LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support
@ 2019-04-23 23:31 Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 01/19] driver core: add per device iommu param Jacob Pan
                   ` (18 more replies)
  0 siblings, 19 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to enable SVA virtualization, i.e. shared guest
application address space and physical device DMA address. Only IOMMU portion
of the changes are included in this series. Additional support is needed in
VFIO and QEMU (will be submitted separately) to complete this functionality.

To make incremental changes and reduce the size of each patchset. This series
does not inlcude support for page request services.

In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables


This work is based on collaboration with other developers on the IOMMU
mailing list. Notably,

[1] [PATCH v6 00/22] SMMUv3 Nested Stage Setup by Eric Auger
https://lkml.org/lkml/2019/3/17/124

[2] [RFC PATCH 2/6] drivers core: Add I/O ASID allocator by Jean-Philippe
Brucker
https://www.spinics.net/lists/iommu/msg30639.html

[3] [RFC PATCH 0/5] iommu: APIs for paravirtual PASID allocation by Lu Baolu
https://lkml.org/lkml/2018/11/12/1921

[4] [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual
    Address (SVA)
    https://lwn.net/Articles/754331/

There are roughly three parts:
1. Generic PASID allocator [1] with extension to support custom allocator
2. IOMMU cache invalidation passdown from guest to host
3. Guest PASID bind for nested translation

All generic IOMMU APIs are reused from [1], which has a v7 just published with
no real impact to the patches used here. It is worth noting that unlike sMMU
nested stage setup, where PASID table is owned by the guest, VT-d PASID table is
owned by the host, individual PASIDs are bound instead of the PASID table.

This series is based on the new VT-d 3.0 Specification (https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf).
This is different than the older series in [4] which was based on the older
specification that does not have scalable mode.


ChangeLog:
	- V2
	  - Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
	  - Integrated with Eric Auger's new v7 series for common APIs
	  (https://github.com/eauger/linux/tree/v5.1-rc3-2stage-v7)
	  - Addressed review comments from Andy Shevchenko and Alex Williamson on
	    IOASID custom allocator.
	  - Support multiple custom IOASID allocators (vIOMMUs) and dynamic
	    registration.


Jacob Pan (16):
  driver core: add per device iommu param
  iommu: introduce device fault data
  iommu: introduce device fault report API
  iommu: Introduce attach/detach_pasid_table API
  ioasid: Convert ioasid_idr to XArray
  ioasid: Add custom IOASID allocator
  iommu/vt-d: Add custom allocator for IOASID
  iommu/vt-d: Replace Intel specific PASID allocator with IOASID
  iommu/vt-d: Move domain helper to header
  iommu/vt-d: Add nested translation support
  iommu: Add guest PASID bind function
  iommu/vt-d: Add bind guest PASID support
  iommu/vtd: Clean up for SVM device list
  iommu: Add max num of cache and granu types
  iommu/vt-d: Support flushing more translation cache types
  iommu/vt-d: Add svm/sva invalidate function

Jean-Philippe Brucker (1):
  drivers core: Add I/O ASID allocator

Liu, Yi L (1):
  iommu: Introduce cache_invalidate API

Lu Baolu (1):
  iommu/vt-d: Enlightened PASID allocation

 drivers/base/Kconfig        |   6 +
 drivers/base/Makefile       |   1 +
 drivers/base/ioasid.c       | 265 ++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/Kconfig       |   1 +
 drivers/iommu/dmar.c        |  48 ++++++++
 drivers/iommu/intel-iommu.c | 236 ++++++++++++++++++++++++++++++++++--
 drivers/iommu/intel-pasid.c | 189 ++++++++++++++++++++++++-----
 drivers/iommu/intel-pasid.h |  24 +++-
 drivers/iommu/intel-svm.c   | 289 +++++++++++++++++++++++++++++++++++---------
 drivers/iommu/iommu.c       | 188 +++++++++++++++++++++++++++-
 include/linux/device.h      |   3 +
 include/linux/intel-iommu.h |  41 ++++++-
 include/linux/intel-svm.h   |   7 ++
 include/linux/ioasid.h      |  53 ++++++++
 include/linux/iommu.h       | 121 +++++++++++++++++++
 include/uapi/linux/iommu.h  | 255 ++++++++++++++++++++++++++++++++++++++
 16 files changed, 1625 insertions(+), 102 deletions(-)
 create mode 100644 drivers/base/ioasid.c
 create mode 100644 include/linux/ioasid.h
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 01/19] driver core: add per device iommu param
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 02/19] iommu: introduce device fault data Jacob Pan
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lkml.org/lkml/2017/10/6/81
---
 include/linux/device.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 4e6987e..2cd48a6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -42,6 +42,7 @@ struct iommu_ops;
 struct iommu_group;
 struct iommu_fwspec;
 struct dev_pin_info;
+struct iommu_param;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -959,6 +960,7 @@ struct dev_links_info {
  * 		device (i.e. the bus driver that discovered the device).
  * @iommu_group: IOMMU group the device belongs to.
  * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
  *
  * @offline_disabled: If set, the device is permanently online.
  * @offline:	Set after successful invocation of bus type's .offline().
@@ -1052,6 +1054,7 @@ struct device {
 	void	(*release)(struct device *dev);
 	struct iommu_group	*iommu_group;
 	struct iommu_fwspec	*iommu_fwspec;
+	struct iommu_param	*iommu_param;
 
 	bool			offline_disabled:1;
 	bool			offline:1;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 02/19] iommu: introduce device fault data
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 01/19] driver core: add per device iommu param Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-25 12:46   ` Jean-Philippe Brucker
  2019-04-23 23:31 ` [PATCH v2 03/19] iommu: introduce device fault report API Jacob Pan
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu, Yi L

Device faults detected by IOMMU can be reported outside the IOMMU
subsystem for further processing. This patch introduces
a generic device fault data structure.

The fault can be either an unrecoverable fault or a page request,
also referred to as a recoverable fault.

We only care about non internal faults that are likely to be reported
to an external subsystem.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
v4 -> v5:
- simplified struct iommu_fault_event comment
- Moved IOMMU_FAULT_PERM outside of the struct
- Removed IOMMU_FAULT_PERM_INST
- s/IOMMU_FAULT_PAGE_REQUEST_PASID_PRESENT/
  IOMMU_FAULT_PAGE_REQUEST_PASID_VALID

v3 -> v4:
- use a union containing aither an unrecoverable fault or a page
  request message. Move the device private data in the page request
  structure. Reshuffle the fields and use flags.
- move fault perm attributes to the uapi
- remove a bunch of iommu_fault_reason enum values that were related
  to internal errors
---
 include/linux/iommu.h      |  44 +++++++++++++++++
 include/uapi/linux/iommu.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 480921d..810bde2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -49,6 +50,7 @@ struct device;
 struct iommu_domain;
 struct notifier_block;
 struct iommu_sva;
+struct iommu_fault_event;
 
 /* iommu fault flags */
 #define IOMMU_FAULT_READ	0x0
@@ -58,6 +60,7 @@ typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
 			struct device *, unsigned long, int, void *);
 typedef int (*iommu_mm_exit_handler_t)(struct device *dev, struct iommu_sva *,
 				       void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
 
 struct iommu_domain_geometry {
 	dma_addr_t aperture_start; /* First address that can be mapped    */
@@ -301,6 +304,46 @@ struct iommu_device {
 	struct device *dev;
 };
 
+/**
+ * struct iommu_fault_event - Generic fault event
+ *
+ * Can represent recoverable faults such as a page requests or
+ * unrecoverable faults such as DMA or IRQ remapping faults.
+ *
+ * @fault: fault descriptor
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ *                 data. Users should not modify this field before
+ *                 sending the fault response.
+ */
+struct iommu_fault_event {
+	struct iommu_fault fault;
+	u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+	iommu_dev_fault_handler_t handler;
+	void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ *	struct iommu_group	*iommu_group;
+ *	struct iommu_fwspec	*iommu_fwspec;
+ */
+struct iommu_param {
+	struct iommu_fault_param *fault_param;
+};
+
 int  iommu_device_register(struct iommu_device *iommu);
 void iommu_device_unregister(struct iommu_device *iommu);
 int  iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -500,6 +543,7 @@ struct iommu_ops {};
 struct iommu_group {};
 struct iommu_fwspec {};
 struct iommu_device {};
+struct iommu_fault_param {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..edcc0dd
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+#define IOMMU_FAULT_PERM_WRITE	(1 << 0) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 1) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 2) /* privileged */
+
+/*  Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* pasid entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * Unrecoverable fault data
+ * @reason: reason of the fault
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ * @pasid: contains process address space ID, used in shared virtual memory
+ * @perm: Requested permission access using by the incoming transaction
+ * (IOMMU_FAULT_PERM_* values)
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason; /* enum iommu_fault_reason */
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_PERM_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 2)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 3)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/*
+ * Page Request data (aka. recoverable fault data)
+ * @flags : encodes whether the pasid is valid and whether this
+ * is the last page in group
+ * @pasid: pasid
+ * @grpid: page request group index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32   flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ *
+ * @type contains fault type
+ */
+
+struct iommu_fault {
+	__u32	type;   /* enum iommu_fault_type */
+	__u32	reserved;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+	};
+};
+#endif /* _UAPI_IOMMU_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 03/19] iommu: introduce device fault report API
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 01/19] driver core: add per device iommu param Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 02/19] iommu: introduce device fault data Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 04/19] iommu: Introduce attach/detach_pasid_table API Jacob Pan
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

This patch introduces a registration API for device specific fault
handlers. This differs from the existing iommu_set_fault_handler/
report_iommu_fault infrastructures in several ways:
- it allows to report more sophisticated fault events (both
  unrecoverable faults and page request faults) due to the nature
  of the iommu_fault struct
- it is device specific and not domain specific.

The current iommu_report_device_fault() implementation only handles
the "shoot and forget" unrecoverable fault case. Handling of page
request faults or stalled faults will come later.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
v6 -> v7:
- use struct iommu_param *param = dev->iommu_param;

v4 -> v5:
- remove stuff related to recoverable faults
---
 drivers/iommu/iommu.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  36 +++++++++++++-
 2 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f8fe112..75c352c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -648,6 +648,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		goto err_free_name;
 	}
 
+	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
+	if (!dev->iommu_param) {
+		ret = -ENOMEM;
+		goto err_free_name;
+	}
+	mutex_init(&dev->iommu_param->lock);
+
 	kobject_get(group->devices_kobj);
 
 	dev->iommu_group = group;
@@ -678,6 +685,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 	mutex_unlock(&group->mutex);
 	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
+	kfree(dev->iommu_param);
 err_free_name:
 	kfree(device->name);
 err_remove_link:
@@ -724,7 +732,7 @@ void iommu_group_remove_device(struct device *dev)
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
-
+	kfree(dev->iommu_param);
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
@@ -859,6 +867,131 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
 /**
+ * iommu_register_device_fault_handler() - Register a device fault handler
+ * @dev: the device
+ * @handler: the fault handler
+ * @data: private data passed as argument to the handler
+ *
+ * When an IOMMU fault event is received, this handler gets called with the
+ * fault event and data as argument.
+ *
+ * Return 0 if the fault handler was installed successfully, or an error.
+ */
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+	/* Only allow one fault handler registered for each device */
+	if (param->fault_param) {
+		ret = -EBUSY;
+		goto done_unlock;
+	}
+
+	get_device(dev);
+	param->fault_param =
+		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!param->fault_param) {
+		put_device(dev);
+		ret = -ENOMEM;
+		goto done_unlock;
+	}
+	mutex_init(&param->fault_param->lock);
+	param->fault_param->handler = handler;
+	param->fault_param->data = data;
+	INIT_LIST_HEAD(&param->fault_param->faults);
+
+done_unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+/**
+ * iommu_unregister_device_fault_handler() - Unregister the device fault handler
+ * @dev: the device
+ *
+ * Remove the device fault handler installed with
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+
+	if (!param->fault_param)
+		goto unlock;
+
+	/* we cannot unregister handler if there are pending faults */
+	if (!list_empty(&param->fault_param->faults)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	kfree(param->fault_param);
+	param->fault_param = NULL;
+	put_device(dev);
+unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+/**
+ * iommu_report_device_fault() - Report fault event to device
+ * @dev: the device
+ * @evt: fault event data
+ *
+ * Called by IOMMU model specific drivers when fault is detected, typically
+ * in a threaded IRQ handler.
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	struct iommu_param *param = dev->iommu_param;
+	struct iommu_fault_param *fparam;
+	int ret = 0;
+
+	/* iommu_param is allocated when device is added to group */
+	if (!param || !evt)
+		return -EINVAL;
+
+	/* we only report device fault if there is a handler registered */
+	mutex_lock(&param->lock);
+	fparam = param->fault_param;
+	if (!fparam || !fparam->handler) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	ret = fparam->handler(evt, fparam->data);
+done_unlock:
+	mutex_unlock(&param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
+/**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
  *
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 810bde2..a42019a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -311,11 +311,13 @@ struct iommu_device {
  * unrecoverable faults such as DMA or IRQ remapping faults.
  *
  * @fault: fault descriptor
+ * @list pending fault event list, used for tracking responses
  * @iommu_private: used by the IOMMU driver for storing fault-specific
  *                 data. Users should not modify this field before
  *                 sending the fault response.
  */
 struct iommu_fault_event {
+	struct list_head list;
 	struct iommu_fault fault;
 	u64 iommu_private;
 };
@@ -324,10 +326,13 @@ struct iommu_fault_event {
  * struct iommu_fault_param - per-device IOMMU fault data
  * @dev_fault_handler: Callback function to handle IOMMU faults at device level
  * @data: handler private data
- *
+ * @faults: holds the pending faults which needs response, e.g. page response.
+ * @lock: protect pending PRQ event list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
+	struct list_head faults;
+	struct mutex lock;
 	void *data;
 };
 
@@ -341,6 +346,7 @@ struct iommu_fault_param {
  *	struct iommu_fwspec	*iommu_fwspec;
  */
 struct iommu_param {
+	struct mutex lock;
 	struct iommu_fault_param *fault_param;
 };
 
@@ -433,6 +439,15 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev,
+				     struct iommu_fault_event *evt);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -737,6 +752,25 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	return -ENODEV;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	return 0;
+}
+
+static inline
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 04/19] iommu: Introduce attach/detach_pasid_table API
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (2 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 03/19] iommu: introduce device fault report API Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 05/19] iommu: Introduce cache_invalidate API Jacob Pan
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu, Yi L

In virtualization use case, when a guest is assigned
a PCI host device, protected by a virtual IOMMU on the guest,
the physical IOMMU must be programmed to be consistent with
the guest mappings. If the physical IOMMU supports two
translation stages it makes sense to program guest mappings
onto the first stage/level (ARM/Intel terminology) while the host
owns the stage/level 2.

In that case, it is mandated to trap on guest configuration
settings and pass those to the physical iommu driver.

This patch adds a new API to the iommu subsystem that allows
to set/unset the pasid table information.

A generic iommu_pasid_table_config struct is introduced in
a new iommu.h uapi header. This is going to be used by the VFIO
user API.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

---

This patch generalizes the API introduced by Jacob & co-authors in
https://lwn.net/Articles/754331/

v4 -> v5:
- no returned valued for dummy definition of iommu_detach_pasid_table
- fix order in comment
- added Jean's R-b

v3 -> v4:
- s/set_pasid_table/attach_pasid_table
- restore detach_pasid_table. Detach can be used on unwind path.
- add padding
- remove @abort
- signature used for config and format
- add comments for fields in the SMMU struct

v2 -> v3:
- replace unbind/bind by set_pasid_table
- move table pointer and pasid bits in the generic part of the struct

v1 -> v2:
- restore the original pasid table name
- remove the struct device * parameter in the API
- reworked iommu_pasid_smmuv3
---
 drivers/iommu/iommu.c      | 19 +++++++++++++++++++
 include/linux/iommu.h      | 18 ++++++++++++++++++
 include/uapi/linux/iommu.h | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 75c352c..2a68786 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1528,6 +1528,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_attach_pasid_table(struct iommu_domain *domain,
+			     struct iommu_pasid_table_config *cfg)
+{
+	if (unlikely(!domain->ops->attach_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->attach_pasid_table(domain, cfg);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_pasid_table);
+
+void iommu_detach_pasid_table(struct iommu_domain *domain)
+{
+	if (unlikely(!domain->ops->detach_pasid_table))
+		return;
+
+	domain->ops->detach_pasid_table(domain);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a42019a..131cf80 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -227,6 +227,8 @@ struct iommu_sva_ops {
  * @sva_bind: Bind process address space to device
  * @sva_unbind: Unbind process address space from device
  * @sva_get_pasid: Get PASID associated to a SVA handle
+ * @attach_pasid_table: attach a pasid table
+ * @detach_pasid_table: detach the pasid table
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  */
 struct iommu_ops {
@@ -286,6 +288,9 @@ struct iommu_ops {
 				      void *drvdata);
 	void (*sva_unbind)(struct iommu_sva *handle);
 	int (*sva_get_pasid)(struct iommu_sva *handle);
+	int (*attach_pasid_table)(struct iommu_domain *domain,
+				  struct iommu_pasid_table_config *cfg);
+	void (*detach_pasid_table)(struct iommu_domain *domain);
 
 	unsigned long pgsize_bitmap;
 };
@@ -394,6 +399,9 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_attach_pasid_table(struct iommu_domain *domain,
+				    struct iommu_pasid_table_config *cfg);
+extern void iommu_detach_pasid_table(struct iommu_domain *domain);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -897,6 +905,13 @@ iommu_aux_get_pasid(struct iommu_domain *domain, struct device *dev)
 	return -ENODEV;
 }
 
+static inline
+int iommu_attach_pasid_table(struct iommu_domain *domain,
+			     struct iommu_pasid_table_config *cfg)
+{
+	return -ENODEV;
+}
+
 static inline struct iommu_sva *
 iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata)
 {
@@ -918,6 +933,9 @@ static inline int iommu_sva_get_pasid(struct iommu_sva *handle)
 	return IOMMU_PASID_INVALID;
 }
 
+static inline
+void iommu_detach_pasid_table(struct iommu_domain *domain) {}
+
 #endif /* CONFIG_IOMMU_API */
 
 #ifdef CONFIG_IOMMU_DEBUGFS
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index edcc0dd..532a640 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -112,4 +112,51 @@ struct iommu_fault {
 		struct iommu_fault_page_request prm;
 	};
 };
+
+/**
+ * SMMUv3 Stream Table Entry stage 1 related information
+ * The PASID table is referred to as the context descriptor (CD) table.
+ *
+ * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
+   or 2-level table)
+ * @s1dss: STE s1dss (specifies the behavior when pasid_bits != 0
+   and no pasid is passed along with the incoming transaction)
+ * Please refer to the smmu 3.x spec (ARM IHI 0070A) for full details
+ */
+struct iommu_pasid_smmuv3 {
+#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1
+	__u32	version;
+	__u8 s1fmt;
+	__u8 s1dss;
+	__u8 padding[2];
+};
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU
+ * Note PASID table corresponds to the Context Table on ARM SMMUv3.
+ *
+ * @version: API version to prepare for future extensions
+ * @format: format of the PASID table
+ * @base_ptr: guest physical address of the PASID table
+ * @pasid_bits: number of PASID bits used in the PASID table
+ * @config: indicates whether the guest translation stage must
+ * be translated, bypassed or aborted.
+ */
+struct iommu_pasid_table_config {
+#define PASID_TABLE_CFG_VERSION_1 1
+	__u32	version;
+#define IOMMU_PASID_FORMAT_SMMUV3	1
+	__u32	format;
+	__u64	base_ptr;
+	__u8	pasid_bits;
+#define IOMMU_PASID_CONFIG_TRANSLATE	1
+#define IOMMU_PASID_CONFIG_BYPASS	2
+#define IOMMU_PASID_CONFIG_ABORT	3
+	__u8	config;
+	__u8    padding[6];
+	union {
+		struct iommu_pasid_smmuv3 smmuv3;
+	};
+};
+
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 05/19] iommu: Introduce cache_invalidate API
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (3 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 04/19] iommu: Introduce attach/detach_pasid_table API Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 06/19] drivers core: Add I/O ASID allocator Jacob Pan
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Liu, Yi L, Liu, Jacob Pan

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

In any virtualization use case, when the first translation stage
is "owned" by the guest OS, the host IOMMU driver has no knowledge
of caching structure updates unless the guest invalidation activities
are trapped by the virtualizer and passed down to the host.

Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
v6 -> v7:
- detail which fields are used for each invalidation type
- add a comment about multiple cache invalidation

v5 -> v6:
- fix merge issue

v3 -> v4:
- full reshape of the API following Alex' comments

v1 -> v2:
- add arch_id field
- renamed tlb_invalidate into cache_invalidate as this API allows
  to invalidate context caches on top of IOTLBs

v1:
renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
header. Commit message reworded.
---
 drivers/iommu/iommu.c      | 14 +++++++++
 include/linux/iommu.h      | 15 +++++++++
 include/uapi/linux/iommu.h | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2a68786..498c28a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1547,6 +1547,20 @@ void iommu_detach_pasid_table(struct iommu_domain *domain)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
 
+int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+			   struct iommu_cache_invalidate_info *inv_info)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->cache_invalidate))
+		return -ENODEV;
+
+	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 131cf80..4b92e4b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -229,6 +229,7 @@ struct iommu_sva_ops {
  * @sva_get_pasid: Get PASID associated to a SVA handle
  * @attach_pasid_table: attach a pasid table
  * @detach_pasid_table: detach the pasid table
+ * @cache_invalidate: invalidate translation caches
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  */
 struct iommu_ops {
@@ -292,6 +293,9 @@ struct iommu_ops {
 				  struct iommu_pasid_table_config *cfg);
 	void (*detach_pasid_table)(struct iommu_domain *domain);
 
+	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
+				struct iommu_cache_invalidate_info *inv_info);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -402,6 +406,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
 extern int iommu_attach_pasid_table(struct iommu_domain *domain,
 				    struct iommu_pasid_table_config *cfg);
 extern void iommu_detach_pasid_table(struct iommu_domain *domain);
+extern int iommu_cache_invalidate(struct iommu_domain *domain,
+				  struct device *dev,
+				  struct iommu_cache_invalidate_info *inv_info);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -936,6 +943,14 @@ static inline int iommu_sva_get_pasid(struct iommu_sva *handle)
 static inline
 void iommu_detach_pasid_table(struct iommu_domain *domain) {}
 
+static inline int
+iommu_cache_invalidate(struct iommu_domain *domain,
+		       struct device *dev,
+		       struct iommu_cache_invalidate_info *inv_info)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #ifdef CONFIG_IOMMU_DEBUGFS
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 532a640..61a3fb7 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -159,4 +159,82 @@ struct iommu_pasid_table_config {
 	};
 };
 
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* pasid-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+};
+
+/**
+ * Address Selective Invalidation Structure
+ *
+ * @flags indicates the granularity of the address-selective invalidation
+ * - if PASID bit is set, @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the
+ *   address range.
+ * - if ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific id and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - if neither PASID or ARCHID is set, global addr invalidation applies
+ * - LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space id
+ * @archid: architecture-specific id
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * First level/stage invalidation information
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > pasid > addr
+ *
+ * Not all the combinations of cache/granularity make sense:
+ *
+ *         type |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * granularity	|		|		|      cache	|
+ * -------------+---------------+---------------+---------------+
+ * DOMAIN	|	N/A	|       Y	|	Y	|
+ * PASID	|	Y	|       Y	|	Y	|
+ * ADDR		|       Y	|       Y	|	N/A	|
+ *
+ * Invalidations by %IOMMU_INV_GRANU_ADDR use field @addr_info.
+ * Invalidations by %IOMMU_INV_GRANU_PASID use field @pasid.
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		__u64	pasid;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (4 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 05/19] iommu: Introduce cache_invalidate API Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-24  6:19   ` Christoph Hellwig
  2019-04-25 10:17   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 07/19] ioasid: Convert ioasid_idr to XArray Jacob Pan
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Some devices might support multiple DMA address spaces, in particular
those that have the PCI PASID feature. PASID (Process Address Space ID)
allows to share process address spaces with devices (SVA), partition a
device into VM-assignable entities (VFIO mdev) or simply provide
multiple DMA address space to kernel drivers. Add a global PASID
allocator usable by different drivers at the same time. Name it I/O ASID
to avoid confusion with ASIDs allocated by arch code, which are usually
a separate ID space.

The IOASID space is global. Each device can have its own PASID space,
but by convention the IOMMU ended up having a global PASID space, so
that with SVA, each mm_struct is associated to a single PASID.

The allocator doesn't really belong in drivers/iommu because some
drivers would like to allocate PASIDs for devices that aren't managed by
an IOMMU, using the same ID space as IOMMU. It doesn't really belong in
drivers/pci either since platform device also support PASID. Add the
allocator in drivers/base.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/base/Kconfig   |   6 +++
 drivers/base/Makefile  |   1 +
 drivers/base/ioasid.c  | 106 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ioasid.h |  40 +++++++++++++++++++
 4 files changed, 153 insertions(+)
 create mode 100644 drivers/base/ioasid.c
 create mode 100644 include/linux/ioasid.h

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700e..47c1348 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -182,6 +182,12 @@ config DMA_SHARED_BUFFER
 	  APIs extension; the file's descriptor can then be passed on to other
 	  driver.
 
+config IOASID
+	bool
+	help
+	  Enable the I/O Address Space ID allocator. A single ID space shared
+	  between different users.
+
 config DMA_FENCE_TRACE
 	bool "Enable verbose DMA_FENCE_TRACE messages"
 	depends on DMA_SHARED_BUFFER
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 1574520..aafa2ac 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_PINCTRL) += pinctrl.o
 obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
 obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
 obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
+obj-$(CONFIG_IOASID) += ioasid.o
 
 obj-y			+= test/
 
diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
new file mode 100644
index 0000000..cf122b2
--- /dev/null
+++ b/drivers/base/ioasid.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * I/O Address Space ID allocator. There is one global IOASID space, split into
+ * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
+ * free IOASIDs with ioasid_alloc and ioasid_free.
+ */
+#include <linux/idr.h>
+#include <linux/ioasid.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+struct ioasid_data {
+	ioasid_t id;
+	struct ioasid_set *set;
+	void *private;
+	struct rcu_head rcu;
+};
+
+static DEFINE_IDR(ioasid_idr);
+
+/**
+ * ioasid_alloc - Allocate an IOASID
+ * @set: the IOASID set
+ * @min: the minimum ID (inclusive)
+ * @max: the maximum ID (exclusive)
+ * @private: data private to the caller
+ *
+ * Allocate an ID between @min and @max (or %0 and %INT_MAX). Return the
+ * allocated ID on success, or INVALID_IOASID on failure. The @private pointer
+ * is stored internally and can be retrieved with ioasid_find().
+ */
+ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
+		      void *private)
+{
+	int id = -1;
+	struct ioasid_data *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return INVALID_IOASID;
+
+	data->set = set;
+	data->private = private;
+
+	idr_preload(GFP_KERNEL);
+	idr_lock(&ioasid_idr);
+	data->id = id = idr_alloc(&ioasid_idr, data, min, max, GFP_ATOMIC);
+	idr_unlock(&ioasid_idr);
+	idr_preload_end();
+
+	if (id < 0) {
+		kfree(data);
+		return INVALID_IOASID;
+	}
+	return id;
+}
+EXPORT_SYMBOL_GPL(ioasid_alloc);
+
+/**
+ * ioasid_free - Free an IOASID
+ * @ioasid: the ID to remove
+ */
+void ioasid_free(ioasid_t ioasid)
+{
+	struct ioasid_data *ioasid_data;
+
+	idr_lock(&ioasid_idr);
+	ioasid_data = idr_remove(&ioasid_idr, ioasid);
+	idr_unlock(&ioasid_idr);
+
+	if (ioasid_data)
+		kfree_rcu(ioasid_data, rcu);
+}
+EXPORT_SYMBOL_GPL(ioasid_free);
+
+/**
+ * ioasid_find - Find IOASID data
+ * @set: the IOASID set
+ * @ioasid: the IOASID to find
+ * @getter: function to call on the found object
+ *
+ * The optional getter function allows to take a reference to the found object
+ * under the rcu lock. The function can also check if the object is still valid:
+ * if @getter returns false, then the object is invalid and NULL is returned.
+ *
+ * If the IOASID has been allocated for this set, return the private pointer
+ * passed to ioasid_alloc. Otherwise return NULL.
+ */
+void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+		  bool (*getter)(void *))
+{
+	void *priv = NULL;
+	struct ioasid_data *ioasid_data;
+
+	rcu_read_lock();
+	ioasid_data = idr_find(&ioasid_idr, ioasid);
+	if (ioasid_data && ioasid_data->set == set) {
+		priv = ioasid_data->private;
+		if (getter && !getter(priv))
+			priv = NULL;
+	}
+	rcu_read_unlock();
+
+	return priv;
+}
+EXPORT_SYMBOL_GPL(ioasid_find);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
new file mode 100644
index 0000000..6f3655a
--- /dev/null
+++ b/include/linux/ioasid.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_IOASID_H
+#define __LINUX_IOASID_H
+
+#define INVALID_IOASID ((ioasid_t)-1)
+typedef unsigned int ioasid_t;
+typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void *data);
+
+struct ioasid_set {
+	int dummy;
+};
+
+#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
+
+#ifdef CONFIG_IOASID
+ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
+		      void *private);
+void ioasid_free(ioasid_t ioasid);
+
+void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+		  bool (*getter)(void *));
+
+#else /* !CONFIG_IOASID */
+static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
+				    ioasid_t max, void *private)
+{
+	return INVALID_IOASID;
+}
+
+static inline void ioasid_free(ioasid_t ioasid)
+{
+}
+
+static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+				bool (*getter)(void *))
+{
+	return NULL;
+}
+#endif /* CONFIG_IOASID */
+#endif /* __LINUX_IOASID_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 07/19] ioasid: Convert ioasid_idr to XArray
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (5 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 06/19] drivers core: Add I/O ASID allocator Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 08/19] ioasid: Add custom IOASID allocator Jacob Pan
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

IDR is to be replaced by XArray, keep up with the changes.
XArray has internal locking for normal APIs used here, also removed
radix tree related preload.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/base/ioasid.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
index cf122b2..c4012aa 100644
--- a/drivers/base/ioasid.c
+++ b/drivers/base/ioasid.c
@@ -4,7 +4,7 @@
  * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
  * free IOASIDs with ioasid_alloc and ioasid_free.
  */
-#include <linux/idr.h>
+#include <linux/xarray.h>
 #include <linux/ioasid.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -16,13 +16,12 @@ struct ioasid_data {
 	struct rcu_head rcu;
 };
 
-static DEFINE_IDR(ioasid_idr);
-
+static DEFINE_XARRAY_ALLOC(ioasid_xa);
 /**
  * ioasid_alloc - Allocate an IOASID
  * @set: the IOASID set
  * @min: the minimum ID (inclusive)
- * @max: the maximum ID (exclusive)
+ * @max: the maximum ID (inclusive)
  * @private: data private to the caller
  *
  * Allocate an ID between @min and @max (or %0 and %INT_MAX). Return the
@@ -41,13 +40,13 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 
 	data->set = set;
 	data->private = private;
+	if (xa_alloc(&ioasid_xa, &id, data, XA_LIMIT(min, max), GFP_KERNEL)) {
+		pr_err("Failed to alloc ioasid from %d to %d\n", min, max);
+		goto exit_free;
+	}
 
-	idr_preload(GFP_KERNEL);
-	idr_lock(&ioasid_idr);
-	data->id = id = idr_alloc(&ioasid_idr, data, min, max, GFP_ATOMIC);
-	idr_unlock(&ioasid_idr);
-	idr_preload_end();
-
+	data->id = id;
+exit_free:
 	if (id < 0) {
 		kfree(data);
 		return INVALID_IOASID;
@@ -64,12 +63,8 @@ void ioasid_free(ioasid_t ioasid)
 {
 	struct ioasid_data *ioasid_data;
 
-	idr_lock(&ioasid_idr);
-	ioasid_data = idr_remove(&ioasid_idr, ioasid);
-	idr_unlock(&ioasid_idr);
-
-	if (ioasid_data)
-		kfree_rcu(ioasid_data, rcu);
+	ioasid_data = xa_erase(&ioasid_xa, ioasid);
+	kfree_rcu(ioasid_data, rcu);
 }
 EXPORT_SYMBOL_GPL(ioasid_free);
 
@@ -93,7 +88,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 	struct ioasid_data *ioasid_data;
 
 	rcu_read_lock();
-	ioasid_data = idr_find(&ioasid_idr, ioasid);
+	ioasid_data = xa_load(&ioasid_xa, ioasid);
 	if (ioasid_data && ioasid_data->set == set) {
 		priv = ioasid_data->private;
 		if (getter && !getter(priv))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (6 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 07/19] ioasid: Convert ioasid_idr to XArray Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-25 10:03   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation Jacob Pan
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Sometimes, IOASID allocation must be handled by platform specific
code. The use cases are guest vIOMMU and pvIOMMU where IOASIDs need
to be allocated by the host via enlightened or paravirt interfaces.

This patch adds an extension to the IOASID allocator APIs such that
platform drivers can register a custom allocator, possibly at boot
time, to take over the allocation. Xarray is still used for tracking
and searching purposes internal to the IOASID code. Private data of
an IOASID can also be set after the allocation.

There can be multiple custom allocators registered but only one is
used at a time. In case of hot removal of devices that provides the
allocator, all IOASIDs must be freed prior to unregistering the
allocator. Default XArray based allocator cannot be mixed with
custom allocators, i.e. custom allocators will not be used if there
are outstanding IOASIDs allocated by the default XA allocator.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/base/ioasid.c  | 182 ++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/ioasid.h |  15 +++-
 2 files changed, 187 insertions(+), 10 deletions(-)

diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
index c4012aa..5cb36a4 100644
--- a/drivers/base/ioasid.c
+++ b/drivers/base/ioasid.c
@@ -17,6 +17,120 @@ struct ioasid_data {
 };
 
 static DEFINE_XARRAY_ALLOC(ioasid_xa);
+static DEFINE_MUTEX(ioasid_allocator_lock);
+static struct ioasid_allocator *ioasid_allocator;
+
+static LIST_HEAD(custom_allocators);
+/*
+ * A flag to track if ioasid default allocator already been used, this will
+ * prevent custom allocator from being used. The reason is that custom allocator
+ * must have unadulterated space to track private data with xarray, there cannot
+ * be a mix been default and custom allocated IOASIDs.
+ */
+static int default_allocator_used;
+
+/**
+ * ioasid_register_allocator - register a custom allocator
+ * @allocator: the custom allocator to be registered
+ *
+ * Custom allocator take precedence over the default xarray based allocator.
+ * Private data associated with the ASID are managed by ASID common code
+ * similar to data stored in xa.
+ *
+ * There can be multiple allocators registered but only one is active. In case
+ * of runtime removal of an custom allocator, the next one is activated based
+ * on the registration ordering.
+ */
+int ioasid_register_allocator(struct ioasid_allocator *allocator)
+{
+	struct ioasid_allocator *pallocator;
+	int ret = 0;
+
+	if (!allocator)
+		return -EINVAL;
+
+	mutex_lock(&ioasid_allocator_lock);
+	if (list_empty(&custom_allocators))
+		ioasid_allocator = allocator;
+	else {
+		/* Check if the allocator is already registered */
+		list_for_each_entry(pallocator, &custom_allocators, list) {
+			if (pallocator == allocator) {
+				pr_err("IOASID allocator already exist\n");
+				ret = -EEXIST;
+				goto out_unlock;
+			}
+		}
+	}
+	list_add_tail(&allocator->list, &custom_allocators);
+
+out_unlock:
+	mutex_unlock(&ioasid_allocator_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_register_allocator);
+
+/**
+ * ioasid_unregister_allocator - Remove a custom IOASID allocator
+ * @allocator: the custom allocator to be removed
+ *
+ * Remove an allocator from the list, activate the next allocator in
+ * the order it was  registration.
+ */
+void ioasid_unregister_allocator(struct ioasid_allocator *allocator)
+{
+	if (!allocator)
+		return;
+
+	if (list_empty(&custom_allocators)) {
+		pr_warn("No custom IOASID allocators active!\n");
+		return;
+	}
+
+	mutex_lock(&ioasid_allocator_lock);
+	list_del(&allocator->list);
+	if (list_empty(&custom_allocators)) {
+		pr_info("No custom IOASID allocators\n");
+		/*
+		 * All IOASIDs should have been freed before the last allocator
+		 * is unregistered.
+		 */
+		BUG_ON(!xa_empty(&ioasid_xa));
+		ioasid_allocator = NULL;
+	} else if (allocator == ioasid_allocator) {
+		ioasid_allocator = list_entry(&custom_allocators, struct ioasid_allocator, list);
+		pr_info("IOASID allocator changed");
+	}
+	mutex_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
+
+/**
+ * ioasid_set_data - Set private data for an allocated ioasid
+ * @ioasid: the ID to set data
+ * @data:   the private data
+ *
+ * For IOASID that is already allocated, private data can be set
+ * via this API. Future lookup can be done via ioasid_find.
+ */
+int ioasid_set_data(ioasid_t ioasid, void *data)
+{
+	struct ioasid_data *ioasid_data;
+	int ret = 0;
+
+	ioasid_data = xa_load(&ioasid_xa, ioasid);
+	if (ioasid_data)
+		ioasid_data->private = data;
+	else
+		ret = -ENOENT;
+
+	/* getter may use the private data */
+	synchronize_rcu();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_set_data);
+
 /**
  * ioasid_alloc - Allocate an IOASID
  * @set: the IOASID set
@@ -31,7 +145,7 @@ static DEFINE_XARRAY_ALLOC(ioasid_xa);
 ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		      void *private)
 {
-	int id = -1;
+	int id = INVALID_IOASID;
 	struct ioasid_data *data;
 
 	data = kzalloc(sizeof(*data), GFP_KERNEL);
@@ -40,14 +154,37 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 
 	data->set = set;
 	data->private = private;
+
+	/*
+	 * Use custom allocator if available, otherwise use default.
+	 * However, if there are active IOASIDs already been allocated by default
+	 * allocator, custom allocator cannot be used.
+	 */
+	if (!default_allocator_used && ioasid_allocator) {
+		mutex_lock(&ioasid_allocator_lock);
+		id = ioasid_allocator->alloc(min, max, ioasid_allocator->pdata);
+		mutex_unlock(&ioasid_allocator_lock);
+		if (id == INVALID_IOASID) {
+			pr_err("Failed ASID allocation by custom allocator\n");
+			goto exit_free;
+		}
+		/*
+		 * Use XA to manage private data also sanitiy check custom
+		 * allocator for duplicates.
+		 */
+		min = id;
+		max = id + 1;
+	} else
+		default_allocator_used = 1;
+
 	if (xa_alloc(&ioasid_xa, &id, data, XA_LIMIT(min, max), GFP_KERNEL)) {
 		pr_err("Failed to alloc ioasid from %d to %d\n", min, max);
 		goto exit_free;
 	}
-
 	data->id = id;
+
 exit_free:
-	if (id < 0) {
+	if (id < 0 || id == INVALID_IOASID) {
 		kfree(data);
 		return INVALID_IOASID;
 	}
@@ -59,12 +196,29 @@ EXPORT_SYMBOL_GPL(ioasid_alloc);
  * ioasid_free - Free an IOASID
  * @ioasid: the ID to remove
  */
-void ioasid_free(ioasid_t ioasid)
+int ioasid_free(ioasid_t ioasid)
 {
 	struct ioasid_data *ioasid_data;
+	int ret = 0;
+
+	if (ioasid_allocator) {
+		mutex_lock(&ioasid_allocator_lock);
+		ret = ioasid_allocator->free(ioasid, ioasid_allocator->pdata);
+		mutex_unlock(&ioasid_allocator_lock);
+	}
+	if (ret) {
+		pr_err("ioasid %d custom allocator free failed\n", ioasid);
+		return ret;
+	}
 
 	ioasid_data = xa_erase(&ioasid_xa, ioasid);
+
 	kfree_rcu(ioasid_data, rcu);
+
+	if (xa_empty(&ioasid_xa))
+		default_allocator_used = 0;
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(ioasid_free);
 
@@ -79,7 +233,8 @@ EXPORT_SYMBOL_GPL(ioasid_free);
  * if @getter returns false, then the object is invalid and NULL is returned.
  *
  * If the IOASID has been allocated for this set, return the private pointer
- * passed to ioasid_alloc. Otherwise return NULL.
+ * passed to ioasid_alloc. Private data can be NULL if not set. Return an error
+ * if the IOASID is not found or not belong to the set.
  */
 void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *))
@@ -89,11 +244,20 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 
 	rcu_read_lock();
 	ioasid_data = xa_load(&ioasid_xa, ioasid);
-	if (ioasid_data && ioasid_data->set == set) {
-		priv = ioasid_data->private;
-		if (getter && !getter(priv))
-			priv = NULL;
+	if (!ioasid_data) {
+		priv = ERR_PTR(-ENOENT);
+		goto unlock;
+	}
+	if (set && ioasid_data->set != set) {
+		/* data found but does not belong to the set */
+		priv = ERR_PTR(-EACCES);
+		goto unlock;
 	}
+	/* Now IOASID and its set is verified, we can return the private data */
+	priv = ioasid_data->private;
+	if (getter && !getter(priv))
+		priv = NULL;
+unlock:
 	rcu_read_unlock();
 
 	return priv;
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 6f3655a..e773c13 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -5,20 +5,33 @@
 #define INVALID_IOASID ((ioasid_t)-1)
 typedef unsigned int ioasid_t;
 typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void *data);
+typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
+typedef int (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
 
 struct ioasid_set {
 	int dummy;
 };
 
+struct ioasid_allocator {
+	ioasid_alloc_fn_t alloc;
+	ioasid_free_fn_t free;
+	void *pdata;
+	struct list_head list;
+};
+
 #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
 
 #ifdef CONFIG_IOASID
 ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
 		      void *private);
-void ioasid_free(ioasid_t ioasid);
+int ioasid_free(ioasid_t ioasid);
 
 void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
 		  bool (*getter)(void *));
+int ioasid_register_allocator(struct ioasid_allocator *allocator);
+void ioasid_unregister_allocator(struct ioasid_allocator *allocator);
+
+int ioasid_set_data(ioasid_t ioasid, void *data);
 
 #else /* !CONFIG_IOASID */
 static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (7 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 08/19] ioasid: Add custom IOASID allocator Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-24 17:27   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

From: Lu Baolu <baolu.lu@linux.intel.com>

If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
IOMMU driver should rely on the emulation software to allocate
and free PASID IDs. The Intel vt-d spec revision 3.0 defines a
register set to support this. This includes a capability register,
a virtual command register and a virtual response register. Refer
to section 10.4.42, 10.4.43, 10.4.44 for more information.

This patch adds the enlightened PASID allocation/free interfaces
via the virtual command register.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 70 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.h | 13 ++++++++-
 include/linux/intel-iommu.h |  2 ++
 3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 03b12d2..5b1d3be 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -63,6 +63,76 @@ void *intel_pasid_lookup_id(int pasid)
 	return p;
 }
 
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
+{
+	u64 res;
+	u64 cap;
+	u8 err_code;
+	unsigned long flags;
+	int ret = 0;
+
+	if (!ecap_vcs(iommu->ecap)) {
+		pr_warn("IOMMU: %s: Hardware doesn't support virtual command\n",
+			iommu->name);
+		return -ENODEV;
+	}
+
+	cap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
+	if (!(cap & DMA_VCS_PAS)) {
+		pr_warn("IOMMU: %s: Emulation software doesn't support PASID allocation\n",
+			iommu->name);
+		return -ENODEV;
+	}
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	err_code = VCMD_VRSP_EC(res);
+	switch (err_code) {
+	case VCMD_VRSP_EC_SUCCESS:
+		*pasid = VCMD_VRSP_RESULE(res);
+		break;
+	case VCMD_VRSP_EC_UNAVAIL:
+		pr_info("IOMMU: %s: No PASID available\n", iommu->name);
+		ret = -ENOMEM;
+		break;
+	default:
+		ret = -ENODEV;
+		pr_warn("IOMMU: %s: Unkonwn error code %d\n",
+			iommu->name, err_code);
+	}
+
+	return ret;
+}
+
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
+{
+	u64 res;
+	u8 err_code;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&iommu->register_lock, flags);
+	dmar_writeq(iommu->reg + DMAR_VCMD_REG, (pasid << 8) | VCMD_CMD_FREE);
+	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
+		      !(res & VCMD_VRSP_IP), res);
+	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+
+	err_code = VCMD_VRSP_EC(res);
+	switch (err_code) {
+	case VCMD_VRSP_EC_SUCCESS:
+		break;
+	case VCMD_VRSP_EC_INVAL:
+		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
+		break;
+	default:
+		pr_warn("IOMMU: %s: Unkonwn error code %d\n",
+			iommu->name, err_code);
+	}
+}
+
 /*
  * Per device pasid table management:
  */
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 23537b3..0999dfe 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -19,6 +19,16 @@
 #define PASID_PDE_SHIFT			6
 #define MAX_NR_PASID_BITS		20
 
+/* Virtual command interface for enlightened pasid management. */
+#define VCMD_CMD_ALLOC			0x1
+#define VCMD_CMD_FREE			0x2
+#define VCMD_VRSP_IP			0x1
+#define VCMD_VRSP_EC(e)			(((e) >> 1) & 0x3)
+#define VCMD_VRSP_EC_SUCCESS		0
+#define VCMD_VRSP_EC_UNAVAIL		1
+#define VCMD_VRSP_EC_INVAL		1
+#define VCMD_VRSP_RESULE(e)		(((e) >> 8) & 0xfffff)
+
 /*
  * Domain ID reserved for pasid entries programmed for first-level
  * only and pass-through transfer modes.
@@ -69,5 +79,6 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct device *dev, int pasid);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
-
+int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
+void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
 #endif /* __INTEL_PASID_H */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 6925a18..bff907b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -173,6 +173,7 @@
 #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
 #define ecap_flts(e)		(((e) >> 47) & 0x1)
 #define ecap_slts(e)		(((e) >> 46) & 0x1)
+#define ecap_vcs(e)		(((e) >> 44) & 0x1)
 #define ecap_smts(e)		(((e) >> 43) & 0x1)
 #define ecap_dit(e)		((e >> 41) & 0x1)
 #define ecap_pasid(e)		((e >> 40) & 0x1)
@@ -289,6 +290,7 @@
 
 /* PRS_REG */
 #define DMA_PRS_PPR	((u32)1)
+#define DMA_VCS_PAS	((u64)1)
 
 #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)			\
 do {									\
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (8 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-24 17:27   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID Jacob Pan
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu

When VT-d driver runs in the guest, PASID allocation must be
performed via virtual command interface. This patch register a
custom IOASID allocator which takes precedence over the default
IDR based allocator. The resulting IOASID allocation will always
come from the host. This ensures that PASID namespace is system-
wide.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d93c4bd..ec6f22d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1711,6 +1711,8 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
 		if (ecap_prs(iommu->ecap))
 			intel_svm_finish_prq(iommu);
 	}
+	ioasid_unregister_allocator(&iommu->pasid_allocator);
+
 #endif
 }
 
@@ -4811,6 +4813,46 @@ static int __init platform_optin_force_iommu(void)
 	return 1;
 }
 
+static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
+{
+	struct intel_iommu *iommu = data;
+	ioasid_t ioasid;
+
+	/*
+	 * VT-d virtual command interface always uses the full 20 bit
+	 * PASID range. Host can partition guest PASID range based on
+	 * policies but it is out of guest's control.
+	 */
+	if (min < PASID_MIN || max > PASID_MAX)
+		return -EINVAL;
+
+	if (vcmd_alloc_pasid(iommu, &ioasid))
+		return INVALID_IOASID;
+
+	return ioasid;
+}
+
+static int intel_ioasid_free(ioasid_t ioasid, void *data)
+{
+	struct iommu_pasid_alloc_info *svm;
+	struct intel_iommu *iommu = data;
+
+	if (!iommu || !cap_caching_mode(iommu->cap))
+		return -EINVAL;
+	/*
+	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
+	 * We can only free the PASID when all the devices are unbond.
+	 */
+	svm = ioasid_find(NULL, ioasid, NULL);
+	if (!svm) {
+		pr_warn("Freeing unbond IOASID %d\n", ioasid);
+		return -EBUSY;
+	}
+	vcmd_free_pasid(iommu, ioasid);
+
+	return 0;
+}
+
 int __init intel_iommu_init(void)
 {
 	int ret = -ENODEV;
@@ -4912,6 +4954,22 @@ int __init intel_iommu_init(void)
 				       "%s", iommu->name);
 		iommu_device_set_ops(&iommu->iommu, &intel_iommu_ops);
 		iommu_device_register(&iommu->iommu);
+		if (cap_caching_mode(iommu->cap) && sm_supported(iommu)) {
+			/*
+			 * Register a custom ASID allocator if we are running
+			 * in a guest, the purpose is to have a system wide PASID
+			 * namespace among all PASID users.
+			 * There can be multiple vIOMMUs in each guest but only
+			 * one allocator is active. All vIOMMU allocators will
+			 * eventually be calling the same host allocator.
+			 */
+			iommu->pasid_allocator.alloc = intel_ioasid_alloc;
+			iommu->pasid_allocator.free = intel_ioasid_free;
+			iommu->pasid_allocator.pdata = (void *)iommu;
+			ret = ioasid_register_allocator(&iommu->pasid_allocator);
+			if (ret)
+				pr_warn("Custom PASID allocator registeration failed\n");
+		}
 	}
 
 	bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index bff907b..c24c8aa 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -31,6 +31,7 @@
 #include <linux/iommu.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/dmar.h>
+#include <linux/ioasid.h>
 
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
@@ -549,6 +550,7 @@ struct intel_iommu {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	struct page_req_dsc *prq;
 	unsigned char prq_name[16];    /* Name for PRQ interrupt */
+	struct ioasid_allocator pasid_allocator; /* Custom allocator for PASIDs */
 #endif
 	struct q_inval  *qi;            /* Queued invalidation info */
 	u32 *iommu_state; /* Store iommu states between suspend and resume.*/
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (9 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-25 10:04   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 12/19] iommu/vt-d: Move domain helper to header Jacob Pan
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Make use of generic IOASID code to manage PASID allocation,
free, and lookup.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/Kconfig       |  1 +
 drivers/iommu/intel-iommu.c |  9 ++++-----
 drivers/iommu/intel-pasid.c | 36 ------------------------------------
 drivers/iommu/intel-svm.c   | 41 ++++++++++++++++++++++++-----------------
 4 files changed, 29 insertions(+), 58 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 6f07f3b..7f92009 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -204,6 +204,7 @@ config INTEL_IOMMU_SVM
 	bool "Support for Shared Virtual Memory with Intel IOMMU"
 	depends on INTEL_IOMMU && X86
 	select PCI_PASID
+	select IOASID
 	select MMU_NOTIFIER
 	help
 	  Shared Virtual Memory (SVM) provides a facility for devices
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ec6f22d..785330a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5153,7 +5153,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
 	domain->auxd_refcnt--;
 
 	if (!domain->auxd_refcnt && domain->default_pasid > 0)
-		intel_pasid_free_id(domain->default_pasid);
+		ioasid_free(domain->default_pasid);
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5171,9 +5171,8 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 	if (domain->default_pasid <= 0) {
 		int pasid;
 
-		pasid = intel_pasid_alloc_id(domain, PASID_MIN,
-					     pci_max_pasids(to_pci_dev(dev)),
-					     GFP_KERNEL);
+		pasid = ioasid_alloc(NULL, PASID_MIN, pci_max_pasids(to_pci_dev(dev)) - 1,
+				domain);
 		if (pasid <= 0) {
 			pr_err("Can't allocate default pasid\n");
 			return -ENODEV;
@@ -5210,7 +5209,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
 	spin_unlock(&iommu->lock);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 	if (!domain->auxd_refcnt && domain->default_pasid > 0)
-		intel_pasid_free_id(domain->default_pasid);
+		ioasid_free(domain->default_pasid);
 
 	return ret;
 }
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 5b1d3be..d339e8f 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -26,42 +26,6 @@
  */
 static DEFINE_SPINLOCK(pasid_lock);
 u32 intel_pasid_max_id = PASID_MAX;
-static DEFINE_IDR(pasid_idr);
-
-int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp)
-{
-	int ret, min, max;
-
-	min = max_t(int, start, PASID_MIN);
-	max = min_t(int, end, intel_pasid_max_id);
-
-	WARN_ON(in_interrupt());
-	idr_preload(gfp);
-	spin_lock(&pasid_lock);
-	ret = idr_alloc(&pasid_idr, ptr, min, max, GFP_ATOMIC);
-	spin_unlock(&pasid_lock);
-	idr_preload_end();
-
-	return ret;
-}
-
-void intel_pasid_free_id(int pasid)
-{
-	spin_lock(&pasid_lock);
-	idr_remove(&pasid_idr, pasid);
-	spin_unlock(&pasid_lock);
-}
-
-void *intel_pasid_lookup_id(int pasid)
-{
-	void *p;
-
-	spin_lock(&pasid_lock);
-	p = idr_find(&pasid_idr, pasid);
-	spin_unlock(&pasid_lock);
-
-	return p;
-}
 
 int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
 {
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 8f87304..8fff212 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -25,6 +25,7 @@
 #include <linux/dmar.h>
 #include <linux/interrupt.h>
 #include <linux/mm_types.h>
+#include <linux/ioasid.h>
 #include <asm/page.h>
 
 #include "intel-pasid.h"
@@ -211,7 +212,9 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	rcu_read_lock();
 	list_for_each_entry_rcu(sdev, &svm->devs, list) {
 		intel_pasid_tear_down_entry(svm->iommu, sdev->dev, svm->pasid);
-		intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
+		/* for emulated iommu, PASID cache invalidation implies IOTLB/DTLB */
+		if (!cap_caching_mode(svm->iommu->cap))
+			intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
 	}
 	rcu_read_unlock();
 
@@ -332,16 +335,15 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 		if (pasid_max > intel_pasid_max_id)
 			pasid_max = intel_pasid_max_id;
 
-		/* Do not use PASID 0 in caching mode (virtualised IOMMU) */
-		ret = intel_pasid_alloc_id(svm,
-					   !!cap_caching_mode(iommu->cap),
-					   pasid_max - 1, GFP_KERNEL);
-		if (ret < 0) {
+		/* Do not use PASID 0, reserved for RID to PASID */
+		svm->pasid = ioasid_alloc(NULL, PASID_MIN,
+					pasid_max - 1, svm);
+		if (svm->pasid == INVALID_IOASID) {
 			kfree(svm);
 			kfree(sdev);
+			ret = ENOSPC;
 			goto out;
 		}
-		svm->pasid = ret;
 		svm->notifier.ops = &intel_mmuops;
 		svm->mm = mm;
 		svm->flags = flags;
@@ -351,7 +353,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 		if (mm) {
 			ret = mmu_notifier_register(&svm->notifier, mm);
 			if (ret) {
-				intel_pasid_free_id(svm->pasid);
+				ioasid_free(svm->pasid);
 				kfree(svm);
 				kfree(sdev);
 				goto out;
@@ -367,7 +369,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 		if (ret) {
 			if (mm)
 				mmu_notifier_unregister(&svm->notifier, mm);
-			intel_pasid_free_id(svm->pasid);
+			ioasid_free(svm->pasid);
 			kfree(svm);
 			kfree(sdev);
 			goto out;
@@ -400,7 +402,12 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
 	if (!iommu)
 		goto out;
 
-	svm = intel_pasid_lookup_id(pasid);
+	svm = ioasid_find(NULL, pasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
 	if (!svm)
 		goto out;
 
@@ -422,7 +429,7 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
 				kfree_rcu(sdev, rcu);
 
 				if (list_empty(&svm->devs)) {
-					intel_pasid_free_id(svm->pasid);
+					ioasid_free(svm->pasid);
 					if (svm->mm)
 						mmu_notifier_unregister(&svm->notifier, svm->mm);
 
@@ -457,10 +464,11 @@ int intel_svm_is_pasid_valid(struct device *dev, int pasid)
 	if (!iommu)
 		goto out;
 
-	svm = intel_pasid_lookup_id(pasid);
-	if (!svm)
+	svm = ioasid_find(NULL, pasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
 		goto out;
-
+	}
 	/* init_mm is used in this case */
 	if (!svm->mm)
 		ret = 1;
@@ -567,13 +575,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 
 		if (!svm || svm->pasid != req->pasid) {
 			rcu_read_lock();
-			svm = intel_pasid_lookup_id(req->pasid);
+			svm = ioasid_find(NULL, req->pasid, NULL);
 			/* It *can't* go away, because the driver is not permitted
 			 * to unbind the mm while any page faults are outstanding.
 			 * So we only need RCU to protect the internal idr code. */
 			rcu_read_unlock();
-
-			if (!svm) {
+			if (IS_ERR(svm) || !svm) {
 				pr_err("%s: Page request for invalid PASID %d: %08llx %08llx\n",
 				       iommu->name, req->pasid, ((unsigned long long *)req)[0],
 				       ((unsigned long long *)req)[1]);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 12/19] iommu/vt-d: Move domain helper to header
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (10 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-24 17:27   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 13/19] iommu/vt-d: Add nested translation support Jacob Pan
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Move domainer helper to header to be used by SVA code.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 6 ------
 include/linux/intel-iommu.h | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 785330a..77bbe1b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -427,12 +427,6 @@ static void init_translation_status(struct intel_iommu *iommu)
 		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
 }
 
-/* Convert generic 'struct iommu_domain to private struct dmar_domain */
-static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
-{
-	return container_of(dom, struct dmar_domain, domain);
-}
-
 static int __init intel_iommu_setup(char *str)
 {
 	if (!str)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index c24c8aa..48fa164 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -597,6 +597,12 @@ static inline void __iommu_flush_cache(
 		clflush_cache_range(addr, size);
 }
 
+/* Convert generic 'struct iommu_domain to private struct dmar_domain */
+static inline struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
+{
+	return container_of(dom, struct dmar_domain, domain);
+}
+
 /*
  * 0: readable
  * 1: writable
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 13/19] iommu/vt-d: Add nested translation support
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (11 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 12/19] iommu/vt-d: Move domain helper to header Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 15:42   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 14/19] iommu: Add guest PASID bind function Jacob Pan
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu, Yi L

Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
With PASID granular translation type set to 0x11b, translation
result from the first level(FL) also subject to a second level(SL)
page table translation. This mode is used for SVA virtualization,
where FL performs guest virtual to guest physical translation and
SL performs guest physical to host physical translation.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-pasid.c | 101 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-pasid.h |  11 +++++
 2 files changed, 112 insertions(+)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index d339e8f..04127cf 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -688,3 +688,104 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 
 	return 0;
 }
+
+/**
+ * intel_pasid_setup_nested() - Set up PASID entry for nested translation
+ * which is used for vSVA. The first level page tables are used for
+ * GVA-GPA translation in the guest, second level page tables are used
+ * for GPA to HPA translation.
+ *
+ * @iommu:      Iommu which the device belong to
+ * @dev:        Device to be set up for translation
+ * @pgd:        First level PGD, treated as GPA
+ * @pasid:      PASID to be programmed in the device PASID table
+ * @flags:      Additional info such as supervisor PASID
+ * @domain:     Domain info for setting up second level page tables
+ * @addr_width: Address width of the first level (guest)
+ */
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *gpgd,
+			int pasid, int flags,
+			struct dmar_domain *domain,
+			int addr_width)
+{
+	struct pasid_entry *pte;
+	struct dma_pte *pgd;
+	u64 pgd_val;
+	int agaw;
+	u16 did;
+
+	if (!ecap_nest(iommu->ecap)) {
+		pr_err("No nested translation support on %s\n",
+		       iommu->name);
+		return -EINVAL;
+	}
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	pasid_clear_entry(pte);
+
+	/* Sanity checking performed by caller to make sure address
+	 * width matching in two dimensions:
+	 * 1. CPU vs. IOMMU
+	 * 2. Guest vs. Host.
+	 */
+	switch (addr_width) {
+	case 57:
+		pasid_set_flpm(pte, 1);
+		break;
+	case 48:
+		pasid_set_flpm(pte, 0);
+		break;
+	default:
+		dev_err(dev, "Invalid paging mode %d\n", addr_width);
+		return -EINVAL;
+	}
+
+	/* Setup the first level page table pointer in GPA */
+	pasid_set_flptr(pte, (u64)gpgd);
+	if (flags & PASID_FLAG_SUPERVISOR_MODE) {
+		if (!ecap_srs(iommu->ecap)) {
+			pr_err("No supervisor request support on %s\n",
+			       iommu->name);
+			return -EINVAL;
+		}
+		pasid_set_sre(pte);
+	}
+
+	/* Setup the second level based on the given domain */
+	pgd = domain->pgd;
+
+	for (agaw = domain->agaw; agaw != iommu->agaw; agaw--) {
+		pgd = phys_to_virt(dma_pte_addr(pgd));
+		if (!dma_pte_present(pgd)) {
+			dev_err(dev, "Invalid domain page table\n");
+			return -EINVAL;
+		}
+	}
+	pgd_val = virt_to_phys(pgd);
+	pasid_set_slptr(pte, pgd_val);
+	pasid_set_fault_enable(pte);
+
+	did = domain->iommu_did[iommu->seq_id];
+	pasid_set_domain_id(pte, did);
+
+	pasid_set_address_width(pte, agaw);
+	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+
+	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
+	pasid_set_present(pte);
+
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pte, sizeof(*pte));
+
+	if (cap_caching_mode(iommu->cap)) {
+		pasid_cache_invalidation_with_pasid(iommu, did, pasid);
+		iotlb_invalidation_with_pasid(iommu, did, pasid);
+	} else
+		iommu_flush_write_buffer(iommu);
+
+	return 0;
+}
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 0999dfe..c4fc1af 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -42,6 +42,7 @@
  * to vmalloc or even module mappings.
  */
 #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
+#define PASID_FLAG_NESTED		BIT(1)
 
 struct pasid_dir_entry {
 	u64 val;
@@ -51,6 +52,11 @@ struct pasid_entry {
 	u64 val[8];
 };
 
+#define PASID_ENTRY_PGTT_FL_ONLY	(1)
+#define PASID_ENTRY_PGTT_SL_ONLY	(2)
+#define PASID_ENTRY_PGTT_NESTED		(3)
+#define PASID_ENTRY_PGTT_PT		(4)
+
 /* The representative of a PASID table */
 struct pasid_table {
 	void			*table;		/* pasid table pointer */
@@ -77,6 +83,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, int pasid);
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+			struct device *dev, pgd_t *pgd,
+			int pasid, int flags,
+			struct dmar_domain *domain,
+			int addr_width);
 void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 				 struct device *dev, int pasid);
 int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 14/19] iommu: Add guest PASID bind function
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (12 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 13/19] iommu/vt-d: Add nested translation support Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 15:53   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support Jacob Pan
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Guest shared virtual address (SVA) may require host to shadow guest
PASID tables. Guest PASID can also be allocated from the host via
enlightened interfaces. In this case, guest needs to bind the guest
mm, i.e. cr3 in guest phisical address to the actual PASID table in
the host IOMMU. Nesting will be turned on such that guest virtual
address can go through a two level translation:
- 1st level translates GVA to GPA
- 2nd level translates GPA to HPA
This patch introduces APIs to bind guest PASID data to the assigned
device entry in the physical IOMMU. See the diagram below for usage
explaination.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process mm, FL only |
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |
    '-------------'
Guest
------| Shadow |--------------------------|------------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.---------------------.
    |             |   |Set SL to GPA-HPA    |
    |             |   '---------------------'
    '-------------'

Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommu.c      | 20 ++++++++++++++++++++
 include/linux/iommu.h      | 10 ++++++++++
 include/uapi/linux/iommu.h | 15 ++++++++++++++-
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 498c28a..072f8f3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1561,6 +1561,26 @@ int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
 
+int iommu_sva_bind_gpasid(struct iommu_domain *domain,
+			struct device *dev, struct gpasid_bind_data *data)
+{
+	if (unlikely(!domain->ops->sva_bind_gpasid))
+		return -ENODEV;
+
+	return domain->ops->sva_bind_gpasid(domain, dev, data);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_bind_gpasid);
+
+int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev,
+			int pasid)
+{
+	if (unlikely(!domain->ops->sva_unbind_gpasid))
+		return -ENODEV;
+
+	return domain->ops->sva_unbind_gpasid(dev, pasid);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4b92e4b..611388e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -231,6 +231,8 @@ struct iommu_sva_ops {
  * @detach_pasid_table: detach the pasid table
  * @cache_invalidate: invalidate translation caches
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @sva_bind_gpasid: bind guest pasid and mm
+ * @sva_unbind_gpasid: unbind guest pasid and mm
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -295,6 +297,10 @@ struct iommu_ops {
 
 	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
 				struct iommu_cache_invalidate_info *inv_info);
+	int (*sva_bind_gpasid)(struct iommu_domain *domain,
+			struct device *dev, struct gpasid_bind_data *data);
+
+	int (*sva_unbind_gpasid)(struct device *dev, int pasid);
 
 	unsigned long pgsize_bitmap;
 };
@@ -409,6 +415,10 @@ extern void iommu_detach_pasid_table(struct iommu_domain *domain);
 extern int iommu_cache_invalidate(struct iommu_domain *domain,
 				  struct device *dev,
 				  struct iommu_cache_invalidate_info *inv_info);
+extern int iommu_sva_bind_gpasid(struct iommu_domain *domain,
+		struct device *dev, struct gpasid_bind_data *data);
+extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
+				struct device *dev, int pasid);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 61a3fb7..5c95905 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -235,6 +235,19 @@ struct iommu_cache_invalidate_info {
 		struct iommu_inv_addr_info addr_info;
 	};
 };
-
+/**
+ * struct gpasid_bind_data - Information about device and guest PASID binding
+ * @gcr3:	Guest CR3 value from guest mm
+ * @pasid:	Process address space ID used for the guest mm
+ * @addr_width:	Guest address width. Paging mode can also be derived.
+ */
+struct gpasid_bind_data {
+	__u64 gcr3;
+	__u32 pasid;
+	__u32 addr_width;
+	__u32 flags;
+#define	IOMMU_SVA_GPASID_SRE	BIT(0) /* supervisor request */
+	__u8 padding[4];
+};
 
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (13 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 14/19] iommu: Add guest PASID bind function Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 16:15   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list Jacob Pan
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu, Yi L

When supporting guest SVA with emulated IOMMU, the guest PASID
table is shadowed in VMM. Updates to guest vIOMMU PASID table
will result in PASID cache flush which will be passed down to
the host as bind guest PASID calls.

For the SL page tables, it will be harvested from device's
default domain (request w/o PASID), or aux domain in case of
mediated device.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c |   4 +
 drivers/iommu/intel-svm.c   | 174 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  10 ++-
 include/linux/intel-svm.h   |   7 ++
 4 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 77bbe1b..89989b5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5768,6 +5768,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.dev_enable_feat	= intel_iommu_dev_enable_feat,
 	.dev_disable_feat	= intel_iommu_dev_disable_feat,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.sva_bind_gpasid	= intel_svm_bind_gpasid,
+	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
+#endif
 };
 
 static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 8fff212..0a973c2 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -227,6 +227,180 @@ static const struct mmu_notifier_ops intel_mmuops = {
 
 static DEFINE_MUTEX(pasid_mutex);
 static LIST_HEAD(global_svm_list);
+#define for_each_svm_dev() \
+	list_for_each_entry(sdev, &svm->devs, list)	\
+	if (dev == sdev->dev)				\
+
+int intel_svm_bind_gpasid(struct iommu_domain *domain,
+			struct device *dev,
+			struct gpasid_bind_data *data)
+{
+	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+	struct intel_svm_dev *sdev;
+	struct intel_svm *svm = NULL;
+	struct dmar_domain *ddomain;
+	int pasid_max;
+	int ret = 0;
+
+	if (WARN_ON(!iommu) || !data)
+		return -EINVAL;
+
+	if (dev_is_pci(dev)) {
+		pasid_max = pci_max_pasids(to_pci_dev(dev));
+		if (pasid_max < 0)
+			return -EINVAL;
+	} else
+		pasid_max = 1 << 20;
+
+	if (data->pasid <= 0 || data->pasid >= pasid_max)
+		return -EINVAL;
+
+	ddomain = to_dmar_domain(domain);
+	/* REVISIT:
+	 * Sanity check adddress width and paging mode support
+	 * width matching in two dimensions:
+	 * 1. paging mode CPU <= IOMMU
+	 * 2. address width Guest <= Host.
+	 */
+	mutex_lock(&pasid_mutex);
+	svm = ioasid_find(NULL, data->pasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+	if (svm) {
+		if (list_empty(&svm->devs)) {
+			dev_err(dev, "GPASID %d has no devices bond but SVA is allocated\n",
+				data->pasid);
+			ret = -ENODEV; /*
+					* If we found svm for the PASID, there must be at
+					* least one device bond, otherwise svm should be freed.
+					*/
+			goto out;
+		}
+		for_each_svm_dev() {
+			/* In case of multiple sub-devices of the same pdev assigned, we should
+			 * allow multiple bind calls with the same PASID and pdev.
+			 */
+			sdev->users++;
+			goto out;
+		}
+	} else {
+		/* We come here when PASID has never been bond to a device. */
+		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+		if (!svm) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
+		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
+		 * ownership.
+		 */
+		svm->mm = get_task_mm(current);
+		svm->pasid = data->pasid;
+		refcount_set(&svm->refs, 0);
+		ioasid_set_data(data->pasid, svm);
+		INIT_LIST_HEAD_RCU(&svm->devs);
+		INIT_LIST_HEAD(&svm->list);
+
+		mmput(svm->mm);
+	}
+	svm->flags |= SVM_FLAG_GUEST_MODE;
+	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+	if (!sdev) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sdev->dev = dev;
+	sdev->users = 1;
+
+	/* Set up device context entry for PASID if not enabled already */
+	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+	if (ret) {
+		dev_err(dev, "Failed to enable PASID capability\n");
+		kfree(sdev);
+		goto out;
+	}
+
+	/*
+	 * For guest bind, we need to set up PASID table entry as follows:
+	 * - FLPM matches guest paging mode
+	 * - turn on nested mode
+	 * - SL guest address width matching
+	 */
+	ret = intel_pasid_setup_nested(iommu,
+				dev,
+				(pgd_t *)data->gcr3,
+				data->pasid,
+				data->flags,
+				ddomain,
+				data->addr_width);
+	if (ret) {
+		dev_err(dev, "Failed to set up PASID %d in nested mode, Err %d\n",
+			data->pasid, ret);
+		kfree(sdev);
+		goto out;
+	}
+
+	init_rcu_head(&sdev->rcu);
+	refcount_inc(&svm->refs);
+	list_add_rcu(&sdev->list, &svm->devs);
+ out:
+	mutex_unlock(&pasid_mutex);
+	return ret;
+}
+
+int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+{
+	struct intel_svm_dev *sdev;
+	struct intel_iommu *iommu;
+	struct intel_svm *svm;
+	int ret = -EINVAL;
+
+	mutex_lock(&pasid_mutex);
+	iommu = intel_svm_device_to_iommu(dev);
+	if (!iommu)
+		goto out;
+
+	svm = ioasid_find(NULL, pasid, NULL);
+	if (IS_ERR(svm)) {
+		ret = PTR_ERR(svm);
+		goto out;
+	}
+
+	if (!svm)
+		goto out;
+
+	for_each_svm_dev() {
+		ret = 0;
+		sdev->users--;
+		if (!sdev->users) {
+			list_del_rcu(&sdev->list);
+			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+			/* TODO: Drain in flight PRQ for the PASID since it
+			 * may get reused soon, we don't want to
+			 * confuse with its previous live.
+			 * intel_svm_drain_prq(dev, pasid);
+			 */
+			kfree_rcu(sdev, rcu);
+
+			if (list_empty(&svm->devs)) {
+				list_del(&svm->list);
+				kfree(svm);
+				/*
+				 * We do not free PASID here until explicit call
+				 * from the guest to free.
+				 */
+				ioasid_set_data(pasid, NULL);
+			}
+		}
+		break;
+	}
+ out:
+	mutex_unlock(&pasid_mutex);
+
+	return ret;
+}
 
 int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
 {
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 48fa164..5d67d0d4 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -677,7 +677,9 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
 int intel_svm_init(struct intel_iommu *iommu);
 extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
-
+extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
+		struct device *dev, struct gpasid_bind_data *data);
+extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
 struct svm_dev_ops;
 
 struct intel_svm_dev {
@@ -693,12 +695,16 @@ struct intel_svm_dev {
 
 struct intel_svm {
 	struct mmu_notifier notifier;
-	struct mm_struct *mm;
+	union {
+		struct mm_struct *mm;
+		u64 gcr3;
+	};
 	struct intel_iommu *iommu;
 	int flags;
 	int pasid;
 	struct list_head devs;
 	struct list_head list;
+	refcount_t refs; /* # of devs bond to the PASID */
 };
 
 extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev);
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index e3f7631..34b0a3b 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -52,6 +52,13 @@ struct svm_dev_ops {
  * do such IOTLB flushes automatically.
  */
 #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
+/*
+ * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
+ * In this case the mm_struct is in the guest kernel or userspace, its life
+ * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
+ * means to bind/unbind guest CR3 with PASIDs allocated for a device.
+ */
+#define SVM_FLAG_GUEST_MODE	(1<<2)
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (14 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 16:19   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 17/19] iommu: Add max num of cache and granu types Jacob Pan
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

Use combined macro for_each_svm_dev() to simplify SVM device iteration.

Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-svm.c | 76 ++++++++++++++++++++++-------------------------
 1 file changed, 36 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 0a973c2..39dfb2e 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -447,15 +447,13 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 				goto out;
 			}
 
-			list_for_each_entry(sdev, &svm->devs, list) {
-				if (dev == sdev->dev) {
-					if (sdev->ops != ops) {
-						ret = -EBUSY;
-						goto out;
-					}
-					sdev->users++;
-					goto success;
+			for_each_svm_dev() {
+				if (sdev->ops != ops) {
+					ret = -EBUSY;
+					goto out;
 				}
+				sdev->users++;
+				goto success;
 			}
 
 			break;
@@ -585,40 +583,38 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
 	if (!svm)
 		goto out;
 
-	list_for_each_entry(sdev, &svm->devs, list) {
-		if (dev == sdev->dev) {
-			ret = 0;
-			sdev->users--;
-			if (!sdev->users) {
-				list_del_rcu(&sdev->list);
-				/* Flush the PASID cache and IOTLB for this device.
-				 * Note that we do depend on the hardware *not* using
-				 * the PASID any more. Just as we depend on other
-				 * devices never using PASIDs that they have no right
-				 * to use. We have a *shared* PASID table, because it's
-				 * large and has to be physically contiguous. So it's
-				 * hard to be as defensive as we might like. */
-				intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
-				intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
-				kfree_rcu(sdev, rcu);
-
-				if (list_empty(&svm->devs)) {
-					ioasid_free(svm->pasid);
-					if (svm->mm)
-						mmu_notifier_unregister(&svm->notifier, svm->mm);
-
-					list_del(&svm->list);
-
-					/* We mandate that no page faults may be outstanding
-					 * for the PASID when intel_svm_unbind_mm() is called.
-					 * If that is not obeyed, subtle errors will happen.
-					 * Let's make them less subtle... */
-					memset(svm, 0x6b, sizeof(*svm));
-					kfree(svm);
-				}
+	for_each_svm_dev() {
+		ret = 0;
+		sdev->users--;
+		if (!sdev->users) {
+			list_del_rcu(&sdev->list);
+			/* Flush the PASID cache and IOTLB for this device.
+			 * Note that we do depend on the hardware *not* using
+			 * the PASID any more. Just as we depend on other
+			 * devices never using PASIDs that they have no right
+			 * to use. We have a *shared* PASID table, because it's
+			 * large and has to be physically contiguous. So it's
+			 * hard to be as defensive as we might like. */
+			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+			intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
+			kfree_rcu(sdev, rcu);
+
+			if (list_empty(&svm->devs)) {
+				ioasid_free(svm->pasid);
+				if (svm->mm)
+					mmu_notifier_unregister(&svm->notifier, svm->mm);
+
+				list_del(&svm->list);
+
+				/* We mandate that no page faults may be outstanding
+				 * for the PASID when intel_svm_unbind_mm() is called.
+				 * If that is not obeyed, subtle errors will happen.
+				 * Let's make them less subtle... */
+				memset(svm, 0x6b, sizeof(*svm));
+				kfree(svm);
 			}
-			break;
 		}
+		break;
 	}
  out:
 	mutex_unlock(&pasid_mutex);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 17/19] iommu: Add max num of cache and granu types
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (15 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 16:22   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types Jacob Pan
  2019-04-23 23:31 ` [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

To convert to/from cache types and granularities between generic and
VT-d specific counterparts, a 2D arrary is used. Introduce the limits
to help define the converstion array size.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/uapi/linux/iommu.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5c95905..2d8fac8 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -197,6 +197,7 @@ struct iommu_inv_addr_info {
 	__u64	granule_size;
 	__u64	nb_granules;
 };
+#define NR_IOMMU_CACHE_INVAL_GRANU	(3)
 
 /**
  * First level/stage invalidation information
@@ -235,6 +236,7 @@ struct iommu_cache_invalidate_info {
 		struct iommu_inv_addr_info addr_info;
 	};
 };
+#define NR_IOMMU_CACHE_TYPE		(3)
 /**
  * struct gpasid_bind_data - Information about device and guest PASID binding
  * @gcr3:	Guest CR3 value from guest mm
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (16 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 17/19] iommu: Add max num of cache and granu types Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-27  9:04   ` Auger Eric
  2019-04-23 23:31 ` [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan

When Shared Virtual Memory is exposed to a guest via vIOMMU, extended
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c        | 48 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h | 21 ++++++++++++++++----
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 9c49300..680894e 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
+void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
+		unsigned int size_order, u64 granu)
+{
+	struct qi_desc desc;
+
+	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
+		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
+	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
+		QI_EIOTLB_AM(size_order);
+	desc.qw2 = 0;
+	desc.qw3 = 0;
+	qi_submit_sync(&desc, iommu);
+}
+
 void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask)
 {
@@ -1380,6 +1394,40 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 	qi_submit_sync(&desc, iommu);
 }
 
+void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
+{
+	struct qi_desc desc;
+
+	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+		QI_DEV_IOTLB_PFSID(pfsid);
+	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);
+
+	/* If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the size. VT-d spec
+	 * 6.5.2.6
+	 */
+	if (!size)
+		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
+	else {
+		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
+
+		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
+	}
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc;
+
+	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
+	desc.qw1 = 0;
+	desc.qw2 = 0;
+	desc.qw3 = 0;
+	qi_submit_sync(&desc, iommu);
+}
 /*
  * Disable Queued Invalidation interface.
  */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 5d67d0d4..38e5efb 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -339,7 +339,7 @@ enum {
 #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
 #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
 #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
-#define QI_IOTLB_AM(am)		(((u8)am))
+#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
 
 #define QI_CC_FM(fm)		(((u64)fm) << 48)
 #define QI_CC_SID(sid)		(((u64)sid) << 32)
@@ -357,17 +357,22 @@ enum {
 #define QI_PC_DID(did)		(((u64)did) << 16)
 #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
 
-#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS	0
+#define QI_PC_PASID_SEL		1
 
 #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
 #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
-#define QI_EIOTLB_AM(am)	(((u64)am))
+#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
 #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
 #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
 #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
 
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
+
 #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
 #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
 #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
@@ -658,8 +663,16 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
 			     u8 fm, u64 type);
 extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
+extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr,
+			u32 pasid, unsigned int size_order, u64 type);
 extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask);
+
+extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
+
+extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
+
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
 extern int dmar_ir_support(void);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
                   ` (17 preceding siblings ...)
  2019-04-23 23:31 ` [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2019-04-23 23:31 ` Jacob Pan
  2019-04-26 17:23   ` Auger Eric
  18 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-23 23:31 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Jacob Pan, Liu, Yi L

When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 159 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 89989b5..54a3d22 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5338,6 +5338,164 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
 	aux_domain_remove_dev(to_dmar_domain(domain), dev);
 }
 
+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned device where guest
+ * could own the first level page tables without being shadowed by QEMU. In
+ * this case there is no pass down unmap to the host IOMMU as a result of unmap
+ * in the guest. Only invalidations are trapped and passed down.
+ * In all cases, only first level TLB invalidation (request with PASID) can be
+ * passed down, therefore we do not include IOTLB granularity for request
+ * without PASID (second level).
+ *
+ * For an example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by iommu cache type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
+ *
+ * Granu_map array indicates validity of the table. 1: valid, 0: invalid
+ *
+ */
+const static int inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU] = {
+	/* PASID based IOTLB, support PASID selective and page selective */
+	{0, 1, 1},
+	/* PASID based dev TLBs, only support all PASIDs or single PASID */
+	{1, 1, 0},
+	/* PASID cache */
+	{1, 1, 0}
+};
+
+const static u64 inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU] = {
+	/* PASID based IOTLB */
+	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+	/* PASID based dev TLBs */
+	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
+	/* PASID cache */
+	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
+};
+
+static inline int to_vtd_granularity(int type, int granu, u64 *vtd_granu)
+{
+	if (type >= NR_IOMMU_CACHE_TYPE || granu >= NR_IOMMU_CACHE_INVAL_GRANU ||
+		!inv_type_granu_map[type][granu])
+		return -EINVAL;
+
+	*vtd_granu = inv_type_granu_table[type][granu];
+
+	return 0;
+}
+
+static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
+{
+	u64 nr_pages;
+	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
+	 * granu size in contiguous memory.
+	 */
+
+	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
+	return order_base_2(nr_pages);
+}
+
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct intel_iommu *iommu;
+	unsigned long flags;
+	int cache_type;
+	u8 bus, devfn;
+	u16 did, sid;
+	int ret = 0;
+	u64 granu;
+	u64 size;
+
+	if (!inv_info || !dmar_domain ||
+		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+		return -EINVAL;
+
+	if (!dev || !dev_is_pci(dev))
+		return -ENODEV;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	spin_lock(&iommu->lock);
+	spin_lock_irqsave(&device_domain_lock, flags);
+	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+	if (!info) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	sid = PCI_DEVID(bus, devfn);
+	size = to_vtd_size(inv_info->addr_info.granule_size, inv_info->addr_info.nb_granules);
+
+	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) {
+
+		ret = to_vtd_granularity(cache_type, inv_info->granularity, &granu);
+		if (ret) {
+			pr_err("Invalid range type %d, granu %d\n", cache_type,
+				inv_info->granularity);
+			break;
+		}
+
+		switch (BIT(cache_type)) {
+		case IOMMU_CACHE_INV_TYPE_IOTLB:
+			if (size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
+				pr_err("Address out of range, 0x%llx, size order %llu\n",
+					inv_info->addr_info.addr, size);
+				ret = -ERANGE;
+				goto out_unlock;
+			}
+
+			qi_flush_piotlb(iommu, did, mm_to_dma_pfn(inv_info->addr_info.addr),
+					inv_info->addr_info.pasid,
+					size, granu);
+
+			/*
+			 * Always flush device IOTLB if ATS is enabled since guest
+			 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
+			 * down. REVISIT: cannot assume Linux guest
+			 */
+			if (info->ats_enabled) {
+				qi_flush_dev_piotlb(iommu, sid, info->pfsid,
+						inv_info->addr_info.pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			}
+			break;
+		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
+			if (info->ats_enabled) {
+				qi_flush_dev_piotlb(iommu, sid, info->pfsid,
+						inv_info->addr_info.pasid, info->ats_qdep,
+						inv_info->addr_info.addr, size,
+						granu);
+			} else
+				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
+
+			break;
+		case IOMMU_CACHE_INV_TYPE_PASID:
+			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid);
+
+			break;
+		default:
+			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
+				cache_type);
+			ret = -EINVAL;
+		}
+	}
+out_unlock:
+	spin_unlock(&iommu->lock);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.dev_disable_feat	= intel_iommu_dev_disable_feat,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
 #ifdef CONFIG_INTEL_IOMMU_SVM
+	.cache_invalidate	= intel_iommu_sva_invalidate,
 	.sva_bind_gpasid	= intel_svm_bind_gpasid,
 	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
 #endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-23 23:31 ` [PATCH v2 06/19] drivers core: Add I/O ASID allocator Jacob Pan
@ 2019-04-24  6:19   ` Christoph Hellwig
  2019-04-25 18:19     ` Jacob Pan
  2019-04-25 10:17   ` Auger Eric
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2019-04-24  6:19 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker, Yi Liu, Tian, Kevin,
	Raj Ashok, Christoph Hellwig, Lu Baolu, Andriy Shevchenko

On Tue, Apr 23, 2019 at 04:31:06PM -0700, Jacob Pan wrote:
> The allocator doesn't really belong in drivers/iommu because some
> drivers would like to allocate PASIDs for devices that aren't managed by
> an IOMMU, using the same ID space as IOMMU. It doesn't really belong in
> drivers/pci either since platform device also support PASID. Add the
> allocator in drivers/base.

I'd still add it to drivers/iommu, just selectable separately from the
core iommu code..

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID
  2019-04-23 23:31 ` [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
@ 2019-04-24 17:27   ` Auger Eric
  2019-04-26 20:11     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-24 17:27 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> When VT-d driver runs in the guest, PASID allocation must be
> performed via virtual command interface. This patch register a
registers
> custom IOASID allocator which takes precedence over the default
> IDR based allocator.
nit: s/IDR based// . It is xarray based now.
 The resulting IOASID allocation will always
> come from the host. This ensures that PASID namespace is system-
> wide.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |  2 ++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index d93c4bd..ec6f22d 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1711,6 +1711,8 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
>  		if (ecap_prs(iommu->ecap))
>  			intel_svm_finish_prq(iommu);
>  	}
> +	ioasid_unregister_allocator(&iommu->pasid_allocator);
> +
>  #endif
>  }
>  
> @@ -4811,6 +4813,46 @@ static int __init platform_optin_force_iommu(void)
>  	return 1;
>  }
>  
> +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
> +{
> +	struct intel_iommu *iommu = data;
> +	ioasid_t ioasid;
> +
> +	/*
> +	 * VT-d virtual command interface always uses the full 20 bit
> +	 * PASID range. Host can partition guest PASID range based on
> +	 * policies but it is out of guest's control.
> +	 */
The above comment does not exactly relate to the check below
> +	if (min < PASID_MIN || max > PASID_MAX)
> +		return -EINVAL;
> +
> +	if (vcmd_alloc_pasid(iommu, &ioasid))
> +		return INVALID_IOASID;
> +
> +	return ioasid;
> +}
> +
> +static int intel_ioasid_free(ioasid_t ioasid, void *data)
> +{
> +	struct iommu_pasid_alloc_info *svm;
> +	struct intel_iommu *iommu = data;
> +
> +	if (!iommu || !cap_caching_mode(iommu->cap))
> +		return -EINVAL;
can !cap_caching_mode(iommu->cap) be true as the allocator only is set
if CM?
> +	/*
> +	 * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> +	 * We can only free the PASID when all the devices are unbond.
> +	 */
> +	svm = ioasid_find(NULL, ioasid, NULL);
> +	if (!svm) {
you can avoid using the local svm variable.
> +		pr_warn("Freeing unbond IOASID %d\n", ioasid);
unbound
> +		return -EBUSY;
-EINVAL?
> +	}
> +	vcmd_free_pasid(iommu, ioasid);
> +
> +	return 0;
> +}
> +
>  int __init intel_iommu_init(void)
>  {
>  	int ret = -ENODEV;
> @@ -4912,6 +4954,22 @@ int __init intel_iommu_init(void)
>  				       "%s", iommu->name);
>  		iommu_device_set_ops(&iommu->iommu, &intel_iommu_ops);
>  		iommu_device_register(&iommu->iommu);
> +		if (cap_caching_mode(iommu->cap) && sm_supported(iommu)) {
so shouldn't you test VCCAP_REG as well?
> +			/*
> +			 * Register a custom ASID allocator if we are running
> +			 * in a guest, the purpose is to have a system wide PASID
> +			 * namespace among all PASID users.
> +			 * There can be multiple vIOMMUs in each guest but only
> +			 * one allocator is active. All vIOMMU allocators will
> +			 * eventually be calling the same host allocator.
> +			 */
> +			iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> +			iommu->pasid_allocator.free = intel_ioasid_free;
> +			iommu->pasid_allocator.pdata = (void *)iommu;
> +			ret = ioasid_register_allocator(&iommu->pasid_allocator);
> +			if (ret)
> +				pr_warn("Custom PASID allocator registeration failed\n");
registration
> +		}
>  	}
>  
>  	bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index bff907b..c24c8aa 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -31,6 +31,7 @@
>  #include <linux/iommu.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/dmar.h>
> +#include <linux/ioasid.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/iommu.h>
> @@ -549,6 +550,7 @@ struct intel_iommu {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	struct page_req_dsc *prq;
>  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> +	struct ioasid_allocator pasid_allocator; /* Custom allocator for PASIDs */
>  #endif
>  	struct q_inval  *qi;            /* Queued invalidation info */
>  	u32 *iommu_state; /* Store iommu states between suspend and resume.*/
> 

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 12/19] iommu/vt-d: Move domain helper to header
  2019-04-23 23:31 ` [PATCH v2 12/19] iommu/vt-d: Move domain helper to header Jacob Pan
@ 2019-04-24 17:27   ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-24 17:27 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Move domainer helper to header to be used by SVA code.
s/domainer/domain
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  drivers/iommu/intel-iommu.c | 6 ------
>  include/linux/intel-iommu.h | 6 ++++++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 785330a..77bbe1b 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -427,12 +427,6 @@ static void init_translation_status(struct intel_iommu *iommu)
>  		iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
>  }
>  
> -/* Convert generic 'struct iommu_domain to private struct dmar_domain */
> -static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
> -{
> -	return container_of(dom, struct dmar_domain, domain);
> -}
> -
>  static int __init intel_iommu_setup(char *str)
>  {
>  	if (!str)
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index c24c8aa..48fa164 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -597,6 +597,12 @@ static inline void __iommu_flush_cache(
>  		clflush_cache_range(addr, size);
>  }
>  
> +/* Convert generic 'struct iommu_domain to private struct dmar_domain */
> +static inline struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
> +{
> +	return container_of(dom, struct dmar_domain, domain);
> +}
> +
>  /*
>   * 0: readable
>   * 1: writable
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-23 23:31 ` [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation Jacob Pan
@ 2019-04-24 17:27   ` Auger Eric
  2019-04-25  7:12     ` Liu, Yi L
  2019-04-25 23:40     ` Jacob Pan
  0 siblings, 2 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-24 17:27 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
> IOMMU driver should rely on the emulation software to allocate
> and free PASID IDs.
Do we make the decision depending on the CM or depending on the VCCAP_REG?

VCCAP_REG description says:

If Set, software must use Virtual Command Register interface to
allocate and free PASIDs.

 The Intel vt-d spec revision 3.0 defines a
> register set to support this. This includes a capability register,
> a virtual command register and a virtual response register. Refer
> to section 10.4.42, 10.4.43, 10.4.44 for more information.
> 
> This patch adds the enlightened PASID allocation/free interfaces
For mu curiosity why is it called "enlightened"?
> via the virtual command register.
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel-pasid.c | 70 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel-pasid.h | 13 ++++++++-
>  include/linux/intel-iommu.h |  2 ++
>  3 files changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 03b12d2..5b1d3be 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -63,6 +63,76 @@ void *intel_pasid_lookup_id(int pasid)
>  	return p;
>  }
>  
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
> +{
> +	u64 res;
> +	u64 cap;
> +	u8 err_code;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	if (!ecap_vcs(iommu->ecap)) {
> +		pr_warn("IOMMU: %s: Hardware doesn't support virtual command\n",
> +			iommu->name);
nit: other pr_* messages don't have the "IOMMU: %s:" prefix.
> +		return -ENODEV;
> +	}
> +
> +	cap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> +	if (!(cap & DMA_VCS_PAS)) {
> +		pr_warn("IOMMU: %s: Emulation software doesn't support PASID allocation\n",
> +			iommu->name);
> +		return -ENODEV;
> +	}
> +
> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +		      !(res & VCMD_VRSP_IP), res);
> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> +	err_code = VCMD_VRSP_EC(res);
> +	switch (err_code) {
> +	case VCMD_VRSP_EC_SUCCESS:
> +		*pasid = VCMD_VRSP_RESULE(res);
> +		break;
> +	case VCMD_VRSP_EC_UNAVAIL:
> +		pr_info("IOMMU: %s: No PASID available\n", iommu->name);
> +		ret = -ENOMEM;
> +		break;
> +	default:
> +		ret = -ENODEV;
> +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",
unknown
> +			iommu->name, err_code);
> +	}
> +
> +	return ret;
> +}
> +
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> +{
> +	u64 res;
> +	u8 err_code;
> +	unsigned long flags;
Shall we check as well the cap is set?
> +
> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, (pasid << 8) | VCMD_CMD_FREE);
> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +		      !(res & VCMD_VRSP_IP), res);
> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +
> +	err_code = VCMD_VRSP_EC(res);
> +	switch (err_code) {
> +	case VCMD_VRSP_EC_SUCCESS:
> +		break;
> +	case VCMD_VRSP_EC_INVAL:
> +		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> +		break;
> +	default:
> +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",
unknown
> +			iommu->name, err_code);
> +	}
> +}
> +
>  /*
>   * Per device pasid table management:
>   */
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 23537b3..0999dfe 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -19,6 +19,16 @@
>  #define PASID_PDE_SHIFT			6
>  #define MAX_NR_PASID_BITS		20
>  
> +/* Virtual command interface for enlightened pasid management. */
> +#define VCMD_CMD_ALLOC			0x1
> +#define VCMD_CMD_FREE			0x2
> +#define VCMD_VRSP_IP			0x1
> +#define VCMD_VRSP_EC(e)			(((e) >> 1) & 0x3)
s/EC/SC? for Status Code and below
> +#define VCMD_VRSP_EC_SUCCESS		0
> +#define VCMD_VRSP_EC_UNAVAIL		1
nit: _NO_VALID_PASID
> +#define VCMD_VRSP_EC_INVAL		1
nit: _INVALID_PASID
> +#define VCMD_VRSP_RESULE(e)		(((e) >> 8) & 0xfffff)
nit: s/RESULE/RSLT?
> +
>  /*
>   * Domain ID reserved for pasid entries programmed for first-level
>   * only and pass-through transfer modes.
> @@ -69,5 +79,6 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct device *dev, int pasid);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
> -
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid);
>  #endif /* __INTEL_PASID_H */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 6925a18..bff907b 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -173,6 +173,7 @@
>  #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
>  #define ecap_flts(e)		(((e) >> 47) & 0x1)
>  #define ecap_slts(e)		(((e) >> 46) & 0x1)
> +#define ecap_vcs(e)		(((e) >> 44) & 0x1)
>  #define ecap_smts(e)		(((e) >> 43) & 0x1)
>  #define ecap_dit(e)		((e >> 41) & 0x1)
>  #define ecap_pasid(e)		((e >> 40) & 0x1)
> @@ -289,6 +290,7 @@
>  
>  /* PRS_REG */
>  #define DMA_PRS_PPR	((u32)1)
> +#define DMA_VCS_PAS	((u64)1)
>  
>  #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts)			\
>  do {									\
> 

Thanks

Eric


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-24 17:27   ` Auger Eric
@ 2019-04-25  7:12     ` Liu, Yi L
  2019-04-25  7:40       ` Auger Eric
  2019-04-25 23:40     ` Jacob Pan
  1 sibling, 1 reply; 74+ messages in thread
From: Liu, Yi L @ 2019-04-25  7:12 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj, Ashok, Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Eric,

> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Thursday, April 25, 2019 1:28 AM
> To: Jacob Pan <jacob.jun.pan@linux.intel.com>; iommu@lists.linux-foundation.org;
> Subject: Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
> 
> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the IOMMU
> > driver should rely on the emulation software to allocate and free
> > PASID IDs.
> Do we make the decision depending on the CM or depending on the VCCAP_REG?
> 
> VCCAP_REG description says:
> 
> If Set, software must use Virtual Command Register interface to allocate and free
> PASIDs.

The answer is it depends on the ECAP.VCS and then the PASID allocation bit in
VCCAP_REG. But VCS bit implies the iommu is a software implementation
(vIOMMU) of vt-d architecture. Pls refer to the descriptions of "Virtual
Command Support" in vt-d 3.0 spec.

"Hardware implementations of this architecture report a value of 0
in this field. Software implementations (emulation) of this
architecture may report VCS=1."

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-25  7:12     ` Liu, Yi L
@ 2019-04-25  7:40       ` Auger Eric
  2019-04-25 23:01         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-25  7:40 UTC (permalink / raw)
  To: Liu, Yi L, Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Tian, Kevin, Raj, Ashok, Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Liu,

On 4/25/19 9:12 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric [mailto:eric.auger@redhat.com]
>> Sent: Thursday, April 25, 2019 1:28 AM
>> To: Jacob Pan <jacob.jun.pan@linux.intel.com>; iommu@lists.linux-foundation.org;
>> Subject: Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
>>
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>
>>> If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the IOMMU
>>> driver should rely on the emulation software to allocate and free
>>> PASID IDs.
>> Do we make the decision depending on the CM or depending on the VCCAP_REG?
>>
>> VCCAP_REG description says:
>>
>> If Set, software must use Virtual Command Register interface to allocate and free
>> PASIDs.
> 
> The answer is it depends on the ECAP.VCS and then the PASID allocation bit in
> VCCAP_REG. But VCS bit implies the iommu is a software implementation
> (vIOMMU) of vt-d architecture. Pls refer to the descriptions of "Virtual
> Command Support" in vt-d 3.0 spec.
> 
> "Hardware implementations of this architecture report a value of 0
> in this field. Software implementations (emulation) of this
> architecture may report VCS=1."

OK I understand. But strictly speaking a vIOMMU may not implement CM.
But that's nitpicking ;-)

Thanks

Eric
> 
> Thanks,
> Yi Liu
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-23 23:31 ` [PATCH v2 08/19] ioasid: Add custom IOASID allocator Jacob Pan
@ 2019-04-25 10:03   ` Auger Eric
  2019-04-25 21:29     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-25 10:03 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Sometimes, IOASID allocation must be handled by platform specific
> code. The use cases are guest vIOMMU and pvIOMMU where IOASIDs need
> to be allocated by the host via enlightened or paravirt interfaces.
> 
> This patch adds an extension to the IOASID allocator APIs such that
> platform drivers can register a custom allocator, possibly at boot
> time, to take over the allocation. Xarray is still used for tracking
> and searching purposes internal to the IOASID code. Private data of
> an IOASID can also be set after the allocation.
> 
> There can be multiple custom allocators registered but only one is
> used at a time. In case of hot removal of devices that provides the
> allocator, all IOASIDs must be freed prior to unregistering the
> allocator. Default XArray based allocator cannot be mixed with
> custom allocators, i.e. custom allocators will not be used if there
> are outstanding IOASIDs allocated by the default XA allocator.

What's the exact use case behind allowing several custom IOASID
allocators to be registered?
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/base/ioasid.c  | 182 ++++++++++++++++++++++++++++++++++++++++++++++---
>  include/linux/ioasid.h |  15 +++-
>  2 files changed, 187 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
> index c4012aa..5cb36a4 100644
> --- a/drivers/base/ioasid.c
> +++ b/drivers/base/ioasid.c
> @@ -17,6 +17,120 @@ struct ioasid_data {
>  };
>  
>  static DEFINE_XARRAY_ALLOC(ioasid_xa);
> +static DEFINE_MUTEX(ioasid_allocator_lock);
> +static struct ioasid_allocator *ioasid_allocator;
A more explicit name may be chosen. If I understand correctly that's the
active_custom_allocator
> +
> +static LIST_HEAD(custom_allocators);
> +/*
> + * A flag to track if ioasid default allocator already been used, this will
is already in use?
> + * prevent custom allocator from being used. The reason is that custom allocator
s/The reason is that custom allocator/The reason is that custom allocators
> + * must have unadulterated space to track private data with xarray, there cannot
> + * be a mix been default and custom allocated IOASIDs.
> + */
> +static int default_allocator_used;
> +
> +/**
> + * ioasid_register_allocator - register a custom allocator
> + * @allocator: the custom allocator to be registered
> + *
> + * Custom allocator take precedence over the default xarray based allocator.
> + * Private data associated with the ASID are managed by ASID common code
> + * similar to data stored in xa.
> + *
> + * There can be multiple allocators registered but only one is active. In case
> + * of runtime removal of an custom allocator, the next one is activated based
> + * on the registration ordering.
This last sentence may be moved to the unregister() kerneldoc
> + */
> +int ioasid_register_allocator(struct ioasid_allocator *allocator)
> +{
> +	struct ioasid_allocator *pallocator;
> +	int ret = 0;
> +
> +	if (!allocator)
> +		return -EINVAL;
> +
> +	mutex_lock(&ioasid_allocator_lock);
> +	if (list_empty(&custom_allocators))
> +		ioasid_allocator = allocator;
The fact the first registered custom allocator gets automatically active
was not obvious to me and may deserve a comment.
> +	else {
> +		/* Check if the allocator is already registered */
> +		list_for_each_entry(pallocator, &custom_allocators, list) {
> +			if (pallocator == allocator) {
> +				pr_err("IOASID allocator already exist\n");
s/exist/registered?
> +				ret = -EEXIST;
> +				goto out_unlock;
> +			}
> +		}
> +	}
> +	list_add_tail(&allocator->list, &custom_allocators);
> +
> +out_unlock:
> +	mutex_unlock(&ioasid_allocator_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_allocator);
> +
> +/**
> + * ioasid_unregister_allocator - Remove a custom IOASID allocator
> + * @allocator: the custom allocator to be removed
> + *
> + * Remove an allocator from the list, activate the next allocator in
> + * the order it was  registration.
> + */
> +void ioasid_unregister_allocator(struct ioasid_allocator *allocator)
> +{
> +	if (!allocator)
> +		return;
> +
> +	if (list_empty(&custom_allocators)) {
> +		pr_warn("No custom IOASID allocators active!\n");
s/active/registered?
> +		return;
> +	}
> +
> +	mutex_lock(&ioasid_allocator_lock);
> +	list_del(&allocator->list);
> +	if (list_empty(&custom_allocators)) {
> +		pr_info("No custom IOASID allocators\n");
> +		/*
> +		 * All IOASIDs should have been freed before the last allocator
> +		 * is unregistered.
> +		 */
> +		BUG_ON(!xa_empty(&ioasid_xa));
At this stage it is difficult to assess whether using a BUG_ON() is safe
here. Who is responsible for freeing the IOASIDs?
> +		ioasid_allocator = NULL;
> +	} else if (allocator == ioasid_allocator) {
> +		ioasid_allocator = list_entry(&custom_allocators, struct ioasid_allocator, list);
> +		pr_info("IOASID allocator changed");
> +	}
> +	mutex_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
> +
> +/**
> + * ioasid_set_data - Set private data for an allocated ioasid
> + * @ioasid: the ID to set data
> + * @data:   the private data
> + *
> + * For IOASID that is already allocated, private data can be set
> + * via this API. Future lookup can be done via ioasid_find.
> + */
> +int ioasid_set_data(ioasid_t ioasid, void *data)
> +{
> +	struct ioasid_data *ioasid_data;
> +	int ret = 0;
> +
> +	ioasid_data = xa_load(&ioasid_xa, ioasid);
> +	if (ioasid_data)
> +		ioasid_data->private = data;
> +	else
> +		ret = -ENOENT;
> +
> +	/* getter may use the private data */
> +	synchronize_rcu();
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_data);
> +
>  /**
>   * ioasid_alloc - Allocate an IOASID
>   * @set: the IOASID set
> @@ -31,7 +145,7 @@ static DEFINE_XARRAY_ALLOC(ioasid_xa);
>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>  		      void *private)
>  {
> -	int id = -1;
> +	int id = INVALID_IOASID;
>  	struct ioasid_data *data;
>  
>  	data = kzalloc(sizeof(*data), GFP_KERNEL);
> @@ -40,14 +154,37 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>  
>  	data->set = set;
>  	data->private = private;
> +
> +	/*
> +	 * Use custom allocator if available, otherwise use default.
> +	 * However, if there are active IOASIDs already been allocated by default
> +	 * allocator, custom allocator cannot be used.
> +	 */
> +	if (!default_allocator_used && ioasid_allocator) {
> +		mutex_lock(&ioasid_allocator_lock);
> +		id = ioasid_allocator->alloc(min, max, ioasid_allocator->pdata);
> +		mutex_unlock(&ioasid_allocator_lock);
> +		if (id == INVALID_IOASID) {
> +			pr_err("Failed ASID allocation by custom allocator\n");
> +			goto exit_free;
> +		}
> +		/*
> +		 * Use XA to manage private data also sanitiy check custom> +		 * allocator for duplicates.
s/data also sanitiy check/data, also sanity check
> +		 */
> +		min = id;
> +		max = id + 1;
> +	} else
> +		default_allocator_used = 1;
shouldn't default_allocator_used be protected as well?
> +
>  	if (xa_alloc(&ioasid_xa, &id, data, XA_LIMIT(min, max), GFP_KERNEL)) {
>  		pr_err("Failed to alloc ioasid from %d to %d\n", min, max);
>  		goto exit_free;
>  	}
> -
>  	data->id = id;
wouldn't it be possible to integrate the default io asid allocator as
any custom allocator, ie. implement an alloc callback using xa_alloc.
Then the active io allocator could be either a custom or a default one.
> +
>  exit_free:
> -	if (id < 0) {
> +	if (id < 0 || id == INVALID_IOASID) {
>  		kfree(data);
>  		return INVALID_IOASID;
>  	}
> @@ -59,12 +196,29 @@ EXPORT_SYMBOL_GPL(ioasid_alloc);
>   * ioasid_free - Free an IOASID
>   * @ioasid: the ID to remove
>   */
> -void ioasid_free(ioasid_t ioasid)
> +int ioasid_free(ioasid_t ioasid)
>  {
>  	struct ioasid_data *ioasid_data;
> +	int ret = 0;
> +
> +	if (ioasid_allocator) {
> +		mutex_lock(&ioasid_allocator_lock);
> +		ret = ioasid_allocator->free(ioasid, ioasid_allocator->pdata);
> +		mutex_unlock(&ioasid_allocator_lock);
> +	}
> +	if (ret) {
> +		pr_err("ioasid %d custom allocator free failed\n", ioasid);
> +		return ret;
> +	}
>  
>  	ioasid_data = xa_erase(&ioasid_xa, ioasid);
> +
>  	kfree_rcu(ioasid_data, rcu);
> +
> +	if (xa_empty(&ioasid_xa))
> +		default_allocator_used = 0;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(ioasid_free);
>  
> @@ -79,7 +233,8 @@ EXPORT_SYMBOL_GPL(ioasid_free);
>   * if @getter returns false, then the object is invalid and NULL is returned.
>   *
>   * If the IOASID has been allocated for this set, return the private pointer
> - * passed to ioasid_alloc. Otherwise return NULL.
> + * passed to ioasid_alloc. Private data can be NULL if not set. Return an error
> + * if the IOASID is not found or not belong to the set.
s/not belong/does not belong
>   */
>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  		  bool (*getter)(void *))
> @@ -89,11 +244,20 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  
>  	rcu_read_lock();
>  	ioasid_data = xa_load(&ioasid_xa, ioasid);
> -	if (ioasid_data && ioasid_data->set == set) {
> -		priv = ioasid_data->private;
> -		if (getter && !getter(priv))
> -			priv = NULL;
> +	if (!ioasid_data) {
> +		priv = ERR_PTR(-ENOENT);
> +		goto unlock;
> +	}
> +	if (set && ioasid_data->set != set) {
> +		/* data found but does not belong to the set */
> +		priv = ERR_PTR(-EACCES);
> +		goto unlock;
>  	}
> +	/* Now IOASID and its set is verified, we can return the private data */
> +	priv = ioasid_data->private;
> +	if (getter && !getter(priv))
> +		priv = NULL;
> +unlock:
>  	rcu_read_unlock();
>  
>  	return priv;
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f3655a..e773c13 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -5,20 +5,33 @@
>  #define INVALID_IOASID ((ioasid_t)-1)
>  typedef unsigned int ioasid_t;
>  typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void *data);
> +typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
> +typedef int (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
>  
>  struct ioasid_set {
>  	int dummy;
>  };
>  
> +struct ioasid_allocator {
> +	ioasid_alloc_fn_t alloc;
> +	ioasid_free_fn_t free;
> +	void *pdata;
> +	struct list_head list;
> +};
> +
>  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>  
>  #ifdef CONFIG_IOASID
>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>  		      void *private);
> -void ioasid_free(ioasid_t ioasid);
> +int ioasid_free(ioasid_t ioasid);
you need to change the definition for the !CONFIG_IOASID case too
>  
>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  		  bool (*getter)(void *));
> +int ioasid_register_allocator(struct ioasid_allocator *allocator);
> +void ioasid_unregister_allocator(struct ioasid_allocator *allocator);
> +
> +int ioasid_set_data(ioasid_t ioasid, void *data);
>  
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
Just to make sure, don't you need to define the new functions if
!CONFIG_IOASID?

Thanks

Eric
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID
  2019-04-23 23:31 ` [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID Jacob Pan
@ 2019-04-25 10:04   ` Auger Eric
       [not found]     ` <20190426140133.6d445315@jacob-builder>
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-25 10:04 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Make use of generic IOASID code to manage PASID allocation,
> free, and lookup.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/Kconfig       |  1 +
>  drivers/iommu/intel-iommu.c |  9 ++++-----
>  drivers/iommu/intel-pasid.c | 36 ------------------------------------
>  drivers/iommu/intel-svm.c   | 41 ++++++++++++++++++++++++-----------------
>  4 files changed, 29 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 6f07f3b..7f92009 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -204,6 +204,7 @@ config INTEL_IOMMU_SVM
>  	bool "Support for Shared Virtual Memory with Intel IOMMU"
>  	depends on INTEL_IOMMU && X86
>  	select PCI_PASID
> +	select IOASID
>  	select MMU_NOTIFIER
>  	help
>  	  Shared Virtual Memory (SVM) provides a facility for devices
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ec6f22d..785330a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5153,7 +5153,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
>  	domain->auxd_refcnt--;
>  
>  	if (!domain->auxd_refcnt && domain->default_pasid > 0)
> -		intel_pasid_free_id(domain->default_pasid);
> +		ioasid_free(domain->default_pasid);
>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5171,9 +5171,8 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
>  	if (domain->default_pasid <= 0) {
>  		int pasid;
>  
> -		pasid = intel_pasid_alloc_id(domain, PASID_MIN,
> -					     pci_max_pasids(to_pci_dev(dev)),
> -					     GFP_KERNEL);
> +		pasid = ioasid_alloc(NULL, PASID_MIN, pci_max_pasids(to_pci_dev(dev)) - 1,
> +				domain);
>  		if (pasid <= 0) {
ioasid_t is a uint and returns INVALID_IOASID on error. Wouldn't it be
simpler to make ioasid_alloc return an int?
>  			pr_err("Can't allocate default pasid\n");
>  			return -ENODEV;
> @@ -5210,7 +5209,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
>  	spin_unlock(&iommu->lock);
>  	spin_unlock_irqrestore(&device_domain_lock, flags);
>  	if (!domain->auxd_refcnt && domain->default_pasid > 0)
> -		intel_pasid_free_id(domain->default_pasid);
> +		ioasid_free(domain->default_pasid);
>  
>  	return ret;
>  }
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 5b1d3be..d339e8f 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -26,42 +26,6 @@
>   */
>  static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
> -static DEFINE_IDR(pasid_idr);
> -
> -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp)
> -{
> -	int ret, min, max;
> -
> -	min = max_t(int, start, PASID_MIN);
> -	max = min_t(int, end, intel_pasid_max_id);
> -
> -	WARN_ON(in_interrupt());
> -	idr_preload(gfp);
> -	spin_lock(&pasid_lock);
> -	ret = idr_alloc(&pasid_idr, ptr, min, max, GFP_ATOMIC);
> -	spin_unlock(&pasid_lock);
> -	idr_preload_end();
> -
> -	return ret;
> -}
> -
> -void intel_pasid_free_id(int pasid)
> -{
> -	spin_lock(&pasid_lock);
> -	idr_remove(&pasid_idr, pasid);
> -	spin_unlock(&pasid_lock);
> -}
> -
> -void *intel_pasid_lookup_id(int pasid)
> -{
> -	void *p;
> -
> -	spin_lock(&pasid_lock);
> -	p = idr_find(&pasid_idr, pasid);
> -	spin_unlock(&pasid_lock);
> -
> -	return p;
> -}
>  
>  int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
>  {
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 8f87304..8fff212 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -25,6 +25,7 @@
>  #include <linux/dmar.h>
>  #include <linux/interrupt.h>
>  #include <linux/mm_types.h>
> +#include <linux/ioasid.h>
>  #include <asm/page.h>
>  
>  #include "intel-pasid.h"
> @@ -211,7 +212,9 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(sdev, &svm->devs, list) {
>  		intel_pasid_tear_down_entry(svm->iommu, sdev->dev, svm->pasid);
> -		intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm)> +		/* for emulated iommu, PASID cache invalidation implies IOTLB/DTLB */
> +		if (!cap_caching_mode(svm->iommu->cap))
> +			intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
This change is not documented in the commit message. Isn't it a separate
fix?
>  	}
>  	rcu_read_unlock();
>  
> @@ -332,16 +335,15 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>  		if (pasid_max > intel_pasid_max_id)
>  			pasid_max = intel_pasid_max_id;
>  
> -		/* Do not use PASID 0 in caching mode (virtualised IOMMU) */
> -		ret = intel_pasid_alloc_id(svm,
> -					   !!cap_caching_mode(iommu->cap),
> -					   pasid_max - 1, GFP_KERNEL);
> -		if (ret < 0) {
> +		/* Do not use PASID 0, reserved for RID to PASID */
> +		svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> +					pasid_max - 1, svm);
the fact the max is not decremented compared to intel_pasid_alloc_id
looks suspicious to me (exclusive to inclusive move). I guess it is a
fix in which case this may be documented in the commit msg?
> +		if (svm->pasid == INVALID_IOASID) {
>  			kfree(svm);
>  			kfree(sdev);
> +			ret = ENOSPC;
-ENOSPC
>  			goto out;
>  		}
> -		svm->pasid = ret;
>  		svm->notifier.ops = &intel_mmuops;
>  		svm->mm = mm;
>  		svm->flags = flags;
> @@ -351,7 +353,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>  		if (mm) {
>  			ret = mmu_notifier_register(&svm->notifier, mm);
>  			if (ret) {
> -				intel_pasid_free_id(svm->pasid);
> +				ioasid_free(svm->pasid);
>  				kfree(svm);
>  				kfree(sdev);
>  				goto out;
> @@ -367,7 +369,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>  		if (ret) {
>  			if (mm)
>  				mmu_notifier_unregister(&svm->notifier, mm);
> -			intel_pasid_free_id(svm->pasid);
> +			ioasid_free(svm->pasid);
the ioasid_free returned value never is tested. Is it useful?
>  			kfree(svm);
>  			kfree(sdev);
>  			goto out;
> @@ -400,7 +402,12 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
>  	if (!iommu)
>  		goto out;
>  
> -	svm = intel_pasid_lookup_id(pasid);
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
>  	if (!svm)
>  		goto out;
>  
> @@ -422,7 +429,7 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
>  				kfree_rcu(sdev, rcu);
>  
>  				if (list_empty(&svm->devs)) {
> -					intel_pasid_free_id(svm->pasid);
> +					ioasid_free(svm->pasid);
>  					if (svm->mm)
>  						mmu_notifier_unregister(&svm->notifier, svm->mm);
>  
> @@ -457,10 +464,11 @@ int intel_svm_is_pasid_valid(struct device *dev, int pasid)
>  	if (!iommu)
>  		goto out;
>  
> -	svm = intel_pasid_lookup_id(pasid);
> -	if (!svm)
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
>  		goto out;
> -
> +	}
>  	/* init_mm is used in this case */
>  	if (!svm->mm)
>  		ret = 1;
> @@ -567,13 +575,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  
>  		if (!svm || svm->pasid != req->pasid) {
>  			rcu_read_lock();
> -			svm = intel_pasid_lookup_id(req->pasid);
> +			svm = ioasid_find(NULL, req->pasid, NULL);
>  			/* It *can't* go away, because the driver is not permitted
>  			 * to unbind the mm while any page faults are outstanding.
>  			 * So we only need RCU to protect the internal idr code. */
>  			rcu_read_unlock();
> -
> -			if (!svm) {
> +			if (IS_ERR(svm) || !svm) {
>  				pr_err("%s: Page request for invalid PASID %d: %08llx %08llx\n",
>  				       iommu->name, req->pasid, ((unsigned long long *)req)[0],
>  				       ((unsigned long long *)req)[1]);
> 

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-23 23:31 ` [PATCH v2 06/19] drivers core: Add I/O ASID allocator Jacob Pan
  2019-04-24  6:19   ` Christoph Hellwig
@ 2019-04-25 10:17   ` Auger Eric
  2019-04-25 10:41     ` Jean-Philippe Brucker
  1 sibling, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-25 10:17 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jean-Philippe, Jacob,
On 4/24/19 1:31 AM, Jacob Pan wrote:
> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> Some devices might support multiple DMA address spaces, in particular
> those that have the PCI PASID feature. PASID (Process Address Space ID)
> allows to share process address spaces with devices (SVA), partition a
> device into VM-assignable entities (VFIO mdev) or simply provide
> multiple DMA address space to kernel drivers. Add a global PASID
> allocator usable by different drivers at the same time. Name it I/O ASID
> to avoid confusion with ASIDs allocated by arch code, which are usually
> a separate ID space.
> 
> The IOASID space is global. Each device can have its own PASID space,
> but by convention the IOMMU ended up having a global PASID space, so
> that with SVA, each mm_struct is associated to a single PASID.
> 
> The allocator doesn't really belong in drivers/iommu because some
> drivers would like to allocate PASIDs for devices that aren't managed by
> an IOMMU, using the same ID space as IOMMU. It doesn't really belong in
> drivers/pci either since platform device also support PASID. Add the
> allocator in drivers/base.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/base/Kconfig   |   6 +++
>  drivers/base/Makefile  |   1 +
>  drivers/base/ioasid.c  | 106 +++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/ioasid.h |  40 +++++++++++++++++++
>  4 files changed, 153 insertions(+)
>  create mode 100644 drivers/base/ioasid.c
>  create mode 100644 include/linux/ioasid.h
> 
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 059700e..47c1348 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -182,6 +182,12 @@ config DMA_SHARED_BUFFER
>  	  APIs extension; the file's descriptor can then be passed on to other
>  	  driver.
>  
> +config IOASID
> +	bool
> +	help
> +	  Enable the I/O Address Space ID allocator. A single ID space shared
> +	  between different users.
> +
>  config DMA_FENCE_TRACE
>  	bool "Enable verbose DMA_FENCE_TRACE messages"
>  	depends on DMA_SHARED_BUFFER
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index 1574520..aafa2ac 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -23,6 +23,7 @@ obj-$(CONFIG_PINCTRL) += pinctrl.o
>  obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
>  obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
>  obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
> +obj-$(CONFIG_IOASID) += ioasid.o
>  
>  obj-y			+= test/
>  
> diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
> new file mode 100644
> index 0000000..cf122b2
> --- /dev/null
> +++ b/drivers/base/ioasid.c
> @@ -0,0 +1,106 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * I/O Address Space ID allocator. There is one global IOASID space, split into
> + * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
> + * free IOASIDs with ioasid_alloc and ioasid_free.
> + */
> +#include <linux/idr.h>
> +#include <linux/ioasid.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +struct ioasid_data {
> +	ioasid_t id;
> +	struct ioasid_set *set;
> +	void *private;
> +	struct rcu_head rcu;
> +};
> +
> +static DEFINE_IDR(ioasid_idr);
> +
> +/**
> + * ioasid_alloc - Allocate an IOASID
> + * @set: the IOASID set
> + * @min: the minimum ID (inclusive)
> + * @max: the maximum ID (exclusive)
> + * @private: data private to the caller
> + *
> + * Allocate an ID between @min and @max (or %0 and %INT_MAX). Return the
I would remove "(or %0 and %INT_MAX)".
> + * allocated ID on success, or INVALID_IOASID on failure. The @private pointer
> + * is stored internally and can be retrieved with ioasid_find().
> + */
> +ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> +		      void *private)
> +{
> +	int id = -1;
> +	struct ioasid_data *data;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return INVALID_IOASID;
> +
> +	data->set = set;
> +	data->private = private;
> +
> +	idr_preload(GFP_KERNEL);
> +	idr_lock(&ioasid_idr);
> +	data->id = id = idr_alloc(&ioasid_idr, data, min, max, GFP_ATOMIC);
> +	idr_unlock(&ioasid_idr);
> +	idr_preload_end();
> +
> +	if (id < 0) {
> +		kfree(data);
> +		return INVALID_IOASID;
> +	}
> +	return id;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_alloc);
> +
> +/**
> + * ioasid_free - Free an IOASID
> + * @ioasid: the ID to remove
> + */
> +void ioasid_free(ioasid_t ioasid)
> +{
> +	struct ioasid_data *ioasid_data;
> +
> +	idr_lock(&ioasid_idr);
> +	ioasid_data = idr_remove(&ioasid_idr, ioasid);
> +	idr_unlock(&ioasid_idr);
> +
> +	if (ioasid_data)
> +		kfree_rcu(ioasid_data, rcu);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_free);
> +
> +/**
> + * ioasid_find - Find IOASID data
> + * @set: the IOASID set
> + * @ioasid: the IOASID to find
> + * @getter: function to call on the found object
> + *
> + * The optional getter function allows to take a reference to the found object
> + * under the rcu lock. The function can also check if the object is still valid:
> + * if @getter returns false, then the object is invalid and NULL is returned.
> + *
> + * If the IOASID has been allocated for this set, return the private pointer
> + * passed to ioasid_alloc. Otherwise return NULL.
> + */
> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> +		  bool (*getter)(void *))
> +{
> +	void *priv = NULL;
> +	struct ioasid_data *ioasid_data;
> +
> +	rcu_read_lock();
> +	ioasid_data = idr_find(&ioasid_idr, ioasid);
> +	if (ioasid_data && ioasid_data->set == set) {
> +		priv = ioasid_data->private;
> +		if (getter && !getter(priv))
> +			priv = NULL;
> +	}
> +	rcu_read_unlock();
> +
> +	return priv;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_find);
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> new file mode 100644
> index 0000000..6f3655a
> --- /dev/null
> +++ b/include/linux/ioasid.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_IOASID_H
> +#define __LINUX_IOASID_H
> +
> +#define INVALID_IOASID ((ioasid_t)-1)
> +typedef unsigned int ioasid_t;
> +typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void *data);
I don't see it used in this series.
> +
> +struct ioasid_set {
> +	int dummy;
> +};
> +
> +#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> +
> +#ifdef CONFIG_IOASID
> +ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> +		      void *private);
> +void ioasid_free(ioasid_t ioasid);
> +
> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> +		  bool (*getter)(void *));
> +
> +#else /* !CONFIG_IOASID */
> +static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> +				    ioasid_t max, void *private)
> +{
> +	return INVALID_IOASID;
> +}
> +
> +static inline void ioasid_free(ioasid_t ioasid)
> +{
> +}
> +
> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> +				bool (*getter)(void *))
> +{
> +	return NULL;
> +}
> +#endif /* CONFIG_IOASID */
> +#endif /* __LINUX_IOASID_H */
> 

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-25 10:17   ` Auger Eric
@ 2019-04-25 10:41     ` Jean-Philippe Brucker
  2019-04-30 20:24       ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-25 10:41 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson
  Cc: Tian, Kevin, Raj Ashok, Andriy Shevchenko

On 25/04/2019 11:17, Auger Eric wrote:
>> +/**
>> + * ioasid_alloc - Allocate an IOASID
>> + * @set: the IOASID set
>> + * @min: the minimum ID (inclusive)
>> + * @max: the maximum ID (exclusive)
>> + * @private: data private to the caller
>> + *
>> + * Allocate an ID between @min and @max (or %0 and %INT_MAX). Return the
> I would remove "(or %0 and %INT_MAX)".

Agreed, those where the default values of idr, but the xarray doesn't
define a default max value. By the way, I do think squashing patches 6
and 7 would be better (keeping my SOB but you can change the author).

>> +typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void *data);
> I don't see it used in this series.

There used to be a "ioasid_for_each()", which isn't needed by anyone at
the moment. This can be removed.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 02/19] iommu: introduce device fault data
  2019-04-23 23:31 ` [PATCH v2 02/19] iommu: introduce device fault data Jacob Pan
@ 2019-04-25 12:46   ` Jean-Philippe Brucker
  2019-04-25 13:21     ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-25 12:46 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Eric Auger, Alex Williamson
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Yi L

On 24/04/2019 00:31, Jacob Pan wrote:
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 0000000..edcc0dd
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,115 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include <linux/types.h>
> +
> +#define IOMMU_FAULT_PERM_WRITE	(1 << 0) /* write */
> +#define IOMMU_FAULT_PERM_EXEC	(1 << 1) /* exec */
> +#define IOMMU_FAULT_PERM_PRIV	(1 << 2) /* privileged */

Could we add IOMMU_FAULT_PERM_READ back? The PRI Page Request has both R
and W fields, and R=W=0 encodes the PASID Stop Markers. Even though the
IOMMU drivers currently filter out the Stop Markers, we may want to
inject them into guests at some point in the future, which wouldn't be
possible with the current API. We could add a
IOMMU_FAULT_PAGE_REQUEST_PERM_VALID bit instead, but I still find it
weird to denote the validity of a bitfield using a separate bit.

Given that three different series now rely on this, how about we send
the fault patches separately for v5.2? I pushed the recoverable fault
support applied on top of this, with the PERM_READ bit and cleaned up
kernel doc, to git://linux-arm.org/linux-jpb.git sva/api

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 02/19] iommu: introduce device fault data
  2019-04-25 12:46   ` Jean-Philippe Brucker
@ 2019-04-25 13:21     ` Auger Eric
  2019-04-25 14:33       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-25 13:21 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Yi L

Hi Jean-Philippe,

On 4/25/19 2:46 PM, Jean-Philippe Brucker wrote:
> On 24/04/2019 00:31, Jacob Pan wrote:
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> new file mode 100644
>> index 0000000..edcc0dd
>> --- /dev/null
>> +++ b/include/uapi/linux/iommu.h
>> @@ -0,0 +1,115 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * IOMMU user API definitions
>> + */
>> +
>> +#ifndef _UAPI_IOMMU_H
>> +#define _UAPI_IOMMU_H
>> +
>> +#include <linux/types.h>
>> +
>> +#define IOMMU_FAULT_PERM_WRITE	(1 << 0) /* write */
>> +#define IOMMU_FAULT_PERM_EXEC	(1 << 1) /* exec */
>> +#define IOMMU_FAULT_PERM_PRIV	(1 << 2) /* privileged */
> 
> Could we add IOMMU_FAULT_PERM_READ back? The PRI Page Request has both R
> and W fields, and R=W=0 encodes the PASID Stop Markers. Even though the
> IOMMU drivers currently filter out the Stop Markers, we may want to
> inject them into guests at some point in the future, which wouldn't be
> possible with the current API.

OK for me.

 We could add a
> IOMMU_FAULT_PAGE_REQUEST_PERM_VALID bit instead, but I still find it
> weird to denote the validity of a bitfield using a separate bit.
> 
> Given that three different series now rely on this, how about we send
> the fault patches separately for v5.2? I pushed the recoverable fault
> support applied on top of this, with the PERM_READ bit and cleaned up
> kernel doc, to git://linux-arm.org/linux-jpb.git sva/api

my only concern is is it likely to be upstreamed without any actual
user? In the positive, of course, I don't have any objection.

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 02/19] iommu: introduce device fault data
  2019-04-25 13:21     ` Auger Eric
@ 2019-04-25 14:33       ` Jean-Philippe Brucker
  2019-04-25 18:07         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-25 14:33 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Alex Williamson
  Cc: Yi L, Tian, Kevin, Raj Ashok, Andriy Shevchenko

On 25/04/2019 14:21, Auger Eric wrote:
  We could add a
>> IOMMU_FAULT_PAGE_REQUEST_PERM_VALID bit instead, but I still find it
>> weird to denote the validity of a bitfield using a separate bit.
>>
>> Given that three different series now rely on this, how about we send
>> the fault patches separately for v5.2?

Sorry I meant v5.3 - after the merge window

>> I pushed the recoverable fault
>> support applied on top of this, with the PERM_READ bit and cleaned up
>> kernel doc, to git://linux-arm.org/linux-jpb.git sva/api
> 
> my only concern is is it likely to be upstreamed without any actual
> user? In the positive, of course, I don't have any objection.

Possibly, I don't think my I/O page fault stuff for SVA is likely to get
in v5.3, it depends on one or two more patch sets. But your nested work
and Jacob's one may be in good shape for next version? I find it
difficult to keep track of the same patches in three different series.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 02/19] iommu: introduce device fault data
  2019-04-25 14:33       ` Jean-Philippe Brucker
@ 2019-04-25 18:07         ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-25 18:07 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Auger Eric, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Yi L, Tian, Kevin, Raj Ashok, Andriy Shevchenko,
	jacob.jun.pan

On Thu, 25 Apr 2019 15:33:17 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 25/04/2019 14:21, Auger Eric wrote:
>   We could add a
> >> IOMMU_FAULT_PAGE_REQUEST_PERM_VALID bit instead, but I still find
> >> it weird to denote the validity of a bitfield using a separate bit.
> >>
> >> Given that three different series now rely on this, how about we
> >> send the fault patches separately for v5.2?  
> 
> Sorry I meant v5.3 - after the merge window
> 
> >> I pushed the recoverable fault
> >> support applied on top of this, with the PERM_READ bit and cleaned
> >> up kernel doc, to git://linux-arm.org/linux-jpb.git sva/api  
> > 
Sounds good to me. We need th READ perm. I will pick the fault reporting
patches from this tree for my next rev. My plan is to add PRQ support
for vSVA after the current series.
> > my only concern is is it likely to be upstreamed without any actual
> > user? In the positive, of course, I don't have any objection.  
> 
> Possibly, I don't think my I/O page fault stuff for SVA is likely to
> get in v5.3, it depends on one or two more patch sets. But your
> nested work and Jacob's one may be in good shape for next version? I
> find it difficult to keep track of the same patches in three
> different series.
Same here, hard to track especially for minor tweaks. I am working
towards the next version for vSVA page fault. Then I will look into
converting VT-d native IO page fault to yours.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-24  6:19   ` Christoph Hellwig
@ 2019-04-25 18:19     ` Jacob Pan
  2019-04-26 11:47       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-25 18:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Eric Auger,
	Alex Williamson, Jean-Philippe Brucker, Yi Liu, Tian, Kevin,
	Raj Ashok, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

Hi Christoph,

On Tue, 23 Apr 2019 23:19:03 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Apr 23, 2019 at 04:31:06PM -0700, Jacob Pan wrote:
> > The allocator doesn't really belong in drivers/iommu because some
> > drivers would like to allocate PASIDs for devices that aren't
> > managed by an IOMMU, using the same ID space as IOMMU. It doesn't
> > really belong in drivers/pci either since platform device also
> > support PASID. Add the allocator in drivers/base.  
> 
> I'd still add it to drivers/iommu, just selectable separately from the
> core iommu code..
Perhaps I misunderstood. If a driver wants to use IOASIDs w/o iommu
subsystem even turned on, how could selecting from the core iommu code
help? Could you elaborate on "selectable"?

From VT-d's perspective, PASIDs are only used with IOMMU on. Jean
knows other use cases.

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-25 10:03   ` Auger Eric
@ 2019-04-25 21:29     ` Jacob Pan
  2019-04-26  9:06       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-25 21:29 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

Hi Eric,

Thanks for the review.

On Thu, 25 Apr 2019 12:03:42 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > Sometimes, IOASID allocation must be handled by platform specific
> > code. The use cases are guest vIOMMU and pvIOMMU where IOASIDs need
> > to be allocated by the host via enlightened or paravirt interfaces.
> > 
> > This patch adds an extension to the IOASID allocator APIs such that
> > platform drivers can register a custom allocator, possibly at boot
> > time, to take over the allocation. Xarray is still used for tracking
> > and searching purposes internal to the IOASID code. Private data of
> > an IOASID can also be set after the allocation.
> > 
> > There can be multiple custom allocators registered but only one is
> > used at a time. In case of hot removal of devices that provides the
> > allocator, all IOASIDs must be freed prior to unregistering the
> > allocator. Default XArray based allocator cannot be mixed with
> > custom allocators, i.e. custom allocators will not be used if there
> > are outstanding IOASIDs allocated by the default XA allocator.  
> 
> What's the exact use case behind allowing several custom IOASID
> allocators to be registered?
It is mainly for supporting multiple PCI segments thus multiple
vIOMMUs. Even though, all allocators will end up calling the host to
allocate PASIDs. QEMU does not support multiple PCI segments/domains
afaik but others might.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/base/ioasid.c  | 182
> > ++++++++++++++++++++++++++++++++++++++++++++++---
> > include/linux/ioasid.h |  15 +++- 2 files changed, 187
> > insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
> > index c4012aa..5cb36a4 100644
> > --- a/drivers/base/ioasid.c
> > +++ b/drivers/base/ioasid.c
> > @@ -17,6 +17,120 @@ struct ioasid_data {
> >  };
> >  
> >  static DEFINE_XARRAY_ALLOC(ioasid_xa);
> > +static DEFINE_MUTEX(ioasid_allocator_lock);
> > +static struct ioasid_allocator *ioasid_allocator;  
> A more explicit name may be chosen. If I understand correctly that's
> the active_custom_allocator
Yes, more clear this way.

> > +
> > +static LIST_HEAD(custom_allocators);
> > +/*
> > + * A flag to track if ioasid default allocator already been used,
> > this will  
> is already in use?
> > + * prevent custom allocator from being used. The reason is that
> > custom allocator  
> s/The reason is that custom allocator/The reason is that custom
> allocators
> > + * must have unadulterated space to track private data with
> > xarray, there cannot
> > + * be a mix been default and custom allocated IOASIDs.
> > + */
> > +static int default_allocator_used;
> > +
> > +/**
> > + * ioasid_register_allocator - register a custom allocator
> > + * @allocator: the custom allocator to be registered
> > + *
> > + * Custom allocator take precedence over the default xarray based
> > allocator.
> > + * Private data associated with the ASID are managed by ASID
> > common code
> > + * similar to data stored in xa.
> > + *
> > + * There can be multiple allocators registered but only one is
> > active. In case
> > + * of runtime removal of an custom allocator, the next one is
> > activated based
> > + * on the registration ordering.  
> This last sentence may be moved to the unregister() kerneldoc
> > + */
> > +int ioasid_register_allocator(struct ioasid_allocator *allocator)
> > +{
> > +	struct ioasid_allocator *pallocator;
> > +	int ret = 0;
> > +
> > +	if (!allocator)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&ioasid_allocator_lock);
> > +	if (list_empty(&custom_allocators))
> > +		ioasid_allocator = allocator;  
> The fact the first registered custom allocator gets automatically
> active was not obvious to me and may deserve a comment.
Will do. I will add:
"No particular preference since all custom allocators end up calling
the host to allocate IOASIDs. We activate the first allocator and keep
the later ones in a list in case the first one gets removed due to
hotplug."

> > +	else {
> > +		/* Check if the allocator is already registered */
> > +		list_for_each_entry(pallocator,
> > &custom_allocators, list) {
> > +			if (pallocator == allocator) {
> > +				pr_err("IOASID allocator already
> > exist\n");  
> s/exist/registered?
make sense.
> > +				ret = -EEXIST;
> > +				goto out_unlock;
> > +			}
> > +		}
> > +	}
> > +	list_add_tail(&allocator->list, &custom_allocators);
> > +
> > +out_unlock:
> > +	mutex_unlock(&ioasid_allocator_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_register_allocator);
> > +
> > +/**
> > + * ioasid_unregister_allocator - Remove a custom IOASID allocator
> > + * @allocator: the custom allocator to be removed
> > + *
> > + * Remove an allocator from the list, activate the next allocator
> > in
> > + * the order it was  registration.
> > + */
> > +void ioasid_unregister_allocator(struct ioasid_allocator
> > *allocator) +{
> > +	if (!allocator)
> > +		return;
> > +
> > +	if (list_empty(&custom_allocators)) {
> > +		pr_warn("No custom IOASID allocators active!\n");  
> s/active/registered?
> > +		return;
> > +	}
> > +
> > +	mutex_lock(&ioasid_allocator_lock);
> > +	list_del(&allocator->list);
> > +	if (list_empty(&custom_allocators)) {
> > +		pr_info("No custom IOASID allocators\n");
> > +		/*
> > +		 * All IOASIDs should have been freed before the
> > last allocator
> > +		 * is unregistered.
> > +		 */
> > +		BUG_ON(!xa_empty(&ioasid_xa));  
> At this stage it is difficult to assess whether using a BUG_ON() is
> safe here. Who is responsible for freeing the IOASIDs?
Who ever allocates IOASIDs are responsible for freeing. This could be
the IOMMU driver running in the guest. In the very unlikely scenario
below:
1. vIOMMU1 register a custom allocator1
2. vIOMMU2 register a custom allocator2
3. sva_bind() called to bind dev under vIOMMU1, use allocator1 to
allocate ioasid1.
4. vIOMMU 1 hot removed
5. vIOMMU 2 hot removed
BUG_ON() hits because sva_unbind was not called on ioasid1. So even if
we free ioasid1 after BUG_ON, it does not undo the damage.

> > +		ioasid_allocator = NULL;
> > +	} else if (allocator == ioasid_allocator) {
> > +		ioasid_allocator = list_entry(&custom_allocators,
> > struct ioasid_allocator, list);
> > +		pr_info("IOASID allocator changed");
> > +	}
> > +	mutex_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
> > +
> > +/**
> > + * ioasid_set_data - Set private data for an allocated ioasid
> > + * @ioasid: the ID to set data
> > + * @data:   the private data
> > + *
> > + * For IOASID that is already allocated, private data can be set
> > + * via this API. Future lookup can be done via ioasid_find.
> > + */
> > +int ioasid_set_data(ioasid_t ioasid, void *data)
> > +{
> > +	struct ioasid_data *ioasid_data;
> > +	int ret = 0;
> > +
> > +	ioasid_data = xa_load(&ioasid_xa, ioasid);
> > +	if (ioasid_data)
> > +		ioasid_data->private = data;
> > +	else
> > +		ret = -ENOENT;
> > +
> > +	/* getter may use the private data */
> > +	synchronize_rcu();
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_data);
> > +
> >  /**
> >   * ioasid_alloc - Allocate an IOASID
> >   * @set: the IOASID set
> > @@ -31,7 +145,7 @@ static DEFINE_XARRAY_ALLOC(ioasid_xa);
> >  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private)
> >  {
> > -	int id = -1;
> > +	int id = INVALID_IOASID;
> >  	struct ioasid_data *data;
> >  
> >  	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > @@ -40,14 +154,37 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, 
> >  	data->set = set;
> >  	data->private = private;
> > +
> > +	/*
> > +	 * Use custom allocator if available, otherwise use
> > default.
> > +	 * However, if there are active IOASIDs already been
> > allocated by default
> > +	 * allocator, custom allocator cannot be used.
> > +	 */
> > +	if (!default_allocator_used && ioasid_allocator) {
> > +		mutex_lock(&ioasid_allocator_lock);
> > +		id = ioasid_allocator->alloc(min, max,
> > ioasid_allocator->pdata);
> > +		mutex_unlock(&ioasid_allocator_lock);
> > +		if (id == INVALID_IOASID) {
> > +			pr_err("Failed ASID allocation by custom
> > allocator\n");
> > +			goto exit_free;
> > +		}
> > +		/*
> > +		 * Use XA to manage private data also sanitiy
> > check custom> +		 * allocator for duplicates.  
> s/data also sanitiy check/data, also sanity check
> > +		 */
> > +		min = id;
> > +		max = id + 1;
> > +	} else
> > +		default_allocator_used = 1;  
> shouldn't default_allocator_used be protected as well?
> > +
> >  	if (xa_alloc(&ioasid_xa, &id, data, XA_LIMIT(min, max),
> > GFP_KERNEL)) { pr_err("Failed to alloc ioasid from %d to %d\n",
> > min, max); goto exit_free;
> >  	}
> > -
> >  	data->id = id;  
> wouldn't it be possible to integrate the default io asid allocator as
> any custom allocator, ie. implement an alloc callback using xa_alloc.
> Then the active io allocator could be either a custom or a default
> one.
That is an interesting idea. I think it is possible.
But since default xa allocator is internal to ioasid infrastructure,
why implement it as a callback?

> > +
> >  exit_free:
> > -	if (id < 0) {
> > +	if (id < 0 || id == INVALID_IOASID) {
> >  		kfree(data);
> >  		return INVALID_IOASID;
> >  	}
> > @@ -59,12 +196,29 @@ EXPORT_SYMBOL_GPL(ioasid_alloc);
> >   * ioasid_free - Free an IOASID
> >   * @ioasid: the ID to remove
> >   */
> > -void ioasid_free(ioasid_t ioasid)
> > +int ioasid_free(ioasid_t ioasid)
> >  {
> >  	struct ioasid_data *ioasid_data;
> > +	int ret = 0;
> > +
> > +	if (ioasid_allocator) {
> > +		mutex_lock(&ioasid_allocator_lock);
> > +		ret = ioasid_allocator->free(ioasid,
> > ioasid_allocator->pdata);
> > +		mutex_unlock(&ioasid_allocator_lock);
> > +	}
> > +	if (ret) {
> > +		pr_err("ioasid %d custom allocator free failed\n",
> > ioasid);
> > +		return ret;
> > +	}
> >  
> >  	ioasid_data = xa_erase(&ioasid_xa, ioasid);
> > +
> >  	kfree_rcu(ioasid_data, rcu);
> > +
> > +	if (xa_empty(&ioasid_xa))
> > +		default_allocator_used = 0;
> > +
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(ioasid_free);
> >  
> > @@ -79,7 +233,8 @@ EXPORT_SYMBOL_GPL(ioasid_free);
> >   * if @getter returns false, then the object is invalid and NULL
> > is returned. *
> >   * If the IOASID has been allocated for this set, return the
> > private pointer
> > - * passed to ioasid_alloc. Otherwise return NULL.
> > + * passed to ioasid_alloc. Private data can be NULL if not set.
> > Return an error
> > + * if the IOASID is not found or not belong to the set.  
> s/not belong/does not belong
> >   */
> >  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> >  		  bool (*getter)(void *))
> > @@ -89,11 +244,20 @@ void *ioasid_find(struct ioasid_set *set,
> > ioasid_t ioasid, 
> >  	rcu_read_lock();
> >  	ioasid_data = xa_load(&ioasid_xa, ioasid);
> > -	if (ioasid_data && ioasid_data->set == set) {
> > -		priv = ioasid_data->private;
> > -		if (getter && !getter(priv))
> > -			priv = NULL;
> > +	if (!ioasid_data) {
> > +		priv = ERR_PTR(-ENOENT);
> > +		goto unlock;
> > +	}
> > +	if (set && ioasid_data->set != set) {
> > +		/* data found but does not belong to the set */
> > +		priv = ERR_PTR(-EACCES);
> > +		goto unlock;
> >  	}
> > +	/* Now IOASID and its set is verified, we can return the
> > private data */
> > +	priv = ioasid_data->private;
> > +	if (getter && !getter(priv))
> > +		priv = NULL;
> > +unlock:
> >  	rcu_read_unlock();
> >  
> >  	return priv;
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 6f3655a..e773c13 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -5,20 +5,33 @@
> >  #define INVALID_IOASID ((ioasid_t)-1)
> >  typedef unsigned int ioasid_t;
> >  typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void
> > *data); +typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min,
> > ioasid_t max, void *data); +typedef int
> > (*ioasid_free_fn_t)(ioasid_t ioasid, void *data); 
> >  struct ioasid_set {
> >  	int dummy;
> >  };
> >  
> > +struct ioasid_allocator {
> > +	ioasid_alloc_fn_t alloc;
> > +	ioasid_free_fn_t free;
> > +	void *pdata;
> > +	struct list_head list;
> > +};
> > +
> >  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> >  
> >  #ifdef CONFIG_IOASID
> >  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private);
> > -void ioasid_free(ioasid_t ioasid);
> > +int ioasid_free(ioasid_t ioasid);  
> you need to change the definition for the !CONFIG_IOASID case too
Good catch! I am thinking there is no need to check return value of
free (as you pointed out in other comments).

> >  
> >  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> >  		  bool (*getter)(void *));
> > +int ioasid_register_allocator(struct ioasid_allocator *allocator);
> > +void ioasid_unregister_allocator(struct ioasid_allocator
> > *allocator); +
> > +int ioasid_set_data(ioasid_t ioasid, void *data);
> >  
> >  #else /* !CONFIG_IOASID */
> >  static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min,  
> Just to make sure, don't you need to define the new functions if
> !CONFIG_IOASID?
> 
Right, Thanks!

> Thanks
> 
> Eric
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-25  7:40       ` Auger Eric
@ 2019-04-25 23:01         ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-25 23:01 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Thu, 25 Apr 2019 09:40:31 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Liu,
> 
> On 4/25/19 9:12 AM, Liu, Yi L wrote:
> > Hi Eric,
> >   
> >> From: Auger Eric [mailto:eric.auger@redhat.com]
> >> Sent: Thursday, April 25, 2019 1:28 AM
> >> To: Jacob Pan <jacob.jun.pan@linux.intel.com>;
> >> iommu@lists.linux-foundation.org; Subject: Re: [PATCH v2 09/19]
> >> iommu/vt-d: Enlightened PASID allocation
> >>
> >> Hi Jacob,
> >>
> >> On 4/24/19 1:31 AM, Jacob Pan wrote:  
> >>> From: Lu Baolu <baolu.lu@linux.intel.com>
> >>>
> >>> If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
> >>> IOMMU driver should rely on the emulation software to allocate
> >>> and free PASID IDs.  
> >> Do we make the decision depending on the CM or depending on the
> >> VCCAP_REG?
> >>
> >> VCCAP_REG description says:
> >>
> >> If Set, software must use Virtual Command Register interface to
> >> allocate and free PASIDs.  
> > 
> > The answer is it depends on the ECAP.VCS and then the PASID
> > allocation bit in VCCAP_REG. But VCS bit implies the iommu is a
> > software implementation (vIOMMU) of vt-d architecture. Pls refer to
> > the descriptions of "Virtual Command Support" in vt-d 3.0 spec.
> > 
> > "Hardware implementations of this architecture report a value of 0
> > in this field. Software implementations (emulation) of this
> > architecture may report VCS=1."  
> 
> OK I understand. But strictly speaking a vIOMMU may not implement CM.
> But that's nitpicking ;-)
> 
CAP.CM (caching mode) and ECAP.VCS(virtual command support) are
separate. I think we are mixing the two here since both are sufficient
condition to indicate whether we are running in a guest.
> Thanks
> 
> Eric
> > 
> > Thanks,
> > Yi Liu
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-24 17:27   ` Auger Eric
  2019-04-25  7:12     ` Liu, Yi L
@ 2019-04-25 23:40     ` Jacob Pan
  2019-04-26  7:24       ` Auger Eric
  1 sibling, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-25 23:40 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Wed, 24 Apr 2019 19:27:52 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> > 
> > If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
> > IOMMU driver should rely on the emulation software to allocate
> > and free PASID IDs.  
> Do we make the decision depending on the CM or depending on the
> VCCAP_REG?
> 
> VCCAP_REG description says:
> 
> If Set, software must use Virtual Command Register interface to
> allocate and free PASIDs.
> 
>  The Intel vt-d spec revision 3.0 defines a
> > register set to support this. This includes a capability register,
> > a virtual command register and a virtual response register. Refer
> > to section 10.4.42, 10.4.43, 10.4.44 for more information.
> > 
> > This patch adds the enlightened PASID allocation/free interfaces  
> For mu curiosity why is it called "enlightened"?
I don't know the origin but "enlightened" means guest is tipped with
information that it is not running on real HW.

> > via the virtual command register.
> > 
> > Cc: Ashok Raj <ashok.raj@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-pasid.c | 70
> > +++++++++++++++++++++++++++++++++++++++++++++
> > drivers/iommu/intel-pasid.h | 13 ++++++++-
> > include/linux/intel-iommu.h |  2 ++ 3 files changed, 84
> > insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index 03b12d2..5b1d3be 100644
> > --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -63,6 +63,76 @@ void *intel_pasid_lookup_id(int pasid)
> >  	return p;
> >  }
> >  
> > +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
> > *pasid) +{
> > +	u64 res;
> > +	u64 cap;
> > +	u8 err_code;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	if (!ecap_vcs(iommu->ecap)) {
> > +		pr_warn("IOMMU: %s: Hardware doesn't support
> > virtual command\n",
> > +			iommu->name);  
> nit: other pr_* messages don't have the "IOMMU: %s:" prefix.
Are you suggesting just use the prefix defined in pr_fmt? I guess i can
remove "IOMMU" if Allen is OK with it :).

> > +		return -ENODEV;
> > +	}
> > +
> > +	cap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> > +	if (!(cap & DMA_VCS_PAS)) {
> > +		pr_warn("IOMMU: %s: Emulation software doesn't
> > support PASID allocation\n",
> > +			iommu->name);
> > +		return -ENODEV;
> > +	}
> > +
> > +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> > +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
> > +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> > +		      !(res & VCMD_VRSP_IP), res);
> > +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> > +
> > +	err_code = VCMD_VRSP_EC(res);
> > +	switch (err_code) {
> > +	case VCMD_VRSP_EC_SUCCESS:
> > +		*pasid = VCMD_VRSP_RESULE(res);
> > +		break;
> > +	case VCMD_VRSP_EC_UNAVAIL:
> > +		pr_info("IOMMU: %s: No PASID available\n",
> > iommu->name);
> > +		ret = -ENOMEM;
> > +		break;
> > +	default:
> > +		ret = -ENODEV;
> > +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",  
> unknown
> > +			iommu->name, err_code);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> > +{
> > +	u64 res;
> > +	u8 err_code;
> > +	unsigned long flags;  
> Shall we check as well the cap is set?
yes, good point.

> > +
> > +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
> > +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, (pasid << 8) |
> > VCMD_CMD_FREE);
> > +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> > +		      !(res & VCMD_VRSP_IP), res);
> > +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> > +
> > +	err_code = VCMD_VRSP_EC(res);
> > +	switch (err_code) {
> > +	case VCMD_VRSP_EC_SUCCESS:
> > +		break;
> > +	case VCMD_VRSP_EC_INVAL:
> > +		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> > +		break;
> > +	default:
> > +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",  
> unknown
> > +			iommu->name, err_code);
> > +	}
> > +}
> > +
> >  /*
> >   * Per device pasid table management:
> >   */
> > diff --git a/drivers/iommu/intel-pasid.h
> > b/drivers/iommu/intel-pasid.h index 23537b3..0999dfe 100644
> > --- a/drivers/iommu/intel-pasid.h
> > +++ b/drivers/iommu/intel-pasid.h
> > @@ -19,6 +19,16 @@
> >  #define PASID_PDE_SHIFT			6
> >  #define MAX_NR_PASID_BITS		20
> >  
> > +/* Virtual command interface for enlightened pasid management. */
> > +#define VCMD_CMD_ALLOC			0x1
> > +#define VCMD_CMD_FREE			0x2
> > +#define VCMD_VRSP_IP			0x1
> > +#define VCMD_VRSP_EC(e)			(((e) >> 1) & 0x3)  
> s/EC/SC? for Status Code and below
Good, that would match the spec.

> > +#define VCMD_VRSP_EC_SUCCESS		0
> > +#define VCMD_VRSP_EC_UNAVAIL		1  
> nit: _NO_VALID_PASID
Other than SUCCESS, these codes are PASID command specific. I think it
can be called _NO_PASID_AVAIL to match Spec. Fig 10-87 "No PASID
Available"

> > +#define VCMD_VRSP_EC_INVAL		1  
> nit: _INVALID_PASID
Agreed
> > +#define VCMD_VRSP_RESULE(e)		(((e) >> 8) & 0xfffff)  
> nit: s/RESULE/RSLT?
yes. Also the mask bits should be 8 to 63
s/0xfffff/GENMASK_ULL(63, 8))/

> > +
> >  /*
> >   * Domain ID reserved for pasid entries programmed for first-level
> >   * only and pass-through transfer modes.
> > @@ -69,5 +79,6 @@ int intel_pasid_setup_pass_through(struct
> > intel_iommu *iommu, struct device *dev, int pasid);
> >  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> >  				 struct device *dev, int pasid);
> > -
> > +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
> > *pasid); +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned
> > int pasid); #endif /* __INTEL_PASID_H */
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 6925a18..bff907b 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -173,6 +173,7 @@
> >  #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
> >  #define ecap_flts(e)		(((e) >> 47) & 0x1)
> >  #define ecap_slts(e)		(((e) >> 46) & 0x1)
> > +#define ecap_vcs(e)		(((e) >> 44) & 0x1)
> >  #define ecap_smts(e)		(((e) >> 43) & 0x1)
> >  #define ecap_dit(e)		((e >> 41) & 0x1)
> >  #define ecap_pasid(e)		((e >> 40) & 0x1)
> > @@ -289,6 +290,7 @@
> >  
> >  /* PRS_REG */
> >  #define DMA_PRS_PPR	((u32)1)
> > +#define DMA_VCS_PAS	((u64)1)
> >  
> >  #define IOMMU_WAIT_OP(iommu, offset, op, cond,
> > sts)			\ do
> > {
> > \ 
> 
> Thanks
> 
> Eric
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-25 23:40     ` Jacob Pan
@ 2019-04-26  7:24       ` Auger Eric
  2019-04-26 15:05         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26  7:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko



On 4/26/19 1:40 AM, Jacob Pan wrote:
> On Wed, 24 Apr 2019 19:27:52 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>
>>> If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
>>> IOMMU driver should rely on the emulation software to allocate
>>> and free PASID IDs.  
>> Do we make the decision depending on the CM or depending on the
>> VCCAP_REG?
>>
>> VCCAP_REG description says:
>>
>> If Set, software must use Virtual Command Register interface to
>> allocate and free PASIDs.
>>
>>  The Intel vt-d spec revision 3.0 defines a
>>> register set to support this. This includes a capability register,
>>> a virtual command register and a virtual response register. Refer
>>> to section 10.4.42, 10.4.43, 10.4.44 for more information.
>>>
>>> This patch adds the enlightened PASID allocation/free interfaces  
>> For mu curiosity why is it called "enlightened"?
> I don't know the origin but "enlightened" means guest is tipped with
> information that it is not running on real HW.
> 
>>> via the virtual command register.
>>>
>>> Cc: Ashok Raj <ashok.raj@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>> ---
>>>  drivers/iommu/intel-pasid.c | 70
>>> +++++++++++++++++++++++++++++++++++++++++++++
>>> drivers/iommu/intel-pasid.h | 13 ++++++++-
>>> include/linux/intel-iommu.h |  2 ++ 3 files changed, 84
>>> insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/intel-pasid.c
>>> b/drivers/iommu/intel-pasid.c index 03b12d2..5b1d3be 100644
>>> --- a/drivers/iommu/intel-pasid.c
>>> +++ b/drivers/iommu/intel-pasid.c
>>> @@ -63,6 +63,76 @@ void *intel_pasid_lookup_id(int pasid)
>>>  	return p;
>>>  }
>>>  
>>> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
>>> *pasid) +{
>>> +	u64 res;
>>> +	u64 cap;
>>> +	u8 err_code;
>>> +	unsigned long flags;
>>> +	int ret = 0;
>>> +
>>> +	if (!ecap_vcs(iommu->ecap)) {
>>> +		pr_warn("IOMMU: %s: Hardware doesn't support
>>> virtual command\n",
>>> +			iommu->name);  
>> nit: other pr_* messages don't have the "IOMMU: %s:" prefix.
> Are you suggesting just use the prefix defined in pr_fmt? I guess i can
> remove "IOMMU" if Allen is OK with it :).
I aimed to signal the trace formats are not homogeneous in this .c file
but that's not a big deal. In the feature you may use the "IOMMU: %s"
prefix for all pr_* traces.

> 
>>> +		return -ENODEV;
>>> +	}
>>> +
>>> +	cap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
>>> +	if (!(cap & DMA_VCS_PAS)) {
>>> +		pr_warn("IOMMU: %s: Emulation software doesn't
>>> support PASID allocation\n",
>>> +			iommu->name);
>>> +		return -ENODEV;
>>> +	}
>>> +
>>> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
>>> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
>>> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
>>> +		      !(res & VCMD_VRSP_IP), res);
>>> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
>>> +
>>> +	err_code = VCMD_VRSP_EC(res);
>>> +	switch (err_code) {
>>> +	case VCMD_VRSP_EC_SUCCESS:
>>> +		*pasid = VCMD_VRSP_RESULE(res);
>>> +		break;
>>> +	case VCMD_VRSP_EC_UNAVAIL:
>>> +		pr_info("IOMMU: %s: No PASID available\n",
>>> iommu->name);
>>> +		ret = -ENOMEM;
>>> +		break;
>>> +	default:
>>> +		ret = -ENODEV;
>>> +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",  
>> unknown
>>> +			iommu->name, err_code);
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
>>> +{
>>> +	u64 res;
>>> +	u8 err_code;
>>> +	unsigned long flags;  
>> Shall we check as well the cap is set?
> yes, good point.
> 
>>> +
>>> +	raw_spin_lock_irqsave(&iommu->register_lock, flags);
>>> +	dmar_writeq(iommu->reg + DMAR_VCMD_REG, (pasid << 8) |
>>> VCMD_CMD_FREE);
>>> +	IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
>>> +		      !(res & VCMD_VRSP_IP), res);
>>> +	raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
>>> +
>>> +	err_code = VCMD_VRSP_EC(res);
>>> +	switch (err_code) {
>>> +	case VCMD_VRSP_EC_SUCCESS:
>>> +		break;
>>> +	case VCMD_VRSP_EC_INVAL:
>>> +		pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
>>> +		break;
>>> +	default:
>>> +		pr_warn("IOMMU: %s: Unkonwn error code %d\n",  
>> unknown
>>> +			iommu->name, err_code);
>>> +	}
>>> +}
>>> +
>>>  /*
>>>   * Per device pasid table management:
>>>   */
>>> diff --git a/drivers/iommu/intel-pasid.h
>>> b/drivers/iommu/intel-pasid.h index 23537b3..0999dfe 100644
>>> --- a/drivers/iommu/intel-pasid.h
>>> +++ b/drivers/iommu/intel-pasid.h
>>> @@ -19,6 +19,16 @@
>>>  #define PASID_PDE_SHIFT			6
>>>  #define MAX_NR_PASID_BITS		20
>>>  
>>> +/* Virtual command interface for enlightened pasid management. */
>>> +#define VCMD_CMD_ALLOC			0x1
>>> +#define VCMD_CMD_FREE			0x2
>>> +#define VCMD_VRSP_IP			0x1
>>> +#define VCMD_VRSP_EC(e)			(((e) >> 1) & 0x3)  
>> s/EC/SC? for Status Code and below
> Good, that would match the spec.
> 
>>> +#define VCMD_VRSP_EC_SUCCESS		0
>>> +#define VCMD_VRSP_EC_UNAVAIL		1  
>> nit: _NO_VALID_PASID
> Other than SUCCESS, these codes are PASID command specific. I think it
> can be called _NO_PASID_AVAIL to match Spec. Fig 10-87 "No PASID
> Available"
yes that's what I meant actually ;-)
> 
>>> +#define VCMD_VRSP_EC_INVAL		1  
>> nit: _INVALID_PASID
> Agreed
>>> +#define VCMD_VRSP_RESULE(e)		(((e) >> 8) & 0xfffff)  
>> nit: s/RESULE/RSLT?
> yes. Also the mask bits should be 8 to 63
> s/0xfffff/GENMASK_ULL(63, 8))/
Well the macro definition looks correct as 63:28 is RsvdZ

Thanks

Eric
> 
>>> +
>>>  /*
>>>   * Domain ID reserved for pasid entries programmed for first-level
>>>   * only and pass-through transfer modes.
>>> @@ -69,5 +79,6 @@ int intel_pasid_setup_pass_through(struct
>>> intel_iommu *iommu, struct device *dev, int pasid);
>>>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>>>  				 struct device *dev, int pasid);
>>> -
>>> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
>>> *pasid); +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned
>>> int pasid); #endif /* __INTEL_PASID_H */
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 6925a18..bff907b 100644
>>> --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -173,6 +173,7 @@
>>>  #define ecap_smpwc(e)		(((e) >> 48) & 0x1)
>>>  #define ecap_flts(e)		(((e) >> 47) & 0x1)
>>>  #define ecap_slts(e)		(((e) >> 46) & 0x1)
>>> +#define ecap_vcs(e)		(((e) >> 44) & 0x1)
>>>  #define ecap_smts(e)		(((e) >> 43) & 0x1)
>>>  #define ecap_dit(e)		((e >> 41) & 0x1)
>>>  #define ecap_pasid(e)		((e >> 40) & 0x1)
>>> @@ -289,6 +290,7 @@
>>>  
>>>  /* PRS_REG */
>>>  #define DMA_PRS_PPR	((u32)1)
>>> +#define DMA_VCS_PAS	((u64)1)
>>>  
>>>  #define IOMMU_WAIT_OP(iommu, offset, op, cond,
>>> sts)			\ do
>>> {
>>> \ 
>>
>> Thanks
>>
>> Eric
>>
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-25 21:29     ` Jacob Pan
@ 2019-04-26  9:06       ` Auger Eric
  2019-04-26 15:19         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26  9:06 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/25/19 11:29 PM, Jacob Pan wrote:
> Hi Eric,
> 
> Thanks for the review.
> 
> On Thu, 25 Apr 2019 12:03:42 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> Sometimes, IOASID allocation must be handled by platform specific
>>> code. The use cases are guest vIOMMU and pvIOMMU where IOASIDs need
>>> to be allocated by the host via enlightened or paravirt interfaces.
>>>
>>> This patch adds an extension to the IOASID allocator APIs such that
>>> platform drivers can register a custom allocator, possibly at boot
>>> time, to take over the allocation. Xarray is still used for tracking
>>> and searching purposes internal to the IOASID code. Private data of
>>> an IOASID can also be set after the allocation.
>>>
>>> There can be multiple custom allocators registered but only one is
>>> used at a time. In case of hot removal of devices that provides the
>>> allocator, all IOASIDs must be freed prior to unregistering the
>>> allocator. Default XArray based allocator cannot be mixed with
>>> custom allocators, i.e. custom allocators will not be used if there
>>> are outstanding IOASIDs allocated by the default XA allocator.  
>>
>> What's the exact use case behind allowing several custom IOASID
>> allocators to be registered?
> It is mainly for supporting multiple PCI segments thus multiple
> vIOMMUs. Even though, all allocators will end up calling the host to
> allocate PASIDs.

Yes that was my understanding actually.

Another question is how do you handle the reserved RID_PASID requirement?

 QEMU does not support multiple PCI segments/domains
> afaik but others might.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>  drivers/base/ioasid.c  | 182
>>> ++++++++++++++++++++++++++++++++++++++++++++++---
>>> include/linux/ioasid.h |  15 +++- 2 files changed, 187
>>> insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/base/ioasid.c b/drivers/base/ioasid.c
>>> index c4012aa..5cb36a4 100644
>>> --- a/drivers/base/ioasid.c
>>> +++ b/drivers/base/ioasid.c
>>> @@ -17,6 +17,120 @@ struct ioasid_data {
>>>  };
>>>  
>>>  static DEFINE_XARRAY_ALLOC(ioasid_xa);
>>> +static DEFINE_MUTEX(ioasid_allocator_lock);
>>> +static struct ioasid_allocator *ioasid_allocator;  
>> A more explicit name may be chosen. If I understand correctly that's
>> the active_custom_allocator
> Yes, more clear this way.
> 
>>> +
>>> +static LIST_HEAD(custom_allocators);
>>> +/*
>>> + * A flag to track if ioasid default allocator already been used,
>>> this will  
>> is already in use?
>>> + * prevent custom allocator from being used. The reason is that
>>> custom allocator  
>> s/The reason is that custom allocator/The reason is that custom
>> allocators
>>> + * must have unadulterated space to track private data with
>>> xarray, there cannot
>>> + * be a mix been default and custom allocated IOASIDs.
>>> + */
>>> +static int default_allocator_used;
>>> +
>>> +/**
>>> + * ioasid_register_allocator - register a custom allocator
>>> + * @allocator: the custom allocator to be registered
>>> + *
>>> + * Custom allocator take precedence over the default xarray based
>>> allocator.
>>> + * Private data associated with the ASID are managed by ASID
>>> common code
>>> + * similar to data stored in xa.
>>> + *
>>> + * There can be multiple allocators registered but only one is
>>> active. In case
>>> + * of runtime removal of an custom allocator, the next one is
>>> activated based
>>> + * on the registration ordering.  
>> This last sentence may be moved to the unregister() kerneldoc
>>> + */
>>> +int ioasid_register_allocator(struct ioasid_allocator *allocator)
>>> +{
>>> +	struct ioasid_allocator *pallocator;
>>> +	int ret = 0;
>>> +
>>> +	if (!allocator)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&ioasid_allocator_lock);
>>> +	if (list_empty(&custom_allocators))
>>> +		ioasid_allocator = allocator;  
>> The fact the first registered custom allocator gets automatically
>> active was not obvious to me and may deserve a comment.
> Will do. I will add:
> "No particular preference since all custom allocators end up calling
> the host to allocate IOASIDs. We activate the first allocator and keep
> the later ones in a list in case the first one gets removed due to
> hotplug."
> 
>>> +	else {
>>> +		/* Check if the allocator is already registered */
>>> +		list_for_each_entry(pallocator,
>>> &custom_allocators, list) {
>>> +			if (pallocator == allocator) {
>>> +				pr_err("IOASID allocator already
>>> exist\n");  
>> s/exist/registered?
> make sense.
>>> +				ret = -EEXIST;
>>> +				goto out_unlock;
>>> +			}
>>> +		}
>>> +	}
>>> +	list_add_tail(&allocator->list, &custom_allocators);
>>> +
>>> +out_unlock:
>>> +	mutex_unlock(&ioasid_allocator_lock);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_register_allocator);
>>> +
>>> +/**
>>> + * ioasid_unregister_allocator - Remove a custom IOASID allocator
>>> + * @allocator: the custom allocator to be removed
>>> + *
>>> + * Remove an allocator from the list, activate the next allocator
>>> in
>>> + * the order it was  registration.
>>> + */
>>> +void ioasid_unregister_allocator(struct ioasid_allocator
>>> *allocator) +{
>>> +	if (!allocator)
>>> +		return;
>>> +
>>> +	if (list_empty(&custom_allocators)) {
>>> +		pr_warn("No custom IOASID allocators active!\n");  
>> s/active/registered?
>>> +		return;
>>> +	}
>>> +
>>> +	mutex_lock(&ioasid_allocator_lock);
>>> +	list_del(&allocator->list);
>>> +	if (list_empty(&custom_allocators)) {
>>> +		pr_info("No custom IOASID allocators\n");
>>> +		/*
>>> +		 * All IOASIDs should have been freed before the
>>> last allocator
>>> +		 * is unregistered.
>>> +		 */
>>> +		BUG_ON(!xa_empty(&ioasid_xa));  
>> At this stage it is difficult to assess whether using a BUG_ON() is
>> safe here. Who is responsible for freeing the IOASIDs?
> Who ever allocates IOASIDs are responsible for freeing. This could be
> the IOMMU driver running in the guest. In the very unlikely scenario
> below:
> 1. vIOMMU1 register a custom allocator1
> 2. vIOMMU2 register a custom allocator2
> 3. sva_bind() called to bind dev under vIOMMU1, use allocator1 to
> allocate ioasid1.
> 4. vIOMMU 1 hot removed
> 5. vIOMMU 2 hot removed
> BUG_ON() hits because sva_unbind was not called on ioasid1. So even if
> we free ioasid1 after BUG_ON, it does not undo the damage.
> 
>>> +		ioasid_allocator = NULL;
>>> +	} else if (allocator == ioasid_allocator) {
>>> +		ioasid_allocator = list_entry(&custom_allocators,
>>> struct ioasid_allocator, list);
>>> +		pr_info("IOASID allocator changed");
>>> +	}
>>> +	mutex_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
>>> +
>>> +/**
>>> + * ioasid_set_data - Set private data for an allocated ioasid
>>> + * @ioasid: the ID to set data
>>> + * @data:   the private data
>>> + *
>>> + * For IOASID that is already allocated, private data can be set
>>> + * via this API. Future lookup can be done via ioasid_find.
>>> + */
>>> +int ioasid_set_data(ioasid_t ioasid, void *data)
>>> +{
>>> +	struct ioasid_data *ioasid_data;
>>> +	int ret = 0;
>>> +
>>> +	ioasid_data = xa_load(&ioasid_xa, ioasid);
>>> +	if (ioasid_data)
>>> +		ioasid_data->private = data;
>>> +	else
>>> +		ret = -ENOENT;
>>> +
>>> +	/* getter may use the private data */
>>> +	synchronize_rcu();
>>> +
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_data);
>>> +
>>>  /**
>>>   * ioasid_alloc - Allocate an IOASID
>>>   * @set: the IOASID set
>>> @@ -31,7 +145,7 @@ static DEFINE_XARRAY_ALLOC(ioasid_xa);
>>>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>> ioasid_t max, void *private)
>>>  {
>>> -	int id = -1;
>>> +	int id = INVALID_IOASID;
>>>  	struct ioasid_data *data;
>>>  
>>>  	data = kzalloc(sizeof(*data), GFP_KERNEL);
>>> @@ -40,14 +154,37 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, 
>>>  	data->set = set;
>>>  	data->private = private;
>>> +
>>> +	/*
>>> +	 * Use custom allocator if available, otherwise use
>>> default.
>>> +	 * However, if there are active IOASIDs already been
>>> allocated by default
>>> +	 * allocator, custom allocator cannot be used.
>>> +	 */
>>> +	if (!default_allocator_used && ioasid_allocator) {
>>> +		mutex_lock(&ioasid_allocator_lock);
>>> +		id = ioasid_allocator->alloc(min, max,
>>> ioasid_allocator->pdata);
>>> +		mutex_unlock(&ioasid_allocator_lock);
>>> +		if (id == INVALID_IOASID) {
>>> +			pr_err("Failed ASID allocation by custom
>>> allocator\n");
>>> +			goto exit_free;
>>> +		}
>>> +		/*
>>> +		 * Use XA to manage private data also sanitiy
>>> check custom> +		 * allocator for duplicates.  
>> s/data also sanitiy check/data, also sanity check
>>> +		 */
>>> +		min = id;
>>> +		max = id + 1;
>>> +	} else
>>> +		default_allocator_used = 1;  
>> shouldn't default_allocator_used be protected as well?
>>> +
>>>  	if (xa_alloc(&ioasid_xa, &id, data, XA_LIMIT(min, max),
>>> GFP_KERNEL)) { pr_err("Failed to alloc ioasid from %d to %d\n",
>>> min, max); goto exit_free;
>>>  	}
>>> -
>>>  	data->id = id;  
>> wouldn't it be possible to integrate the default io asid allocator as
>> any custom allocator, ie. implement an alloc callback using xa_alloc.
>> Then the active io allocator could be either a custom or a default
>> one.
> That is an interesting idea. I think it is possible.
> But since default xa allocator is internal to ioasid infrastructure,
> why implement it as a callback?

I mean your could directly define a static const default_allocator in
ioasid.c and assign it by default. Do I miss something?

Thanks

Eric
> 
>>> +
>>>  exit_free:
>>> -	if (id < 0) {
>>> +	if (id < 0 || id == INVALID_IOASID) {
>>>  		kfree(data);
>>>  		return INVALID_IOASID;
>>>  	}
>>> @@ -59,12 +196,29 @@ EXPORT_SYMBOL_GPL(ioasid_alloc);
>>>   * ioasid_free - Free an IOASID
>>>   * @ioasid: the ID to remove
>>>   */
>>> -void ioasid_free(ioasid_t ioasid)
>>> +int ioasid_free(ioasid_t ioasid)
>>>  {
>>>  	struct ioasid_data *ioasid_data;
>>> +	int ret = 0;
>>> +
>>> +	if (ioasid_allocator) {
>>> +		mutex_lock(&ioasid_allocator_lock);
>>> +		ret = ioasid_allocator->free(ioasid,
>>> ioasid_allocator->pdata);
>>> +		mutex_unlock(&ioasid_allocator_lock);
>>> +	}
>>> +	if (ret) {
>>> +		pr_err("ioasid %d custom allocator free failed\n",
>>> ioasid);
>>> +		return ret;
>>> +	}
>>>  
>>>  	ioasid_data = xa_erase(&ioasid_xa, ioasid);
>>> +
>>>  	kfree_rcu(ioasid_data, rcu);
>>> +
>>> +	if (xa_empty(&ioasid_xa))
>>> +		default_allocator_used = 0;
>>> +
>>> +	return ret;
>>>  }
>>>  EXPORT_SYMBOL_GPL(ioasid_free);
>>>  
>>> @@ -79,7 +233,8 @@ EXPORT_SYMBOL_GPL(ioasid_free);
>>>   * if @getter returns false, then the object is invalid and NULL
>>> is returned. *
>>>   * If the IOASID has been allocated for this set, return the
>>> private pointer
>>> - * passed to ioasid_alloc. Otherwise return NULL.
>>> + * passed to ioasid_alloc. Private data can be NULL if not set.
>>> Return an error
>>> + * if the IOASID is not found or not belong to the set.  
>> s/not belong/does not belong
>>>   */
>>>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>>  		  bool (*getter)(void *))
>>> @@ -89,11 +244,20 @@ void *ioasid_find(struct ioasid_set *set,
>>> ioasid_t ioasid, 
>>>  	rcu_read_lock();
>>>  	ioasid_data = xa_load(&ioasid_xa, ioasid);
>>> -	if (ioasid_data && ioasid_data->set == set) {
>>> -		priv = ioasid_data->private;
>>> -		if (getter && !getter(priv))
>>> -			priv = NULL;
>>> +	if (!ioasid_data) {
>>> +		priv = ERR_PTR(-ENOENT);
>>> +		goto unlock;
>>> +	}
>>> +	if (set && ioasid_data->set != set) {
>>> +		/* data found but does not belong to the set */
>>> +		priv = ERR_PTR(-EACCES);
>>> +		goto unlock;
>>>  	}
>>> +	/* Now IOASID and its set is verified, we can return the
>>> private data */
>>> +	priv = ioasid_data->private;
>>> +	if (getter && !getter(priv))
>>> +		priv = NULL;
>>> +unlock:
>>>  	rcu_read_unlock();
>>>  
>>>  	return priv;
>>> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
>>> index 6f3655a..e773c13 100644
>>> --- a/include/linux/ioasid.h
>>> +++ b/include/linux/ioasid.h
>>> @@ -5,20 +5,33 @@
>>>  #define INVALID_IOASID ((ioasid_t)-1)
>>>  typedef unsigned int ioasid_t;
>>>  typedef int (*ioasid_iter_t)(ioasid_t ioasid, void *private, void
>>> *data); +typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min,
>>> ioasid_t max, void *data); +typedef int
>>> (*ioasid_free_fn_t)(ioasid_t ioasid, void *data); 
>>>  struct ioasid_set {
>>>  	int dummy;
>>>  };
>>>  
>>> +struct ioasid_allocator {
>>> +	ioasid_alloc_fn_t alloc;
>>> +	ioasid_free_fn_t free;
>>> +	void *pdata;
>>> +	struct list_head list;
>>> +};
>>> +
>>>  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>>>  
>>>  #ifdef CONFIG_IOASID
>>>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>> ioasid_t max, void *private);
>>> -void ioasid_free(ioasid_t ioasid);
>>> +int ioasid_free(ioasid_t ioasid);  
>> you need to change the definition for the !CONFIG_IOASID case too
> Good catch! I am thinking there is no need to check return value of
> free (as you pointed out in other comments).
> 
>>>  
>>>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>>  		  bool (*getter)(void *));
>>> +int ioasid_register_allocator(struct ioasid_allocator *allocator);
>>> +void ioasid_unregister_allocator(struct ioasid_allocator
>>> *allocator); +
>>> +int ioasid_set_data(ioasid_t ioasid, void *data);
>>>  
>>>  #else /* !CONFIG_IOASID */
>>>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min,  
>> Just to make sure, don't you need to define the new functions if
>> !CONFIG_IOASID?
>>
> Right, Thanks!
> 
>> Thanks
>>
>> Eric
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-25 18:19     ` Jacob Pan
@ 2019-04-26 11:47       ` Jean-Philippe Brucker
  2019-04-26 12:21         ` Christoph Hellwig
  0 siblings, 1 reply; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-26 11:47 UTC (permalink / raw)
  To: Jacob Pan, Christoph Hellwig
  Cc: Tian, Kevin, Raj Ashok, iommu, LKML, Alex Williamson,
	Andriy Shevchenko, David Woodhouse, christian.koenig

On 25/04/2019 19:19, Jacob Pan wrote:
> Hi Christoph,
> 
> On Tue, 23 Apr 2019 23:19:03 -0700
> Christoph Hellwig <hch@infradead.org> wrote:
> 
>> On Tue, Apr 23, 2019 at 04:31:06PM -0700, Jacob Pan wrote:
>>> The allocator doesn't really belong in drivers/iommu because some
>>> drivers would like to allocate PASIDs for devices that aren't
>>> managed by an IOMMU, using the same ID space as IOMMU. It doesn't
>>> really belong in drivers/pci either since platform device also
>>> support PASID. Add the allocator in drivers/base.  
>>
>> I'd still add it to drivers/iommu, just selectable separately from the
>> core iommu code..
> Perhaps I misunderstood. If a driver wants to use IOASIDs w/o iommu
> subsystem even turned on, how could selecting from the core iommu code
> help? Could you elaborate on "selectable"?

How about doing the same as CONFIG_IOMMU_IOVA? The code is in
drivers/iommu but can be selected by non-IOMMU_API users, independently
of CONFIG_IOMMU_SUPPORT. It's true that this allocator will mostly be
used by IOMMU drivers.

> From VT-d's perspective, PASIDs are only used with IOMMU on. Jean
> knows other use cases.

I know of one: the AMD GPU driver may use IOASID for context IDs, even
if IOMMU is disabled. As I understand it, if IOMMU is enabled they need
to use the same allocator as IOMMU since it's the same ID space. And I
think it's more convenient to use the same allocation code in the GPU
driver regardless of CONFIG_IOMMU_SUPPORT.

See the previous discussion at
https://www.spinics.net/lists/iommu/msg31200.html

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-26 11:47       ` Jean-Philippe Brucker
@ 2019-04-26 12:21         ` Christoph Hellwig
  2019-04-26 16:58           ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2019-04-26 12:21 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, Christoph Hellwig, Tian, Kevin, Raj Ashok, iommu,
	LKML, Alex Williamson, Andriy Shevchenko, David Woodhouse,
	christian.koenig

On Fri, Apr 26, 2019 at 12:47:43PM +0100, Jean-Philippe Brucker wrote:
> >> On Tue, Apr 23, 2019 at 04:31:06PM -0700, Jacob Pan wrote:
> >>> The allocator doesn't really belong in drivers/iommu because some
> >>> drivers would like to allocate PASIDs for devices that aren't
> >>> managed by an IOMMU, using the same ID space as IOMMU. It doesn't
> >>> really belong in drivers/pci either since platform device also
> >>> support PASID. Add the allocator in drivers/base.  
> >>
> >> I'd still add it to drivers/iommu, just selectable separately from the
> >> core iommu code..
> > Perhaps I misunderstood. If a driver wants to use IOASIDs w/o iommu
> > subsystem even turned on, how could selecting from the core iommu code
> > help? Could you elaborate on "selectable"?
> 
> How about doing the same as CONFIG_IOMMU_IOVA? The code is in
> drivers/iommu but can be selected by non-IOMMU_API users, independently
> of CONFIG_IOMMU_SUPPORT. It's true that this allocator will mostly be
> used by IOMMU drivers.

That is exactly what I meant!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation
  2019-04-26  7:24       ` Auger Eric
@ 2019-04-26 15:05         ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 15:05 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 09:24:29 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> > Agreed  
> >>> +#define VCMD_VRSP_RESULE(e)		(((e) >> 8) &
> >>> 0xfffff)    
> >> nit: s/RESULE/RSLT?  
> > yes. Also the mask bits should be 8 to 63
> > s/0xfffff/GENMASK_ULL(63, 8))/  
> Well the macro definition looks correct as 63:28 is RsvdZ

you are right, I misread the spec.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-26  9:06       ` Auger Eric
@ 2019-04-26 15:19         ` Jacob Pan
  2019-05-06 17:59           ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 15:19 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 11:06:54 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/25/19 11:29 PM, Jacob Pan wrote:
> > Hi Eric,
> > 
> > Thanks for the review.
> > 
> > On Thu, 25 Apr 2019 12:03:42 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >>
> >> On 4/24/19 1:31 AM, Jacob Pan wrote:  
> >>> Sometimes, IOASID allocation must be handled by platform specific
> >>> code. The use cases are guest vIOMMU and pvIOMMU where IOASIDs
> >>> need to be allocated by the host via enlightened or paravirt
> >>> interfaces.
> >>>
> >>> This patch adds an extension to the IOASID allocator APIs such
> >>> that platform drivers can register a custom allocator, possibly
> >>> at boot time, to take over the allocation. Xarray is still used
> >>> for tracking and searching purposes internal to the IOASID code.
> >>> Private data of an IOASID can also be set after the allocation.
> >>>
> >>> There can be multiple custom allocators registered but only one is
> >>> used at a time. In case of hot removal of devices that provides
> >>> the allocator, all IOASIDs must be freed prior to unregistering
> >>> the allocator. Default XArray based allocator cannot be mixed with
> >>> custom allocators, i.e. custom allocators will not be used if
> >>> there are outstanding IOASIDs allocated by the default XA
> >>> allocator.    
> >>
> >> What's the exact use case behind allowing several custom IOASID
> >> allocators to be registered?  
> > It is mainly for supporting multiple PCI segments thus multiple
> > vIOMMUs. Even though, all allocators will end up calling the host to
> > allocate PASIDs.  
> 
> Yes that was my understanding actually.
> 
> Another question is how do you handle the reserved RID_PASID
> requirement?
> 
We always use PASID 0 for request w/o PASID, so it does not go through
the allocator.
 #define PASID_RID2PASID			0x0

>  QEMU does not support multiple PCI segments/domains
> > afaik but others might.  
>  [...]  
>  [...]  
> > Yes, more clear this way.
> >   
>  [...]  
> >> is already in use?  
>  [...]  
> >> s/The reason is that custom allocator/The reason is that custom
> >> allocators  
>  [...]  
> >> This last sentence may be moved to the unregister() kerneldoc  
>  [...]  
> >> The fact the first registered custom allocator gets automatically
> >> active was not obvious to me and may deserve a comment.  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >> At this stage it is difficult to assess whether using a BUG_ON() is
> >> safe here. Who is responsible for freeing the IOASIDs?  
>  [...]  
>  [...]  
> >> s/data also sanitiy check/data, also sanity check  
> >>> +		 */
> >>> +		min = id;
> >>> +		max = id + 1;
> >>> +	} else
> >>> +		default_allocator_used = 1;    
> >> shouldn't default_allocator_used be protected as well?  
>  [...]  
> >> wouldn't it be possible to integrate the default io asid allocator
> >> as any custom allocator, ie. implement an alloc callback using
> >> xa_alloc. Then the active io allocator could be either a custom or
> >> a default one.  
> > That is an interesting idea. I think it is possible.
> > But since default xa allocator is internal to ioasid infrastructure,
> > why implement it as a callback?  
> 
> I mean your could directly define a static const default_allocator in
> ioasid.c and assign it by default. Do I miss something?
> 
got it, seems cleaner. let me give it a try.

Thanks

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 13/19] iommu/vt-d: Add nested translation support
  2019-04-23 23:31 ` [PATCH v2 13/19] iommu/vt-d: Add nested translation support Jacob Pan
@ 2019-04-26 15:42   ` Auger Eric
  2019-04-26 21:57     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26 15:42 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko, Yi L

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> With PASID granular translation type set to 0x11b, translation
> result from the first level(FL) also subject to a second level(SL)
> page table translation. This mode is used for SVA virtualization,
> where FL performs guest virtual to guest physical translation and
> SL performs guest physical to host physical translation.

The title of the patch sounds a bit misleading to me as this patch
"just" adds a helper to set the PASID table entry in nested mode. There
is no caller yet.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-pasid.c | 101 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel-pasid.h |  11 +++++
>  2 files changed, 112 insertions(+)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index d339e8f..04127cf 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -688,3 +688,104 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  
>  	return 0;
>  }
> +
> +/**
> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation
> + * which is used for vSVA. The first level page tables are used for
> + * GVA-GPA translation in the guest, second level page tables are used
> + * for GPA to HPA translation.
> + *
> + * @iommu:      Iommu which the device belong to
> + * @dev:        Device to be set up for translation
> + * @pgd:        First level PGD, treated as GPA
nit: @gpgd

spec naming could be used as well: FLPTPTR: First Level Page
Translation Pointer
> + * @pasid:      PASID to be programmed in the device PASID table
> + * @flags:      Additional info such as supervisor PASID
> + * @domain:     Domain info for setting up second level page tables
> + * @addr_width: Address width of the first level (guest)
> + */
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *gpgd,
> +			int pasid, int flags,
> +			struct dmar_domain *domain,
> +			int addr_width)
> +{
> +	struct pasid_entry *pte;
> +	struct dma_pte *pgd;
> +	u64 pgd_val;
> +	int agaw;
> +	u16 did;
> +
> +	if (!ecap_nest(iommu->ecap)) {
> +		pr_err("No nested translation support on %s\n",
> +		       iommu->name);
IOMMU: %s: ;-)
> +		return -EINVAL;
> +	}
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	pasid_clear_entry(pte);
> +
> +	/* Sanity checking performed by caller to make sure address
> +	 * width matching in two dimensions:
> +	 * 1. CPU vs. IOMMU
> +	 * 2. Guest vs. Host.
> +	 */
> +	switch (addr_width) {
> +	case 57:
> +		pasid_set_flpm(pte, 1);
> +		break;
> +	case 48:
> +		pasid_set_flpm(pte, 0);
> +		break;
> +	default:
> +		dev_err(dev, "Invalid paging mode %d\n", addr_width);
> +		return -EINVAL;
> +	}
> +
> +	/* Setup the first level page table pointer in GPA */
> +	pasid_set_flptr(pte, (u64)gpgd);
> +	if (flags & PASID_FLAG_SUPERVISOR_MODE) {
> +		if (!ecap_srs(iommu->ecap)) {
> +			pr_err("No supervisor request support on %s\n",
> +			       iommu->name);
> +			return -EINVAL;
> +		}
> +		pasid_set_sre(pte);
> +	}
> +
> +	/* Setup the second level based on the given domain */
> +	pgd = domain->pgd;
> +
> +	for (agaw = domain->agaw; agaw != iommu->agaw; agaw--) {
> +		pgd = phys_to_virt(dma_pte_addr(pgd));
> +		if (!dma_pte_present(pgd)) {
> +			dev_err(dev, "Invalid domain page table\n");
> +			return -EINVAL;
> +		}
> +	}
> +	pgd_val = virt_to_phys(pgd);
> +	pasid_set_slptr(pte, pgd_val);
> +	pasid_set_fault_enable(pte);
> +
> +	did = domain->iommu_did[iommu->seq_id];
> +	pasid_set_domain_id(pte, did);
> +
> +	pasid_set_address_width(pte, agaw);
> +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> +
> +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> +	pasid_set_present(pte);
> +
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pte, sizeof(*pte));
> +
> +	if (cap_caching_mode(iommu->cap)) {
> +		pasid_cache_invalidation_with_pasid(iommu, did, pasid);
> +		iotlb_invalidation_with_pasid(iommu, did, pasid);
> +	} else
> +		iommu_flush_write_buffer(iommu);
a bunch of that code is duplicated from
intel_pasid_setup_second_level(). I wonder if you could devise a common
helper function?

Thanks

Eric
> +
> +	return 0;
> +}
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index 0999dfe..c4fc1af 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -42,6 +42,7 @@
>   * to vmalloc or even module mappings.
>   */
>  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> +#define PASID_FLAG_NESTED		BIT(1)
>  
>  struct pasid_dir_entry {
>  	u64 val;
> @@ -51,6 +52,11 @@ struct pasid_entry {
>  	u64 val[8];
>  };
>  
> +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> +#define PASID_ENTRY_PGTT_NESTED		(3)
> +#define PASID_ENTRY_PGTT_PT		(4)
> +
>  /* The representative of a PASID table */
>  struct pasid_table {
>  	void			*table;		/* pasid table pointer */
> @@ -77,6 +83,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, int pasid);
> +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> +			struct device *dev, pgd_t *pgd,
> +			int pasid, int flags,
> +			struct dmar_domain *domain,
> +			int addr_width);
>  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>  				 struct device *dev, int pasid);
>  int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid);
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 14/19] iommu: Add guest PASID bind function
  2019-04-23 23:31 ` [PATCH v2 14/19] iommu: Add guest PASID bind function Jacob Pan
@ 2019-04-26 15:53   ` Auger Eric
  2019-04-26 22:11     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26 15:53 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Guest shared virtual address (SVA) may require host to shadow guest
> PASID tables. Guest PASID can also be allocated from the host via
> enlightened interfaces. In this case, guest needs to bind the guest
> mm, i.e. cr3 in guest phisical address to the actual PASID table in
physical
> the host IOMMU. Nesting will be turned on such that guest virtual
> address can go through a two level translation:
> - 1st level translates GVA to GPA
> - 2nd level translates GPA to HPA
> This patch introduces APIs to bind guest PASID data to the assigned
> device entry in the physical IOMMU. See the diagram below for usage
> explaination.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process mm, FL only |
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |
>     '-------------'
> Guest
> ------| Shadow |--------------------------|------------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.---------------------.
>     |             |   |Set SL to GPA-HPA    |
>     |             |   '---------------------'
>     '-------------'
> 
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommu.c      | 20 ++++++++++++++++++++
>  include/linux/iommu.h      | 10 ++++++++++
>  include/uapi/linux/iommu.h | 15 ++++++++++++++-
>  3 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 498c28a..072f8f3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1561,6 +1561,26 @@ int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
>  
> +int iommu_sva_bind_gpasid(struct iommu_domain *domain,
> +			struct device *dev, struct gpasid_bind_data *data)
> +{
> +	if (unlikely(!domain->ops->sva_bind_gpasid))
> +		return -ENODEV;
> +
> +	return domain->ops->sva_bind_gpasid(domain, dev, data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_bind_gpasid);
> +
> +int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev,
> +			int pasid)
> +{
> +	if (unlikely(!domain->ops->sva_unbind_gpasid))
> +		return -ENODEV;
> +
> +	return domain->ops->sva_unbind_gpasid(dev, pasid);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 4b92e4b..611388e 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -231,6 +231,8 @@ struct iommu_sva_ops {
>   * @detach_pasid_table: detach the pasid table
>   * @cache_invalidate: invalidate translation caches
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @sva_bind_gpasid: bind guest pasid and mm
> + * @sva_unbind_gpasid: unbind guest pasid and mm
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -295,6 +297,10 @@ struct iommu_ops {
>  
>  	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>  				struct iommu_cache_invalidate_info *inv_info);
> +	int (*sva_bind_gpasid)(struct iommu_domain *domain,
> +			struct device *dev, struct gpasid_bind_data *data);
> +
> +	int (*sva_unbind_gpasid)(struct device *dev, int pasid);
So I am confused now. As the scalable mode PASID table entry contains
both the FL and SL PT pointers, will you ever use the
attach/detach_pasid_table or are we the only known users on ARM?

>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -409,6 +415,10 @@ extern void iommu_detach_pasid_table(struct iommu_domain *domain);
>  extern int iommu_cache_invalidate(struct iommu_domain *domain,
>  				  struct device *dev,
>  				  struct iommu_cache_invalidate_info *inv_info);
> +extern int iommu_sva_bind_gpasid(struct iommu_domain *domain,
> +		struct device *dev, struct gpasid_bind_data *data);
> +extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
> +				struct device *dev, int pasid);
definition in !CONFIG_IOMMU_API case?
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 61a3fb7..5c95905 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -235,6 +235,19 @@ struct iommu_cache_invalidate_info {
>  		struct iommu_inv_addr_info addr_info;
>  	};
>  };
> -
> +/**
> + * struct gpasid_bind_data - Information about device and guest PASID binding
> + * @gcr3:	Guest CR3 value from guest mm
> + * @pasid:	Process address space ID used for the guest mm
> + * @addr_width:	Guest address width. Paging mode can also be derived.
> + */
> +struct gpasid_bind_data {
> +	__u64 gcr3;
> +	__u32 pasid;
> +	__u32 addr_width;
> +	__u32 flags;
> +#define	IOMMU_SVA_GPASID_SRE	BIT(0) /* supervisor request */
> +	__u8 padding[4];
> +};


Thanks

Eric
>  
>  #endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support
  2019-04-23 23:31 ` [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support Jacob Pan
@ 2019-04-26 16:15   ` Auger Eric
  2019-04-29 15:25     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26 16:15 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
> 
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c |   4 +
>  drivers/iommu/intel-svm.c   | 174 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h |  10 ++-
>  include/linux/intel-svm.h   |   7 ++
>  4 files changed, 193 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 77bbe1b..89989b5 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5768,6 +5768,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.dev_enable_feat	= intel_iommu_dev_enable_feat,
>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> +#endif
>  };
>  
>  static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 8fff212..0a973c2 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -227,6 +227,180 @@ static const struct mmu_notifier_ops intel_mmuops = {
>  
>  static DEFINE_MUTEX(pasid_mutex);
>  static LIST_HEAD(global_svm_list);
> +#define for_each_svm_dev() \
> +	list_for_each_entry(sdev, &svm->devs, list)	\
> +	if (dev == sdev->dev)				\
> +
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +			struct device *dev,
> +			struct gpasid_bind_data *data)
> +{
> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> +	struct intel_svm_dev *sdev;
> +	struct intel_svm *svm = NULL;
> +	struct dmar_domain *ddomain;
> +	int pasid_max;
> +	int ret = 0;
> +
> +	if (WARN_ON(!iommu) || !data)
> +		return -EINVAL;
> +
> +	if (dev_is_pci(dev)) {
> +		pasid_max = pci_max_pasids(to_pci_dev(dev));
> +		if (pasid_max < 0)
> +			return -EINVAL;
> +	} else
> +		pasid_max = 1 << 20;
> +
> +	if (data->pasid <= 0 || data->pasid >= pasid_max)
> +		return -EINVAL;
> +
> +	ddomain = to_dmar_domain(domain);
> +	/* REVISIT:
> +	 * Sanity check adddress width and paging mode support
> +	 * width matching in two dimensions:
> +	 * 1. paging mode CPU <= IOMMU
> +	 * 2. address width Guest <= Host.
> +	 */
> +	mutex_lock(&pasid_mutex);
> +	svm = ioasid_find(NULL, data->pasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +	if (svm) {
> +		if (list_empty(&svm->devs)) {
> +			dev_err(dev, "GPASID %d has no devices bond but SVA is allocated\n",
> +				data->pasid);
> +			ret = -ENODEV; /*
> +					* If we found svm for the PASID, there must be at
> +					* least one device bond, otherwise svm should be freed.
> +					*/
comment should be put after list_empty I think. In which circumstances
can it happen, I mean, isn't it a BUG_ON case?
> +			goto out;
> +		}
> +		for_each_svm_dev() {
> +			/* In case of multiple sub-devices of the same pdev assigned, we should
> +			 * allow multiple bind calls with the same PASID and pdev.
> +			 */
> +			sdev->users++;
> +			goto out;
> +		}
> +	} else {
> +		/* We come here when PASID has never been bond to a device. */
> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> +		if (!svm) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
> +		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
> +		 * ownership.
> +		 */
> +		svm->mm = get_task_mm(current);
> +		svm->pasid = data->pasid;
> +		refcount_set(&svm->refs, 0);
> +		ioasid_set_data(data->pasid, svm);
> +		INIT_LIST_HEAD_RCU(&svm->devs);
> +		INIT_LIST_HEAD(&svm->list);
> +
> +		mmput(svm->mm);
> +	}
> +	svm->flags |= SVM_FLAG_GUEST_MODE;
> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> +	if (!sdev) {
> +		ret = -ENOMEM;
in case of failure what is the state of svm (you added the
SVM_FLAG_GUEST_MODE bit typically, is it safe to leave it?)
> +		goto out;
> +	}
> +	sdev->dev = dev;
> +	sdev->users = 1;
> +
> +	/* Set up device context entry for PASID if not enabled already */
> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> +	if (ret) {
> +		dev_err(dev, "Failed to enable PASID capability\n");
> +		kfree(sdev);
same here
> +		goto out;
> +	}
> +
> +	/*
> +	 * For guest bind, we need to set up PASID table entry as follows:
> +	 * - FLPM matches guest paging mode
> +	 * - turn on nested mode
> +	 * - SL guest address width matching
> +	 */
> +	ret = intel_pasid_setup_nested(iommu,
> +				dev,
> +				(pgd_t *)data->gcr3,
> +				data->pasid,
> +				data->flags,
> +				ddomain,
> +				data->addr_width);
> +	if (ret) {
> +		dev_err(dev, "Failed to set up PASID %d in nested mode, Err %d\n",
> +			data->pasid, ret);
> +		kfree(sdev);
> +		goto out;
> +	}
> +
> +	init_rcu_head(&sdev->rcu);
> +	refcount_inc(&svm->refs);
> +	list_add_rcu(&sdev->list, &svm->devs);
> + out:
> +	mutex_unlock(&pasid_mutex);
> +	return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
> +	struct intel_svm_dev *sdev;
> +	struct intel_iommu *iommu;
> +	struct intel_svm *svm;
> +	int ret = -EINVAL;
> +
> +	mutex_lock(&pasid_mutex);
> +	iommu = intel_svm_device_to_iommu(dev);
> +	if (!iommu)
> +		goto out;
> +
> +	svm = ioasid_find(NULL, pasid, NULL);
> +	if (IS_ERR(svm)) {
> +		ret = PTR_ERR(svm);
> +		goto out;
> +	}
> +
> +	if (!svm)
> +		goto out;
> +
> +	for_each_svm_dev() {
> +		ret = 0;
> +		sdev->users--;
> +		if (!sdev->users) {
> +			list_del_rcu(&sdev->list);
> +			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
> +			/* TODO: Drain in flight PRQ for the PASID since it
> +			 * may get reused soon, we don't want to
> +			 * confuse with its previous live.
> +			 * intel_svm_drain_prq(dev, pasid);
> +			 */
> +			kfree_rcu(sdev, rcu);
> +
> +			if (list_empty(&svm->devs)) {
> +				list_del(&svm->list);
> +				kfree(svm);
> +				/*
> +				 * We do not free PASID here until explicit call
> +				 * from the guest to free.
can you be confident in the guest?
> +				 */
> +				ioasid_set_data(pasid, NULL);
> +			}
> +		}
> +		break;
> +	}
> + out:
> +	mutex_unlock(&pasid_mutex);
> +
> +	return ret;
> +}
>  
>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
>  {
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 48fa164..5d67d0d4 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -677,7 +677,9 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
>  int intel_svm_init(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> +		struct device *dev, struct gpasid_bind_data *data);
> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>  struct svm_dev_ops;
>  
>  struct intel_svm_dev {
> @@ -693,12 +695,16 @@ struct intel_svm_dev {
>  
>  struct intel_svm {
>  	struct mmu_notifier notifier;
> -	struct mm_struct *mm;
> +	union {
> +		struct mm_struct *mm;
> +		u64 gcr3;
> +	};
>  	struct intel_iommu *iommu;
>  	int flags;
>  	int pasid;
>  	struct list_head devs;
>  	struct list_head list;
> +	refcount_t refs; /* # of devs bond to the PASID */
number of devices sharing the same PASID?
>  };
>  
>  extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev);
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index e3f7631..34b0a3b 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -52,6 +52,13 @@ struct svm_dev_ops {
>   * do such IOTLB flushes automatically.
>   */
>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
binds
> + * In this case the mm_struct is in the guest kernel or userspace, its life
> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE	(1<<2)
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  
> 

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list
  2019-04-23 23:31 ` [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list Jacob Pan
@ 2019-04-26 16:19   ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-26 16:19 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi

On 4/24/19 1:31 AM, Jacob Pan wrote:
> Use combined macro for_each_svm_dev() to simplify SVM device iteration.
> 
> Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

> ---
>  drivers/iommu/intel-svm.c | 76 ++++++++++++++++++++++-------------------------
>  1 file changed, 36 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 0a973c2..39dfb2e 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -447,15 +447,13 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>  				goto out;
>  			}
>  
> -			list_for_each_entry(sdev, &svm->devs, list) {
> -				if (dev == sdev->dev) {
> -					if (sdev->ops != ops) {
> -						ret = -EBUSY;
> -						goto out;
> -					}
> -					sdev->users++;
> -					goto success;
> +			for_each_svm_dev() {
> +				if (sdev->ops != ops) {
> +					ret = -EBUSY;
> +					goto out;
>  				}
> +				sdev->users++;
> +				goto success;
>  			}
>  
>  			break;
> @@ -585,40 +583,38 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
>  	if (!svm)
>  		goto out;
>  
> -	list_for_each_entry(sdev, &svm->devs, list) {
> -		if (dev == sdev->dev) {
> -			ret = 0;
> -			sdev->users--;
> -			if (!sdev->users) {
> -				list_del_rcu(&sdev->list);
> -				/* Flush the PASID cache and IOTLB for this device.
> -				 * Note that we do depend on the hardware *not* using
> -				 * the PASID any more. Just as we depend on other
> -				 * devices never using PASIDs that they have no right
> -				 * to use. We have a *shared* PASID table, because it's
> -				 * large and has to be physically contiguous. So it's
> -				 * hard to be as defensive as we might like. */
> -				intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
> -				intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
> -				kfree_rcu(sdev, rcu);
> -
> -				if (list_empty(&svm->devs)) {
> -					ioasid_free(svm->pasid);
> -					if (svm->mm)
> -						mmu_notifier_unregister(&svm->notifier, svm->mm);
> -
> -					list_del(&svm->list);
> -
> -					/* We mandate that no page faults may be outstanding
> -					 * for the PASID when intel_svm_unbind_mm() is called.
> -					 * If that is not obeyed, subtle errors will happen.
> -					 * Let's make them less subtle... */
> -					memset(svm, 0x6b, sizeof(*svm));
> -					kfree(svm);
> -				}
> +	for_each_svm_dev() {
> +		ret = 0;
> +		sdev->users--;
> +		if (!sdev->users) {
> +			list_del_rcu(&sdev->list);
> +			/* Flush the PASID cache and IOTLB for this device.
> +			 * Note that we do depend on the hardware *not* using
> +			 * the PASID any more. Just as we depend on other
> +			 * devices never using PASIDs that they have no right
> +			 * to use. We have a *shared* PASID table, because it's
> +			 * large and has to be physically contiguous. So it's
> +			 * hard to be as defensive as we might like. */
> +			intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
> +			intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
> +			kfree_rcu(sdev, rcu);
> +
> +			if (list_empty(&svm->devs)) {
> +				ioasid_free(svm->pasid);
> +				if (svm->mm)
> +					mmu_notifier_unregister(&svm->notifier, svm->mm);
> +
> +				list_del(&svm->list);
> +
> +				/* We mandate that no page faults may be outstanding
> +				 * for the PASID when intel_svm_unbind_mm() is called.
> +				 * If that is not obeyed, subtle errors will happen.
> +				 * Let's make them less subtle... */
> +				memset(svm, 0x6b, sizeof(*svm));
> +				kfree(svm);
>  			}
> -			break;
>  		}
> +		break;
>  	}
>   out:
>  	mutex_unlock(&pasid_mutex);
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 17/19] iommu: Add max num of cache and granu types
  2019-04-23 23:31 ` [PATCH v2 17/19] iommu: Add max num of cache and granu types Jacob Pan
@ 2019-04-26 16:22   ` Auger Eric
  2019-04-29 16:17     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26 16:22 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> To convert to/from cache types and granularities between generic and
> VT-d specific counterparts, a 2D arrary is used. Introduce the limits
array
> to help define the converstion array size.
conversion
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/uapi/linux/iommu.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 5c95905..2d8fac8 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -197,6 +197,7 @@ struct iommu_inv_addr_info {
>  	__u64	granule_size;
>  	__u64	nb_granules;
>  };
> +#define NR_IOMMU_CACHE_INVAL_GRANU	(3)
>  
>  /**
>   * First level/stage invalidation information
> @@ -235,6 +236,7 @@ struct iommu_cache_invalidate_info {
>  		struct iommu_inv_addr_info addr_info;
>  	};
>  };
> +#define NR_IOMMU_CACHE_TYPE		(3)
>  /**
>   * struct gpasid_bind_data - Information about device and guest PASID binding
>   * @gcr3:	Guest CR3 value from guest mm
> 
Is it really something that needs to be exposed in the uapi?

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-26 12:21         ` Christoph Hellwig
@ 2019-04-26 16:58           ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 16:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jean-Philippe Brucker, Tian, Kevin, Raj Ashok, iommu, LKML,
	Alex Williamson, Andriy Shevchenko, David Woodhouse,
	christian.koenig, jacob.jun.pan

On Fri, 26 Apr 2019 05:21:15 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Apr 26, 2019 at 12:47:43PM +0100, Jean-Philippe Brucker wrote:
> > >> On Tue, Apr 23, 2019 at 04:31:06PM -0700, Jacob Pan wrote:  
> > >>> The allocator doesn't really belong in drivers/iommu because
> > >>> some drivers would like to allocate PASIDs for devices that
> > >>> aren't managed by an IOMMU, using the same ID space as IOMMU.
> > >>> It doesn't really belong in drivers/pci either since platform
> > >>> device also support PASID. Add the allocator in
> > >>> drivers/base.    
> > >>
> > >> I'd still add it to drivers/iommu, just selectable separately
> > >> from the core iommu code..  
> > > Perhaps I misunderstood. If a driver wants to use IOASIDs w/o
> > > iommu subsystem even turned on, how could selecting from the core
> > > iommu code help? Could you elaborate on "selectable"?  
> > 
> > How about doing the same as CONFIG_IOMMU_IOVA? The code is in
> > drivers/iommu but can be selected by non-IOMMU_API users,
> > independently of CONFIG_IOMMU_SUPPORT. It's true that this
> > allocator will mostly be used by IOMMU drivers.  
> 
> That is exactly what I meant!

Make sense, will do that in the next round. Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-23 23:31 ` [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
@ 2019-04-26 17:23   ` Auger Eric
  2019-04-29 22:41     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-26 17:23 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,
On 4/24/19 1:31 AM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 159 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 159 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 89989b5..54a3d22 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5338,6 +5338,164 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
>  	aux_domain_remove_dev(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned device where guest
> + * could own the first level page tables without being shadowed by QEMU. In
> + * this case there is no pass down unmap to the host IOMMU as a result of unmap
> + * in the guest. Only invalidations are trapped and passed down.
> + * In all cases, only first level TLB invalidation (request with PASID) can be
> + * passed down, therefore we do not include IOTLB granularity for request
> + * without PASID (second level).
> + *
> + * For an example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU] = {
The size is frozen for a given uapi version so I guess you can hardcode
the limits for a given version.
> +	/* PASID based IOTLB, support PASID selective and page selective */
> +	{0, 1, 1},
> +	/* PASID based dev TLBs, only support all PASIDs or single PASID */
> +	{1, 1, 0},
> +	/* PASID cache */
> +	{1, 1, 0}
> +};
> +
> +const static u64 inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU] = {
> +	/* PASID based IOTLB */
> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> +	/* PASID based dev TLBs */
> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> +	/* PASID cache */
> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> +};
Can't you use a single matrix instead, ie. inv_type_granu_table

> +
> +static inline int to_vtd_granularity(int type, int granu, u64 *vtd_granu)
> +{
> +	if (type >= NR_IOMMU_CACHE_TYPE || granu >= NR_IOMMU_CACHE_INVAL_GRANU ||
> +		!inv_type_granu_map[type][granu])
> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][granu];
> +
> +	return 0;
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> +	u64 nr_pages;
direct initialization?
> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> +	 * IOMMU cache invalidate API passes granu_size in bytes, and number of
> +	 * granu size in contiguous memory.
> +	 */
> +
> +	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +	return order_base_2(nr_pages);
> +}
> +
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct iommu_cache_invalidate_info *inv_info)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct intel_iommu *iommu;
> +	unsigned long flags;
> +	int cache_type;
> +	u8 bus, devfn;
> +	u16 did, sid;
> +	int ret = 0;
> +	u64 granu;
> +	u64 size;
> +
> +	if (!inv_info || !dmar_domain ||
> +		inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> +		return -EINVAL;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	spin_lock(&iommu->lock);
> +	spin_lock_irqsave(&device_domain_lock, flags);
mix of _irqsave and non _irqsave looks suspicious to me.
> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +	if (!info) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +	size = to_vtd_size(inv_info->addr_info.granule_size, inv_info->addr_info.nb_granules);
> +
> +	for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) {
> +
> +		ret = to_vtd_granularity(cache_type, inv_info->granularity, &granu);
> +		if (ret) {
> +			pr_err("Invalid range type %d, granu %d\n", cache_type,
s/Invalid range type %d, granu %d/Invalid cache type/granu combination
(%d/%d)
> +				inv_info->granularity);
> +			break;
> +		}
> +
> +		switch (BIT(cache_type)) {
> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> +			if (size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> +				pr_err("Address out of range, 0x%llx, size order %llu\n",
> +					inv_info->addr_info.addr, size);
> +				ret = -ERANGE;
> +				goto out_unlock;
> +			}
> +
> +			qi_flush_piotlb(iommu, did, mm_to_dma_pfn(inv_info->addr_info.addr),
> +					inv_info->addr_info.pasid,
> +					size, granu);
> +
> +			/*
> +			 * Always flush device IOTLB if ATS is enabled since guest
> +			 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> +			 * down. REVISIT: cannot assume Linux guest
> +			 */
> +			if (info->ats_enabled) {
> +				qi_flush_dev_piotlb(iommu, sid, info->pfsid,
> +						inv_info->addr_info.pasid, info->ats_qdep,
> +						inv_info->addr_info.addr, size,
> +						granu);
> +			}
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> +			if (info->ats_enabled) {
> +				qi_flush_dev_piotlb(iommu, sid, info->pfsid,
> +						inv_info->addr_info.pasid, info->ats_qdep,
> +						inv_info->addr_info.addr, size,
> +						granu);
> +			} else
> +				pr_warn("Passdown device IOTLB flush w/o ATS!\n");
> +
> +			break;
> +		case IOMMU_CACHE_INV_TYPE_PASID:
> +			qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid);
> +
> +			break;
> +		default:
> +			dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
> +				cache_type);
> +			ret = -EINVAL;
> +		}
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
I would expect the opposite order
> +
> +	return ret;
> +}
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot)
> @@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>  #endif
> 
Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID
  2019-04-24 17:27   ` Auger Eric
@ 2019-04-26 20:11     ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 20:11 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Wed, 24 Apr 2019 19:27:26 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > When VT-d driver runs in the guest, PASID allocation must be
> > performed via virtual command interface. This patch register a  
> registers
will fix
> > custom IOASID allocator which takes precedence over the default
> > IDR based allocator.  
> nit: s/IDR based// . It is xarray based now.
right, changed code but not commit message :)
>  The resulting IOASID allocation will always
> > come from the host. This ensures that PASID namespace is system-
> > wide.
> > 
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 58
> > +++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h |  2 ++ 2 files changed, 60
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index d93c4bd..ec6f22d 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -1711,6 +1711,8 @@ static void free_dmar_iommu(struct
> > intel_iommu *iommu) if (ecap_prs(iommu->ecap))
> >  			intel_svm_finish_prq(iommu);
> >  	}
> > +	ioasid_unregister_allocator(&iommu->pasid_allocator);
> > +
> >  #endif
> >  }
> >  
> > @@ -4811,6 +4813,46 @@ static int __init
> > platform_optin_force_iommu(void) return 1;
> >  }
> >  
> > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > void *data) +{
> > +	struct intel_iommu *iommu = data;
> > +	ioasid_t ioasid;
> > +
> > +	/*
> > +	 * VT-d virtual command interface always uses the full 20
> > bit
> > +	 * PASID range. Host can partition guest PASID range based
> > on
> > +	 * policies but it is out of guest's control.
> > +	 */  
> The above comment does not exactly relate to the check below
> > +	if (min < PASID_MIN || max > PASID_MAX)
> > +		return -EINVAL;
> > +
> > +	if (vcmd_alloc_pasid(iommu, &ioasid))
> > +		return INVALID_IOASID;
> > +
> > +	return ioasid;
> > +}
> > +
> > +static int intel_ioasid_free(ioasid_t ioasid, void *data)
> > +{
> > +	struct iommu_pasid_alloc_info *svm;
> > +	struct intel_iommu *iommu = data;
> > +
> > +	if (!iommu || !cap_caching_mode(iommu->cap))
> > +		return -EINVAL;  
> can !cap_caching_mode(iommu->cap) be true as the allocator only is set
> if CM?
right, should never happen.
> > +	/*
> > +	 * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > +	 * We can only free the PASID when all the devices are
> > unbond.
> > +	 */
> > +	svm = ioasid_find(NULL, ioasid, NULL);
> > +	if (!svm) {  
> you can avoid using the local svm variable.
> > +		pr_warn("Freeing unbond IOASID %d\n", ioasid);  
> unbound
> > +		return -EBUSY;  
> -EINVAL?
It meant the PASID is still being used, bond to a device doing DMA etc.
thus -EBUSY. But I will make the free a void function.
> > +	}
> > +	vcmd_free_pasid(iommu, ioasid);
> > +
> > +	return 0;
> > +}
> > +
> >  int __init intel_iommu_init(void)
> >  {
> >  	int ret = -ENODEV;
> > @@ -4912,6 +4954,22 @@ int __init intel_iommu_init(void)
> >  				       "%s", iommu->name);
> >  		iommu_device_set_ops(&iommu->iommu,
> > &intel_iommu_ops); iommu_device_register(&iommu->iommu);
> > +		if (cap_caching_mode(iommu->cap) &&
> > sm_supported(iommu)) {  
> so shouldn't you test VCCAP_REG as well?
> > +			/*
> > +			 * Register a custom ASID allocator if we
> > are running
> > +			 * in a guest, the purpose is to have a
> > system wide PASID
> > +			 * namespace among all PASID users.
> > +			 * There can be multiple vIOMMUs in each
> > guest but only
> > +			 * one allocator is active. All vIOMMU
> > allocators will
> > +			 * eventually be calling the same host
> > allocator.
> > +			 */
> > +			iommu->pasid_allocator.alloc =
> > intel_ioasid_alloc;
> > +			iommu->pasid_allocator.free =
> > intel_ioasid_free;
> > +			iommu->pasid_allocator.pdata = (void
> > *)iommu;
> > +			ret =
> > ioasid_register_allocator(&iommu->pasid_allocator);
> > +			if (ret)
> > +				pr_warn("Custom PASID allocator
> > registeration failed\n");  
> registration
> > +		}
> >  	}
> >  
> >  	bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index bff907b..c24c8aa 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -31,6 +31,7 @@
> >  #include <linux/iommu.h>
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/dmar.h>
> > +#include <linux/ioasid.h>
> >  
> >  #include <asm/cacheflush.h>
> >  #include <asm/iommu.h>
> > @@ -549,6 +550,7 @@ struct intel_iommu {
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  	struct page_req_dsc *prq;
> >  	unsigned char prq_name[16];    /* Name for PRQ interrupt */
> > +	struct ioasid_allocator pasid_allocator; /* Custom
> > allocator for PASIDs */ #endif
> >  	struct q_inval  *qi;            /* Queued invalidation
> > info */ u32 *iommu_state; /* Store iommu states between suspend and
> > resume.*/ 
> 
> Thanks
> 
> Eric

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 13/19] iommu/vt-d: Add nested translation support
  2019-04-26 15:42   ` Auger Eric
@ 2019-04-26 21:57     ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 21:57 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, Yi L,
	jacob.jun.pan

On Fri, 26 Apr 2019 17:42:05 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> > With PASID granular translation type set to 0x11b, translation
> > result from the first level(FL) also subject to a second level(SL)
> > page table translation. This mode is used for SVA virtualization,
> > where FL performs guest virtual to guest physical translation and
> > SL performs guest physical to host physical translation.  
> 
> The title of the patch sounds a bit misleading to me as this patch
> "just" adds a helper to set the PASID table entry in nested mode.
> There is no caller yet.
right, will rename to "Add nested translation helper function"
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-pasid.c | 101
> > ++++++++++++++++++++++++++++++++++++++++++++
> > drivers/iommu/intel-pasid.h |  11 +++++ 2 files changed, 112
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index d339e8f..04127cf 100644
> > --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -688,3 +688,104 @@ int intel_pasid_setup_pass_through(struct
> > intel_iommu *iommu, 
> >  	return 0;
> >  }
> > +
> > +/**
> > + * intel_pasid_setup_nested() - Set up PASID entry for nested
> > translation
> > + * which is used for vSVA. The first level page tables are used for
> > + * GVA-GPA translation in the guest, second level page tables are
> > used
> > + * for GPA to HPA translation.
> > + *
> > + * @iommu:      Iommu which the device belong to
> > + * @dev:        Device to be set up for translation
> > + * @pgd:        First level PGD, treated as GPA  
> nit: @gpgd
> 
> spec naming could be used as well: FLPTPTR: First Level Page
> Translation Pointer
more precise. sounds good

> > + * @pasid:      PASID to be programmed in the device PASID table
> > + * @flags:      Additional info such as supervisor PASID
> > + * @domain:     Domain info for setting up second level page tables
> > + * @addr_width: Address width of the first level (guest)
> > + */
> > +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > +			struct device *dev, pgd_t *gpgd,
> > +			int pasid, int flags,
> > +			struct dmar_domain *domain,
> > +			int addr_width)
> > +{
> > +	struct pasid_entry *pte;
> > +	struct dma_pte *pgd;
> > +	u64 pgd_val;
> > +	int agaw;
> > +	u16 did;
> > +
> > +	if (!ecap_nest(iommu->ecap)) {
> > +		pr_err("No nested translation support on %s\n",
> > +		       iommu->name);  
> IOMMU: %s: ;-)
will do

> > +		return -EINVAL;
> > +	}
> > +
> > +	pte = intel_pasid_get_entry(dev, pasid);
> > +	if (WARN_ON(!pte))
> > +		return -EINVAL;
> > +
> > +	pasid_clear_entry(pte);
> > +
> > +	/* Sanity checking performed by caller to make sure address
> > +	 * width matching in two dimensions:
> > +	 * 1. CPU vs. IOMMU
> > +	 * 2. Guest vs. Host.
> > +	 */
> > +	switch (addr_width) {
> > +	case 57:
> > +		pasid_set_flpm(pte, 1);
> > +		break;
> > +	case 48:
> > +		pasid_set_flpm(pte, 0);
> > +		break;
> > +	default:
> > +		dev_err(dev, "Invalid paging mode %d\n",
> > addr_width);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Setup the first level page table pointer in GPA */
> > +	pasid_set_flptr(pte, (u64)gpgd);
> > +	if (flags & PASID_FLAG_SUPERVISOR_MODE) {
> > +		if (!ecap_srs(iommu->ecap)) {
> > +			pr_err("No supervisor request support on
> > %s\n",
> > +			       iommu->name);
> > +			return -EINVAL;
> > +		}
> > +		pasid_set_sre(pte);
> > +	}
> > +
> > +	/* Setup the second level based on the given domain */
> > +	pgd = domain->pgd;
> > +
> > +	for (agaw = domain->agaw; agaw != iommu->agaw; agaw--) {
> > +		pgd = phys_to_virt(dma_pte_addr(pgd));
> > +		if (!dma_pte_present(pgd)) {
> > +			dev_err(dev, "Invalid domain page
> > table\n");
> > +			return -EINVAL;
> > +		}
> > +	}
> > +	pgd_val = virt_to_phys(pgd);
> > +	pasid_set_slptr(pte, pgd_val);
> > +	pasid_set_fault_enable(pte);
> > +
> > +	did = domain->iommu_did[iommu->seq_id];
> > +	pasid_set_domain_id(pte, did);
> > +
> > +	pasid_set_address_width(pte, agaw);
> > +	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > +
> > +	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> > +	pasid_set_present(pte);
> > +
> > +	if (!ecap_coherent(iommu->ecap))
> > +		clflush_cache_range(pte, sizeof(*pte));
> > +
> > +	if (cap_caching_mode(iommu->cap)) {
> > +		pasid_cache_invalidation_with_pasid(iommu, did,
> > pasid);
> > +		iotlb_invalidation_with_pasid(iommu, did, pasid);
> > +	} else
> > +		iommu_flush_write_buffer(iommu);  
> a bunch of that code is duplicated from
> intel_pasid_setup_second_level(). I wonder if you could devise a
> common helper function?
> 
indeed, duplicated code. will do.
> Thanks
> 
> Eric
> > +
> > +	return 0;
> > +}
> > diff --git a/drivers/iommu/intel-pasid.h
> > b/drivers/iommu/intel-pasid.h index 0999dfe..c4fc1af 100644
> > --- a/drivers/iommu/intel-pasid.h
> > +++ b/drivers/iommu/intel-pasid.h
> > @@ -42,6 +42,7 @@
> >   * to vmalloc or even module mappings.
> >   */
> >  #define PASID_FLAG_SUPERVISOR_MODE	BIT(0)
> > +#define PASID_FLAG_NESTED		BIT(1)
> >  
> >  struct pasid_dir_entry {
> >  	u64 val;
> > @@ -51,6 +52,11 @@ struct pasid_entry {
> >  	u64 val[8];
> >  };
> >  
> > +#define PASID_ENTRY_PGTT_FL_ONLY	(1)
> > +#define PASID_ENTRY_PGTT_SL_ONLY	(2)
> > +#define PASID_ENTRY_PGTT_NESTED		(3)
> > +#define PASID_ENTRY_PGTT_PT		(4)
> > +
> >  /* The representative of a PASID table */
> >  struct pasid_table {
> >  	void			*table;		/*
> > pasid table pointer */ @@ -77,6 +83,11 @@ int
> > intel_pasid_setup_second_level(struct intel_iommu *iommu, int
> > intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct
> > dmar_domain *domain, struct device *dev, int pasid);
> > +int intel_pasid_setup_nested(struct intel_iommu *iommu,
> > +			struct device *dev, pgd_t *pgd,
> > +			int pasid, int flags,
> > +			struct dmar_domain *domain,
> > +			int addr_width);
> >  void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> >  				 struct device *dev, int pasid);
> >  int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
> > *pasid); 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 14/19] iommu: Add guest PASID bind function
  2019-04-26 15:53   ` Auger Eric
@ 2019-04-26 22:11     ` Jacob Pan
  2019-04-27  8:37       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-26 22:11 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 17:53:43 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > Guest shared virtual address (SVA) may require host to shadow guest
> > PASID tables. Guest PASID can also be allocated from the host via
> > enlightened interfaces. In this case, guest needs to bind the guest
> > mm, i.e. cr3 in guest phisical address to the actual PASID table
> > in  
> physical
got it

> > the host IOMMU. Nesting will be turned on such that guest virtual
> > address can go through a two level translation:
> > - 1st level translates GVA to GPA
> > - 2nd level translates GPA to HPA
> > This patch introduces APIs to bind guest PASID data to the assigned
> > device entry in the physical IOMMU. See the diagram below for usage
> > explaination.
> > 
> >     .-------------.  .---------------------------.
> >     |   vIOMMU    |  | Guest process mm, FL only |
> >     |             |  '---------------------------'
> >     .----------------/
> >     | PASID Entry |--- PASID cache flush -
> >     '-------------'                       |
> >     |             |                       V
> >     |             |
> >     '-------------'
> > Guest
> > ------| Shadow |--------------------------|------------
> >       v        v                          v
> > Host
> >     .-------------.  .----------------------.
> >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >     |             |  '----------------------'
> >     .----------------/  |
> >     | PASID Entry |     V (Nested xlate)
> >     '----------------\.---------------------.
> >     |             |   |Set SL to GPA-HPA    |
> >     |             |   '---------------------'
> >     '-------------'
> > 
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/iommu.c      | 20 ++++++++++++++++++++
> >  include/linux/iommu.h      | 10 ++++++++++
> >  include/uapi/linux/iommu.h | 15 ++++++++++++++-
> >  3 files changed, 44 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 498c28a..072f8f3 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1561,6 +1561,26 @@ int iommu_cache_invalidate(struct
> > iommu_domain *domain, struct device *dev, }
> >  EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> >  
> > +int iommu_sva_bind_gpasid(struct iommu_domain *domain,
> > +			struct device *dev, struct
> > gpasid_bind_data *data) +{
> > +	if (unlikely(!domain->ops->sva_bind_gpasid))
> > +		return -ENODEV;
> > +
> > +	return domain->ops->sva_bind_gpasid(domain, dev, data);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_sva_bind_gpasid);
> > +
> > +int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct
> > device *dev,
> > +			int pasid)
> > +{
> > +	if (unlikely(!domain->ops->sva_unbind_gpasid))
> > +		return -ENODEV;
> > +
> > +	return domain->ops->sva_unbind_gpasid(dev, pasid);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 4b92e4b..611388e 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -231,6 +231,8 @@ struct iommu_sva_ops {
> >   * @detach_pasid_table: detach the pasid table
> >   * @cache_invalidate: invalidate translation caches
> >   * @pgsize_bitmap: bitmap of all possible supported page sizes
> > + * @sva_bind_gpasid: bind guest pasid and mm
> > + * @sva_unbind_gpasid: unbind guest pasid and mm
> >   */
> >  struct iommu_ops {
> >  	bool (*capable)(enum iommu_cap);
> > @@ -295,6 +297,10 @@ struct iommu_ops {
> >  
> >  	int (*cache_invalidate)(struct iommu_domain *domain,
> > struct device *dev, struct iommu_cache_invalidate_info *inv_info);
> > +	int (*sva_bind_gpasid)(struct iommu_domain *domain,
> > +			struct device *dev, struct
> > gpasid_bind_data *data); +
> > +	int (*sva_unbind_gpasid)(struct device *dev, int pasid);  
> So I am confused now. As the scalable mode PASID table entry contains
> both the FL and SL PT pointers, will you ever use the
> attach/detach_pasid_table or are we the only known users on ARM?
> 
In scalable mode, we will not use attach pasid table. So ARM will be
the only user for now.
Guest PASID table is shadowed, PASID cache flush in the guest will
trigger sva_bind_gpasid.
I introduced bind PASID table for the previous VT-d spec that has
extend context mode (deprecated), where SL is shared by all PASIDs on 
the same device.
> >  
> >  	unsigned long pgsize_bitmap;
> >  };
> > @@ -409,6 +415,10 @@ extern void iommu_detach_pasid_table(struct
> > iommu_domain *domain); extern int iommu_cache_invalidate(struct
> > iommu_domain *domain, struct device *dev,
> >  				  struct
> > iommu_cache_invalidate_info *inv_info); +extern int
> > iommu_sva_bind_gpasid(struct iommu_domain *domain,
> > +		struct device *dev, struct gpasid_bind_data *data);
> > +extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
> > +				struct device *dev, int pasid);  
> definition in !CONFIG_IOMMU_API case?
right

> >  extern struct iommu_domain *iommu_get_domain_for_dev(struct device
> > *dev); extern struct iommu_domain *iommu_get_dma_domain(struct
> > device *dev); extern int iommu_map(struct iommu_domain *domain,
> > unsigned long iova, diff --git a/include/uapi/linux/iommu.h
> > b/include/uapi/linux/iommu.h index 61a3fb7..5c95905 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -235,6 +235,19 @@ struct iommu_cache_invalidate_info {
> >  		struct iommu_inv_addr_info addr_info;
> >  	};
> >  };
> > -
> > +/**
> > + * struct gpasid_bind_data - Information about device and guest
> > PASID binding
> > + * @gcr3:	Guest CR3 value from guest mm
> > + * @pasid:	Process address space ID used for the guest mm
> > + * @addr_width:	Guest address width. Paging mode can also
> > be derived.
> > + */
> > +struct gpasid_bind_data {
> > +	__u64 gcr3;
> > +	__u32 pasid;
> > +	__u32 addr_width;
> > +	__u32 flags;
> > +#define	IOMMU_SVA_GPASID_SRE	BIT(0) /* supervisor
> > request */
> > +	__u8 padding[4];
> > +};  
> 
> 
> Thanks
> 
> Eric
> >  
> >  #endif /* _UAPI_IOMMU_H */
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 14/19] iommu: Add guest PASID bind function
  2019-04-26 22:11     ` Jacob Pan
@ 2019-04-27  8:37       ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-27  8:37 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/27/19 12:11 AM, Jacob Pan wrote:
> On Fri, 26 Apr 2019 17:53:43 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> Guest shared virtual address (SVA) may require host to shadow guest
>>> PASID tables. Guest PASID can also be allocated from the host via
>>> enlightened interfaces. In this case, guest needs to bind the guest
>>> mm, i.e. cr3 in guest phisical address to the actual PASID table
>>> in  
>> physical
> got it
> 
>>> the host IOMMU. Nesting will be turned on such that guest virtual
>>> address can go through a two level translation:
>>> - 1st level translates GVA to GPA
>>> - 2nd level translates GPA to HPA
>>> This patch introduces APIs to bind guest PASID data to the assigned
>>> device entry in the physical IOMMU. See the diagram below for usage
>>> explaination.
>>>
>>>     .-------------.  .---------------------------.
>>>     |   vIOMMU    |  | Guest process mm, FL only |
>>>     |             |  '---------------------------'
>>>     .----------------/
>>>     | PASID Entry |--- PASID cache flush -
>>>     '-------------'                       |
>>>     |             |                       V
>>>     |             |
>>>     '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|------------
>>>       v        v                          v
>>> Host
>>>     .-------------.  .----------------------.
>>>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>>>     |             |  '----------------------'
>>>     .----------------/  |
>>>     | PASID Entry |     V (Nested xlate)
>>>     '----------------\.---------------------.
>>>     |             |   |Set SL to GPA-HPA    |
>>>     |             |   '---------------------'
>>>     '-------------'
>>>
>>> Where:
>>>  - FL = First level/stage one page tables
>>>  - SL = Second level/stage two page tables
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  drivers/iommu/iommu.c      | 20 ++++++++++++++++++++
>>>  include/linux/iommu.h      | 10 ++++++++++
>>>  include/uapi/linux/iommu.h | 15 ++++++++++++++-
>>>  3 files changed, 44 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 498c28a..072f8f3 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1561,6 +1561,26 @@ int iommu_cache_invalidate(struct
>>> iommu_domain *domain, struct device *dev, }
>>>  EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
>>>  
>>> +int iommu_sva_bind_gpasid(struct iommu_domain *domain,
>>> +			struct device *dev, struct
>>> gpasid_bind_data *data) +{
>>> +	if (unlikely(!domain->ops->sva_bind_gpasid))
>>> +		return -ENODEV;
>>> +
>>> +	return domain->ops->sva_bind_gpasid(domain, dev, data);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_sva_bind_gpasid);
>>> +
>>> +int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct
>>> device *dev,
>>> +			int pasid)
>>> +{
>>> +	if (unlikely(!domain->ops->sva_unbind_gpasid))
>>> +		return -ENODEV;
>>> +
>>> +	return domain->ops->sva_unbind_gpasid(dev, pasid);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 4b92e4b..611388e 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -231,6 +231,8 @@ struct iommu_sva_ops {
>>>   * @detach_pasid_table: detach the pasid table
>>>   * @cache_invalidate: invalidate translation caches
>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>> + * @sva_bind_gpasid: bind guest pasid and mm
>>> + * @sva_unbind_gpasid: unbind guest pasid and mm
>>>   */
>>>  struct iommu_ops {
>>>  	bool (*capable)(enum iommu_cap);
>>> @@ -295,6 +297,10 @@ struct iommu_ops {
>>>  
>>>  	int (*cache_invalidate)(struct iommu_domain *domain,
>>> struct device *dev, struct iommu_cache_invalidate_info *inv_info);
>>> +	int (*sva_bind_gpasid)(struct iommu_domain *domain,
>>> +			struct device *dev, struct
>>> gpasid_bind_data *data); +
>>> +	int (*sva_unbind_gpasid)(struct device *dev, int pasid);  
>> So I am confused now. As the scalable mode PASID table entry contains
>> both the FL and SL PT pointers, will you ever use the
>> attach/detach_pasid_table or are we the only known users on ARM?
>>
> In scalable mode, we will not use attach pasid table. So ARM will be
> the only user for now.
> Guest PASID table is shadowed, PASID cache flush in the guest will
> trigger sva_bind_gpasid.
> I introduced bind PASID table for the previous VT-d spec that has
> extend context mode (deprecated), where SL is shared by all PASIDs on 
> the same device.
OK. Thank you for the confirmation.

Thanks

Eric

>>>  
>>>  	unsigned long pgsize_bitmap;
>>>  };
>>> @@ -409,6 +415,10 @@ extern void iommu_detach_pasid_table(struct
>>> iommu_domain *domain); extern int iommu_cache_invalidate(struct
>>> iommu_domain *domain, struct device *dev,
>>>  				  struct
>>> iommu_cache_invalidate_info *inv_info); +extern int
>>> iommu_sva_bind_gpasid(struct iommu_domain *domain,
>>> +		struct device *dev, struct gpasid_bind_data *data);
>>> +extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
>>> +				struct device *dev, int pasid);  
>> definition in !CONFIG_IOMMU_API case?
> right
> 
>>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device
>>> *dev); extern struct iommu_domain *iommu_get_dma_domain(struct
>>> device *dev); extern int iommu_map(struct iommu_domain *domain,
>>> unsigned long iova, diff --git a/include/uapi/linux/iommu.h
>>> b/include/uapi/linux/iommu.h index 61a3fb7..5c95905 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -235,6 +235,19 @@ struct iommu_cache_invalidate_info {
>>>  		struct iommu_inv_addr_info addr_info;
>>>  	};
>>>  };
>>> -
>>> +/**
>>> + * struct gpasid_bind_data - Information about device and guest
>>> PASID binding
>>> + * @gcr3:	Guest CR3 value from guest mm
>>> + * @pasid:	Process address space ID used for the guest mm
>>> + * @addr_width:	Guest address width. Paging mode can also
>>> be derived.
>>> + */
>>> +struct gpasid_bind_data {
>>> +	__u64 gcr3;
>>> +	__u32 pasid;
>>> +	__u32 addr_width;
>>> +	__u32 flags;
>>> +#define	IOMMU_SVA_GPASID_SRE	BIT(0) /* supervisor
>>> request */
>>> +	__u8 padding[4];
>>> +};  
>>
>>
>> Thanks
>>
>> Eric
>>>  
>>>  #endif /* _UAPI_IOMMU_H */
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID
       [not found]     ` <20190426140133.6d445315@jacob-builder>
@ 2019-04-27  8:38       ` Auger Eric
  2019-04-29 10:00         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-27  8:38 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko



On 4/26/19 11:01 PM, Jacob Pan wrote:
> On Thu, 25 Apr 2019 12:04:01 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> Make use of generic IOASID code to manage PASID allocation,
>>> free, and lookup.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>  drivers/iommu/Kconfig       |  1 +
>>>  drivers/iommu/intel-iommu.c |  9 ++++-----
>>>  drivers/iommu/intel-pasid.c | 36
>>> ------------------------------------ drivers/iommu/intel-svm.c   |
>>> 41 ++++++++++++++++++++++++----------------- 4 files changed, 29
>>> insertions(+), 58 deletions(-)
>>>
>>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>> index 6f07f3b..7f92009 100644
>>> --- a/drivers/iommu/Kconfig
>>> +++ b/drivers/iommu/Kconfig
>>> @@ -204,6 +204,7 @@ config INTEL_IOMMU_SVM
>>>  	bool "Support for Shared Virtual Memory with Intel IOMMU"
>>>  	depends on INTEL_IOMMU && X86
>>>  	select PCI_PASID
>>> +	select IOASID
>>>  	select MMU_NOTIFIER
>>>  	help
>>>  	  Shared Virtual Memory (SVM) provides a facility for
>>> devices diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index ec6f22d..785330a 100644
>>> --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -5153,7 +5153,7 @@ static void auxiliary_unlink_device(struct
>>> dmar_domain *domain, domain->auxd_refcnt--;
>>>  
>>>  	if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> -		intel_pasid_free_id(domain->default_pasid);
>>> +		ioasid_free(domain->default_pasid);
>>>  }
>>>  
>>>  static int aux_domain_add_dev(struct dmar_domain *domain,
>>> @@ -5171,9 +5171,8 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, if (domain->default_pasid <= 0) {
>>>  		int pasid;
>>>  
>>> -		pasid = intel_pasid_alloc_id(domain, PASID_MIN,
>>> -
>>> pci_max_pasids(to_pci_dev(dev)),
>>> -					     GFP_KERNEL);
>>> +		pasid = ioasid_alloc(NULL, PASID_MIN,
>>> pci_max_pasids(to_pci_dev(dev)) - 1,
>>> +				domain);
>>>  		if (pasid <= 0) {  
>> ioasid_t is a uint and returns INVALID_IOASID on error. Wouldn't it be
>> simpler to make ioasid_alloc return an int?
> Well, I think we still want the full uint range - 1(INVALID_IOASID).
> Intel uses 20bit but I think SMMUs use 32 bits for streamID? I
> should just check 
> 	if (pasid == INVALID_IOASID) {
Jean-Philippe may correct me but SMMU uses 20b SubstreamId which is a
superset of PASIDs. StreamId is 32b.
> 
>>>  			pr_err("Can't allocate default pasid\n");
>>>  			return -ENODEV;
>>> @@ -5210,7 +5209,7 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, spin_unlock(&iommu->lock);
>>>  	spin_unlock_irqrestore(&device_domain_lock, flags);
>>>  	if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> -		intel_pasid_free_id(domain->default_pasid);
>>> +		ioasid_free(domain->default_pasid);
>>>  
>>>  	return ret;
>>>  }
>>> diff --git a/drivers/iommu/intel-pasid.c
>>> b/drivers/iommu/intel-pasid.c index 5b1d3be..d339e8f 100644
>>> --- a/drivers/iommu/intel-pasid.c
>>> +++ b/drivers/iommu/intel-pasid.c
>>> @@ -26,42 +26,6 @@
>>>   */
>>>  static DEFINE_SPINLOCK(pasid_lock);
>>>  u32 intel_pasid_max_id = PASID_MAX;
>>> -static DEFINE_IDR(pasid_idr);
>>> -
>>> -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp)
>>> -{
>>> -	int ret, min, max;
>>> -
>>> -	min = max_t(int, start, PASID_MIN);
>>> -	max = min_t(int, end, intel_pasid_max_id);
>>> -
>>> -	WARN_ON(in_interrupt());
>>> -	idr_preload(gfp);
>>> -	spin_lock(&pasid_lock);
>>> -	ret = idr_alloc(&pasid_idr, ptr, min, max, GFP_ATOMIC);
>>> -	spin_unlock(&pasid_lock);
>>> -	idr_preload_end();
>>> -
>>> -	return ret;
>>> -}
>>> -
>>> -void intel_pasid_free_id(int pasid)
>>> -{
>>> -	spin_lock(&pasid_lock);
>>> -	idr_remove(&pasid_idr, pasid);
>>> -	spin_unlock(&pasid_lock);
>>> -}
>>> -
>>> -void *intel_pasid_lookup_id(int pasid)
>>> -{
>>> -	void *p;
>>> -
>>> -	spin_lock(&pasid_lock);
>>> -	p = idr_find(&pasid_idr, pasid);
>>> -	spin_unlock(&pasid_lock);
>>> -
>>> -	return p;
>>> -}
>>>  
>>>  int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int
>>> *pasid) {
>>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
>>> index 8f87304..8fff212 100644
>>> --- a/drivers/iommu/intel-svm.c
>>> +++ b/drivers/iommu/intel-svm.c
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/dmar.h>
>>>  #include <linux/interrupt.h>
>>>  #include <linux/mm_types.h>
>>> +#include <linux/ioasid.h>
>>>  #include <asm/page.h>
>>>  
>>>  #include "intel-pasid.h"
>>> @@ -211,7 +212,9 @@ static void intel_mm_release(struct
>>> mmu_notifier *mn, struct mm_struct *mm) rcu_read_lock();
>>>  	list_for_each_entry_rcu(sdev, &svm->devs, list) {
>>>  		intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
>>> svm->pasid);
>>> -		intel_flush_svm_range_dev(svm, sdev, 0, -1,
>>> 0, !svm->mm)> +		/* for emulated iommu, PASID cache
>>> invalidation implies IOTLB/DTLB */
>>> +		if (!cap_caching_mode(svm->iommu->cap))
>>> +			intel_flush_svm_range_dev(svm, sdev, 0,
>>> -1, 0, !svm->mm);  
>> This change is not documented in the commit message. Isn't it a
>> separate fix?
> right, should be separate.
> 
>>>  	}
>>>  	rcu_read_unlock();
>>>  
>>> @@ -332,16 +335,15 @@ int intel_svm_bind_mm(struct device *dev, int
>>> *pasid, int flags, struct svm_dev_ if (pasid_max >
>>> intel_pasid_max_id) pasid_max = intel_pasid_max_id;
>>>  
>>> -		/* Do not use PASID 0 in caching mode (virtualised
>>> IOMMU) */
>>> -		ret = intel_pasid_alloc_id(svm,
>>> -					   !!cap_caching_mode(iommu->cap),
>>> -					   pasid_max - 1,
>>> GFP_KERNEL);
>>> -		if (ret < 0) {
>>> +		/* Do not use PASID 0, reserved for RID to PASID */
>>> +		svm->pasid = ioasid_alloc(NULL, PASID_MIN,
>>> +					pasid_max - 1, svm);  
>> the fact the max is not decremented compared to intel_pasid_alloc_id
>> looks suspicious to me (exclusive to inclusive move). I guess it is a
>> fix in which case this may be documented in the commit msg?
> Yes, it should be in separate patch.
> 
> VT-d will always support device with full 20bit PASID range only, we
> should fail svm_bind if device pasid_max < 20. It has to be
> included in this series otherwise we could have assigned a device with <
> 20 PASID and our virtual command can only handle full 20bit.
OK

Thanks

Eric
> 
>>> +		if (svm->pasid == INVALID_IOASID) {
>>>  			kfree(svm);
>>>  			kfree(sdev);
>>> +			ret = ENOSPC;  
>> -ENOSPC
>>>  			goto out;
>>>  		}
>>> -		svm->pasid = ret;
>>>  		svm->notifier.ops = &intel_mmuops;
>>>  		svm->mm = mm;
>>>  		svm->flags = flags;
>>> @@ -351,7 +353,7 @@ int intel_svm_bind_mm(struct device *dev, int
>>> *pasid, int flags, struct svm_dev_ if (mm) {
>>>  			ret =
>>> mmu_notifier_register(&svm->notifier, mm); if (ret) {
>>> -				intel_pasid_free_id(svm->pasid);
>>> +				ioasid_free(svm->pasid);
>>>  				kfree(svm);
>>>  				kfree(sdev);
>>>  				goto out;
>>> @@ -367,7 +369,7 @@ int intel_svm_bind_mm(struct device *dev, int
>>> *pasid, int flags, struct svm_dev_ if (ret) {
>>>  			if (mm)
>>>  				mmu_notifier_unregister(&svm->notifier,
>>> mm);
>>> -			intel_pasid_free_id(svm->pasid);
>>> +			ioasid_free(svm->pasid);  
>> the ioasid_free returned value never is tested. Is it useful?
>>>  			kfree(svm);
>>>  			kfree(sdev);
>>>  			goto out;
>>> @@ -400,7 +402,12 @@ int intel_svm_unbind_mm(struct device *dev,
>>> int pasid) if (!iommu)
>>>  		goto out;
>>>  
>>> -	svm = intel_pasid_lookup_id(pasid);
>>> +	svm = ioasid_find(NULL, pasid, NULL);
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>> +		goto out;
>>> +	}
>>> +
>>>  	if (!svm)
>>>  		goto out;
>>>  
>>> @@ -422,7 +429,7 @@ int intel_svm_unbind_mm(struct device *dev, int
>>> pasid) kfree_rcu(sdev, rcu);
>>>  
>>>  				if (list_empty(&svm->devs)) {
>>> -
>>> intel_pasid_free_id(svm->pasid);
>>> +					ioasid_free(svm->pasid);
>>>  					if (svm->mm)
>>>  						mmu_notifier_unregister(&svm->notifier,
>>> svm->mm); 
>>> @@ -457,10 +464,11 @@ int intel_svm_is_pasid_valid(struct device
>>> *dev, int pasid) if (!iommu)
>>>  		goto out;
>>>  
>>> -	svm = intel_pasid_lookup_id(pasid);
>>> -	if (!svm)
>>> +	svm = ioasid_find(NULL, pasid, NULL);
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>>  		goto out;
>>> -
>>> +	}
>>>  	/* init_mm is used in this case */
>>>  	if (!svm->mm)
>>>  		ret = 1;
>>> @@ -567,13 +575,12 @@ static irqreturn_t prq_event_thread(int irq,
>>> void *d) 
>>>  		if (!svm || svm->pasid != req->pasid) {
>>>  			rcu_read_lock();
>>> -			svm = intel_pasid_lookup_id(req->pasid);
>>> +			svm = ioasid_find(NULL, req->pasid, NULL);
>>>  			/* It *can't* go away, because the driver
>>> is not permitted
>>>  			 * to unbind the mm while any page faults
>>> are outstanding.
>>>  			 * So we only need RCU to protect the
>>> internal idr code. */ rcu_read_unlock();
>>> -
>>> -			if (!svm) {
>>> +			if (IS_ERR(svm) || !svm) {
>>>  				pr_err("%s: Page request for
>>> invalid PASID %d: %08llx %08llx\n", iommu->name, req->pasid,
>>> ((unsigned long long *)req)[0], ((unsigned long long *)req)[1]);
>>>   
>>
>> Thanks
>>
>> Eric
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-23 23:31 ` [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types Jacob Pan
@ 2019-04-27  9:04   ` Auger Eric
  2019-04-29 21:29     ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-27  9:04 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

Hi Jacob,

On 4/24/19 1:31 AM, Jacob Pan wrote:
> When Shared Virtual Memory is exposed to a guest via vIOMMU, extended
> IOTLB invalidation may be passed down from outside IOMMU subsystems.
> This patch adds invalidation functions that can be used for additional
> translation cache types.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/dmar.c        | 48 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h | 21 ++++++++++++++++----
>  2 files changed, 65 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 9c49300..680894e 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
/* PASID-based IOTLB Invalidate */
> +void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
> +		unsigned int size_order, u64 granu)
> +{
> +	struct qi_desc desc;
> +
> +	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
> +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
> +	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
> +		QI_EIOTLB_AM(size_order);
I see IH it hardcoded to 0. Don't you envision to cascade the IH. On ARM
this was needed for perf sake.
> +	desc.qw2 = 0;
> +	desc.qw3 = 0;
> +	qi_submit_sync(&desc, iommu);
> +}
> +
>  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask)
>  {
> @@ -1380,6 +1394,40 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
/* Pasid-based Device-TLB Invalidation */
> +void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
> +{
> +	struct qi_desc desc;
> +
> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> +		QI_DEV_IOTLB_PFSID(pfsid);
> +	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);
> +
> +	/* If S bit is 0, we only flush a single page. If S bit is set,
> +	 * The least significant zero bit indicates the size. VT-d spec
> +	 * 6.5.2.6
> +	 */
> +	if (!size)
> +		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
desc.q1 |= ?
> +	else {
> +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
> +
> +		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
desc.q1 |=
> +	}
> +	qi_submit_sync(&desc, iommu);
> +}
> +
/* PASID-cache invalidation */
> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
> +{
> +	struct qi_desc desc;
> +
> +	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
> +	desc.qw1 = 0;
> +	desc.qw2 = 0;
> +	desc.qw3 = 0;
> +	qi_submit_sync(&desc, iommu);
> +}
>  /*
>   * Disable Queued Invalidation interface.
>   */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 5d67d0d4..38e5efb 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -339,7 +339,7 @@ enum {
>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
>  #define QI_IOTLB_ADDR(addr)	(((u64)addr) & VTD_PAGE_MASK)
>  #define QI_IOTLB_IH(ih)		(((u64)ih) << 6)
> -#define QI_IOTLB_AM(am)		(((u8)am))
> +#define QI_IOTLB_AM(am)		(((u8)am) & 0x3f)
>  
>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> @@ -357,17 +357,22 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PASID cache invalidation granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> -#define QI_EIOTLB_AM(am)	(((u64)am))
> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		1
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -658,8 +663,16 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
>  			     u8 fm, u64 type);
>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  			  unsigned int size_order, u64 type);
> +extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr,
> +			u32 pasid, unsigned int size_order, u64 type);
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
> +
> +extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
> +
> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
> +
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
>  
>  extern int dmar_ir_support(void);
> 

Thanks

Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID
  2019-04-27  8:38       ` Auger Eric
@ 2019-04-29 10:00         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-04-29 10:00 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Yi Liu, Tian, Kevin, Raj Ashok, Christoph Hellwig, Lu Baolu,
	Andriy Shevchenko

On 27/04/2019 09:38, Auger Eric wrote:
>>>> --- a/drivers/iommu/intel-iommu.c
>>>> +++ b/drivers/iommu/intel-iommu.c
>>>> @@ -5153,7 +5153,7 @@ static void auxiliary_unlink_device(struct
>>>> dmar_domain *domain, domain->auxd_refcnt--;
>>>>  
>>>>  	if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>>> -		intel_pasid_free_id(domain->default_pasid);
>>>> +		ioasid_free(domain->default_pasid);
>>>>  }
>>>>  
>>>>  static int aux_domain_add_dev(struct dmar_domain *domain,
>>>> @@ -5171,9 +5171,8 @@ static int aux_domain_add_dev(struct
>>>> dmar_domain *domain, if (domain->default_pasid <= 0) {
>>>>  		int pasid;
>>>>  
>>>> -		pasid = intel_pasid_alloc_id(domain, PASID_MIN,
>>>> -
>>>> pci_max_pasids(to_pci_dev(dev)),
>>>> -					     GFP_KERNEL);
>>>> +		pasid = ioasid_alloc(NULL, PASID_MIN,
>>>> pci_max_pasids(to_pci_dev(dev)) - 1,
>>>> +				domain);
>>>>  		if (pasid <= 0) {  
>>> ioasid_t is a uint and returns INVALID_IOASID on error. Wouldn't it be
>>> simpler to make ioasid_alloc return an int?
>> Well, I think we still want the full uint range - 1(INVALID_IOASID).
>> Intel uses 20bit but I think SMMUs use 32 bits for streamID? I
>> should just check 
>> 	if (pasid == INVALID_IOASID) {
> Jean-Philippe may correct me but SMMU uses 20b SubstreamId which is a
> superset of PASIDs. StreamId is 32b.

Right, we use 20 bits for PASIDs (== SubstreamID really). Given the
choices that vendors are making for PASIDs (a global namespace rather
than per-VM), I wouldn't be surprised if they extend the size of PASIDs
in a couple of years, so I added the typedef ioasid_t to ease a possible
change from 32-bit to 64 in the future.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support
  2019-04-26 16:15   ` Auger Eric
@ 2019-04-29 15:25     ` Jacob Pan
  2019-04-30  7:05       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-29 15:25 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 18:15:27 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> >     .-------------.  .---------------------------.
> >     |   vIOMMU    |  | Guest process CR3, FL only|
> >     |             |  '---------------------------'
> >     .----------------/
> >     | PASID Entry |--- PASID cache flush -
> >     '-------------'                       |
> >     |             |                       V
> >     |             |                CR3 in GPA
> >     '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >       v        v                          v
> > Host
> >     .-------------.  .----------------------.
> >     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >     |             |  '----------------------'
> >     .----------------/  |
> >     | PASID Entry |     V (Nested xlate)
> >     '----------------\.------------------------------.
> >     |             |   |SL for GPA-HPA, default domain|
> >     |             |   '------------------------------'
> >     '-------------'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 174
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h |  10 ++- include/linux/intel-svm.h
> > |   7 ++ 4 files changed, 193 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 77bbe1b..89989b5 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5768,6 +5768,10 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.dev_enable_feat	= intel_iommu_dev_enable_feat,
> >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> > +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> > +#endif
> >  };
> >  
> >  static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index 8fff212..0a973c2 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -227,6 +227,180 @@ static const struct mmu_notifier_ops
> > intel_mmuops = { 
> >  static DEFINE_MUTEX(pasid_mutex);
> >  static LIST_HEAD(global_svm_list);
> > +#define for_each_svm_dev() \
> > +	list_for_each_entry(sdev, &svm->devs, list)	\
> > +	if (dev == sdev->dev)				\
> > +
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +			struct device *dev,
> > +			struct gpasid_bind_data *data)
> > +{
> > +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_svm *svm = NULL;
> > +	struct dmar_domain *ddomain;
> > +	int pasid_max;
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(!iommu) || !data)
> > +		return -EINVAL;
> > +
> > +	if (dev_is_pci(dev)) {
> > +		pasid_max = pci_max_pasids(to_pci_dev(dev));
> > +		if (pasid_max < 0)
> > +			return -EINVAL;
> > +	} else
> > +		pasid_max = 1 << 20;
> > +
> > +	if (data->pasid <= 0 || data->pasid >= pasid_max)
> > +		return -EINVAL;
> > +
> > +	ddomain = to_dmar_domain(domain);
> > +	/* REVISIT:
> > +	 * Sanity check adddress width and paging mode support
> > +	 * width matching in two dimensions:
> > +	 * 1. paging mode CPU <= IOMMU
> > +	 * 2. address width Guest <= Host.
> > +	 */
> > +	mutex_lock(&pasid_mutex);
> > +	svm = ioasid_find(NULL, data->pasid, NULL);
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +	if (svm) {
> > +		if (list_empty(&svm->devs)) {
> > +			dev_err(dev, "GPASID %d has no devices
> > bond but SVA is allocated\n",
> > +				data->pasid);
> > +			ret = -ENODEV; /*
> > +					* If we found svm for the
> > PASID, there must be at
> > +					* least one device bond,
> > otherwise svm should be freed.
> > +					*/  
> comment should be put after list_empty I think. In which circumstances
> can it happen, I mean, isn't it a BUG_ON case?
Well, I think failing to bind guest PASID is not severe enough to the
host to use BUG_ON. It has to be something more catastrophic to use
BUG_ON right? I will relocate the comments.
> > +			goto out;
> > +		}
> > +		for_each_svm_dev() {
> > +			/* In case of multiple sub-devices of the
> > same pdev assigned, we should
> > +			 * allow multiple bind calls with the same
> > PASID and pdev.
> > +			 */
> > +			sdev->users++;
> > +			goto out;
> > +		}
> > +	} else {
> > +		/* We come here when PASID has never been bond to
> > a device. */
> > +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > +		if (!svm) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		/* REVISIT: upper layer/VFIO can track host
> > process that bind the PASID.
> > +		 * ioasid_set = mm might be sufficient for vfio to
> > check pasid VMM
> > +		 * ownership.
> > +		 */
> > +		svm->mm = get_task_mm(current);
> > +		svm->pasid = data->pasid;
> > +		refcount_set(&svm->refs, 0);
> > +		ioasid_set_data(data->pasid, svm);
> > +		INIT_LIST_HEAD_RCU(&svm->devs);
> > +		INIT_LIST_HEAD(&svm->list);
> > +
> > +		mmput(svm->mm);
> > +	}
> > +	svm->flags |= SVM_FLAG_GUEST_MODE;
> > +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > +	if (!sdev) {
> > +		ret = -ENOMEM;  
> in case of failure what is the state of svm (you added the
> SVM_FLAG_GUEST_MODE bit typically, is it safe to leave it?)
The SVM_FLAG_GUEST_MODE flag is used for fault reporting where faults
such as PRQ need to be injected into the guest. If this kzalloc()
fails, the nested translation would not be setup for this PASID. So
there shouldn't be any user of the flag. But I think it is better to
move svm->flags |= SVM_FLAG_GUEST_MODE; to the end when everything is
setup for nesting.

> > +		goto out;
> > +	}
> > +	sdev->dev = dev;
> > +	sdev->users = 1;
> > +
> > +	/* Set up device context entry for PASID if not enabled
> > already */
> > +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > +	if (ret) {
> > +		dev_err(dev, "Failed to enable PASID
> > capability\n");
> > +		kfree(sdev);  
> same here
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * For guest bind, we need to set up PASID table entry as
> > follows:
> > +	 * - FLPM matches guest paging mode
> > +	 * - turn on nested mode
> > +	 * - SL guest address width matching
> > +	 */
> > +	ret = intel_pasid_setup_nested(iommu,
> > +				dev,
> > +				(pgd_t *)data->gcr3,
> > +				data->pasid,
> > +				data->flags,
> > +				ddomain,
> > +				data->addr_width);
> > +	if (ret) {
> > +		dev_err(dev, "Failed to set up PASID %d in nested
> > mode, Err %d\n",
> > +			data->pasid, ret);
> > +		kfree(sdev);
> > +		goto out;
> > +	}
> > +
> > +	init_rcu_head(&sdev->rcu);
> > +	refcount_inc(&svm->refs);
> > +	list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > +	mutex_unlock(&pasid_mutex);
> > +	return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{
> > +	struct intel_svm_dev *sdev;
> > +	struct intel_iommu *iommu;
> > +	struct intel_svm *svm;
> > +	int ret = -EINVAL;
> > +
> > +	mutex_lock(&pasid_mutex);
> > +	iommu = intel_svm_device_to_iommu(dev);
> > +	if (!iommu)
> > +		goto out;
> > +
> > +	svm = ioasid_find(NULL, pasid, NULL);
> > +	if (IS_ERR(svm)) {
> > +		ret = PTR_ERR(svm);
> > +		goto out;
> > +	}
> > +
> > +	if (!svm)
> > +		goto out;
> > +
> > +	for_each_svm_dev() {
> > +		ret = 0;
> > +		sdev->users--;
> > +		if (!sdev->users) {
> > +			list_del_rcu(&sdev->list);
> > +			intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);
> > +			/* TODO: Drain in flight PRQ for the PASID
> > since it
> > +			 * may get reused soon, we don't want to
> > +			 * confuse with its previous live.
> > +			 * intel_svm_drain_prq(dev, pasid);
> > +			 */
> > +			kfree_rcu(sdev, rcu);
> > +
> > +			if (list_empty(&svm->devs)) {
> > +				list_del(&svm->list);
> > +				kfree(svm);
> > +				/*
> > +				 * We do not free PASID here until
> > explicit call
> > +				 * from the guest to free.  
> can you be confident in the guest?
No. But I have confident in the kernel VFIO code to manage guest life
cycle :)
I assume when a guest doesn't do unbind when it dies or unload a
assigned device, I expect VFIO to free all the PASIDs. VFIO needs to
police the PASID ownership anyway in order to make sure a PASID
assigned to guest A cannot be used to bind from guest B.
This is the flow I worked out with Yi, who is doing the VFIO part. Any
particular concerns?

> > +				 */
> > +				ioasid_set_data(pasid, NULL);
> > +			}
> > +		}
> > +		break;
> > +	}
> > + out:
> > +	mutex_unlock(&pasid_mutex);
> > +
> > +	return ret;
> > +}
> >  
> >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops) {
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 48fa164..5d67d0d4 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -677,7 +677,9 @@ int intel_iommu_enable_pasid(struct intel_iommu
> > *iommu, struct device *dev); int intel_svm_init(struct intel_iommu
> > *iommu); extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> >  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +		struct device *dev, struct gpasid_bind_data *data);
> > +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
> >  struct svm_dev_ops;
> >  
> >  struct intel_svm_dev {
> > @@ -693,12 +695,16 @@ struct intel_svm_dev {
> >  
> >  struct intel_svm {
> >  	struct mmu_notifier notifier;
> > -	struct mm_struct *mm;
> > +	union {
> > +		struct mm_struct *mm;
> > +		u64 gcr3;
> > +	};
> >  	struct intel_iommu *iommu;
> >  	int flags;
> >  	int pasid;
> >  	struct list_head devs;
> >  	struct list_head list;
> > +	refcount_t refs; /* # of devs bond to the PASID */  
> number of devices sharing the same PASID?
more clear wording, thanks.
> >  };
> >  
> >  extern struct intel_iommu *intel_svm_device_to_iommu(struct device
> > *dev); diff --git a/include/linux/intel-svm.h
> > b/include/linux/intel-svm.h index e3f7631..34b0a3b 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -52,6 +52,13 @@ struct svm_dev_ops {
> >   * do such IOTLB flushes automatically.
> >   */
> >  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.  
> binds
will fix

> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE	(1<<2)
> >  
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  
> >   
> 
> Thanks
> 
> Eric

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 17/19] iommu: Add max num of cache and granu types
  2019-04-26 16:22   ` Auger Eric
@ 2019-04-29 16:17     ` Jacob Pan
  2019-04-30  5:15       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-29 16:17 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 18:22:46 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > To convert to/from cache types and granularities between generic and
> > VT-d specific counterparts, a 2D arrary is used. Introduce the
> > limits  
> array
> > to help define the converstion array size.  
> conversion
> > 
will fix, thanks
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  include/uapi/linux/iommu.h | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 5c95905..2d8fac8 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -197,6 +197,7 @@ struct iommu_inv_addr_info {
> >  	__u64	granule_size;
> >  	__u64	nb_granules;
> >  };
> > +#define NR_IOMMU_CACHE_INVAL_GRANU	(3)
> >  
> >  /**
> >   * First level/stage invalidation information
> > @@ -235,6 +236,7 @@ struct iommu_cache_invalidate_info {
> >  		struct iommu_inv_addr_info addr_info;
> >  	};
> >  };
> > +#define NR_IOMMU_CACHE_TYPE		(3)
> >  /**
> >   * struct gpasid_bind_data - Information about device and guest
> > PASID binding
> >   * @gcr3:	Guest CR3 value from guest mm
> >   
> Is it really something that needs to be exposed in the uapi?
> 
I put it in uapi since the related definitions for granularity and
cache type are in the same file.
Maybe putting them close together like this? I was thinking you can just
fold it into your next series as one patch for introducing cache
invalidation.
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 2d8fac8..4ff6929 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -164,6 +164,7 @@ enum iommu_inv_granularity {
        IOMMU_INV_GRANU_DOMAIN, /* domain-selective invalidation */
        IOMMU_INV_GRANU_PASID,  /* pasid-selective invalidation */
        IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation */
+       NR_IOMMU_INVAL_GRANU,   /* number of invalidation granularities
*/ };
 
 /**
@@ -228,6 +229,7 @@ struct iommu_cache_invalidate_info {
 #define IOMMU_CACHE_INV_TYPE_IOTLB     (1 << 0) /* IOMMU IOTLB */
 #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device IOTLB */
 #define IOMMU_CACHE_INV_TYPE_PASID     (1 << 2) /* PASID cache */
+#define NR_IOMMU_CACHE_TYPE            (3)
        __u8    cache;
        __u8    granularity;

> Thanks
> 
> Eric

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-27  9:04   ` Auger Eric
@ 2019-04-29 21:29     ` Jacob Pan
  2019-04-30  4:41       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-29 21:29 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Sat, 27 Apr 2019 11:04:04 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > When Shared Virtual Memory is exposed to a guest via vIOMMU,
> > extended IOTLB invalidation may be passed down from outside IOMMU
> > subsystems. This patch adds invalidation functions that can be used
> > for additional translation cache types.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/dmar.c        | 48
> > +++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h | 21 ++++++++++++++++---- 2 files
> > changed, 65 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index 9c49300..680894e 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu
> > *iommu, u16 did, u64 addr, qi_submit_sync(&desc, iommu);
> >  }
> >    
> /* PASID-based IOTLB Invalidate */
> > +void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr,
> > u32 pasid,
> > +		unsigned int size_order, u64 granu)
> > +{
> > +	struct qi_desc desc;
> > +
> > +	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
> > +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
> > +	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
> > +		QI_EIOTLB_AM(size_order);  
> I see IH it hardcoded to 0. Don't you envision to cascade the IH. On
> ARM this was needed for perf sake.
Right, we should cascade IH based on IOMMU_INV_ADDR_FLAGS_LEAF. Just
curious how do you deduce the IH information on ARM? I guess you need
to get the non-leaf page directory info?
I will add an argument for IH.
> > +	desc.qw2 = 0;
> > +	desc.qw3 = 0;
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +
> >  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16
> > pfsid, u16 qdep, u64 addr, unsigned mask)
> >  {
> > @@ -1380,6 +1394,40 @@ void qi_flush_dev_iotlb(struct intel_iommu
> > *iommu, u16 sid, u16 pfsid, qi_submit_sync(&desc, iommu);
> >  }
> >    
> /* Pasid-based Device-TLB Invalidation */
yes, better to explain piotlb :), same for iotlb.
> > +void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16
> > pfsid,
> > +		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64
> > granu) +{
> > +	struct qi_desc desc;
> > +
> > +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
> > QI_DEV_EIOTLB_SID(sid) |
> > +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> > +		QI_DEV_IOTLB_PFSID(pfsid);
> > +	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);
should be desc.qw1 =
> > +
> > +	/* If S bit is 0, we only flush a single page. If S bit is
> > set,
> > +	 * The least significant zero bit indicates the size. VT-d
> > spec
> > +	 * 6.5.2.6
> > +	 */
> > +	if (!size)
> > +		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) &
> > ~QI_DEV_EIOTLB_SIZE;  
> desc.q1 |= ?
Right, I also missed previous qw1 assignment.
> > +	else {
> > +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT +
> > size); +
> > +		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) |
> > QI_DEV_EIOTLB_SIZE;  
> desc.q1 |=
right, thanks
> > +	}
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +  
> /* PASID-cache invalidation */
> > +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
> > granu, int pasid) +{
> > +	struct qi_desc desc;
> > +
> > +	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu)
> > | QI_PC_PASID(pasid);
> > +	desc.qw1 = 0;
> > +	desc.qw2 = 0;
> > +	desc.qw3 = 0;
> > +	qi_submit_sync(&desc, iommu);
> > +}
> >  /*
> >   * Disable Queued Invalidation interface.
> >   */
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 5d67d0d4..38e5efb 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -339,7 +339,7 @@ enum {
> >  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
> > (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
> > (((u64)addr) & VTD_PAGE_MASK) #define
> > QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
> > QI_IOTLB_AM(am)		(((u8)am)) +#define
> > QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
> >  #define QI_CC_FM(fm)		(((u64)fm) << 48)
> >  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> > @@ -357,17 +357,22 @@ enum {
> >  #define QI_PC_DID(did)		(((u64)did) << 16)
> >  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
> >  
> > -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> > -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> > +/* PASID cache invalidation granu */
> > +#define QI_PC_ALL_PASIDS	0
> > +#define QI_PC_PASID_SEL		1
> >  
> >  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> >  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
> >  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> > -#define QI_EIOTLB_AM(am)	(((u64)am))
> > +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
> >  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
> >  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
> >  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
> >  
> > +/* QI Dev-IOTLB inv granu */
> > +#define QI_DEV_IOTLB_GRAN_ALL		1
> > +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> > +
> >  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
> >  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
> >  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> > @@ -658,8 +663,16 @@ extern void qi_flush_context(struct
> > intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type);
> >  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64
> > addr, unsigned int size_order, u64 type);
> > +extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did,
> > u64 addr,
> > +			u32 pasid, unsigned int size_order, u64
> > type); extern void qi_flush_dev_iotlb(struct intel_iommu *iommu,
> > u16 sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
> > +
> > +extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16
> > sid, u16 pfsid,
> > +			u32 pasid, u16 qdep, u64 addr, unsigned
> > size, u64 granu); +
> > +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
> > did, u64 granu, int pasid); +
> >  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu
> > *iommu); 
> >  extern int dmar_ir_support(void);
> >   
> 
> Thanks
> 
> Eric

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-26 17:23   ` Auger Eric
@ 2019-04-29 22:41     ` Jacob Pan
  2019-04-30  6:57       ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-29 22:41 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 19:23:03 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> On 4/24/19 1:31 AM, Jacob Pan wrote:
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> > 
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> > 
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c | 159
> > ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 89989b5..54a3d22 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5338,6 +5338,164 @@ static void
> > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >  
> > +/*
> > + * 2D array for converting and sanitizing IOMMU generic TLB
> > granularity to
> > + * VT-d granularity. Invalidation is typically included in the
> > unmap operation
> > + * as a result of DMA or VFIO unmap. However, for assigned device
> > where guest
> > + * could own the first level page tables without being shadowed by
> > QEMU. In
> > + * this case there is no pass down unmap to the host IOMMU as a
> > result of unmap
> > + * in the guest. Only invalidations are trapped and passed down.
> > + * In all cases, only first level TLB invalidation (request with
> > PASID) can be
> > + * passed down, therefore we do not include IOTLB granularity for
> > request
> > + * without PASID (second level).
> > + *
> > + * For an example, to find the VT-d granularity encoding for IOTLB
> > + * type and page selective granularity within PASID:
> > + * X: indexed by iommu cache type
> > + * Y: indexed by enum iommu_inv_granularity
> > + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
> > + *
> > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > invalid
> > + *
> > + */
> > +const static int
> > inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
> > = {  
> The size is frozen for a given uapi version so I guess you can
> hardcode the limits for a given version.
I guess I could, I just felt more readable this way.
> > +	/* PASID based IOTLB, support PASID selective and page
> > selective */
> > +	{0, 1, 1},
> > +	/* PASID based dev TLBs, only support all PASIDs or single
> > PASID */
> > +	{1, 1, 0},
> > +	/* PASID cache */
> > +	{1, 1, 0}
> > +};
> > +
> > +const static u64
> > inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
> > = {
> > +	/* PASID based IOTLB */
> > +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > +	/* PASID based dev TLBs */
> > +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> > +	/* PASID cache */
> > +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> > +};  
> Can't you use a single matrix instead, ie. inv_type_granu_table
> 
The reason i have an additional inv_type_granu_map[] matrix is that
some of fields can be 0 but still valid. A single matrix would not be
able to tell the difference between a valid 0 or invalid field.
> > +
> > +static inline int to_vtd_granularity(int type, int granu, u64
> > *vtd_granu) +{
> > +	if (type >= NR_IOMMU_CACHE_TYPE || granu >=
> > NR_IOMMU_CACHE_INVAL_GRANU ||
> > +		!inv_type_granu_map[type][granu])
> > +		return -EINVAL;
> > +
> > +	*vtd_granu = inv_type_granu_table[type][granu];
> > +
> > +	return 0;
> > +}
> > +
> > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > +{
> > +	u64 nr_pages;  
> direct initialization?
will do, thanks
> > +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > for 2MB, etc.
> > +	 * IOMMU cache invalidate API passes granu_size in bytes,
> > and number of
> > +	 * granu size in contiguous memory.
> > +	 */
> > +
> > +	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> > +	return order_base_2(nr_pages);
> > +}
> > +
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > +		struct device *dev, struct
> > iommu_cache_invalidate_info *inv_info) +{
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct intel_iommu *iommu;
> > +	unsigned long flags;
> > +	int cache_type;
> > +	u8 bus, devfn;
> > +	u16 did, sid;
> > +	int ret = 0;
> > +	u64 granu;
> > +	u64 size;
> > +
> > +	if (!inv_info || !dmar_domain ||
> > +		inv_info->version !=
> > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > +		return -EINVAL;
> > +
> > +	if (!dev || !dev_is_pci(dev))
> > +		return -ENODEV;
> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +
> > +	spin_lock(&iommu->lock);
> > +	spin_lock_irqsave(&device_domain_lock, flags);  
> mix of _irqsave and non _irqsave looks suspicious to me.
It should be in reverse order. Any other concerns?
> > +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > devfn);
> > +	if (!info) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > +	sid = PCI_DEVID(bus, devfn);
> > +	size = to_vtd_size(inv_info->addr_info.granule_size,
> > inv_info->addr_info.nb_granules); +
> > +	for_each_set_bit(cache_type, (unsigned long
> > *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) { +
> > +		ret = to_vtd_granularity(cache_type,
> > inv_info->granularity, &granu);
> > +		if (ret) {
> > +			pr_err("Invalid range type %d, granu
> > %d\n", cache_type,  
> s/Invalid range type %d, granu %d/Invalid cache type/granu combination
> (%d/%d)
sounds good, indeed it is the combination that is invalid.
> > +				inv_info->granularity);
> > +			break;
> > +		}
> > +
> > +		switch (BIT(cache_type)) {
> > +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> > +			if (size && (inv_info->addr_info.addr &
> > ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > +				pr_err("Address out of range,
> > 0x%llx, size order %llu\n",
> > +					inv_info->addr_info.addr,
> > size);
> > +				ret = -ERANGE;
> > +				goto out_unlock;
> > +			}
> > +
> > +			qi_flush_piotlb(iommu, did,
> > mm_to_dma_pfn(inv_info->addr_info.addr),
> > +					inv_info->addr_info.pasid,
> > +					size, granu);
> > +
> > +			/*
> > +			 * Always flush device IOTLB if ATS is
> > enabled since guest
> > +			 * vIOMMU exposes CM = 1, no device IOTLB
> > flush will be passed
> > +			 * down. REVISIT: cannot assume Linux guest
> > +			 */
> > +			if (info->ats_enabled) {
> > +				qi_flush_dev_piotlb(iommu, sid,
> > info->pfsid,
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			}
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > +			if (info->ats_enabled) {
> > +				qi_flush_dev_piotlb(iommu, sid,
> > info->pfsid,
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> > +						granu);
> > +			} else
> > +				pr_warn("Passdown device IOTLB
> > flush w/o ATS!\n"); +
> > +			break;
> > +		case IOMMU_CACHE_INV_TYPE_PASID:
> > +			qi_flush_pasid_cache(iommu, did, granu,
> > inv_info->pasid); +
> > +			break;
> > +		default:
> > +			dev_err(dev, "Unsupported IOMMU
> > invalidation type %d\n",
> > +				cache_type);
> > +			ret = -EINVAL;
> > +		}
> > +	}
> > +out_unlock:
> > +	spin_unlock(&iommu->lock);
> > +	spin_unlock_irqrestore(&device_domain_lock, flags);  
> I would expect the opposite order
yes, i reversed in the lock order such that irq is disabled.
> > +
> > +	return ret;
> > +}
> > +
> >  static int intel_iommu_map(struct iommu_domain *domain,
> >  			   unsigned long iova, phys_addr_t hpa,
> >  			   size_t size, int iommu_prot)
> > @@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >  #endif
> >   
> Thanks
> 
> Eric

Thank you so much for your review. I will roll up the next version
soon, hopefully this week.

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-29 21:29     ` Jacob Pan
@ 2019-04-30  4:41       ` Auger Eric
  2019-04-30 17:15         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-30  4:41 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/29/19 11:29 PM, Jacob Pan wrote:
> On Sat, 27 Apr 2019 11:04:04 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> When Shared Virtual Memory is exposed to a guest via vIOMMU,
>>> extended IOTLB invalidation may be passed down from outside IOMMU
>>> subsystems. This patch adds invalidation functions that can be used
>>> for additional translation cache types.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>  drivers/iommu/dmar.c        | 48
>>> +++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/intel-iommu.h | 21 ++++++++++++++++---- 2 files
>>> changed, 65 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
>>> index 9c49300..680894e 100644
>>> --- a/drivers/iommu/dmar.c
>>> +++ b/drivers/iommu/dmar.c
>>> @@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu
>>> *iommu, u16 did, u64 addr, qi_submit_sync(&desc, iommu);
>>>  }
>>>    
>> /* PASID-based IOTLB Invalidate */
>>> +void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>>> u32 pasid,
>>> +		unsigned int size_order, u64 granu)
>>> +{
>>> +	struct qi_desc desc;
>>> +
>>> +	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
>>> +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
>>> +	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
>>> +		QI_EIOTLB_AM(size_order);  
>> I see IH it hardcoded to 0. Don't you envision to cascade the IH. On
>> ARM this was needed for perf sake.
> Right, we should cascade IH based on IOMMU_INV_ADDR_FLAGS_LEAF. Just
> curious how do you deduce the IH information on ARM? I guess you need
> to get the non-leaf page directory info?
> I will add an argument for IH.
On ARM we have the "Leaf" field in the stage1 TLB invalidation command.
"When Leaf==1, only cached entries for the last level of translation
table walk are required to be invalidated".

Thanks

Eric
>>> +	desc.qw2 = 0;
>>> +	desc.qw3 = 0;
>>> +	qi_submit_sync(&desc, iommu);
>>> +}
>>> +
>>>  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16
>>> pfsid, u16 qdep, u64 addr, unsigned mask)
>>>  {
>>> @@ -1380,6 +1394,40 @@ void qi_flush_dev_iotlb(struct intel_iommu
>>> *iommu, u16 sid, u16 pfsid, qi_submit_sync(&desc, iommu);
>>>  }
>>>    
>> /* Pasid-based Device-TLB Invalidation */
> yes, better to explain piotlb :), same for iotlb.
>>> +void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16
>>> pfsid,
>>> +		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64
>>> granu) +{
>>> +	struct qi_desc desc;
>>> +
>>> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
>>> QI_DEV_EIOTLB_SID(sid) |
>>> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
>>> +		QI_DEV_IOTLB_PFSID(pfsid);
>>> +	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);
> should be desc.qw1 =
>>> +
>>> +	/* If S bit is 0, we only flush a single page. If S bit is
>>> set,
>>> +	 * The least significant zero bit indicates the size. VT-d
>>> spec
>>> +	 * 6.5.2.6
>>> +	 */
>>> +	if (!size)
>>> +		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) &
>>> ~QI_DEV_EIOTLB_SIZE;  
>> desc.q1 |= ?
> Right, I also missed previous qw1 assignment.
>>> +	else {
>>> +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT +
>>> size); +
>>> +		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) |
>>> QI_DEV_EIOTLB_SIZE;  
>> desc.q1 |=
> right, thanks
>>> +	}
>>> +	qi_submit_sync(&desc, iommu);
>>> +}
>>> +  
>> /* PASID-cache invalidation */
>>> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
>>> granu, int pasid) +{
>>> +	struct qi_desc desc;
>>> +
>>> +	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu)
>>> | QI_PC_PASID(pasid);
>>> +	desc.qw1 = 0;
>>> +	desc.qw2 = 0;
>>> +	desc.qw3 = 0;
>>> +	qi_submit_sync(&desc, iommu);
>>> +}
>>>  /*
>>>   * Disable Queued Invalidation interface.
>>>   */
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 5d67d0d4..38e5efb 100644
>>> --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -339,7 +339,7 @@ enum {
>>>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
>>> (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
>>> (((u64)addr) & VTD_PAGE_MASK) #define
>>> QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
>>> QI_IOTLB_AM(am)		(((u8)am)) +#define
>>> QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
>>>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>>>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
>>> @@ -357,17 +357,22 @@ enum {
>>>  #define QI_PC_DID(did)		(((u64)did) << 16)
>>>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>>>  
>>> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
>>> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
>>> +/* PASID cache invalidation granu */
>>> +#define QI_PC_ALL_PASIDS	0
>>> +#define QI_PC_PASID_SEL		1
>>>  
>>>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>>>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
>>>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
>>> -#define QI_EIOTLB_AM(am)	(((u64)am))
>>> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>>>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>>>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>>>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>>>  
>>> +/* QI Dev-IOTLB inv granu */
>>> +#define QI_DEV_IOTLB_GRAN_ALL		1
>>> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
>>> +
>>>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>>>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>>>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
>>> @@ -658,8 +663,16 @@ extern void qi_flush_context(struct
>>> intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type);
>>>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64
>>> addr, unsigned int size_order, u64 type);
>>> +extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did,
>>> u64 addr,
>>> +			u32 pasid, unsigned int size_order, u64
>>> type); extern void qi_flush_dev_iotlb(struct intel_iommu *iommu,
>>> u16 sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
>>> +
>>> +extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16
>>> sid, u16 pfsid,
>>> +			u32 pasid, u16 qdep, u64 addr, unsigned
>>> size, u64 granu); +
>>> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
>>> did, u64 granu, int pasid); +
>>>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu
>>> *iommu); 
>>>  extern int dmar_ir_support(void);
>>>   
>>
>> Thanks
>>
>> Eric
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 17/19] iommu: Add max num of cache and granu types
  2019-04-29 16:17     ` Jacob Pan
@ 2019-04-30  5:15       ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-30  5:15 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/29/19 6:17 PM, Jacob Pan wrote:
> On Fri, 26 Apr 2019 18:22:46 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> To convert to/from cache types and granularities between generic and
>>> VT-d specific counterparts, a 2D arrary is used. Introduce the
>>> limits  
>> array
>>> to help define the converstion array size.  
>> conversion
>>>
> will fix, thanks
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>  include/uapi/linux/iommu.h | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 5c95905..2d8fac8 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -197,6 +197,7 @@ struct iommu_inv_addr_info {
>>>  	__u64	granule_size;
>>>  	__u64	nb_granules;
>>>  };
>>> +#define NR_IOMMU_CACHE_INVAL_GRANU	(3)
>>>  
>>>  /**
>>>   * First level/stage invalidation information
>>> @@ -235,6 +236,7 @@ struct iommu_cache_invalidate_info {
>>>  		struct iommu_inv_addr_info addr_info;
>>>  	};
>>>  };
>>> +#define NR_IOMMU_CACHE_TYPE		(3)
>>>  /**
>>>   * struct gpasid_bind_data - Information about device and guest
>>> PASID binding
>>>   * @gcr3:	Guest CR3 value from guest mm
>>>   
>> Is it really something that needs to be exposed in the uapi?
>>
> I put it in uapi since the related definitions for granularity and
> cache type are in the same file.
> Maybe putting them close together like this? I was thinking you can just
> fold it into your next series as one patch for introducing cache
> invalidation.
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 2d8fac8..4ff6929 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -164,6 +164,7 @@ enum iommu_inv_granularity {
>         IOMMU_INV_GRANU_DOMAIN, /* domain-selective invalidation */
>         IOMMU_INV_GRANU_PASID,  /* pasid-selective invalidation */
>         IOMMU_INV_GRANU_ADDR,   /* page-selective invalidation */
> +       NR_IOMMU_INVAL_GRANU,   /* number of invalidation granularities
> */ };
>  
>  /**
> @@ -228,6 +229,7 @@ struct iommu_cache_invalidate_info {
>  #define IOMMU_CACHE_INV_TYPE_IOTLB     (1 << 0) /* IOMMU IOTLB */
>  #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device IOTLB */
>  #define IOMMU_CACHE_INV_TYPE_PASID     (1 << 2) /* PASID cache */
> +#define NR_IOMMU_CACHE_TYPE            (3)

OK I will add this.

Thanks

Eric
>         __u8    cache;
>         __u8    granularity;
> 
>> Thanks
>>
>> Eric
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-29 22:41     ` Jacob Pan
@ 2019-04-30  6:57       ` Auger Eric
  2019-04-30 17:22         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-30  6:57 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko



On 4/30/19 12:41 AM, Jacob Pan wrote:
> On Fri, 26 Apr 2019 19:23:03 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>>> vIOMMU, we need to provide invalidation support at IOMMU API and
>>> driver level. This patch adds Intel VT-d specific function to
>>> implement iommu passdown invalidate API for shared virtual address.
>>>
>>> The use case is for supporting caching structure invalidation
>>> of assigned SVM capable devices. Emulated IOMMU exposes queue
>>> invalidation capability and passes down all descriptors from the
>>> guest to the physical IOMMU.
>>>
>>> The assumption is that guest to host device ID mapping should be
>>> resolved prior to calling IOMMU driver. Based on the device handle,
>>> host IOMMU driver can replace certain fields before submit to the
>>> invalidation queue.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/intel-iommu.c | 159
>>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159
>>> insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index 89989b5..54a3d22 100644
>>> --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -5338,6 +5338,164 @@ static void
>>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
>>>  
>>> +/*
>>> + * 2D array for converting and sanitizing IOMMU generic TLB
>>> granularity to
>>> + * VT-d granularity. Invalidation is typically included in the
>>> unmap operation
>>> + * as a result of DMA or VFIO unmap. However, for assigned device
>>> where guest
>>> + * could own the first level page tables without being shadowed by
>>> QEMU. In
>>> + * this case there is no pass down unmap to the host IOMMU as a
>>> result of unmap
>>> + * in the guest. Only invalidations are trapped and passed down.
>>> + * In all cases, only first level TLB invalidation (request with
>>> PASID) can be
>>> + * passed down, therefore we do not include IOTLB granularity for
>>> request
>>> + * without PASID (second level).
>>> + *
>>> + * For an example, to find the VT-d granularity encoding for IOTLB
>>> + * type and page selective granularity within PASID:
>>> + * X: indexed by iommu cache type
>>> + * Y: indexed by enum iommu_inv_granularity
>>> + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
>>> + *
>>> + * Granu_map array indicates validity of the table. 1: valid, 0:
>>> invalid
>>> + *
>>> + */
>>> +const static int
>>> inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
>>> = {  
>> The size is frozen for a given uapi version so I guess you can
>> hardcode the limits for a given version.
> I guess I could, I just felt more readable this way.
>>> +	/* PASID based IOTLB, support PASID selective and page
>>> selective */
>>> +	{0, 1, 1},
>>> +	/* PASID based dev TLBs, only support all PASIDs or single
>>> PASID */
>>> +	{1, 1, 0},
>>> +	/* PASID cache */
>>> +	{1, 1, 0}
>>> +};
>>> +
>>> +const static u64
>>> inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
>>> = {
>>> +	/* PASID based IOTLB */
>>> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>>> +	/* PASID based dev TLBs */
>>> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
>>> +	/* PASID cache */
>>> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
>>> +};  
>> Can't you use a single matrix instead, ie. inv_type_granu_table
>>
> The reason i have an additional inv_type_granu_map[] matrix is that
> some of fields can be 0 but still valid. A single matrix would not be
> able to tell the difference between a valid 0 or invalid field.
Ah OK sorry I missed that.
>>> +
>>> +static inline int to_vtd_granularity(int type, int granu, u64
>>> *vtd_granu) +{
>>> +	if (type >= NR_IOMMU_CACHE_TYPE || granu >=
>>> NR_IOMMU_CACHE_INVAL_GRANU ||
>>> +		!inv_type_granu_map[type][granu])
>>> +		return -EINVAL;
>>> +
>>> +	*vtd_granu = inv_type_granu_table[type][granu];
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>>> +{
>>> +	u64 nr_pages;  
>> direct initialization?
> will do, thanks
>>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
>>> for 2MB, etc.
>>> +	 * IOMMU cache invalidate API passes granu_size in bytes,
>>> and number of
>>> +	 * granu size in contiguous memory.
>>> +	 */
>>> +
>>> +	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
>>> +	return order_base_2(nr_pages);
>>> +}
>>> +
>>> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
>>> +		struct device *dev, struct
>>> iommu_cache_invalidate_info *inv_info) +{
>>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>>> +	struct device_domain_info *info;
>>> +	struct intel_iommu *iommu;
>>> +	unsigned long flags;
>>> +	int cache_type;
>>> +	u8 bus, devfn;
>>> +	u16 did, sid;
>>> +	int ret = 0;
>>> +	u64 granu;
>>> +	u64 size;
>>> +
>>> +	if (!inv_info || !dmar_domain ||
>>> +		inv_info->version !=
>>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>>> +		return -EINVAL;
>>> +
>>> +	if (!dev || !dev_is_pci(dev))
>>> +		return -ENODEV;
>>> +
>>> +	iommu = device_to_iommu(dev, &bus, &devfn);
>>> +	if (!iommu)
>>> +		return -ENODEV;
>>> +
>>> +	spin_lock(&iommu->lock);
>>> +	spin_lock_irqsave(&device_domain_lock, flags);  
>> mix of _irqsave and non _irqsave looks suspicious to me.
> It should be in reverse order. Any other concerns?
I understand both locks are likely to be taken in ISR context so
_irqsave should be called on the first call.
>>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
>>> devfn);
>>> +	if (!info) {
>>> +		ret = -EINVAL;
>>> +		goto out_unlock;
>>> +	}
>>> +	did = dmar_domain->iommu_did[iommu->seq_id];
>>> +	sid = PCI_DEVID(bus, devfn);
>>> +	size = to_vtd_size(inv_info->addr_info.granule_size,
>>> inv_info->addr_info.nb_granules); +
>>> +	for_each_set_bit(cache_type, (unsigned long
>>> *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) { +
>>> +		ret = to_vtd_granularity(cache_type,
>>> inv_info->granularity, &granu);
>>> +		if (ret) {
>>> +			pr_err("Invalid range type %d, granu
>>> %d\n", cache_type,  
>> s/Invalid range type %d, granu %d/Invalid cache type/granu combination
>> (%d/%d)
> sounds good, indeed it is the combination that is invalid.
>>> +				inv_info->granularity);
>>> +			break;
>>> +		}
>>> +
>>> +		switch (BIT(cache_type)) {
>>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
>>> +			if (size && (inv_info->addr_info.addr &
>>> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>>> +				pr_err("Address out of range,
>>> 0x%llx, size order %llu\n",
>>> +					inv_info->addr_info.addr,
>>> size);
>>> +				ret = -ERANGE;
>>> +				goto out_unlock;
>>> +			}
>>> +
>>> +			qi_flush_piotlb(iommu, did,
>>> mm_to_dma_pfn(inv_info->addr_info.addr),
>>> +					inv_info->addr_info.pasid,
>>> +					size, granu);
>>> +
>>> +			/*
>>> +			 * Always flush device IOTLB if ATS is
>>> enabled since guest
>>> +			 * vIOMMU exposes CM = 1, no device IOTLB
>>> flush will be passed
>>> +			 * down. REVISIT: cannot assume Linux guest
>>> +			 */
>>> +			if (info->ats_enabled) {
>>> +				qi_flush_dev_piotlb(iommu, sid,
>>> info->pfsid,
>>> +
>>> inv_info->addr_info.pasid, info->ats_qdep,
>>> +
>>> inv_info->addr_info.addr, size,
>>> +						granu);
>>> +			}
>>> +			break;
>>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>>> +			if (info->ats_enabled) {
>>> +				qi_flush_dev_piotlb(iommu, sid,
>>> info->pfsid,
>>> +
>>> inv_info->addr_info.pasid, info->ats_qdep,
>>> +
>>> inv_info->addr_info.addr, size,
>>> +						granu);
>>> +			} else
>>> +				pr_warn("Passdown device IOTLB
>>> flush w/o ATS!\n"); +
>>> +			break;
>>> +		case IOMMU_CACHE_INV_TYPE_PASID:
>>> +			qi_flush_pasid_cache(iommu, did, granu,
>>> inv_info->pasid); +
>>> +			break;
>>> +		default:
>>> +			dev_err(dev, "Unsupported IOMMU
>>> invalidation type %d\n",
>>> +				cache_type);
>>> +			ret = -EINVAL;
>>> +		}
>>> +	}
>>> +out_unlock:
>>> +	spin_unlock(&iommu->lock);
>>> +	spin_unlock_irqrestore(&device_domain_lock, flags);  
>> I would expect the opposite order
> yes, i reversed in the lock order such that irq is disabled.
spin_unlock_irqsave(&iommu->lock, flags);
spin_lock(&device_domain_lock);
../..
spin_unlock_irqrestore(&device_domain_lock);
spin_unlock_irqrestore(&iommu->lock);
?

Thanks

Eric
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static int intel_iommu_map(struct iommu_domain *domain,
>>>  			   unsigned long iova, phys_addr_t hpa,
>>>  			   size_t size, int iommu_prot)
>>> @@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
>>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>>  #endif
>>>   
>> Thanks
>>
>> Eric
> 
> Thank you so much for your review. I will roll up the next version
> soon, hopefully this week.
> 
> Jacob
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support
  2019-04-29 15:25     ` Jacob Pan
@ 2019-04-30  7:05       ` Auger Eric
  2019-04-30 17:49         ` Jacob Pan
  0 siblings, 1 reply; 74+ messages in thread
From: Auger Eric @ 2019-04-30  7:05 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko



On 4/29/19 5:25 PM, Jacob Pan wrote:
> On Fri, 26 Apr 2019 18:15:27 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/24/19 1:31 AM, Jacob Pan wrote:
>>> When supporting guest SVA with emulated IOMMU, the guest PASID
>>> table is shadowed in VMM. Updates to guest vIOMMU PASID table
>>> will result in PASID cache flush which will be passed down to
>>> the host as bind guest PASID calls.
>>>
>>> For the SL page tables, it will be harvested from device's
>>> default domain (request w/o PASID), or aux domain in case of
>>> mediated device.
>>>
>>>     .-------------.  .---------------------------.
>>>     |   vIOMMU    |  | Guest process CR3, FL only|
>>>     |             |  '---------------------------'
>>>     .----------------/
>>>     | PASID Entry |--- PASID cache flush -
>>>     '-------------'                       |
>>>     |             |                       V
>>>     |             |                CR3 in GPA
>>>     '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|--------
>>>       v        v                          v
>>> Host
>>>     .-------------.  .----------------------.
>>>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>>>     |             |  '----------------------'
>>>     .----------------/  |
>>>     | PASID Entry |     V (Nested xlate)
>>>     '----------------\.------------------------------.
>>>     |             |   |SL for GPA-HPA, default domain|
>>>     |             |   '------------------------------'
>>>     '-------------'
>>> Where:
>>>  - FL = First level/stage one page tables
>>>  - SL = Second level/stage two page tables
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/intel-iommu.c |   4 +
>>>  drivers/iommu/intel-svm.c   | 174
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/intel-iommu.h |  10 ++- include/linux/intel-svm.h
>>> |   7 ++ 4 files changed, 193 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index 77bbe1b..89989b5 100644
>>> --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -5768,6 +5768,10 @@ const struct iommu_ops intel_iommu_ops = {
>>>  	.dev_enable_feat	= intel_iommu_dev_enable_feat,
>>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>> +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>> +#endif
>>>  };
>>>  
>>>  static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
>>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
>>> index 8fff212..0a973c2 100644
>>> --- a/drivers/iommu/intel-svm.c
>>> +++ b/drivers/iommu/intel-svm.c
>>> @@ -227,6 +227,180 @@ static const struct mmu_notifier_ops
>>> intel_mmuops = { 
>>>  static DEFINE_MUTEX(pasid_mutex);
>>>  static LIST_HEAD(global_svm_list);
>>> +#define for_each_svm_dev() \
>>> +	list_for_each_entry(sdev, &svm->devs, list)	\
>>> +	if (dev == sdev->dev)				\
>>> +
>>> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> +			struct device *dev,
>>> +			struct gpasid_bind_data *data)
>>> +{
>>> +	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> +	struct intel_svm_dev *sdev;
>>> +	struct intel_svm *svm = NULL;
>>> +	struct dmar_domain *ddomain;
>>> +	int pasid_max;
>>> +	int ret = 0;
>>> +
>>> +	if (WARN_ON(!iommu) || !data)
>>> +		return -EINVAL;
>>> +
>>> +	if (dev_is_pci(dev)) {
>>> +		pasid_max = pci_max_pasids(to_pci_dev(dev));
>>> +		if (pasid_max < 0)
>>> +			return -EINVAL;
>>> +	} else
>>> +		pasid_max = 1 << 20;
>>> +
>>> +	if (data->pasid <= 0 || data->pasid >= pasid_max)
>>> +		return -EINVAL;
>>> +
>>> +	ddomain = to_dmar_domain(domain);
>>> +	/* REVISIT:
>>> +	 * Sanity check adddress width and paging mode support
>>> +	 * width matching in two dimensions:
>>> +	 * 1. paging mode CPU <= IOMMU
>>> +	 * 2. address width Guest <= Host.
>>> +	 */
>>> +	mutex_lock(&pasid_mutex);
>>> +	svm = ioasid_find(NULL, data->pasid, NULL);
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>> +		goto out;
>>> +	}
>>> +	if (svm) {
>>> +		if (list_empty(&svm->devs)) {
>>> +			dev_err(dev, "GPASID %d has no devices
>>> bond but SVA is allocated\n",
>>> +				data->pasid);
>>> +			ret = -ENODEV; /*
>>> +					* If we found svm for the
>>> PASID, there must be at
>>> +					* least one device bond,
>>> otherwise svm should be freed.
>>> +					*/  
>> comment should be put after list_empty I think. In which circumstances
>> can it happen, I mean, isn't it a BUG_ON case?
> Well, I think failing to bind guest PASID is not severe enough to the
> host to use BUG_ON. It has to be something more catastrophic to use
> BUG_ON right? I will relocate the comments.
When the error is due to a programming error at kernel error (not
induced by any userspace call) I guess it is acceptable to put a BUG_ON.
However the usage of BUG_ON() is generally frown upon so my question
rather was to understand if this can really happen and why?
>>> +			goto out;
>>> +		}
>>> +		for_each_svm_dev() {
>>> +			/* In case of multiple sub-devices of the
>>> same pdev assigned, we should
>>> +			 * allow multiple bind calls with the same
>>> PASID and pdev.
>>> +			 */
>>> +			sdev->users++;
>>> +			goto out;
>>> +		}
>>> +	} else {
>>> +		/* We come here when PASID has never been bond to
>>> a device. */
>>> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
>>> +		if (!svm) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +		/* REVISIT: upper layer/VFIO can track host
>>> process that bind the PASID.
>>> +		 * ioasid_set = mm might be sufficient for vfio to
>>> check pasid VMM
>>> +		 * ownership.
>>> +		 */
>>> +		svm->mm = get_task_mm(current);
>>> +		svm->pasid = data->pasid;
>>> +		refcount_set(&svm->refs, 0);
>>> +		ioasid_set_data(data->pasid, svm);
>>> +		INIT_LIST_HEAD_RCU(&svm->devs);
>>> +		INIT_LIST_HEAD(&svm->list);
>>> +
>>> +		mmput(svm->mm);
>>> +	}
>>> +	svm->flags |= SVM_FLAG_GUEST_MODE;
>>> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
>>> +	if (!sdev) {
>>> +		ret = -ENOMEM;  
>> in case of failure what is the state of svm (you added the
>> SVM_FLAG_GUEST_MODE bit typically, is it safe to leave it?)
> The SVM_FLAG_GUEST_MODE flag is used for fault reporting where faults
> such as PRQ need to be injected into the guest. If this kzalloc()
> fails, the nested translation would not be setup for this PASID. So
> there shouldn't be any user of the flag. But I think it is better to
> move svm->flags |= SVM_FLAG_GUEST_MODE; to the end when everything is
> setup for nesting.
ok
> 
>>> +		goto out;
>>> +	}
>>> +	sdev->dev = dev;
>>> +	sdev->users = 1;
>>> +
>>> +	/* Set up device context entry for PASID if not enabled
>>> already */
>>> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
>>> +	if (ret) {
>>> +		dev_err(dev, "Failed to enable PASID
>>> capability\n");
>>> +		kfree(sdev);  
>> same here
>>> +		goto out;
>>> +	}
>>> +
>>> +	/*
>>> +	 * For guest bind, we need to set up PASID table entry as
>>> follows:
>>> +	 * - FLPM matches guest paging mode
>>> +	 * - turn on nested mode
>>> +	 * - SL guest address width matching
>>> +	 */
>>> +	ret = intel_pasid_setup_nested(iommu,
>>> +				dev,
>>> +				(pgd_t *)data->gcr3,
>>> +				data->pasid,
>>> +				data->flags,
>>> +				ddomain,
>>> +				data->addr_width);
>>> +	if (ret) {
>>> +		dev_err(dev, "Failed to set up PASID %d in nested
>>> mode, Err %d\n",
>>> +			data->pasid, ret);
>>> +		kfree(sdev);
>>> +		goto out;
>>> +	}
>>> +
>>> +	init_rcu_head(&sdev->rcu);
>>> +	refcount_inc(&svm->refs);
>>> +	list_add_rcu(&sdev->list, &svm->devs);
>>> + out:
>>> +	mutex_unlock(&pasid_mutex);
>>> +	return ret;
>>> +}
>>> +
>>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>>> +{
>>> +	struct intel_svm_dev *sdev;
>>> +	struct intel_iommu *iommu;
>>> +	struct intel_svm *svm;
>>> +	int ret = -EINVAL;
>>> +
>>> +	mutex_lock(&pasid_mutex);
>>> +	iommu = intel_svm_device_to_iommu(dev);
>>> +	if (!iommu)
>>> +		goto out;
>>> +
>>> +	svm = ioasid_find(NULL, pasid, NULL);
>>> +	if (IS_ERR(svm)) {
>>> +		ret = PTR_ERR(svm);
>>> +		goto out;
>>> +	}
>>> +
>>> +	if (!svm)
>>> +		goto out;
>>> +
>>> +	for_each_svm_dev() {
>>> +		ret = 0;
>>> +		sdev->users--;
>>> +		if (!sdev->users) {
>>> +			list_del_rcu(&sdev->list);
>>> +			intel_pasid_tear_down_entry(iommu, dev,
>>> svm->pasid);
>>> +			/* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> +			 * may get reused soon, we don't want to
>>> +			 * confuse with its previous live.
>>> +			 * intel_svm_drain_prq(dev, pasid);
>>> +			 */
>>> +			kfree_rcu(sdev, rcu);
>>> +
>>> +			if (list_empty(&svm->devs)) {
>>> +				list_del(&svm->list);
>>> +				kfree(svm);
>>> +				/*
>>> +				 * We do not free PASID here until
>>> explicit call
>>> +				 * from the guest to free.  
>> can you be confident in the guest?
> No. But I have confident in the kernel VFIO code to manage guest life
> cycle :)
> I assume when a guest doesn't do unbind when it dies or unload a
> assigned device, I expect VFIO to free all the PASIDs. VFIO needs to
> police the PASID ownership anyway in order to make sure a PASID
> assigned to guest A cannot be used to bind from guest B.
> This is the flow I worked out with Yi, who is doing the VFIO part. Any
> particular concerns?
No I just wanted to make sure someone is going to take care of the final
tear down even if the userspace fails to do things as expected. Maybe
adding a comment to explain who has the ownership of the final tear down
would help here.

Thanks

Eric
> 
>>> +				 */
>>> +				ioasid_set_data(pasid, NULL);
>>> +			}
>>> +		}
>>> +		break;
>>> +	}
>>> + out:
>>> +	mutex_unlock(&pasid_mutex);
>>> +
>>> +	return ret;
>>> +}
>>>  
>>>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
>>> struct svm_dev_ops *ops) {
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 48fa164..5d67d0d4 100644
>>> --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -677,7 +677,9 @@ int intel_iommu_enable_pasid(struct intel_iommu
>>> *iommu, struct device *dev); int intel_svm_init(struct intel_iommu
>>> *iommu); extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>>>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>>> -
>>> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> +		struct device *dev, struct gpasid_bind_data *data);
>>> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>>>  struct svm_dev_ops;
>>>  
>>>  struct intel_svm_dev {
>>> @@ -693,12 +695,16 @@ struct intel_svm_dev {
>>>  
>>>  struct intel_svm {
>>>  	struct mmu_notifier notifier;
>>> -	struct mm_struct *mm;
>>> +	union {
>>> +		struct mm_struct *mm;
>>> +		u64 gcr3;
>>> +	};
>>>  	struct intel_iommu *iommu;
>>>  	int flags;
>>>  	int pasid;
>>>  	struct list_head devs;
>>>  	struct list_head list;
>>> +	refcount_t refs; /* # of devs bond to the PASID */  
>> number of devices sharing the same PASID?
> more clear wording, thanks.
>>>  };
>>>  
>>>  extern struct intel_iommu *intel_svm_device_to_iommu(struct device
>>> *dev); diff --git a/include/linux/intel-svm.h
>>> b/include/linux/intel-svm.h index e3f7631..34b0a3b 100644
>>> --- a/include/linux/intel-svm.h
>>> +++ b/include/linux/intel-svm.h
>>> @@ -52,6 +52,13 @@ struct svm_dev_ops {
>>>   * do such IOTLB flushes automatically.
>>>   */
>>>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
>>> +/*
>>> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
>>> to a device.  
>> binds
> will fix
> 
>>> + * In this case the mm_struct is in the guest kernel or userspace,
>>> its life
>>> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
>>> API provides
>>> + * means to bind/unbind guest CR3 with PASIDs allocated for a
>>> device.
>>> + */
>>> +#define SVM_FLAG_GUEST_MODE	(1<<2)
>>>  
>>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>>  
>>>   
>>
>> Thanks
>>
>> Eric
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-30  4:41       ` Auger Eric
@ 2019-04-30 17:15         ` Jacob Pan
  2019-04-30 17:41           ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-30 17:15 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Tue, 30 Apr 2019 06:41:13 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 4/29/19 11:29 PM, Jacob Pan wrote:
> > On Sat, 27 Apr 2019 11:04:04 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >>
> >> On 4/24/19 1:31 AM, Jacob Pan wrote:  
> >>> When Shared Virtual Memory is exposed to a guest via vIOMMU,
> >>> extended IOTLB invalidation may be passed down from outside IOMMU
> >>> subsystems. This patch adds invalidation functions that can be
> >>> used for additional translation cache types.
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> ---
> >>>  drivers/iommu/dmar.c        | 48
> >>> +++++++++++++++++++++++++++++++++++++++++++++
> >>> include/linux/intel-iommu.h | 21 ++++++++++++++++---- 2 files
> >>> changed, 65 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> >>> index 9c49300..680894e 100644
> >>> --- a/drivers/iommu/dmar.c
> >>> +++ b/drivers/iommu/dmar.c
> >>> @@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu
> >>> *iommu, u16 did, u64 addr, qi_submit_sync(&desc, iommu);
> >>>  }
> >>>      
> >> /* PASID-based IOTLB Invalidate */  
> >>> +void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64
> >>> addr, u32 pasid,
> >>> +		unsigned int size_order, u64 granu)
> >>> +{
> >>> +	struct qi_desc desc;
> >>> +
> >>> +	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
> >>> +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
> >>> +	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
> >>> +		QI_EIOTLB_AM(size_order);    
> >> I see IH it hardcoded to 0. Don't you envision to cascade the IH.
> >> On ARM this was needed for perf sake.  
> > Right, we should cascade IH based on IOMMU_INV_ADDR_FLAGS_LEAF. Just
> > curious how do you deduce the IH information on ARM? I guess you
> > need to get the non-leaf page directory info?
> > I will add an argument for IH.  
> On ARM we have the "Leaf" field in the stage1 TLB invalidation
> command. "When Leaf==1, only cached entries for the last level of
> translation table walk are required to be invalidated".
> 
Thanks for explaining, I guess I didn't ask the right question. I was
wondering how SMMU driver determines when to set the Leaf bit. I guess
it is this function? It is not apparent to me whether the sharing of
non-leaf TLBs are considered.
io_pgtable_tlb_add_flush(iop, iova, blk_size, blk_size, true);

> Thanks
> 
> Eric
>  [...]  
> >> /* Pasid-based Device-TLB Invalidation */  
>  [...]  
> >>> +void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16
> >>> pfsid,
> >>> +		u32 pasid,  u16 qdep, u64 addr, unsigned size,
> >>> u64 granu) +{
> >>> +	struct qi_desc desc;
> >>> +
> >>> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
> >>> QI_DEV_EIOTLB_SID(sid) |
> >>> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> >>> +		QI_DEV_IOTLB_PFSID(pfsid);
> >>> +	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);  
> > should be desc.qw1 =  
> >>> +
> >>> +	/* If S bit is 0, we only flush a single page. If S bit
> >>> is set,
> >>> +	 * The least significant zero bit indicates the size.
> >>> VT-d spec
> >>> +	 * 6.5.2.6
> >>> +	 */
> >>> +	if (!size)
> >>> +		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) &
> >>> ~QI_DEV_EIOTLB_SIZE;    
> >> desc.q1 |= ?  
> > Right, I also missed previous qw1 assignment.  
> >>> +	else {
> >>> +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT +
> >>> size); +
> >>> +		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) |
> >>> QI_DEV_EIOTLB_SIZE;    
> >> desc.q1 |=  
> > right, thanks  
> >>> +	}
> >>> +	qi_submit_sync(&desc, iommu);
> >>> +}
> >>> +    
> >> /* PASID-cache invalidation */  
> >>> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
> >>> granu, int pasid) +{
> >>> +	struct qi_desc desc;
> >>> +
> >>> +	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) |
> >>> QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
> >>> +	desc.qw1 = 0;
> >>> +	desc.qw2 = 0;
> >>> +	desc.qw3 = 0;
> >>> +	qi_submit_sync(&desc, iommu);
> >>> +}
> >>>  /*
> >>>   * Disable Queued Invalidation interface.
> >>>   */
> >>> diff --git a/include/linux/intel-iommu.h
> >>> b/include/linux/intel-iommu.h index 5d67d0d4..38e5efb 100644
> >>> --- a/include/linux/intel-iommu.h
> >>> +++ b/include/linux/intel-iommu.h
> >>> @@ -339,7 +339,7 @@ enum {
> >>>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
> >>> (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
> >>> (((u64)addr) & VTD_PAGE_MASK) #define
> >>> QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
> >>> QI_IOTLB_AM(am)		(((u8)am)) +#define
> >>> QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
> >>>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
> >>>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
> >>> @@ -357,17 +357,22 @@ enum {
> >>>  #define QI_PC_DID(did)		(((u64)did) << 16)
> >>>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
> >>>  
> >>> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> >>> -#define QI_PC_PASID_SEL		(QI_PC_TYPE |
> >>> QI_PC_GRAN(1)) +/* PASID cache invalidation granu */
> >>> +#define QI_PC_ALL_PASIDS	0
> >>> +#define QI_PC_PASID_SEL		1
> >>>  
> >>>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> >>>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
> >>>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
> >>> -#define QI_EIOTLB_AM(am)	(((u64)am))
> >>> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
> >>>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
> >>>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
> >>>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
> >>>  
> >>> +/* QI Dev-IOTLB inv granu */
> >>> +#define QI_DEV_IOTLB_GRAN_ALL		1
> >>> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> >>> +
> >>>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
> >>>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
> >>>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> >>> @@ -658,8 +663,16 @@ extern void qi_flush_context(struct
> >>> intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type);
> >>>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did,
> >>> u64 addr, unsigned int size_order, u64 type);
> >>> +extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did,
> >>> u64 addr,
> >>> +			u32 pasid, unsigned int size_order, u64
> >>> type); extern void qi_flush_dev_iotlb(struct intel_iommu *iommu,
> >>> u16 sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
> >>> +
> >>> +extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16
> >>> sid, u16 pfsid,
> >>> +			u32 pasid, u16 qdep, u64 addr, unsigned
> >>> size, u64 granu); +
> >>> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
> >>> did, u64 granu, int pasid); +
> >>>  extern int qi_submit_sync(struct qi_desc *desc, struct
> >>> intel_iommu *iommu); 
> >>>  extern int dmar_ir_support(void);
> >>>     
> >>
> >> Thanks
> >>
> >> Eric  
> > 
> > [Jacob Pan]
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-30  6:57       ` Auger Eric
@ 2019-04-30 17:22         ` Jacob Pan
  2019-04-30 17:36           ` Auger Eric
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-30 17:22 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Tue, 30 Apr 2019 08:57:30 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> On 4/30/19 12:41 AM, Jacob Pan wrote:
> > On Fri, 26 Apr 2019 19:23:03 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >> On 4/24/19 1:31 AM, Jacob Pan wrote:  
> >>> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >>> vIOMMU, we need to provide invalidation support at IOMMU API and
> >>> driver level. This patch adds Intel VT-d specific function to
> >>> implement iommu passdown invalidate API for shared virtual
> >>> address.
> >>>
> >>> The use case is for supporting caching structure invalidation
> >>> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >>> invalidation capability and passes down all descriptors from the
> >>> guest to the physical IOMMU.
> >>>
> >>> The assumption is that guest to host device ID mapping should be
> >>> resolved prior to calling IOMMU driver. Based on the device
> >>> handle, host IOMMU driver can replace certain fields before
> >>> submit to the invalidation queue.
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> >>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >>> ---
> >>>  drivers/iommu/intel-iommu.c | 159
> >>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159
> >>> insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/intel-iommu.c
> >>> b/drivers/iommu/intel-iommu.c index 89989b5..54a3d22 100644
> >>> --- a/drivers/iommu/intel-iommu.c
> >>> +++ b/drivers/iommu/intel-iommu.c
> >>> @@ -5338,6 +5338,164 @@ static void
> >>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >>>  
> >>> +/*
> >>> + * 2D array for converting and sanitizing IOMMU generic TLB
> >>> granularity to
> >>> + * VT-d granularity. Invalidation is typically included in the
> >>> unmap operation
> >>> + * as a result of DMA or VFIO unmap. However, for assigned device
> >>> where guest
> >>> + * could own the first level page tables without being shadowed
> >>> by QEMU. In
> >>> + * this case there is no pass down unmap to the host IOMMU as a
> >>> result of unmap
> >>> + * in the guest. Only invalidations are trapped and passed down.
> >>> + * In all cases, only first level TLB invalidation (request with
> >>> PASID) can be
> >>> + * passed down, therefore we do not include IOTLB granularity for
> >>> request
> >>> + * without PASID (second level).
> >>> + *
> >>> + * For an example, to find the VT-d granularity encoding for
> >>> IOTLB
> >>> + * type and page selective granularity within PASID:
> >>> + * X: indexed by iommu cache type
> >>> + * Y: indexed by enum iommu_inv_granularity
> >>> + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
> >>> + *
> >>> + * Granu_map array indicates validity of the table. 1: valid, 0:
> >>> invalid
> >>> + *
> >>> + */
> >>> +const static int
> >>> inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
> >>> = {    
> >> The size is frozen for a given uapi version so I guess you can
> >> hardcode the limits for a given version.  
> > I guess I could, I just felt more readable this way.  
> >>> +	/* PASID based IOTLB, support PASID selective and page
> >>> selective */
> >>> +	{0, 1, 1},
> >>> +	/* PASID based dev TLBs, only support all PASIDs or
> >>> single PASID */
> >>> +	{1, 1, 0},
> >>> +	/* PASID cache */
> >>> +	{1, 1, 0}
> >>> +};
> >>> +
> >>> +const static u64
> >>> inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
> >>> = {
> >>> +	/* PASID based IOTLB */
> >>> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >>> +	/* PASID based dev TLBs */
> >>> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> >>> +	/* PASID cache */
> >>> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> >>> +};    
> >> Can't you use a single matrix instead, ie. inv_type_granu_table
> >>  
> > The reason i have an additional inv_type_granu_map[] matrix is that
> > some of fields can be 0 but still valid. A single matrix would not
> > be able to tell the difference between a valid 0 or invalid field.  
> Ah OK sorry I missed that.
> >>> +
> >>> +static inline int to_vtd_granularity(int type, int granu, u64
> >>> *vtd_granu) +{
> >>> +	if (type >= NR_IOMMU_CACHE_TYPE || granu >=
> >>> NR_IOMMU_CACHE_INVAL_GRANU ||
> >>> +		!inv_type_granu_map[type][granu])
> >>> +		return -EINVAL;
> >>> +
> >>> +	*vtd_granu = inv_type_granu_table[type][granu];
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> >>> +{
> >>> +	u64 nr_pages;    
> >> direct initialization?  
> > will do, thanks  
> >>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k,
> >>> 9 for 2MB, etc.
> >>> +	 * IOMMU cache invalidate API passes granu_size in bytes,
> >>> and number of
> >>> +	 * granu size in contiguous memory.
> >>> +	 */
> >>> +
> >>> +	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> >>> +	return order_base_2(nr_pages);
> >>> +}
> >>> +
> >>> +static int intel_iommu_sva_invalidate(struct iommu_domain
> >>> *domain,
> >>> +		struct device *dev, struct
> >>> iommu_cache_invalidate_info *inv_info) +{
> >>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >>> +	struct device_domain_info *info;
> >>> +	struct intel_iommu *iommu;
> >>> +	unsigned long flags;
> >>> +	int cache_type;
> >>> +	u8 bus, devfn;
> >>> +	u16 did, sid;
> >>> +	int ret = 0;
> >>> +	u64 granu;
> >>> +	u64 size;
> >>> +
> >>> +	if (!inv_info || !dmar_domain ||
> >>> +		inv_info->version !=
> >>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (!dev || !dev_is_pci(dev))
> >>> +		return -ENODEV;
> >>> +
> >>> +	iommu = device_to_iommu(dev, &bus, &devfn);
> >>> +	if (!iommu)
> >>> +		return -ENODEV;
> >>> +
> >>> +	spin_lock(&iommu->lock);
> >>> +	spin_lock_irqsave(&device_domain_lock, flags);    
> >> mix of _irqsave and non _irqsave looks suspicious to me.  
> > It should be in reverse order. Any other concerns?  
> I understand both locks are likely to be taken in ISR context so
> _irqsave should be called on the first call.
Yes, that is what i meant in reverse order.
	spin_lock_irqsave(&device_domain_lock, flags); 
	spin_lock(&iommu->lock);

then the unlocking part will remain the same.

> >>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> >>> devfn);
> >>> +	if (!info) {
> >>> +		ret = -EINVAL;
> >>> +		goto out_unlock;
> >>> +	}
> >>> +	did = dmar_domain->iommu_did[iommu->seq_id];
> >>> +	sid = PCI_DEVID(bus, devfn);
> >>> +	size = to_vtd_size(inv_info->addr_info.granule_size,
> >>> inv_info->addr_info.nb_granules); +
> >>> +	for_each_set_bit(cache_type, (unsigned long
> >>> *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) { +
> >>> +		ret = to_vtd_granularity(cache_type,
> >>> inv_info->granularity, &granu);
> >>> +		if (ret) {
> >>> +			pr_err("Invalid range type %d, granu
> >>> %d\n", cache_type,    
> >> s/Invalid range type %d, granu %d/Invalid cache type/granu
> >> combination (%d/%d)  
> > sounds good, indeed it is the combination that is invalid.  
> >>> +				inv_info->granularity);
> >>> +			break;
> >>> +		}
> >>> +
> >>> +		switch (BIT(cache_type)) {
> >>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
> >>> +			if (size && (inv_info->addr_info.addr &
> >>> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> >>> +				pr_err("Address out of range,
> >>> 0x%llx, size order %llu\n",
> >>> +					inv_info->addr_info.addr,
> >>> size);
> >>> +				ret = -ERANGE;
> >>> +				goto out_unlock;
> >>> +			}
> >>> +
> >>> +			qi_flush_piotlb(iommu, did,
> >>> mm_to_dma_pfn(inv_info->addr_info.addr),
> >>> +
> >>> inv_info->addr_info.pasid,
> >>> +					size, granu);
> >>> +
> >>> +			/*
> >>> +			 * Always flush device IOTLB if ATS is
> >>> enabled since guest
> >>> +			 * vIOMMU exposes CM = 1, no device IOTLB
> >>> flush will be passed
> >>> +			 * down. REVISIT: cannot assume Linux
> >>> guest
> >>> +			 */
> >>> +			if (info->ats_enabled) {
> >>> +				qi_flush_dev_piotlb(iommu, sid,
> >>> info->pfsid,
> >>> +
> >>> inv_info->addr_info.pasid, info->ats_qdep,
> >>> +
> >>> inv_info->addr_info.addr, size,
> >>> +						granu);
> >>> +			}
> >>> +			break;
> >>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> >>> +			if (info->ats_enabled) {
> >>> +				qi_flush_dev_piotlb(iommu, sid,
> >>> info->pfsid,
> >>> +
> >>> inv_info->addr_info.pasid, info->ats_qdep,
> >>> +
> >>> inv_info->addr_info.addr, size,
> >>> +						granu);
> >>> +			} else
> >>> +				pr_warn("Passdown device IOTLB
> >>> flush w/o ATS!\n"); +
> >>> +			break;
> >>> +		case IOMMU_CACHE_INV_TYPE_PASID:
> >>> +			qi_flush_pasid_cache(iommu, did, granu,
> >>> inv_info->pasid); +
> >>> +			break;
> >>> +		default:
> >>> +			dev_err(dev, "Unsupported IOMMU
> >>> invalidation type %d\n",
> >>> +				cache_type);
> >>> +			ret = -EINVAL;
> >>> +		}
> >>> +	}
> >>> +out_unlock:
> >>> +	spin_unlock(&iommu->lock);
> >>> +	spin_unlock_irqrestore(&device_domain_lock, flags);    
> >> I would expect the opposite order  
> > yes, i reversed in the lock order such that irq is disabled.  
> spin_unlock_irqsave(&iommu->lock, flags);
> spin_lock(&device_domain_lock);
> ../..
> spin_unlock_irqrestore(&device_domain_lock);
> spin_unlock_irqrestore(&iommu->lock);
> ?
> 
I meant this:

	spin_lock_irqsave(&device_domain_lock, flags); 
	spin_lock(&iommu->lock);

...

	spin_unlock(&iommu->lock);
	spin_unlock_irqrestore(&device_domain_lock, flags);

> Thanks
> 
> Eric
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>>  static int intel_iommu_map(struct iommu_domain *domain,
> >>>  			   unsigned long iova, phys_addr_t hpa,
> >>>  			   size_t size, int iommu_prot)
> >>> @@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
> >>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> >>>  #ifdef CONFIG_INTEL_IOMMU_SVM
> >>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
> >>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >>>  #endif
> >>>     
> >> Thanks
> >>
> >> Eric  
> > 
> > Thank you so much for your review. I will roll up the next version
> > soon, hopefully this week.
> > 
> > Jacob
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function
  2019-04-30 17:22         ` Jacob Pan
@ 2019-04-30 17:36           ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-30 17:36 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/30/19 7:22 PM, Jacob Pan wrote:
> On Tue, 30 Apr 2019 08:57:30 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> On 4/30/19 12:41 AM, Jacob Pan wrote:
>>> On Fri, 26 Apr 2019 19:23:03 +0200
>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>   
>>>> Hi Jacob,
>>>> On 4/24/19 1:31 AM, Jacob Pan wrote:  
>>>>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>>>>> vIOMMU, we need to provide invalidation support at IOMMU API and
>>>>> driver level. This patch adds Intel VT-d specific function to
>>>>> implement iommu passdown invalidate API for shared virtual
>>>>> address.
>>>>>
>>>>> The use case is for supporting caching structure invalidation
>>>>> of assigned SVM capable devices. Emulated IOMMU exposes queue
>>>>> invalidation capability and passes down all descriptors from the
>>>>> guest to the physical IOMMU.
>>>>>
>>>>> The assumption is that guest to host device ID mapping should be
>>>>> resolved prior to calling IOMMU driver. Based on the device
>>>>> handle, host IOMMU driver can replace certain fields before
>>>>> submit to the invalidation queue.
>>>>>
>>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>>>> ---
>>>>>  drivers/iommu/intel-iommu.c | 159
>>>>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159
>>>>> insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/intel-iommu.c
>>>>> b/drivers/iommu/intel-iommu.c index 89989b5..54a3d22 100644
>>>>> --- a/drivers/iommu/intel-iommu.c
>>>>> +++ b/drivers/iommu/intel-iommu.c
>>>>> @@ -5338,6 +5338,164 @@ static void
>>>>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>>>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
>>>>>  
>>>>> +/*
>>>>> + * 2D array for converting and sanitizing IOMMU generic TLB
>>>>> granularity to
>>>>> + * VT-d granularity. Invalidation is typically included in the
>>>>> unmap operation
>>>>> + * as a result of DMA or VFIO unmap. However, for assigned device
>>>>> where guest
>>>>> + * could own the first level page tables without being shadowed
>>>>> by QEMU. In
>>>>> + * this case there is no pass down unmap to the host IOMMU as a
>>>>> result of unmap
>>>>> + * in the guest. Only invalidations are trapped and passed down.
>>>>> + * In all cases, only first level TLB invalidation (request with
>>>>> PASID) can be
>>>>> + * passed down, therefore we do not include IOTLB granularity for
>>>>> request
>>>>> + * without PASID (second level).
>>>>> + *
>>>>> + * For an example, to find the VT-d granularity encoding for
>>>>> IOTLB
>>>>> + * type and page selective granularity within PASID:
>>>>> + * X: indexed by iommu cache type
>>>>> + * Y: indexed by enum iommu_inv_granularity
>>>>> + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
>>>>> + *
>>>>> + * Granu_map array indicates validity of the table. 1: valid, 0:
>>>>> invalid
>>>>> + *
>>>>> + */
>>>>> +const static int
>>>>> inv_type_granu_map[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
>>>>> = {    
>>>> The size is frozen for a given uapi version so I guess you can
>>>> hardcode the limits for a given version.  
>>> I guess I could, I just felt more readable this way.  
>>>>> +	/* PASID based IOTLB, support PASID selective and page
>>>>> selective */
>>>>> +	{0, 1, 1},
>>>>> +	/* PASID based dev TLBs, only support all PASIDs or
>>>>> single PASID */
>>>>> +	{1, 1, 0},
>>>>> +	/* PASID cache */
>>>>> +	{1, 1, 0}
>>>>> +};
>>>>> +
>>>>> +const static u64
>>>>> inv_type_granu_table[NR_IOMMU_CACHE_TYPE][NR_IOMMU_CACHE_INVAL_GRANU]
>>>>> = {
>>>>> +	/* PASID based IOTLB */
>>>>> +	{0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>>>>> +	/* PASID based dev TLBs */
>>>>> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
>>>>> +	/* PASID cache */
>>>>> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
>>>>> +};    
>>>> Can't you use a single matrix instead, ie. inv_type_granu_table
>>>>  
>>> The reason i have an additional inv_type_granu_map[] matrix is that
>>> some of fields can be 0 but still valid. A single matrix would not
>>> be able to tell the difference between a valid 0 or invalid field.  
>> Ah OK sorry I missed that.
>>>>> +
>>>>> +static inline int to_vtd_granularity(int type, int granu, u64
>>>>> *vtd_granu) +{
>>>>> +	if (type >= NR_IOMMU_CACHE_TYPE || granu >=
>>>>> NR_IOMMU_CACHE_INVAL_GRANU ||
>>>>> +		!inv_type_granu_map[type][granu])
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	*vtd_granu = inv_type_granu_table[type][granu];
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>>>>> +{
>>>>> +	u64 nr_pages;    
>>>> direct initialization?  
>>> will do, thanks  
>>>>> +	/* VT-d size is encoded as 2^size of 4K pages, 0 for 4k,
>>>>> 9 for 2MB, etc.
>>>>> +	 * IOMMU cache invalidate API passes granu_size in bytes,
>>>>> and number of
>>>>> +	 * granu size in contiguous memory.
>>>>> +	 */
>>>>> +
>>>>> +	nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
>>>>> +	return order_base_2(nr_pages);
>>>>> +}
>>>>> +
>>>>> +static int intel_iommu_sva_invalidate(struct iommu_domain
>>>>> *domain,
>>>>> +		struct device *dev, struct
>>>>> iommu_cache_invalidate_info *inv_info) +{
>>>>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>>>>> +	struct device_domain_info *info;
>>>>> +	struct intel_iommu *iommu;
>>>>> +	unsigned long flags;
>>>>> +	int cache_type;
>>>>> +	u8 bus, devfn;
>>>>> +	u16 did, sid;
>>>>> +	int ret = 0;
>>>>> +	u64 granu;
>>>>> +	u64 size;
>>>>> +
>>>>> +	if (!inv_info || !dmar_domain ||
>>>>> +		inv_info->version !=
>>>>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	if (!dev || !dev_is_pci(dev))
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	iommu = device_to_iommu(dev, &bus, &devfn);
>>>>> +	if (!iommu)
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	spin_lock(&iommu->lock);
>>>>> +	spin_lock_irqsave(&device_domain_lock, flags);    
>>>> mix of _irqsave and non _irqsave looks suspicious to me.  
>>> It should be in reverse order. Any other concerns?  
>> I understand both locks are likely to be taken in ISR context so
>> _irqsave should be called on the first call.
> Yes, that is what i meant in reverse order.
> 	spin_lock_irqsave(&device_domain_lock, flags); 
> 	spin_lock(&iommu->lock);
> 
> then the unlocking part will remain the same.
> 
>>>>> +	info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
>>>>> devfn);
>>>>> +	if (!info) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto out_unlock;
>>>>> +	}
>>>>> +	did = dmar_domain->iommu_did[iommu->seq_id];
>>>>> +	sid = PCI_DEVID(bus, devfn);
>>>>> +	size = to_vtd_size(inv_info->addr_info.granule_size,
>>>>> inv_info->addr_info.nb_granules); +
>>>>> +	for_each_set_bit(cache_type, (unsigned long
>>>>> *)&inv_info->cache, NR_IOMMU_CACHE_TYPE) { +
>>>>> +		ret = to_vtd_granularity(cache_type,
>>>>> inv_info->granularity, &granu);
>>>>> +		if (ret) {
>>>>> +			pr_err("Invalid range type %d, granu
>>>>> %d\n", cache_type,    
>>>> s/Invalid range type %d, granu %d/Invalid cache type/granu
>>>> combination (%d/%d)  
>>> sounds good, indeed it is the combination that is invalid.  
>>>>> +				inv_info->granularity);
>>>>> +			break;
>>>>> +		}
>>>>> +
>>>>> +		switch (BIT(cache_type)) {
>>>>> +		case IOMMU_CACHE_INV_TYPE_IOTLB:
>>>>> +			if (size && (inv_info->addr_info.addr &
>>>>> ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>>>>> +				pr_err("Address out of range,
>>>>> 0x%llx, size order %llu\n",
>>>>> +					inv_info->addr_info.addr,
>>>>> size);
>>>>> +				ret = -ERANGE;
>>>>> +				goto out_unlock;
>>>>> +			}
>>>>> +
>>>>> +			qi_flush_piotlb(iommu, did,
>>>>> mm_to_dma_pfn(inv_info->addr_info.addr),
>>>>> +
>>>>> inv_info->addr_info.pasid,
>>>>> +					size, granu);
>>>>> +
>>>>> +			/*
>>>>> +			 * Always flush device IOTLB if ATS is
>>>>> enabled since guest
>>>>> +			 * vIOMMU exposes CM = 1, no device IOTLB
>>>>> flush will be passed
>>>>> +			 * down. REVISIT: cannot assume Linux
>>>>> guest
>>>>> +			 */
>>>>> +			if (info->ats_enabled) {
>>>>> +				qi_flush_dev_piotlb(iommu, sid,
>>>>> info->pfsid,
>>>>> +
>>>>> inv_info->addr_info.pasid, info->ats_qdep,
>>>>> +
>>>>> inv_info->addr_info.addr, size,
>>>>> +						granu);
>>>>> +			}
>>>>> +			break;
>>>>> +		case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>>>>> +			if (info->ats_enabled) {
>>>>> +				qi_flush_dev_piotlb(iommu, sid,
>>>>> info->pfsid,
>>>>> +
>>>>> inv_info->addr_info.pasid, info->ats_qdep,
>>>>> +
>>>>> inv_info->addr_info.addr, size,
>>>>> +						granu);
>>>>> +			} else
>>>>> +				pr_warn("Passdown device IOTLB
>>>>> flush w/o ATS!\n"); +
>>>>> +			break;
>>>>> +		case IOMMU_CACHE_INV_TYPE_PASID:
>>>>> +			qi_flush_pasid_cache(iommu, did, granu,
>>>>> inv_info->pasid); +
>>>>> +			break;
>>>>> +		default:
>>>>> +			dev_err(dev, "Unsupported IOMMU
>>>>> invalidation type %d\n",
>>>>> +				cache_type);
>>>>> +			ret = -EINVAL;
>>>>> +		}
>>>>> +	}
>>>>> +out_unlock:
>>>>> +	spin_unlock(&iommu->lock);
>>>>> +	spin_unlock_irqrestore(&device_domain_lock, flags);    
>>>> I would expect the opposite order  
>>> yes, i reversed in the lock order such that irq is disabled.  
>> spin_unlock_irqsave(&iommu->lock, flags);
>> spin_lock(&device_domain_lock);
>> ../..
>> spin_unlock_irqrestore(&device_domain_lock);
>> spin_unlock_irqrestore(&iommu->lock);
>> ?
>>
> I meant this:
> 
> 	spin_lock_irqsave(&device_domain_lock, flags); 
> 	spin_lock(&iommu->lock);
> 
> ...
> 
> 	spin_unlock(&iommu->lock);
> 	spin_unlock_irqrestore(&device_domain_lock, flags);
Yes that's the proper lock hierarchy as seen in dmar_insert_one_dev_info().

Thanks

Eric
> 
>> Thanks
>>
>> Eric
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>>  static int intel_iommu_map(struct iommu_domain *domain,
>>>>>  			   unsigned long iova, phys_addr_t hpa,
>>>>>  			   size_t size, int iommu_prot)
>>>>> @@ -5769,6 +5927,7 @@ const struct iommu_ops intel_iommu_ops = {
>>>>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
>>>>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>>>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>>>> +	.cache_invalidate	= intel_iommu_sva_invalidate,
>>>>>  	.sva_bind_gpasid	= intel_svm_bind_gpasid,
>>>>>  	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
>>>>>  #endif
>>>>>     
>>>> Thanks
>>>>
>>>> Eric  
>>>
>>> Thank you so much for your review. I will roll up the next version
>>> soon, hopefully this week.
>>>
>>> Jacob
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types
  2019-04-30 17:15         ` Jacob Pan
@ 2019-04-30 17:41           ` Auger Eric
  0 siblings, 0 replies; 74+ messages in thread
From: Auger Eric @ 2019-04-30 17:41 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko

Hi Jacob,

On 4/30/19 7:15 PM, Jacob Pan wrote:
> On Tue, 30 Apr 2019 06:41:13 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 4/29/19 11:29 PM, Jacob Pan wrote:
>>> On Sat, 27 Apr 2019 11:04:04 +0200
>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>   
>>>> Hi Jacob,
>>>>
>>>> On 4/24/19 1:31 AM, Jacob Pan wrote:  
>>>>> When Shared Virtual Memory is exposed to a guest via vIOMMU,
>>>>> extended IOTLB invalidation may be passed down from outside IOMMU
>>>>> subsystems. This patch adds invalidation functions that can be
>>>>> used for additional translation cache types.
>>>>>
>>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> ---
>>>>>  drivers/iommu/dmar.c        | 48
>>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>> include/linux/intel-iommu.h | 21 ++++++++++++++++---- 2 files
>>>>> changed, 65 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
>>>>> index 9c49300..680894e 100644
>>>>> --- a/drivers/iommu/dmar.c
>>>>> +++ b/drivers/iommu/dmar.c
>>>>> @@ -1357,6 +1357,20 @@ void qi_flush_iotlb(struct intel_iommu
>>>>> *iommu, u16 did, u64 addr, qi_submit_sync(&desc, iommu);
>>>>>  }
>>>>>      
>>>> /* PASID-based IOTLB Invalidate */  
>>>>> +void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u64
>>>>> addr, u32 pasid,
>>>>> +		unsigned int size_order, u64 granu)
>>>>> +{
>>>>> +	struct qi_desc desc;
>>>>> +
>>>>> +	desc.qw0 = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
>>>>> +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
>>>>> +	desc.qw1 = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_IH(0) |
>>>>> +		QI_EIOTLB_AM(size_order);    
>>>> I see IH it hardcoded to 0. Don't you envision to cascade the IH.
>>>> On ARM this was needed for perf sake.  
>>> Right, we should cascade IH based on IOMMU_INV_ADDR_FLAGS_LEAF. Just
>>> curious how do you deduce the IH information on ARM? I guess you
>>> need to get the non-leaf page directory info?
>>> I will add an argument for IH.  
>> On ARM we have the "Leaf" field in the stage1 TLB invalidation
>> command. "When Leaf==1, only cached entries for the last level of
>> translation table walk are required to be invalidated".
>>
> Thanks for explaining, I guess I didn't ask the right question. I was
> wondering how SMMU driver determines when to set the Leaf bit. I guess
> it is this function? It is not apparent to me whether the sharing of
> non-leaf TLBs are considered.
> io_pgtable_tlb_add_flush(iop, iova, blk_size, blk_size, true);

the leaf value is passed as arg to
tlb_sync cb = arm_smmu_tlb_inv_range_nosync so the actual decision is
made in io-pgtable-arm.c, see io_pgtable_tlb_sync call sites.

Thanks

Eric
> 
>> Thanks
>>
>> Eric
>>  [...]  
>>>> /* Pasid-based Device-TLB Invalidation */  
>>  [...]  
>>>>> +void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16 sid, u16
>>>>> pfsid,
>>>>> +		u32 pasid,  u16 qdep, u64 addr, unsigned size,
>>>>> u64 granu) +{
>>>>> +	struct qi_desc desc;
>>>>> +
>>>>> +	desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
>>>>> QI_DEV_EIOTLB_SID(sid) |
>>>>> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
>>>>> +		QI_DEV_IOTLB_PFSID(pfsid);
>>>>> +	desc.qw1 |= QI_DEV_EIOTLB_GLOB(granu);  
>>> should be desc.qw1 =  
>>>>> +
>>>>> +	/* If S bit is 0, we only flush a single page. If S bit
>>>>> is set,
>>>>> +	 * The least significant zero bit indicates the size.
>>>>> VT-d spec
>>>>> +	 * 6.5.2.6
>>>>> +	 */
>>>>> +	if (!size)
>>>>> +		desc.qw0 = QI_DEV_EIOTLB_ADDR(addr) &
>>>>> ~QI_DEV_EIOTLB_SIZE;    
>>>> desc.q1 |= ?  
>>> Right, I also missed previous qw1 assignment.  
>>>>> +	else {
>>>>> +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT +
>>>>> size); +
>>>>> +		desc.qw1 = QI_DEV_EIOTLB_ADDR(addr & ~mask) |
>>>>> QI_DEV_EIOTLB_SIZE;    
>>>> desc.q1 |=  
>>> right, thanks  
>>>>> +	}
>>>>> +	qi_submit_sync(&desc, iommu);
>>>>> +}
>>>>> +    
>>>> /* PASID-cache invalidation */  
>>>>> +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
>>>>> granu, int pasid) +{
>>>>> +	struct qi_desc desc;
>>>>> +
>>>>> +	desc.qw0 = QI_PC_TYPE | QI_PC_DID(did) |
>>>>> QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
>>>>> +	desc.qw1 = 0;
>>>>> +	desc.qw2 = 0;
>>>>> +	desc.qw3 = 0;
>>>>> +	qi_submit_sync(&desc, iommu);
>>>>> +}
>>>>>  /*
>>>>>   * Disable Queued Invalidation interface.
>>>>>   */
>>>>> diff --git a/include/linux/intel-iommu.h
>>>>> b/include/linux/intel-iommu.h index 5d67d0d4..38e5efb 100644
>>>>> --- a/include/linux/intel-iommu.h
>>>>> +++ b/include/linux/intel-iommu.h
>>>>> @@ -339,7 +339,7 @@ enum {
>>>>>  #define QI_IOTLB_GRAN(gran) 	(((u64)gran) >>
>>>>> (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
>>>>> (((u64)addr) & VTD_PAGE_MASK) #define
>>>>> QI_IOTLB_IH(ih)		(((u64)ih) << 6) -#define
>>>>> QI_IOTLB_AM(am)		(((u8)am)) +#define
>>>>> QI_IOTLB_AM(am)		(((u8)am) & 0x3f) 
>>>>>  #define QI_CC_FM(fm)		(((u64)fm) << 48)
>>>>>  #define QI_CC_SID(sid)		(((u64)sid) << 32)
>>>>> @@ -357,17 +357,22 @@ enum {
>>>>>  #define QI_PC_DID(did)		(((u64)did) << 16)
>>>>>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>>>>>  
>>>>> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
>>>>> -#define QI_PC_PASID_SEL		(QI_PC_TYPE |
>>>>> QI_PC_GRAN(1)) +/* PASID cache invalidation granu */
>>>>> +#define QI_PC_ALL_PASIDS	0
>>>>> +#define QI_PC_PASID_SEL		1
>>>>>  
>>>>>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>>>>>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
>>>>>  #define QI_EIOTLB_IH(ih)	(((u64)ih) << 6)
>>>>> -#define QI_EIOTLB_AM(am)	(((u64)am))
>>>>> +#define QI_EIOTLB_AM(am)	(((u64)am) & 0x3f)
>>>>>  #define QI_EIOTLB_PASID(pasid) 	(((u64)pasid) << 32)
>>>>>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>>>>>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>>>>>  
>>>>> +/* QI Dev-IOTLB inv granu */
>>>>> +#define QI_DEV_IOTLB_GRAN_ALL		1
>>>>> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
>>>>> +
>>>>>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>>>>>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>>>>>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
>>>>> @@ -658,8 +663,16 @@ extern void qi_flush_context(struct
>>>>> intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type);
>>>>>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did,
>>>>> u64 addr, unsigned int size_order, u64 type);
>>>>> +extern void qi_flush_piotlb(struct intel_iommu *iommu, u16 did,
>>>>> u64 addr,
>>>>> +			u32 pasid, unsigned int size_order, u64
>>>>> type); extern void qi_flush_dev_iotlb(struct intel_iommu *iommu,
>>>>> u16 sid, u16 pfsid, u16 qdep, u64 addr, unsigned mask);
>>>>> +
>>>>> +extern void qi_flush_dev_piotlb(struct intel_iommu *iommu, u16
>>>>> sid, u16 pfsid,
>>>>> +			u32 pasid, u16 qdep, u64 addr, unsigned
>>>>> size, u64 granu); +
>>>>> +extern void qi_flush_pasid_cache(struct intel_iommu *iommu, u16
>>>>> did, u64 granu, int pasid); +
>>>>>  extern int qi_submit_sync(struct qi_desc *desc, struct
>>>>> intel_iommu *iommu); 
>>>>>  extern int dmar_ir_support(void);
>>>>>     
>>>>
>>>> Thanks
>>>>
>>>> Eric  
>>>
>>> [Jacob Pan]
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support
  2019-04-30  7:05       ` Auger Eric
@ 2019-04-30 17:49         ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-04-30 17:49 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Tue, 30 Apr 2019 09:05:01 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> On 4/29/19 5:25 PM, Jacob Pan wrote:
> > On Fri, 26 Apr 2019 18:15:27 +0200
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >>
> >> On 4/24/19 1:31 AM, Jacob Pan wrote:  
> >>> When supporting guest SVA with emulated IOMMU, the guest PASID
> >>> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> >>> will result in PASID cache flush which will be passed down to
> >>> the host as bind guest PASID calls.
> >>>
> >>> For the SL page tables, it will be harvested from device's
> >>> default domain (request w/o PASID), or aux domain in case of
> >>> mediated device.
> >>>
> >>>     .-------------.  .---------------------------.
> >>>     |   vIOMMU    |  | Guest process CR3, FL only|
> >>>     |             |  '---------------------------'
> >>>     .----------------/
> >>>     | PASID Entry |--- PASID cache flush -
> >>>     '-------------'                       |
> >>>     |             |                       V
> >>>     |             |                CR3 in GPA
> >>>     '-------------'
> >>> Guest
> >>> ------| Shadow |--------------------------|--------
> >>>       v        v                          v
> >>> Host
> >>>     .-------------.  .----------------------.
> >>>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >>>     |             |  '----------------------'
> >>>     .----------------/  |
> >>>     | PASID Entry |     V (Nested xlate)
> >>>     '----------------\.------------------------------.
> >>>     |             |   |SL for GPA-HPA, default domain|
> >>>     |             |   '------------------------------'
> >>>     '-------------'
> >>> Where:
> >>>  - FL = First level/stage one page tables
> >>>  - SL = Second level/stage two page tables
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >>> ---
> >>>  drivers/iommu/intel-iommu.c |   4 +
> >>>  drivers/iommu/intel-svm.c   | 174
> >>> ++++++++++++++++++++++++++++++++++++++++++++
> >>> include/linux/intel-iommu.h |  10 ++- include/linux/intel-svm.h
> >>> |   7 ++ 4 files changed, 193 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/intel-iommu.c
> >>> b/drivers/iommu/intel-iommu.c index 77bbe1b..89989b5 100644
> >>> --- a/drivers/iommu/intel-iommu.c
> >>> +++ b/drivers/iommu/intel-iommu.c
> >>> @@ -5768,6 +5768,10 @@ const struct iommu_ops intel_iommu_ops = {
> >>>  	.dev_enable_feat	= intel_iommu_dev_enable_feat,
> >>>  	.dev_disable_feat	= intel_iommu_dev_disable_feat,
> >>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
> >>> +#ifdef CONFIG_INTEL_IOMMU_SVM
> >>> +	.sva_bind_gpasid	= intel_svm_bind_gpasid,
> >>> +	.sva_unbind_gpasid	= intel_svm_unbind_gpasid,
> >>> +#endif
> >>>  };
> >>>  
> >>>  static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
> >>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> >>> index 8fff212..0a973c2 100644
> >>> --- a/drivers/iommu/intel-svm.c
> >>> +++ b/drivers/iommu/intel-svm.c
> >>> @@ -227,6 +227,180 @@ static const struct mmu_notifier_ops
> >>> intel_mmuops = { 
> >>>  static DEFINE_MUTEX(pasid_mutex);
> >>>  static LIST_HEAD(global_svm_list);
> >>> +#define for_each_svm_dev() \
> >>> +	list_for_each_entry(sdev, &svm->devs, list)	\
> >>> +	if (dev == sdev->dev)				\
> >>> +
> >>> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> >>> +			struct device *dev,
> >>> +			struct gpasid_bind_data *data)
> >>> +{
> >>> +	struct intel_iommu *iommu =
> >>> intel_svm_device_to_iommu(dev);
> >>> +	struct intel_svm_dev *sdev;
> >>> +	struct intel_svm *svm = NULL;
> >>> +	struct dmar_domain *ddomain;
> >>> +	int pasid_max;
> >>> +	int ret = 0;
> >>> +
> >>> +	if (WARN_ON(!iommu) || !data)
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (dev_is_pci(dev)) {
> >>> +		pasid_max = pci_max_pasids(to_pci_dev(dev));
> >>> +		if (pasid_max < 0)
> >>> +			return -EINVAL;
> >>> +	} else
> >>> +		pasid_max = 1 << 20;
> >>> +
> >>> +	if (data->pasid <= 0 || data->pasid >= pasid_max)
> >>> +		return -EINVAL;
> >>> +
> >>> +	ddomain = to_dmar_domain(domain);
> >>> +	/* REVISIT:
> >>> +	 * Sanity check adddress width and paging mode support
> >>> +	 * width matching in two dimensions:
> >>> +	 * 1. paging mode CPU <= IOMMU
> >>> +	 * 2. address width Guest <= Host.
> >>> +	 */
> >>> +	mutex_lock(&pasid_mutex);
> >>> +	svm = ioasid_find(NULL, data->pasid, NULL);
> >>> +	if (IS_ERR(svm)) {
> >>> +		ret = PTR_ERR(svm);
> >>> +		goto out;
> >>> +	}
> >>> +	if (svm) {
> >>> +		if (list_empty(&svm->devs)) {
> >>> +			dev_err(dev, "GPASID %d has no devices
> >>> bond but SVA is allocated\n",
> >>> +				data->pasid);
> >>> +			ret = -ENODEV; /*
> >>> +					* If we found svm for the
> >>> PASID, there must be at
> >>> +					* least one device bond,
> >>> otherwise svm should be freed.
> >>> +					*/    
> >> comment should be put after list_empty I think. In which
> >> circumstances can it happen, I mean, isn't it a BUG_ON case?  
> > Well, I think failing to bind guest PASID is not severe enough to
> > the host to use BUG_ON. It has to be something more catastrophic to
> > use BUG_ON right? I will relocate the comments.  
> When the error is due to a programming error at kernel error (not
> induced by any userspace call) I guess it is acceptable to put a
> BUG_ON. However the usage of BUG_ON() is generally frown upon so my
> question rather was to understand if this can really happen and why?
Indeed this should never happen unless some future programming error. I
guess I can add a BUG_ON() or ignore the check.
> >>> +			goto out;
> >>> +		}
> >>> +		for_each_svm_dev() {
> >>> +			/* In case of multiple sub-devices of the
> >>> same pdev assigned, we should
> >>> +			 * allow multiple bind calls with the
> >>> same PASID and pdev.
> >>> +			 */
> >>> +			sdev->users++;
> >>> +			goto out;
> >>> +		}
> >>> +	} else {
> >>> +		/* We come here when PASID has never been bond to
> >>> a device. */
> >>> +		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> >>> +		if (!svm) {
> >>> +			ret = -ENOMEM;
> >>> +			goto out;
> >>> +		}
> >>> +		/* REVISIT: upper layer/VFIO can track host
> >>> process that bind the PASID.
> >>> +		 * ioasid_set = mm might be sufficient for vfio
> >>> to check pasid VMM
> >>> +		 * ownership.
> >>> +		 */
> >>> +		svm->mm = get_task_mm(current);
> >>> +		svm->pasid = data->pasid;
> >>> +		refcount_set(&svm->refs, 0);
> >>> +		ioasid_set_data(data->pasid, svm);
> >>> +		INIT_LIST_HEAD_RCU(&svm->devs);
> >>> +		INIT_LIST_HEAD(&svm->list);
> >>> +
> >>> +		mmput(svm->mm);
> >>> +	}
> >>> +	svm->flags |= SVM_FLAG_GUEST_MODE;
> >>> +	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> >>> +	if (!sdev) {
> >>> +		ret = -ENOMEM;    
> >> in case of failure what is the state of svm (you added the
> >> SVM_FLAG_GUEST_MODE bit typically, is it safe to leave it?)  
> > The SVM_FLAG_GUEST_MODE flag is used for fault reporting where
> > faults such as PRQ need to be injected into the guest. If this
> > kzalloc() fails, the nested translation would not be setup for this
> > PASID. So there shouldn't be any user of the flag. But I think it
> > is better to move svm->flags |= SVM_FLAG_GUEST_MODE; to the end
> > when everything is setup for nesting.  
> ok
> >   
> >>> +		goto out;
> >>> +	}
> >>> +	sdev->dev = dev;
> >>> +	sdev->users = 1;
> >>> +
> >>> +	/* Set up device context entry for PASID if not enabled
> >>> already */
> >>> +	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> >>> +	if (ret) {
> >>> +		dev_err(dev, "Failed to enable PASID
> >>> capability\n");
> >>> +		kfree(sdev);    
> >> same here  
> >>> +		goto out;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * For guest bind, we need to set up PASID table entry as
> >>> follows:
> >>> +	 * - FLPM matches guest paging mode
> >>> +	 * - turn on nested mode
> >>> +	 * - SL guest address width matching
> >>> +	 */
> >>> +	ret = intel_pasid_setup_nested(iommu,
> >>> +				dev,
> >>> +				(pgd_t *)data->gcr3,
> >>> +				data->pasid,
> >>> +				data->flags,
> >>> +				ddomain,
> >>> +				data->addr_width);
> >>> +	if (ret) {
> >>> +		dev_err(dev, "Failed to set up PASID %d in nested
> >>> mode, Err %d\n",
> >>> +			data->pasid, ret);
> >>> +		kfree(sdev);
> >>> +		goto out;
> >>> +	}
> >>> +
> >>> +	init_rcu_head(&sdev->rcu);
> >>> +	refcount_inc(&svm->refs);
> >>> +	list_add_rcu(&sdev->list, &svm->devs);
> >>> + out:
> >>> +	mutex_unlock(&pasid_mutex);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> >>> +{
> >>> +	struct intel_svm_dev *sdev;
> >>> +	struct intel_iommu *iommu;
> >>> +	struct intel_svm *svm;
> >>> +	int ret = -EINVAL;
> >>> +
> >>> +	mutex_lock(&pasid_mutex);
> >>> +	iommu = intel_svm_device_to_iommu(dev);
> >>> +	if (!iommu)
> >>> +		goto out;
> >>> +
> >>> +	svm = ioasid_find(NULL, pasid, NULL);
> >>> +	if (IS_ERR(svm)) {
> >>> +		ret = PTR_ERR(svm);
> >>> +		goto out;
> >>> +	}
> >>> +
> >>> +	if (!svm)
> >>> +		goto out;
> >>> +
> >>> +	for_each_svm_dev() {
> >>> +		ret = 0;
> >>> +		sdev->users--;
> >>> +		if (!sdev->users) {
> >>> +			list_del_rcu(&sdev->list);
> >>> +			intel_pasid_tear_down_entry(iommu, dev,
> >>> svm->pasid);
> >>> +			/* TODO: Drain in flight PRQ for the
> >>> PASID since it
> >>> +			 * may get reused soon, we don't want to
> >>> +			 * confuse with its previous live.
> >>> +			 * intel_svm_drain_prq(dev, pasid);
> >>> +			 */
> >>> +			kfree_rcu(sdev, rcu);
> >>> +
> >>> +			if (list_empty(&svm->devs)) {
> >>> +				list_del(&svm->list);
> >>> +				kfree(svm);
> >>> +				/*
> >>> +				 * We do not free PASID here
> >>> until explicit call
> >>> +				 * from the guest to free.    
> >> can you be confident in the guest?  
> > No. But I have confident in the kernel VFIO code to manage guest
> > life cycle :)
> > I assume when a guest doesn't do unbind when it dies or unload a
> > assigned device, I expect VFIO to free all the PASIDs. VFIO needs to
> > police the PASID ownership anyway in order to make sure a PASID
> > assigned to guest A cannot be used to bind from guest B.
> > This is the flow I worked out with Yi, who is doing the VFIO part.
> > Any particular concerns?  
> No I just wanted to make sure someone is going to take care of the
> final tear down even if the userspace fails to do things as expected.
> Maybe adding a comment to explain who has the ownership of the final
> tear down would help here.
> 
I will add comments as follows:
/*
 * We do not free PASID here until explicit call
 * from VFIO to free. The PASID life cycle
 * management is largely tied to VFIO management
 * of assigned device life cycles. In case of
 * guest exit without a explicit free PASID call,
 * the responsibility lies in VFIO layer to free
 * the PASIDs allocated for the guest.
 * For security reasons, VFIO has to track the
 * PASID ownership per guest anyway to ensure
 * that PASID allocated by one guest cannot be
 * used by another.
 */

> Thanks
> 
> Eric
> >   
> >>> +				 */
> >>> +				ioasid_set_data(pasid, NULL);
> >>> +			}
> >>> +		}
> >>> +		break;
> >>> +	}
> >>> + out:
> >>> +	mutex_unlock(&pasid_mutex);
> >>> +
> >>> +	return ret;
> >>> +}
> >>>  
> >>>  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> >>> struct svm_dev_ops *ops) {
> >>> diff --git a/include/linux/intel-iommu.h
> >>> b/include/linux/intel-iommu.h index 48fa164..5d67d0d4 100644
> >>> --- a/include/linux/intel-iommu.h
> >>> +++ b/include/linux/intel-iommu.h
> >>> @@ -677,7 +677,9 @@ int intel_iommu_enable_pasid(struct
> >>> intel_iommu *iommu, struct device *dev); int
> >>> intel_svm_init(struct intel_iommu *iommu); extern int
> >>> intel_svm_enable_prq(struct intel_iommu *iommu); extern int
> >>> intel_svm_finish_prq(struct intel_iommu *iommu); -
> >>> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> >>> +		struct device *dev, struct gpasid_bind_data
> >>> *data); +extern int intel_svm_unbind_gpasid(struct device *dev,
> >>> int pasid); struct svm_dev_ops;
> >>>  
> >>>  struct intel_svm_dev {
> >>> @@ -693,12 +695,16 @@ struct intel_svm_dev {
> >>>  
> >>>  struct intel_svm {
> >>>  	struct mmu_notifier notifier;
> >>> -	struct mm_struct *mm;
> >>> +	union {
> >>> +		struct mm_struct *mm;
> >>> +		u64 gcr3;
> >>> +	};
> >>>  	struct intel_iommu *iommu;
> >>>  	int flags;
> >>>  	int pasid;
> >>>  	struct list_head devs;
> >>>  	struct list_head list;
> >>> +	refcount_t refs; /* # of devs bond to the PASID */    
> >> number of devices sharing the same PASID?  
> > more clear wording, thanks.  
> >>>  };
> >>>  
> >>>  extern struct intel_iommu *intel_svm_device_to_iommu(struct
> >>> device *dev); diff --git a/include/linux/intel-svm.h
> >>> b/include/linux/intel-svm.h index e3f7631..34b0a3b 100644
> >>> --- a/include/linux/intel-svm.h
> >>> +++ b/include/linux/intel-svm.h
> >>> @@ -52,6 +52,13 @@ struct svm_dev_ops {
> >>>   * do such IOTLB flushes automatically.
> >>>   */
> >>>  #define SVM_FLAG_SUPERVISOR_MODE	(1<<1)
> >>> +/*
> >>> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> >>> to a device.    
> >> binds  
> > will fix
> >   
> >>> + * In this case the mm_struct is in the guest kernel or
> >>> userspace, its life
> >>> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> >>> API provides
> >>> + * means to bind/unbind guest CR3 with PASIDs allocated for a
> >>> device.
> >>> + */
> >>> +#define SVM_FLAG_GUEST_MODE	(1<<2)
> >>>  
> >>>  #ifdef CONFIG_INTEL_IOMMU_SVM
> >>>  
> >>>     
> >>
> >> Thanks
> >>
> >> Eric  
> > 
> > [Jacob Pan]
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-25 10:41     ` Jean-Philippe Brucker
@ 2019-04-30 20:24       ` Jacob Pan
  2019-05-01 17:40         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 74+ messages in thread
From: Jacob Pan @ 2019-04-30 20:24 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Auger Eric, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Tian, Kevin, Raj Ashok, Andriy Shevchenko,
	jacob.jun.pan

On Thu, 25 Apr 2019 11:41:05 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 25/04/2019 11:17, Auger Eric wrote:
> >> +/**
> >> + * ioasid_alloc - Allocate an IOASID
> >> + * @set: the IOASID set
> >> + * @min: the minimum ID (inclusive)
> >> + * @max: the maximum ID (exclusive)
> >> + * @private: data private to the caller
> >> + *
> >> + * Allocate an ID between @min and @max (or %0 and %INT_MAX).
> >> Return the  
> > I would remove "(or %0 and %INT_MAX)".  
> 
> Agreed, those where the default values of idr, but the xarray doesn't
> define a default max value. By the way, I do think squashing patches 6
> and 7 would be better (keeping my SOB but you can change the author).
> 
I will squash 6 and 7 in v3. I will just add my SOB but keep the
author if that is OK.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 06/19] drivers core: Add I/O ASID allocator
  2019-04-30 20:24       ` Jacob Pan
@ 2019-05-01 17:40         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 74+ messages in thread
From: Jean-Philippe Brucker @ 2019-05-01 17:40 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Auger Eric, iommu, LKML, Joerg Roedel, David Woodhouse,
	Alex Williamson, Tian, Kevin, Raj Ashok, Andriy Shevchenko

On 30/04/2019 21:24, Jacob Pan wrote:
> On Thu, 25 Apr 2019 11:41:05 +0100
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
>> On 25/04/2019 11:17, Auger Eric wrote:
>>>> +/**
>>>> + * ioasid_alloc - Allocate an IOASID
>>>> + * @set: the IOASID set
>>>> + * @min: the minimum ID (inclusive)
>>>> + * @max: the maximum ID (exclusive)
>>>> + * @private: data private to the caller
>>>> + *
>>>> + * Allocate an ID between @min and @max (or %0 and %INT_MAX).
>>>> Return the  
>>> I would remove "(or %0 and %INT_MAX)".  
>>
>> Agreed, those where the default values of idr, but the xarray doesn't
>> define a default max value. By the way, I do think squashing patches 6
>> and 7 would be better (keeping my SOB but you can change the author).
>>
> I will squash 6 and 7 in v3. I will just add my SOB but keep the
> author if that is OK.

Sure, that works

Thanks,
Jean

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 08/19] ioasid: Add custom IOASID allocator
  2019-04-26 15:19         ` Jacob Pan
@ 2019-05-06 17:59           ` Jacob Pan
  0 siblings, 0 replies; 74+ messages in thread
From: Jacob Pan @ 2019-05-06 17:59 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Alex Williamson,
	Jean-Philippe Brucker, Yi Liu, Tian, Kevin, Raj Ashok,
	Christoph Hellwig, Lu Baolu, Andriy Shevchenko, jacob.jun.pan

On Fri, 26 Apr 2019 08:19:03 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> > >>> +		default_allocator_used = 1;      
> > >> shouldn't default_allocator_used be protected as well?    
> >  [...]    
> > >> wouldn't it be possible to integrate the default io asid
> > >> allocator as any custom allocator, ie. implement an alloc
> > >> callback using xa_alloc. Then the active io allocator could be
> > >> either a custom or a default one.    
> > > That is an interesting idea. I think it is possible.
> > > But since default xa allocator is internal to ioasid
> > > infrastructure, why implement it as a callback?    
> > 
> > I mean your could directly define a static const default_allocator
> > in ioasid.c and assign it by default. Do I miss something?
> >   
> got it, seems cleaner. let me give it a try.

Hi Eric,

Just sent out v3 last week. I did look into this but could not find a
clean way of making the default allocator as another custom allocator.
The reason is that default allocator is not interchangeable with
other custom allocators, XArray is shared. So it ends up having lots of
special cases anyway. Feel free to change this.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2019-05-06 17:57 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-23 23:31 [PATCH v2 00/19] Shared virtual address IOMMU and VT-d support Jacob Pan
2019-04-23 23:31 ` [PATCH v2 01/19] driver core: add per device iommu param Jacob Pan
2019-04-23 23:31 ` [PATCH v2 02/19] iommu: introduce device fault data Jacob Pan
2019-04-25 12:46   ` Jean-Philippe Brucker
2019-04-25 13:21     ` Auger Eric
2019-04-25 14:33       ` Jean-Philippe Brucker
2019-04-25 18:07         ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 03/19] iommu: introduce device fault report API Jacob Pan
2019-04-23 23:31 ` [PATCH v2 04/19] iommu: Introduce attach/detach_pasid_table API Jacob Pan
2019-04-23 23:31 ` [PATCH v2 05/19] iommu: Introduce cache_invalidate API Jacob Pan
2019-04-23 23:31 ` [PATCH v2 06/19] drivers core: Add I/O ASID allocator Jacob Pan
2019-04-24  6:19   ` Christoph Hellwig
2019-04-25 18:19     ` Jacob Pan
2019-04-26 11:47       ` Jean-Philippe Brucker
2019-04-26 12:21         ` Christoph Hellwig
2019-04-26 16:58           ` Jacob Pan
2019-04-25 10:17   ` Auger Eric
2019-04-25 10:41     ` Jean-Philippe Brucker
2019-04-30 20:24       ` Jacob Pan
2019-05-01 17:40         ` Jean-Philippe Brucker
2019-04-23 23:31 ` [PATCH v2 07/19] ioasid: Convert ioasid_idr to XArray Jacob Pan
2019-04-23 23:31 ` [PATCH v2 08/19] ioasid: Add custom IOASID allocator Jacob Pan
2019-04-25 10:03   ` Auger Eric
2019-04-25 21:29     ` Jacob Pan
2019-04-26  9:06       ` Auger Eric
2019-04-26 15:19         ` Jacob Pan
2019-05-06 17:59           ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 09/19] iommu/vt-d: Enlightened PASID allocation Jacob Pan
2019-04-24 17:27   ` Auger Eric
2019-04-25  7:12     ` Liu, Yi L
2019-04-25  7:40       ` Auger Eric
2019-04-25 23:01         ` Jacob Pan
2019-04-25 23:40     ` Jacob Pan
2019-04-26  7:24       ` Auger Eric
2019-04-26 15:05         ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 10/19] iommu/vt-d: Add custom allocator for IOASID Jacob Pan
2019-04-24 17:27   ` Auger Eric
2019-04-26 20:11     ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 11/19] iommu/vt-d: Replace Intel specific PASID allocator with IOASID Jacob Pan
2019-04-25 10:04   ` Auger Eric
     [not found]     ` <20190426140133.6d445315@jacob-builder>
2019-04-27  8:38       ` Auger Eric
2019-04-29 10:00         ` Jean-Philippe Brucker
2019-04-23 23:31 ` [PATCH v2 12/19] iommu/vt-d: Move domain helper to header Jacob Pan
2019-04-24 17:27   ` Auger Eric
2019-04-23 23:31 ` [PATCH v2 13/19] iommu/vt-d: Add nested translation support Jacob Pan
2019-04-26 15:42   ` Auger Eric
2019-04-26 21:57     ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 14/19] iommu: Add guest PASID bind function Jacob Pan
2019-04-26 15:53   ` Auger Eric
2019-04-26 22:11     ` Jacob Pan
2019-04-27  8:37       ` Auger Eric
2019-04-23 23:31 ` [PATCH v2 15/19] iommu/vt-d: Add bind guest PASID support Jacob Pan
2019-04-26 16:15   ` Auger Eric
2019-04-29 15:25     ` Jacob Pan
2019-04-30  7:05       ` Auger Eric
2019-04-30 17:49         ` Jacob Pan
2019-04-23 23:31 ` [PATCH v2 16/19] iommu/vtd: Clean up for SVM device list Jacob Pan
2019-04-26 16:19   ` Auger Eric
2019-04-23 23:31 ` [PATCH v2 17/19] iommu: Add max num of cache and granu types Jacob Pan
2019-04-26 16:22   ` Auger Eric
2019-04-29 16:17     ` Jacob Pan
2019-04-30  5:15       ` Auger Eric
2019-04-23 23:31 ` [PATCH v2 18/19] iommu/vt-d: Support flushing more translation cache types Jacob Pan
2019-04-27  9:04   ` Auger Eric
2019-04-29 21:29     ` Jacob Pan
2019-04-30  4:41       ` Auger Eric
2019-04-30 17:15         ` Jacob Pan
2019-04-30 17:41           ` Auger Eric
2019-04-23 23:31 ` [PATCH v2 19/19] iommu/vt-d: Add svm/sva invalidate function Jacob Pan
2019-04-26 17:23   ` Auger Eric
2019-04-29 22:41     ` Jacob Pan
2019-04-30  6:57       ` Auger Eric
2019-04-30 17:22         ` Jacob Pan
2019-04-30 17:36           ` Auger Eric

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).