LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-19  6:38 Liu Yi L
  2021-09-19  6:38 ` [RFC 01/20] iommu/iommufd: Add /dev/iommu core Liu Yi L
                   ` (21 more replies)
  0 siblings, 22 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
vDPA) to manage secure device access from the userspace. One critical task
of those frameworks is to put the assigned device in a secure, IOMMU-
protected context so user-initiated DMAs are prevented from doing harm to
the rest of the system.

Currently those frameworks implement their own logic for managing I/O page
tables to isolate user-initiated DMAs. This doesn't scale to support many
new IOMMU features, such as PASID-granular DMA remapping, nested translation,
I/O page fault, IOMMU dirty bit, etc.

/dev/iommu is introduced as an unified interface for managing I/O address
spaces and DMA isolation for passthrough devices. It's originated from the
upstream discussion for the vSVA enabling work[1].

This RFC aims to provide a basic skeleton for above proposal, w/o adding
any new feature beyond what vfio type1 provides today. For an overview of
future extensions, please refer to the full design proposal [2].

The core concepts in /dev/iommu are iommufd and ioasid. iommufd (by opening
/dev/iommu) is the container holding multiple I/O address spaces, while
ioasid is the fd-local software handle representing an I/O address space and
associated with a single I/O page table. User manages those address spaces
through fd operations, e.g. by using vfio type1v2 mapping semantics to manage
respective I/O page tables.

An I/O address space takes effect in the iommu only after it is attached by
a device. One I/O address space can be attached by multiple devices. One
device can be only attached to a single I/O address space in this RFC, to
match vfio type1 behavior as the starting point.

Device must be bound to an iommufd before attach operation can be conducted.
The binding operation builds the connection between the devicefd (opened via
device-passthrough framework) and iommufd. Most importantly, the entire
/dev/iommu framework adopts a device-centric model w/o carrying any container/
group legacy as current vfio does. This requires the binding operation also
establishes a security context which prevents the bound device from accessing
the rest of the system, as the contract for vfio to grant user access to the
assigned device. Detail explanation of this aspect can be found in patch 06.

Last, the format of an I/O page table must be compatible to the attached 
devices (or more specifically to the IOMMU which serves the DMA from the
attached devices). User is responsible for specifying the format when
allocating an IOASID, according to one or multiple devices which will be
attached right after. The device IOMMU format can be queried via iommufd
once a device is successfully bound to the iommufd. Attaching a device to
an IOASID with incompatible format is simply rejected.

The skeleton is mostly implemented in iommufd, except that bind_iommufd/
ioasid_attach operations are initiated via device-passthrough framework
specific uAPIs. This RFC only changes vfio to work with iommufd. vdpa
support can be added in a later stage.

Basically iommufd provides following uAPIs and helper functions:

- IOMMU_DEVICE_GET_INFO, for querying per-device iommu capability/format;
- IOMMU_IOASID_ALLOC/FREE, as the name stands;
- IOMMU_[UN]MAP_DMA, providing vfio type1v2 semantics for managing a
  specific I/O page table;
- helper functions for vfio to bind_iommufd/attach_ioasid with devices;

vfio extensions include:
- A new interface for user to open a device w/o using container/group uAPI;
- VFIO_DEVICE_BIND_IOMMUFD, for binding a vfio device to an iommufd;
  * unbind is automatically done when devicefd is closed;
- VFIO_DEVICE_[DE]ATTACH_IOASID, for attaching/detaching a vfio device
  to/from an ioasid in the specified iommufd;

[TODO in RFC v2]

We did one temporary hack in v1 by reusing vfio_iommu_type1.c to implement
IOMMU_[UN]MAP_DMA. This leads to some dirty code in patch 16/17/18. We
estimated almost 80% of the current type1 code are related to map/unmap.
It needs non-trivial effort for either duplicating it in iommufd or making
it shared by both vfio and iommufd. We hope this hack doesn't affect the
review of the overall skeleton, since the  role of this part is very clear.
Based on the received feedback we will make a clean implementation in v2.

For userspace our time doesn't afford a clean implementation in Qemu.
Instead, we just wrote a simple application (similar to the example in
iommufd.rst) and verified the basic work flow (bind/unbind, alloc/free
ioasid, attach/detach, map/unmap, multi-devices group, etc.). We did
verify the I/O page table mappings established as expected, though no
DMA is conducted. We plan to have a clean implementation in Qemu and
provide a public link for reference when v2 is sending out.

[TODO out of this RFC]

The entire /dev/iommu project involves lots of tasks. It has to grow in
a staging approach. Below is a rough list of TODO features. Most of them
can be developed in parallel after this skeleton is accepted. For more
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
    * device which does no-snoop DMA
    * software mdev
    * PPC device
    * platform device

2. New vfio device type
    * hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
    * ioasid nesting (hardware)
    * ioasid nesting (software)
    * pasid virtualization
        o pdev (arm/amd)
        o pdev/mdev which doesn't support enqcmd (intel)
        o pdev/mdev which supports enqcmd (intel)
    * I/O page fault (stage-1)

5. Miscellaneous
    * I/O page fault (stage-2), for on-demand paging
    * IOMMU dirty bit, for hardware-assisted dirty page tracking
    * shared I/O page table (mm, ept, etc.)
    * vfio/vdpa shim to avoid code duplication for legacy uAPI
    * hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
[2] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-vfio-device-class-for-device-nodes.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
  0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
  0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0015-iommu-iommufd-Report-iova-range-to-userspace.patch
  0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
  0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0018-Doc-Add-documentation-for-dev-iommu.patch
 
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-device-class-for-dev-vfio-devices.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-Add-page-size-and-address-width-attributes.patch
  0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
  0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
  0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
  0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0017-iommu-iommufd-Report-iova-range-to-userspace.patch
  0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
  0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
  iommu/iommufd: Add /dev/iommu core
  vfio: Add device class for /dev/vfio/devices
  vfio: Add vfio_[un]register_device()
  vfio/pci: Register device to /dev/vfio/devices
  iommu/iommufd: Add iommufd_[un]bind_device()
  vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  vfio/type1: Export symbols for dma [un]map code sharing
  iommu/iommufd: Report iova range to userspace
  iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  Doc: Add documentation for /dev/iommu

Lu Baolu (5):
  iommu: Add iommu_device_get_info interface
  iommu: Add iommu_device_init[exit]_user_dma interfaces
  iommu: Add page size and address width attributes
  iommu: Extend iommu_at[de]tach_device() for multiple devices group
  iommu/vt-d: Implement device_info iommu_ops callback

 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++
 drivers/iommu/Kconfig                   |   1 +
 drivers/iommu/Makefile                  |   1 +
 drivers/iommu/intel/iommu.c             |  35 +
 drivers/iommu/iommu.c                   | 188 +++++-
 drivers/iommu/iommufd/Kconfig           |  11 +
 drivers/iommu/iommufd/Makefile          |   2 +
 drivers/iommu/iommufd/iommufd.c         | 832 ++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig                |   1 +
 drivers/vfio/pci/vfio_pci.c             | 179 ++++-
 drivers/vfio/pci/vfio_pci_private.h     |  10 +
 drivers/vfio/vfio.c                     | 366 ++++++++++-
 drivers/vfio/vfio_iommu_type1.c         | 246 ++++++-
 include/linux/iommu.h                   |  35 +
 include/linux/iommufd.h                 |  71 ++
 include/linux/vfio.h                    |  27 +
 include/uapi/linux/iommu.h              | 162 +++++
 include/uapi/linux/vfio.h               |  56 ++
 19 files changed, 2358 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c
 create mode 100644 include/linux/iommufd.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 15:41   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 02/20] vfio: Add device class for /dev/vfio/devices Liu Yi L
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

/dev/iommu aims to provide a unified interface for managing I/O address
spaces for devices assigned to userspace. This patch adds the initial
framework to create a /dev/iommu node. Each open of this node returns an
iommufd. And this fd is the handle for userspace to initiate its I/O
address space management.

One open:
- We call this feature as IOMMUFD in Kconfig in this RFC. However this
  name is not clear enough to indicate its purpose to user. Back to 2010
  vfio even introduced a /dev/uiommu [1] as the predecessor of its
  container concept. Is that a better name? Appreciate opinions here.

[1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/Kconfig           |   1 +
 drivers/iommu/Makefile          |   1 +
 drivers/iommu/iommufd/Kconfig   |  11 ++++
 drivers/iommu/iommufd/Makefile  |   2 +
 drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
 5 files changed, 127 insertions(+)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..a83ce0acd09d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -136,6 +136,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..719c799f23ad 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
+obj-$(CONFIG_IOMMUFD) += iommufd/
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 000000000000..9fb7769a815d
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "I/O Address Space management framework for passthrough devices"
+	select IOMMU_API
+	default n
+	help
+	  provides unified I/O address space management framework for
+	  isolating untrusted DMAs via devices which are passed through
+	  to userspace drivers.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 000000000000..54381a01d003
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
new file mode 100644
index 000000000000..710b7e62988b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * I/O Address Space Management for passthrough devices
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+
+#define pr_fmt(fmt)    "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/iommu.h>
+
+/* Per iommufd */
+struct iommufd_ctx {
+	refcount_t refs;
+};
+
+static int iommufd_fops_open(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx;
+	int ret = 0;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+	if (!ictx)
+		return -ENOMEM;
+
+	refcount_set(&ictx->refs, 1);
+	filep->private_data = ictx;
+
+	return ret;
+}
+
+static void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	if (refcount_dec_and_test(&ictx->refs))
+		kfree(ictx);
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+
+	filep->private_data = NULL;
+
+	iommufd_ctx_put(ictx);
+
+	return 0;
+}
+
+static long iommufd_fops_unl_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+	long ret = -EINVAL;
+
+	if (!ictx)
+		return ret;
+
+	switch (cmd) {
+	default:
+		pr_err_ratelimited("unsupported cmd %u\n", cmd);
+		break;
+	}
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iommufd_fops_open,
+	.release	= iommufd_fops_release,
+	.unlocked_ioctl	= iommufd_fops_unl_ioctl,
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0666,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_AUTHOR("Liu Yi L <yi.l.liu@intel.com>");
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
  2021-09-19  6:38 ` [RFC 01/20] iommu/iommufd: Add /dev/iommu core Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 15:57   ` Jason Gunthorpe
                     ` (2 more replies)
  2021-09-19  6:38 ` [RFC 03/20] vfio: Add vfio_[un]register_device() Liu Yi L
                   ` (19 subsequent siblings)
  21 siblings, 3 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
userspace to directly open a vfio device w/o relying on container/group
(/dev/vfio/$GROUP). Anything related to group is now hidden behind
iommufd (more specifically in iommu core by this RFC) in a device-centric
manner.

In case a device is exposed in both legacy and new interfaces (see next
patch for how to decide it), this patch also ensures that when the device
is already opened via one interface then the other one must be blocked.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   2 +
 2 files changed, 213 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 02cc51ce6891..84436d7abedd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -46,6 +46,12 @@ static struct vfio {
 	struct mutex			group_lock;
 	struct cdev			group_cdev;
 	dev_t				group_devt;
+	/* Fields for /dev/vfio/devices interface */
+	struct class			*device_class;
+	struct cdev			device_cdev;
+	dev_t				device_devt;
+	struct mutex			device_lock;
+	struct idr			device_idr;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -81,9 +87,11 @@ struct vfio_group {
 	struct list_head		container_next;
 	struct list_head		unbound_list;
 	struct mutex			unbound_lock;
-	atomic_t			opened;
-	wait_queue_head_t		container_q;
+	struct mutex			opened_lock;
+	u32				opened;
+	bool				opened_by_nongroup_dev;
 	bool				noiommu;
+	wait_queue_head_t		container_q;
 	unsigned int			dev_counter;
 	struct kvm			*kvm;
 	struct blocking_notifier_head	notifier;
@@ -327,7 +335,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
 	INIT_LIST_HEAD(&group->unbound_list);
 	mutex_init(&group->unbound_lock);
 	atomic_set(&group->container_users, 0);
-	atomic_set(&group->opened, 0);
+	mutex_init(&group->opened_lock);
 	init_waitqueue_head(&group->container_q);
 	group->iommu_group = iommu_group;
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -1489,10 +1497,53 @@ static long vfio_group_fops_unl_ioctl(struct file *filep,
 	return ret;
 }
 
+/*
+ * group->opened is used to ensure that the group can be opened only via
+ * one of the two interfaces (/dev/vfio/$GROUP and /dev/vfio/devices/
+ * $DEVICE) instead of both.
+ *
+ * We also introduce a new group flag to indicate whether this group is
+ * opened via /dev/vfio/devices/$DEVICE. For multi-devices group,
+ * group->opened also tracks how many devices have been opened in the
+ * group if the new flag is true.
+ *
+ * Also add a new lock since two flags are operated here.
+ */
+static int vfio_group_try_open(struct vfio_group *group, bool nongroup_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&group->opened_lock);
+	if (group->opened) {
+		if (nongroup_dev && group->opened_by_nongroup_dev)
+			group->opened++;
+		else
+			ret = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Is something still in use from a previous open? Should
+	 * not allow new open if it is such case.
+	 */
+	if (group->container) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	group->opened = 1;
+	group->opened_by_nongroup_dev = nongroup_dev;
+
+out:
+	mutex_unlock(&group->opened_lock);
+
+	return ret;
+}
+
 static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 {
 	struct vfio_group *group;
-	int opened;
+	int ret;
 
 	group = vfio_group_get_from_minor(iminor(inode));
 	if (!group)
@@ -1503,18 +1554,10 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 		return -EPERM;
 	}
 
-	/* Do we need multiple instances of the group open?  Seems not. */
-	opened = atomic_cmpxchg(&group->opened, 0, 1);
-	if (opened) {
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	/* Is something still in use from a previous open? */
-	if (group->container) {
-		atomic_dec(&group->opened);
+	ret = vfio_group_try_open(group, false);
+	if (ret) {
 		vfio_group_put(group);
-		return -EBUSY;
+		return ret;
 	}
 
 	/* Warn if previous user didn't cleanup and re-init to drop them */
@@ -1534,7 +1577,9 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 
 	vfio_group_try_dissolve_container(group);
 
-	atomic_dec(&group->opened);
+	mutex_lock(&group->opened_lock);
+	group->opened--;
+	mutex_unlock(&group->opened_lock);
 
 	vfio_group_put(group);
 
@@ -1552,6 +1597,92 @@ static const struct file_operations vfio_group_fops = {
 /**
  * VFIO Device fd
  */
+static struct vfio_device *vfio_device_get_from_minor(int minor)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&vfio.device_lock);
+	device = idr_find(&vfio.device_idr, minor);
+	if (!device || !vfio_device_try_get(device)) {
+		mutex_unlock(&vfio.device_lock);
+		return NULL;
+	}
+	mutex_unlock(&vfio.device_lock);
+
+	return device;
+}
+
+static int vfio_device_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret, opened;
+
+	device = vfio_device_get_from_minor(iminor(inode));
+	if (!device)
+		return -ENODEV;
+
+	/*
+	 * Check whether the user has opened this device via the legacy
+	 * container/group interface. If yes, then prevent the user from
+	 * opening it via device node in /dev/vfio/devices. Otherwise,
+	 * mark the group as opened to block the group interface. either
+	 * way, we must ensure only one interface is used to open the
+	 * device when it supports both legacy and new interfaces.
+	 */
+	group = vfio_group_try_get(device->group);
+	if (group) {
+		ret = vfio_group_try_open(group, true);
+		if (ret)
+			goto err_group_try_open;
+	}
+
+	/*
+	 * No support of multiple instances of the device open, similar to
+	 * the policy on the group open.
+	 */
+	opened = atomic_cmpxchg(&device->opened, 0, 1);
+	if (opened) {
+		ret = -EBUSY;
+		goto err_device_try_open;
+	}
+
+	if (!try_module_get(device->dev->driver->owner)) {
+		ret = -ENODEV;
+		goto err_module_get;
+	}
+
+	ret = device->ops->open(device);
+	if (ret)
+		goto err_device_open;
+
+	filep->private_data = device;
+
+	if (group)
+		vfio_group_put(group);
+	return 0;
+err_device_open:
+	module_put(device->dev->driver->owner);
+err_module_get:
+	atomic_dec(&device->opened);
+err_device_try_open:
+	if (group) {
+		mutex_lock(&group->opened_lock);
+		group->opened--;
+		mutex_unlock(&group->opened_lock);
+	}
+err_group_try_open:
+	if (group)
+		vfio_group_put(group);
+	vfio_device_put(device);
+	return ret;
+}
+
+static bool vfio_device_in_container(struct vfio_device *device)
+{
+	return !!(device->group && device->group->container);
+}
+
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_device *device = filep->private_data;
@@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
 	module_put(device->dev->driver->owner);
 
-	vfio_group_try_dissolve_container(device->group);
+	if (vfio_device_in_container(device)) {
+		vfio_group_try_dissolve_container(device->group);
+	} else {
+		atomic_dec(&device->opened);
+		if (device->group) {
+			mutex_lock(&device->group->opened_lock);
+			device->group->opened--;
+			mutex_unlock(&device->group->opened_lock);
+		}
+	}
 
 	vfio_device_put(device);
 
@@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 
 static const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
+	.open		= vfio_device_fops_open,
 	.release	= vfio_device_fops_release,
 	.read		= vfio_device_fops_read,
 	.write		= vfio_device_fops_write,
@@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
 	.mode = S_IRUGO | S_IWUGO,
 };
 
+static char *vfio_device_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
+}
+
+static int vfio_init_device_class(void)
+{
+	int ret;
+
+	mutex_init(&vfio.device_lock);
+	idr_init(&vfio.device_idr);
+
+	/* /dev/vfio/devices/$DEVICE */
+	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
+	if (IS_ERR(vfio.device_class))
+		return PTR_ERR(vfio.device_class);
+
+	vfio.device_class->devnode = vfio_device_devnode;
+
+	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
+	if (ret)
+		goto err_alloc_chrdev;
+
+	cdev_init(&vfio.device_cdev, &vfio_device_fops);
+	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
+	if (ret)
+		goto err_cdev_add;
+	return 0;
+
+err_cdev_add:
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+err_alloc_chrdev:
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	return ret;
+}
+
+static void vfio_destroy_device_class(void)
+{
+	cdev_del(&vfio.device_cdev);
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	idr_destroy(&vfio.device_idr);
+}
+
 static int __init vfio_init(void)
 {
 	int ret;
@@ -2329,6 +2516,10 @@ static int __init vfio_init(void)
 	if (ret)
 		goto err_cdev_add;
 
+	ret = vfio_init_device_class();
+	if (ret)
+		goto err_init_device_class;
+
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -2336,6 +2527,8 @@ static int __init vfio_init(void)
 #endif
 	return 0;
 
+err_init_device_class:
+	cdev_del(&vfio.group_cdev);
 err_cdev_add:
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 err_alloc_chrdev:
@@ -2358,6 +2551,7 @@ static void __exit vfio_cleanup(void)
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 	class_destroy(vfio.class);
 	vfio.class = NULL;
+	vfio_destroy_device_class();
 	misc_deregister(&vfio_dev);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a2c5b30e1763..4a5f3f99eab2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -24,6 +24,8 @@ struct vfio_device {
 	refcount_t refcount;
 	struct completion comp;
 	struct list_head group_next;
+	int minor;
+	atomic_t opened;
 };
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
  2021-09-19  6:38 ` [RFC 01/20] iommu/iommufd: Add /dev/iommu core Liu Yi L
  2021-09-19  6:38 ` [RFC 02/20] vfio: Add device class for /dev/vfio/devices Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 16:01   ` Jason Gunthorpe
  2021-09-29  2:43   ` David Gibson
  2021-09-19  6:38 ` [RFC 04/20] iommu: Add iommu_device_get_info interface Liu Yi L
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

With /dev/vfio/devices introduced, now a vfio device driver has three
options to expose its device to userspace:

a)  only legacy group interface, for devices which haven't been moved to
    iommufd (e.g. platform devices, sw mdev, etc.);

b)  both legacy group interface and new device-centric interface, for
    devices which supports iommufd but also wants to keep backward
    compatibility (e.g. pci devices in this RFC);

c)  only new device-centric interface, for new devices which don't carry
    backward compatibility burden (e.g. hw mdev/subdev with pasid);

This patch introduces vfio_[un]register_device() helpers for the device
drivers to specify the device exposure policy to vfio core. Hence the
existing vfio_[un]register_group_dev() become the wrapper of the new
helper functions. The new device-centric interface is described as
'nongroup' to differentiate from existing 'group' stuff.

TBD: this patch needs to rebase on top of below series from Christoph in
next version.

	"cleanup vfio iommu_group creation"

Legacy userspace continues to follow the legacy group interface.

Newer userspace can first try the new device-centric interface if the
device is present under /dev/vfio/devices. Otherwise fall back to the
group interface.

One open about how to organize the device nodes under /dev/vfio/devices/.
This RFC adopts a simple policy by keeping a flat layout with mixed devname
from all kinds of devices. The prerequisite of this model is that devnames
from different bus types are unique formats:

	/dev/vfio/devices/0000:00:14.2 (pci)
	/dev/vfio/devices/PNP0103:00 (platform)
	/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

One alternative option is to arrange device nodes in sub-directories based
on the device type. But doing so also adds one trouble to userspace. The
current vfio uAPI is designed to have the user query device type via
VFIO_DEVICE_GET_INFO after opening the device. With this option the user
instead needs to figure out the device type before opening the device, to
identify the sub-directory. Another tricky thing is that "pdev. vs. mdev"
and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
more thoughts on whether both or just one category should be used to define
the sub-directories.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 137 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   9 +++
 2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 84436d7abedd..1e87b25962f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -51,6 +51,7 @@ static struct vfio {
 	struct cdev			device_cdev;
 	dev_t				device_devt;
 	struct mutex			device_lock;
+	struct list_head		device_list;
 	struct idr			device_idr;
 } vfio;
 
@@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(vfio_init_group_dev);
 
-int vfio_register_group_dev(struct vfio_device *device)
+static int __vfio_register_group_dev(struct vfio_device *device)
 {
 	struct vfio_device *existing_device;
 	struct iommu_group *iommu_group;
@@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
 	/* Our reference on group is moved to the device */
 	device->group = group;
 
-	/* Refcounting can't start until the driver calls register */
-	refcount_set(&device->refcount, 1);
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
 
 	mutex_lock(&group->device_lock);
 	list_add(&device->group_next, &group->device_list);
@@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+static int __vfio_register_nongroup_dev(struct vfio_device *device)
+{
+	struct vfio_device *existing_device;
+	struct device *dev;
+	int ret = 0, minor;
+
+	mutex_lock(&vfio.device_lock);
+	list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
+		if (existing_device == device) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	}
+
+	minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
+	pr_debug("%s - mnior: %d\n", __func__, minor);
+	if (minor < 0) {
+		ret = minor;
+		goto out_unlock;
+	}
+
+	dev = device_create(vfio.device_class, NULL,
+			    MKDEV(MAJOR(vfio.device_devt), minor),
+			    device, "%s", dev_name(device->dev));
+	if (IS_ERR(dev)) {
+		idr_remove(&vfio.device_idr, minor);
+		ret = PTR_ERR(dev);
+		goto out_unlock;
+	}
+
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
+
+	device->minor = minor;
+	list_add(&device->vfio_next, &vfio.device_list);
+	dev_info(device->dev, "Creates Device interface successfully!\n");
+out_unlock:
+	mutex_unlock(&vfio.device_lock);
+	return ret;
+}
+
+int vfio_register_device(struct vfio_device *device, u32 flags)
+{
+	int ret = -EINVAL;
+
+	device->minor = -1;
+	device->group = NULL;
+	atomic_set(&device->opened, 0);
+
+	if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
+		return ret;
+
+	if (flags & VFIO_DEVNODE_GROUP) {
+		ret = __vfio_register_group_dev(device);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & VFIO_DEVNODE_NONGROUP) {
+		ret = __vfio_register_nongroup_dev(device);
+		if (ret && device->group)
+			vfio_unregister_device(device);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_register_device);
 
 /**
  * Get a reference to the vfio_device for a device.  Even if the
@@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 /*
  * Decrement the device reference count and wait for the device to be
  * removed.  Open file descriptors for the device... */
-void vfio_unregister_group_dev(struct vfio_device *device)
+void vfio_unregister_device(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
 	struct vfio_unbound_dev *unbound;
 	unsigned int i = 0;
 	bool interrupted = false;
 	long rc;
+	int minor = device->minor;
 
 	/*
 	 * When the device is removed from the group, the group suddenly
@@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	 * solve this, we track such devices on the unbound_list to bridge
 	 * the gap until they're fully unbound.
 	 */
-	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
-	if (unbound) {
-		unbound->dev = device->dev;
-		mutex_lock(&group->unbound_lock);
-		list_add(&unbound->unbound_next, &group->unbound_list);
-		mutex_unlock(&group->unbound_lock);
+	if (group) {
+		/*
+		 * If caller hasn't called vfio_register_group_dev(), this
+		 * branch is not necessary.
+		 */
+		unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
+		if (unbound) {
+			unbound->dev = device->dev;
+			mutex_lock(&group->unbound_lock);
+			list_add(&unbound->unbound_next, &group->unbound_list);
+			mutex_unlock(&group->unbound_lock);
+		}
+		WARN_ON(!unbound);
 	}
-	WARN_ON(!unbound);
 
 	vfio_device_put(device);
 	rc = try_wait_for_completion(&device->comp);
@@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 		}
 	}
 
+	/* nongroup interface related cleanup */
+	if (minor >= 0) {
+		mutex_lock(&vfio.device_lock);
+		list_del(&device->vfio_next);
+		device->minor = -1;
+		device_destroy(vfio.device_class,
+			       MKDEV(MAJOR(vfio.device_devt), minor));
+		idr_remove(&vfio.device_idr, minor);
+		mutex_unlock(&vfio.device_lock);
+	}
+
+	/* No need go further if no group. */
+	if (!group)
+		return;
+
 	mutex_lock(&group->device_lock);
 	list_del(&device->group_next);
 	group->dev_counter--;
@@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	/* Matches the get in vfio_register_group_dev() */
 	vfio_group_put(group);
 }
+EXPORT_SYMBOL_GPL(vfio_unregister_device);
+
+int vfio_register_group_dev(struct vfio_device *device)
+{
+	return vfio_register_device(device, VFIO_DEVNODE_GROUP);
+}
+EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+void vfio_unregister_group_dev(struct vfio_device *device)
+{
+	vfio_unregister_device(device);
+}
 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
 
 /**
@@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)
 
 	mutex_init(&vfio.device_lock);
 	idr_init(&vfio.device_idr);
+	INIT_LIST_HEAD(&vfio.device_list);
 
 	/* /dev/vfio/devices/$DEVICE */
 	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
@@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
+	WARN_ON(!list_empty(&vfio.device_list));
 
 #ifdef CONFIG_VFIO_NOIOMMU
 	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4a5f3f99eab2..9448b751b663 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@ struct vfio_device {
 	struct list_head group_next;
 	int minor;
 	atomic_t opened;
+	struct list_head vfio_next;
 };
 
 /**
@@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
 	VFIO_IOMMU_CONTAINER_CLOSE = 0,
 };
 
+/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
+#define VFIO_DEVNODE_GROUP	BIT(0)
+/* The device can be opened via /dev/sys/devices/${DEVICE} */
+#define VFIO_DEVNODE_NONGROUP	BIT(1)
+
+extern int vfio_register_device(struct vfio_device *device, u32 flags);
+extern void vfio_unregister_device(struct vfio_device *device);
+
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (2 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 03/20] vfio: Add vfio_[un]register_device() Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 16:19   ` Jason Gunthorpe
  2021-09-29  2:52   ` David Gibson
  2021-09-19  6:38 ` [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices Liu Yi L
                   ` (17 subsequent siblings)
  21 siblings, 2 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This provides an interface for upper layers to get the per-device iommu
attributes.

    int iommu_device_get_info(struct device *dev,
                              enum iommu_devattr attr, void *data);

The first attribute (IOMMU_DEV_INFO_FORCE_SNOOP) is added. It tells if
the iommu can force DMA to snoop cache. At this stage, only PCI devices
which have this attribute set could use the iommufd, this is due to
supporting no-snoop DMA requires additional refactoring work on the
current kvm-vfio contract. The following patch will have vfio check this
attribute to decide whether a pci device can be exposed through
/dev/vfio/devices.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 16 ++++++++++++++++
 include/linux/iommu.h | 19 +++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63f0af10c403..5ea3a007fd7c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3260,3 +3260,19 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 
 	return ret;
 }
+
+/* Expose per-device iommu attributes. */
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data)
+{
+	const struct iommu_ops *ops;
+
+	if (!dev->bus || !dev->bus->iommu_ops)
+		return -EINVAL;
+
+	ops = dev->bus->iommu_ops;
+	if (unlikely(!ops->device_info))
+		return -ENODEV;
+
+	return ops->device_info(dev, attr, data);
+}
+EXPORT_SYMBOL_GPL(iommu_device_get_info);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..52a6d33c82dc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -150,6 +150,14 @@ enum iommu_dev_features {
 	IOMMU_DEV_FEAT_IOPF,
 };
 
+/**
+ * enum iommu_devattr - Per device IOMMU attributes
+ * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ */
+enum iommu_devattr {
+	IOMMU_DEV_INFO_FORCE_SNOOP,
+};
+
 #define IOMMU_PASID_INVALID	(-1U)
 
 #ifdef CONFIG_IOMMU_API
@@ -215,6 +223,7 @@ struct iommu_iotlb_gather {
  *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
  *		- IOMMU_DOMAIN_DMA: must use a dma domain
  *		- 0: use the default setting
+ * @device_info: query per-device iommu attributes
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
  */
@@ -283,6 +292,8 @@ struct iommu_ops {
 
 	int (*def_domain_type)(struct device *dev);
 
+	int (*device_info)(struct device *dev, enum iommu_devattr attr, void *data);
+
 	unsigned long pgsize_bitmap;
 	struct module *owner;
 };
@@ -604,6 +615,8 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -999,6 +1012,12 @@ static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
 }
+
+static inline int iommu_device_get_info(struct device *dev,
+					enum iommu_devattr type, void *data)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (3 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 04/20] iommu: Add iommu_device_get_info interface Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 16:40   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces Liu Yi L
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 25 +++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 drivers/vfio/vfio.c                 |  3 ++-
 include/linux/vfio.h                |  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)
 
 		vfio_spapr_pci_eeh_open(vdev->pdev);
 		vfio_pci_vf_token_user_add(vdev, 1);
+		if (!vfio_device_in_container(core_vdev))
+			atomic_set(&vdev->block_access, 1);
+		else
+			atomic_set(&vdev->block_access, 0);
 	}
 	vdev->refcnt++;
 error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
 
@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, struct vm_area_struct *v
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
 	int ret;
+	u32 flags;
+	bool snoop = false;
 
 	if (vfio_pci_is_denylisted(pdev))
 		return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = vfio_register_group_dev(&vdev->vdev);
-	if (ret)
+	flags = VFIO_DEVNODE_GROUP;
+	ret = iommu_device_get_info(&pdev->dev,
+				    IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (!ret && snoop)
+		flags |= VFIO_DEVNODE_NONGROUP;
+
+	ret = vfio_register_device(&vdev->vdev, flags);
+	if (ret) {
+		pr_debug("Failed to register device interface\n");
 		goto out_power;
+	}
+
 	dev_set_drvdata(&pdev->dev, vdev);
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	atomic_t		block_access;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, struct file *filep)
 	return ret;
 }
 
-static bool vfio_device_in_container(struct vfio_device *device)
+bool vfio_device_in_container(struct vfio_device *device)
 {
 	return !!(device->group && device->group->container);
 }
+EXPORT_SYMBOL_GPL(vfio_device_in_container);
 
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 9448b751b663..fd0629acb948 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -81,6 +81,7 @@ enum vfio_iommu_notify_type {
 
 extern int vfio_register_device(struct vfio_device *device, u32 flags);
 extern void vfio_unregister_device(struct vfio_device *device);
+extern bool vfio_device_in_container(struct vfio_device *device);
 
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (4 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:09   ` Jason Gunthorpe
  2021-09-29  4:55   ` David Gibson
  2021-09-19  6:38 ` [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device() Liu Yi L
                   ` (15 subsequent siblings)
  21 siblings, 2 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1)  The group is viable for user-initiated DMAs. This implies that the
    devices in the group must be either bound to a device-passthrough
    framework, or driver-less, or bound to a driver which is known safe
    (not do DMA).

2)  The security context should only allow DMA to the user's memory and
    devices in this group;

3)  After the security context is established for the group, the group
    viability must be continuously monitored before the user relinquishes
    all devices belonging to the group. The viability might be broken e.g.
    when a driver-less device is later bound to a driver which does DMA.

4)  The security context should not be destroyed before user access
    permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1)  Open a device in "/dev/vfio/devices" with user access blocked;

2)  Bind the device to an iommufd with an initial security context
    (an empty iommu domain which blocks dma) established for its
    group, with user access unblocked;

3)  Attach the device to a user-specified ioasid (shared by all devices
    attached to this ioasid). Before attaching, the device should be first
    detached from the initial context;

4)  Detach the device from the ioasid and switch it back to the initial
    security context;

5)  Unbind the device from the iommufd, back to access blocked state and
    move its group out of the initial security context if it's the last
    unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

        int iommu_device_init_user_dma(struct device *dev,
                        unsigned long owner);
        void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  12 ++++
 2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,8 @@ struct iommu_group {
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
+	unsigned long user_dma_owner_id;
+	refcount_t owner_cnt;
 };
 
 struct group_device {
@@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static bool iommu_group_user_dma_viable(struct iommu_group *group);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
 	 */
 	iommu_alloc_default_domain(group, dev);
 
-	if (group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid attaching the default domain.
+	 */
+	if (group->default_domain && !group->user_dma_owner_id) {
 		ret = __iommu_attach_device(group->default_domain, dev);
 		if (ret) {
 			iommu_group_put(group);
@@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
 		break;
 	case BUS_NOTIFY_BOUND_DRIVER:
+		/*
+		 * FIXME: Alternatively the attached drivers could generically
+		 * indicate to the iommu layer that they are safe for keeping
+		 * the iommu group user viable by calling some function around
+		 * probe(). We could eliminate this gross BUG_ON() by denying
+		 * probe to non-iommu-safe driver.
+		 */
+		mutex_lock(&group->mutex);
+		if (group->user_dma_owner_id)
+			BUG_ON(!iommu_group_user_dma_viable(group));
+		mutex_unlock(&group->mutex);
 		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
 		break;
 	case BUS_NOTIFY_UNBIND_DRIVER:
@@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (group->default_domain && group->domain != group->default_domain)
+	/*
+	 * group->domain could be NULL when a domain is detached from the
+	 * group but the default_domain is not re-attached.
+	 */
+	if (group->domain && group->domain != group->default_domain)
 		return -EBUSY;
 
 	ret = __iommu_group_for_each_dev(group, domain,
@@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (!group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid re-attaching the default domain.
+	 */
+	if (!group->default_domain || group->user_dma_owner_id) {
 		__iommu_group_for_each_dev(group, domain,
 					   iommu_group_do_detach_device);
 		group->domain = NULL;
@@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
 	return ops->device_info(dev, attr, data);
 }
 EXPORT_SYMBOL_GPL(iommu_device_get_info);
+
+/*
+ * IOMMU core interfaces for iommufd.
+ */
+
+/*
+ * FIXME: We currently simply follow vifo policy to mantain the group's
+ * viability to user. Eventually, we should avoid below hard-coded list
+ * by letting drivers indicate to the iommu layer that they are safe for
+ * keeping the iommu group's user aviability.
+ */
+static const char * const iommu_driver_allowed[] = {
+	"vfio-pci",
+	"pci-stub"
+};
+
+/*
+ * An iommu group is viable for use by userspace if all devices are in
+ * one of the following states:
+ *  - driver-less
+ *  - bound to an allowed driver
+ *  - a PCI interconnect device
+ */
+static int device_user_dma_viable(struct device *dev, void *data)
+{
+	struct device_driver *drv = READ_ONCE(dev->driver);
+
+	if (!drv)
+		return 0;
+
+	if (dev_is_pci(dev)) {
+		struct pci_dev *pdev = to_pci_dev(dev);
+
+		if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+			return 0;
+	}
+
+	return match_string(iommu_driver_allowed,
+			    ARRAY_SIZE(iommu_driver_allowed),
+			    drv->name) < 0;
+}
+
+static bool iommu_group_user_dma_viable(struct iommu_group *group)
+{
+	return !__iommu_group_for_each_dev(group, NULL, device_user_dma_viable);
+}
+
+static int iommu_group_init_user_dma(struct iommu_group *group,
+				     unsigned long owner)
+{
+	if (group->user_dma_owner_id) {
+		if (group->user_dma_owner_id != owner)
+			return -EBUSY;
+
+		refcount_inc(&group->owner_cnt);
+		return 0;
+	}
+
+	if (group->domain && group->domain != group->default_domain)
+		return -EBUSY;
+
+	if (!iommu_group_user_dma_viable(group))
+		return -EINVAL;
+
+	group->user_dma_owner_id = owner;
+	refcount_set(&group->owner_cnt, 1);
+
+	/* default domain is unsafe for user-initiated dma */
+	if (group->domain == group->default_domain)
+		__iommu_detach_group(group->default_domain, group);
+
+	return 0;
+}
+
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret;
+
+	if (!group || !owner)
+		return -ENODEV;
+
+	mutex_lock(&group->mutex);
+	ret = iommu_group_init_user_dma(group, owner);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_init_user_dma);
+
+static void iommu_group_exit_user_dma(struct iommu_group *group)
+{
+	if (refcount_dec_and_test(&group->owner_cnt)) {
+		group->user_dma_owner_id = 0;
+		if (group->default_domain)
+			__iommu_attach_group(group->default_domain, group);
+	}
+}
+
+void iommu_device_exit_user_dma(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (WARN_ON(!group || !group->user_dma_owner_id))
+		return;
+
+	mutex_lock(&group->mutex);
+	iommu_group_exit_user_dma(group);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_exit_user_dma);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 52a6d33c82dc..943de6897f56 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,6 +617,9 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
 int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
 
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner);
+void iommu_device_exit_user_dma(struct device *dev);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -1018,6 +1021,15 @@ static inline int iommu_device_get_info(struct device *dev,
 {
 	return -ENODEV;
 }
+
+static inline int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	return -ENODEV;
+}
+
+static inline void iommu_device_exit_user_dma(struct device *dev)
+{
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (5 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:14   ` Jason Gunthorpe
  2021-09-29  5:25   ` David Gibson
  2021-09-19  6:38 ` [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD Liu Yi L
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

-  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
   DMA isolation;
-  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
   instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
 include/linux/iommufd.h         |  38 ++++++++
 2 files changed, 196 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
 #include <linux/miscdevice.h>
 #include <linux/mutex.h>
 #include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/xarray.h>
+#include <asm-generic/bug.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
 	refcount_t refs;
+	struct mutex lock;
+	struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+	unsigned int id;
+	struct iommufd_ctx *ictx;
+	struct device *dev; /* always be the physical device */
+	u64 dev_cookie;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 		return -ENOMEM;
 
 	refcount_set(&ictx->refs, 1);
+	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
 	return ret;
 }
 
+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	refcount_inc(&ictx->refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+	struct iommufd_ctx *ictx;
+
+	if (!file)
+		return NULL;
+
+	if (file->f_op != &iommufd_fops)
+		return NULL;
+
+	ictx = file->private_data;
+	if (ictx)
+		iommufd_ctx_get(ictx);
+	fdput(f);
+	return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
 static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
-	if (refcount_dec_and_test(&ictx->refs))
-		kfree(ictx);
+	if (!refcount_dec_and_test(&ictx->refs))
+		return;
+
+	WARN_ON(!xa_empty(&ictx->device_xa));
+	kfree(ictx);
 }
 
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ *			 cookie to an iommu fd.
+ * @fd:		[in] iommufd file descriptor.
+ * @dev:	[in] Pointer to a physical device struct.
+ * @dev_cookie:	[in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes a security context for the device
+ * and returns struct iommufd_device pointer. Otherwise returns
+ * error pointer.
+ *
+ */
+struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
+					   u64 dev_cookie)
+{
+	struct iommufd_ctx *ictx;
+	struct iommufd_device *idev;
+	unsigned long index;
+	unsigned int id;
+	int ret;
+
+	ictx = iommufd_ctx_fdget(fd);
+	if (!ictx)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&ictx->lock);
+
+	/* check duplicate registration */
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
+			idev = ERR_PTR(-EBUSY);
+			goto out_unlock;
+		}
+	}
+
+	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
+	if (!idev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/* Establish the security context */
+	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
+	if (ret)
+		goto out_free;
+
+	ret = xa_alloc(&ictx->device_xa, &id, idev,
+		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
+		       GFP_KERNEL);
+	if (ret) {
+		idev = ERR_PTR(ret);
+		goto out_user_dma;
+	}
+
+	idev->ictx = ictx;
+	idev->dev = dev;
+	idev->dev_cookie = dev_cookie;
+	idev->id = id;
+	mutex_unlock(&ictx->lock);
+
+	return idev;
+out_user_dma:
+	iommu_device_exit_user_dma(idev->dev);
+out_free:
+	kfree(idev);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	iommufd_ctx_put(ictx);
+
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_device);
+
+/**
+ * iommufd_unbind_device - Unbind a physical device from iommufd
+ *
+ * @idev: [in] Pointer to the internal iommufd_device struct.
+ *
+ */
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+	struct iommufd_ctx *ictx = idev->ictx;
+
+	mutex_lock(&ictx->lock);
+	xa_erase(&ictx->device_xa, idev->id);
+	mutex_unlock(&ictx->lock);
+	/* Exit the security context */
+	iommu_device_exit_user_dma(idev->dev);
+	kfree(idev);
+	iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
 static int __init iommufd_init(void)
 {
 	int ret;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 000000000000..1603a13937e9
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * IOMMUFD API definition
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+#define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_DEVID_MIN	0
+
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (6 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device() Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:29   ` Jason Gunthorpe
  2021-09-29  6:00   ` David Gibson
  2021-09-19  6:38 ` [RFC 09/20] iommu: Add page size and address width attributes Liu Yi L
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
because it's implicitly done when the device fd is closed.

In concept a vfio device can be bound to multiple iommufds, each hosting
a subset of I/O address spaces attached by this device. However as a
starting point (matching current vfio), only one I/O address space is
supported per vfio device. It implies one device can only be attached
to one iommufd at this point.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Kconfig            |  1 +
 drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |  8 ++++
 include/uapi/linux/vfio.h           | 30 ++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5e2e1b9a9fd3..3abfb098b4dc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -5,6 +5,7 @@ config VFIO_PCI
 	depends on MMU
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	select IOMMUFD
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 145addde983b..20006bb66430 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 			vdev->req_trigger = NULL;
 		}
 		mutex_unlock(&vdev->igate);
+
+		mutex_lock(&vdev->videv_lock);
+		if (vdev->videv) {
+			struct vfio_iommufd_device *videv = vdev->videv;
+
+			vdev->videv = NULL;
+			iommufd_unbind_device(videv->idev);
+			kfree(videv);
+		}
+		mutex_unlock(&vdev->videv_lock);
 	}
 
 	mutex_unlock(&vdev->reflck->lock);
@@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		container_of(core_vdev, struct vfio_pci_device, vdev);
 	unsigned long minsz;
 
-	if (cmd == VFIO_DEVICE_GET_INFO) {
+	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
+		struct vfio_device_iommu_bind_data bind_data;
+		unsigned long minsz;
+		struct iommufd_device *idev;
+		struct vfio_iommufd_device *videv;
+
+		/*
+		 * Reject the request if the device is already opened and
+		 * attached to a container.
+		 */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
+
+		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind_data.argsz < minsz ||
+		    bind_data.flags || bind_data.iommu_fd < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+		/*
+		 * Allow only one iommufd per device until multiple
+		 * address spaces (e.g. vSVA) support is introduced
+		 * in the future.
+		 */
+		if (vdev->videv) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		idev = iommufd_bind_device(bind_data.iommu_fd,
+					   &vdev->pdev->dev,
+					   bind_data.dev_cookie);
+		if (IS_ERR(idev)) {
+			mutex_unlock(&vdev->videv_lock);
+			return PTR_ERR(idev);
+		}
+
+		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
+		if (!videv) {
+			iommufd_unbind_device(idev);
+			mutex_unlock(&vdev->videv_lock);
+			return -ENOMEM;
+		}
+		videv->idev = idev;
+		videv->iommu_fd = bind_data.iommu_fd;
+		/*
+		 * A security context has been established. Unblock
+		 * user access.
+		 */
+		if (atomic_read(&vdev->block_access))
+			atomic_set(&vdev->block_access, 0);
+		vdev->videv = videv;
+		mutex_unlock(&vdev->videv_lock);
+
+		return 0;
+	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
 		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
 		unsigned long capsz;
@@ -2031,6 +2100,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
+	mutex_init(&vdev->videv_lock);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f12012e30b53..bd784accac35 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/uuid.h>
 #include <linux/notifier.h>
+#include <linux/iommufd.h>
 
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
@@ -99,6 +100,11 @@ struct vfio_pci_mmap_vma {
 	struct list_head	vma_next;
 };
 
+struct vfio_iommufd_device {
+	struct iommufd_device *idev;
+	int iommu_fd;
+};
+
 struct vfio_pci_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
@@ -144,6 +150,8 @@ struct vfio_pci_device {
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 	atomic_t		block_access;
+	struct mutex		videv_lock;
+	struct vfio_iommufd_device *videv;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..c902abd60339 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -190,6 +190,36 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_device_iommu_bind_data)
+ *
+ * Bind a vfio_device to the specified iommufd
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is later used in iommufd for capability query, iotlb invalidation
+ * and I/O fault handling.
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Input parameters:
+ *	- iommu_fd;
+ *	- dev_cookie;
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_iommu_bind_data {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__u64	dev_cookie;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (7 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-22 13:42   ` Eric Auger
  2021-09-19  6:38 ` [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO Liu Yi L
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
them to define the IOAS.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 943de6897f56..86d34e4ce05e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -153,9 +153,13 @@ enum iommu_dev_features {
 /**
  * enum iommu_devattr - Per device IOMMU attributes
  * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
+ * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
  */
 enum iommu_devattr {
 	IOMMU_DEV_INFO_FORCE_SNOOP,
+	IOMMU_DEV_INFO_PAGE_SIZE,
+	IOMMU_DEV_INFO_ADDR_WIDTH,
 };
 
 #define IOMMU_PASID_INVALID	(-1U)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (8 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 09/20] iommu: Add page size and address width attributes Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:40   ` Jason Gunthorpe
                     ` (2 more replies)
  2021-09-19  6:38 ` [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE Liu Yi L
                   ` (11 subsequent siblings)
  21 siblings, 3 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

After a device is bound to the iommufd, userspace can use this interface
to query the underlying iommu capability and format info for this device.
Based on this information the user then creates I/O address space in a
compatible format with the to-be-attached devices.

Device cookie which is registered at binding time is used to mark the
device which is being queried here.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e16ca21e4534..641f199f2d41 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static struct device *
+iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
+{
+	struct iommufd_device *idev;
+	struct device *dev = NULL;
+	unsigned long index;
+
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev_cookie == dev_cookie) {
+			dev = idev->dev;
+			break;
+		}
+	}
+	mutex_unlock(&ictx->lock);
+
+	return dev;
+}
+
+static void iommu_device_build_info(struct device *dev,
+				    struct iommu_device_info *info)
+{
+	bool snoop;
+	u64 awidth, pgsizes;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
+		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
+		info->pgsize_bitmap = pgsizes;
+		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
+	}
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
+		info->addr_width = awidth;
+		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
+	}
+}
+
+static int iommufd_get_device_info(struct iommufd_ctx *ictx,
+				   unsigned long arg)
+{
+	struct iommu_device_info info;
+	unsigned long minsz;
+	struct device *dev;
+
+	minsz = offsetofend(struct iommu_device_info, addr_width);
+
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	info.flags = 0;
+
+	dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
+	if (!dev)
+		return -EINVAL;
+
+	iommu_device_build_info(dev, &info);
+
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_DEVICE_GET_INFO:
+		ret = iommufd_get_device_info(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..76b71f9d6b34 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -7,6 +7,55 @@
 #define _UAPI_IOMMU_H
 
 #include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
+
+#define IOMMU_TYPE	(';')
+#define IOMMU_BASE	100
+
+/*
+ * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
+ *				struct iommu_device_info)
+ *
+ * Check IOMMU capabilities and format information on a bound device.
+ *
+ * The device is identified by device cookie (registered when binding
+ * this device).
+ *
+ * @argsz:	   user filled size of this data.
+ * @flags:	   tells userspace which capability info is available
+ * @dev_cookie:	   user assinged cookie.
+ * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
+ *		   bit in pgsize_bitmap[63:12] indicates a supported
+ *		   page size. Details as below table:
+ *
+ *		   +===============+============+
+ *		   |  Bit[index]   |  Page Size |
+ *		   +---------------+------------+
+ *		   |  12           |  4 KB      |
+ *		   +---------------+------------+
+ *		   |  13           |  8 KB      |
+ *		   +---------------+------------+
+ *		   |  14           |  16 KB     |
+ *		   +---------------+------------+
+ *		   ...
+ * @addr_width:    the address width of supported I/O address spaces.
+ *
+ * Availability: after device is bound to iommufd
+ */
+struct iommu_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
+#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
+#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+	__u64	dev_cookie;
+	__u64   pgsize_bitmap;
+	__u32	addr_width;
+};
+
+#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (9 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:44   ` Jason Gunthorpe
                     ` (2 more replies)
  2021-09-19  6:38 ` [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION Liu Yi L
                   ` (10 subsequent siblings)
  21 siblings, 3 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
  Per previous discussion they can also use vfio type1v2 as long as there
  is a way to claim a specific iova range from a system-wide address space.
  This requirement doesn't sound PPC specific, as addr_width for pci devices
  can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
  adopted this design yet. We hope to have formal alignment in v1 discussion
  and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
  ioasid.c) to represent the hardware I/O address space ID in the wire. It
  covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
  ID). We need find a way to resolve the naming conflict between the hardware
  ID and software handle. One option is to rename the existing ioasid to be
  pasid or ssid, given their full names still sound generic. Appreciate more
  thoughts on this open!

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |   3 +
 include/uapi/linux/iommu.h      |  54 ++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
 struct iommufd_ctx {
 	refcount_t refs;
 	struct mutex lock;
+	struct xarray ioasid_xa; /* xarray of ioasids */
 	struct xarray device_xa; /* xarray of bound devices */
 };
 
@@ -42,6 +43,16 @@ struct iommufd_device {
 	u64 dev_cookie;
 };
 
+/* Represent an I/O address space */
+struct iommufd_ioas {
+	int ioasid;
+	u32 type;
+	u32 addr_width;
+	bool enforce_snoop;
+	struct iommufd_ctx *ictx;
+	refcount_t refs;
+};
+
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 
 	refcount_set(&ictx->refs, 1);
 	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
 	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	if (!refcount_dec_and_test(&ictx->refs))
 		return;
 
+	WARN_ON(!xa_empty(&ictx->ioasid_xa));
 	WARN_ON(!xa_empty(&ictx->device_xa));
 	kfree(ictx);
 }
 
+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+	int ioasid = ioas->ioasid;
+
+	if (!refcount_dec_and_test(&ioas->refs))
+		return;
+
+	xa_erase(&ictx->ioasid_xa, ioasid);
+	iommufd_ctx_put(ictx);
+	kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommu_ioasid_alloc req;
+	struct iommufd_ioas *ioas;
+	unsigned long minsz;
+	int ioasid, ret;
+
+	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+	if (copy_from_user(&req, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (req.argsz < minsz || !req.addr_width ||
+	    req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+	    req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+		return -EINVAL;
+
+	ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+	if (!ioas)
+		return -ENOMEM;
+
+	mutex_lock(&ictx->lock);
+	ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
+		       XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+		       GFP_KERNEL);
+	mutex_unlock(&ictx->lock);
+	if (ret) {
+		pr_err_ratelimited("Failed to alloc ioasid\n");
+		kfree(ioas);
+		return ret;
+	}
+
+	ioas->ioasid = ioasid;
+
+	/* only supports kernel managed I/O page table so far */
+	ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+	ioas->addr_width = req.addr_width;
+
+	/* only supports enforce snoop today */
+	ioas->enforce_snoop = true;
+
+	iommufd_ctx_get(ictx);
+	ioas->ictx = ictx;
+
+	refcount_set(&ioas->refs, 1);
+
+	return ioasid;
+}
+
+static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommufd_ioas *ioas = NULL;
+	int ioasid, ret;
+
+	if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+		return -EFAULT;
+
+	if (ioasid < 0)
+		return -EINVAL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (IS_ERR(ioas)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Disallow free if refcount is not 1 */
+	if (refcount_read(&ioas->refs) > 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	ioas_put_locked(ioas);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	return ret;
+};
+
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx = filep->private_data;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	filep->private_data = NULL;
 
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas)
+		ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+
 	iommufd_ctx_put(ictx);
 
 	return 0;
@@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
+	case IOMMU_IOASID_ALLOC:
+		ret = iommufd_ioasid_alloc(ictx, arg);
+		break;
+	case IOMMU_IOASID_FREE:
+		ret = iommufd_ioasid_free(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1603a13937e9..1dd6515e7816 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -14,6 +14,9 @@
 #include <linux/err.h>
 #include <linux/device.h>
 
+#define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_IOASID_MIN	0
+
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
 
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 76b71f9d6b34..5cbd300eb0ee 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -57,6 +57,60 @@ struct iommu_device_info {
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
+/*
+ * IOMMU_IOASID_ALLOC	- _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
+ *				struct iommu_ioasid_alloc)
+ *
+ * Allocate an IOASID.
+ *
+ * IOASID is the FD-local software handle representing an I/O address
+ * space. Each IOASID is associated with a single I/O page table. User
+ * must call this ioctl to get an IOASID for every I/O address space
+ * that is intended to be tracked by the kernel.
+ *
+ * User needs to specify the attributes of the IOASID and associated
+ * I/O page table format information according to one or multiple devices
+ * which will be attached to this IOASID right after. The I/O page table
+ * is activated in the IOMMU when it's attached by a device. Incompatible
+ * format between device and IOASID will lead to attaching failure in
+ * device side.
+ *
+ * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
+ * must be always set.
+ *
+ * Only one I/O page table type (kernel-managed) is supported, with vfio
+ * type1v2 mapping semantics.
+ *
+ * User should call IOMMU_CHECK_EXTENSION for future extensions.
+ *
+ * @argsz:	    user filled size of this data.
+ * @flags:	    additional information for IOASID allocation.
+ * @type:	    I/O address space page table type.
+ * @addr_width:    address width of the I/O address space.
+ *
+ * Return: allocated ioasid on success, -errno on failure.
+ */
+struct iommu_ioasid_alloc {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
+	__u32	type;
+#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2	1
+	__u32	addr_width;
+};
+
+#define IOMMU_IOASID_ALLOC		_IO(IOMMU_TYPE, IOMMU_BASE + 2)
+
+/**
+ * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
+ *
+ * Free an IOASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+
+#define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (10 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 17:47   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group Liu Yi L
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

As aforementioned, userspace should check extension for what formats
can be specified when allocating an IOASID. This patch adds such
interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
support and no no-snoop support yet.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c |  7 +++++++
 include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 4839f128b24a..e45d76359e34 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_CHECK_EXTENSION:
+		switch (arg) {
+		case EXT_MAP_TYPE1V2:
+			return 1;
+		default:
+			return 0;
+		}
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5cbd300eb0ee..49731be71213 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -14,6 +14,33 @@
 #define IOMMU_TYPE	(';')
 #define IOMMU_BASE	100
 
+/*
+ * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
+ *
+ * Check whether an uAPI extension is supported.
+ *
+ * It's unlikely that all planned capabilities in IOMMU fd will be ready
+ * in one breath. User should check which uAPI extension is supported
+ * according to its intended usage.
+ *
+ * A rough list of possible extensions may include:
+ *
+ *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
+ *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
+ *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
+ *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
+ *	- EXT_IOASID_NESTING for what the name stands;
+ *	- EXT_USER_PAGE_TABLE for user managed page table;
+ *	- EXT_USER_PASID_TABLE for user managed PASID table;
+ *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
+ *	- ...
+ *
+ * Return: 0 if not supported, 1 if supported.
+ */
+#define EXT_MAP_TYPE1V2		1
+#define EXT_DMA_NO_SNOOP	2
+#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)
+
 /*
  * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
  *				struct iommu_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (11 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-10-14  5:24   ` David Gibson
  2021-09-19  6:38 ` [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid() Liu Yi L
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

These two helpers could be used when 1) the iommu group is singleton,
or 2) the upper layer has put the iommu group into the secure state by
calling iommu_device_init_user_dma().

As we want the iommufd design to be a device-centric model, we want to
remove any group knowledge in iommufd. Given that we already have
iommu_at[de]tach_device() interface, we could extend it for iommufd
simply by doing below:

 - first device in a group does group attach;
 - last device in a group does group detach.

as long as the group has been put into the secure context.

The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to
device with their own group") deliberately restricts the two interfaces
to single-device group. To avoid the conflict with existing usages, we
keep this policy and put the new extension only when the group has been
marked for user_dma.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bffd84e978fb..b6178997aef1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -47,6 +47,7 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned long user_dma_owner_id;
 	refcount_t owner_cnt;
+	refcount_t attach_cnt;
 };
 
 struct group_device {
@@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain *domain,
 int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 {
 	struct iommu_group *group;
-	int ret;
+	int ret = 0;
 
 	group = iommu_group_get(dev);
 	if (!group)
@@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 	 * change while we are attaching
 	 */
 	mutex_lock(&group->mutex);
-	ret = -EINVAL;
-	if (iommu_group_device_count(group) != 1)
+	if (group->user_dma_owner_id) {
+		if (group->domain) {
+			if (group->domain != domain)
+				ret = -EBUSY;
+			else
+				refcount_inc(&group->attach_cnt);
+
+			goto out_unlock;
+		}
+	} else if (iommu_group_device_count(group) != 1) {
+		ret = -EINVAL;
 		goto out_unlock;
+	}
 
 	ret = __iommu_attach_group(domain, group);
+	if (!ret && group->user_dma_owner_id)
+		refcount_set(&group->attach_cnt, 1);
 
 out_unlock:
 	mutex_unlock(&group->mutex);
@@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 		return;
 
 	mutex_lock(&group->mutex);
-	if (iommu_group_device_count(group) != 1) {
+	if (group->user_dma_owner_id) {
+		if (!refcount_dec_and_test(&group->attach_cnt))
+			goto out_unlock;
+	} else if (iommu_group_device_count(group) != 1) {
 		WARN_ON(1);
 		goto out_unlock;
 	}
@@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group *group,
 
 	group->user_dma_owner_id = owner;
 	refcount_set(&group->owner_cnt, 1);
+	refcount_set(&group->attach_cnt, 0);
 
 	/* default domain is unsafe for user-initiated dma */
 	if (group->domain == group->default_domain)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (12 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 18:02   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID Liu Yi L
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

An I/O address space takes effect in the iommu only after it's attached
by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
helpers for this purpose. One device can be only attached to one ioasid
at this point, but one ioasid can be attached by multiple devices.

The caller specifies the iommufd_device (returned at binding time) and
the target ioasid when calling the helper function. Upon request, iommufd
installs the specified I/O page table to the correct place in the IOMMU,
according to the routing information (struct device* which represents
RID) recorded in iommufd_device. Future variants could allow the caller
to specify additional routing information (e.g. pasid/ssid) when multiple
I/O address spaces are supported per device.

Open:
Per Jason's comment in below link, bus-specific wrappers are recommended.
This RFC implements one wrapper for pci device. But it looks that struct
pci_device is not used at all since iommufd_ device already carries all
necessary info. So want to have another discussion on its necessity, e.g.
whether making more sense to have bus-specific wrappers for binding, while
leaving a common attaching helper per iommufd_device.
https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/

TODO:
When multiple devices are attached to a same ioasid, the permitted iova
ranges and supported pgsize bitmap on this ioasid should be a common
subset of all attached devices. iommufd needs to track such info per
ioasid and update it every time when a new device is attached to the
ioasid. This has not been done in this version yet, due to the temporary
hack adopted in patch 16-18. The hack reuses vfio type1 driver which
already includes the necessary logic for iova ranges and pgsize bitmap.
Once we get a clear direction for those patches, that logic will be moved
to this patch.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |  29 ++++
 2 files changed, 255 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e45d76359e34..25373a0e037a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -51,6 +51,19 @@ struct iommufd_ioas {
 	bool enforce_snoop;
 	struct iommufd_ctx *ictx;
 	refcount_t refs;
+	struct mutex lock;
+	struct list_head device_list;
+	struct iommu_domain *domain;
+};
+
+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+	struct iommufd_device *idev;
+	struct list_head next;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -119,6 +132,21 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	kfree(ictx);
 }
 
+static struct iommufd_ioas *ioasid_get_ioas(struct iommufd_ctx *ictx, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+
+	if (ioasid < 0)
+		return NULL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (ioas)
+		refcount_inc(&ioas->refs);
+	mutex_unlock(&ictx->lock);
+	return ioas;
+}
+
 /* Caller should hold ictx->lock */
 static void ioas_put_locked(struct iommufd_ioas *ioas)
 {
@@ -128,11 +156,28 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 	if (!refcount_dec_and_test(&ioas->refs))
 		return;
 
+	WARN_ON(!list_empty(&ioas->device_list));
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
 }
 
+/*
+ * Caller should hold a ictx reference when calling this function
+ * otherwise ictx might be freed in ioas_put_locked() then the last
+ * unlock becomes problematic. Alternatively we could have a fresh
+ * implementation of ioas_put instead of calling the locked function.
+ * In this case it can ensure ictx is freed after mutext_unlock().
+ */
+static void ioas_put(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+
+	mutex_lock(&ictx->lock);
+	ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+}
+
 static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 {
 	struct iommu_ioasid_alloc req;
@@ -178,6 +223,9 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	iommufd_ctx_get(ictx);
 	ioas->ictx = ictx;
 
+	mutex_init(&ioas->lock);
+	INIT_LIST_HEAD(&ioas->device_list);
+
 	refcount_set(&ioas->refs, 1);
 
 	return ioasid;
@@ -344,6 +392,166 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/* Caller should hold ioas->lock */
+static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
+						 struct iommufd_device *idev)
+{
+	struct ioas_device_info *ioas_dev;
+
+	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
+		if (ioas_dev->idev == idev)
+			return ioas_dev;
+	}
+
+	return NULL;
+}
+
+static void ioas_free_domain_if_empty(struct iommufd_ioas *ioas)
+{
+	if (list_empty(&ioas->device_list)) {
+		iommu_domain_free(ioas->domain);
+		ioas->domain = NULL;
+	}
+}
+
+static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
+					   struct device *dev)
+{
+	bool snoop = false;
+	u32 addr_width;
+	int ret;
+
+	/*
+	 * currently we only support I/O page table with iommu enforce-snoop
+	 * format. Attaching a device which doesn't support this format in its
+	 * upstreaming iommu is rejected.
+	 */
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (ret || !snoop)
+		return -EINVAL;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
+	if (ret || addr_width < ioas->addr_width)
+		return -EINVAL;
+
+	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
+
+	return 0;
+}
+
+/**
+ * iommufd_device_attach_ioasid - attach device to an ioasid
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ * Returns 0 for successful attach, otherwise returns error.
+ *
+ */
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+	struct iommu_domain *domain;
+	int ret;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/* Check for duplicates */
+	if (ioas_find_device(ioas, idev)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = ioas_check_device_compatibility(ioas, idev->dev);
+	if (ret)
+		goto out_unlock;
+
+	ioas_dev = kzalloc(sizeof(*ioas_dev), GFP_KERNEL);
+	if (!ioas_dev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/*
+	 * Each ioas is backed by an iommu domain, which is allocated
+	 * when the ioas is attached for the first time and then shared
+	 * by following devices.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		struct iommu_domain *d;
+
+		d = iommu_domain_alloc(idev->dev->bus);
+		if (!d) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		ioas->domain = d;
+	}
+	domain = ioas->domain;
+
+	/* Install the I/O page table to the iommu for this device */
+	ret = iommu_attach_device(domain, idev->dev);
+	if (ret)
+		goto out_domain;
+
+	ioas_dev->idev = idev;
+	list_add(&ioas_dev->next, &ioas->device_list);
+	mutex_unlock(&ioas->lock);
+
+	return 0;
+out_domain:
+	ioas_free_domain_if_empty(ioas);
+out_free:
+	kfree(ioas_dev);
+out_unlock:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach_ioasid);
+
+/**
+ * iommufd_device_detach_ioasid - Detach an ioasid from a device.
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ */
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas)
+		return;
+
+	mutex_lock(&ioas->lock);
+	ioas_dev = ioas_find_device(ioas, idev);
+	if (!ioas_dev) {
+		mutex_unlock(&ioas->lock);
+		goto out;
+	}
+
+	list_del(&ioas_dev->next);
+	iommu_detach_device(ioas->domain, idev->dev);
+	ioas_free_domain_if_empty(ioas);
+	kfree(ioas_dev);
+	mutex_unlock(&ioas->lock);
+
+	/* release the reference acquired at the start of this function */
+	ioas_put(ioas);
+out:
+	ioas_put(ioas);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach_ioasid);
+
 /**
  * iommufd_bind_device - Bind a physical device marked by a device
  *			 cookie to an iommu fd.
@@ -426,8 +634,26 @@ EXPORT_SYMBOL_GPL(iommufd_bind_device);
 void iommufd_unbind_device(struct iommufd_device *idev)
 {
 	struct iommufd_ctx *ictx = idev->ictx;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas) {
+		struct ioas_device_info *ioas_dev;
+
+		mutex_lock(&ioas->lock);
+		ioas_dev = ioas_find_device(ioas, idev);
+		if (!ioas_dev) {
+			mutex_unlock(&ioas->lock);
+			continue;
+		}
+		list_del(&ioas_dev->next);
+		iommu_detach_device(ioas->domain, idev->dev);
+		ioas_free_domain_if_empty(ioas);
+		kfree(ioas_dev);
+		mutex_unlock(&ioas->lock);
+		ioas_put_locked(ioas);
+	}
 	xa_erase(&ictx->device_xa, idev->id);
 	mutex_unlock(&ictx->lock);
 	/* Exit the security context */
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1dd6515e7816..01a4fe934143 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/device.h>
+#include <linux/pci.h>
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
@@ -27,6 +28,16 @@ struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
 void iommufd_unbind_device(struct iommufd_device *idev);
 
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return iommufd_device_attach_ioasid(idev, ioasid);
+}
+
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
@@ -37,5 +48,23 @@ iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
 static inline void iommufd_unbind_device(struct iommufd_device *idev)
 {
 }
+
+static inline int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					       int ioasid)
+{
+	return -ENODEV;
+}
+
+static inline void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+						int ioasid)
+{
+}
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (13 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid() Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 18:04   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing Liu Yi L
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 82 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 include/linux/iommufd.h             |  1 +
 include/uapi/linux/vfio.h           | 26 +++++++++
 4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 		if (vdev->videv) {
 			struct vfio_iommufd_device *videv = vdev->videv;
 
+			if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+				iommufd_device_detach_ioasid(videv->idev,
+							     videv->ioasid);
+				videv->ioasid = IOMMUFD_INVALID_IOASID;
+			}
 			vdev->videv = NULL;
 			iommufd_unbind_device(videv->idev);
 			kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		}
 		videv->idev = idev;
 		videv->iommu_fd = bind_data.iommu_fd;
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
 		/*
 		 * A security context has been established. Unblock
 		 * user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		vdev->videv = videv;
 		mutex_unlock(&vdev->videv_lock);
 
+		return 0;
+	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+		int ret = 0;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		/* Currently only allows one IOASID attach */
+		if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+							 videv->idev,
+							 attach.ioasid);
+		if (!ret)
+			videv->ioasid = attach.ioasid;
+		mutex_unlock(&vdev->videv_lock);
+
+		return ret;
+	} else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+		    videv->ioasid != attach.ioasid) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
+		iommufd_device_detach_ioasid(videv->idev, attach.ioasid);
+		mutex_unlock(&vdev->videv_lock);
+
 		return 0;
 	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index bd784accac35..daa0f08ac835 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -103,6 +103,7 @@ struct vfio_pci_mmap_vma {
 struct vfio_iommufd_device {
 	struct iommufd_device *idev;
 	int iommu_fd;
+	int ioasid;
 };
 
 struct vfio_pci_device {
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 01a4fe934143..36d8d2fd22bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -17,6 +17,7 @@
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
+#define IOMMUFD_INVALID_IOASID	-1
 
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c902abd60339..61493ab03038 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -220,6 +220,32 @@ struct vfio_device_iommu_bind_data {
 
 #define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
 
+/*
+ * VFIO_DEVICE_ATTACH_IOASID - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				struct vfio_device_attach_ioasid)
+ *
+ * Attach a vfio device to the specified IOASID
+ *
+ * Multiple vfio devices can be attached to the same IOASID. One device can
+ * be attached to only one ioasid at this point.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommu_fd:	iommufd where the ioasid comes from.
+ * @ioasid:	target I/O address space.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioasid {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__s32	ioasid;
+};
+
+#define VFIO_DEVICE_ATTACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 20)
+#define VFIO_DEVICE_DETACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (14 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-21 18:14   ` Jason Gunthorpe
  2021-09-19  6:38 ` [RFC 17/20] iommu/iommufd: Report iova range to userspace Liu Yi L
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

There are two options to impelement vfio type1v2 mapping semantics in
/dev/iommu.

One is to duplicate the related code from vfio as the starting point,
and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
has over 3000LOC with ~80% related to dma management logic, including:

- the dma map/unmap metadata management
- page pinning, and related accounting
- iova range reporting
- dirty bitmap retrieving
- dynamic vaddr update, etc.

Not sure whether duplicating such amount of code in the transition phase
is acceptable.

The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
which requires converting vfio_iommu_type1 to be a shim driver. The upside
is no code duplication and it is anyway the long-term goal even with the
first approach. The downside is that more effort is required for the
'initial' skeleton thus all new iommu features will be blocked for a longer
time. Main task is to figure out how to handle the remaining 20% code (tied
with group) in vfio_iommu_type1 with device-centric model in iommufd (with
group managed by iommu core). It also implies that no-snoop DMA must be
handled now with extra work on reworked kvm-vfio contract. and also need
to support external page pinning as required by sw mdev.

Due to limited time, we choose a hacky approach in this RFC by directly
calling vfio_iommu_type1 functions in iommufd and raising this open for
discussion. This should not impact the review on other key aspects of the
new framework. Once we reach consensus, we'll follow it to do a clean
implementation 'in' next version.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 199 +++++++++++++++++++++++++++++++-
 include/linux/vfio.h            |  13 +++
 2 files changed, 206 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4f7c174c7a..c1c6bc803d94 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -115,6 +115,7 @@ struct vfio_iommu_group {
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
 	bool			pinned_page_dirty_scope;
+	int			attach_cnt;
 };
 
 struct vfio_iova {
@@ -2240,6 +2241,135 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+/* HACK: called by /dev/iommu core to init group to vfio_iommu_type1 */
+int vfio_iommu_add_group(struct vfio_iommu *iommu,
+			 struct iommu_group *iommu_group,
+			 struct iommu_domain *iommu_domain)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	struct bus_type *bus = NULL;
+	int ret = 0;
+	bool resv_msi, msi_remap;
+	phys_addr_t resv_msi_base = 0;
+	struct iommu_domain_geometry *geo;
+	LIST_HEAD(iova_copy);
+	LIST_HEAD(group_resv_regions);
+
+	/* Determine bus_type */
+	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
+	if (ret)
+		return ret;
+
+	mutex_lock(&iommu->lock);
+
+	/* Check for duplicates */
+	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+	if (group) {
+		group->attach_cnt++;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
+	/* Get aperture info */
+	geo = &iommu_domain->geometry;
+	if (vfio_iommu_aper_conflict(iommu, geo->aperture_start,
+				     geo->aperture_end)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	ret = iommu_get_group_resv_regions(iommu_group, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	if (vfio_iommu_resv_conflict(iommu, &group_resv_regions)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	/*
+	 * We don't want to work on the original iova list as the list
+	 * gets modified and in case of failure we have to retain the
+	 * original list. Get a copy here.
+	 */
+	ret = vfio_iommu_iova_get_copy(iommu, &iova_copy);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_aper_resize(&iova_copy, geo->aperture_start,
+				     geo->aperture_end);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_resv_exclude(&iova_copy, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	resv_msi = vfio_iommu_has_sw_msi(&group_resv_regions, &resv_msi_base);
+
+	msi_remap = irq_domain_check_msi_remap() ||
+		    iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
+
+	if (!allow_unsafe_interrupts && !msi_remap) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		ret = -EPERM;
+		goto out_free;
+	}
+
+	if (resv_msi) {
+		ret = iommu_get_msi_cookie(iommu_domain, resv_msi_base);
+		if (ret && ret != -ENODEV)
+			goto out_free;
+	}
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	group->iommu_group = iommu_group;
+
+	if (!list_empty(&iommu->domain_list)) {
+		domain = list_first_entry(&iommu->domain_list,
+					  struct vfio_domain, next);
+	} else {
+		domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+		if (!domain) {
+			kfree(group);
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		domain->domain = iommu_domain;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&domain->next, &iommu->domain_list);
+	}
+
+	list_add(&group->next, &domain->group_list);
+
+	vfio_test_domain_fgsp(domain);
+
+	vfio_update_pgsize_bitmap(iommu);
+
+	/* Delete the old one and insert new iova list */
+	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+
+	group->attach_cnt = 1;
+	mutex_unlock(&iommu->lock);
+	vfio_iommu_resv_free(&group_resv_regions);
+
+	return 0;
+
+out_free:
+	vfio_iommu_iova_free(&iova_copy);
+	vfio_iommu_resv_free(&group_resv_regions);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_add_group);
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2557,6 +2687,59 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+/* HACK: called by /dev/iommu core to remove group to vfio_iommu_type1 */
+void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+			     struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	LIST_HEAD(iova_copy);
+
+	mutex_lock(&iommu->lock);
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+	group = find_iommu_group(domain, iommu_group);
+	if (!group) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	if (!--group->attach_cnt) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	/*
+	 * Get a copy of iova list. This will be used to update
+	 * and to replace the current one later. Please note that
+	 * we will leave the original list as it is if update fails.
+	 */
+	vfio_iommu_iova_get_copy(iommu, &iova_copy);
+
+	list_del(&group->next);
+	kfree(group);
+	/*
+	 * Group ownership provides privilege, if the device list is
+	 * empty, the domain goes away.
+	 */
+	if (list_empty(&domain->group_list)) {
+		WARN_ON(iommu->notifier.head);
+		vfio_iommu_unmap_unpin_all(iommu);
+		list_del(&domain->next);
+		kfree(domain);
+		vfio_iommu_aper_expand(iommu, &iova_copy);
+		vfio_update_pgsize_bitmap(iommu);
+	}
+
+	if (!vfio_iommu_resv_refresh(iommu, &iova_copy))
+		vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+	else
+		vfio_iommu_iova_free(&iova_copy);
+
+	mutex_unlock(&iommu->lock);
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_remove_group);
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -2647,7 +2830,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 	mutex_unlock(&iommu->lock);
 }
 
-static void *vfio_iommu_type1_open(unsigned long arg)
+void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
 
@@ -2680,6 +2863,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 
 	return iommu;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_open);
 
 static void vfio_release_domain(struct vfio_domain *domain, bool external)
 {
@@ -2697,7 +2881,7 @@ static void vfio_release_domain(struct vfio_domain *domain, bool external)
 		iommu_domain_free(domain->domain);
 }
 
-static void vfio_iommu_type1_release(void *iommu_data)
+void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
@@ -2720,6 +2904,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	kfree(iommu);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_release);
 
 static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
 {
@@ -2913,8 +3098,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
 			-EFAULT : 0;
 }
 
-static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
-				    unsigned long arg)
+int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+			     unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_map map;
 	unsigned long minsz;
@@ -2931,9 +3116,10 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
 
 	return vfio_dma_do_map(iommu, &map);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_map_dma);
 
-static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
-				      unsigned long arg)
+int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+			       unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
@@ -2984,6 +3170,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_unmap_dma);
 
 static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 					unsigned long arg)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index fd0629acb948..d904ee5a68cc 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -158,6 +158,19 @@ extern int vfio_dma_rw(struct vfio_group *group, dma_addr_t user_iova,
 
 extern struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group);
 
+struct vfio_iommu;
+extern void *vfio_iommu_type1_open(unsigned long arg);
+extern void vfio_iommu_type1_release(void *iommu_data);
+extern int vfio_iommu_add_group(struct vfio_iommu *iommu,
+				struct iommu_group *iommu_group,
+				struct iommu_domain *iommu_domain);
+extern void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+				    struct iommu_group *iommu_group);
+extern int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+				      unsigned long arg);
+extern int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+				    unsigned long arg);
+
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (15 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-22 14:49   ` Jean-Philippe Brucker
  2021-09-19  6:38 ` [RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID Liu Yi L
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

IOVA range is critical info for userspace to manage DMA for an I/O address
space. This patch reports the valid iova range info of a given device.

Due to aforementioned hack, this info comes from the hacked vfio type1
driver. To follow the same format in vfio, we also introduce a cap chain
format in IOMMU_DEVICE_GET_INFO to carry the iova range info.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommu.c           |  2 ++
 drivers/iommu/iommufd/iommufd.c | 41 +++++++++++++++++++++++++++-
 drivers/vfio/vfio_iommu_type1.c | 47 ++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |  2 ++
 include/uapi/linux/iommu.h      |  3 +++
 5 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b6178997aef1..44bba346ab52 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2755,6 +2755,7 @@ void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->get_resv_regions)
 		ops->get_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_get_resv_regions);
 
 void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 {
@@ -2763,6 +2764,7 @@ void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->put_resv_regions)
 		ops->put_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_put_resv_regions);
 
 /**
  * generic_iommu_put_resv_regions - Reserved region driver helper
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 25373a0e037a..cbf5e30062a6 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -19,6 +19,7 @@
 #include <linux/iommufd.h>
 #include <linux/xarray.h>
 #include <asm-generic/bug.h>
+#include <linux/vfio.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
@@ -298,6 +299,38 @@ iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
 	return dev;
 }
 
+static int iommu_device_add_cap_chain(struct device *dev, unsigned long arg,
+				      struct iommu_device_info *info)
+{
+	struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+	int ret;
+
+	ret = vfio_device_add_iova_cap(dev, &caps);
+	if (ret)
+		return ret;
+
+	if (caps.size) {
+		info->flags |= IOMMU_DEVICE_INFO_CAPS;
+
+		if (info->argsz < sizeof(*info) + caps.size) {
+			info->argsz = sizeof(*info) + caps.size;
+		} else {
+			vfio_info_cap_shift(&caps, sizeof(*info));
+			if (copy_to_user((void __user *)arg +
+					sizeof(*info), caps.buf,
+					caps.size)) {
+				kfree(caps.buf);
+				info->flags &= ~IOMMU_DEVICE_INFO_CAPS;
+				return -EFAULT;
+			}
+			info->cap_offset = sizeof(*info);
+		}
+
+		kfree(caps.buf);
+	}
+	return 0;
+}
+
 static void iommu_device_build_info(struct device *dev,
 				    struct iommu_device_info *info)
 {
@@ -324,8 +357,9 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	struct iommu_device_info info;
 	unsigned long minsz;
 	struct device *dev;
+	int ret;
 
-	minsz = offsetofend(struct iommu_device_info, addr_width);
+	minsz = offsetofend(struct iommu_device_info, cap_offset);
 
 	if (copy_from_user(&info, (void __user *)arg, minsz))
 		return -EFAULT;
@@ -341,6 +375,11 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 
 	iommu_device_build_info(dev, &info);
 
+	info.cap_offset = 0;
+	ret = iommu_device_add_cap_chain(dev, arg, &info);
+	if (ret)
+		pr_info_ratelimited("No cap chain added, error %d\n", ret);
+
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c1c6bc803d94..28c1699aed6b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2963,15 +2963,15 @@ static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
 	return 0;
 }
 
-static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
-				      struct vfio_info_cap *caps)
+static int vfio_iova_list_build_caps(struct list_head *iova_list,
+				     struct vfio_info_cap *caps)
 {
 	struct vfio_iommu_type1_info_cap_iova_range *cap_iovas;
 	struct vfio_iova *iova;
 	size_t size;
 	int iovas = 0, i = 0, ret;
 
-	list_for_each_entry(iova, &iommu->iova_list, list)
+	list_for_each_entry(iova, iova_list, list)
 		iovas++;
 
 	if (!iovas) {
@@ -2990,7 +2990,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 
 	cap_iovas->nr_iovas = iovas;
 
-	list_for_each_entry(iova, &iommu->iova_list, list) {
+	list_for_each_entry(iova, iova_list, list) {
 		cap_iovas->iova_ranges[i].start = iova->start;
 		cap_iovas->iova_ranges[i].end = iova->end;
 		i++;
@@ -3002,6 +3002,45 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
+				      struct vfio_info_cap *caps)
+{
+	return vfio_iova_list_build_caps(&iommu->iova_list, caps);
+}
+
+/* HACK: called by /dev/iommu core to build iova range cap for a device */
+int vfio_device_add_iova_cap(struct device *dev, struct vfio_info_cap *caps)
+{
+	u64 awidth;
+	dma_addr_t aperture_end;
+	LIST_HEAD(iova);
+	LIST_HEAD(dev_resv_regions);
+	int ret;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth);
+	if (ret)
+		return ret;
+
+	/* FIXME: needs to use geometry info reported by iommu core. */
+	aperture_end = ((dma_addr_t)1) << awidth;
+
+	ret = vfio_iommu_iova_insert(&iova, 0, aperture_end);
+	if (ret)
+		return ret;
+
+	iommu_get_resv_regions(dev, &dev_resv_regions);
+	ret = vfio_iommu_resv_exclude(&iova, &dev_resv_regions);
+	if (ret)
+		goto out;
+
+	ret = vfio_iova_list_build_caps(&iova, caps);
+out:
+	vfio_iommu_iova_free(&iova);
+	iommu_put_resv_regions(dev, &dev_resv_regions);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_add_iova_cap);
+
 static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
 					   struct vfio_info_cap *caps)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index d904ee5a68cc..605b8e828be4 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -212,6 +212,8 @@ extern int vfio_info_add_capability(struct vfio_info_cap *caps,
 extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
 					      int num_irqs, int max_irq_type,
 					      size_t *data_size);
+extern int vfio_device_add_iova_cap(struct device *dev,
+				    struct vfio_info_cap *caps);
 
 struct pci_dev;
 #if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 49731be71213..f408ad3c8ade 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -68,6 +68,7 @@
  *		   +---------------+------------+
  *		   ...
  * @addr_width:    the address width of supported I/O address spaces.
+ * @cap_offset:	   Offset within info struct of first cap
  *
  * Availability: after device is bound to iommufd
  */
@@ -77,9 +78,11 @@ struct iommu_device_info {
 #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
 #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
 #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
 	__u64	dev_cookie;
 	__u64   pgsize_bitmap;
 	__u32	addr_width;
+	__u32   cap_offset;
 };
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (16 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 17/20] iommu/iommufd: Report iova range to userspace Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-19  6:38 ` [RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback Liu Yi L
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

This patch introduces vfio type1v2-equivalent interface to userspace. Due
to aforementioned hack, iommufd currently calls exported vfio symbols to
handle map/unmap requests from the user.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 104 ++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      |  29 +++++++++
 2 files changed, 133 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index cbf5e30062a6..f5f2274d658c 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -55,6 +55,7 @@ struct iommufd_ioas {
 	struct mutex lock;
 	struct list_head device_list;
 	struct iommu_domain *domain;
+	struct vfio_iommu *vfio_iommu; /* FIXME: added for reusing vfio_iommu_type1 code */
 };
 
 /*
@@ -158,6 +159,7 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 		return;
 
 	WARN_ON(!list_empty(&ioas->device_list));
+	vfio_iommu_type1_release(ioas->vfio_iommu); /* FIXME: reused vfio code */
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
@@ -185,6 +187,7 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	struct iommufd_ioas *ioas;
 	unsigned long minsz;
 	int ioasid, ret;
+	struct vfio_iommu *vfio_iommu;
 
 	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
 
@@ -211,6 +214,18 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 		return ret;
 	}
 
+	/* FIXME: get a vfio_iommu object for dma map/unmap management */
+	vfio_iommu = vfio_iommu_type1_open(VFIO_TYPE1v2_IOMMU);
+	if (IS_ERR(vfio_iommu)) {
+		pr_err_ratelimited("Failed to get vfio_iommu object\n");
+		mutex_lock(&ictx->lock);
+		xa_erase(&ictx->ioasid_xa, ioasid);
+		mutex_unlock(&ictx->lock);
+		kfree(ioas);
+		return PTR_ERR(vfio_iommu);
+	}
+	ioas->vfio_iommu = vfio_iommu;
+
 	ioas->ioasid = ioasid;
 
 	/* only supports kernel managed I/O page table so far */
@@ -383,6 +398,49 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
+static int iommufd_process_dma_op(struct iommufd_ctx *ictx,
+				  unsigned long arg, bool map)
+{
+	struct iommu_ioasid_dma_op dma;
+	unsigned long minsz;
+	struct iommufd_ioas *ioas = NULL;
+	int ret;
+
+	minsz = offsetofend(struct iommu_ioasid_dma_op, padding);
+
+	if (copy_from_user(&dma, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (dma.argsz < minsz || dma.flags || dma.ioasid < 0)
+		return -EINVAL;
+
+	ioas = ioasid_get_ioas(ictx, dma.ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("unkonwn IOASID %u\n", dma.ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/*
+	 * Needs to block map/unmap request from userspace before IOASID
+	 * is attached to any device.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (map)
+		ret = vfio_iommu_type1_map_dma(ioas->vfio_iommu, arg + minsz);
+	else
+		ret = vfio_iommu_type1_unmap_dma(ioas->vfio_iommu, arg + minsz);
+out:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+	return ret;
+};
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -409,6 +467,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_IOASID_FREE:
 		ret = iommufd_ioasid_free(ictx, arg);
 		break;
+	case IOMMU_MAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, true);
+		break;
+	case IOMMU_UNMAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, false);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
@@ -478,6 +542,39 @@ static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
 	return 0;
 }
 
+/* HACK:
+ * vfio_iommu_add/remove_device() is hacky implementation for
+ * this version to add the device/group to vfio iommu type1.
+ */
+static int vfio_iommu_add_device(struct vfio_iommu *vfio_iommu,
+				 struct device *dev,
+				 struct iommu_domain *domain)
+{
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -EINVAL;
+
+	ret = vfio_iommu_add_group(vfio_iommu, group, domain);
+	iommu_group_put(group);
+	return ret;
+}
+
+static void vfio_iommu_remove_device(struct vfio_iommu *vfio_iommu,
+				     struct device *dev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return;
+
+	vfio_iommu_remove_group(vfio_iommu, group);
+	iommu_group_put(group);
+}
+
 /**
  * iommufd_device_attach_ioasid - attach device to an ioasid
  * @idev: [in] Pointer to struct iommufd_device.
@@ -539,11 +636,17 @@ int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
 	if (ret)
 		goto out_domain;
 
+	ret = vfio_iommu_add_device(ioas->vfio_iommu, idev->dev, domain);
+	if (ret)
+		goto out_detach;
+
 	ioas_dev->idev = idev;
 	list_add(&ioas_dev->next, &ioas->device_list);
 	mutex_unlock(&ioas->lock);
 
 	return 0;
+out_detach:
+	iommu_detach_device(domain, idev->dev);
 out_domain:
 	ioas_free_domain_if_empty(ioas);
 out_free:
@@ -579,6 +682,7 @@ void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
 	}
 
 	list_del(&ioas_dev->next);
+	vfio_iommu_remove_device(ioas->vfio_iommu, idev->dev);
 	iommu_detach_device(ioas->domain, idev->dev);
 	ioas_free_domain_if_empty(ioas);
 	kfree(ioas_dev);
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index f408ad3c8ade..fe815cc1f665 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -141,6 +141,35 @@ struct iommu_ioasid_alloc {
 
 #define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
 
+/*
+ * Map/unmap process virtual addresses to I/O virtual addresses.
+ *
+ * Provide VFIO type1 equivalent semantics. Start with the same
+ * restriction e.g. the unmap size should match those used in the
+ * original mapping call.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @ioasid:	the handle of target I/O address space.
+ * @data:	the operation payload, refer to vfio_iommu_type1_dma_{un}map.
+ *
+ * FIXME:
+ *	userspace needs to include uapi/vfio.h as well as interface reuses
+ *	the map/unmap logic from vfio iommu type1.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct iommu_ioasid_dma_op {
+	__u32	argsz;
+	__u32	flags;
+	__s32	ioasid;
+	__u32	padding;
+	__u8	data[];
+};
+
+#define IOMMU_MAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 4)
+#define IOMMU_UNMAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 5)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (17 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-09-19  6:38 ` [RFC 20/20] Doc: Add documentation for /dev/iommu Liu Yi L
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

Expose per-device IOMMU attributes to the upper layers.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index dd22fc7d5176..d531ea44f418 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5583,6 +5583,40 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	}
 }
 
+static int
+intel_iommu_device_info(struct device *dev, enum iommu_devattr type, void *data)
+{
+	struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+	int ret = 0;
+
+	if (!iommu)
+		return -ENODEV;
+
+	switch (type) {
+	case IOMMU_DEV_INFO_PAGE_SIZE:
+		*(u64 *)data = SZ_4K |
+			(cap_super_page_val(iommu->cap) & BIT(0) ? SZ_2M : 0) |
+			(cap_super_page_val(iommu->cap) & BIT(1) ? SZ_1G : 0);
+		break;
+	case IOMMU_DEV_INFO_FORCE_SNOOP:
+		/*
+		 * Force snoop is always supported in the scalable mode. For the legacy
+		 * mode, check the capability register.
+		 */
+		*(bool *)data = sm_supported(iommu) || ecap_sc_support(iommu->ecap);
+		break;
+	case IOMMU_DEV_INFO_ADDR_WIDTH:
+		*(u32 *)data = min_t(u32, agaw_to_width(iommu->agaw),
+				     cap_mgaw(iommu->cap));
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5621,6 +5655,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.sva_get_pasid		= intel_svm_get_pasid,
 	.page_response		= intel_svm_page_response,
 #endif
+	.device_info		= intel_iommu_device_info,
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* [RFC 20/20] Doc: Add documentation for /dev/iommu
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (18 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback Liu Yi L
@ 2021-09-19  6:38 ` Liu Yi L
  2021-10-29  0:15   ` David Gibson
  2021-09-19  6:45 ` [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu, Yi L
  2021-09-21 13:45 ` Jason Gunthorpe
  21 siblings, 1 reply; 280+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
 
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 000000000000..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===================
+IOMMU Userspace API
+===================
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-------------------------------------
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---------------------------
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device to an IOASID with incompatible format is simply rejected.
+
+Currently no-snoop DMA is not supported yet. This implies that
+IOASID must be created in an enforce-snoop format and only devices
+which can be forced to snoop cache by IOMMU are allowed to be
+attached to IOASID. The user should check uAPI extension and get
+device info via IOMMUFD to handle such restriction.
+
+Usage Example
+-------------
+
+Assume user wants to access PCI device 0000:06:0d.0, which is
+exposed under the new /dev/vfio/devices directory by VFIO:
+
+	/* Open device-centric interface and /dev/iommu interface */
+	device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
+	iommu_fd = open("/dev/iommu", O_RDWR);
+
+	/* Bind device to IOMMUFD */
+	bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
+	ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
+
+	/* Query per-device IOMMU capability/format */
+	info = { .dev_cookie = cookie, };
+	ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
+
+	if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
+		if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
+				EXT_DMA_NO_SNOOP))
+			/* No support of no-snoop DMA */
+	}
+
+	if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
+		/* No support of vfio type1v2 mapping semantics */
+
+	/* Decides IOASID alloc fields based on info */
+	alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
+		       .flags = IOMMU_IOASID_ENFORCE_SNOOP,
+		       .addr_width = info.addr_width, };
+
+	/* Allocate IOASID */
+	gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
+
+	/* Attach device to an IOASID */
+	at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
+
+	/* Setup GPA mapping [0 - 1GB] */
+	dma_map = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.flags  = R/W		/* permission */
+			.iova	= 0,		/* GPA */
+			.vaddr	= 0x40000000,	/* HVA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
+
+	/* DMA */
+
+	/* Unmap GPA mapping [0 - 1GB] */
+	dma_unmap = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.iova	= 0,		/* GPA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
+
+	/* Detach device from an IOASID */
+	dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
+
+	/* Free IOASID */
+	ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
+
+	close(device_fd);
+	close(iommu_fd);
+
+API for device-passthrough frameworks
+-------------------------------------
+
+iommufd binding and IOASID attach/detach are initiated via the device-
+passthrough framework uAPI.
+
+When a binding operation is requested by the user, the passthrough
+framework should call iommufd_bind_device(). When the device fd is
+closed by the user, iommufd_unbind_device() should be called
+automatically::
+
+	struct iommufd_device *
+	iommufd_bind_device(int fd, struct device *dev,
+			   u64 dev_cookie);
+	void iommufd_unbind_device(struct iommufd_device *idev);
+
+IOASID attach/detach operations are per iommufd_device which is
+returned by iommufd_bind_device():
+
+	int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					int ioasid);
+	void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+					int ioasid);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (19 preceding siblings ...)
  2021-09-19  6:38 ` [RFC 20/20] Doc: Add documentation for /dev/iommu Liu Yi L
@ 2021-09-19  6:45 ` Liu, Yi L
  2021-09-21 13:45 ` Jason Gunthorpe
  21 siblings, 0 replies; 280+ messages in thread
From: Liu, Yi L @ 2021-09-19  6:45 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, Tian, Kevin, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, baolu.lu, david, nicolinc

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, September 19, 2021 2:38 PM
[...]
> [Series Overview]
>
> * Basic skeleton:
>   0001-iommu-iommufd-Add-dev-iommu-core.patch
> 
> * VFIO PCI creates device-centric interface:
>   0002-vfio-Add-device-class-for-dev-vfio-devices.patch
>   0003-vfio-Add-vfio_-un-register_device.patch
>   0004-iommu-Add-iommu_device_get_info-interface.patch
>   0005-vfio-pci-Register-device-to-dev-vfio-devices.patch
> 
> * Bind device fd with iommufd:
>   0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
>   0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
>   0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch
> 
> * IOASID allocation:
>   0009-iommu-Add-page-size-and-address-width-attributes.patch
>   0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
>   0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
>   0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch
> 
> * IOASID [de]attach:
>   0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
>   0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
>   0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch
> 
> * DMA (un)map:
>   0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
>   0017-iommu-iommufd-Report-iova-range-to-userspace.patch
>   0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch
> 
> * Report the device info in vt-d driver to enable whole series:
>   0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch
> 
> * Add doc:
>   0020-Doc-Add-documentation-for-dev-iommu.patch

Please refer to the above patch overview. sorry for the duplicated contents.

thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-19  6:38 [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu Yi L
                   ` (20 preceding siblings ...)
  2021-09-19  6:45 ` [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management Liu, Yi L
@ 2021-09-21 13:45 ` Jason Gunthorpe
  2021-09-22  3:25   ` Liu, Yi L
  21 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
> vDPA) to manage secure device access from the userspace. One critical task
> of those frameworks is to put the assigned device in a secure, IOMMU-
> protected context so user-initiated DMAs are prevented from doing harm to
> the rest of the system.

Some bot will probably send this too, but it has compile warnings and
needs to be rebased to 5.15-rc1

drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
        if (refcount_read(&ioas->refs) > 1) {
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs here
        return ret;
               ^~~
drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its condition is always true
        if (refcount_read(&ioas->refs) > 1) {
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret' to silence this warning
        int ioasid, ret;
                       ^
                        = 0
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret' to silence this warning
        int ret;
               ^
                = 0

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-19  6:38 ` [RFC 01/20] iommu/iommufd: Add /dev/iommu core Liu Yi L
@ 2021-09-21 15:41   ` Jason Gunthorpe
  2021-09-22  1:51     ` Tian, Kevin
  2021-10-15  9:18     ` Liu, Yi L
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 15:41 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:29PM +0800, Liu Yi L wrote:
> /dev/iommu aims to provide a unified interface for managing I/O address
> spaces for devices assigned to userspace. This patch adds the initial
> framework to create a /dev/iommu node. Each open of this node returns an
> iommufd. And this fd is the handle for userspace to initiate its I/O
> address space management.
> 
> One open:
> - We call this feature as IOMMUFD in Kconfig in this RFC. However this
>   name is not clear enough to indicate its purpose to user. Back to 2010
>   vfio even introduced a /dev/uiommu [1] as the predecessor of its
>   container concept. Is that a better name? Appreciate opinions here.
> 
> [1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/Kconfig           |   1 +
>  drivers/iommu/Makefile          |   1 +
>  drivers/iommu/iommufd/Kconfig   |  11 ++++
>  drivers/iommu/iommufd/Makefile  |   2 +
>  drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
>  5 files changed, 127 insertions(+)
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/iommufd.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 07b7c25cbed8..a83ce0acd09d 100644
> +++ b/drivers/iommu/Kconfig
> @@ -136,6 +136,7 @@ config MSM_IOMMU
>  
>  source "drivers/iommu/amd/Kconfig"
>  source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>  
>  config IRQ_REMAP
>  	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index c0fb0ba88143..719c799f23ad 100644
> +++ b/drivers/iommu/Makefile
> @@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
>  obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
> +obj-$(CONFIG_IOMMUFD) += iommufd/
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 000000000000..9fb7769a815d
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "I/O Address Space management framework for passthrough devices"
> +	select IOMMU_API
> +	default n
> +	help
> +	  provides unified I/O address space management framework for
> +	  isolating untrusted DMAs via devices which are passed through
> +	  to userspace drivers.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 000000000000..54381a01d003
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> new file mode 100644
> index 000000000000..710b7e62988b
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * I/O Address Space Management for passthrough devices
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <yi.l.liu@intel.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/iommu.h>
> +
> +/* Per iommufd */
> +struct iommufd_ctx {
> +	refcount_t refs;
> +};

A private_data of a struct file should avoid having a refcount (and
this should have been a kref anyhow)

Use the refcount on the struct file instead.

In general the lifetime models look overly convoluted to me with
refcounts being used as locks and going in all manner of directions.

- No refcount on iommufd_ctx, this should use the fget on the fd.
  The driver facing version of the API has the driver holds a fget
  inside the iommufd_device.

- Put a rwlock inside the iommufd_ioas that is a
  'destroying_lock'. The rwlock starts out unlocked.
  
  Acquire from the xarray is
   rcu_lock()
   ioas = xa_load()
   if (ioas)
      if (down_read_trylock(&ioas->destroying_lock))
           // success
  Unacquire is just up_read()

  Do down_write when the ioas is to be destroyed, do not return ebusy.

 - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
   not need locking (order it properly too, it is in the wrong order), and
   don't check for duplicate devices or dev_cookie duplication, that
   is user error and is harmless to the kernel.
  
> +static int iommufd_fops_release(struct inode *inode, struct file *filep)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +
> +	filep->private_data = NULL;

unnecessary

> +	iommufd_ctx_put(ictx);
> +
> +	return 0;
> +}
> +
> +static long iommufd_fops_unl_ioctl(struct file *filep,
> +				   unsigned int cmd, unsigned long arg)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +	long ret = -EINVAL;
> +
> +	if (!ictx)
> +		return ret;

impossible

> +
> +	switch (cmd) {
> +	default:
> +		pr_err_ratelimited("unsupported cmd %u\n", cmd);

don't log user triggerable events

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38 ` [RFC 02/20] vfio: Add device class for /dev/vfio/devices Liu Yi L
@ 2021-09-21 15:57   ` Jason Gunthorpe
  2021-09-21 23:56     ` Tian, Kevin
  2021-09-21 19:56   ` Alex Williamson
  2021-09-29  2:08   ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 15:57 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)

> +static int vfio_init_device_class(void)
> +{
> +	int ret;
> +
> +	mutex_init(&vfio.device_lock);
> +	idr_init(&vfio.device_idr);
> +
> +	/* /dev/vfio/devices/$DEVICE */
> +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> +	if (IS_ERR(vfio.device_class))
> +		return PTR_ERR(vfio.device_class);
> +
> +	vfio.device_class->devnode = vfio_device_devnode;
> +
> +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
> +	if (ret)
> +		goto err_alloc_chrdev;
> +
> +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
> +	if (ret)
> +		goto err_cdev_add;

Huh? This is not how cdevs are used. This patch needs rewriting.

The struct vfio_device should gain a 'struct device' and 'struct cdev'
as non-pointer members

vfio register path should end up doing cdev_device_add() for each
vfio_device

vfio_unregister path should do cdev_device_del()

No idr should be needed, an ida is used to allocate minor numbers

The struct device release function should trigger a kfree which
requires some reworking of the callers

vfio_init_group_dev() should do a device_initialize()
vfio_uninit_group_dev() should do a device_put()

The opened atomic is aweful. A newly created fd should start in a
state where it has a disabled fops

The only thing the disabled fops can do is register the device to the
iommu fd. When successfully registered the device gets the normal fops.

The registration steps should be done under a normal lock inside the
vfio_device. If a vfio_device is already registered then further
registration should fail.

Getting the device fd via the group fd triggers the same sequence as
above.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-19  6:38 ` [RFC 03/20] vfio: Add vfio_[un]register_device() Liu Yi L
@ 2021-09-21 16:01   ` Jason Gunthorpe
  2021-09-21 23:10     ` Tian, Kevin
  2021-09-22  0:54     ` Tian, Kevin
  2021-09-29  2:43   ` David Gibson
  1 sibling, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:01 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
> 
> a)  only legacy group interface, for devices which haven't been moved to
>     iommufd (e.g. platform devices, sw mdev, etc.);
> 
> b)  both legacy group interface and new device-centric interface, for
>     devices which supports iommufd but also wants to keep backward
>     compatibility (e.g. pci devices in this RFC);
> 
> c)  only new device-centric interface, for new devices which don't carry
>     backward compatibility burden (e.g. hw mdev/subdev with pasid);

We shouldn't have 'b'? Where does it come from?

> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.

Detect what the driver supports based on the ops it declares. There
should be a function provided through the ops for the driver to bind
to the iommufd.

>  One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:

This isn't reliable, the devname should just be vfio0, vfio1, etc

The userspace can learn the correct major/minor by inspecting the
sysfs.

This whole concept should disappear into the prior patch that adds the
struct device in the first place, and I think most of the code here
can be deleted once the struct device is used properly.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-19  6:38 ` [RFC 04/20] iommu: Add iommu_device_get_info interface Liu Yi L
@ 2021-09-21 16:19   ` Jason Gunthorpe
  2021-09-22  2:31     ` Lu Baolu
  2021-09-29  2:52   ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:19 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This provides an interface for upper layers to get the per-device iommu
> attributes.
> 
>     int iommu_device_get_info(struct device *dev,
>                               enum iommu_devattr attr, void *data);

Can't we use properly typed ops and functions here instead of a void
*data?

get_snoop()
get_page_size()
get_addr_width()

?

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-19  6:38 ` [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices Liu Yi L
@ 2021-09-21 16:40   ` Jason Gunthorpe
  2021-09-21 21:09     ` Alex Williamson
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> This patch exposes the device-centric interface for vfio-pci devices. To
> be compatiable with existing users, vfio-pci exposes both legacy group
> interface and device-centric interface.
> 
> As explained in last patch, this change doesn't apply to devices which
> cannot be forced to snoop cache by their upstream iommu. Such devices
> are still expected to be opened via the legacy group interface.
> 
> When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> the user from accessing the assigned device because the device is still
> attached to the default domain which may allow user-initiated DMAs to
> touch arbitrary place. The user access must be blocked until the device
> is later bound to an iommufd (see patch 08). The binding acts as the
> contract for putting the device in a security context which ensures user-
> initiated DMAs via this device cannot harm the rest of the system.
> 
> This patch introduces a vdev->block_access flag for this purpose. It's set
> when the device is opened via /dev/vfio/devices and cleared after binding
> to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> user access should be blocked or not.

This should not be in vfio_pci.

AFAIK there is no condition where a vfio driver can work without being
connected to some kind of iommu back end, so the core code should
handle this interlock globally. A vfio driver's ops should not be
callable until the iommu is connected.

The only vfio_pci patch in this series should be adding a new callback
op to take in an iommufd and register the pci_device as a iommufd
device.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-19  6:38 ` [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces Liu Yi L
@ 2021-09-21 17:09   ` Jason Gunthorpe
  2021-09-22  1:47     ` Tian, Kevin
  2021-09-29  4:55   ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
> 
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.

> Managing the security context must meet below criteria:
> 
> 1)  The group is viable for user-initiated DMAs. This implies that the
>     devices in the group must be either bound to a device-passthrough

s/a/the same/

>     framework, or driver-less, or bound to a driver which is known safe
>     (not do DMA).
> 
> 2)  The security context should only allow DMA to the user's memory and
>     devices in this group;
> 
> 3)  After the security context is established for the group, the group
>     viability must be continuously monitored before the user relinquishes
>     all devices belonging to the group. The viability might be broken e.g.
>     when a driver-less device is later bound to a driver which does DMA.
> 
> 4)  The security context should not be destroyed before user access
>     permission is withdrawn.
> 
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
> 
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:

This should be a precursor series that actually does clean things up
properly. There is no reason for vfio and iommufd to differ here, if
we are implementing this logic into the iommu layer then it should be
deleted from the VFIO layer, not left duplicated like this.

IIRC in VFIO the container is the IOAS and when the group goes to
create the device fd it should simply do the
iommu_device_init_user_dma() followed immediately by a call to bind
the container IOAS as your #3.

Then delete all the group viability stuff from vfio, relying on the
iommu to do it.

It should have full symmetry with the iommufd.

> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
>  		break;
>  	case BUS_NOTIFY_BOUND_DRIVER:
> +		/*
> +		 * FIXME: Alternatively the attached drivers could generically
> +		 * indicate to the iommu layer that they are safe for keeping
> +		 * the iommu group user viable by calling some function around
> +		 * probe(). We could eliminate this gross BUG_ON() by denying
> +		 * probe to non-iommu-safe driver.
> +		 */
> +		mutex_lock(&group->mutex);
> +		if (group->user_dma_owner_id)
> +			BUG_ON(!iommu_group_user_dma_viable(group));
> +		mutex_unlock(&group->mutex);

And the mini-series should fix this BUG_ON properly by interlocking
with the driver core to simply refuse to bind a driver under these
conditions instead of allowing userspace to crash the kernel.

That alone would be justification enough to merge this work.

> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> +	"vfio-pci",
> +	"pci-stub"
> +};

Yuk. This should be done with some callback in those drivers
'iomm_allow_user_dma()"

Ie the basic flow would see the driver core doing some:

 ret = iommu_doing_kernel_dma()
 if (ret) do not bind
 driver_bind
  pci_stub_probe()
     iommu_allow_user_dma()

And the various functions are manipulating some atomic.
 0 = nothing happening
 1 = kernel DMA
 2 = user DMA

No BUG_ON.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-19  6:38 ` [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device() Liu Yi L
@ 2021-09-21 17:14   ` Jason Gunthorpe
  2021-10-15  9:21     ` Liu, Yi L
  2021-09-29  5:25   ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:

> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> +	unsigned int id;
> +	struct iommufd_ctx *ictx;
> +	struct device *dev; /* always be the physical device */
> +	u64 dev_cookie;
>  };
>  
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  		return -ENOMEM;
>  
>  	refcount_set(&ictx->refs, 1);
> +	mutex_init(&ictx->lock);
> +	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
>  	filep->private_data = ictx;
>  
>  	return ret;
>  }
>  
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> +	refcount_inc(&ictx->refs);
> +}

See my earlier remarks about how to structure the lifetime logic, this
ref isn't necessary.

> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> +	struct fd f = fdget(fd);
> +	struct file *file = f.file;
> +	struct iommufd_ctx *ictx;
> +
> +	if (!file)
> +		return NULL;
> +
> +	if (file->f_op != &iommufd_fops)
> +		return NULL;

Leaks the fdget

> +
> +	ictx = file->private_data;
> +	if (ictx)
> +		iommufd_ctx_get(ictx);

Use success oriented flow

> +	fdput(f);
> +	return ictx;
> +}

> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> +					   u64 dev_cookie)
> +{
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_device *idev;
> +	unsigned long index;
> +	unsigned int id;
> +	int ret;
> +
> +	ictx = iommufd_ctx_fdget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&ictx->lock);
> +
> +	/* check duplicate registration */
> +	xa_for_each(&ictx->device_xa, index, idev) {
> +		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> +			idev = ERR_PTR(-EBUSY);
> +			goto out_unlock;
> +		}

I can't think of a reason why this expensive check is needed.

> +	}
> +
> +	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> +	if (!idev) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	/* Establish the security context */
> +	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = xa_alloc(&ictx->device_xa, &id, idev,
> +		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> +		       GFP_KERNEL);

idev should be fully initialized before being placed in the xarray, so
this should be the last thing done.

Why not just use the standard xa_limit_32b instead of special single
use constants?

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38 ` [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD Liu Yi L
@ 2021-09-21 17:29   ` Jason Gunthorpe
  2021-09-22 21:01     ` Alex Williamson
  2021-09-29  6:00   ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:29 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
> device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
> because it's implicitly done when the device fd is closed.
> 
> In concept a vfio device can be bound to multiple iommufds, each hosting
> a subset of I/O address spaces attached by this device. However as a
> starting point (matching current vfio), only one I/O address space is
> supported per vfio device. It implies one device can only be attached
> to one iommufd at this point.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/vfio/pci/Kconfig            |  1 +
>  drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |  8 ++++
>  include/uapi/linux/vfio.h           | 30 ++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 5e2e1b9a9fd3..3abfb098b4dc 100644
> +++ b/drivers/vfio/pci/Kconfig
> @@ -5,6 +5,7 @@ config VFIO_PCI
>  	depends on MMU
>  	select VFIO_VIRQFD
>  	select IRQ_BYPASS_MANAGER
> +	select IOMMUFD
>  	help
>  	  Support for the PCI VFIO bus driver.  This is required to make
>  	  use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 145addde983b..20006bb66430 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
>  			vdev->req_trigger = NULL;
>  		}
>  		mutex_unlock(&vdev->igate);
> +
> +		mutex_lock(&vdev->videv_lock);
> +		if (vdev->videv) {
> +			struct vfio_iommufd_device *videv = vdev->videv;
> +
> +			vdev->videv = NULL;
> +			iommufd_unbind_device(videv->idev);
> +			kfree(videv);
> +		}
> +		mutex_unlock(&vdev->videv_lock);
>  	}
>  
>  	mutex_unlock(&vdev->reflck->lock);
> @@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>  		container_of(core_vdev, struct vfio_pci_device, vdev);
>  	unsigned long minsz;
>  
> -	if (cmd == VFIO_DEVICE_GET_INFO) {
> +	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {

Choosing to implement this through the ioctl multiplexor is what is
causing so much ugly gyration in the previous patches

This should be a straightforward new function and ops:

struct iommufd_device *vfio_pci_bind_iommufd(struct vfio_device *)
{
		iommu_dev = iommufd_bind_device(bind_data.iommu_fd,
					   &vdev->pdev->dev,
					   bind_data.dev_cookie);
                if (!iommu_dev) return ERR
                vdev->iommu_dev = iommu_dev;
}
static const struct vfio_device_ops vfio_pci_ops = {
   .bind_iommufd = &*vfio_pci_bind_iommufd

If you do the other stuff I said then you'll notice that the
iommufd_bind_device() will provide automatic exclusivity.

The thread that sees ops->bind_device succeed will know it is the only
thread that can see that (by definition, the iommu enable user stuff
has to be exclusive and race free) thus it can go ahead and store the
iommu pointer.

The other half of the problem '&vdev->block_access' is solved by
manipulating the filp->f_ops. Start with a fops that can ONLY call the
above op. When the above op succeeds switch the fops to the normal
full ops. .

The same flow happens when the group fd spawns the device fd, just
parts of iommfd_bind_device are open coded into the vfio code, but the
whole flow and sequence should be the same.

> +		/*
> +		 * Reject the request if the device is already opened and
> +		 * attached to a container.
> +		 */
> +		if (vfio_device_in_container(core_vdev))
> +			return -ENOTTY;

This is wrongly locked

> +
> +		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
> +
> +		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind_data.argsz < minsz ||
> +		    bind_data.flags || bind_data.iommu_fd < 0)
> +			return -EINVAL;
> +
> +		mutex_lock(&vdev->videv_lock);
> +		/*
> +		 * Allow only one iommufd per device until multiple
> +		 * address spaces (e.g. vSVA) support is introduced
> +		 * in the future.
> +		 */
> +		if (vdev->videv) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return -EBUSY;
> +		}
> +
> +		idev = iommufd_bind_device(bind_data.iommu_fd,
> +					   &vdev->pdev->dev,
> +					   bind_data.dev_cookie);
> +		if (IS_ERR(idev)) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return PTR_ERR(idev);
> +		}
> +
> +		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
> +		if (!videv) {
> +			iommufd_unbind_device(idev);
> +			mutex_unlock(&vdev->videv_lock);
> +			return -ENOMEM;
> +		}
> +		videv->idev = idev;
> +		videv->iommu_fd = bind_data.iommu_fd;

No need for more memory, a struct vfio_device can be attached to a
single iommu context. If idev then the context and all the other
information is valid.

> +		if (atomic_read(&vdev->block_access))
> +			atomic_set(&vdev->block_access, 0);

I'm sure I'll tell you this is all wrongly locked too if I look
closely.

> +/*
> + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
> + *				struct vfio_device_iommu_bind_data)
> + *
> + * Bind a vfio_device to the specified iommufd
> + *
> + * The user should provide a device cookie when calling this ioctl. The
> + * cookie is later used in iommufd for capability query, iotlb invalidation
> + * and I/O fault handling.
> + *
> + * User is not allowed to access the device before the binding operation
> + * is completed.
> + *
> + * Unbind is automatically conducted when device fd is closed.
> + *
> + * Input parameters:
> + *	- iommu_fd;
> + *	- dev_cookie;
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_iommu_bind_data {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	iommu_fd;
> +	__u64	dev_cookie;

Missing explicit padding

Always use __aligned_u64 in uapi headers, fix all the patches.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38 ` [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO Liu Yi L
@ 2021-09-21 17:40   ` Jason Gunthorpe
  2021-09-22  3:30     ` Tian, Kevin
  2021-09-22 21:24   ` Alex Williamson
  2021-09-29  6:23   ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
> 
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{

We have an xarray ID for the device, why are we allowing userspace to
use the dev_cookie as input?

Userspace should always pass in the ID. The only place dev_cookie
should appear is if the kernel generates an event back to
userspace. Then the kernel should return both the ID and the
dev_cookie in the event to allow userspace to correlate it.

> +static void iommu_device_build_info(struct device *dev,
> +				    struct iommu_device_info *info)
> +{
> +	bool snoop;
> +	u64 awidth, pgsizes;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> +		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> +		info->pgsize_bitmap = pgsizes;
> +		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> +	}
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> +		info->addr_width = awidth;
> +		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> +	}

Another good option is to push the iommu_device_info uAPI struct down
through to the iommu driver to fill it in and forget about the crazy
enum.

A big part of thinking of this iommu interface is a way to bind the HW
IOMMU driver to a uAPI and allow the HW driver to expose its unique
functionalities.

> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> +				   unsigned long arg)
> +{
> +	struct iommu_device_info info;
> +	unsigned long minsz;
> +	struct device *dev;
> +
> +	minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;

All of these patterns everywhere are wrongly coded for forward/back
compatibility.

static int iommufd_get_device_info(struct iommufd_ctx *ictx,
                   struct iommu_device_info __user *arg, size_t usize)
{
	struct iommu_device_info info;
	int ret;

	if (usize < offsetofend(struct iommu_device_info, addr_flags))
           return -EINVAL;

        ret = copy_struct_from_user(&info, sizeof(info), arg, usize);
        if (ret)
	      return ret;

'usize' should be in a 'common' header extracted by the main ioctl handler.

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};

Be explicit with padding here too.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38 ` [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE Liu Yi L
@ 2021-09-21 17:44   ` Jason Gunthorpe
  2021-09-22  3:40     ` Tian, Kevin
                       ` (2 more replies)
  2021-09-22 13:45   ` Jean-Philippe Brucker
  2021-10-01  6:11   ` David Gibson
  2 siblings, 3 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:44 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

I think the request was to include a start/end IO address hint when
creating the ios. When the kernel creates it then it can return the
actual geometry including any holes via a query.

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

ioas works well here I think. Use ioas_id to refer to the xarray
index.

> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |   3 +
>  include/uapi/linux/iommu.h      |  54 ++++++++++++++
>  3 files changed, 177 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
>  struct iommufd_ctx {
>  	refcount_t refs;
>  	struct mutex lock;
> +	struct xarray ioasid_xa; /* xarray of ioasids */
>  	struct xarray device_xa; /* xarray of bound devices */
>  };
>  
> @@ -42,6 +43,16 @@ struct iommufd_device {
>  	u64 dev_cookie;
>  };
>  
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> +	int ioasid;

xarray id's should consistently be u32s everywhere.

Many of the same prior comments repeated here

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-19  6:38 ` [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION Liu Yi L
@ 2021-09-21 17:47   ` Jason Gunthorpe
  2021-09-22  3:41     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:47 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> As aforementioned, userspace should check extension for what formats
> can be specified when allocating an IOASID. This patch adds such
> interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> support and no no-snoop support yet.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c |  7 +++++++
>  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 4839f128b24a..e45d76359e34 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>  		return ret;
>  
>  	switch (cmd) {
> +	case IOMMU_CHECK_EXTENSION:
> +		switch (arg) {
> +		case EXT_MAP_TYPE1V2:
> +			return 1;
> +		default:
> +			return 0;
> +		}
>  	case IOMMU_DEVICE_GET_INFO:
>  		ret = iommufd_get_device_info(ictx, arg);
>  		break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 5cbd300eb0ee..49731be71213 100644
> +++ b/include/uapi/linux/iommu.h
> @@ -14,6 +14,33 @@
>  #define IOMMU_TYPE	(';')
>  #define IOMMU_BASE	100
>  
> +/*
> + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> + *
> + * Check whether an uAPI extension is supported.
> + *
> + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> + * in one breath. User should check which uAPI extension is supported
> + * according to its intended usage.
> + *
> + * A rough list of possible extensions may include:
> + *
> + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> + *	- EXT_IOASID_NESTING for what the name stands;
> + *	- EXT_USER_PAGE_TABLE for user managed page table;
> + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> + *	- ...
> + *
> + * Return: 0 if not supported, 1 if supported.
> + */
> +#define EXT_MAP_TYPE1V2		1
> +#define EXT_DMA_NO_SNOOP	2
> +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)

I generally advocate for a 'try and fail' approach to discovering
compatibility.

If that doesn't work for the userspace then a query to return a
generic capability flag is the next best idea. Each flag should
clearly define what 'try and fail' it is talking about

Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set

TYPE1V2 seems like nonsense

Not sure about the others.

IOW, this should recast to a generic 'query capabilities' IOCTL

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-19  6:38 ` [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid() Liu Yi L
@ 2021-09-21 18:02   ` Jason Gunthorpe
  2021-09-22  3:53     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:02 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:42PM +0800, Liu Yi L wrote:
> An I/O address space takes effect in the iommu only after it's attached
> by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
> helpers for this purpose. One device can be only attached to one ioasid
> at this point, but one ioasid can be attached by multiple devices.
> 
> The caller specifies the iommufd_device (returned at binding time) and
> the target ioasid when calling the helper function. Upon request, iommufd
> installs the specified I/O page table to the correct place in the IOMMU,
> according to the routing information (struct device* which represents
> RID) recorded in iommufd_device. Future variants could allow the caller
> to specify additional routing information (e.g. pasid/ssid) when multiple
> I/O address spaces are supported per device.
> 
> Open:
> Per Jason's comment in below link, bus-specific wrappers are recommended.
> This RFC implements one wrapper for pci device. But it looks that struct
> pci_device is not used at all since iommufd_ device already carries all
> necessary info. So want to have another discussion on its necessity, e.g.
> whether making more sense to have bus-specific wrappers for binding, while
> leaving a common attaching helper per iommufd_device.
> https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/
> 
> TODO:
> When multiple devices are attached to a same ioasid, the permitted iova
> ranges and supported pgsize bitmap on this ioasid should be a common
> subset of all attached devices. iommufd needs to track such info per
> ioasid and update it every time when a new device is attached to the
> ioasid. This has not been done in this version yet, due to the temporary
> hack adopted in patch 16-18. The hack reuses vfio type1 driver which
> already includes the necessary logic for iova ranges and pgsize bitmap.
> Once we get a clear direction for those patches, that logic will be moved
> to this patch.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |  29 ++++
>  2 files changed, 255 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e45d76359e34..25373a0e037a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -51,6 +51,19 @@ struct iommufd_ioas {
>  	bool enforce_snoop;
>  	struct iommufd_ctx *ictx;
>  	refcount_t refs;
> +	struct mutex lock;
> +	struct list_head device_list;
> +	struct iommu_domain *domain;

This should just be another xarray indexed by the device id

> +/* Caller should hold ioas->lock */
> +static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
> +						 struct iommufd_device *idev)
> +{
> +	struct ioas_device_info *ioas_dev;
> +
> +	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
> +		if (ioas_dev->idev == idev)
> +			return ioas_dev;
> +	}

Which eliminates this search. xarray with tightly packed indexes is
generally more efficient than linked lists..

> +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> +					   struct device *dev)
> +{
> +	bool snoop = false;
> +	u32 addr_width;
> +	int ret;
> +
> +	/*
> +	 * currently we only support I/O page table with iommu enforce-snoop
> +	 * format. Attaching a device which doesn't support this format in its
> +	 * upstreaming iommu is rejected.
> +	 */
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> +	if (ret || !snoop)
> +		return -EINVAL;
> +
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> +	if (ret || addr_width < ioas->addr_width)
> +		return -EINVAL;
> +
> +	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
> +
> +	return 0;
> +}

This seems kind of weird..

I expect the iommufd to hold a SW copy of the IO page table and each
time a new domain is to be created it should push the SW copy into the
domain. If the domain cannot support it then the domain driver should
naturally fail a request.

When the user changes the IO page table the SW copy is updated then
all of the domains are updated too - again if any domain cannot
support the change then it fails and the change is rolled back.

It seems like this is a side effect of roughly hacking in the vfio
code?

> +
> +/**
> + * iommufd_device_attach_ioasid - attach device to an ioasid
> + * @idev: [in] Pointer to struct iommufd_device.
> + * @ioasid: [in] ioasid points to an I/O address space.
> + *
> + * Returns 0 for successful attach, otherwise returns error.
> + *
> + */
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)

Types for the ioas_id again..

> +{
> +	struct iommufd_ioas *ioas;
> +	struct ioas_device_info *ioas_dev;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	ioas = ioasid_get_ioas(idev->ictx, ioasid);
> +	if (!ioas) {
> +		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
> +		return -EINVAL;

No prints triggered by bad userspace

> +	}
> +
> +	mutex_lock(&ioas->lock);
> +
> +	/* Check for duplicates */
> +	if (ioas_find_device(ioas, idev)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

just xa_cmpxchg NULL, XA_ZERO_ENTRY

> +	/*
> +	 * Each ioas is backed by an iommu domain, which is allocated
> +	 * when the ioas is attached for the first time and then shared
> +	 * by following devices.
> +	 */
> +	if (list_empty(&ioas->device_list)) {

Seems strange, what if the devices are forced to have different
domains? We don't want to model that in the SW layer..

> +	/* Install the I/O page table to the iommu for this device */
> +	ret = iommu_attach_device(domain, idev->dev);
> +	if (ret)
> +		goto out_domain;

This is where things start to get confusing when you talk about PASID
as the above call needs to be some PASID centric API.

> @@ -27,6 +28,16 @@ struct iommufd_device *
>  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
>  void iommufd_unbind_device(struct iommufd_device *idev);
>  
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
> +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
> +
> +static inline int
> +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> +				   struct iommufd_device *idev, int ioasid)
> +{
> +	return iommufd_device_attach_ioasid(idev, ioasid);
> +}

If think sis taking in the iommfd_device then there isn't a logical
place to signal the PCIness

But, I think the API should at least have a kdoc that this is
capturing the entire device and specify that for PCI this means all
TLPs with the RID.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-19  6:38 ` [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID Liu Yi L
@ 2021-09-21 18:04   ` Jason Gunthorpe
  2021-09-22  3:56     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:04 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> This patch adds interface for userspace to attach device to specified
> IOASID.
> 
> Note:
> One device can only be attached to one IOASID in this version. This is
> on par with what vfio provides today. In the future this restriction can
> be relaxed when multiple I/O address spaces are supported per device

?? In VFIO the container is the IOS and the container can be shared
with multiple devices. This needs to start at about the same
functionality.

> +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {

This should be in the core code, right? There is nothing PCI specific
here.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-19  6:38 ` [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing Liu Yi L
@ 2021-09-21 18:14   ` Jason Gunthorpe
  2021-09-22  3:57     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> There are two options to impelement vfio type1v2 mapping semantics in
> /dev/iommu.
> 
> One is to duplicate the related code from vfio as the starting point,
> and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
> has over 3000LOC with ~80% related to dma management logic, including:

I can't really see a way forward like this. I think some scheme to
move the vfio datastructure is going to be necessary.

> - the dma map/unmap metadata management
> - page pinning, and related accounting
> - iova range reporting
> - dirty bitmap retrieving
> - dynamic vaddr update, etc.

All of this needs to be part of the iommufd anyhow..

> The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> which requires converting vfio_iommu_type1 to be a shim driver. 

Another choice is the the datastructure coulde move and the two
drivers could share its code and continue to exist more independently

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38 ` [RFC 02/20] vfio: Add device class for /dev/vfio/devices Liu Yi L
  2021-09-21 15:57   ` Jason Gunthorpe
@ 2021-09-21 19:56   ` Alex Williamson
  2021-09-22  0:56     ` Tian, Kevin
  2021-09-29  2:08   ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-21 19:56 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, kevin.tian, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, ashok.raj, yi.l.liu,
	jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

On Sun, 19 Sep 2021 14:38:30 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 02cc51ce6891..84436d7abedd 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
...
> @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
>  	.mode = S_IRUGO | S_IWUGO,
>  };
>  
> +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> +}

dev_name() doesn't provide us with any uniqueness guarantees, so this
could potentially generate naming conflicts.  The similar scheme for
devices within an iommu group appends an instance number if a conflict
occurs, but that solution doesn't work here where the name isn't just a
link to the actual device.  Devices within an iommu group are also
likely associated within a bus_type, so the potential for conflict is
pretty negligible, that's not the case as vfio is adopted for new
device types.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 16:40   ` Jason Gunthorpe
@ 2021-09-21 21:09     ` Alex Williamson
  2021-09-21 21:58       ` Jason Gunthorpe
  2021-09-22  1:19       ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Alex Williamson @ 2021-09-21 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, 21 Sep 2021 13:40:01 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > This patch exposes the device-centric interface for vfio-pci devices. To
> > be compatiable with existing users, vfio-pci exposes both legacy group
> > interface and device-centric interface.
> > 
> > As explained in last patch, this change doesn't apply to devices which
> > cannot be forced to snoop cache by their upstream iommu. Such devices
> > are still expected to be opened via the legacy group interface.

This doesn't make much sense to me.  The previous patch indicates
there's work to be done in updating the kvm-vfio contract to understand
DMA coherency, so you're trying to limit use cases to those where the
IOMMU enforces coherency, but there's QEMU work to be done to support
the iommufd uAPI at all.  Isn't part of that work to understand how KVM
will be told about non-coherent devices rather than "meh, skip it in the
kernel"?  Also let's not forget that vfio is not only for KVM.
 
> > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > the user from accessing the assigned device because the device is still
> > attached to the default domain which may allow user-initiated DMAs to
> > touch arbitrary place. The user access must be blocked until the device
> > is later bound to an iommufd (see patch 08). The binding acts as the
> > contract for putting the device in a security context which ensures user-
> > initiated DMAs via this device cannot harm the rest of the system.
> > 
> > This patch introduces a vdev->block_access flag for this purpose. It's set
> > when the device is opened via /dev/vfio/devices and cleared after binding
> > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > user access should be blocked or not.  
> 
> This should not be in vfio_pci.
> 
> AFAIK there is no condition where a vfio driver can work without being
> connected to some kind of iommu back end, so the core code should
> handle this interlock globally. A vfio driver's ops should not be
> callable until the iommu is connected.
> 
> The only vfio_pci patch in this series should be adding a new callback
> op to take in an iommufd and register the pci_device as a iommufd
> device.

Couldn't the same argument be made that registering a $bus device as an
iommufd device is a common interface that shouldn't be the
responsibility of the vfio device driver?  Is userspace opening the
non-group device anything more than a reservation of that device if
access is withheld until iommu isolation?  I also don't really want to
predict how ioctls might evolve to guess whether only blocking .read,
.write, and .mmap callbacks are sufficient.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:09     ` Alex Williamson
@ 2021-09-21 21:58       ` Jason Gunthorpe
  2021-09-22  1:24         ` Tian, Kevin
  2021-09-22  1:19       ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 21:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:

> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

vfio is not only for KVM, but AFIACT the wbinv stuff is only for
KVM... But yes, I agree this should be sorted out at this stage

> > > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > > the user from accessing the assigned device because the device is still
> > > attached to the default domain which may allow user-initiated DMAs to
> > > touch arbitrary place. The user access must be blocked until the device
> > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > contract for putting the device in a security context which ensures user-
> > > initiated DMAs via this device cannot harm the rest of the system.
> > > 
> > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > when the device is opened via /dev/vfio/devices and cleared after binding
> > > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > > user access should be blocked or not.  
> > 
> > This should not be in vfio_pci.
> > 
> > AFAIK there is no condition where a vfio driver can work without being
> > connected to some kind of iommu back end, so the core code should
> > handle this interlock globally. A vfio driver's ops should not be
> > callable until the iommu is connected.
> > 
> > The only vfio_pci patch in this series should be adding a new callback
> > op to take in an iommufd and register the pci_device as a iommufd
> > device.
> 
> Couldn't the same argument be made that registering a $bus device as an
> iommufd device is a common interface that shouldn't be the
> responsibility of the vfio device driver? 

The driver needs enough involvment to signal what kind of IOMMU
connection it wants, eg attaching to a physical device will use the
iofd_attach_device() path, but attaching to a SW page table should use
a different API call. PASID should also be different.

Possibly a good arrangement is to have the core provide some generic
ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
except mdev drivers can use so the code is all duplicated.

> non-group device anything more than a reservation of that device if
> access is withheld until iommu isolation?  I also don't really want to
> predict how ioctls might evolve to guess whether only blocking .read,
> .write, and .mmap callbacks are sufficient.  Thanks,

This is why I said the entire fops should be blocked in a dummy fops
so the core code the vfio_device FD parked and userspace unable to
access the ops until device attachment and thus IOMMU ioslation is
completed.

Simple and easy to reason about, a parked FD is very similar to a
closed FD.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 16:01   ` Jason Gunthorpe
@ 2021-09-21 23:10     ` Tian, Kevin
  2021-09-22  0:53       ` Jason Gunthorpe
  2021-09-22  0:54     ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > With /dev/vfio/devices introduced, now a vfio device driver has three
> > options to expose its device to userspace:
> >
> > a)  only legacy group interface, for devices which haven't been moved to
> >     iommufd (e.g. platform devices, sw mdev, etc.);
> >
> > b)  both legacy group interface and new device-centric interface, for
> >     devices which supports iommufd but also wants to keep backward
> >     compatibility (e.g. pci devices in this RFC);
> >
> > c)  only new device-centric interface, for new devices which don't carry
> >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> 
> We shouldn't have 'b'? Where does it come from?

a vfio-pci device can be opened via the existing group interface. if no b) it 
means legacy vfio userspace can never use vfio-pci device any more
once the latter is moved to iommufd.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 15:57   ` Jason Gunthorpe
@ 2021-09-21 23:56     ` Tian, Kevin
  2021-09-22  0:55       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:57 PM
> 
> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> > +static int vfio_init_device_class(void)
> > +{
> > +	int ret;
> > +
> > +	mutex_init(&vfio.device_lock);
> > +	idr_init(&vfio.device_idr);
> > +
> > +	/* /dev/vfio/devices/$DEVICE */
> > +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> > +	if (IS_ERR(vfio.device_class))
> > +		return PTR_ERR(vfio.device_class);
> > +
> > +	vfio.device_class->devnode = vfio_device_devnode;
> > +
> > +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1,
> "vfio-device");
> > +	if (ret)
> > +		goto err_alloc_chrdev;
> > +
> > +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> > +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK +
> 1);
> > +	if (ret)
> > +		goto err_cdev_add;
> 
> Huh? This is not how cdevs are used. This patch needs rewriting.
> 
> The struct vfio_device should gain a 'struct device' and 'struct cdev'
> as non-pointer members
> 
> vfio register path should end up doing cdev_device_add() for each
> vfio_device
> 
> vfio_unregister path should do cdev_device_del()
> 
> No idr should be needed, an ida is used to allocate minor numbers
> 
> The struct device release function should trigger a kfree which
> requires some reworking of the callers
> 
> vfio_init_group_dev() should do a device_initialize()
> vfio_uninit_group_dev() should do a device_put()

All above are good suggestions!

> 
> The opened atomic is aweful. A newly created fd should start in a
> state where it has a disabled fops
> 
> The only thing the disabled fops can do is register the device to the
> iommu fd. When successfully registered the device gets the normal fops.
> 
> The registration steps should be done under a normal lock inside the
> vfio_device. If a vfio_device is already registered then further
> registration should fail.
> 
> Getting the device fd via the group fd triggers the same sequence as
> above.
> 

Above works if the group interface is also connected to iommufd, i.e.
making vfio type1 as a shim. In this case we can use the registration
status as the exclusive switch. But if we keep vfio type1 separate as
today, then a new atomic is still necessary. This all depends on how
we want to deal with vfio type1 and iommufd, and possibly what's
discussed here just adds another pound to the shim option...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 23:10     ` Tian, Kevin
@ 2021-09-22  0:53       ` Jason Gunthorpe
  2021-09-22  0:59         ` Tian, Kevin
  2021-09-22  9:23         ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  0:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > options to expose its device to userspace:
> > >
> > > a)  only legacy group interface, for devices which haven't been moved to
> > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > >
> > > b)  both legacy group interface and new device-centric interface, for
> > >     devices which supports iommufd but also wants to keep backward
> > >     compatibility (e.g. pci devices in this RFC);
> > >
> > > c)  only new device-centric interface, for new devices which don't carry
> > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > 
> > We shouldn't have 'b'? Where does it come from?
> 
> a vfio-pci device can be opened via the existing group interface. if no b) it 
> means legacy vfio userspace can never use vfio-pci device any more
> once the latter is moved to iommufd.

Sorry, I think I ment a, which I guess you will say is SW mdev devices

But even so, I think the way forward here is to still always expose
the device /dev/vfio/devices/X and some devices may not allow iommufd
usage initially.

Providing an ioctl to bind to a normal VFIO container or group might
allow a reasonable fallback in userspace..

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 16:01   ` Jason Gunthorpe
  2021-09-21 23:10     ` Tian, Kevin
@ 2021-09-22  0:54     ` Tian, Kevin
  2021-09-22  1:00       ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> >  One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
> 
> This isn't reliable, the devname should just be vfio0, vfio1, etc
> 
> The userspace can learn the correct major/minor by inspecting the
> sysfs.
> 
> This whole concept should disappear into the prior patch that adds the
> struct device in the first place, and I think most of the code here
> can be deleted once the struct device is used properly.
> 

Can you help elaborate above flow? This is one area where we need
more guidance.

When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
how does Qemu identify which vifo0/1/... is associated with the specified 
DDDD:BB:DD.F? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 23:56     ` Tian, Kevin
@ 2021-09-22  0:55       ` Jason Gunthorpe
  2021-09-22  1:07         ` Tian, Kevin
  2021-09-22  3:22         ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  0:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > The opened atomic is aweful. A newly created fd should start in a
> > state where it has a disabled fops
> > 
> > The only thing the disabled fops can do is register the device to the
> > iommu fd. When successfully registered the device gets the normal fops.
> > 
> > The registration steps should be done under a normal lock inside the
> > vfio_device. If a vfio_device is already registered then further
> > registration should fail.
> > 
> > Getting the device fd via the group fd triggers the same sequence as
> > above.
> > 
> 
> Above works if the group interface is also connected to iommufd, i.e.
> making vfio type1 as a shim. In this case we can use the registration
> status as the exclusive switch. But if we keep vfio type1 separate as
> today, then a new atomic is still necessary. This all depends on how
> we want to deal with vfio type1 and iommufd, and possibly what's
> discussed here just adds another pound to the shim option...

No, it works the same either way, the group FD path is identical to
the normal FD path, it just triggers some of the state transitions
automatically internally instead of requiring external ioctls.

The device FDs starts disabled, an internal API binds it to the iommu
via open coding with the group API, and then the rest of the APIs can
be enabled. Same as today.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 19:56   ` Alex Williamson
@ 2021-09-22  0:56     ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:56 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, parav, lkml, pbonzini,
	lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian,
	Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 3:56 AM
> 
> On Sun, 19 Sep 2021 14:38:30 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 02cc51ce6891..84436d7abedd 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> ...
> > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> >  	.mode = S_IRUGO | S_IWUGO,
> >  };
> >
> > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> > +}
> 
> dev_name() doesn't provide us with any uniqueness guarantees, so this
> could potentially generate naming conflicts.  The similar scheme for
> devices within an iommu group appends an instance number if a conflict
> occurs, but that solution doesn't work here where the name isn't just a
> link to the actual device.  Devices within an iommu group are also
> likely associated within a bus_type, so the potential for conflict is
> pretty negligible, that's not the case as vfio is adopted for new
> device types.  Thanks,
> 

This is also our concern. Thanks for confirming it. Appreciate if you
can help think out some better alternative to deal with it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:53       ` Jason Gunthorpe
@ 2021-09-22  0:59         ` Tian, Kevin
  2021-09-22  9:23         ` Tian, Kevin
  1 sibling, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices

We listed a) here in case we don't want to move all vfio device types to
use iommufd in one breath. It's supposed to be a type valid only in this
transition phase. In the end only b) and c) are valid.

> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.
> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

but doesn't a new ioctl still imply breaking existing vfio userspace?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:54     ` Tian, Kevin
@ 2021-09-22  1:00       ` Jason Gunthorpe
  2021-09-22  1:02         ` Tian, Kevin
                           ` (2 more replies)
  0 siblings, 3 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  1:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > >  One open about how to organize the device nodes under
> > /dev/vfio/devices/.
> > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > devname
> > > from all kinds of devices. The prerequisite of this model is that devnames
> > > from different bus types are unique formats:
> > 
> > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > 
> > The userspace can learn the correct major/minor by inspecting the
> > sysfs.
> > 
> > This whole concept should disappear into the prior patch that adds the
> > struct device in the first place, and I think most of the code here
> > can be deleted once the struct device is used properly.
> > 
> 
> Can you help elaborate above flow? This is one area where we need
> more guidance.
> 
> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> how does Qemu identify which vifo0/1/... is associated with the specified 
> DDDD:BB:DD.F? 

When done properly in the kernel the file:

/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

Will contain the major:minor of the VFIO device.

Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
that the major:minor matches.

in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
to qemu.

You can look at this for some general over engineered code to handle
opening from a sysfs handle like above:

https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  1:00       ` Jason Gunthorpe
@ 2021-09-22  1:02         ` Tian, Kevin
  2021-09-23  7:25         ` Eric Auger
  2021-09-29  2:46         ` david
  2 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 9:00 AM
> 
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > >  One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that
> devnames
> > > > from different bus types are unique formats:
> > >
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > >
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > >
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > >
> >
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> >
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified
> > DDDD:BB:DD.F?
> 
> When done properly in the kernel the file:
> 
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> 
> Will contain the major:minor of the VFIO device.
> 
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.

ah, that's the trick.

> 
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
> 
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
> 
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> 

will check. Thanks for suggestion.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  0:55       ` Jason Gunthorpe
@ 2021-09-22  1:07         ` Tian, Kevin
  2021-09-22 12:31           ` Jason Gunthorpe
  2021-09-22  3:22         ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:55 AM
> 
> On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > The opened atomic is aweful. A newly created fd should start in a
> > > state where it has a disabled fops
> > >
> > > The only thing the disabled fops can do is register the device to the
> > > iommu fd. When successfully registered the device gets the normal fops.
> > >
> > > The registration steps should be done under a normal lock inside the
> > > vfio_device. If a vfio_device is already registered then further
> > > registration should fail.
> > >
> > > Getting the device fd via the group fd triggers the same sequence as
> > > above.
> > >
> >
> > Above works if the group interface is also connected to iommufd, i.e.
> > making vfio type1 as a shim. In this case we can use the registration
> > status as the exclusive switch. But if we keep vfio type1 separate as
> > today, then a new atomic is still necessary. This all depends on how
> > we want to deal with vfio type1 and iommufd, and possibly what's
> > discussed here just adds another pound to the shim option...
> 
> No, it works the same either way, the group FD path is identical to
> the normal FD path, it just triggers some of the state transitions
> automatically internally instead of requiring external ioctls.
> 
> The device FDs starts disabled, an internal API binds it to the iommu
> via open coding with the group API, and then the rest of the APIs can
> be enabled. Same as today.
> 

Still a bit confused. if vfio type1 also connects to iommufd, whether 
the device is registered can be centrally checked based on whether
an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
all, don't we still need introduce a new state (calling it 'opened' or
'registered') to protect the two interfaces? In this case what is the
point of keeping device FD disabled even for the group path?

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:09     ` Alex Williamson
  2021-09-21 21:58       ` Jason Gunthorpe
@ 2021-09-22  1:19       ` Tian, Kevin
  2021-09-22 21:17         ` Alex Williamson
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 5:09 AM
> 
> On Tue, 21 Sep 2021 13:40:01 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > interface and device-centric interface.
> > >
> > > As explained in last patch, this change doesn't apply to devices which
> > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > are still expected to be opened via the legacy group interface.
> 
> This doesn't make much sense to me.  The previous patch indicates
> there's work to be done in updating the kvm-vfio contract to understand
> DMA coherency, so you're trying to limit use cases to those where the
> IOMMU enforces coherency, but there's QEMU work to be done to support
> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

The policy here is that VFIO will not expose such devices (no enforce-snoop)
in the new device hierarchy at all. In this case QEMU will fall back to the
group interface automatically and then rely on the existing contract to connect 
vfio and QEMU. It doesn't need to care about the whatever new contract
until such devices are exposed in the new interface.

yes, vfio is not only for KVM. But here it's more a task split based on staging
consideration. imo it's not necessary to further split task into supporting
non-snoop device for userspace driver and then for kvm.


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:58       ` Jason Gunthorpe
@ 2021-09-22  1:24         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 5:59 AM
> 
> On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:
> 
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.
> 
> vfio is not only for KVM, but AFIACT the wbinv stuff is only for
> KVM... But yes, I agree this should be sorted out at this stage

If such devices are even not exposed in the new hierarchy at this stage,
suppose sorting it out later should be fine?

> 
> > > > When the device is opened via /dev/vfio/devices, vfio-pci should
> prevent
> > > > the user from accessing the assigned device because the device is still
> > > > attached to the default domain which may allow user-initiated DMAs to
> > > > touch arbitrary place. The user access must be blocked until the device
> > > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > > contract for putting the device in a security context which ensures user-
> > > > initiated DMAs via this device cannot harm the rest of the system.
> > > >
> > > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > > when the device is opened via /dev/vfio/devices and cleared after
> binding
> > > > to iommufd succeeds. mmap and r/w handlers check this flag to decide
> whether
> > > > user access should be blocked or not.
> > >
> > > This should not be in vfio_pci.
> > >
> > > AFAIK there is no condition where a vfio driver can work without being
> > > connected to some kind of iommu back end, so the core code should
> > > handle this interlock globally. A vfio driver's ops should not be
> > > callable until the iommu is connected.
> > >
> > > The only vfio_pci patch in this series should be adding a new callback
> > > op to take in an iommufd and register the pci_device as a iommufd
> > > device.
> >
> > Couldn't the same argument be made that registering a $bus device as an
> > iommufd device is a common interface that shouldn't be the
> > responsibility of the vfio device driver?
> 
> The driver needs enough involvment to signal what kind of IOMMU
> connection it wants, eg attaching to a physical device will use the
> iofd_attach_device() path, but attaching to a SW page table should use
> a different API call. PASID should also be different.

Exactly

> 
> Possibly a good arrangement is to have the core provide some generic
> ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
> except mdev drivers can use so the code is all duplicated.

Could this be an future enhancement when we have multiple device
types supporting iommufd?

> 
> > non-group device anything more than a reservation of that device if
> > access is withheld until iommu isolation?  I also don't really want to
> > predict how ioctls might evolve to guess whether only blocking .read,
> > .write, and .mmap callbacks are sufficient.  Thanks,
> 
> This is why I said the entire fops should be blocked in a dummy fops
> so the core code the vfio_device FD parked and userspace unable to
> access the ops until device attachment and thus IOMMU ioslation is
> completed.
> 
> Simple and easy to reason about, a parked FD is very similar to a
> closed FD.
> 

This rationale makes sense. Just the open how to handle exclusive
open between group and nongroup interfaces still needs some
more clarification here, especially about what a parked FD means
for the group interface (where parking is unnecessary since the 
security context is already established before the device is opened)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-21 17:09   ` Jason Gunthorpe
@ 2021-09-22  1:47     ` Tian, Kevin
  2021-09-22 12:39       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:10 AM
> 
> On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This extends iommu core to manage security context for passthrough
> > devices. Please bear a long explanation for how we reach this design
> > instead of managing it solely in iommufd like what vfio does today.
> >
> > Devices which cannot be isolated from each other are organized into an
> > iommu group. When a device is assigned to the user space, the entire
> > group must be put in a security context so that user-initiated DMAs via
> > the assigned device cannot harm the rest of the system. No user access
> > should be granted on a device before the security context is established
> > for the group which the device belongs to.
> 
> > Managing the security context must meet below criteria:
> >
> > 1)  The group is viable for user-initiated DMAs. This implies that the
> >     devices in the group must be either bound to a device-passthrough
> 
> s/a/the same/
> 
> >     framework, or driver-less, or bound to a driver which is known safe
> >     (not do DMA).
> >
> > 2)  The security context should only allow DMA to the user's memory and
> >     devices in this group;
> >
> > 3)  After the security context is established for the group, the group
> >     viability must be continuously monitored before the user relinquishes
> >     all devices belonging to the group. The viability might be broken e.g.
> >     when a driver-less device is later bound to a driver which does DMA.
> >
> > 4)  The security context should not be destroyed before user access
> >     permission is withdrawn.
> >
> > Existing vfio introduces explicit container/group semantics in its uAPI
> > to meet above requirements. A single security context (iommu domain)
> > is created per container. Attaching group to container moves the entire
> > group into the associated security context, and vice versa. The user can
> > open the device only after group attach. A group can be detached only
> > after all devices in the group are closed. Group viability is monitored
> > by listening to iommu group events.
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
> 
> This should be a precursor series that actually does clean things up
> properly. There is no reason for vfio and iommufd to differ here, if
> we are implementing this logic into the iommu layer then it should be
> deleted from the VFIO layer, not left duplicated like this.

make sense

> 
> IIRC in VFIO the container is the IOAS and when the group goes to
> create the device fd it should simply do the
> iommu_device_init_user_dma() followed immediately by a call to bind
> the container IOAS as your #3.

a slight correction.

to meet vfio semantics we could do init_user_dma() at group attach
time and then call binding to container IOAS when the device fd
is created. This is because vfio requires the group in a security context
before the device is opened. 

> 
> Then delete all the group viability stuff from vfio, relying on the
> iommu to do it.
> 
> It should have full symmetry with the iommufd.

agree

> 
> > @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct
> notifier_block *nb,
> >  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> >  		break;
> >  	case BUS_NOTIFY_BOUND_DRIVER:
> > +		/*
> > +		 * FIXME: Alternatively the attached drivers could generically
> > +		 * indicate to the iommu layer that they are safe for keeping
> > +		 * the iommu group user viable by calling some function
> around
> > +		 * probe(). We could eliminate this gross BUG_ON() by
> denying
> > +		 * probe to non-iommu-safe driver.
> > +		 */
> > +		mutex_lock(&group->mutex);
> > +		if (group->user_dma_owner_id)
> > +			BUG_ON(!iommu_group_user_dma_viable(group));
> > +		mutex_unlock(&group->mutex);
> 
> And the mini-series should fix this BUG_ON properly by interlocking
> with the driver core to simply refuse to bind a driver under these
> conditions instead of allowing userspace to crash the kernel.
> 
> That alone would be justification enough to merge this work.

yes

> 
> > +
> > +/*
> > + * IOMMU core interfaces for iommufd.
> > + */
> > +
> > +/*
> > + * FIXME: We currently simply follow vifo policy to mantain the group's
> > + * viability to user. Eventually, we should avoid below hard-coded list
> > + * by letting drivers indicate to the iommu layer that they are safe for
> > + * keeping the iommu group's user aviability.
> > + */
> > +static const char * const iommu_driver_allowed[] = {
> > +	"vfio-pci",
> > +	"pci-stub"
> > +};
> 
> Yuk. This should be done with some callback in those drivers
> 'iomm_allow_user_dma()"
> 
> Ie the basic flow would see the driver core doing some:

Just double confirm. Is there concern on having the driver core to
call iommu functions? 

> 
>  ret = iommu_doing_kernel_dma()
>  if (ret) do not bind
>  driver_bind
>   pci_stub_probe()
>      iommu_allow_user_dma()
> 
> And the various functions are manipulating some atomic.
>  0 = nothing happening
>  1 = kernel DMA
>  2 = user DMA
> 
> No BUG_ON.
> 
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-21 15:41   ` Jason Gunthorpe
@ 2021-09-22  1:51     ` Tian, Kevin
  2021-09-22 12:40       ` Jason Gunthorpe
  2021-10-15  9:18     ` Liu, Yi L
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:42 PM
> 
>  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
>    not need locking (order it properly too, it is in the wrong order), and
>    don't check for duplicate devices or dev_cookie duplication, that
>    is user error and is harmless to the kernel.
> 

I'm confused here. yes it's user error, but we check so many user errors
and then return -EINVAL, -EBUSY, etc. Why is this one special?

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-21 16:19   ` Jason Gunthorpe
@ 2021-09-22  2:31     ` Lu Baolu
  2021-09-22  5:07       ` Christoph Hellwig
  0 siblings, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-22  2:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: baolu.lu, alex.williamson, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

Hi Jason,

On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>
>> This provides an interface for upper layers to get the per-device iommu
>> attributes.
>>
>>      int iommu_device_get_info(struct device *dev,
>>                                enum iommu_devattr attr, void *data);
> 
> Can't we use properly typed ops and functions here instead of a void
> *data?
> 
> get_snoop()
> get_page_size()
> get_addr_width()

Yeah! Above are more friendly to the upper layer callers.

> 
> ?
> 
> Jason
> 

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  0:55       ` Jason Gunthorpe
  2021-09-22  1:07         ` Tian, Kevin
@ 2021-09-22  3:22         ` Tian, Kevin
  2021-09-22 12:50           ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Tian, Kevin
> Sent: Wednesday, September 22, 2021 9:07 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> >
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> >
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> >
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> >

After reading your comments on patch08, I may have a clearer picture
on your suggestion. The key is to handle exclusive access at the binding
time (based on vdev->iommu_dev). Please see whether below makes 
sense:

Shared sequence:

1)  initialize the device with a parked fops;
2)  need binding (explicit or implicit) to move away from parked fops;
3)  switch to normal fops after successful binding;

1) happens at device probe.

for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:

  - 2) is done by calling .bind_iommufd() callback;
  - 3) could be done within .bind_iommufd(), or via a new callback e.g.
    .finalize_device(). The latter may be preferred for the group interface;
  - Two threads may open the same device simultaneously, with exclusive 
    access guaranteed by iommufd_bind_device();
  - Open() after successful binding is rejected, since normal fops has been
    activated. This is checked upon vdev->iommu_dev;

for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:

  - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
    iommufd_device object and record it to vdev->iommu_dev
  - 3) is done by calling .finalize_device();
  - open() after a valid vdev->iommu_dev is rejected. this also ensures
    exclusive ownership with the nongroup path.

If Alex also agrees with it, this might be another mini-series to be merged
(just for group path) before this one. Doing so sort of nullifies the existing
group/container attaching process, where attach_ioas will be skipped and
now the security context is established when the device is opened.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-21 13:45 ` Jason Gunthorpe
@ 2021-09-22  3:25   ` Liu, Yi L
  0 siblings, 0 replies; 280+ messages in thread
From: Liu, Yi L @ 2021-09-22  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> > Linux now includes multiple device-passthrough frameworks (e.g. VFIO
> and
> > vDPA) to manage secure device access from the userspace. One critical
> task
> > of those frameworks is to put the assigned device in a secure, IOMMU-
> > protected context so user-initiated DMAs are prevented from doing harm
> to
> > the rest of the system.
> 
> Some bot will probably send this too, but it has compile warnings and
> needs to be rebased to 5.15-rc1

thanks Jason, will fix the warnings. yeah, I was using 5.14 in the test, will
rebase to 5.15-rc# in next version.

Regards,
Yi Liu

> drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>         if (refcount_read(&ioas->refs) > 1) {
>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs
> here
>         return ret;
>                ^~~
> drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its
> condition is always true
>         if (refcount_read(&ioas->refs) > 1) {
>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret'
> to silence this warning
>         int ioasid, ret;
>                        ^
>                         = 0
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> 
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret'
> to silence this warning
>         int ret;
>                ^
>                 = 0
> 
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-21 17:40   ` Jason Gunthorpe
@ 2021-09-22  3:30     ` Tian, Kevin
  2021-09-22 12:41       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:41 AM
> 
> On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > After a device is bound to the iommufd, userspace can use this interface
> > to query the underlying iommu capability and format info for this device.
> > Based on this information the user then creates I/O address space in a
> > compatible format with the to-be-attached devices.
> >
> > Device cookie which is registered at binding time is used to mark the
> > device which is being queried here.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c | 68
> +++++++++++++++++++++++++++++++++
> >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index e16ca21e4534..641f199f2d41 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> *inode, struct file *filep)
> >  	return 0;
> >  }
> >
> > +static struct device *
> > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> dev_cookie)
> > +{
> 
> We have an xarray ID for the device, why are we allowing userspace to
> use the dev_cookie as input?
> 
> Userspace should always pass in the ID. The only place dev_cookie
> should appear is if the kernel generates an event back to
> userspace. Then the kernel should return both the ID and the
> dev_cookie in the event to allow userspace to correlate it.
> 

A little background.

In earlier design proposal we discussed two options. One is to return
an kernel-allocated ID (label) to userspace. The other is to have user
register a cookie and use it in iommufd uAPI. At that time the two
options were discussed exclusively and the cookie one is preferred.

Now you instead recommended a mixed option. We can follow it for
sure if nobody objects.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-21 17:44   ` Jason Gunthorpe
@ 2021-09-22  3:40     ` Tian, Kevin
  2021-09-22 14:09       ` Jason Gunthorpe
  2021-10-01  6:15       ` david
  2021-09-22 12:51     ` Liu, Yi L
  2021-10-01  6:13     ` David Gibson
  2 siblings, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the

is the hint single-range or could be multiple-ranges?

> actual geometry including any holes via a query.

I'd like to see a detail flow from David on how the uAPI works today with
existing spapr driver and what exact changes he'd like to make on this
proposed interface. Above info is still insufficient for us to think about the
right solution.

> 
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
> 
> ioas works well here I think. Use ioas_id to refer to the xarray
> index.

What about when introducing pasid to this uAPI? Then use ioas_id
for the xarray index and ioasid to represent pasid/ssid? At this point
the software handle and hardware id are mixed together thus need
a clear terminology to differentiate them.


Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-21 17:47   ` Jason Gunthorpe
@ 2021-09-22  3:41     ` Tian, Kevin
  2021-09-22 12:55       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:47 AM
> 
> On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > As aforementioned, userspace should check extension for what formats
> > can be specified when allocating an IOASID. This patch adds such
> > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > support and no no-snoop support yet.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 4839f128b24a..e45d76359e34 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> *filep,
> >  		return ret;
> >
> >  	switch (cmd) {
> > +	case IOMMU_CHECK_EXTENSION:
> > +		switch (arg) {
> > +		case EXT_MAP_TYPE1V2:
> > +			return 1;
> > +		default:
> > +			return 0;
> > +		}
> >  	case IOMMU_DEVICE_GET_INFO:
> >  		ret = iommufd_get_device_info(ictx, arg);
> >  		break;
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 5cbd300eb0ee..49731be71213 100644
> > +++ b/include/uapi/linux/iommu.h
> > @@ -14,6 +14,33 @@
> >  #define IOMMU_TYPE	(';')
> >  #define IOMMU_BASE	100
> >
> > +/*
> > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > + *
> > + * Check whether an uAPI extension is supported.
> > + *
> > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > + * in one breath. User should check which uAPI extension is supported
> > + * according to its intended usage.
> > + *
> > + * A rough list of possible extensions may include:
> > + *
> > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > + *	- EXT_IOASID_NESTING for what the name stands;
> > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > + *	- ...
> > + *
> > + * Return: 0 if not supported, 1 if supported.
> > + */
> > +#define EXT_MAP_TYPE1V2		1
> > +#define EXT_DMA_NO_SNOOP	2
> > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> IOMMU_BASE + 0)
> 
> I generally advocate for a 'try and fail' approach to discovering
> compatibility.
> 
> If that doesn't work for the userspace then a query to return a
> generic capability flag is the next best idea. Each flag should
> clearly define what 'try and fail' it is talking about

We don't have strong preference here. Just follow what vfio does
today. So Alex's opinion is appreciated here. 😊

> 
> Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set
> 
> TYPE1V2 seems like nonsense

just in case other mapping protocols are introduced in the future

> 
> Not sure about the others.
> 
> IOW, this should recast to a generic 'query capabilities' IOCTL
> 
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-21 18:02   ` Jason Gunthorpe
@ 2021-09-22  3:53     ` Tian, Kevin
  2021-09-22 12:57       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:02 AM
> 
> > +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> > +					   struct device *dev)
> > +{
> > +	bool snoop = false;
> > +	u32 addr_width;
> > +	int ret;
> > +
> > +	/*
> > +	 * currently we only support I/O page table with iommu enforce-
> snoop
> > +	 * format. Attaching a device which doesn't support this format in its
> > +	 * upstreaming iommu is rejected.
> > +	 */
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> > +	if (ret || !snoop)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> > +	if (ret || addr_width < ioas->addr_width)
> > +		return -EINVAL;
> > +
> > +	/* TODO: also need to check permitted iova ranges and pgsize
> bitmap */
> > +
> > +	return 0;
> > +}
> 
> This seems kind of weird..
> 
> I expect the iommufd to hold a SW copy of the IO page table and each
> time a new domain is to be created it should push the SW copy into the
> domain. If the domain cannot support it then the domain driver should
> naturally fail a request.
> 
> When the user changes the IO page table the SW copy is updated then
> all of the domains are updated too - again if any domain cannot
> support the change then it fails and the change is rolled back.
> 
> It seems like this is a side effect of roughly hacking in the vfio
> code?

Actually this was one open we closed in previous design proposal, but
looks you have a different thought now.

vfio maintains one ioas per container. Devices in the container
can be attached to different domains (e.g. due to snoop format). Every
time when the ioas is updated, every attached domain is updated
in accordance. 

You recommended one-ioas-one-domain model instead, i.e. any device 
with a format incompatible with the one currently used in ioas has to 
be attached to a new ioas, even if the two ioas's have the same mapping.
This leads to compatibility check at attaching time.

Now you want returning back to the vfio model?

> 
> > +	/*
> > +	 * Each ioas is backed by an iommu domain, which is allocated
> > +	 * when the ioas is attached for the first time and then shared
> > +	 * by following devices.
> > +	 */
> > +	if (list_empty(&ioas->device_list)) {
> 
> Seems strange, what if the devices are forced to have different
> domains? We don't want to model that in the SW layer..

this is due to above background

> 
> > +	/* Install the I/O page table to the iommu for this device */
> > +	ret = iommu_attach_device(domain, idev->dev);
> > +	if (ret)
> > +		goto out_domain;
> 
> This is where things start to get confusing when you talk about PASID
> as the above call needs to be some PASID centric API.

yes, for pasid new api (e.g. iommu_attach_device_pasid()) will be added.

but here we only talk about physical device, and iommu_attach_device()
is only for physical device.

> 
> > @@ -27,6 +28,16 @@ struct iommufd_device *
> >  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> >  void iommufd_unbind_device(struct iommufd_device *idev);
> >
> > +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +
> > +static inline int
> > +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> > +				   struct iommufd_device *idev, int ioasid)
> > +{
> > +	return iommufd_device_attach_ioasid(idev, ioasid);
> > +}
> 
> If think sis taking in the iommfd_device then there isn't a logical
> place to signal the PCIness

can you elaborate?

> 
> But, I think the API should at least have a kdoc that this is
> capturing the entire device and specify that for PCI this means all
> TLPs with the RID.
> 

yes, this should be documented.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-21 18:04   ` Jason Gunthorpe
@ 2021-09-22  3:56     ` Tian, Kevin
  2021-09-22 12:58       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:04 AM
> 
> On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > This patch adds interface for userspace to attach device to specified
> > IOASID.
> >
> > Note:
> > One device can only be attached to one IOASID in this version. This is
> > on par with what vfio provides today. In the future this restriction can
> > be relaxed when multiple I/O address spaces are supported per device
> 
> ?? In VFIO the container is the IOS and the container can be shared
> with multiple devices. This needs to start at about the same
> functionality.

a device can be only attached to one container. One container can be
shared by multiple devices.

a device can be only attached to one IOASID. One IOASID can be shared
by multiple devices.

it does start at the same functionality.

> 
> > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> 
> This should be in the core code, right? There is nothing PCI specific
> here.
> 

but if you insist on a pci-wrapper attach function, we still need something
here (e.g. with .attach_ioasid() callback)?

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-21 18:14   ` Jason Gunthorpe
@ 2021-09-22  3:57     ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:15 AM
> 
> On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > There are two options to impelement vfio type1v2 mapping semantics in
> > /dev/iommu.
> >
> > One is to duplicate the related code from vfio as the starting point,
> > and then merge with vfio type1 at a later time. However
> vfio_iommu_type1.c
> > has over 3000LOC with ~80% related to dma management logic, including:
> 
> I can't really see a way forward like this. I think some scheme to
> move the vfio datastructure is going to be necessary.
> 
> > - the dma map/unmap metadata management
> > - page pinning, and related accounting
> > - iova range reporting
> > - dirty bitmap retrieving
> > - dynamic vaddr update, etc.
> 
> All of this needs to be part of the iommufd anyhow..

yes

> 
> > The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> > which requires converting vfio_iommu_type1 to be a shim driver.
> 
> Another choice is the the datastructure coulde move and the two
> drivers could share its code and continue to exist more independently
> 

where to put the shared code?

btw this is one major open that I plan to discuss in LPC. 😊

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-22  2:31     ` Lu Baolu
@ 2021-09-22  5:07       ` Christoph Hellwig
  0 siblings, 0 replies; 280+ messages in thread
From: Christoph Hellwig @ 2021-09-22  5:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jason Gunthorpe, Liu Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu,
	dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, david, nicolinc

On Wed, Sep 22, 2021 at 10:31:47AM +0800, Lu Baolu wrote:
> Hi Jason,
>
> On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
>> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>
>>> This provides an interface for upper layers to get the per-device iommu
>>> attributes.
>>>
>>>      int iommu_device_get_info(struct device *dev,
>>>                                enum iommu_devattr attr, void *data);
>>
>> Can't we use properly typed ops and functions here instead of a void
>> *data?
>>
>> get_snoop()
>> get_page_size()
>> get_addr_width()
>
> Yeah! Above are more friendly to the upper layer callers.

The other option would be a struct with all the attributes.  Still
type safe, but not as many methods.  It'll require a little boilerplate
in the callers, though.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:53       ` Jason Gunthorpe
  2021-09-22  0:59         ` Tian, Kevin
@ 2021-09-22  9:23         ` Tian, Kevin
  2021-09-22 12:22           ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices
> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.

After another thought this should work. Following your comments in
other places, we'll move the handling of BIND_IOMMUFD to vfio core
which then invoke .bind_iommufd() from the driver. For devices which
don't allow iommufd now, the callback is null thus an error is returned.

This leaves the userspace in a try-and-fail mode. It first opens the device
fd and iommufd, and then try to connect the two together. If failed then
fallback to the legacy group interface.

Then we don't need a) at all. and we can even avoid introducing new
vfio_[un]register_device() at this point. Just leverage existing 
vfio_[un]register_group_dev() to cover b). new helpers can be introduced
later when c) is supported.

> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

I didn't get this point though. An error in binding already allows the
user to fall back to the group path. Why do we need introduce another
ioctl to explicitly bind to container via the nongroup interface? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  9:23         ` Tian, Kevin
@ 2021-09-22 12:22           ` Jason Gunthorpe
  2021-09-22 13:44             ` Tian, Kevin
  2021-09-22 20:10             ` Alex Williamson
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:

> > Providing an ioctl to bind to a normal VFIO container or group might
> > allow a reasonable fallback in userspace..
> 
> I didn't get this point though. An error in binding already allows the
> user to fall back to the group path. Why do we need introduce another
> ioctl to explicitly bind to container via the nongroup interface? 

New userspace still needs a fallback path if it hits the 'try and
fail'. Keeping the device FD open and just using a different ioctl to
bind to a container/group FD, which new userspace can then obtain as a
fallback, might be OK.

Hard to see without going through the qemu parts, so maybe just keep
it in mind

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  1:07         ` Tian, Kevin
@ 2021-09-22 12:31           ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:07:11AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> > 
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> > 
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> > 
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> > 
> 
> Still a bit confused. if vfio type1 also connects to iommufd, whether 
> the device is registered can be centrally checked based on whether
> an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
> all, don't we still need introduce a new state (calling it 'opened' or
> 'registered') to protect the two interfaces? 

The "new state" is if the fops are pointing at the real fops or the
pre-fops, which in turn protects everything. You could imagine this as
some state in front of every fop call if you want.

> In this case what is the point of keeping device FD disabled even
> for the group path?

I have a feeling when you go through the APIs it will make sense to
have some symmetry here.

eg creating a device FD should have basically the same flow no matter
what triggers it, not confusing special cases where the group code
skips steps

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-22  1:47     ` Tian, Kevin
@ 2021-09-22 12:39       ` Jason Gunthorpe
  2021-09-22 13:56         ` Tian, Kevin
  2021-09-27  9:42         ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:

> > IIRC in VFIO the container is the IOAS and when the group goes to
> > create the device fd it should simply do the
> > iommu_device_init_user_dma() followed immediately by a call to bind
> > the container IOAS as your #3.
> 
> a slight correction.
> 
> to meet vfio semantics we could do init_user_dma() at group attach
> time and then call binding to container IOAS when the device fd
> is created. This is because vfio requires the group in a security context
> before the device is opened. 

Is it? Until a device FD is opened the group fd is kind of idle, right?

> > Ie the basic flow would see the driver core doing some:
> 
> Just double confirm. Is there concern on having the driver core to
> call iommu functions? 

It is always an interesting question, but I'd say iommu is
foundantional to Linux and if it needs driver core help it shouldn't
be any different from PM, pinctl, or other subsystems that have
inserted themselves into the driver core.

Something kind of like the below.

If I recall, once it is done like this then the entire iommu notifier
infrastructure can be ripped out which is a lot of code.


diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa90..e39612c99c6123 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                goto done;
        }
 
+       ret = iommu_set_kernel_ownership(dev);
+       if (ret)
+               return ret;
+
 re_probe:
        dev->driver = drv;
 
@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                dev->pm_domain->dismiss(dev);
        pm_runtime_reinit(dev);
        dev_pm_set_driver_flags(dev, 0);
+       iommu_release_kernel_ownership(dev);
 done:
        return ret;
 }
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
                        dev->pm_domain->dismiss(dev);
                pm_runtime_reinit(dev);
                dev_pm_set_driver_flags(dev, 0);
+               iommu_release_kernel_ownership(dev);
 
                klist_remove(&dev->p->knode_driver);
                device_pm_check_callbacks(dev);

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22  1:51     ` Tian, Kevin
@ 2021-09-22 12:40       ` Jason Gunthorpe
  2021-09-22 13:59         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, September 21, 2021 11:42 PM
> > 
> >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
> >    not need locking (order it properly too, it is in the wrong order), and
> >    don't check for duplicate devices or dev_cookie duplication, that
> >    is user error and is harmless to the kernel.
> > 
> 
> I'm confused here. yes it's user error, but we check so many user errors
> and then return -EINVAL, -EBUSY, etc. Why is this one special?

Because it is expensive to calculate and forces a complicated locking
scheme into the kernel. Without this check you don't need the locking
that spans so much code, and simple RCU becomes acceptable.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-22  3:30     ` Tian, Kevin
@ 2021-09-22 12:41       ` Jason Gunthorpe
  2021-09-29  6:18         ` david
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:41 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > After a device is bound to the iommufd, userspace can use this interface
> > > to query the underlying iommu capability and format info for this device.
> > > Based on this information the user then creates I/O address space in a
> > > compatible format with the to-be-attached devices.
> > >
> > > Device cookie which is registered at binding time is used to mark the
> > > device which is being queried here.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c | 68
> > +++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index e16ca21e4534..641f199f2d41 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > *inode, struct file *filep)
> > >  	return 0;
> > >  }
> > >
> > > +static struct device *
> > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > dev_cookie)
> > > +{
> > 
> > We have an xarray ID for the device, why are we allowing userspace to
> > use the dev_cookie as input?
> > 
> > Userspace should always pass in the ID. The only place dev_cookie
> > should appear is if the kernel generates an event back to
> > userspace. Then the kernel should return both the ID and the
> > dev_cookie in the event to allow userspace to correlate it.
> > 
> 
> A little background.
> 
> In earlier design proposal we discussed two options. One is to return
> an kernel-allocated ID (label) to userspace. The other is to have user
> register a cookie and use it in iommufd uAPI. At that time the two
> options were discussed exclusively and the cookie one is preferred.
> 
> Now you instead recommended a mixed option. We can follow it for
> sure if nobody objects.

Either or for the return is fine, I'd return both just because it is
more flexable

But the cookie should never be an input from userspace, and the kernel
should never search for it. Locating the kernel object is what the ID
and xarray is for.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  3:22         ` Tian, Kevin
@ 2021-09-22 12:50           ` Jason Gunthorpe
  2021-09-22 14:09             ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Wednesday, September 22, 2021 9:07 AM
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 8:55 AM
> > >
> > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > state where it has a disabled fops
> > > > >
> > > > > The only thing the disabled fops can do is register the device to the
> > > > > iommu fd. When successfully registered the device gets the normal fops.
> > > > >
> > > > > The registration steps should be done under a normal lock inside the
> > > > > vfio_device. If a vfio_device is already registered then further
> > > > > registration should fail.
> > > > >
> > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > above.
> > > > >
> > > >
> > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > making vfio type1 as a shim. In this case we can use the registration
> > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > today, then a new atomic is still necessary. This all depends on how
> > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > discussed here just adds another pound to the shim option...
> > >
> > > No, it works the same either way, the group FD path is identical to
> > > the normal FD path, it just triggers some of the state transitions
> > > automatically internally instead of requiring external ioctls.
> > >
> > > The device FDs starts disabled, an internal API binds it to the iommu
> > > via open coding with the group API, and then the rest of the APIs can
> > > be enabled. Same as today.
> > >
> 
> After reading your comments on patch08, I may have a clearer picture
> on your suggestion. The key is to handle exclusive access at the binding
> time (based on vdev->iommu_dev). Please see whether below makes 
> sense:
> 
> Shared sequence:
> 
> 1)  initialize the device with a parked fops;
> 2)  need binding (explicit or implicit) to move away from parked fops;
> 3)  switch to normal fops after successful binding;
> 
> 1) happens at device probe.

1 happens when the cdev is setup with the parked fops, yes. I'd say it
happens at fd open time.

> for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> 
>   - 2) is done by calling .bind_iommufd() callback;
>   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
>     .finalize_device(). The latter may be preferred for the group interface;
>   - Two threads may open the same device simultaneously, with exclusive 
>     access guaranteed by iommufd_bind_device();
>   - Open() after successful binding is rejected, since normal fops has been
>     activated. This is checked upon vdev->iommu_dev;

Almost, open is always successful, what fails is
VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
with a FD that is useless, cannot reach the ops and thus cannot impact
the device it doesn't own in any way.

It is similar to opening a group FD

> for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> 
>   - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
>     iommufd_device object and record it to vdev->iommu_dev
>   - 3) is done by calling .finalize_device();
>   - open() after a valid vdev->iommu_dev is rejected. this also ensures
>     exclusive ownership with the nongroup path.

Same comment as above, groups should go through the same sequence of
steps, create a FD, attempt to bind, if successuful make the FD
operational.

The only difference is that failure in these steps does not call
fd_install(). For this reason alone the FD could start out with
operational fops, but it feels like a needless optimization.

> If Alex also agrees with it, this might be another mini-series to be merged
> (just for group path) before this one. Doing so sort of nullifies the existing
> group/container attaching process, where attach_ioas will be skipped and
> now the security context is established when the device is opened.

I think it is really important to unify DMA exclusion model and lower
to the core iommu code. If there is a reason the exclusion must be
triggered on group fd open then the iommu core code should provide an
API to do that which interworks with the device API iommufd will work.

But I would start here because it is much simpler to understand..

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-21 17:44   ` Jason Gunthorpe
  2021-09-22  3:40     ` Tian, Kevin
@ 2021-09-22 12:51     ` Liu, Yi L
  2021-09-22 13:32       ` Jason Gunthorpe
  2021-10-01  6:13     ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Liu, Yi L @ 2021-09-22 12:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> >  struct iommufd_ctx {
> >  	refcount_t refs;
> >  	struct mutex lock;
> > +	struct xarray ioasid_xa; /* xarray of ioasids */
> >  	struct xarray device_xa; /* xarray of bound devices */
> >  };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> >  	u64 dev_cookie;
> >  };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > +	int ioasid;
> 
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-22  3:41     ` Tian, Kevin
@ 2021-09-22 12:55       ` Jason Gunthorpe
  2021-09-22 14:13         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:41:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:47 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > > As aforementioned, userspace should check extension for what formats
> > > can be specified when allocating an IOASID. This patch adds such
> > > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > > support and no no-snoop support yet.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> > >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 34 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 4839f128b24a..e45d76359e34 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> > *filep,
> > >  		return ret;
> > >
> > >  	switch (cmd) {
> > > +	case IOMMU_CHECK_EXTENSION:
> > > +		switch (arg) {
> > > +		case EXT_MAP_TYPE1V2:
> > > +			return 1;
> > > +		default:
> > > +			return 0;
> > > +		}
> > >  	case IOMMU_DEVICE_GET_INFO:
> > >  		ret = iommufd_get_device_info(ictx, arg);
> > >  		break;
> > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > index 5cbd300eb0ee..49731be71213 100644
> > > +++ b/include/uapi/linux/iommu.h
> > > @@ -14,6 +14,33 @@
> > >  #define IOMMU_TYPE	(';')
> > >  #define IOMMU_BASE	100
> > >
> > > +/*
> > > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > > + *
> > > + * Check whether an uAPI extension is supported.
> > > + *
> > > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > > + * in one breath. User should check which uAPI extension is supported
> > > + * according to its intended usage.
> > > + *
> > > + * A rough list of possible extensions may include:
> > > + *
> > > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > > + *	- EXT_IOASID_NESTING for what the name stands;
> > > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > > + *	- ...
> > > + *
> > > + * Return: 0 if not supported, 1 if supported.
> > > + */
> > > +#define EXT_MAP_TYPE1V2		1
> > > +#define EXT_DMA_NO_SNOOP	2
> > > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> > IOMMU_BASE + 0)
> > 
> > I generally advocate for a 'try and fail' approach to discovering
> > compatibility.
> > 
> > If that doesn't work for the userspace then a query to return a
> > generic capability flag is the next best idea. Each flag should
> > clearly define what 'try and fail' it is talking about
> 
> We don't have strong preference here. Just follow what vfio does
> today. So Alex's opinion is appreciated here. 😊

This is a uAPI design, it should follow the current mainstream
thinking on how to build these things. There is a lot of old stuff in
vfio that doesn't match the modern thinking. IMHO.

> > TYPE1V2 seems like nonsense
> 
> just in case other mapping protocols are introduced in the future

Well, we should never, ever do that. Allowing PPC and evrything else
to split in VFIO has created a compelte disaster in userspace. HW
specific extensions should be modeled as extensions not a wholesale
replacement of everything.

I'd say this is part of the modern thinking on uAPI design.

What I want to strive for is the basic API is usable with all HW - and
is what something like DPDK can exclusively use.

An extended API with HW specific facets exists for qemu to use to
build a HW backed accelereated and featureful vIOMMU emulation.

The needs of qmeu should not trump the requirement for a universal
basic API.

Eg if we can't figure out a basic API version of the PPC range issue
then that should be punted to a PPC specific API.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-22  3:53     ` Tian, Kevin
@ 2021-09-22 12:57       ` Jason Gunthorpe
  2021-09-22 14:16         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:

> Actually this was one open we closed in previous design proposal, but
> looks you have a different thought now.
> 
> vfio maintains one ioas per container. Devices in the container
> can be attached to different domains (e.g. due to snoop format). Every
> time when the ioas is updated, every attached domain is updated
> in accordance. 
> 
> You recommended one-ioas-one-domain model instead, i.e. any device 
> with a format incompatible with the one currently used in ioas has to 
> be attached to a new ioas, even if the two ioas's have the same mapping.
> This leads to compatibility check at attaching time.
> 
> Now you want returning back to the vfio model?

Oh, I thought we circled back again.. If we are all OK with one ioas
one domain then great.

> > If think sis taking in the iommfd_device then there isn't a logical
> > place to signal the PCIness
> 
> can you elaborate?

I mean just drop it and document it.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-22  3:56     ` Tian, Kevin
@ 2021-09-22 12:58       ` Jason Gunthorpe
  2021-09-22 14:17         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 2:04 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > This patch adds interface for userspace to attach device to specified
> > > IOASID.
> > >
> > > Note:
> > > One device can only be attached to one IOASID in this version. This is
> > > on par with what vfio provides today. In the future this restriction can
> > > be relaxed when multiple I/O address spaces are supported per device
> > 
> > ?? In VFIO the container is the IOS and the container can be shared
> > with multiple devices. This needs to start at about the same
> > functionality.
> 
> a device can be only attached to one container. One container can be
> shared by multiple devices.
> 
> a device can be only attached to one IOASID. One IOASID can be shared
> by multiple devices.
> 
> it does start at the same functionality.
> 
> > 
> > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > 
> > This should be in the core code, right? There is nothing PCI specific
> > here.
> > 
> 
> but if you insist on a pci-wrapper attach function, we still need something
> here (e.g. with .attach_ioasid() callback)?

I would like to stop adding ioctls to this switch, the core code
should decode the ioctl and call an per-ioctl op like every other
subsystem does..

If you do that then you could have an op

 .attach_ioasid = vfio_full_device_attach,

And that is it for driver changes.

Every driver that use type1 today should be updated to have the above
line and will work with iommufd. mdevs will not be updated and won't
work.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22 12:51     ` Liu, Yi L
@ 2021-09-22 13:32       ` Jason Gunthorpe
  2021-09-23  6:26         ` Liu, Yi L
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 13:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> [...]
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 641f199f2d41..4839f128b24a 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -24,6 +24,7 @@
> > >  struct iommufd_ctx {
> > >  	refcount_t refs;
> > >  	struct mutex lock;
> > > +	struct xarray ioasid_xa; /* xarray of ioasids */
> > >  	struct xarray device_xa; /* xarray of bound devices */
> > >  };
> > >
> > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > >  	u64 dev_cookie;
> > >  };
> > >
> > > +/* Represent an I/O address space */
> > > +struct iommufd_ioas {
> > > +	int ioasid;
> > 
> > xarray id's should consistently be u32s everywhere.
> 
> sure. just one more check, this id is supposed to be returned to
> userspace as the return value of ioctl(IOASID_ALLOC). That's why
> I chose to use "int" as its prototype to make it aligned with the
> return type of ioctl(). Based on this, do you think it's still better
> to use "u32" here?

I suggest not using the return code from ioctl to exchange data.. The
rest of the uAPI uses an in/out struct, everything should do
that consistently.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-19  6:38 ` [RFC 09/20] iommu: Add page size and address width attributes Liu Yi L
@ 2021-09-22 13:42   ` Eric Auger
  2021-09-22 14:19     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Eric Auger @ 2021-09-22 13:42 UTC (permalink / raw)
  To: Liu Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

Hi,

On 9/19/21 8:38 AM, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
> them to define the IOAS.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  include/linux/iommu.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 943de6897f56..86d34e4ce05e 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -153,9 +153,13 @@ enum iommu_dev_features {
>  /**
>   * enum iommu_devattr - Per device IOMMU attributes
>   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
> + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
> + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
I think this deserves additional info. What address width do we talk
about, input, output, what stage if the IOMMU does support multiple stages

Thanks

Eric
>   */
>  enum iommu_devattr {
>  	IOMMU_DEV_INFO_FORCE_SNOOP,
> +	IOMMU_DEV_INFO_PAGE_SIZE,
> +	IOMMU_DEV_INFO_ADDR_WIDTH,
>  };
>  
>  #define IOMMU_PASID_INVALID	(-1U)


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 12:22           ` Jason Gunthorpe
@ 2021-09-22 13:44             ` Tian, Kevin
  2021-09-22 20:10             ` Alex Williamson
  1 sibling, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:23 PM
> 
> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..
> >
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind
> 

sure. will figure it out when working on the qemu part.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38 ` [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE Liu Yi L
  2021-09-21 17:44   ` Jason Gunthorpe
@ 2021-09-22 13:45   ` Jean-Philippe Brucker
  2021-09-29 10:47     ` Liu, Yi L
  2021-10-01  6:11   ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.

Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
what it's used for or why it's mandatory. But for PPC it sounds like it
should be an address range instead of an upper limit?

Thanks,
Jean

>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.
> 
> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-22 12:39       ` Jason Gunthorpe
@ 2021-09-22 13:56         ` Tian, Kevin
  2021-09-27  9:42         ` Tian, Kevin
  1 sibling, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:40 PM
> 
> On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:
> 
> > > IIRC in VFIO the container is the IOAS and when the group goes to
> > > create the device fd it should simply do the
> > > iommu_device_init_user_dma() followed immediately by a call to bind
> > > the container IOAS as your #3.
> >
> > a slight correction.
> >
> > to meet vfio semantics we could do init_user_dma() at group attach
> > time and then call binding to container IOAS when the device fd
> > is created. This is because vfio requires the group in a security context
> > before the device is opened.
> 
> Is it? Until a device FD is opened the group fd is kind of idle, right?

yes, then there is no user-tangible difference between init_user_dma()
at group attach time vs. doing it when opening fd(). But the latter does
require more change than the former, as it also needs the vfio iommu 
driver to provide a .device_attach callback. 

What's in my mind now is to keep existing group attach sequence 
which further calls a group-version init_user_dma(). Then when 
device fd is created, just create a iommu_dev object and switch to
normal fops. 

> 
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
> 
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
> 
> Something kind of like the below.
> 
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

thanks for the guidance. will think more along this direction...

> 
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 goto done;
>         }
> 
> +       ret = iommu_set_kernel_ownership(dev);
> +       if (ret)
> +               return ret;
> +
>  re_probe:
>         dev->driver = drv;
> 
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 dev->pm_domain->dismiss(dev);
>         pm_runtime_reinit(dev);
>         dev_pm_set_driver_flags(dev, 0);
> +       iommu_release_kernel_ownership(dev);
>  done:
>         return ret;
>  }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
>                         dev->pm_domain->dismiss(dev);
>                 pm_runtime_reinit(dev);
>                 dev_pm_set_driver_flags(dev, 0);
> +               iommu_release_kernel_ownership(dev);
> 
>                 klist_remove(&dev->p->knode_driver);
>                 device_pm_check_callbacks(dev);

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22 12:40       ` Jason Gunthorpe
@ 2021-09-22 13:59         ` Tian, Kevin
  2021-09-22 14:10           ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:41 PM
> 
> On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, September 21, 2021 11:42 PM
> > >
> > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> does
> > >    not need locking (order it properly too, it is in the wrong order), and
> > >    don't check for duplicate devices or dev_cookie duplication, that
> > >    is user error and is harmless to the kernel.
> > >
> >
> > I'm confused here. yes it's user error, but we check so many user errors
> > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> 
> Because it is expensive to calculate and forces a complicated locking
> scheme into the kernel. Without this check you don't need the locking
> that spans so much code, and simple RCU becomes acceptable.
> 

In case of duplication the kernel just uses the first entry which matches
the device when sending an event to userspace?

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22  3:40     ` Tian, Kevin
@ 2021-09-22 14:09       ` Jason Gunthorpe
  2021-09-23  9:14         ` Tian, Kevin
  2021-10-01  6:19         ` david
  2021-10-01  6:15       ` david
  1 sibling, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?

David explained it here:

https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

qeumu needs to be able to chooose if it gets the 32 bit range or 64
bit range.

So a 'range hint' will do the job

David also suggested this:

https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/

So I like this better:

struct iommu_ioasid_alloc {
	__u32	argsz;

	__u32	flags;
#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
#define IOMMU_IOASID_HINT_BASE_IOVA	(1 << 1)

	__aligned_u64 max_iova_hint;
	__aligned_u64 base_iova_hint; // Used only if IOMMU_IOASID_HINT_BASE_IOVA

	// For creating nested page tables
	__u32 parent_ios_id;
	__u32 format;
#define IOMMU_FORMAT_KERNEL 0
#define IOMMU_FORMAT_PPC_XXX 2
#define IOMMU_FORMAT_[..]
	u32 format_flags; // Layout depends on format above

	__aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
};

Again 'type' as an overall API indicator should not exist, feature
flags need to have clear narrow meanings.

This does both of David's suggestions at once. If quemu wants the 1G
limited region it could specify max_iova_hint = 1G, if it wants the
extend 64bit region with the hole it can give either the high base or
a large max_iova_hint. format/format_flags allows a further
device-specific escape if more specific customization is needed and is
needed to specify user space page tables anyhow.

> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index

Yes, ioas_id should always be the xarray index.

PASID needs to be called out as PASID or as a generic "hw description"
blob.

kvm's API to program the vPASID translation table should probably take
in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
information using an in-kernel API. Userspace shouldn't have to
shuttle it around.

I'm starting to feel like the struct approach for describing this uAPI
might not scale well, but lets see..

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22 12:50           ` Jason Gunthorpe
@ 2021-09-22 14:09             ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:51 PM
> 
> On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, September 22, 2021 9:07 AM
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, September 22, 2021 8:55 AM
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > > state where it has a disabled fops
> > > > > >
> > > > > > The only thing the disabled fops can do is register the device to the
> > > > > > iommu fd. When successfully registered the device gets the normal
> fops.
> > > > > >
> > > > > > The registration steps should be done under a normal lock inside
> the
> > > > > > vfio_device. If a vfio_device is already registered then further
> > > > > > registration should fail.
> > > > > >
> > > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > > above.
> > > > > >
> > > > >
> > > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > > making vfio type1 as a shim. In this case we can use the registration
> > > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > > today, then a new atomic is still necessary. This all depends on how
> > > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > > discussed here just adds another pound to the shim option...
> > > >
> > > > No, it works the same either way, the group FD path is identical to
> > > > the normal FD path, it just triggers some of the state transitions
> > > > automatically internally instead of requiring external ioctls.
> > > >
> > > > The device FDs starts disabled, an internal API binds it to the iommu
> > > > via open coding with the group API, and then the rest of the APIs can
> > > > be enabled. Same as today.
> > > >
> >
> > After reading your comments on patch08, I may have a clearer picture
> > on your suggestion. The key is to handle exclusive access at the binding
> > time (based on vdev->iommu_dev). Please see whether below makes
> > sense:
> >
> > Shared sequence:
> >
> > 1)  initialize the device with a parked fops;
> > 2)  need binding (explicit or implicit) to move away from parked fops;
> > 3)  switch to normal fops after successful binding;
> >
> > 1) happens at device probe.
> 
> 1 happens when the cdev is setup with the parked fops, yes. I'd say it
> happens at fd open time.
> 
> > for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> >
> >   - 2) is done by calling .bind_iommufd() callback;
> >   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
> >     .finalize_device(). The latter may be preferred for the group interface;
> >   - Two threads may open the same device simultaneously, with exclusive
> >     access guaranteed by iommufd_bind_device();
> >   - Open() after successful binding is rejected, since normal fops has been
> >     activated. This is checked upon vdev->iommu_dev;
> 
> Almost, open is always successful, what fails is
> VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
> with a FD that is useless, cannot reach the ops and thus cannot impact
> the device it doesn't own in any way.

make sense. I had an wrong impression that once a normal fops is
activated it is also visible to other threads. But in concept this fops
replacement should be local to each thread thus another thread
opening the device always gets a parked fops.

> 
> It is similar to opening a group FD
> 
> > for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> >
> >   - 2) is done by open coding bind_iommufd + attach_ioas. Create an
> >     iommufd_device object and record it to vdev->iommu_dev
> >   - 3) is done by calling .finalize_device();
> >   - open() after a valid vdev->iommu_dev is rejected. this also ensures
> >     exclusive ownership with the nongroup path.
> 
> Same comment as above, groups should go through the same sequence of
> steps, create a FD, attempt to bind, if successuful make the FD
> operational.
> 
> The only difference is that failure in these steps does not call
> fd_install(). For this reason alone the FD could start out with
> operational fops, but it feels like a needless optimization.
> 
> > If Alex also agrees with it, this might be another mini-series to be merged
> > (just for group path) before this one. Doing so sort of nullifies the existing
> > group/container attaching process, where attach_ioas will be skipped and
> > now the security context is established when the device is opened.
> 
> I think it is really important to unify DMA exclusion model and lower
> to the core iommu code. If there is a reason the exclusion must be
> triggered on group fd open then the iommu core code should provide an
> API to do that which interworks with the device API iommufd will work.
> 
> But I would start here because it is much simpler to understand..
> 

Let's work on this task first and figure out what's the cleaner way to unify
it. My current impression is that having an iommu api for group fd open
might be simpler. Currently vfio iommu drivers are coupled with container
with group-granular operations. Adapting them to device fd open will 
require more changes to handle device<->group. anyway we'll see...

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22 13:59         ` Tian, Kevin
@ 2021-09-22 14:10           ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 14:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:59:39PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:41 PM
> > 
> > On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, September 21, 2021 11:42 PM
> > > >
> > > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> > does
> > > >    not need locking (order it properly too, it is in the wrong order), and
> > > >    don't check for duplicate devices or dev_cookie duplication, that
> > > >    is user error and is harmless to the kernel.
> > > >
> > >
> > > I'm confused here. yes it's user error, but we check so many user errors
> > > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> > 
> > Because it is expensive to calculate and forces a complicated locking
> > scheme into the kernel. Without this check you don't need the locking
> > that spans so much code, and simple RCU becomes acceptable.
> 
> In case of duplication the kernel just uses the first entry which matches
> the device when sending an event to userspace?

Sure

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-22 12:55       ` Jason Gunthorpe
@ 2021-09-22 14:13         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Jason Gunthorpe
> Sent: Wednesday, September 22, 2021 8:55 PM
> 
> On Wed, Sep 22, 2021 at 03:41:50AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 1:47 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > > > As aforementioned, userspace should check extension for what formats
> > > > can be specified when allocating an IOASID. This patch adds such
> > > > interface for userspace. In this RFC, iommufd reports
> EXT_MAP_TYPE1V2
> > > > support and no no-snoop support yet.
> > > >
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> > > >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> > > >  2 files changed, 34 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 4839f128b24a..e45d76359e34 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> > > *filep,
> > > >  		return ret;
> > > >
> > > >  	switch (cmd) {
> > > > +	case IOMMU_CHECK_EXTENSION:
> > > > +		switch (arg) {
> > > > +		case EXT_MAP_TYPE1V2:
> > > > +			return 1;
> > > > +		default:
> > > > +			return 0;
> > > > +		}
> > > >  	case IOMMU_DEVICE_GET_INFO:
> > > >  		ret = iommufd_get_device_info(ictx, arg);
> > > >  		break;
> > > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > > index 5cbd300eb0ee..49731be71213 100644
> > > > +++ b/include/uapi/linux/iommu.h
> > > > @@ -14,6 +14,33 @@
> > > >  #define IOMMU_TYPE	(';')
> > > >  #define IOMMU_BASE	100
> > > >
> > > > +/*
> > > > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > > > + *
> > > > + * Check whether an uAPI extension is supported.
> > > > + *
> > > > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > > > + * in one breath. User should check which uAPI extension is supported
> > > > + * according to its intended usage.
> > > > + *
> > > > + * A rough list of possible extensions may include:
> > > > + *
> > > > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > > > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > > > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > > > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > > > + *	- EXT_IOASID_NESTING for what the name stands;
> > > > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > > > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > > > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > > > + *	- ...
> > > > + *
> > > > + * Return: 0 if not supported, 1 if supported.
> > > > + */
> > > > +#define EXT_MAP_TYPE1V2		1
> > > > +#define EXT_DMA_NO_SNOOP	2
> > > > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> > > IOMMU_BASE + 0)
> > >
> > > I generally advocate for a 'try and fail' approach to discovering
> > > compatibility.
> > >
> > > If that doesn't work for the userspace then a query to return a
> > > generic capability flag is the next best idea. Each flag should
> > > clearly define what 'try and fail' it is talking about
> >
> > We don't have strong preference here. Just follow what vfio does
> > today. So Alex's opinion is appreciated here. 😊
> 
> This is a uAPI design, it should follow the current mainstream
> thinking on how to build these things. There is a lot of old stuff in
> vfio that doesn't match the modern thinking. IMHO.
> 
> > > TYPE1V2 seems like nonsense
> >
> > just in case other mapping protocols are introduced in the future
> 
> Well, we should never, ever do that. Allowing PPC and evrything else
> to split in VFIO has created a compelte disaster in userspace. HW
> specific extensions should be modeled as extensions not a wholesale
> replacement of everything.
> 
> I'd say this is part of the modern thinking on uAPI design.
> 
> What I want to strive for is the basic API is usable with all HW - and
> is what something like DPDK can exclusively use.
> 
> An extended API with HW specific facets exists for qemu to use to
> build a HW backed accelereated and featureful vIOMMU emulation.
> 
> The needs of qmeu should not trump the requirement for a universal
> basic API.
> 
> Eg if we can't figure out a basic API version of the PPC range issue
> then that should be punted to a PPC specific API.
> 

sounds good. I may keep an wrong memory on the multiple mapping
protocols thing. 😊

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-22 12:57       ` Jason Gunthorpe
@ 2021-09-22 14:16         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:57 PM
> 
> On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:
> 
> > Actually this was one open we closed in previous design proposal, but
> > looks you have a different thought now.
> >
> > vfio maintains one ioas per container. Devices in the container
> > can be attached to different domains (e.g. due to snoop format). Every
> > time when the ioas is updated, every attached domain is updated
> > in accordance.
> >
> > You recommended one-ioas-one-domain model instead, i.e. any device
> > with a format incompatible with the one currently used in ioas has to
> > be attached to a new ioas, even if the two ioas's have the same mapping.
> > This leads to compatibility check at attaching time.
> >
> > Now you want returning back to the vfio model?
> 
> Oh, I thought we circled back again.. If we are all OK with one ioas
> one domain then great.

yes, at least I haven't seen a blocking issue with this assumption. Later
when converting vfio type1 into a shim, it could create multiple ioas's
if container would have a list of domains before the shim.

> 
> > > If think sis taking in the iommfd_device then there isn't a logical
> > > place to signal the PCIness
> >
> > can you elaborate?
> 
> I mean just drop it and document it.
> 

got you

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-22 12:58       ` Jason Gunthorpe
@ 2021-09-22 14:17         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:59 PM
> 
> On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 2:04 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > > This patch adds interface for userspace to attach device to specified
> > > > IOASID.
> > > >
> > > > Note:
> > > > One device can only be attached to one IOASID in this version. This is
> > > > on par with what vfio provides today. In the future this restriction can
> > > > be relaxed when multiple I/O address spaces are supported per device
> > >
> > > ?? In VFIO the container is the IOS and the container can be shared
> > > with multiple devices. This needs to start at about the same
> > > functionality.
> >
> > a device can be only attached to one container. One container can be
> > shared by multiple devices.
> >
> > a device can be only attached to one IOASID. One IOASID can be shared
> > by multiple devices.
> >
> > it does start at the same functionality.
> >
> > >
> > > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > >
> > > This should be in the core code, right? There is nothing PCI specific
> > > here.
> > >
> >
> > but if you insist on a pci-wrapper attach function, we still need something
> > here (e.g. with .attach_ioasid() callback)?
> 
> I would like to stop adding ioctls to this switch, the core code
> should decode the ioctl and call an per-ioctl op like every other
> subsystem does..
> 
> If you do that then you could have an op
> 
>  .attach_ioasid = vfio_full_device_attach,
> 
> And that is it for driver changes.
> 
> Every driver that use type1 today should be updated to have the above
> line and will work with iommufd. mdevs will not be updated and won't
> work.
> 

will do. 

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-22 13:42   ` Eric Auger
@ 2021-09-22 14:19     ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:19 UTC (permalink / raw)
  To: eric.auger, Liu, Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, parav, lkml, pbonzini, lushenming, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, September 22, 2021 9:43 PM
> 
> Hi,
> 
> On 9/19/21 8:38 AM, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could
> use
> > them to define the IOAS.
> >
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > ---
> >  include/linux/iommu.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 943de6897f56..86d34e4ce05e 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -153,9 +153,13 @@ enum iommu_dev_features {
> >  /**
> >   * enum iommu_devattr - Per device IOMMU attributes
> >   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to
> be snooped.
> > + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu
> supports.
> > + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
> I think this deserves additional info. What address width do we talk
> about, input, output, what stage if the IOMMU does support multiple stages
> 

it describes the address space width, thus is about input.

when multiple stages are supported, each stage is represented by a separate
ioasid, each with its own addr_width

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-19  6:38 ` [RFC 17/20] iommu/iommufd: Report iova range to userspace Liu Yi L
@ 2021-09-22 14:49   ` Jean-Philippe Brucker
  2021-09-29 10:44     ` Liu, Yi L
  0 siblings, 1 reply; 280+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 14:49 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> IOVA range is critical info for userspace to manage DMA for an I/O address
> space. This patch reports the valid iova range info of a given device.
> 
> Due to aforementioned hack, this info comes from the hacked vfio type1
> driver. To follow the same format in vfio, we also introduce a cap chain
> format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
[...]
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 49731be71213..f408ad3c8ade 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -68,6 +68,7 @@
>   *		   +---------------+------------+
>   *		   ...
>   * @addr_width:    the address width of supported I/O address spaces.
> + * @cap_offset:	   Offset within info struct of first cap
>   *
>   * Availability: after device is bound to iommufd
>   */
> @@ -77,9 +78,11 @@ struct iommu_device_info {
>  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
>  #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
>  #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
>  	__u64	dev_cookie;
>  	__u64   pgsize_bitmap;
>  	__u32	addr_width;
> +	__u32   cap_offset;

We can also add vendor-specific page table and PASID table properties as
capabilities, otherwise we'll need giant unions in the iommu_device_info
struct. That made me wonder whether pgsize and addr_width should also be
separate capabilities for consistency, but this way might be good enough.
There won't be many more generic capabilities. I have "output address
width" and "PASID width", the rest is specific to Arm and SMMU table
formats.

Thanks,
Jean

>  };
>  
>  #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 12:22           ` Jason Gunthorpe
  2021-09-22 13:44             ` Tian, Kevin
@ 2021-09-22 20:10             ` Alex Williamson
  2021-09-22 22:34               ` Tian, Kevin
  2021-09-22 23:56               ` Jason Gunthorpe
  1 sibling, 2 replies; 280+ messages in thread
From: Alex Williamson @ 2021-09-22 20:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 09:22:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..  
> > 
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?   
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind

If we assume that the container/group/device interface is essentially
deprecated once we have iommufd, it doesn't make a lot of sense to me
to tack on a container/device interface just so userspace can avoid
reverting to the fully legacy interface.

But why would we create vfio device interface files at all if they
can't work?  I'm not really on board with creating a try-and-fail
interface for a mechanism that cannot work for a given device.  The
existence of the device interface should indicate that it's supported.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-21 17:29   ` Jason Gunthorpe
@ 2021-09-22 21:01     ` Alex Williamson
  2021-09-22 23:01       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-22 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, 21 Sep 2021 14:29:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > +struct vfio_device_iommu_bind_data {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__s32	iommu_fd;
> > +	__u64	dev_cookie;  
> 
> Missing explicit padding
> 
> Always use __aligned_u64 in uapi headers, fix all the patches.

We don't need padding or explicit alignment if we just swap the order
of iommu_fd and dev_cookie.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-22  1:19       ` Tian, Kevin
@ 2021-09-22 21:17         ` Alex Williamson
  2021-09-22 23:49           ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-22 21:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 01:19:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 22, 2021 5:09 AM
> > 
> > On Tue, 21 Sep 2021 13:40:01 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:  
> > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > interface and device-centric interface.
> > > >
> > > > As explained in last patch, this change doesn't apply to devices which
> > > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > > are still expected to be opened via the legacy group interface.  
> > 
> > This doesn't make much sense to me.  The previous patch indicates
> > there's work to be done in updating the kvm-vfio contract to understand
> > DMA coherency, so you're trying to limit use cases to those where the
> > IOMMU enforces coherency, but there's QEMU work to be done to support
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.  
> 
> The policy here is that VFIO will not expose such devices (no enforce-snoop)
> in the new device hierarchy at all. In this case QEMU will fall back to the
> group interface automatically and then rely on the existing contract to connect 
> vfio and QEMU. It doesn't need to care about the whatever new contract
> until such devices are exposed in the new interface.
> 
> yes, vfio is not only for KVM. But here it's more a task split based on staging
> consideration. imo it's not necessary to further split task into supporting
> non-snoop device for userspace driver and then for kvm.

Patch 10 introduces an iommufd interface for QEMU to learn whether the
IOMMU enforces DMA coherency, at that point QEMU could revert to the
legacy interface, or register the iommufd with KVM, or otherwise
establish non-coherent DMA with KVM as necessary.  We're adding cruft
to the kernel here to enforce an unnecessary limitation.

If there are reasons the kernel can't support the device interface,
that's a valid reason not to present the interface, but this seems like
picking a specific gap that userspace is already able to detect from
this series at the expense of other use cases.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38 ` [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO Liu Yi L
  2021-09-21 17:40   ` Jason Gunthorpe
@ 2021-09-22 21:24   ` Alex Williamson
  2021-09-22 23:49     ` Jason Gunthorpe
  2021-09-29  6:23   ` David Gibson
  2 siblings, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-22 21:24 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, kevin.tian, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, ashok.raj, yi.l.liu,
	jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

On Sun, 19 Sep 2021 14:38:38 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */

Is this too PCI specific, or perhaps too much of the mechanism rather
than the result?  ie. should we just indicate if the IOMMU guarantees
coherent DMA?  Thanks,

Alex

> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};
> +
> +#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
>  
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 20:10             ` Alex Williamson
@ 2021-09-22 22:34               ` Tian, Kevin
  2021-09-22 22:45                 ` Alex Williamson
  2021-09-22 23:56               ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 22:34 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 4:11 AM
> 
> On Wed, 22 Sep 2021 09:22:52 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> >
> > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > allow a reasonable fallback in userspace..
> > >
> > > I didn't get this point though. An error in binding already allows the
> > > user to fall back to the group path. Why do we need introduce another
> > > ioctl to explicitly bind to container via the nongroup interface?
> >
> > New userspace still needs a fallback path if it hits the 'try and
> > fail'. Keeping the device FD open and just using a different ioctl to
> > bind to a container/group FD, which new userspace can then obtain as a
> > fallback, might be OK.
> >
> > Hard to see without going through the qemu parts, so maybe just keep
> > it in mind
> 
> If we assume that the container/group/device interface is essentially
> deprecated once we have iommufd, it doesn't make a lot of sense to me
> to tack on a container/device interface just so userspace can avoid
> reverting to the fully legacy interface.
> 
> But why would we create vfio device interface files at all if they
> can't work?  I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device.  The
> existence of the device interface should indicate that it's supported.
> Thanks,
> 

Now it's a try-and-fail model even for devices which support iommufd.
Per Jason's suggestion, a device is always opened with a parked fops
which supports only bind. Binding serves as the contract for handling
exclusive ownership on a device and switching to normal fops if
succeed. So the user has to try-and-fail in case multiple threads attempt 
to open a same device. Device which doesn't support iommufd is not
different, except binding request 100% fails (due to missing .bind_iommufd
in kernel driver).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 22:34               ` Tian, Kevin
@ 2021-09-22 22:45                 ` Alex Williamson
  2021-09-22 23:45                   ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-22 22:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 22:34:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, September 23, 2021 4:11 AM
> > 
> > On Wed, 22 Sep 2021 09:22:52 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > >  
> > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > allow a reasonable fallback in userspace..  
> > > >
> > > > I didn't get this point though. An error in binding already allows the
> > > > user to fall back to the group path. Why do we need introduce another
> > > > ioctl to explicitly bind to container via the nongroup interface?  
> > >
> > > New userspace still needs a fallback path if it hits the 'try and
> > > fail'. Keeping the device FD open and just using a different ioctl to
> > > bind to a container/group FD, which new userspace can then obtain as a
> > > fallback, might be OK.
> > >
> > > Hard to see without going through the qemu parts, so maybe just keep
> > > it in mind  
> > 
> > If we assume that the container/group/device interface is essentially
> > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > to tack on a container/device interface just so userspace can avoid
> > reverting to the fully legacy interface.
> > 
> > But why would we create vfio device interface files at all if they
> > can't work?  I'm not really on board with creating a try-and-fail
> > interface for a mechanism that cannot work for a given device.  The
> > existence of the device interface should indicate that it's supported.
> > Thanks,
> >   
> 
> Now it's a try-and-fail model even for devices which support iommufd.
> Per Jason's suggestion, a device is always opened with a parked fops
> which supports only bind. Binding serves as the contract for handling
> exclusive ownership on a device and switching to normal fops if
> succeed. So the user has to try-and-fail in case multiple threads attempt 
> to open a same device. Device which doesn't support iommufd is not
> different, except binding request 100% fails (due to missing .bind_iommufd
> in kernel driver).

That's a rather important difference.  I don't really see how that's
comparable to the mutually exclusive nature of the legacy vs device
interface.  We're not going to present a vfio device interface for SW
mdevs that can't participate in iommufd, right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-22 21:01     ` Alex Williamson
@ 2021-09-22 23:01       ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 23:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:01:01PM -0600, Alex Williamson wrote:
> On Tue, 21 Sep 2021 14:29:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > +struct vfio_device_iommu_bind_data {
> > > +	__u32	argsz;
> > > +	__u32	flags;
> > > +	__s32	iommu_fd;
> > > +	__u64	dev_cookie;  
> > 
> > Missing explicit padding
> > 
> > Always use __aligned_u64 in uapi headers, fix all the patches.
> 
> We don't need padding or explicit alignment if we just swap the order
> of iommu_fd and dev_cookie.  Thanks,

Yes, the padding should all be checked and minimized

But it is always good practice to always use __aligned_u64 in the uapi
headers just in case someone messes it up someday - it prevents small
mistakes from becoming an ABI mess.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 22:45                 ` Alex Williamson
@ 2021-09-22 23:45                   ` Tian, Kevin
  2021-09-22 23:52                     ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 23:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 6:45 AM
> 
> On Wed, 22 Sep 2021 22:34:42 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, September 23, 2021 4:11 AM
> > >
> > > On Wed, 22 Sep 2021 09:22:52 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > > >
> > > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > > allow a reasonable fallback in userspace..
> > > > >
> > > > > I didn't get this point though. An error in binding already allows the
> > > > > user to fall back to the group path. Why do we need introduce
> another
> > > > > ioctl to explicitly bind to container via the nongroup interface?
> > > >
> > > > New userspace still needs a fallback path if it hits the 'try and
> > > > fail'. Keeping the device FD open and just using a different ioctl to
> > > > bind to a container/group FD, which new userspace can then obtain as
> a
> > > > fallback, might be OK.
> > > >
> > > > Hard to see without going through the qemu parts, so maybe just keep
> > > > it in mind
> > >
> > > If we assume that the container/group/device interface is essentially
> > > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > > to tack on a container/device interface just so userspace can avoid
> > > reverting to the fully legacy interface.
> > >
> > > But why would we create vfio device interface files at all if they
> > > can't work?  I'm not really on board with creating a try-and-fail
> > > interface for a mechanism that cannot work for a given device.  The
> > > existence of the device interface should indicate that it's supported.
> > > Thanks,
> > >
> >
> > Now it's a try-and-fail model even for devices which support iommufd.
> > Per Jason's suggestion, a device is always opened with a parked fops
> > which supports only bind. Binding serves as the contract for handling
> > exclusive ownership on a device and switching to normal fops if
> > succeed. So the user has to try-and-fail in case multiple threads attempt
> > to open a same device. Device which doesn't support iommufd is not
> > different, except binding request 100% fails (due to missing .bind_iommufd
> > in kernel driver).
> 
> That's a rather important difference.  I don't really see how that's
> comparable to the mutually exclusive nature of the legacy vs device

I didn't get the 'comparable' part. Can you elaborate?

> interface.  We're not going to present a vfio device interface for SW
> mdevs that can't participate in iommufd, right?  Thanks,
> 

Did you see any problem if exposing sw mdev now? Following above
explanation the try-and-fail model should still work...

btw I realized another related piece regarding to the new layout that
Jason suggested, which have sys device node include a link to the vfio
devnode:

	/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

This for sure requires specific vfio driver support to get the link established.
if we only do it for vfio-pci in the start, then for other devices which don't
support iommufd there is no way for the user to identify the corresponding
vfio devnode even it's still exposed. Then try-and-fail model may not even
been reached for those devices.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-22 21:17         ` Alex Williamson
@ 2021-09-22 23:49           ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-22 23:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 5:17 AM
> 
> On Wed, 22 Sep 2021 01:19:08 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 22, 2021 5:09 AM
> > >
> > > On Tue, 21 Sep 2021 13:40:01 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > > interface and device-centric interface.
> > > > >
> > > > > As explained in last patch, this change doesn't apply to devices which
> > > > > cannot be forced to snoop cache by their upstream iommu. Such
> devices
> > > > > are still expected to be opened via the legacy group interface.
> > >
> > > This doesn't make much sense to me.  The previous patch indicates
> > > there's work to be done in updating the kvm-vfio contract to understand
> > > DMA coherency, so you're trying to limit use cases to those where the
> > > IOMMU enforces coherency, but there's QEMU work to be done to
> support
> > > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > > will be told about non-coherent devices rather than "meh, skip it in the
> > > kernel"?  Also let's not forget that vfio is not only for KVM.
> >
> > The policy here is that VFIO will not expose such devices (no enforce-snoop)
> > in the new device hierarchy at all. In this case QEMU will fall back to the
> > group interface automatically and then rely on the existing contract to
> connect
> > vfio and QEMU. It doesn't need to care about the whatever new contract
> > until such devices are exposed in the new interface.
> >
> > yes, vfio is not only for KVM. But here it's more a task split based on staging
> > consideration. imo it's not necessary to further split task into supporting
> > non-snoop device for userspace driver and then for kvm.
> 
> Patch 10 introduces an iommufd interface for QEMU to learn whether the
> IOMMU enforces DMA coherency, at that point QEMU could revert to the
> legacy interface, or register the iommufd with KVM, or otherwise
> establish non-coherent DMA with KVM as necessary.  We're adding cruft
> to the kernel here to enforce an unnecessary limitation.
> 
> If there are reasons the kernel can't support the device interface,
> that's a valid reason not to present the interface, but this seems like
> picking a specific gap that userspace is already able to detect from
> this series at the expense of other use cases.  Thanks,
> 

I see your point now. Yes I agree that the kernel cruft is unnecessary
limitation here. The user should rely on the device/iommufd capability
to decide whether non-coherent DMA should go through legacy or
new interface.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-22 21:24   ` Alex Williamson
@ 2021-09-22 23:49     ` Jason Gunthorpe
  2021-09-23  3:10       ` Tian, Kevin
       [not found]       ` <BN9PR11MB5433409DF766AAEF1BB2CF258CA39@BN9PR11MB5433.namprd11.prod.outlook.com>
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 23:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> On Sun, 19 Sep 2021 14:38:38 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > +struct iommu_device_info {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
> 
> Is this too PCI specific, or perhaps too much of the mechanism rather
> than the result?  ie. should we just indicate if the IOMMU guarantees
> coherent DMA?  Thanks,

I think the name of "coherent DMA" for this feature inside the kernel
is very, very confusing. We already have something called coherent dma
and this usage on Intel has nothing at all to do with that.

In fact it looks like this confusing name has already caused
implementation problems as I see dma-iommu, is connecting
dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
completely wrong if IOMMU_CACHE is linked to no_snoop.

And ARM seems to have fallen out of step with x86 as the ARM IOMMU
drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
ARM_LPAE_MAIR_ATTR_IDX_CACHE

The SMMU spec for ARMv8 is pretty clear:

 13.6.1.1 No_snoop

 Support for No_snoop is system-dependent and, if implemented, No_snoop
 transforms a final access attribute of a Normal cacheable type to
 Normal-iNC-oNC-OSH downstream of (or appearing to be performed
 downstream of) the SMMU. No_snoop does not transform a final access
 attribute of any-Device.

Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE does NOT
block non-snoop, in fact it *enables* it - the reverse of what Intel
is doing!

So this is all a mess.

Better to start clear and unambiguous names in the uAPI and someone
can try to clean up the kernel eventually.

The required behavior for iommufd is to have the IOMMU ignore the
no-snoop bit so that Intel HW can disable wbinvd. This bit should be
clearly documented for its exact purpose and if other arches also have
instructions that need to be disabled if snoop TLPs are allowed then
they can re-use this bit. It appears ARM does not have this issue and
does not need the bit.

What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
if/how iommufd should expose it as a controllable PTE flag. The ARM

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 23:45                   ` Tian, Kevin
@ 2021-09-22 23:52                     ` Jason Gunthorpe
  2021-09-23  0:38                       ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 23:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 11:45:33PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>

> btw I realized another related piece regarding to the new layout that
> Jason suggested, which have sys device node include a link to the vfio
> devnode:
> 
> 	/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> 
> This for sure requires specific vfio driver support to get the link
> established.

It doesn't. Just set the parent device of the vfio_device's struct
device to the physical struct device that vfio is already tracking -
ie the struct device providing the IOMMU. The driver core takes care
of everything else.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 20:10             ` Alex Williamson
  2021-09-22 22:34               ` Tian, Kevin
@ 2021-09-22 23:56               ` Jason Gunthorpe
  1 sibling, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 23:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 02:10:36PM -0600, Alex Williamson wrote:

> But why would we create vfio device interface files at all if they
> can't work?  I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device.  The
> existence of the device interface should indicate that it's supported.

I'm a little worried about adding a struct device to vfio_device and
then making it optional.. That is a really weird situation.

I suppose you could create the sysfs presence in the struct device but
not create a cdev.

However, if we ever want to use the device fd for something else, like
querying the device driver capabilities or mode, (ie clean the
driver_api thing wrongly placed in mdev sysfs for instance), we are
blocked as the uAPI will be cdev == must support iommufd..

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 23:52                     ` Jason Gunthorpe
@ 2021-09-23  0:38                       ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23  0:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 7:52 AM
> 
> On Wed, Sep 22, 2021 at 11:45:33PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> 
> > btw I realized another related piece regarding to the new layout that
> > Jason suggested, which have sys device node include a link to the vfio
> > devnode:
> >
> > 	/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> >
> > This for sure requires specific vfio driver support to get the link
> > established.
> 
> It doesn't. Just set the parent device of the vfio_device's struct
> device to the physical struct device that vfio is already tracking -
> ie the struct device providing the IOMMU. The driver core takes care
> of everything else.
> 

Thanks for correction. So it's still the same try-and-fail model for both 
devices which support iommufd and which do not.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-22 23:49     ` Jason Gunthorpe
@ 2021-09-23  3:10       ` Tian, Kevin
  2021-09-23 10:15         ` Jean-Philippe Brucker
  2021-09-23 11:36         ` Jason Gunthorpe
       [not found]       ` <BN9PR11MB5433409DF766AAEF1BB2CF258CA39@BN9PR11MB5433.namprd11.prod.outlook.com>
  1 sibling, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23  3:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 7:50 AM
> 
> On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> > On Sun, 19 Sep 2021 14:38:38 +0800
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >
> > > +struct iommu_device_info {
> > > +	__u32	argsz;
> > > +	__u32	flags;
> > > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU
> enforced snoop */
> >
> > Is this too PCI specific, or perhaps too much of the mechanism rather

Isn't snoop vs. !snoop a general concept not pci specific?

> > than the result?  ie. should we just indicate if the IOMMU guarantees
> > coherent DMA?  Thanks,
> 
> I think the name of "coherent DMA" for this feature inside the kernel
> is very, very confusing. We already have something called coherent dma
> and this usage on Intel has nothing at all to do with that.
> 
> In fact it looks like this confusing name has already caused
> implementation problems as I see dma-iommu, is connecting
> dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
> completely wrong if IOMMU_CACHE is linked to no_snoop.
> 
> And ARM seems to have fallen out of step with x86 as the ARM IOMMU
> drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
> ARM_LPAE_MAIR_ATTR_IDX_CACHE
> 
> The SMMU spec for ARMv8 is pretty clear:
> 
>  13.6.1.1 No_snoop
> 
>  Support for No_snoop is system-dependent and, if implemented, No_snoop
>  transforms a final access attribute of a Normal cacheable type to
>  Normal-iNC-oNC-OSH downstream of (or appearing to be performed
>  downstream of) the SMMU. No_snoop does not transform a final access
>  attribute of any-Device.
> 
> Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE
> does NOT
> block non-snoop, in fact it *enables* it - the reverse of what Intel
> is doing!

Checking the code:

        if (data->iop.fmt == ARM_64_LPAE_S2 ||
            data->iop.fmt == ARM_32_LPAE_S2) {
                if (prot & IOMMU_MMIO)
                        pte |= ARM_LPAE_PTE_MEMATTR_DEV;
                else if (prot & IOMMU_CACHE)
                        pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
                else
                        pte |= ARM_LPAE_PTE_MEMATTR_NC;

It does set attribute to WB for IOMMU_CACHE and then NC (Non-cacheable)
for !IOMMU_CACHE. The main difference between Intel and ARM is that Intel
by default allows both snoop and non-snoop traffic with one additional bit
to enforce snoop, while ARM requires explicit SMMU configuration for snoop
and non-snoop respectively.

        } else {
                if (prot & IOMMU_MMIO)
                        pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
                                << ARM_LPAE_PTE_ATTRINDX_SHIFT);
                else if (prot & IOMMU_CACHE)
                        pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
                                << ARM_LPAE_PTE_ATTRINDX_SHIFT);
        }

same for this one. MAIR_ELx register is programmed to ARM_LPAE_MAIR_
ATTR_WBRWA for IDX_CACHE bit. I'm not sure why it doesn't use 
IDX_NC though, when !IOMMU_CACHE.

> 
> So this is all a mess.
> 
> Better to start clear and unambiguous names in the uAPI and someone
> can try to clean up the kernel eventually.
> 
> The required behavior for iommufd is to have the IOMMU ignore the
> no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> clearly documented for its exact purpose and if other arches also have
> instructions that need to be disabled if snoop TLPs are allowed then
> they can re-use this bit. It appears ARM does not have this issue and
> does not need the bit.

Disabling wbinvd is one purpose. imo the more important intention
is that iommu vendor uses different PTE formats between snoop and
!snoop. As long as we want allow userspace to opt in case of isoch 
performance requirement (unlike current vfio which always choose
snoop format if available), such mechanism is required for all vendors.

When creating an ioas there could be three snoop modes:

1) snoop for all attached devices;
2) non-snoop for all attached devices;
3) device-selected snoop;

Intel supports 1) <enforce-snoop on> and 3) <enforce-snoop off>. snoop
and nonsnoop devices can be attached to a same ioas in 3).

ARM supports 1) <snoop format> and 2) <nonsnoop format>. snoop devices
and nonsnoop devices must be attached to different ioas's in 1) and 2)
respectively.

Then the device info should reports:

/* iommu enforced snoop */
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0)
/* iommu enforced nonsnoop */
+#define IOMMU_DEVICE_INFO_ENFORCE_NONSNOOP	(1 << 1)
/* device selected snoop */
+#define IOMMU_DEVICE_INFO_DEVICE_SNOOP	(1 << 2)

> 
> What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
> if/how iommufd should expose it as a controllable PTE flag. The ARM
> 

Based on above analysis I think the ARM usage with IOMMU_CACHE
doesn't change. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
       [not found]       ` <BN9PR11MB5433409DF766AAEF1BB2CF258CA39@BN9PR11MB5433.namprd11.prod.outlook.com>
@ 2021-09-23  3:38         ` Tian, Kevin
  2021-09-23 11:42           ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23  3:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Tian, Kevin
> Sent: Thursday, September 23, 2021 11:11 AM
> 
> >
> > The required behavior for iommufd is to have the IOMMU ignore the
> > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > clearly documented for its exact purpose and if other arches also have
> > instructions that need to be disabled if snoop TLPs are allowed then
> > they can re-use this bit. It appears ARM does not have this issue and
> > does not need the bit.
> 
> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop. As long as we want allow userspace to opt in case of isoch
> performance requirement (unlike current vfio which always choose
> snoop format if available), such mechanism is required for all vendors.
> 

btw I'm not sure whether the wbinvd trick is Intel specific. All other
platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
CACHE_COHERENCY (the source of IOMMU_CACHE). They didn't hit
this problem because vfio always sets IOMMU_CACHE to force every
DMA to snoop. Will they need to handle similar wbinvd-like trick (plus
necessary memory type virtualization) when non-snoop format is enabled? 
Or are their architectures highly optimized to afford isoch traffic even 
with snoop (then fine to not support user opt-in)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22 13:32       ` Jason Gunthorpe
@ 2021-09-23  6:26         ` Liu, Yi L
  0 siblings, 0 replies; 280+ messages in thread
From: Liu, Yi L @ 2021-09-23  6:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 9:32 PM
> 
> On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > [...]
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 641f199f2d41..4839f128b24a 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -24,6 +24,7 @@
> > > >  struct iommufd_ctx {
> > > >  	refcount_t refs;
> > > >  	struct mutex lock;
> > > > +	struct xarray ioasid_xa; /* xarray of ioasids */
> > > >  	struct xarray device_xa; /* xarray of bound devices */
> > > >  };
> > > >
> > > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > > >  	u64 dev_cookie;
> > > >  };
> > > >
> > > > +/* Represent an I/O address space */
> > > > +struct iommufd_ioas {
> > > > +	int ioasid;
> > >
> > > xarray id's should consistently be u32s everywhere.
> >
> > sure. just one more check, this id is supposed to be returned to
> > userspace as the return value of ioctl(IOASID_ALLOC). That's why
> > I chose to use "int" as its prototype to make it aligned with the
> > return type of ioctl(). Based on this, do you think it's still better
> > to use "u32" here?
> 
> I suggest not using the return code from ioctl to exchange data.. The
> rest of the uAPI uses an in/out struct, everything should do
> that consistently.

got it.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  1:00       ` Jason Gunthorpe
  2021-09-22  1:02         ` Tian, Kevin
@ 2021-09-23  7:25         ` Eric Auger
  2021-09-23 11:44           ` Jason Gunthorpe
  2021-09-29  2:46         ` david
  2 siblings, 1 reply; 280+ messages in thread
From: Eric Auger @ 2021-09-23  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

Hi,

On 9/22/21 3:00 AM, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Wednesday, September 22, 2021 12:01 AM
>>>
>>>>  One open about how to organize the device nodes under
>>> /dev/vfio/devices/.
>>>> This RFC adopts a simple policy by keeping a flat layout with mixed
>>> devname
>>>> from all kinds of devices. The prerequisite of this model is that devnames
>>>> from different bus types are unique formats:
>>> This isn't reliable, the devname should just be vfio0, vfio1, etc
>>>
>>> The userspace can learn the correct major/minor by inspecting the
>>> sysfs.
>>>
>>> This whole concept should disappear into the prior patch that adds the
>>> struct device in the first place, and I think most of the code here
>>> can be deleted once the struct device is used properly.
>>>
>> Can you help elaborate above flow? This is one area where we need
>> more guidance.
>>
>> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
>> how does Qemu identify which vifo0/1/... is associated with the specified 
>> DDDD:BB:DD.F? 
> When done properly in the kernel the file:
>
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
>
> Will contain the major:minor of the VFIO device.
>
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.
>
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
I guess this would be the same for platform devices, for instance
/sys/bus/platform/devices/AMDI8001:01/vfio/vfioX/dev, right?

Thanks

Eric
>
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
>
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
>
> Jason
>


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22 14:09       ` Jason Gunthorpe
@ 2021-09-23  9:14         ` Tian, Kevin
  2021-09-23 12:06           ` Jason Gunthorpe
  2021-10-01  6:26           ` david
  2021-10-01  6:19         ` david
  1 sibling, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23  9:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 10:09 PM
> 
> On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as there
> > > >   is a way to claim a specific iova range from a system-wide address
> space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > >
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> >
> > is the hint single-range or could be multiple-ranges?
> 
> David explained it here:
> 
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> 
> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.
> 
> So a 'range hint' will do the job
> 
> David also suggested this:
> 
> https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> 
> So I like this better:
> 
> struct iommu_ioasid_alloc {
> 	__u32	argsz;
> 
> 	__u32	flags;
> #define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
> #define IOMMU_IOASID_HINT_BASE_IOVA	(1 << 1)
> 
> 	__aligned_u64 max_iova_hint;
> 	__aligned_u64 base_iova_hint; // Used only if
> IOMMU_IOASID_HINT_BASE_IOVA
> 
> 	// For creating nested page tables
> 	__u32 parent_ios_id;
> 	__u32 format;
> #define IOMMU_FORMAT_KERNEL 0
> #define IOMMU_FORMAT_PPC_XXX 2
> #define IOMMU_FORMAT_[..]
> 	u32 format_flags; // Layout depends on format above
> 
> 	__aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
> };
> 
> Again 'type' as an overall API indicator should not exist, feature
> flags need to have clear narrow meanings.

currently the type is aimed to differentiate three usages:

- kernel-managed I/O page table
- user-managed I/O page table
- shared I/O page table (e.g. with mm, or ept)

we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
indicator? their difference is not about format.

> 
> This does both of David's suggestions at once. If quemu wants the 1G
> limited region it could specify max_iova_hint = 1G, if it wants the
> extend 64bit region with the hole it can give either the high base or
> a large max_iova_hint. format/format_flags allows a further

Dave's links didn't answer one puzzle from me. Does PPC needs accurate
range information or be ok with a large range including holes (then let
the kernel to figure out where the holes locate)?

> device-specific escape if more specific customization is needed and is
> needed to specify user space page tables anyhow.

and I didn't understand the 2nd link. How does user-managed page
table jump into this range claim problem? I'm getting confused...

> 
> > > ioas works well here I think. Use ioas_id to refer to the xarray
> > > index.
> >
> > What about when introducing pasid to this uAPI? Then use ioas_id
> > for the xarray index
> 
> Yes, ioas_id should always be the xarray index.
> 
> PASID needs to be called out as PASID or as a generic "hw description"
> blob.

ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
kernel. Do we want to clear this confusion? Or possibly it's fine because
ioas_id is never used outside of iommufd and iommufd doesn't directly
call ioasid_alloc() from ioasid.c?

> 
> kvm's API to program the vPASID translation table should probably take
> in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> information using an in-kernel API. Userspace shouldn't have to
> shuttle it around.

the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI. 
when kvm calls iommufd with above tuple, vPASID->pPASID is
returned to kvm. So we still need a generic blob to represent
vPASID in the uAPI.

> 
> I'm starting to feel like the struct approach for describing this uAPI
> might not scale well, but lets see..
> 
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23  3:10       ` Tian, Kevin
@ 2021-09-23 10:15         ` Jean-Philippe Brucker
  2021-09-23 11:27           ` Jason Gunthorpe
  2021-09-23 11:36         ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-23 10:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Liu, Yi L, hch, jasowang, joro,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 03:10:47AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 23, 2021 7:50 AM
> > 
> > On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> > > On Sun, 19 Sep 2021 14:38:38 +0800
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > +struct iommu_device_info {
> > > > +	__u32	argsz;
> > > > +	__u32	flags;
> > > > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU
> > enforced snoop */
> > >
> > > Is this too PCI specific, or perhaps too much of the mechanism rather
> 
> Isn't snoop vs. !snoop a general concept not pci specific?
> 
> > > than the result?  ie. should we just indicate if the IOMMU guarantees
> > > coherent DMA?  Thanks,
> > 
> > I think the name of "coherent DMA" for this feature inside the kernel
> > is very, very confusing. We already have something called coherent dma
> > and this usage on Intel has nothing at all to do with that.
> > 
> > In fact it looks like this confusing name has already caused
> > implementation problems as I see dma-iommu, is connecting
> > dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
> > completely wrong if IOMMU_CACHE is linked to no_snoop.
> > 
> > And ARM seems to have fallen out of step with x86 as the ARM IOMMU
> > drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
> > ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > 
> > The SMMU spec for ARMv8 is pretty clear:
> > 
> >  13.6.1.1 No_snoop
> > 
> >  Support for No_snoop is system-dependent and, if implemented, No_snoop
> >  transforms a final access attribute of a Normal cacheable type to
> >  Normal-iNC-oNC-OSH downstream of (or appearing to be performed
> >  downstream of) the SMMU. No_snoop does not transform a final access
> >  attribute of any-Device.
> > 
> > Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE
> > does NOT
> > block non-snoop, in fact it *enables* it - the reverse of what Intel
> > is doing!
> 
> Checking the code:
> 
>         if (data->iop.fmt == ARM_64_LPAE_S2 ||
>             data->iop.fmt == ARM_32_LPAE_S2) {
>                 if (prot & IOMMU_MMIO)
>                         pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>                 else if (prot & IOMMU_CACHE)
>                         pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>                 else
>                         pte |= ARM_LPAE_PTE_MEMATTR_NC;
> 
> It does set attribute to WB for IOMMU_CACHE and then NC (Non-cacheable)
> for !IOMMU_CACHE. The main difference between Intel and ARM is that Intel
> by default allows both snoop and non-snoop traffic with one additional bit
> to enforce snoop, while ARM requires explicit SMMU configuration for snoop
> and non-snoop respectively.
> 
>         } else {
>                 if (prot & IOMMU_MMIO)
>                         pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>                                 << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>                 else if (prot & IOMMU_CACHE)
>                         pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>                                 << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>         }
> 
> same for this one. MAIR_ELx register is programmed to ARM_LPAE_MAIR_
> ATTR_WBRWA for IDX_CACHE bit. I'm not sure why it doesn't use 
> IDX_NC though, when !IOMMU_CACHE.

It is in effect since IDX_NC == 0

> 
> > 
> > So this is all a mess.
> > 
> > Better to start clear and unambiguous names in the uAPI and someone
> > can try to clean up the kernel eventually.
> > 
> > The required behavior for iommufd is to have the IOMMU ignore the
> > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > clearly documented for its exact purpose and if other arches also have
> > instructions that need to be disabled if snoop TLPs are allowed then
> > they can re-use this bit. It appears ARM does not have this issue and
> > does not need the bit.
> 
> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop. As long as we want allow userspace to opt in case of isoch 
> performance requirement (unlike current vfio which always choose
> snoop format if available), such mechanism is required for all vendors.
> 
> When creating an ioas there could be three snoop modes:
> 
> 1) snoop for all attached devices;
> 2) non-snoop for all attached devices;
> 3) device-selected snoop;
> 
> Intel supports 1) <enforce-snoop on> and 3) <enforce-snoop off>. snoop
> and nonsnoop devices can be attached to a same ioas in 3).
> 
> ARM supports 1) <snoop format> and 2) <nonsnoop format>. snoop devices
> and nonsnoop devices must be attached to different ioas's in 1) and 2)
> respectively.

I think Arm mainly supports 3), ie. No_snoop PCI transactions on pages
mapped cacheable become non-cacheable memory accesses.

But the Arm Base System Architecture 1.0
(https://developer.arm.com/documentation/den0094/a) states that it's
implementation dependent whether the system supports No_snoop.

    In the case where the system has a System MMU translating and
    attributing the transactions from the root complex, the PCI Express
    transactions must keep the memory attributes assigned by the System
    MMU. If the System MMU-assigned attribute is cacheable then it is
    IMPLEMENTATION DEFINED if No_snoop transactions replace the attribute
    with non-cached.

So we can only tell userspace "No_snoop is not supported" (provided we
even want to allow them to enable No_snoop). Users in control of stage-1
tables can create non-cacheable mappings through MAIR attributes.

Thanks,
Jean

> 
> Then the device info should reports:
> 
> /* iommu enforced snoop */
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0)
> /* iommu enforced nonsnoop */
> +#define IOMMU_DEVICE_INFO_ENFORCE_NONSNOOP	(1 << 1)
> /* device selected snoop */
> +#define IOMMU_DEVICE_INFO_DEVICE_SNOOP	(1 << 2)
> 
> > 
> > What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
> > if/how iommufd should expose it as a controllable PTE flag. The ARM
> > 
> 
> Based on above analysis I think the ARM usage with IOMMU_CACHE
> doesn't change. 
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 10:15         ` Jean-Philippe Brucker
@ 2021-09-23 11:27           ` Jason Gunthorpe
  2021-09-23 12:05             ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 11:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, Alex Williamson, Liu, Yi L, hch, jasowang, joro,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:

> So we can only tell userspace "No_snoop is not supported" (provided we
> even want to allow them to enable No_snoop). Users in control of stage-1
> tables can create non-cacheable mappings through MAIR attributes.

My point is that ARM is using IOMMU_CACHE to control the overall
cachability of the DMA 

ie not specifying IOMMU_CACHE requires using the arch specific DMA
cache flushers. 

Intel never uses arch specifc DMA cache flushers, and instead is
abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that is always
cachable.

These are different things and need different bits. Since the ARM path
has a lot more code supporting it, I'd suggest Intel should change
their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.

Which clarifies what to do here as uAPI - these things need to have
different bits and Intel's should still have NO SNOOP in the
name. What the no-snoop bit is called on other busses can be clarified
in comments if that case ever arises.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23  3:10       ` Tian, Kevin
  2021-09-23 10:15         ` Jean-Philippe Brucker
@ 2021-09-23 11:36         ` Jason Gunthorpe
  1 sibling, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 11:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 03:10:47AM +0000, Tian, Kevin wrote:

> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop. 

The PTE format for userspace is communicated through the format input,
not through random flags. If Intel has two different PTE formats then
userspace must negotiate which to use via the format input.

If the kernel controls the PTE then the format doesn't matter and the
kernel should configure things to match the requested behavior

> When creating an ioas there could be three snoop modes:
> 
> 1) snoop for all attached devices;
> 2) non-snoop for all attached devices;
> 3) device-selected snoop;

I'd express the three cases like this:

 0 
    ARM can avoid cache shooping, must use arch cache flush helpers
 IOMMU_CACHE
     Normal DMAs get cache coherence, do not need arch cache flush helpers
 IOMMU_CACHE | IOMMU_BLOCK_NO_SNOOP 
     All DMAs get cache coherence, not supported on ARM

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23  3:38         ` Tian, Kevin
@ 2021-09-23 11:42           ` Jason Gunthorpe
  2021-09-30  9:35             ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 11:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 03:38:10AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Thursday, September 23, 2021 11:11 AM
> > 
> > >
> > > The required behavior for iommufd is to have the IOMMU ignore the
> > > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > > clearly documented for its exact purpose and if other arches also have
> > > instructions that need to be disabled if snoop TLPs are allowed then
> > > they can re-use this bit. It appears ARM does not have this issue and
> > > does not need the bit.
> > 
> > Disabling wbinvd is one purpose. imo the more important intention
> > is that iommu vendor uses different PTE formats between snoop and
> > !snoop. As long as we want allow userspace to opt in case of isoch
> > performance requirement (unlike current vfio which always choose
> > snoop format if available), such mechanism is required for all vendors.
> > 
> 
> btw I'm not sure whether the wbinvd trick is Intel specific. All other
> platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
> CACHE_COHERENCY (the source of IOMMU_CACHE). 

This only means they don't need to use the arch cache flush
helpers. It has nothing to do with no-snoop on those platforms.

> They didn't hit this problem because vfio always sets IOMMU_CACHE to
> force every DMA to snoop. Will they need to handle similar
> wbinvd-like trick (plus necessary memory type virtualization) when
> non-snoop format is enabled?  Or are their architectures highly
> optimized to afford isoch traffic even with snoop (then fine to not
> support user opt-in)?

In other arches the question is:
 - Do they allow non-coherent DMA to exist in a VM?
 - Can the VM issue cache maintaince ops to fix the decoherence?

The Intel functional issue is that Intel blocks the cache maintaince
ops from the VM and the VM has no way to self-discover that the cache
maintaince ops don't work. 

Other arches don't seem to have this specific problem...

The other warped part of this is that Linux doesn't actually support
no-snoop DMA through the DMA API. The users in Intel GPU drivers are
all hacking it up, so it may well be that on other arches Linux never
ask devices to issue no-snoop DMA because there is no portable way
for the driver to restore coherence on a DMA by DMA basis..

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-23  7:25         ` Eric Auger
@ 2021-09-23 11:44           ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 11:44 UTC (permalink / raw)
  To: Eric Auger
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 09:25:27AM +0200, Eric Auger wrote:
> Hi,
> 
> On 9/22/21 3:00 AM, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> >>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>> Sent: Wednesday, September 22, 2021 12:01 AM
> >>>
> >>>>  One open about how to organize the device nodes under
> >>> /dev/vfio/devices/.
> >>>> This RFC adopts a simple policy by keeping a flat layout with mixed
> >>> devname
> >>>> from all kinds of devices. The prerequisite of this model is that devnames
> >>>> from different bus types are unique formats:
> >>> This isn't reliable, the devname should just be vfio0, vfio1, etc
> >>>
> >>> The userspace can learn the correct major/minor by inspecting the
> >>> sysfs.
> >>>
> >>> This whole concept should disappear into the prior patch that adds the
> >>> struct device in the first place, and I think most of the code here
> >>> can be deleted once the struct device is used properly.
> >>>
> >> Can you help elaborate above flow? This is one area where we need
> >> more guidance.
> >>
> >> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> >> how does Qemu identify which vifo0/1/... is associated with the specified 
> >> DDDD:BB:DD.F? 
> > When done properly in the kernel the file:
> >
> > /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> >
> > Will contain the major:minor of the VFIO device.
> >
> > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > that the major:minor matches.
> >
> > in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> > to qemu.
> I guess this would be the same for platform devices, for instance
> /sys/bus/platform/devices/AMDI8001:01/vfio/vfioX/dev, right?

Yes, it is the general driver core pattern for creating cdevs below a
parent device

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 11:27           ` Jason Gunthorpe
@ 2021-09-23 12:05             ` Tian, Kevin
  2021-09-23 12:22               ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23 12:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Jean-Philippe Brucker
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 7:27 PM
> 
> On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> 
> > So we can only tell userspace "No_snoop is not supported" (provided we
> > even want to allow them to enable No_snoop). Users in control of stage-1
> > tables can create non-cacheable mappings through MAIR attributes.
> 
> My point is that ARM is using IOMMU_CACHE to control the overall
> cachability of the DMA
> 
> ie not specifying IOMMU_CACHE requires using the arch specific DMA
> cache flushers.
> 
> Intel never uses arch specifc DMA cache flushers, and instead is
> abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that
> is always
> cachable.

it uses IOMMU_CACHE to force all DMAs to snoop, including those which
has non_snoop flag and wouldn't snoop cache if iommu is disabled. Nothing
is blocked.

but why do you call it abuse? IOMMU_CACHE was first introduced for
Intel platform:

commit 9cf0669746be19a4906a6c48920060bcf54c708b
Author: Sheng Yang <sheng@linux.intel.com>
Date:   Wed Mar 18 15:33:07 2009 +0800

    intel-iommu: VT-d page table to support snooping control bit

    The user can request to enable snooping control through VT-d page table.

    Signed-off-by: Sheng Yang <sheng@linux.intel.com>
    Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

> 
> These are different things and need different bits. Since the ARM path
> has a lot more code supporting it, I'd suggest Intel should change
> their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.

I didn't fully get this point. The end result is same, i.e. making the DMA
cache-coherent when IOMMU_CACHE is set. Or if you help define the
behavior of IOMMU_CACHE, what will you define now?

> 
> Which clarifies what to do here as uAPI - these things need to have
> different bits and Intel's should still have NO SNOOP in the
> name. What the no-snoop bit is called on other busses can be clarified
> in comments if that case ever arises.
> 
> Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23  9:14         ` Tian, Kevin
@ 2021-09-23 12:06           ` Jason Gunthorpe
  2021-09-23 12:22             ` Tian, Kevin
  2021-10-01  6:26           ` david
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 12:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:

> currently the type is aimed to differentiate three usages:
> 
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)

Creating a shared ios is something that should probably be a different
command.

> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

Format should be

FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

My impression was it only needed a way to select between the two
different cases as they are exclusive. I'd see this API as being a
hint and userspace should query the exact ranges to learn what was
actually created.
 
> > device-specific escape if more specific customization is needed and is
> > needed to specify user space page tables anyhow.
> 
> and I didn't understand the 2nd link. How does user-managed page
> table jump into this range claim problem? I'm getting confused...

PPC could also model it using a FORMAT_KERNEL_PPC_X, FORMAT_KERNEL_PPC_Y
though it is less nice..

> > Yes, ioas_id should always be the xarray index.
> > 
> > PASID needs to be called out as PASID or as a generic "hw description"
> > blob.
> 
> ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
and it MUST be exposed in that format to be programmed into the PCI
device itself.

All of this should be able to support a userspace, like DPDK, creating
a PASID on its own without any special VFIO drivers.

- Open iommufd
- Attach the vfio device FD
- Request a PASID device id
- Create an ios against the pasid device id
- Query the ios for the PCI PASID #
- Program the HW to issue TLPs with the PASID

> and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> kernel. Do we want to clear this confusion? Or possibly it's fine because
> ioas_id is never used outside of iommufd and iommufd doesn't directly
> call ioasid_alloc() from ioasid.c?

As long as it is ioas_id and ioasid it is probably fine..

> > kvm's API to program the vPASID translation table should probably take
> > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > information using an in-kernel API. Userspace shouldn't have to
> > shuttle it around.
> 
> the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> when kvm calls iommufd with above tuple, vPASID->pPASID is
> returned to kvm. So we still need a generic blob to represent
> vPASID in the uAPI.

I think you have to be clear about what the value is being used
for. Is it an IOMMU page table handle or is it a PCI PASID value?

AFAICT I think it is the former in the Intel scheme as the "vPASID" is
really about presenting a consistent IOMMU handle to the guest across
migration, it is not the value that shows up on the PCI bus.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 12:05             ` Tian, Kevin
@ 2021-09-23 12:22               ` Jason Gunthorpe
  2021-09-29  8:48                 ` Tian, Kevin
  2021-09-30  8:49                 ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 12:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Alex Williamson, Liu, Yi L, hch, jasowang,
	joro, parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 12:05:29PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 23, 2021 7:27 PM
> > 
> > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> > 
> > > So we can only tell userspace "No_snoop is not supported" (provided we
> > > even want to allow them to enable No_snoop). Users in control of stage-1
> > > tables can create non-cacheable mappings through MAIR attributes.
> > 
> > My point is that ARM is using IOMMU_CACHE to control the overall
> > cachability of the DMA
> > 
> > ie not specifying IOMMU_CACHE requires using the arch specific DMA
> > cache flushers.
> > 
> > Intel never uses arch specifc DMA cache flushers, and instead is
> > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that
> > is always
> > cachable.
> 
> it uses IOMMU_CACHE to force all DMAs to snoop, including those which
> has non_snoop flag and wouldn't snoop cache if iommu is disabled. Nothing
> is blocked.

I see it differently, on Intel the only way to bypass the cache with
DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we
are talking about tells the IOMMU to ignore the no snoop bit.

Again, Intel arch in the kernel does not support the DMA cache flush
arch API and *DOES NOT* support incoherent DMA at all.

ARM *does* implement the DMA cache flush arch API and is using
IOMMU_CACHE to control if the caller will, or will not call the cache
flushes.

This is fundamentally different from what Intel is using it for.

> but why do you call it abuse? IOMMU_CACHE was first introduced for
> Intel platform:

IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to
dma_is_coherent stuff. At that point it became linked to 'do I need to
call arch cache flushers or not'.

> > These are different things and need different bits. Since the ARM path
> > has a lot more code supporting it, I'd suggest Intel should change
> > their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.
> 
> I didn't fully get this point. The end result is same, i.e. making the DMA
> cache-coherent when IOMMU_CACHE is set. Or if you help define the
> behavior of IOMMU_CACHE, what will you define now?

It is clearly specifying how the kernel API works:

 !IOMMU_CACHE
   must call arch cache flushers
 IOMMU_CACHE -
   do not call arch cache flushers
 IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP - 
   dot not arch cache flushers, and ignore the no snoop bit.

On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
do that. All IOMMU formats can support IOMMU_CACHE. Only the special
no-snoop IOPTE format can support the final one, and it is only useful
for iommufd/vfio users that are interacting with VMs and wbvind.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 12:06           ` Jason Gunthorpe
@ 2021-09-23 12:22             ` Tian, Kevin
  2021-09-23 12:31               ` Jason Gunthorpe
  2021-10-01  6:30               ` david
  0 siblings, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23 12:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 8:07 PM
> 
> On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> 
> > currently the type is aimed to differentiate three usages:
> >
> > - kernel-managed I/O page table
> > - user-managed I/O page table
> > - shared I/O page table (e.g. with mm, or ept)
> 
> Creating a shared ios is something that should probably be a different
> command.

why? I didn't understand the criteria here...

> 
> > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > indicator? their difference is not about format.
> 
> Format should be
> 
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

> 
> > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > range information or be ok with a large range including holes (then let
> > the kernel to figure out where the holes locate)?
> 
> My impression was it only needed a way to select between the two
> different cases as they are exclusive. I'd see this API as being a
> hint and userspace should query the exact ranges to learn what was
> actually created.

yes, the user can query the permitted range using DEVICE_GET_INFO.
But in the end if the user wants two separate regions, I'm afraid that 
the underlying iommu driver wants to know the exact info. iirc PPC
has one global system address space shared by all devices. It is possible
that the user may want to claim range-A and range-C, with range-B
in-between but claimed by another user. Then simply using one hint
range [A-lowend, C-highend] might not work.

> 
> > > device-specific escape if more specific customization is needed and is
> > > needed to specify user space page tables anyhow.
> >
> > and I didn't understand the 2nd link. How does user-managed page
> > table jump into this range claim problem? I'm getting confused...
> 
> PPC could also model it using a FORMAT_KERNEL_PPC_X,
> FORMAT_KERNEL_PPC_Y
> though it is less nice..

yes PPC can use different format, but I didn't understand why it is 
related user-managed page table which further requires nesting. sound
disconnected topics here...

> 
> > > Yes, ioas_id should always be the xarray index.
> > >
> > > PASID needs to be called out as PASID or as a generic "hw description"
> > > blob.
> >
> > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
> 
> ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> and it MUST be exposed in that format to be programmed into the PCI
> device itself.

In the entire discussion in previous design RFC, I kept an impression that
ARM-equivalent PASID is called SSID. If we can use PASID as a general
term in iommufd context, definitely it's much better!

> 
> All of this should be able to support a userspace, like DPDK, creating
> a PASID on its own without any special VFIO drivers.
> 
> - Open iommufd
> - Attach the vfio device FD
> - Request a PASID device id
> - Create an ios against the pasid device id
> - Query the ios for the PCI PASID #
> - Program the HW to issue TLPs with the PASID

this all makes me very confused, and completely different from what
we agreed in previous v2 design proposal:

- open iommufd
- create an ioas
- attach vfio device to ioasid, with vPASID info
	* vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid()
	* the latter then installs ioas to the IOMMU with RID/PASID

> 
> > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > call ioasid_alloc() from ioasid.c?
> 
> As long as it is ioas_id and ioasid it is probably fine..

let's align with others in a few hours.

> 
> > > kvm's API to program the vPASID translation table should probably take
> > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > information using an in-kernel API. Userspace shouldn't have to
> > > shuttle it around.
> >
> > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > returned to kvm. So we still need a generic blob to represent
> > vPASID in the uAPI.
> 
> I think you have to be clear about what the value is being used
> for. Is it an IOMMU page table handle or is it a PCI PASID value?
> 
> AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> really about presenting a consistent IOMMU handle to the guest across
> migration, it is not the value that shows up on the PCI bus.
> 

It's the former. But vfio driver needs to maintain vPASID->pPASID
translation in the mediation path, since what guest programs is vPASID.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 12:22             ` Tian, Kevin
@ 2021-09-23 12:31               ` Jason Gunthorpe
  2021-09-23 12:45                 ` Tian, Kevin
  2021-10-01  6:30               ` david
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 12:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 23, 2021 8:07 PM
> > 
> > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > 
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> > 
> > Creating a shared ios is something that should probably be a different
> > command.
> 
> why? I didn't understand the criteria here...

I suspect the input args will be very different, no?

> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> > 
> > Format should be
> > 
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> 
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

So long as we are using structs we need to have values then the field
isn't being used. FORMAT_KERNEL is a reasonable value to have when we
are not creating a userspace page table.

Alternatively a userspace page table could have a different API

> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that 
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices. It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.

I don't know, that sounds strange.. In any event hint is a hint, it
can be ignored, the only information the kernel needs to extract is
low/high bank?

> yes PPC can use different format, but I didn't understand why it is 
> related user-managed page table which further requires nesting. sound
> disconnected topics here...

It is just a way to feed through more information if we get stuck
someday.

> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
> 
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!

SSID is inside the chip and part of the IOMMU. PASID is part of the
PCI spec.

iommufd should keep these things distinct. 

If we are talking about a PCI TLP then the name to use is PASID.

> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> > 
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
> 
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
>
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
> 	* vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid()
> 	* the latter then installs ioas to the IOMMU with RID/PASID

This was your flow for mdev's, I've always been talking about wanting
to see this supported for all use cases, including physical PCI
devices w/ PASID support.

A normal vfio_pci userspace should be able to create PASIDs unrelated
to the mdev stuff.

> > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > really about presenting a consistent IOMMU handle to the guest across
> > migration, it is not the value that shows up on the PCI bus.
> 
> It's the former. But vfio driver needs to maintain vPASID->pPASID
> translation in the mediation path, since what guest programs is vPASID.

The pPASID definately is a PASID as it goes out on the PCIe wire

Suggest you come up with a more general name for vPASID?

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 12:31               ` Jason Gunthorpe
@ 2021-09-23 12:45                 ` Tian, Kevin
  2021-09-23 13:01                   ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23 12:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 8:31 PM
> 
> On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, September 23, 2021 8:07 PM
> > >
> > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > >
> > > > currently the type is aimed to differentiate three usages:
> > > >
> > > > - kernel-managed I/O page table
> > > > - user-managed I/O page table
> > > > - shared I/O page table (e.g. with mm, or ept)
> > >
> > > Creating a shared ios is something that should probably be a different
> > > command.
> >
> > why? I didn't understand the criteria here...
> 
> I suspect the input args will be very different, no?

yes, but can't the structure be extended to incorporate it? 

> 
> > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > indicator? their difference is not about format.
> > >
> > > Format should be
> > >
> > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> >
> > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> 
> So long as we are using structs we need to have values then the field
> isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> are not creating a userspace page table.
> 
> Alternatively a userspace page table could have a different API

I don't know. Your comments really confused me on what's the right
way to design the uAPI. If you still remember, the original v1 proposal
introduced different uAPIs for kernel/user-managed cases. Then you
recommended to consolidate everything related to ioas in one allocation
command.

Can you help articulate the criteria first?

> 
> > yes, the user can query the permitted range using DEVICE_GET_INFO.
> > But in the end if the user wants two separate regions, I'm afraid that
> > the underlying iommu driver wants to know the exact info. iirc PPC
> > has one global system address space shared by all devices. It is possible
> > that the user may want to claim range-A and range-C, with range-B
> > in-between but claimed by another user. Then simply using one hint
> > range [A-lowend, C-highend] might not work.
> 
> I don't know, that sounds strange.. In any event hint is a hint, it
> can be ignored, the only information the kernel needs to extract is
> low/high bank?

iirc Dave said that the user needs to claim a range explicitly. 'claim'
sounds not a hint to me. Possibly it's time for Dave to chime in. 

> 
> > yes PPC can use different format, but I didn't understand why it is
> > related user-managed page table which further requires nesting. sound
> > disconnected topics here...
> 
> It is just a way to feed through more information if we get stuck
> someday.

You mean that we should define uAPI for all future possible extensions
now to minimize the frequency of changing it?

> 
> > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > > and it MUST be exposed in that format to be programmed into the PCI
> > > device itself.
> >
> > In the entire discussion in previous design RFC, I kept an impression that
> > ARM-equivalent PASID is called SSID. If we can use PASID as a general
> > term in iommufd context, definitely it's much better!
> 
> SSID is inside the chip and part of the IOMMU. PASID is part of the
> PCI spec.
> 
> iommufd should keep these things distinct.
> 
> If we are talking about a PCI TLP then the name to use is PASID.

If Jean doesn't object...

> 
> > > All of this should be able to support a userspace, like DPDK, creating
> > > a PASID on its own without any special VFIO drivers.
> > >
> > > - Open iommufd
> > > - Attach the vfio device FD
> > > - Request a PASID device id
> > > - Create an ios against the pasid device id
> > > - Query the ios for the PCI PASID #
> > > - Program the HW to issue TLPs with the PASID
> >
> > this all makes me very confused, and completely different from what
> > we agreed in previous v2 design proposal:
> >
> > - open iommufd
> > - create an ioas
> > - attach vfio device to ioasid, with vPASID info
> > 	* vfio converts vPASID to pPASID and then call
> iommufd_device_attach_ioasid()
> > 	* the latter then installs ioas to the IOMMU with RID/PASID
> 
> This was your flow for mdev's, I've always been talking about wanting
> to see this supported for all use cases, including physical PCI
> devices w/ PASID support.

this is not a flow for mdev. It's also required for pdev on Intel platform,
because the pasid table is in HPA space thus must be managed by host 
kernel. Even no translation we still need the user to provide the pasid info.

> 
> A normal vfio_pci userspace should be able to create PASIDs unrelated
> to the mdev stuff.
> 
> > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > > really about presenting a consistent IOMMU handle to the guest across
> > > migration, it is not the value that shows up on the PCI bus.
> >
> > It's the former. But vfio driver needs to maintain vPASID->pPASID
> > translation in the mediation path, since what guest programs is vPASID.
> 
> The pPASID definately is a PASID as it goes out on the PCIe wire
> 
> Suggest you come up with a more general name for vPASID?
> 

as explained earlier, on Intel platform the user always needs to provide 
a PASID in the attaching call. whether it's directly used (for pdev)
or translated (for mdev) is the underlying driver thing. From kernel
p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
in the uAPI.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 12:45                 ` Tian, Kevin
@ 2021-09-23 13:01                   ` Jason Gunthorpe
  2021-09-23 13:20                     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 13:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 12:45:17PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 23, 2021 8:31 PM
> > 
> > On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > >
> > > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > > >
> > > > > currently the type is aimed to differentiate three usages:
> > > > >
> > > > > - kernel-managed I/O page table
> > > > > - user-managed I/O page table
> > > > > - shared I/O page table (e.g. with mm, or ept)
> > > >
> > > > Creating a shared ios is something that should probably be a different
> > > > command.
> > >
> > > why? I didn't understand the criteria here...
> > 
> > I suspect the input args will be very different, no?
> 
> yes, but can't the structure be extended to incorporate it? 

You need to be thoughtful, giant structures with endless combinations
of optional fields turn out very hard. I haven't even seen what args
this shared thing will need, but I'm guessing it is almost none, so
maybe a new call is OK?

If it is literally just 'give me an ioas for current mm' then it has
no args or complexity at all.

> > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > indicator? their difference is not about format.
> > > >
> > > > Format should be
> > > >
> > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > >
> > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> > 
> > So long as we are using structs we need to have values then the field
> > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > are not creating a userspace page table.
> > 
> > Alternatively a userspace page table could have a different API
> 
> I don't know. Your comments really confused me on what's the right
> way to design the uAPI. If you still remember, the original v1 proposal
> introduced different uAPIs for kernel/user-managed cases. Then you
> recommended to consolidate everything related to ioas in one allocation
> command.

This is because you had almost completely duplicated the input args
between the two calls.

If it turns out they have very different args, then they should have
different calls.

> > > - open iommufd
> > > - create an ioas
> > > - attach vfio device to ioasid, with vPASID info
> > > 	* vfio converts vPASID to pPASID and then call
> > iommufd_device_attach_ioasid()
> > > 	* the latter then installs ioas to the IOMMU with RID/PASID
> > 
> > This was your flow for mdev's, I've always been talking about wanting
> > to see this supported for all use cases, including physical PCI
> > devices w/ PASID support.
> 
> this is not a flow for mdev. It's also required for pdev on Intel platform,
> because the pasid table is in HPA space thus must be managed by host 
> kernel. Even no translation we still need the user to provide the pasid info.

There should be no mandatory vPASID stuff in most of these flows, that
is just a special thing ENQCMD virtualization needs. If userspace
isn't doing ENQCMD virtualization it shouldn't need to touch this
stuff.

> as explained earlier, on Intel platform the user always needs to provide 
> a PASID in the attaching call. whether it's directly used (for pdev)
> or translated (for mdev) is the underlying driver thing. From kernel
> p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> in the uAPI.

I've always disagreed with this. There should be an option for the
kernel to pick an appropriate PASID for portability to other IOMMUs
and simplicity of the interface.

You need to keep it clear what is in the minimum basic path and what
is needed for special cases, like ENQCMD virtualization.

Not every user of iommufd is doing virtualization.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 13:01                   ` Jason Gunthorpe
@ 2021-09-23 13:20                     ` Tian, Kevin
  2021-09-23 13:30                       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23 13:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 9:02 PM
> 
> On Thu, Sep 23, 2021 at 12:45:17PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, September 23, 2021 8:31 PM
> > >
> > > On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > > >
> > > > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > currently the type is aimed to differentiate three usages:
> > > > > >
> > > > > > - kernel-managed I/O page table
> > > > > > - user-managed I/O page table
> > > > > > - shared I/O page table (e.g. with mm, or ept)
> > > > >
> > > > > Creating a shared ios is something that should probably be a different
> > > > > command.
> > > >
> > > > why? I didn't understand the criteria here...
> > >
> > > I suspect the input args will be very different, no?
> >
> > yes, but can't the structure be extended to incorporate it?
> 
> You need to be thoughtful, giant structures with endless combinations
> of optional fields turn out very hard. I haven't even seen what args
> this shared thing will need, but I'm guessing it is almost none, so
> maybe a new call is OK?

To judge this looks we may have to do some practice on this front
e.g. coming up an example structure for future intended usages and
then see whether one structure can fit? 

> 
> If it is literally just 'give me an ioas for current mm' then it has
> no args or complexity at all.

for mm, yes, should be simple. for ept it might be more complex e.g.
requiring a handle in kvm and some other format info to match ept
page table.

> 
> > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > > indicator? their difference is not about format.
> > > > >
> > > > > Format should be
> > > > >
> > > > >
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > > >
> > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> > >
> > > So long as we are using structs we need to have values then the field
> > > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > > are not creating a userspace page table.
> > >
> > > Alternatively a userspace page table could have a different API
> >
> > I don't know. Your comments really confused me on what's the right
> > way to design the uAPI. If you still remember, the original v1 proposal
> > introduced different uAPIs for kernel/user-managed cases. Then you
> > recommended to consolidate everything related to ioas in one allocation
> > command.
> 
> This is because you had almost completely duplicated the input args
> between the two calls.
> 
> If it turns out they have very different args, then they should have
> different calls.
> 
> > > > - open iommufd
> > > > - create an ioas
> > > > - attach vfio device to ioasid, with vPASID info
> > > > 	* vfio converts vPASID to pPASID and then call
> > > iommufd_device_attach_ioasid()
> > > > 	* the latter then installs ioas to the IOMMU with RID/PASID
> > >
> > > This was your flow for mdev's, I've always been talking about wanting
> > > to see this supported for all use cases, including physical PCI
> > > devices w/ PASID support.
> >
> > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > because the pasid table is in HPA space thus must be managed by host
> > kernel. Even no translation we still need the user to provide the pasid info.
> 
> There should be no mandatory vPASID stuff in most of these flows, that
> is just a special thing ENQCMD virtualization needs. If userspace
> isn't doing ENQCMD virtualization it shouldn't need to touch this
> stuff.

No. for one, we also support SVA w/o using ENQCMD. For two, the key
is that the PASID table cannot be delegated to the userspace like ARM
or AMD. This implies that for any pasid that the userspace wants to
enable, it must be configured via the kernel.

> 
> > as explained earlier, on Intel platform the user always needs to provide
> > a PASID in the attaching call. whether it's directly used (for pdev)
> > or translated (for mdev) is the underlying driver thing. From kernel
> > p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> > in the uAPI.
> 
> I've always disagreed with this. There should be an option for the
> kernel to pick an appropriate PASID for portability to other IOMMUs
> and simplicity of the interface.
> 
> You need to keep it clear what is in the minimum basic path and what
> is needed for special cases, like ENQCMD virtualization.
> 
> Not every user of iommufd is doing virtualization.
> 

just for a short summary of PASID model from previous design RFC:

for arm/amd:
	- pasid space delegated to userspace
	- pasid table delegated to userspace
	- just one call to bind pasid_table() then pasids are fully managed by user

for intel:
	- pasid table is always managed by kernel
	- for pdev,
		- pasid space is delegated to userspace
		- attach_ioasid(dev, ioasid, pasid) so the kernel can setup the pasid entry
	- for mdev,
		- pasid space is managed by userspace
		- attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to ppasid. iommufd setups the ppasid entry
		- additional a contract to kvm for setup CPU pasid translation if enqcmd is used
	- to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let underlying driver to figure out whether vpasid should be translated.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 13:20                     ` Tian, Kevin
@ 2021-09-23 13:30                       ` Jason Gunthorpe
  2021-09-23 13:41                         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 13:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 23, 2021 at 01:20:55PM +0000, Tian, Kevin wrote:

> > > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > > because the pasid table is in HPA space thus must be managed by host
> > > kernel. Even no translation we still need the user to provide the pasid info.
> > 
> > There should be no mandatory vPASID stuff in most of these flows, that
> > is just a special thing ENQCMD virtualization needs. If userspace
> > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > stuff.
> 
> No. for one, we also support SVA w/o using ENQCMD. For two, the key
> is that the PASID table cannot be delegated to the userspace like ARM
> or AMD. This implies that for any pasid that the userspace wants to
> enable, it must be configured via the kernel.

Yes, configured through the kernel, but the simplified flow should
have the kernel handle everything and just emit a PASID for userspace
to use.


> just for a short summary of PASID model from previous design RFC:
> 
> for arm/amd:
> 	- pasid space delegated to userspace
> 	- pasid table delegated to userspace
> 	- just one call to bind pasid_table() then pasids are fully managed by user
> 
> for intel:
> 	- pasid table is always managed by kernel
> 	- for pdev,
> 		- pasid space is delegated to userspace
> 		- attach_ioasid(dev, ioasid, pasid) so the kernel can setup the pasid entry
> 	- for mdev,
> 		- pasid space is managed by userspace
> 		- attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to ppasid. iommufd setups the ppasid entry
> 		- additional a contract to kvm for setup CPU pasid translation if enqcmd is used
> 	- to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let underlying driver to figure out whether vpasid should be translated.

All cases should support a kernel owned ioas associated with a
PASID. This is the universal basic API that all PASID supporting
IOMMUs need to implement.

I should not need to write generic users space that has to know how to
setup architecture specific nested userspace page tables just to use
PASID!

All of the above is qemu accelerated vIOMMU stuff. It is a good idea
to keep the two areas seperate as it greatly informs what is general
code and what is HW specific code.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-23 13:30                       ` Jason Gunthorpe
@ 2021-09-23 13:41                         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-23 13:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 9:31 PM
> 
> On Thu, Sep 23, 2021 at 01:20:55PM +0000, Tian, Kevin wrote:
> 
> > > > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > > > because the pasid table is in HPA space thus must be managed by host
> > > > kernel. Even no translation we still need the user to provide the pasid
> info.
> > >
> > > There should be no mandatory vPASID stuff in most of these flows, that
> > > is just a special thing ENQCMD virtualization needs. If userspace
> > > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > > stuff.
> >
> > No. for one, we also support SVA w/o using ENQCMD. For two, the key
> > is that the PASID table cannot be delegated to the userspace like ARM
> > or AMD. This implies that for any pasid that the userspace wants to
> > enable, it must be configured via the kernel.
> 
> Yes, configured through the kernel, but the simplified flow should
> have the kernel handle everything and just emit a PASID for userspace
> to use.
> 
> 
> > just for a short summary of PASID model from previous design RFC:
> >
> > for arm/amd:
> > 	- pasid space delegated to userspace
> > 	- pasid table delegated to userspace
> > 	- just one call to bind pasid_table() then pasids are fully managed by
> user
> >
> > for intel:
> > 	- pasid table is always managed by kernel
> > 	- for pdev,
> > 		- pasid space is delegated to userspace
> > 		- attach_ioasid(dev, ioasid, pasid) so the kernel can setup the
> pasid entry
> > 	- for mdev,
> > 		- pasid space is managed by userspace
> > 		- attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to
> ppasid. iommufd setups the ppasid entry
> > 		- additional a contract to kvm for setup CPU pasid translation
> if enqcmd is used
> > 	- to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let
> underlying driver to figure out whether vpasid should be translated.
> 
> All cases should support a kernel owned ioas associated with a
> PASID. This is the universal basic API that all PASID supporting
> IOMMUs need to implement.
> 
> I should not need to write generic users space that has to know how to
> setup architecture specific nested userspace page tables just to use
> PASID!

ah, got you! I have to admit that my previous thoughts are all from
VM p.o.v, with true userspace application ignored...

> 
> All of the above is qemu accelerated vIOMMU stuff. It is a good idea
> to keep the two areas seperate as it greatly informs what is general
> code and what is HW specific code.
> 

Agree. will think more along this direction. possibly this discussion 
deviated a lot from what this skeleton series provide. We still have 
plenty of time to figure it out when starting the pasid support. For now
at least the minimal output is that PASID might be a good candidate to 
be used in iommufd. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-22 12:39       ` Jason Gunthorpe
  2021-09-22 13:56         ` Tian, Kevin
@ 2021-09-27  9:42         ` Tian, Kevin
  2021-09-27 11:34           ` Lu Baolu
                             ` (2 more replies)
  1 sibling, 3 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-27  9:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:40 PM
> 
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
> 
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
> 
> Something kind of like the below.
> 
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

Currently vfio is the only user of this notifier mechanism. Now 
three events are handled in vfio_iommu_group_notifier():

NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
not required once we handle it cleanly in the iommu/driver core.

NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.

NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
the comments the group->unbound_list is used to avoid breaking
group viability check between vfio_unregister_group_dev() and 
final dev/drv teardown. within that small window the device is
not tracked by vfio group but is still bound to a driver (e.g. vfio-pci
itself), while an external group user may hold a reference to the
group. Possibly it's not required now with the new mechanism as 
we rely on init/exit_user_dma() as the single switch to claim/
withdraw the group ownership. As long as exit_user_dma() is not 
called until vfio_group_release(), above small window is covered
thus no need to maintain a unbound_list.

But anyway since this corner case is tricky, will think more in case
of any oversight.

> 
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 goto done;
>         }
> 
> +       ret = iommu_set_kernel_ownership(dev);
> +       if (ret)
> +               return ret;
> +
>  re_probe:
>         dev->driver = drv;
> 
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 dev->pm_domain->dismiss(dev);
>         pm_runtime_reinit(dev);
>         dev_pm_set_driver_flags(dev, 0);
> +       iommu_release_kernel_ownership(dev);
>  done:
>         return ret;
>  }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
>                         dev->pm_domain->dismiss(dev);
>                 pm_runtime_reinit(dev);
>                 dev_pm_set_driver_flags(dev, 0);
> +               iommu_release_kernel_ownership(dev);
> 
>                 klist_remove(&dev->p->knode_driver);
>                 device_pm_check_callbacks(dev);

I expanded above into below conceptual draft. Please help check whether
it matches your thought:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f9..826a651 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 		goto done;
 	}
 
+	ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
+	if (ret)
+		return ret;
+
 re_probe:
 	dev->driver = drv;
 
@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 		dev->pm_domain->dismiss(dev);
 	pm_runtime_reinit(dev);
 	dev_pm_set_driver_flags(dev, 0);
+	iommu_device_clear_dma_hint(dev);
 done:
 	return ret;
 }
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
 			dev->pm_domain->dismiss(dev);
 		pm_runtime_reinit(dev);
 		dev_pm_set_driver_flags(dev, 0);
+		iommu_device_clear_dma_hint(dev);
 
 		klist_remove(&dev->p->knode_driver);
 		device_pm_check_callbacks(dev);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3303d70..b12f335 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1064,6 +1064,104 @@ void iommu_group_put(struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_group_put);
 
+static int iommu_dev_viable(struct device *dev, void *data)
+{
+	enum dma_hint hint = *data;
+	struct device_driver *drv = READ_ONCE(dev->driver);
+
+	/* no conflict if the new device doesn't do DMA */
+	if (hint == DMA_FOR_NONE)
+		return 0;
+
+	/* no conflict if this device is driver-less, or doesn't do DMA */
+	if (!drv || (drv->dma_hint == DMA_FOR_NONE))
+		return 0;
+
+	/* kernel dma and user dma are exclusive */
+	if (hint != drv->dma_hint)
+		return -EINVAL;
+
+	/*
+	 * devices in the group could be bound to different user-dma
+	 * drivers (e.g. vfio-pci, vdpa, etc.), or even bound to the
+	 * same driver but eventually opened via different mechanisms
+	 * (e.g. vfio group vs. nongroup interfaces). We rely on 
+	 * iommu_{group/device}_init_user_dma() to ensure exclusive
+	 * user-dma ownership (iommufd ctx, vfio container ctx, etc.)
+	 * in such scenario.
+	 */
+	return 0;
+}
+
+static int __iommu_group_viable(struct iommu_group *group, enum dma_hint hint)
+{
+	return (__iommu_group_for_each_dev(group, &hint,
+					   iommu_dev_viable) == 0);
+}
+
+int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
+{
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get(dev);
+	/* not an iommu-probed device */
+	if (!group)
+		return 0;
+
+	mutex_lock(&group->mutex);
+	ret = __iommu_group_viable(group, hint);
+	mutex_unlock(&group->mutex);
+
+	iommu_group_put(group);
+	return ret;
+}
+
+/* any housekeeping? */
+void iommu_device_clear_dma_hint(struct device *dev) {}
+
+/* claim group ownership for user-dma */
+int __iommu_group_init_user_dma(struct iommu_group *group,
+				unsigned long owner)
+{
+	int ret;
+
+	ret = __iommu_group_viable(group, DMA_FOR_USER);
+	if (ret)
+		goto out;
+
+	/* other logic for exclusive user_dma ownership and refcounting */
+out:
+	return ret;
+}
+
+int iommu_group_init_user_dma(struct iommu_group *group,
+			      unsigned long owner)
+{
+	int ret;
+
+	mutex_lock(&group->mutex);
+	ret = __iommu_group_init_user_dma(group, owner);
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+
+int iommu_device_init_user_dma(struct device *dev,
+			      unsigned long owner)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret;
+
+	if (!group)
+		return -ENODEV;
+
+	mutex_lock(&group->mutex);
+	ret = __iommu_group_init_user_dma(group, owner);
+	mutex_unlock(&group->mutex);
+	iommu_grou_put(group);
+	return ret;
+}
+
 /**
  * iommu_group_register_notifier - Register a notifier for group changes
  * @group: the group to watch
diff --git a/drivers/pci/pci-stub.c b/drivers/pci/pci-stub.c
index e408099..4568811 100644
--- a/drivers/pci/pci-stub.c
+++ b/drivers/pci/pci-stub.c
@@ -36,6 +36,9 @@ static int pci_stub_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	.name		= "pci-stub",
 	.id_table	= NULL,	/* only dynamic id's */
 	.probe		= pci_stub_probe,
+	.driver = {
+		.dma_hint	= DMA_FOR_NONE,
+	},
 };
 
 static int __init pci_stub_init(void)
diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index dcd648e..a613b78 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -678,6 +678,9 @@ static void ifcvf_remove(struct pci_dev *pdev)
 	.id_table = ifcvf_pci_ids,
 	.probe    = ifcvf_probe,
 	.remove   = ifcvf_remove,
+	.driver = {
+		.dma_hint	= DMA_FOR_USER,
+	},
 };
 
 module_pci_driver(ifcvf_driver);
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92b..61c422d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -193,6 +193,9 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 	.remove			= vfio_pci_remove,
 	.sriov_configure	= vfio_pci_sriov_configure,
 	.err_handler		= &vfio_pci_core_err_handlers,
+	.driver = {
+		.dma_hint	= DMA_FOR_USER,
+	},
 };
 
 static void __init vfio_pci_fill_ids(void)
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index a498ebc..6bddfd2 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -48,6 +48,17 @@ enum probe_type {
 };
 
 /**
+ * enum dma_hint - device driver dma hint
+ *	Device drivers may provide hints for whether dma is
+ *	intended for kernel driver, user driver, not not required.
+ */
+enum dma_hint {
+	DMA_FOR_KERNEL,
+	DMA_FOR_USER,
+	DMA_FOR_NONE,
+};
+
+/**
  * struct device_driver - The basic device driver structure
  * @name:	Name of the device driver.
  * @bus:	The bus which the device of this driver belongs to.
@@ -101,6 +112,7 @@ struct device_driver {
 
 	bool suppress_bind_attrs;	/* disables bind/unbind via sysfs */
 	enum probe_type probe_type;
+	enum dma_type dma_type;
 
 	const struct of_device_id	*of_match_table;
 	const struct acpi_device_id	*acpi_match_table;

Thanks
Kevin



^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27  9:42         ` Tian, Kevin
@ 2021-09-27 11:34           ` Lu Baolu
  2021-09-27 13:08             ` Tian, Kevin
  2021-09-27 11:53           ` Jason Gunthorpe
  2021-09-27 15:09           ` Jason Gunthorpe
  2 siblings, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-27 11:34 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

On 2021/9/27 17:42, Tian, Kevin wrote:
> +int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
> +{
> +	struct iommu_group *group;
> +	int ret;
> +
> +	group = iommu_group_get(dev);
> +	/* not an iommu-probed device */
> +	if (!group)
> +		return 0;
> +
> +	mutex_lock(&group->mutex);
> +	ret = __iommu_group_viable(group, hint);
> +	mutex_unlock(&group->mutex);
> +
> +	iommu_group_put(group);
> +	return ret;
> +}

Conceptually, we could also move iommu_deferred_attach() from
iommu_dma_ops here to save unnecessary checks in the hot DMA API
paths?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27  9:42         ` Tian, Kevin
  2021-09-27 11:34           ` Lu Baolu
@ 2021-09-27 11:53           ` Jason Gunthorpe
  2021-09-27 13:00             ` Tian, Kevin
  2021-09-27 15:09           ` Jason Gunthorpe
  2 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-27 11:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:40 PM
> > 
> > > > Ie the basic flow would see the driver core doing some:
> > >
> > > Just double confirm. Is there concern on having the driver core to
> > > call iommu functions?
> > 
> > It is always an interesting question, but I'd say iommu is
> > foundantional to Linux and if it needs driver core help it shouldn't
> > be any different from PM, pinctl, or other subsystems that have
> > inserted themselves into the driver core.
> > 
> > Something kind of like the below.
> > 
> > If I recall, once it is done like this then the entire iommu notifier
> > infrastructure can be ripped out which is a lot of code.
> 
> Currently vfio is the only user of this notifier mechanism. Now 
> three events are handled in vfio_iommu_group_notifier():
> 
> NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
> not required once we handle it cleanly in the iommu/driver core.
> 
> NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.
> 
> NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
> the comments the group->unbound_list is used to avoid breaking

I have a patch series to delete the unbound_list, the scenario you
describe is handled by the device_lock()

> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f9..826a651 100644
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
>  		goto done;
>  	}
>  
> +	ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
> +	if (ret)
> +		return ret;

I think for such a narrow usage you should not change the struct
device_driver. Just have pci_stub call a function to flip back to user
mode.

> +static int iommu_dev_viable(struct device *dev, void *data)
> +{
> +	enum dma_hint hint = *data;
> +	struct device_driver *drv = READ_ONCE(dev->driver);

Especially since this isn't locked properly or safe.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 11:53           ` Jason Gunthorpe
@ 2021-09-27 13:00             ` Tian, Kevin
  2021-09-27 13:09               ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-27 13:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, September 27, 2021 7:54 PM
> 
> On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 8:40 PM
> > >
> > > > > Ie the basic flow would see the driver core doing some:
> > > >
> > > > Just double confirm. Is there concern on having the driver core to
> > > > call iommu functions?
> > >
> > > It is always an interesting question, but I'd say iommu is
> > > foundantional to Linux and if it needs driver core help it shouldn't
> > > be any different from PM, pinctl, or other subsystems that have
> > > inserted themselves into the driver core.
> > >
> > > Something kind of like the below.
> > >
> > > If I recall, once it is done like this then the entire iommu notifier
> > > infrastructure can be ripped out which is a lot of code.
> >
> > Currently vfio is the only user of this notifier mechanism. Now
> > three events are handled in vfio_iommu_group_notifier():
> >
> > NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
> > not required once we handle it cleanly in the iommu/driver core.
> >
> > NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.
> >
> > NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
> > the comments the group->unbound_list is used to avoid breaking
> 
> I have a patch series to delete the unbound_list, the scenario you
> describe is handled by the device_lock()

that's great!

> 
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 68ea1f9..826a651 100644
> > +++ b/drivers/base/dd.c
> > @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> >  		goto done;
> >  	}
> >
> > +	ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
> > +	if (ret)
> > +		return ret;
> 
> I think for such a narrow usage you should not change the struct
> device_driver. Just have pci_stub call a function to flip back to user
> mode.

Here we want to ensure that kernel dma should be blocked
if the group is already marked for user-dma. If we just blindly
do it for any driver at this point (as you commented earlier):

+       ret = iommu_set_kernel_ownership(dev);
+       if (ret)
+               return ret;

how would pci-stub reach its function to indicate that it doesn't 
do dma and flip back?

Do you envision a simpler policy that no driver can be bound
to the group if it's already set for user-dma? what about vfio-pci
itself?

> 
> > +static int iommu_dev_viable(struct device *dev, void *data)
> > +{
> > +	enum dma_hint hint = *data;
> > +	struct device_driver *drv = READ_ONCE(dev->driver);
> 
> Especially since this isn't locked properly or safe.

I have the same worry when copying from vfio. Not sure how
vfio gets safe with this approach...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 11:34           ` Lu Baolu
@ 2021-09-27 13:08             ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-27 13:08 UTC (permalink / raw)
  To: Lu Baolu, Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Monday, September 27, 2021 7:34 PM
> 
> On 2021/9/27 17:42, Tian, Kevin wrote:
> > +int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
> > +{
> > +	struct iommu_group *group;
> > +	int ret;
> > +
> > +	group = iommu_group_get(dev);
> > +	/* not an iommu-probed device */
> > +	if (!group)
> > +		return 0;
> > +
> > +	mutex_lock(&group->mutex);
> > +	ret = __iommu_group_viable(group, hint);
> > +	mutex_unlock(&group->mutex);
> > +
> > +	iommu_group_put(group);
> > +	return ret;
> > +}
> 
> Conceptually, we could also move iommu_deferred_attach() from
> iommu_dma_ops here to save unnecessary checks in the hot DMA API
> paths?
> 

Yes, it's possible. But just be curious, why doesn't iommu core 
manage deferred_attach when receiving BOUND_DRIVER event?
Is there other implication that deferred attach cannot be done
at driver binding time?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 13:00             ` Tian, Kevin
@ 2021-09-27 13:09               ` Jason Gunthorpe
  2021-09-27 13:32                 ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-27 13:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:

> > I think for such a narrow usage you should not change the struct
> > device_driver. Just have pci_stub call a function to flip back to user
> > mode.
> 
> Here we want to ensure that kernel dma should be blocked
> if the group is already marked for user-dma. If we just blindly
> do it for any driver at this point (as you commented earlier):
> 
> +       ret = iommu_set_kernel_ownership(dev);
> +       if (ret)
> +               return ret;
> 
> how would pci-stub reach its function to indicate that it doesn't 
> do dma and flip back?

> Do you envision a simpler policy that no driver can be bound
> to the group if it's already set for user-dma? what about vfio-pci
> itself?

Yes.. I'm not sure there is a good use case to allow the stub drivers
to load/unload while a VFIO is running. At least, not a strong enough
one to justify a global change to the driver core..

> > > +static int iommu_dev_viable(struct device *dev, void *data)
> > > +{
> > > +	enum dma_hint hint = *data;
> > > +	struct device_driver *drv = READ_ONCE(dev->driver);
> > 
> > Especially since this isn't locked properly or safe.
> 
> I have the same worry when copying from vfio. Not sure how
> vfio gets safe with this approach...

Fixing the locking in vfio_dev_viable is part of deleting the unbound
list. Once it properly uses the device_lock and doesn't race with the
driver core like this things are much better. Don't copy this stuff
into the iommu core without fixing it.

https://github.com/jgunthorpe/linux/commit/fa6abb318ccca114da12c0b5b123c99131ace926
https://github.com/jgunthorpe/linux/commit/45980bd90b023d1eea56df70d1c395bdf4cc7cf1

I can't remember if the above is contingent on some of the mdev
cleanups or not.. Have to get back to it.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 13:09               ` Jason Gunthorpe
@ 2021-09-27 13:32                 ` Tian, Kevin
  2021-09-27 14:39                   ` Jason Gunthorpe
  2021-09-27 19:19                   ` Alex Williamson
  0 siblings, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-27 13:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Jason Gunthorpe
> Sent: Monday, September 27, 2021 9:10 PM
> 
> On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> 
> > > I think for such a narrow usage you should not change the struct
> > > device_driver. Just have pci_stub call a function to flip back to user
> > > mode.
> >
> > Here we want to ensure that kernel dma should be blocked
> > if the group is already marked for user-dma. If we just blindly
> > do it for any driver at this point (as you commented earlier):
> >
> > +       ret = iommu_set_kernel_ownership(dev);
> > +       if (ret)
> > +               return ret;
> >
> > how would pci-stub reach its function to indicate that it doesn't
> > do dma and flip back?
> 
> > Do you envision a simpler policy that no driver can be bound
> > to the group if it's already set for user-dma? what about vfio-pci
> > itself?
> 
> Yes.. I'm not sure there is a good use case to allow the stub drivers
> to load/unload while a VFIO is running. At least, not a strong enough
> one to justify a global change to the driver core..

I'm fine with not loading pci-stub. From the very 1st commit msg
looks pci-stub was introduced before vfio to prevent host driver 
loading when doing device assignment with KVM. I'm not sure 
whether other usages are built on pci-stub later, but in general it's 
not good to position devices in a same group into different usages.

but I'm little worried that even vfio-pci itself cannot be bound now,
which implies that all devices in a group which are intended to be
used by the user must be bound to vfio-pci in a breath before the 
user attempts to open any of them, i.e. late-binding and device-
hotplug is disallowed after the initial open. I'm not sure how 
important such an usage would be, but it does cause user-tangible
semantics change.

Alex?

> 
> > > > +static int iommu_dev_viable(struct device *dev, void *data)
> > > > +{
> > > > +	enum dma_hint hint = *data;
> > > > +	struct device_driver *drv = READ_ONCE(dev->driver);
> > >
> > > Especially since this isn't locked properly or safe.
> >
> > I have the same worry when copying from vfio. Not sure how
> > vfio gets safe with this approach...
> 
> Fixing the locking in vfio_dev_viable is part of deleting the unbound
> list. Once it properly uses the device_lock and doesn't race with the
> driver core like this things are much better. Don't copy this stuff
> into the iommu core without fixing it.

sure. Above was just a quickly-baked sample code to match your 
thought.

> 
> https://github.com/jgunthorpe/linux/commit/fa6abb318ccca114da12c0b5b1
> 23c99131ace926
> https://github.com/jgunthorpe/linux/commit/45980bd90b023d1eea56df70d
> 1c395bdf4cc7cf1
> 
> I can't remember if the above is contingent on some of the mdev
> cleanups or not.. Have to get back to it.
> 

my home network has some problem to access above links. Will check it
tomorrow and follow the fix when working on the formal change in 
iommu core.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 13:32                 ` Tian, Kevin
@ 2021-09-27 14:39                   ` Jason Gunthorpe
  2021-09-28  7:13                     ` Tian, Kevin
  2021-09-27 19:19                   ` Alex Williamson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-27 14:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:

> but I'm little worried that even vfio-pci itself cannot be bound now,
> which implies that all devices in a group which are intended to be
> used by the user must be bound to vfio-pci in a breath before the 
> user attempts to open any of them, i.e. late-binding and device-
> hotplug is disallowed after the initial open. I'm not sure how 
> important such an usage would be, but it does cause user-tangible
> semantics change.

Oh, that's bad..

I guess your approach is the only way forward, it will have to be
extensively justified in the commit message for Greg et al.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27  9:42         ` Tian, Kevin
  2021-09-27 11:34           ` Lu Baolu
  2021-09-27 11:53           ` Jason Gunthorpe
@ 2021-09-27 15:09           ` Jason Gunthorpe
  2021-09-28  7:30             ` Tian, Kevin
  2 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-27 15:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:

> +static int iommu_dev_viable(struct device *dev, void *data)
> +{
> +	enum dma_hint hint = *data;
> +	struct device_driver *drv = READ_ONCE(dev->driver);
> +
> +	/* no conflict if the new device doesn't do DMA */
> +	if (hint == DMA_FOR_NONE)
> +		return 0;
> +
> +	/* no conflict if this device is driver-less, or doesn't do DMA */
> +	if (!drv || (drv->dma_hint == DMA_FOR_NONE))
> +		return 0;

While it is kind of clever to fetch this in the drv like this, the
locking just doesn't work right.

The group itself needs to have an atomic that encodes what state it is
in. You can read the initial state from the drv, under the
device_lock, and update the atomic state

Also, don't call it "hint", there is nothing hinty about this, it has
definitive functional impacts.

Greg will want to see a definiate benefit from this extra global code,
so be sure to explain about why the BUG_ON is bad, and how driver core
involvement is needed to fix it properly.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 13:32                 ` Tian, Kevin
  2021-09-27 14:39                   ` Jason Gunthorpe
@ 2021-09-27 19:19                   ` Alex Williamson
  2021-09-28  7:43                     ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-27 19:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, kvm, jasowang, kwankhede, hch, jean-philippe,
	Jiang, Dave, Raj, Ashok, corbet, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Mon, 27 Sep 2021 13:32:34 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe
> > Sent: Monday, September 27, 2021 9:10 PM
> > 
> > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> >   
> > > > I think for such a narrow usage you should not change the struct
> > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > mode.  
> > >
> > > Here we want to ensure that kernel dma should be blocked
> > > if the group is already marked for user-dma. If we just blindly
> > > do it for any driver at this point (as you commented earlier):
> > >
> > > +       ret = iommu_set_kernel_ownership(dev);
> > > +       if (ret)
> > > +               return ret;
> > >
> > > how would pci-stub reach its function to indicate that it doesn't
> > > do dma and flip back?  
> >   
> > > Do you envision a simpler policy that no driver can be bound
> > > to the group if it's already set for user-dma? what about vfio-pci
> > > itself?  
> > 
> > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > to load/unload while a VFIO is running. At least, not a strong enough
> > one to justify a global change to the driver core..  
> 
> I'm fine with not loading pci-stub. From the very 1st commit msg
> looks pci-stub was introduced before vfio to prevent host driver 
> loading when doing device assignment with KVM. I'm not sure 
> whether other usages are built on pci-stub later, but in general it's 
> not good to position devices in a same group into different usages.

IIRC, pci-stub was invented for legacy KVM device assignment because
KVM was never an actual device driver, it just latched onto and started
using the device.  If there was an existing driver for the device then
KVM would fail to get device resources.  Therefore the device needed to
be unbound from its standard host driver, but that left it susceptible
to driver loads usurping the device.  Therefore pci-stub came along to
essentially claim the device on behalf of KVM.

With vfio, there are a couple use cases of pci-stub that can be
interesting.  The first is that pci-stub is generally built into the
kernel, not as a module, which provides users the ability to specify a
list of ids for pci-stub to claim on the kernel command line with
higher priority than loadable modules.  This can prevent default driver
bindings to devices until tools like driverctl or boot time scripting
gets a shot to load the user designated driver for a device.

The other use case, is that if a group is composed of multiple devices
and all those devices are bound to vfio drivers, then the user can gain
direct access to each of those devices.  If we wanted to insert a
barrier to restrict user access to certain devices within a group, we'd
suggest binding those devices to pci-stub.  Obviously within a group, it
may still be possible to manipulate the device via p2p DMA, but the
barrier is much higher and device, if not platform, specific to
manipulate such devices.  An example use case might be a chipset
Ethernet controller grouped among system management function in a
multi-function root complex integrated endpoint.

> but I'm little worried that even vfio-pci itself cannot be bound now,
> which implies that all devices in a group which are intended to be
> used by the user must be bound to vfio-pci in a breath before the 
> user attempts to open any of them, i.e. late-binding and device-
> hotplug is disallowed after the initial open. I'm not sure how 
> important such an usage would be, but it does cause user-tangible
> semantics change.

Yep, a high potential to break userspace, especially as pci-stub has
been recommended for the cases noted above.  I don't expect that tools
like libvirt manage unassigned devices within a group, but that
probably means that there are all sorts of ad-hoc user mechanisms
beyond simply assigning all the devices.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 14:39                   ` Jason Gunthorpe
@ 2021-09-28  7:13                     ` Tian, Kevin
  2021-09-28 11:54                       ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-28  7:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, September 27, 2021 10:40 PM
> 
> On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
> 
> > but I'm little worried that even vfio-pci itself cannot be bound now,
> > which implies that all devices in a group which are intended to be
> > used by the user must be bound to vfio-pci in a breath before the
> > user attempts to open any of them, i.e. late-binding and device-
> > hotplug is disallowed after the initial open. I'm not sure how
> > important such an usage would be, but it does cause user-tangible
> > semantics change.
> 
> Oh, that's bad..
> 
> I guess your approach is the only way forward, it will have to be
> extensively justified in the commit message for Greg et al.
> 

Just thought about another alternative. What about having driver
core to call iommu after call_driver_probe()?

call_driver_probe()
	pci_stub_probe()
		iommu_set_dma_mode(dev, DMA_NONE);
iommu_check_dma_mode(dev);

The default dma mode is DMA_KERNEL. Above allows driver to opt
for DMA_NONE or DMA_USER w/o changing the device_driver
structure. Right after probe() is completed, we check whether dma 
mode of this device is allowed by the iommu core (based on recorded 
dma mode info of sibling devices in the same group). If not, then fail 
the binding.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 15:09           ` Jason Gunthorpe
@ 2021-09-28  7:30             ` Tian, Kevin
  2021-09-28 11:57               ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-28  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, September 27, 2021 11:09 PM
> 
> On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
> 
> > +static int iommu_dev_viable(struct device *dev, void *data)
> > +{
> > +	enum dma_hint hint = *data;
> > +	struct device_driver *drv = READ_ONCE(dev->driver);
> > +
> > +	/* no conflict if the new device doesn't do DMA */
> > +	if (hint == DMA_FOR_NONE)
> > +		return 0;
> > +
> > +	/* no conflict if this device is driver-less, or doesn't do DMA */
> > +	if (!drv || (drv->dma_hint == DMA_FOR_NONE))
> > +		return 0;
> 
> While it is kind of clever to fetch this in the drv like this, the
> locking just doesn't work right.
> 
> The group itself needs to have an atomic that encodes what state it is
> in. You can read the initial state from the drv, under the
> device_lock, and update the atomic state

will do. 

> 
> Also, don't call it "hint", there is nothing hinty about this, it has
> definitive functional impacts.

possibly dma_mode (too broad?) or dma_usage

> 
> Greg will want to see a definiate benefit from this extra global code,
> so be sure to explain about why the BUG_ON is bad, and how driver core
> involvement is needed to fix it properly.
> 

Sure. and I plan to at least have the patches aligned in this loop first,
before rolling up to Greg. Better we all confirm it's the right approach 
with all corner cases covered and then involve Greg to help judge
a clean driver core change. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-27 19:19                   ` Alex Williamson
@ 2021-09-28  7:43                     ` Tian, Kevin
  2021-09-28 16:26                       ` Alex Williamson
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-28  7:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, kvm, jasowang, kwankhede, hch, jean-philippe,
	Jiang, Dave, Raj, Ashok, corbet, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, September 28, 2021 3:20 AM
> 
> On Mon, 27 Sep 2021 13:32:34 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe
> > > Sent: Monday, September 27, 2021 9:10 PM
> > >
> > > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> > >
> > > > > I think for such a narrow usage you should not change the struct
> > > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > > mode.
> > > >
> > > > Here we want to ensure that kernel dma should be blocked
> > > > if the group is already marked for user-dma. If we just blindly
> > > > do it for any driver at this point (as you commented earlier):
> > > >
> > > > +       ret = iommu_set_kernel_ownership(dev);
> > > > +       if (ret)
> > > > +               return ret;
> > > >
> > > > how would pci-stub reach its function to indicate that it doesn't
> > > > do dma and flip back?
> > >
> > > > Do you envision a simpler policy that no driver can be bound
> > > > to the group if it's already set for user-dma? what about vfio-pci
> > > > itself?
> > >
> > > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > > to load/unload while a VFIO is running. At least, not a strong enough
> > > one to justify a global change to the driver core..
> >
> > I'm fine with not loading pci-stub. From the very 1st commit msg
> > looks pci-stub was introduced before vfio to prevent host driver
> > loading when doing device assignment with KVM. I'm not sure
> > whether other usages are built on pci-stub later, but in general it's
> > not good to position devices in a same group into different usages.
> 
> IIRC, pci-stub was invented for legacy KVM device assignment because
> KVM was never an actual device driver, it just latched onto and started
> using the device.  If there was an existing driver for the device then
> KVM would fail to get device resources.  Therefore the device needed to
> be unbound from its standard host driver, but that left it susceptible
> to driver loads usurping the device.  Therefore pci-stub came along to
> essentially claim the device on behalf of KVM.
> 
> With vfio, there are a couple use cases of pci-stub that can be
> interesting.  The first is that pci-stub is generally built into the
> kernel, not as a module, which provides users the ability to specify a
> list of ids for pci-stub to claim on the kernel command line with
> higher priority than loadable modules.  This can prevent default driver
> bindings to devices until tools like driverctl or boot time scripting
> gets a shot to load the user designated driver for a device.
> 
> The other use case, is that if a group is composed of multiple devices
> and all those devices are bound to vfio drivers, then the user can gain
> direct access to each of those devices.  If we wanted to insert a
> barrier to restrict user access to certain devices within a group, we'd
> suggest binding those devices to pci-stub.  Obviously within a group, it
> may still be possible to manipulate the device via p2p DMA, but the
> barrier is much higher and device, if not platform, specific to
> manipulate such devices.  An example use case might be a chipset
> Ethernet controller grouped among system management function in a
> multi-function root complex integrated endpoint.

Thanks for the background. It perfectly reflects how many tricky things
that vfio has evolved to deal with and we'll dig them out again in this
refactoring process with your help. 😊

just a nit on the last example. If a system management function is 
in such group, isn't the right policy is to disallow assigning any device
in this group? Even the barrier is high, any chance of allowing the guest
to control a system management function is dangerous...

> 
> > but I'm little worried that even vfio-pci itself cannot be bound now,
> > which implies that all devices in a group which are intended to be
> > used by the user must be bound to vfio-pci in a breath before the
> > user attempts to open any of them, i.e. late-binding and device-
> > hotplug is disallowed after the initial open. I'm not sure how
> > important such an usage would be, but it does cause user-tangible
> > semantics change.
> 
> Yep, a high potential to break userspace, especially as pci-stub has
> been recommended for the cases noted above.  I don't expect that tools
> like libvirt manage unassigned devices within a group, but that
> probably means that there are all sorts of ad-hoc user mechanisms
> beyond simply assigning all the devices.  Thanks,
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28  7:13                     ` Tian, Kevin
@ 2021-09-28 11:54                       ` Jason Gunthorpe
  2021-09-28 23:59                         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 11:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Tue, Sep 28, 2021 at 07:13:01AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, September 27, 2021 10:40 PM
> > 
> > On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
> > 
> > > but I'm little worried that even vfio-pci itself cannot be bound now,
> > > which implies that all devices in a group which are intended to be
> > > used by the user must be bound to vfio-pci in a breath before the
> > > user attempts to open any of them, i.e. late-binding and device-
> > > hotplug is disallowed after the initial open. I'm not sure how
> > > important such an usage would be, but it does cause user-tangible
> > > semantics change.
> > 
> > Oh, that's bad..
> > 
> > I guess your approach is the only way forward, it will have to be
> > extensively justified in the commit message for Greg et al.
> > 
> 
> Just thought about another alternative. What about having driver
> core to call iommu after call_driver_probe()?

Then the kernel is now already exposed to an insecure scenario, we
must not do probe if any user device is attached at all.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28  7:30             ` Tian, Kevin
@ 2021-09-28 11:57               ` Jason Gunthorpe
  2021-09-28 13:35                 ` Lu Baolu
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 11:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Tue, Sep 28, 2021 at 07:30:41AM +0000, Tian, Kevin wrote:

> > Also, don't call it "hint", there is nothing hinty about this, it has
> > definitive functional impacts.
> 
> possibly dma_mode (too broad?) or dma_usage

You just need a flag to specify if the driver manages DMA ownership
itself, or if it requires the driver core to setup kernel ownership

DMA_OWNER_KERNEL
DMA_OWNER_DRIVER_CONTROLLED

?

There is a bool 'suprress_bind_attrs' already so it could be done like
this:

 bool suppress_bind_attrs:1;

 /* If set the driver must call iommu_XX as the first action in probe() */
 bool suppress_dma_owner:1;

Which is pretty low cost.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28 11:57               ` Jason Gunthorpe
@ 2021-09-28 13:35                 ` Lu Baolu
  2021-09-28 14:07                   ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-28 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: baolu.lu, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

Hi Jason,

On 2021/9/28 19:57, Jason Gunthorpe wrote:
> On Tue, Sep 28, 2021 at 07:30:41AM +0000, Tian, Kevin wrote:
> 
>>> Also, don't call it "hint", there is nothing hinty about this, it has
>>> definitive functional impacts.
>>
>> possibly dma_mode (too broad?) or dma_usage
> 
> You just need a flag to specify if the driver manages DMA ownership
> itself, or if it requires the driver core to setup kernel ownership
> 
> DMA_OWNER_KERNEL
> DMA_OWNER_DRIVER_CONTROLLED
> 
> ?
> 
> There is a bool 'suprress_bind_attrs' already so it could be done like
> this:
> 
>   bool suppress_bind_attrs:1;
> 
>   /* If set the driver must call iommu_XX as the first action in probe() */
>   bool suppress_dma_owner:1;
> 
> Which is pretty low cost.

Yes. Pretty low cost to fix the BUG_ON() issue. Any kernel-DMA driver
binding is blocked if the device's iommu group has been put into user-
dma mode.

Another issue is, when putting a device into user-dma mode, all devices
belonging to the same iommu group shouldn't be bound with a kernel-dma
driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
not lock safe as discussed below,

https://lore.kernel.org/linux-iommu/20210927130935.GZ964074@nvidia.com/

Any guidance on this?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28 13:35                 ` Lu Baolu
@ 2021-09-28 14:07                   ` Jason Gunthorpe
  2021-09-29  0:38                     ` Tian, Kevin
  2021-09-29  2:22                     ` Lu Baolu
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 14:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> Another issue is, when putting a device into user-dma mode, all devices
> belonging to the same iommu group shouldn't be bound with a kernel-dma
> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> not lock safe as discussed below,
> 
> https://lore.kernel.org/linux-iommu/20210927130935.GZ964074@nvidia.com/
> 
> Any guidance on this?

Something like this?


int iommu_set_device_dma_owner(struct device *dev, enum device_dma_owner mode,
			       struct file *user_owner)
{
	struct iommu_group *group = group_from_dev(dev);

	spin_lock(&iommu_group->dma_owner_lock);
	switch (mode) {
		case DMA_OWNER_KERNEL:
			if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
				return -EBUSY;
			break;
		case DMA_OWNER_SHARED:
			break;
		case DMA_OWNER_USERSPACE:
			if (iommu_group->dma_users[DMA_OWNER_KERNEL])
				return -EBUSY;
			if (iommu_group->dma_owner_file != user_owner) {
				if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
					return -EPERM;
				get_file(user_owner);
				iommu_group->dma_owner_file = user_owner;
			}
			break;
		default:
			spin_unlock(&iommu_group->dma_owner_lock);
			return -EINVAL;
	}
	iommu_group->dma_users[mode]++;
	spin_unlock(&iommu_group->dma_owner_lock);
	return 0;
}

int iommu_release_device_dma_owner(struct device *dev,
				   enum device_dma_owner mode)
{
	struct iommu_group *group = group_from_dev(dev);

	spin_lock(&iommu_group->dma_owner_lock);
	if (WARN_ON(!iommu_group->dma_users[mode]))
		goto err_unlock;
	if (!iommu_group->dma_users[mode]--) {
		if (mode == DMA_OWNER_USERSPACE) {
			fput(iommu_group->dma_owner_file);
			iommu_group->dma_owner_file = NULL;
		}
	}
err_unlock:
	spin_unlock(&iommu_group->dma_owner_lock);
}


Where, the driver core does before probe:

   iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)

pci_stub/etc does in their probe func:

   iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)

And vfio/iommfd does when a struct vfio_device FD is attached:

   iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28  7:43                     ` Tian, Kevin
@ 2021-09-28 16:26                       ` Alex Williamson
  0 siblings, 0 replies; 280+ messages in thread
From: Alex Williamson @ 2021-09-28 16:26 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, kvm, jasowang, kwankhede, hch, jean-philippe,
	Jiang, Dave, Raj, Ashok, corbet, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Tue, 28 Sep 2021 07:43:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, September 28, 2021 3:20 AM
> > 
> > On Mon, 27 Sep 2021 13:32:34 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Jason Gunthorpe
> > > > Sent: Monday, September 27, 2021 9:10 PM
> > > >
> > > > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> > > >  
> > > > > > I think for such a narrow usage you should not change the struct
> > > > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > > > mode.  
> > > > >
> > > > > Here we want to ensure that kernel dma should be blocked
> > > > > if the group is already marked for user-dma. If we just blindly
> > > > > do it for any driver at this point (as you commented earlier):
> > > > >
> > > > > +       ret = iommu_set_kernel_ownership(dev);
> > > > > +       if (ret)
> > > > > +               return ret;
> > > > >
> > > > > how would pci-stub reach its function to indicate that it doesn't
> > > > > do dma and flip back?  
> > > >  
> > > > > Do you envision a simpler policy that no driver can be bound
> > > > > to the group if it's already set for user-dma? what about vfio-pci
> > > > > itself?  
> > > >
> > > > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > > > to load/unload while a VFIO is running. At least, not a strong enough
> > > > one to justify a global change to the driver core..  
> > >
> > > I'm fine with not loading pci-stub. From the very 1st commit msg
> > > looks pci-stub was introduced before vfio to prevent host driver
> > > loading when doing device assignment with KVM. I'm not sure
> > > whether other usages are built on pci-stub later, but in general it's
> > > not good to position devices in a same group into different usages.  
> > 
> > IIRC, pci-stub was invented for legacy KVM device assignment because
> > KVM was never an actual device driver, it just latched onto and started
> > using the device.  If there was an existing driver for the device then
> > KVM would fail to get device resources.  Therefore the device needed to
> > be unbound from its standard host driver, but that left it susceptible
> > to driver loads usurping the device.  Therefore pci-stub came along to
> > essentially claim the device on behalf of KVM.
> > 
> > With vfio, there are a couple use cases of pci-stub that can be
> > interesting.  The first is that pci-stub is generally built into the
> > kernel, not as a module, which provides users the ability to specify a
> > list of ids for pci-stub to claim on the kernel command line with
> > higher priority than loadable modules.  This can prevent default driver
> > bindings to devices until tools like driverctl or boot time scripting
> > gets a shot to load the user designated driver for a device.
> > 
> > The other use case, is that if a group is composed of multiple devices
> > and all those devices are bound to vfio drivers, then the user can gain
> > direct access to each of those devices.  If we wanted to insert a
> > barrier to restrict user access to certain devices within a group, we'd
> > suggest binding those devices to pci-stub.  Obviously within a group, it
> > may still be possible to manipulate the device via p2p DMA, but the
> > barrier is much higher and device, if not platform, specific to
> > manipulate such devices.  An example use case might be a chipset
> > Ethernet controller grouped among system management function in a
> > multi-function root complex integrated endpoint.  
> 
> Thanks for the background. It perfectly reflects how many tricky things
> that vfio has evolved to deal with and we'll dig them out again in this
> refactoring process with your help. 😊
> 
> just a nit on the last example. If a system management function is 
> in such group, isn't the right policy is to disallow assigning any device
> in this group? Even the barrier is high, any chance of allowing the guest
> to control a system management function is dangerous...

We can advise that it's a risk, but we generally refrain from making
such policy decisions.  Ideally the chipset vendor avoids
configurations that require their users to choose between functionality
and security ;)  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28 11:54                       ` Jason Gunthorpe
@ 2021-09-28 23:59                         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-28 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 28, 2021 7:55 PM
> 
> On Tue, Sep 28, 2021 at 07:13:01AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, September 27, 2021 10:40 PM
> > >
> > > On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
> > >
> > > > but I'm little worried that even vfio-pci itself cannot be bound now,
> > > > which implies that all devices in a group which are intended to be
> > > > used by the user must be bound to vfio-pci in a breath before the
> > > > user attempts to open any of them, i.e. late-binding and device-
> > > > hotplug is disallowed after the initial open. I'm not sure how
> > > > important such an usage would be, but it does cause user-tangible
> > > > semantics change.
> > >
> > > Oh, that's bad..
> > >
> > > I guess your approach is the only way forward, it will have to be
> > > extensively justified in the commit message for Greg et al.
> > >
> >
> > Just thought about another alternative. What about having driver
> > core to call iommu after call_driver_probe()?
> 
> Then the kernel is now already exposed to an insecure scenario, we
> must not do probe if any user device is attached at all.
> 

Originally I thought it's fine as long as the entire probe process
is not completed. Based on your comment I feel your concern is
that no guarantee that the driver won't do any iommu related 
work in its probe function thus it's insecure?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28 14:07                   ` Jason Gunthorpe
@ 2021-09-29  0:38                     ` Tian, Kevin
  2021-09-29 12:59                       ` Jason Gunthorpe
  2021-09-29  2:22                     ` Lu Baolu
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  0:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Lu Baolu
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 28, 2021 10:07 PM
> 
> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> > Another issue is, when putting a device into user-dma mode, all devices
> > belonging to the same iommu group shouldn't be bound with a kernel-dma
> > driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> > not lock safe as discussed below,
> >
> > https://lore.kernel.org/linux-
> iommu/20210927130935.GZ964074@nvidia.com/
> >
> > Any guidance on this?
> 
> Something like this?
> 
> 

yes, with this group level atomics we don't need loop every dev->driver
respectively.

> int iommu_set_device_dma_owner(struct device *dev, enum
> device_dma_owner mode,
> 			       struct file *user_owner)
> {
> 	struct iommu_group *group = group_from_dev(dev);
> 
> 	spin_lock(&iommu_group->dma_owner_lock);
> 	switch (mode) {
> 		case DMA_OWNER_KERNEL:
> 			if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> 				return -EBUSY;
> 			break;
> 		case DMA_OWNER_SHARED:
> 			break;
> 		case DMA_OWNER_USERSPACE:
> 			if (iommu_group-
> >dma_users[DMA_OWNER_KERNEL])
> 				return -EBUSY;
> 			if (iommu_group->dma_owner_file != user_owner) {
> 				if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> 					return -EPERM;
> 				get_file(user_owner);
> 				iommu_group->dma_owner_file =
> user_owner;
> 			}
> 			break;
> 		default:
> 			spin_unlock(&iommu_group->dma_owner_lock);
> 			return -EINVAL;
> 	}
> 	iommu_group->dma_users[mode]++;
> 	spin_unlock(&iommu_group->dma_owner_lock);
> 	return 0;
> }
> 
> int iommu_release_device_dma_owner(struct device *dev,
> 				   enum device_dma_owner mode)
> {
> 	struct iommu_group *group = group_from_dev(dev);
> 
> 	spin_lock(&iommu_group->dma_owner_lock);
> 	if (WARN_ON(!iommu_group->dma_users[mode]))
> 		goto err_unlock;
> 	if (!iommu_group->dma_users[mode]--) {
> 		if (mode == DMA_OWNER_USERSPACE) {
> 			fput(iommu_group->dma_owner_file);
> 			iommu_group->dma_owner_file = NULL;
> 		}
> 	}
> err_unlock:
> 	spin_unlock(&iommu_group->dma_owner_lock);
> }
> 
> 
> Where, the driver core does before probe:
> 
>    iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
> 
> pci_stub/etc does in their probe func:
> 
>    iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
> 
> And vfio/iommfd does when a struct vfio_device FD is attached:
> 
>    iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
> group_file/iommu_file)
> 

Just a nit. Per your comment in previous mail:

/* If set the driver must call iommu_XX as the first action in probe() */
 bool suppress_dma_owner:1;

Following above logic userspace drivers won't call iommu_XX in probe().
Just want to double confirm whether you see any issue here with this
relaxed behavior. If no problem:

/* If set the driver must call iommu_XX as the first action in probe() or
  * before it attempts to do DMA
  */
 bool suppress_dma_owner:1;

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38 ` [RFC 02/20] vfio: Add device class for /dev/vfio/devices Liu Yi L
  2021-09-21 15:57   ` Jason Gunthorpe
  2021-09-21 19:56   ` Alex Williamson
@ 2021-09-29  2:08   ` David Gibson
  2021-09-29 19:05     ` Alex Williamson
  2021-10-20 12:39     ` Liu, Yi L
  2 siblings, 2 replies; 280+ messages in thread
From: David Gibson @ 2021-09-29  2:08 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2774 bytes --]

On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
[snip]

> +static bool vfio_device_in_container(struct vfio_device *device)
> +{
> +	return !!(device->group && device->group->container);

You don't need !! here.  && is already a logical operation, so returns
a valid bool.

> +}
> +
>  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  {
>  	struct vfio_device *device = filep->private_data;
> @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  
>  	module_put(device->dev->driver->owner);
>  
> -	vfio_group_try_dissolve_container(device->group);
> +	if (vfio_device_in_container(device)) {
> +		vfio_group_try_dissolve_container(device->group);
> +	} else {
> +		atomic_dec(&device->opened);
> +		if (device->group) {
> +			mutex_lock(&device->group->opened_lock);
> +			device->group->opened--;
> +			mutex_unlock(&device->group->opened_lock);
> +		}
> +	}
>  
>  	vfio_device_put(device);
>  
> @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
>  
>  static const struct file_operations vfio_device_fops = {
>  	.owner		= THIS_MODULE,
> +	.open		= vfio_device_fops_open,
>  	.release	= vfio_device_fops_release,
>  	.read		= vfio_device_fops_read,
>  	.write		= vfio_device_fops_write,
> @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
>  	.mode = S_IRUGO | S_IWUGO,
>  };
>  
> +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));

Others have pointed out some problems with the use of dev_name()
here.  I'll add that I think you'll make things much easier if instead
of using one huge "devices" subdir, you use a separate subdir for each
vfio sub-driver (so, one for PCI, one for each type of mdev, one for
platform, etc.).  That should make avoiding name conflicts a lot simpler.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-28 14:07                   ` Jason Gunthorpe
  2021-09-29  0:38                     ` Tian, Kevin
@ 2021-09-29  2:22                     ` Lu Baolu
  2021-09-29  2:29                       ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-29  2:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang,
	joro, jean-philippe, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, david, nicolinc

On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
>> Another issue is, when putting a device into user-dma mode, all devices
>> belonging to the same iommu group shouldn't be bound with a kernel-dma
>> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
>> not lock safe as discussed below,
>>
>> https://lore.kernel.org/linux-iommu/20210927130935.GZ964074@nvidia.com/
>>
>> Any guidance on this?
> 
> Something like this?
> 
> 
> int iommu_set_device_dma_owner(struct device *dev, enum device_dma_owner mode,
> 			       struct file *user_owner)
> {
> 	struct iommu_group *group = group_from_dev(dev);
> 
> 	spin_lock(&iommu_group->dma_owner_lock);
> 	switch (mode) {
> 		case DMA_OWNER_KERNEL:
> 			if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
> 				return -EBUSY;
> 			break;
> 		case DMA_OWNER_SHARED:
> 			break;
> 		case DMA_OWNER_USERSPACE:
> 			if (iommu_group->dma_users[DMA_OWNER_KERNEL])
> 				return -EBUSY;
> 			if (iommu_group->dma_owner_file != user_owner) {
> 				if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
> 					return -EPERM;
> 				get_file(user_owner);
> 				iommu_group->dma_owner_file = user_owner;
> 			}
> 			break;
> 		default:
> 			spin_unlock(&iommu_group->dma_owner_lock);
> 			return -EINVAL;
> 	}
> 	iommu_group->dma_users[mode]++;
> 	spin_unlock(&iommu_group->dma_owner_lock);
> 	return 0;
> }
> 
> int iommu_release_device_dma_owner(struct device *dev,
> 				   enum device_dma_owner mode)
> {
> 	struct iommu_group *group = group_from_dev(dev);
> 
> 	spin_lock(&iommu_group->dma_owner_lock);
> 	if (WARN_ON(!iommu_group->dma_users[mode]))
> 		goto err_unlock;
> 	if (!iommu_group->dma_users[mode]--) {
> 		if (mode == DMA_OWNER_USERSPACE) {
> 			fput(iommu_group->dma_owner_file);
> 			iommu_group->dma_owner_file = NULL;
> 		}
> 	}
> err_unlock:
> 	spin_unlock(&iommu_group->dma_owner_lock);
> }
> 
> 
> Where, the driver core does before probe:
> 
>     iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
> 
> pci_stub/etc does in their probe func:
> 
>     iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
> 
> And vfio/iommfd does when a struct vfio_device FD is attached:
> 
>     iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

Really good design. It also helps alleviating some pains elsewhere in
the iommu core.

Just a nit comment, we also need DMA_OWNER_NONE which will be set when
the driver core unbinds the driver from the device.

> 
> Jason
> 

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  2:22                     ` Lu Baolu
@ 2021-09-29  2:29                       ` Tian, Kevin
  2021-09-29  2:38                         ` Lu Baolu
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  2:29 UTC (permalink / raw)
  To: Lu Baolu, Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Wednesday, September 29, 2021 10:22 AM
> 
> On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
> > On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> >> Another issue is, when putting a device into user-dma mode, all devices
> >> belonging to the same iommu group shouldn't be bound with a kernel-
> dma
> >> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> >> not lock safe as discussed below,
> >>
> >> https://lore.kernel.org/linux-
> iommu/20210927130935.GZ964074@nvidia.com/
> >>
> >> Any guidance on this?
> >
> > Something like this?
> >
> >
> > int iommu_set_device_dma_owner(struct device *dev, enum
> device_dma_owner mode,
> > 			       struct file *user_owner)
> > {
> > 	struct iommu_group *group = group_from_dev(dev);
> >
> > 	spin_lock(&iommu_group->dma_owner_lock);
> > 	switch (mode) {
> > 		case DMA_OWNER_KERNEL:
> > 			if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> > 				return -EBUSY;
> > 			break;
> > 		case DMA_OWNER_SHARED:
> > 			break;
> > 		case DMA_OWNER_USERSPACE:
> > 			if (iommu_group-
> >dma_users[DMA_OWNER_KERNEL])
> > 				return -EBUSY;
> > 			if (iommu_group->dma_owner_file != user_owner) {
> > 				if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> > 					return -EPERM;
> > 				get_file(user_owner);
> > 				iommu_group->dma_owner_file =
> user_owner;
> > 			}
> > 			break;
> > 		default:
> > 			spin_unlock(&iommu_group->dma_owner_lock);
> > 			return -EINVAL;
> > 	}
> > 	iommu_group->dma_users[mode]++;
> > 	spin_unlock(&iommu_group->dma_owner_lock);
> > 	return 0;
> > }
> >
> > int iommu_release_device_dma_owner(struct device *dev,
> > 				   enum device_dma_owner mode)
> > {
> > 	struct iommu_group *group = group_from_dev(dev);
> >
> > 	spin_lock(&iommu_group->dma_owner_lock);
> > 	if (WARN_ON(!iommu_group->dma_users[mode]))
> > 		goto err_unlock;
> > 	if (!iommu_group->dma_users[mode]--) {
> > 		if (mode == DMA_OWNER_USERSPACE) {
> > 			fput(iommu_group->dma_owner_file);
> > 			iommu_group->dma_owner_file = NULL;
> > 		}
> > 	}
> > err_unlock:
> > 	spin_unlock(&iommu_group->dma_owner_lock);
> > }
> >
> >
> > Where, the driver core does before probe:
> >
> >     iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
> >
> > pci_stub/etc does in their probe func:
> >
> >     iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
> >
> > And vfio/iommfd does when a struct vfio_device FD is attached:
> >
> >     iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
> group_file/iommu_file)
> 
> Really good design. It also helps alleviating some pains elsewhere in
> the iommu core.
> 
> Just a nit comment, we also need DMA_OWNER_NONE which will be set
> when
> the driver core unbinds the driver from the device.
> 

Not necessarily. NONE is represented by none of dma_user[mode]
is valid.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  2:29                       ` Tian, Kevin
@ 2021-09-29  2:38                         ` Lu Baolu
  0 siblings, 0 replies; 280+ messages in thread
From: Lu Baolu @ 2021-09-29  2:38 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

On 9/29/21 10:29 AM, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Wednesday, September 29, 2021 10:22 AM
>>
>> On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
>>> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
>>>> Another issue is, when putting a device into user-dma mode, all devices
>>>> belonging to the same iommu group shouldn't be bound with a kernel-
>> dma
>>>> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
>>>> not lock safe as discussed below,
>>>>
>>>> https://lore.kernel.org/linux-
>> iommu/20210927130935.GZ964074@nvidia.com/
>>>>
>>>> Any guidance on this?
>>>
>>> Something like this?
>>>
>>>
>>> int iommu_set_device_dma_owner(struct device *dev, enum
>> device_dma_owner mode,
>>> 			       struct file *user_owner)
>>> {
>>> 	struct iommu_group *group = group_from_dev(dev);
>>>
>>> 	spin_lock(&iommu_group->dma_owner_lock);
>>> 	switch (mode) {
>>> 		case DMA_OWNER_KERNEL:
>>> 			if (iommu_group-
>>> dma_users[DMA_OWNER_USERSPACE])
>>> 				return -EBUSY;
>>> 			break;
>>> 		case DMA_OWNER_SHARED:
>>> 			break;
>>> 		case DMA_OWNER_USERSPACE:
>>> 			if (iommu_group-
>>> dma_users[DMA_OWNER_KERNEL])
>>> 				return -EBUSY;
>>> 			if (iommu_group->dma_owner_file != user_owner) {
>>> 				if (iommu_group-
>>> dma_users[DMA_OWNER_USERSPACE])
>>> 					return -EPERM;
>>> 				get_file(user_owner);
>>> 				iommu_group->dma_owner_file =
>> user_owner;
>>> 			}
>>> 			break;
>>> 		default:
>>> 			spin_unlock(&iommu_group->dma_owner_lock);
>>> 			return -EINVAL;
>>> 	}
>>> 	iommu_group->dma_users[mode]++;
>>> 	spin_unlock(&iommu_group->dma_owner_lock);
>>> 	return 0;
>>> }
>>>
>>> int iommu_release_device_dma_owner(struct device *dev,
>>> 				   enum device_dma_owner mode)
>>> {
>>> 	struct iommu_group *group = group_from_dev(dev);
>>>
>>> 	spin_lock(&iommu_group->dma_owner_lock);
>>> 	if (WARN_ON(!iommu_group->dma_users[mode]))
>>> 		goto err_unlock;
>>> 	if (!iommu_group->dma_users[mode]--) {
>>> 		if (mode == DMA_OWNER_USERSPACE) {
>>> 			fput(iommu_group->dma_owner_file);
>>> 			iommu_group->dma_owner_file = NULL;
>>> 		}
>>> 	}
>>> err_unlock:
>>> 	spin_unlock(&iommu_group->dma_owner_lock);
>>> }
>>>
>>>
>>> Where, the driver core does before probe:
>>>
>>>      iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>>>
>>> pci_stub/etc does in their probe func:
>>>
>>>      iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>>>
>>> And vfio/iommfd does when a struct vfio_device FD is attached:
>>>
>>>      iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
>> group_file/iommu_file)
>>
>> Really good design. It also helps alleviating some pains elsewhere in
>> the iommu core.
>>
>> Just a nit comment, we also need DMA_OWNER_NONE which will be set
>> when
>> the driver core unbinds the driver from the device.
>>
> 
> Not necessarily. NONE is represented by none of dma_user[mode]
> is valid.
> 

Fair enough.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-19  6:38 ` [RFC 03/20] vfio: Add vfio_[un]register_device() Liu Yi L
  2021-09-21 16:01   ` Jason Gunthorpe
@ 2021-09-29  2:43   ` David Gibson
  2021-09-29  3:40     ` Tian, Kevin
  2021-09-29  5:30     ` Tian, Kevin
  1 sibling, 2 replies; 280+ messages in thread
From: David Gibson @ 2021-09-29  2:43 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 11882 bytes --]

On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
> 
> a)  only legacy group interface, for devices which haven't been moved to
>     iommufd (e.g. platform devices, sw mdev, etc.);
> 
> b)  both legacy group interface and new device-centric interface, for
>     devices which supports iommufd but also wants to keep backward
>     compatibility (e.g. pci devices in this RFC);
> 
> c)  only new device-centric interface, for new devices which don't carry
>     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> 
> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.
> 
> TBD: this patch needs to rebase on top of below series from Christoph in
> next version.
> 
> 	"cleanup vfio iommu_group creation"
> 
> Legacy userspace continues to follow the legacy group interface.
> 
> Newer userspace can first try the new device-centric interface if the
> device is present under /dev/vfio/devices. Otherwise fall back to the
> group interface.
> 
> One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:
> 
> 	/dev/vfio/devices/0000:00:14.2 (pci)
> 	/dev/vfio/devices/PNP0103:00 (platform)
> 	/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

Oof.  I really don't think this is a good idea.  Ensuring that a
format is "unique" in the sense that it can't collide with any of the
other formats, for *every* value of the parameters on both sides is
actually pretty complicated in general.

I think per-type sub-directories would be helpful here, Jason's
suggestion of just sequential numbers would work as well.

> One alternative option is to arrange device nodes in sub-directories based
> on the device type. But doing so also adds one trouble to userspace. The
> current vfio uAPI is designed to have the user query device type via
> VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> instead needs to figure out the device type before opening the device, to
> identify the sub-directory.

Wouldn't this be up to the operator / configuration, rather than the
actual software though?  I would assume that typically the VFIO
program would be pointed at a specific vfio device node file to use,
e.g.
	my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1

Or more generally, if you're expecting userspace to know a name in a
uniqu pattern, they can equally well know a "type/name" pair.

> Another tricky thing is that "pdev. vs. mdev"
> and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
> more thoughts on whether both or just one category should be used to define
> the sub-directories.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 137 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   9 +++
>  2 files changed, 134 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 84436d7abedd..1e87b25962f1 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -51,6 +51,7 @@ static struct vfio {
>  	struct cdev			device_cdev;
>  	dev_t				device_devt;
>  	struct mutex			device_lock;
> +	struct list_head		device_list;
>  	struct idr			device_idr;
>  } vfio;
>  
> @@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_init_group_dev);
>  
> -int vfio_register_group_dev(struct vfio_device *device)
> +static int __vfio_register_group_dev(struct vfio_device *device)
>  {
>  	struct vfio_device *existing_device;
>  	struct iommu_group *iommu_group;
> @@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
>  	/* Our reference on group is moved to the device */
>  	device->group = group;
>  
> -	/* Refcounting can't start until the driver calls register */
> -	refcount_set(&device->refcount, 1);
> +	/*
> +	 * Refcounting can't start until the driver call register. Don’t
> +	 * start twice when the device is exposed in both group and nongroup
> +	 * interfaces.
> +	 */
> +	if (!refcount_read(&device->refcount))

Is there a possible race here with something getting in and
incrementing the refcount between the read and set?

> +		refcount_set(&device->refcount, 1);
>  
>  	mutex_lock(&group->device_lock);
>  	list_add(&device->group_next, &group->device_list);
> @@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
>  
>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> +
> +static int __vfio_register_nongroup_dev(struct vfio_device *device)
> +{
> +	struct vfio_device *existing_device;
> +	struct device *dev;
> +	int ret = 0, minor;
> +
> +	mutex_lock(&vfio.device_lock);
> +	list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
> +		if (existing_device == device) {
> +			ret = -EBUSY;
> +			goto out_unlock;

This indicates a bug in the caller, doesn't it?  Should it be a BUG or
WARN instead?

> +		}
> +	}
> +
> +	minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
> +	pr_debug("%s - mnior: %d\n", __func__, minor);
> +	if (minor < 0) {
> +		ret = minor;
> +		goto out_unlock;
> +	}
> +
> +	dev = device_create(vfio.device_class, NULL,
> +			    MKDEV(MAJOR(vfio.device_devt), minor),
> +			    device, "%s", dev_name(device->dev));
> +	if (IS_ERR(dev)) {
> +		idr_remove(&vfio.device_idr, minor);
> +		ret = PTR_ERR(dev);
> +		goto out_unlock;
> +	}
> +
> +	/*
> +	 * Refcounting can't start until the driver call register. Don’t
> +	 * start twice when the device is exposed in both group and nongroup
> +	 * interfaces.
> +	 */
> +	if (!refcount_read(&device->refcount))
> +		refcount_set(&device->refcount, 1);
> +
> +	device->minor = minor;
> +	list_add(&device->vfio_next, &vfio.device_list);
> +	dev_info(device->dev, "Creates Device interface successfully!\n");
> +out_unlock:
> +	mutex_unlock(&vfio.device_lock);
> +	return ret;
> +}
> +
> +int vfio_register_device(struct vfio_device *device, u32 flags)
> +{
> +	int ret = -EINVAL;
> +
> +	device->minor = -1;
> +	device->group = NULL;
> +	atomic_set(&device->opened, 0);
> +
> +	if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
> +		return ret;
> +
> +	if (flags & VFIO_DEVNODE_GROUP) {
> +		ret = __vfio_register_group_dev(device);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (flags & VFIO_DEVNODE_NONGROUP) {
> +		ret = __vfio_register_nongroup_dev(device);
> +		if (ret && device->group)
> +			vfio_unregister_device(device);
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_register_device);
>  
>  /**
>   * Get a reference to the vfio_device for a device.  Even if the
> @@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
>  /*
>   * Decrement the device reference count and wait for the device to be
>   * removed.  Open file descriptors for the device... */
> -void vfio_unregister_group_dev(struct vfio_device *device)
> +void vfio_unregister_device(struct vfio_device *device)
>  {
>  	struct vfio_group *group = device->group;
>  	struct vfio_unbound_dev *unbound;
>  	unsigned int i = 0;
>  	bool interrupted = false;
>  	long rc;
> +	int minor = device->minor;
>  
>  	/*
>  	 * When the device is removed from the group, the group suddenly
> @@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>  	 * solve this, we track such devices on the unbound_list to bridge
>  	 * the gap until they're fully unbound.
>  	 */
> -	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
> -	if (unbound) {
> -		unbound->dev = device->dev;
> -		mutex_lock(&group->unbound_lock);
> -		list_add(&unbound->unbound_next, &group->unbound_list);
> -		mutex_unlock(&group->unbound_lock);
> +	if (group) {
> +		/*
> +		 * If caller hasn't called vfio_register_group_dev(), this
> +		 * branch is not necessary.
> +		 */
> +		unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
> +		if (unbound) {
> +			unbound->dev = device->dev;
> +			mutex_lock(&group->unbound_lock);
> +			list_add(&unbound->unbound_next, &group->unbound_list);
> +			mutex_unlock(&group->unbound_lock);
> +		}
> +		WARN_ON(!unbound);
>  	}
> -	WARN_ON(!unbound);
>  
>  	vfio_device_put(device);
>  	rc = try_wait_for_completion(&device->comp);
> @@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>  		}
>  	}
>  
> +	/* nongroup interface related cleanup */
> +	if (minor >= 0) {
> +		mutex_lock(&vfio.device_lock);
> +		list_del(&device->vfio_next);
> +		device->minor = -1;
> +		device_destroy(vfio.device_class,
> +			       MKDEV(MAJOR(vfio.device_devt), minor));
> +		idr_remove(&vfio.device_idr, minor);
> +		mutex_unlock(&vfio.device_lock);
> +	}
> +
> +	/* No need go further if no group. */
> +	if (!group)
> +		return;
> +
>  	mutex_lock(&group->device_lock);
>  	list_del(&device->group_next);
>  	group->dev_counter--;
> @@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>  	/* Matches the get in vfio_register_group_dev() */
>  	vfio_group_put(group);
>  }
> +EXPORT_SYMBOL_GPL(vfio_unregister_device);
> +
> +int vfio_register_group_dev(struct vfio_device *device)
> +{
> +	return vfio_register_device(device, VFIO_DEVNODE_GROUP);
> +}
> +EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> +
> +void vfio_unregister_group_dev(struct vfio_device *device)
> +{
> +	vfio_unregister_device(device);
> +}
>  EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>  
>  /**
> @@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)
>  
>  	mutex_init(&vfio.device_lock);
>  	idr_init(&vfio.device_idr);
> +	INIT_LIST_HEAD(&vfio.device_list);
>  
>  	/* /dev/vfio/devices/$DEVICE */
>  	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> @@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
>  static void __exit vfio_cleanup(void)
>  {
>  	WARN_ON(!list_empty(&vfio.group_list));
> +	WARN_ON(!list_empty(&vfio.device_list));
>  
>  #ifdef CONFIG_VFIO_NOIOMMU
>  	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 4a5f3f99eab2..9448b751b663 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -26,6 +26,7 @@ struct vfio_device {
>  	struct list_head group_next;
>  	int minor;
>  	atomic_t opened;
> +	struct list_head vfio_next;
>  };
>  
>  /**
> @@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
>  	VFIO_IOMMU_CONTAINER_CLOSE = 0,
>  };
>  
> +/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
> +#define VFIO_DEVNODE_GROUP	BIT(0)
> +/* The device can be opened via /dev/sys/devices/${DEVICE} */
> +#define VFIO_DEVNODE_NONGROUP	BIT(1)
> +
> +extern int vfio_register_device(struct vfio_device *device, u32 flags);
> +extern void vfio_unregister_device(struct vfio_device *device);
> +
>  /**
>   * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
>   */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  1:00       ` Jason Gunthorpe
  2021-09-22  1:02         ` Tian, Kevin
  2021-09-23  7:25         ` Eric Auger
@ 2021-09-29  2:46         ` david
  2021-09-29 12:22           ` Jason Gunthorpe
  2 siblings, 1 reply; 280+ messages in thread
From: david @ 2021-09-29  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > > 
> > > >  One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that devnames
> > > > from different bus types are unique formats:
> > > 
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > > 
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > > 
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > > 
> > 
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> > 
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified 
> > DDDD:BB:DD.F? 
> 
> When done properly in the kernel the file:
> 
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> 
> Will contain the major:minor of the VFIO device.
> 
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.
> 
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.

I thought part of the appeal of the device centric model was less
grovelling around in sysfs for information.  Using type/address
directly in /dev seems simpler than having to dig around matching
things here.

Note that this doesn't have to be done in kernel: you could have the
kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
that create symlinks from say /dev/vfio/pci/DDDD:BB:SS.F - >
../devices/vfioXX based on the sysfs information.

> 
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
> 
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-19  6:38 ` [RFC 04/20] iommu: Add iommu_device_get_info interface Liu Yi L
  2021-09-21 16:19   ` Jason Gunthorpe
@ 2021-09-29  2:52   ` David Gibson
  2021-09-29  9:25     ` Lu Baolu
  1 sibling, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-29  2:52 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 3945 bytes --]

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This provides an interface for upper layers to get the per-device iommu
> attributes.
> 
>     int iommu_device_get_info(struct device *dev,
>                               enum iommu_devattr attr, void *data);

That fact that this interface doesn't let you know how to size the
data buffer, other than by just knowing the right size for each attr
concerns me.

> 
> The first attribute (IOMMU_DEV_INFO_FORCE_SNOOP) is added. It tells if
> the iommu can force DMA to snoop cache. At this stage, only PCI devices
> which have this attribute set could use the iommufd, this is due to
> supporting no-snoop DMA requires additional refactoring work on the
> current kvm-vfio contract. The following patch will have vfio check this
> attribute to decide whether a pci device can be exposed through
> /dev/vfio/devices.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/iommu.c | 16 ++++++++++++++++
>  include/linux/iommu.h | 19 +++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 63f0af10c403..5ea3a007fd7c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3260,3 +3260,19 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
>  
>  	return ret;
>  }
> +
> +/* Expose per-device iommu attributes. */
> +int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data)
> +{
> +	const struct iommu_ops *ops;
> +
> +	if (!dev->bus || !dev->bus->iommu_ops)
> +		return -EINVAL;
> +
> +	ops = dev->bus->iommu_ops;
> +	if (unlikely(!ops->device_info))
> +		return -ENODEV;
> +
> +	return ops->device_info(dev, attr, data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_device_get_info);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 32d448050bf7..52a6d33c82dc 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -150,6 +150,14 @@ enum iommu_dev_features {
>  	IOMMU_DEV_FEAT_IOPF,
>  };
>  
> +/**
> + * enum iommu_devattr - Per device IOMMU attributes
> + * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
> + */
> +enum iommu_devattr {
> +	IOMMU_DEV_INFO_FORCE_SNOOP,
> +};
> +
>  #define IOMMU_PASID_INVALID	(-1U)
>  
>  #ifdef CONFIG_IOMMU_API
> @@ -215,6 +223,7 @@ struct iommu_iotlb_gather {
>   *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
>   *		- IOMMU_DOMAIN_DMA: must use a dma domain
>   *		- 0: use the default setting
> + * @device_info: query per-device iommu attributes
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>   * @owner: Driver module providing these ops
>   */
> @@ -283,6 +292,8 @@ struct iommu_ops {
>  
>  	int (*def_domain_type)(struct device *dev);
>  
> +	int (*device_info)(struct device *dev, enum iommu_devattr attr, void *data);
> +
>  	unsigned long pgsize_bitmap;
>  	struct module *owner;
>  };
> @@ -604,6 +615,8 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
>  void iommu_sva_unbind_device(struct iommu_sva *handle);
>  u32 iommu_sva_get_pasid(struct iommu_sva *handle);
>  
> +int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
> +
>  #else /* CONFIG_IOMMU_API */
>  
>  struct iommu_ops {};
> @@ -999,6 +1012,12 @@ static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
>  {
>  	return NULL;
>  }
> +
> +static inline int iommu_device_get_info(struct device *dev,
> +					enum iommu_devattr type, void *data)
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_IOMMU_API */
>  
>  /**

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29  2:43   ` David Gibson
@ 2021-09-29  3:40     ` Tian, Kevin
  2021-09-29  5:30     ` Tian, Kevin
  1 sibling, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  3:40 UTC (permalink / raw)
  To: David Gibson, Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 10:44 AM
> 
> >
> > One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
> >
> > 	/dev/vfio/devices/0000:00:14.2 (pci)
> > 	/dev/vfio/devices/PNP0103:00 (platform)
> > 	/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)
> 
> Oof.  I really don't think this is a good idea.  Ensuring that a
> format is "unique" in the sense that it can't collide with any of the
> other formats, for *every* value of the parameters on both sides is
> actually pretty complicated in general.
> 
> I think per-type sub-directories would be helpful here, Jason's
> suggestion of just sequential numbers would work as well.

we'll follow Jason's suggestion in next version.

> > +	/*
> > +	 * Refcounting can't start until the driver call register. Don’t
> > +	 * start twice when the device is exposed in both group and
> nongroup
> > +	 * interfaces.
> > +	 */
> > +	if (!refcount_read(&device->refcount))
> 
> Is there a possible race here with something getting in and
> incrementing the refcount between the read and set?

this will not be required in next version, which will always create
both group and nongroup interfaces for every device (then let
driver providing .bind_iommufd() callback for whether nongroup
interface is functional). It will be centrally processed within
existing vfio_[un]register_group_dev(), thus above race is not
a concern any more.

> 
> > +		refcount_set(&device->refcount, 1);
> >
> >  	mutex_lock(&group->device_lock);
> >  	list_add(&device->group_next, &group->device_list);
> > @@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device
> *device)
> >
> >  	return 0;
> >  }
> > -EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> > +
> > +static int __vfio_register_nongroup_dev(struct vfio_device *device)
> > +{
> > +	struct vfio_device *existing_device;
> > +	struct device *dev;
> > +	int ret = 0, minor;
> > +
> > +	mutex_lock(&vfio.device_lock);
> > +	list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
> > +		if (existing_device == device) {
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> 
> This indicates a bug in the caller, doesn't it?  Should it be a BUG or
> WARN instead?

this call is initiated by userspace. Per Jason's suggestion we don't 
even need to check it then no lock is required. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-19  6:38 ` [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces Liu Yi L
  2021-09-21 17:09   ` Jason Gunthorpe
@ 2021-09-29  4:55   ` David Gibson
  2021-09-29  5:38     ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-29  4:55 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 11188 bytes --]

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
> 
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.
> 
> Managing the security context must meet below criteria:
> 
> 1)  The group is viable for user-initiated DMAs. This implies that the
>     devices in the group must be either bound to a device-passthrough
>     framework, or driver-less, or bound to a driver which is known safe
>     (not do DMA).
> 
> 2)  The security context should only allow DMA to the user's memory and
>     devices in this group;
> 
> 3)  After the security context is established for the group, the group
>     viability must be continuously monitored before the user relinquishes
>     all devices belonging to the group. The viability might be broken e.g.
>     when a driver-less device is later bound to a driver which does DMA.
> 
> 4)  The security context should not be destroyed before user access
>     permission is withdrawn.
> 
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
> 
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:
> 
> 1)  Open a device in "/dev/vfio/devices" with user access blocked;

Probably worth clarifying that (1) must happen for *all* devices in
the group before (2) happens for any device in the group.

> 2)  Bind the device to an iommufd with an initial security context
>     (an empty iommu domain which blocks dma) established for its
>     group, with user access unblocked;
> 
> 3)  Attach the device to a user-specified ioasid (shared by all devices
>     attached to this ioasid). Before attaching, the device should be first
>     detached from the initial context;

So, this step can implicitly but observably change the behaviour for
other devices in the group as well.  I don't love that kind of
difficult to predict side effect, which is why I'm *still* not totally
convinced by the device-centric model.

> 4)  Detach the device from the ioasid and switch it back to the initial
>     security context;

Same non-local side effect at this step, of course.

Btw, explicitly naming the "no DMA" context is probably a good idea,
rather than referring to the "initial security context" (it's
"initial" from the PoV of the iommufd, but not from the PoV of the
device fd which was likely bound to the default kernel context before
(2)).

> 5)  Unbind the device from the iommufd, back to access blocked state and
>     move its group out of the initial security context if it's the last
>     unbound device in the group;

Maybe worth clarifying that again (5) must happen for all devices in
the group before rebiding any devices to regular kernel drivers.
> 
> (multiple attach/detach could happen between 2 and 5).
> 
> However existing iommu core has problem with above transition. Detach
> in step 3/4 makes the device/group re-attached to the default domain
> automatically, which opens the door for user-initiated DMAs to attack
> the rest of the system. The existing vfio doesn't have this problem as
> it combines 2/3 in one step (so does 4/5).
> 
> Fixing this problem requires the iommu core to also participate in the
> security context management. Following this direction we also move group
> viability check into the iommu core, which allows iommufd to stay fully
> device-centric w/o keeping any group knowledge (combining with the
> extension to iommu_at[de]tach_device() in a latter patch).
> 
> Basically two new interfaces are provided:
> 
>         int iommu_device_init_user_dma(struct device *dev,
>                         unsigned long owner);
>         void iommu_device_exit_user_dma(struct device *dev);
> 
> iommufd calls them respectively when handling device binding/unbinding
> requests.
> 
> The init_user_dma() for the 1st device in a group marks the entire group
> for user-dma and establishes the initial security context (dma blocked)
> according to aforementioned criteria. As long as the group is marked for
> user-dma, auto-reattaching to default domain is disabled. Instead, upon
> detaching the group is moved back to the initial security context.
> 
> The caller also provides an owner id to mark the ownership so inadvertent
> attempt from another caller on the same device can be captured. In this
> RFC iommufd will use the fd context pointer as the owner id.
> 
> The exit_user_dma() for the last device in the group clears the user-dma
> mark and moves the group back to the default domain.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h |  12 ++++
>  2 files changed, 154 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 5ea3a007fd7c..bffd84e978fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -45,6 +45,8 @@ struct iommu_group {
>  	struct iommu_domain *default_domain;
>  	struct iommu_domain *domain;
>  	struct list_head entry;
> +	unsigned long user_dma_owner_id;

Using an opaque integer doesn't seem like a good idea.  I think you
probably want a pointer to a suitable struct dma_owner or the like
(you could have one embedded in each iommufd struct, plus a global
static kernel_default_owner).

> +	refcount_t owner_cnt;
>  };
>  
>  struct group_device {
> @@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
>  static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  static ssize_t iommu_group_store_type(struct iommu_group *group,
>  				      const char *buf, size_t count);
> +static bool iommu_group_user_dma_viable(struct iommu_group *group);
>  
>  #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
>  struct iommu_group_attribute iommu_group_attr_##_name =		\
> @@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
>  	 */
>  	iommu_alloc_default_domain(group, dev);
>  
> -	if (group->default_domain) {
> +	/*
> +	 * If any device in the group has been initialized for user dma,
> +	 * avoid attaching the default domain.
> +	 */
> +	if (group->default_domain && !group->user_dma_owner_id) {
>  		ret = __iommu_attach_device(group->default_domain, dev);
>  		if (ret) {
>  			iommu_group_put(group);
> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
>  		break;
>  	case BUS_NOTIFY_BOUND_DRIVER:
> +		/*
> +		 * FIXME: Alternatively the attached drivers could generically
> +		 * indicate to the iommu layer that they are safe for keeping
> +		 * the iommu group user viable by calling some function around
> +		 * probe(). We could eliminate this gross BUG_ON() by denying
> +		 * probe to non-iommu-safe driver.
> +		 */
> +		mutex_lock(&group->mutex);
> +		if (group->user_dma_owner_id)
> +			BUG_ON(!iommu_group_user_dma_viable(group));
> +		mutex_unlock(&group->mutex);
>  		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
>  		break;
>  	case BUS_NOTIFY_UNBIND_DRIVER:
> @@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>  {
>  	int ret;
>  
> -	if (group->default_domain && group->domain != group->default_domain)
> +	/*
> +	 * group->domain could be NULL when a domain is detached from the
> +	 * group but the default_domain is not re-attached.
> +	 */
> +	if (group->domain && group->domain != group->default_domain)
>  		return -EBUSY;
>  
>  	ret = __iommu_group_for_each_dev(group, domain,
> @@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
>  {
>  	int ret;
>  
> -	if (!group->default_domain) {
> +	/*
> +	 * If any device in the group has been initialized for user dma,
> +	 * avoid re-attaching the default domain.
> +	 */
> +	if (!group->default_domain || group->user_dma_owner_id) {
>  		__iommu_group_for_each_dev(group, domain,
>  					   iommu_group_do_detach_device);
>  		group->domain = NULL;
> @@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
>  	return ops->device_info(dev, attr, data);
>  }
>  EXPORT_SYMBOL_GPL(iommu_device_get_info);
> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> +	"vfio-pci",
> +	"pci-stub"
> +};
> +
> +/*
> + * An iommu group is viable for use by userspace if all devices are in
> + * one of the following states:
> + *  - driver-less
> + *  - bound to an allowed driver
> + *  - a PCI interconnect device
> + */
> +static int device_user_dma_viable(struct device *dev, void *data)

I think this wants a "less friendly" more obviously local name.
Really the only safe way to call this is via
iommu_group_user_dma_viable(), which isn't obvious from this name.

> +{
> +	struct device_driver *drv = READ_ONCE(dev->driver);
> +
> +	if (!drv)
> +		return 0;
> +
> +	if (dev_is_pci(dev)) {
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> +			return 0;
> +	}
> +
> +	return match_string(iommu_driver_allowed,
> +			    ARRAY_SIZE(iommu_driver_allowed),
> +			    drv->name) < 0;
> +}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-19  6:38 ` [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device() Liu Yi L
  2021-09-21 17:14   ` Jason Gunthorpe
@ 2021-09-29  5:25   ` David Gibson
  2021-09-29 12:24     ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-29  5:25 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 9187 bytes --]

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:
> Under the /dev/iommu model, iommufd provides the interface for I/O page
> tables management such as dma map/unmap. However, it cannot work
> independently since the device is still owned by the device-passthrough
> frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
> should build a connection between its device and the iommufd to delegate
> the I/O page table management affairs to iommufd.
> 
> This patch introduces iommufd_[un]bind_device() helpers for the device-
> passthrough framework to build such connection. The helper functions then
> invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
> security context for the bound device. Each successfully bound device is
> internally tracked by an iommufd_device object. This object is returned
> to the caller for subsequent attaching operations on the device as well.
> 
> The caller should pass a user-provided cookie to mark the device in the
> iommufd. Later this cookie will be used to represent the device in iommufd
> uAPI, e.g. when querying device capabilities or handling per-device I/O
> page faults. One alternative is to have iommufd allocate a device label
> and return to the user. Either way works, but cookie is slightly preferred
> per earlier discussion as it may allow the user to inject faults slightly
> faster without ID->vRID lookup.
> 
> iommu_[un]bind_device() functions are only used for physical devices. Other
> variants will be introduced in the future, e.g.:
> 
> -  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
>    DMA isolation;
> -  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
>    instead of iommu to isolate DMA;
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
>  include/linux/iommufd.h         |  38 ++++++++
>  2 files changed, 196 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/iommufd.h
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 710b7e62988b..e16ca21e4534 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -16,10 +16,30 @@
>  #include <linux/miscdevice.h>
>  #include <linux/mutex.h>
>  #include <linux/iommu.h>
> +#include <linux/iommufd.h>
> +#include <linux/xarray.h>
> +#include <asm-generic/bug.h>
>  
>  /* Per iommufd */
>  struct iommufd_ctx {
>  	refcount_t refs;
> +	struct mutex lock;
> +	struct xarray device_xa; /* xarray of bound devices */
> +};
> +
> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> +	unsigned int id;
> +	struct iommufd_ctx *ictx;
> +	struct device *dev; /* always be the physical device */
> +	u64 dev_cookie;

Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
unique, couldn't you just use the cookie directly as the index into
the xarray?

>  };
>  
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  		return -ENOMEM;
>  
>  	refcount_set(&ictx->refs, 1);
> +	mutex_init(&ictx->lock);
> +	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
>  	filep->private_data = ictx;
>  
>  	return ret;
>  }
>  
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> +	refcount_inc(&ictx->refs);
> +}
> +
> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> +	struct fd f = fdget(fd);
> +	struct file *file = f.file;
> +	struct iommufd_ctx *ictx;
> +
> +	if (!file)
> +		return NULL;
> +
> +	if (file->f_op != &iommufd_fops)
> +		return NULL;
> +
> +	ictx = file->private_data;
> +	if (ictx)
> +		iommufd_ctx_get(ictx);
> +	fdput(f);
> +	return ictx;
> +}
> +
> +/**
> + * iommufd_ctx_put - Releases a reference to the internal iommufd context.
> + * @ictx: [in] Pointer to iommufd context.
> + *
> + */
>  static void iommufd_ctx_put(struct iommufd_ctx *ictx)
>  {
> -	if (refcount_dec_and_test(&ictx->refs))
> -		kfree(ictx);
> +	if (!refcount_dec_and_test(&ictx->refs))
> +		return;
> +
> +	WARN_ON(!xa_empty(&ictx->device_xa));
> +	kfree(ictx);
>  }
>  
>  static int iommufd_fops_release(struct inode *inode, struct file *filep)
> @@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
>  	.mode = 0666,
>  };
>  
> +/**
> + * iommufd_bind_device - Bind a physical device marked by a device
> + *			 cookie to an iommu fd.
> + * @fd:		[in] iommufd file descriptor.
> + * @dev:	[in] Pointer to a physical device struct.
> + * @dev_cookie:	[in] A cookie to mark the device in /dev/iommu uAPI.
> + *
> + * A successful bind establishes a security context for the device
> + * and returns struct iommufd_device pointer. Otherwise returns
> + * error pointer.
> + *
> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> +					   u64 dev_cookie)
> +{
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_device *idev;
> +	unsigned long index;
> +	unsigned int id;
> +	int ret;
> +
> +	ictx = iommufd_ctx_fdget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&ictx->lock);
> +
> +	/* check duplicate registration */
> +	xa_for_each(&ictx->device_xa, index, idev) {
> +		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> +			idev = ERR_PTR(-EBUSY);
> +			goto out_unlock;
> +		}
> +	}
> +
> +	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> +	if (!idev) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	/* Establish the security context */
> +	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = xa_alloc(&ictx->device_xa, &id, idev,
> +		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> +		       GFP_KERNEL);
> +	if (ret) {
> +		idev = ERR_PTR(ret);
> +		goto out_user_dma;
> +	}
> +
> +	idev->ictx = ictx;
> +	idev->dev = dev;
> +	idev->dev_cookie = dev_cookie;
> +	idev->id = id;
> +	mutex_unlock(&ictx->lock);
> +
> +	return idev;
> +out_user_dma:
> +	iommu_device_exit_user_dma(idev->dev);
> +out_free:
> +	kfree(idev);
> +out_unlock:
> +	mutex_unlock(&ictx->lock);
> +	iommufd_ctx_put(ictx);
> +
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_bind_device);
> +
> +/**
> + * iommufd_unbind_device - Unbind a physical device from iommufd
> + *
> + * @idev: [in] Pointer to the internal iommufd_device struct.
> + *
> + */
> +void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> +	struct iommufd_ctx *ictx = idev->ictx;
> +
> +	mutex_lock(&ictx->lock);
> +	xa_erase(&ictx->device_xa, idev->id);
> +	mutex_unlock(&ictx->lock);
> +	/* Exit the security context */
> +	iommu_device_exit_user_dma(idev->dev);
> +	kfree(idev);
> +	iommufd_ctx_put(ictx);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_unbind_device);
> +
>  static int __init iommufd_init(void)
>  {
>  	int ret;
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> new file mode 100644
> index 000000000000..1603a13937e9
> --- /dev/null
> +++ b/include/linux/iommufd.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * IOMMUFD API definition
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <yi.l.liu@intel.com>
> + */
> +#ifndef __LINUX_IOMMUFD_H
> +#define __LINUX_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/device.h>
> +
> +#define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
> +#define IOMMUFD_DEVID_MIN	0
> +
> +struct iommufd_device;
> +
> +#if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_device *
> +iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> +void iommufd_unbind_device(struct iommufd_device *idev);
> +
> +#else /* !CONFIG_IOMMUFD */
> +static inline struct iommufd_device *
> +iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
> +{
> +	return ERR_PTR(-ENODEV);
> +}
> +
> +static inline void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> +}
> +#endif /* CONFIG_IOMMUFD */
> +#endif /* __LINUX_IOMMUFD_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29  2:43   ` David Gibson
  2021-09-29  3:40     ` Tian, Kevin
@ 2021-09-29  5:30     ` Tian, Kevin
  2021-09-29  7:08       ` Cornelia Huck
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  5:30 UTC (permalink / raw)
  To: David Gibson, Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 10:44 AM
> 
> > One alternative option is to arrange device nodes in sub-directories based
> > on the device type. But doing so also adds one trouble to userspace. The
> > current vfio uAPI is designed to have the user query device type via
> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> > instead needs to figure out the device type before opening the device, to
> > identify the sub-directory.
> 
> Wouldn't this be up to the operator / configuration, rather than the
> actual software though?  I would assume that typically the VFIO
> program would be pointed at a specific vfio device node file to use,
> e.g.
> 	my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1
> 
> Or more generally, if you're expecting userspace to know a name in a
> uniqu pattern, they can equally well know a "type/name" pair.
> 

You are correct. Currently:

-device, vfio-pci,host=DDDD:BB:DD.F
-device, vfio-pci,sysfdev=/sys/bus/pci/devices/ DDDD:BB:DD.F
-device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00

above is definitely type/name information to find the related node. 

Actually even for Jason's proposal we still need such information to
identify the sysfs path.

Then I feel type-based sub-directory does work. Adding another link
to sysfs sounds unnecessary now. But I'm not sure whether we still
want to create /dev/vfio/devices/vfio0 thing and related udev rule
thing that you pointed out in another mail.

Jason?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  4:55   ` David Gibson
@ 2021-09-29  5:38     ` Tian, Kevin
  2021-09-29  6:35       ` David Gibson
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  5:38 UTC (permalink / raw)
  To: David Gibson, Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 12:56 PM
> 
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
> >
> > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> 
> Probably worth clarifying that (1) must happen for *all* devices in
> the group before (2) happens for any device in the group.

No. User access is naturally blocked for other devices as long as they
are not opened yet.

> 
> > 2)  Bind the device to an iommufd with an initial security context
> >     (an empty iommu domain which blocks dma) established for its
> >     group, with user access unblocked;
> >
> > 3)  Attach the device to a user-specified ioasid (shared by all devices
> >     attached to this ioasid). Before attaching, the device should be first
> >     detached from the initial context;
> 
> So, this step can implicitly but observably change the behaviour for
> other devices in the group as well.  I don't love that kind of
> difficult to predict side effect, which is why I'm *still* not totally
> convinced by the device-centric model.

which side-effect is predicted here? The user anyway needs to be
aware of such group restriction regardless whether it uses group
or nongroup interface.

> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 5ea3a007fd7c..bffd84e978fb 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -45,6 +45,8 @@ struct iommu_group {
> >  	struct iommu_domain *default_domain;
> >  	struct iommu_domain *domain;
> >  	struct list_head entry;
> > +	unsigned long user_dma_owner_id;
> 
> Using an opaque integer doesn't seem like a good idea.  I think you
> probably want a pointer to a suitable struct dma_owner or the like
> (you could have one embedded in each iommufd struct, plus a global
> static kernel_default_owner).
> 

For remaining comments you may want to look at the latest discussion
here:

https://lore.kernel.org/kvm/20210928140712.GL964074@nvidia.com/

It relies on driver core change to manage group ownership gracefully.
No BUG_ON() is triggered any more for driver binding. There a fd will
be passed in to mark the ownership.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38 ` [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD Liu Yi L
  2021-09-21 17:29   ` Jason Gunthorpe
@ 2021-09-29  6:00   ` David Gibson
  2021-09-29  6:41     ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-29  6:00 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 7780 bytes --]

On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
> device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
> because it's implicitly done when the device fd is closed.
> 
> In concept a vfio device can be bound to multiple iommufds, each hosting
> a subset of I/O address spaces attached by this device.

I really feel like this many<->many mapping between devices is going
to be super-confusing, and therefore make it really hard to be
confident we have all the rules right for proper isolation.

That's why I was suggesting a concept like endpoints, to break this
into two many<->one relationships.  I'm ok if that isn't visible in
the user API, but I think this is going to be really hard to keep
track of if it isn't explicit somewhere in the internals.

> However as a
> starting point (matching current vfio), only one I/O address space is
> supported per vfio device. It implies one device can only be attached
> to one iommufd at this point.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/Kconfig            |  1 +
>  drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |  8 ++++
>  include/uapi/linux/vfio.h           | 30 ++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 5e2e1b9a9fd3..3abfb098b4dc 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -5,6 +5,7 @@ config VFIO_PCI
>  	depends on MMU
>  	select VFIO_VIRQFD
>  	select IRQ_BYPASS_MANAGER
> +	select IOMMUFD
>  	help
>  	  Support for the PCI VFIO bus driver.  This is required to make
>  	  use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 145addde983b..20006bb66430 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
>  			vdev->req_trigger = NULL;
>  		}
>  		mutex_unlock(&vdev->igate);
> +
> +		mutex_lock(&vdev->videv_lock);
> +		if (vdev->videv) {
> +			struct vfio_iommufd_device *videv = vdev->videv;
> +
> +			vdev->videv = NULL;
> +			iommufd_unbind_device(videv->idev);
> +			kfree(videv);
> +		}
> +		mutex_unlock(&vdev->videv_lock);
>  	}
>  
>  	mutex_unlock(&vdev->reflck->lock);
> @@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>  		container_of(core_vdev, struct vfio_pci_device, vdev);
>  	unsigned long minsz;
>  
> -	if (cmd == VFIO_DEVICE_GET_INFO) {
> +	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
> +		struct vfio_device_iommu_bind_data bind_data;
> +		unsigned long minsz;
> +		struct iommufd_device *idev;
> +		struct vfio_iommufd_device *videv;
> +
> +		/*
> +		 * Reject the request if the device is already opened and
> +		 * attached to a container.
> +		 */
> +		if (vfio_device_in_container(core_vdev))

Usually one would do argument sanity checks before checks that
actually depend on machine state.

> +			return -ENOTTY;

This doesn't seem like the right error code.  It's a perfectly valid
operation for this device - just not available right now.

> +
> +		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
> +
> +		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind_data.argsz < minsz ||
> +		    bind_data.flags || bind_data.iommu_fd < 0)
> +			return -EINVAL;
> +
> +		mutex_lock(&vdev->videv_lock);
> +		/*
> +		 * Allow only one iommufd per device until multiple
> +		 * address spaces (e.g. vSVA) support is introduced
> +		 * in the future.
> +		 */
> +		if (vdev->videv) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return -EBUSY;
> +		}
> +
> +		idev = iommufd_bind_device(bind_data.iommu_fd,
> +					   &vdev->pdev->dev,
> +					   bind_data.dev_cookie);
> +		if (IS_ERR(idev)) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return PTR_ERR(idev);
> +		}
> +
> +		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
> +		if (!videv) {
> +			iommufd_unbind_device(idev);
> +			mutex_unlock(&vdev->videv_lock);
> +			return -ENOMEM;
> +		}
> +		videv->idev = idev;
> +		videv->iommu_fd = bind_data.iommu_fd;
> +		/*
> +		 * A security context has been established. Unblock
> +		 * user access.
> +		 */
> +		if (atomic_read(&vdev->block_access))
> +			atomic_set(&vdev->block_access, 0);
> +		vdev->videv = videv;
> +		mutex_unlock(&vdev->videv_lock);
> +
> +		return 0;
> +	} else if (cmd == VFIO_DEVICE_GET_INFO) {
>  		struct vfio_device_info info;
>  		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
>  		unsigned long capsz;
> @@ -2031,6 +2100,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	mutex_init(&vdev->vma_lock);
>  	INIT_LIST_HEAD(&vdev->vma_list);
>  	init_rwsem(&vdev->memory_lock);
> +	mutex_init(&vdev->videv_lock);
>  
>  	ret = vfio_pci_reflck_attach(vdev);
>  	if (ret)
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index f12012e30b53..bd784accac35 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -14,6 +14,7 @@
>  #include <linux/types.h>
>  #include <linux/uuid.h>
>  #include <linux/notifier.h>
> +#include <linux/iommufd.h>
>  
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
> @@ -99,6 +100,11 @@ struct vfio_pci_mmap_vma {
>  	struct list_head	vma_next;
>  };
>  
> +struct vfio_iommufd_device {
> +	struct iommufd_device *idev;

Could this be embedded to avoid multiple layers of pointers?

> +	int iommu_fd;
> +};
> +
>  struct vfio_pci_device {
>  	struct vfio_device	vdev;
>  	struct pci_dev		*pdev;
> @@ -144,6 +150,8 @@ struct vfio_pci_device {
>  	struct list_head	vma_list;
>  	struct rw_semaphore	memory_lock;
>  	atomic_t		block_access;
> +	struct mutex		videv_lock;
> +	struct vfio_iommufd_device *videv;
>  };
>  
>  #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..c902abd60339 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -190,6 +190,36 @@ struct vfio_group_status {
>  
>  /* --------------- IOCTLs for DEVICE file descriptors --------------- */
>  
> +/*
> + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
> + *				struct vfio_device_iommu_bind_data)
> + *
> + * Bind a vfio_device to the specified iommufd
> + *
> + * The user should provide a device cookie when calling this ioctl. The
> + * cookie is later used in iommufd for capability query, iotlb invalidation
> + * and I/O fault handling.
> + *
> + * User is not allowed to access the device before the binding operation
> + * is completed.
> + *
> + * Unbind is automatically conducted when device fd is closed.
> + *
> + * Input parameters:
> + *	- iommu_fd;
> + *	- dev_cookie;
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_iommu_bind_data {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	iommu_fd;
> +	__u64	dev_cookie;
> +};
> +
> +#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
>  /**
>   * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
>   *						struct vfio_device_info)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-22 12:41       ` Jason Gunthorpe
@ 2021-09-29  6:18         ` david
  0 siblings, 0 replies; 280+ messages in thread
From: david @ 2021-09-29  6:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2884 bytes --]

On Wed, Sep 22, 2021 at 09:41:50AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 1:41 AM
> > > 
> > > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > > After a device is bound to the iommufd, userspace can use this interface
> > > > to query the underlying iommu capability and format info for this device.
> > > > Based on this information the user then creates I/O address space in a
> > > > compatible format with the to-be-attached devices.
> > > >
> > > > Device cookie which is registered at binding time is used to mark the
> > > > device which is being queried here.
> > > >
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > >  drivers/iommu/iommufd/iommufd.c | 68
> > > +++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> > > >  2 files changed, 117 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index e16ca21e4534..641f199f2d41 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > > *inode, struct file *filep)
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static struct device *
> > > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > > dev_cookie)
> > > > +{
> > > 
> > > We have an xarray ID for the device, why are we allowing userspace to
> > > use the dev_cookie as input?
> > > 
> > > Userspace should always pass in the ID. The only place dev_cookie
> > > should appear is if the kernel generates an event back to
> > > userspace. Then the kernel should return both the ID and the
> > > dev_cookie in the event to allow userspace to correlate it.
> > > 
> > 
> > A little background.
> > 
> > In earlier design proposal we discussed two options. One is to return
> > an kernel-allocated ID (label) to userspace. The other is to have user
> > register a cookie and use it in iommufd uAPI. At that time the two
> > options were discussed exclusively and the cookie one is preferred.
> > 
> > Now you instead recommended a mixed option. We can follow it for
> > sure if nobody objects.
> 
> Either or for the return is fine, I'd return both just because it is
> more flexable
> 
> But the cookie should never be an input from userspace, and the kernel
> should never search for it. Locating the kernel object is what the ID
> and xarray is for.

Why do we need two IDs at all?  Can't we just use the cookie as the
sole ID?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38 ` [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO Liu Yi L
  2021-09-21 17:40   ` Jason Gunthorpe
  2021-09-22 21:24   ` Alex Williamson
@ 2021-09-29  6:23   ` David Gibson
  2 siblings, 0 replies; 280+ messages in thread
From: David Gibson @ 2021-09-29  6:23 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 5894 bytes --]

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
> 
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{
> +	struct iommufd_device *idev;
> +	struct device *dev = NULL;
> +	unsigned long index;
> +
> +	mutex_lock(&ictx->lock);
> +	xa_for_each(&ictx->device_xa, index, idev) {
> +		if (idev->dev_cookie == dev_cookie) {
> +			dev = idev->dev;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&ictx->lock);
> +
> +	return dev;
> +}
> +
> +static void iommu_device_build_info(struct device *dev,
> +				    struct iommu_device_info *info)
> +{
> +	bool snoop;
> +	u64 awidth, pgsizes;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> +		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> +		info->pgsize_bitmap = pgsizes;
> +		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> +	}
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> +		info->addr_width = awidth;
> +		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> +	}
> +}
> +
> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> +				   unsigned long arg)
> +{
> +	struct iommu_device_info info;
> +	unsigned long minsz;
> +	struct device *dev;
> +
> +	minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;
> +
> +	info.flags = 0;
> +
> +	dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
> +	if (!dev)
> +		return -EINVAL;
> +
> +	iommu_device_build_info(dev, &info);
> +
> +	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
> +}
> +
>  static long iommufd_fops_unl_ioctl(struct file *filep,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>  		return ret;
>  
>  	switch (cmd) {
> +	case IOMMU_DEVICE_GET_INFO:
> +		ret = iommufd_get_device_info(ictx, arg);
> +		break;
>  	default:
>  		pr_err_ratelimited("unsupported cmd %u\n", cmd);
>  		break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 59178fc229ca..76b71f9d6b34 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -7,6 +7,55 @@
>  #define _UAPI_IOMMU_H
>  
>  #include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
> +
> +#define IOMMU_TYPE	(';')
> +#define IOMMU_BASE	100
> +
> +/*
> + * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
> + *				struct iommu_device_info)
> + *
> + * Check IOMMU capabilities and format information on a bound device.
> + *
> + * The device is identified by device cookie (registered when binding
> + * this device).
> + *
> + * @argsz:	   user filled size of this data.
> + * @flags:	   tells userspace which capability info is available
> + * @dev_cookie:	   user assinged cookie.
> + * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
> + *		   bit in pgsize_bitmap[63:12] indicates a supported
> + *		   page size. Details as below table:
> + *
> + *		   +===============+============+
> + *		   |  Bit[index]   |  Page Size |
> + *		   +---------------+------------+
> + *		   |  12           |  4 KB      |
> + *		   +---------------+------------+
> + *		   |  13           |  8 KB      |
> + *		   +---------------+------------+
> + *		   |  14           |  16 KB     |
> + *		   +---------------+------------+
> + *		   ...
> + * @addr_width:    the address width of supported I/O address spaces.
> + *
> + * Availability: after device is bound to iommufd
> + */
> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;

I think this is where you should be reporting available IOVA windows,
rather than just an address width.  I know that for ppc a real
situation will be to have two different windows of different sizes:
that is the effective address width depends on which IOVA window
you're mapping into.


> +};
> +
> +#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
>  
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  5:38     ` Tian, Kevin
@ 2021-09-29  6:35       ` David Gibson
  2021-09-29  7:31         ` Tian, Kevin
  2021-09-29 12:57         ` Jason Gunthorpe
  0 siblings, 2 replies; 280+ messages in thread
From: David Gibson @ 2021-09-29  6:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, jgg, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2773 bytes --]

On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Wednesday, September 29, 2021 12:56 PM
> > 
> > >
> > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > as the contract to get security context established (and vice versa
> > > for unbinding). One additional requirement in iommufd is to manage the
> > > switch between multiple security contexts due to decoupled bind/attach:
> > >
> > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > 
> > Probably worth clarifying that (1) must happen for *all* devices in
> > the group before (2) happens for any device in the group.
> 
> No. User access is naturally blocked for other devices as long as they
> are not opened yet.

Uh... my point is that everything in the group has to be removed from
regular kernel drivers before we reach step (2).  Is the plan that you
must do that before you can even open them?  That's a reasonable
choice, but then I think you should show that step in this description
as well.

> > > 2)  Bind the device to an iommufd with an initial security context
> > >     (an empty iommu domain which blocks dma) established for its
> > >     group, with user access unblocked;
> > >
> > > 3)  Attach the device to a user-specified ioasid (shared by all devices
> > >     attached to this ioasid). Before attaching, the device should be first
> > >     detached from the initial context;
> > 
> > So, this step can implicitly but observably change the behaviour for
> > other devices in the group as well.  I don't love that kind of
> > difficult to predict side effect, which is why I'm *still* not totally
> > convinced by the device-centric model.
> 
> which side-effect is predicted here? The user anyway needs to be
> aware of such group restriction regardless whether it uses group
> or nongroup interface.

Yes, exactly.  And with a group interface it's obvious it has to
understand it.  With the non-group interface, you can get to this
stage in ignorance of groups.  It will even work as long as you are
lucky enough only to try with singleton-group devices.  Then you try
it with two devices in the one group and doing (3) on device A will
implicitly change the DMA environment of device B.

(or at least, it will if they share a group because they don't have
distinguishable RIDs.  That's not the only multi-device group case,
but it's one of them).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-29  6:00   ` David Gibson
@ 2021-09-29  6:41     ` Tian, Kevin
  2021-09-29 12:28       ` Jason Gunthorpe
  2021-09-30  3:12       ` David Gibson
  0 siblings, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  6:41 UTC (permalink / raw)
  To: David Gibson, Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 2:01 PM
> 
> On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> vfio
> > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> provided
> > because it's implicitly done when the device fd is closed.
> >
> > In concept a vfio device can be bound to multiple iommufds, each hosting
> > a subset of I/O address spaces attached by this device.
> 
> I really feel like this many<->many mapping between devices is going
> to be super-confusing, and therefore make it really hard to be
> confident we have all the rules right for proper isolation.

Based on new discussion on group ownership part (patch06), I feel this
many<->many relationship will disappear. The context fd (either container
or iommufd) will uniquely mark the ownership on a physical device and
its group. With this design it's impractical to have one device bound
to multiple iommufds. Actually I don't think this is a compelling usage
in reality. The previous rationale was that no need to impose such restriction
if no special reason... and now we have a reason. 😊

Jason, are you OK with this simplification?

> 
> That's why I was suggesting a concept like endpoints, to break this
> into two many<->one relationships.  I'm ok if that isn't visible in
> the user API, but I think this is going to be really hard to keep
> track of if it isn't explicit somewhere in the internals.
> 

I think this endpoint concept is represented by ioas_device_info in
patch14:

+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+	struct iommufd_device *idev;
+	struct list_head next;
 };

currently it's 1:1 mapping before this object and iommufd_device, 
because no pasid support yet.

We can rename it to struct ioas_endpoint if it makes you feel better.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29  5:30     ` Tian, Kevin
@ 2021-09-29  7:08       ` Cornelia Huck
  2021-09-29 12:15         ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Cornelia Huck @ 2021-09-29  7:08 UTC (permalink / raw)
  To: Tian, Kevin, David Gibson, Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

On Wed, Sep 29 2021, "Tian, Kevin" <kevin.tian@intel.com> wrote:

>> From: David Gibson <david@gibson.dropbear.id.au>
>> Sent: Wednesday, September 29, 2021 10:44 AM
>> 
>> > One alternative option is to arrange device nodes in sub-directories based
>> > on the device type. But doing so also adds one trouble to userspace. The
>> > current vfio uAPI is designed to have the user query device type via
>> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
>> > instead needs to figure out the device type before opening the device, to
>> > identify the sub-directory.
>> 
>> Wouldn't this be up to the operator / configuration, rather than the
>> actual software though?  I would assume that typically the VFIO
>> program would be pointed at a specific vfio device node file to use,
>> e.g.
>> 	my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1
>> 
>> Or more generally, if you're expecting userspace to know a name in a
>> uniqu pattern, they can equally well know a "type/name" pair.
>> 
>
> You are correct. Currently:
>
> -device, vfio-pci,host=DDDD:BB:DD.F
> -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ DDDD:BB:DD.F
> -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00
>
> above is definitely type/name information to find the related node. 
>
> Actually even for Jason's proposal we still need such information to
> identify the sysfs path.
>
> Then I feel type-based sub-directory does work. Adding another link
> to sysfs sounds unnecessary now. But I'm not sure whether we still
> want to create /dev/vfio/devices/vfio0 thing and related udev rule
> thing that you pointed out in another mail.

Still reading through this whole thread, but type-based subdirectories
also make the most sense to me. I don't really see userspace wanting to
grab just any device and then figure out whether it is the device it was
looking for, but rather immediately go to a specific device or at least
a device of a specific type.

Sequentially-numbered devices tend to become really unwieldy in my
experience if you are working on a system with loads of devices.


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  6:35       ` David Gibson
@ 2021-09-29  7:31         ` Tian, Kevin
  2021-09-30  3:05           ` David Gibson
  2021-09-29 12:57         ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  7:31 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, jgg, parav, alex.williamson, lkml, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: David Gibson
> Sent: Wednesday, September 29, 2021 2:35 PM
> 
> On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > > From: David Gibson <david@gibson.dropbear.id.au>
> > > Sent: Wednesday, September 29, 2021 12:56 PM
> > >
> > > >
> > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > as the contract to get security context established (and vice versa
> > > > for unbinding). One additional requirement in iommufd is to manage
> the
> > > > switch between multiple security contexts due to decoupled
> bind/attach:
> > > >
> > > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > >
> > > Probably worth clarifying that (1) must happen for *all* devices in
> > > the group before (2) happens for any device in the group.
> >
> > No. User access is naturally blocked for other devices as long as they
> > are not opened yet.
> 
> Uh... my point is that everything in the group has to be removed from
> regular kernel drivers before we reach step (2).  Is the plan that you
> must do that before you can even open them?  That's a reasonable
> choice, but then I think you should show that step in this description
> as well.

Agree. I think below proposal can meet above requirement and ensure
it's not broken in the whole process when the group is operated by the
userspace:

https://lore.kernel.org/kvm/20210928140712.GL964074@nvidia.com/

and definitely an updated description will be provided when sending out
the new proposal.

> 
> > > > 2)  Bind the device to an iommufd with an initial security context
> > > >     (an empty iommu domain which blocks dma) established for its
> > > >     group, with user access unblocked;
> > > >
> > > > 3)  Attach the device to a user-specified ioasid (shared by all devices
> > > >     attached to this ioasid). Before attaching, the device should be first
> > > >     detached from the initial context;
> > >
> > > So, this step can implicitly but observably change the behaviour for
> > > other devices in the group as well.  I don't love that kind of
> > > difficult to predict side effect, which is why I'm *still* not totally
> > > convinced by the device-centric model.
> >
> > which side-effect is predicted here? The user anyway needs to be
> > aware of such group restriction regardless whether it uses group
> > or nongroup interface.
> 
> Yes, exactly.  And with a group interface it's obvious it has to
> understand it.  With the non-group interface, you can get to this
> stage in ignorance of groups.  It will even work as long as you are
> lucky enough only to try with singleton-group devices.  Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

for non-group we can also document it obviously in uAPI that the user
must understand group restriction and violating it will get failure
when attaching to different IOAS's for devices in the same group.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 12:22               ` Jason Gunthorpe
@ 2021-09-29  8:48                 ` Tian, Kevin
  2021-09-29 12:36                   ` Jason Gunthorpe
  2021-09-30  8:49                 ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29  8:48 UTC (permalink / raw)
  To: Jason Gunthorpe, robin.murphy
  Cc: Jean-Philippe Brucker, Alex Williamson, Liu, Yi L, hch, jasowang,
	joro, parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

+Robin.

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 8:22 PM
> 
> On Thu, Sep 23, 2021 at 12:05:29PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, September 23, 2021 7:27 PM
> > >
> > > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> > >
> > > > So we can only tell userspace "No_snoop is not supported" (provided
> we
> > > > even want to allow them to enable No_snoop). Users in control of
> stage-1
> > > > tables can create non-cacheable mappings through MAIR attributes.
> > >
> > > My point is that ARM is using IOMMU_CACHE to control the overall
> > > cachability of the DMA
> > >
> > > ie not specifying IOMMU_CACHE requires using the arch specific DMA
> > > cache flushers.
> > >
> > > Intel never uses arch specifc DMA cache flushers, and instead is
> > > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA
> that
> > > is always
> > > cachable.
> >
> > it uses IOMMU_CACHE to force all DMAs to snoop, including those which
> > has non_snoop flag and wouldn't snoop cache if iommu is disabled.
> Nothing
> > is blocked.
> 
> I see it differently, on Intel the only way to bypass the cache with
> DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we
> are talking about tells the IOMMU to ignore the no snoop bit.
> 
> Again, Intel arch in the kernel does not support the DMA cache flush
> arch API and *DOES NOT* support incoherent DMA at all.
> 
> ARM *does* implement the DMA cache flush arch API and is using
> IOMMU_CACHE to control if the caller will, or will not call the cache
> flushes.

I still didn't fully understand this point after reading the code. Looking
at dma-iommu its cache flush functions are all coded with below as
the first check:

        if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
                return;

dev->dma_coherent is initialized upon firmware info, not decided by 
IOMMU_CACHE.

i.e. it's not IOMMU_CACHE to decide whether cache flushes should
be called.

Probably the confusion comes from __iommu_dma_alloc_noncontiguous:

        if (!(ioprot & IOMMU_CACHE)) {
                struct scatterlist *sg;
                int i;

                for_each_sg(sgt->sgl, sg, sgt->orig_nents, i)
                        arch_dma_prep_coherent(sg_page(sg), sg->length);
        }

Here it makes more sense to be if (!coherent) {}.

with above being corrected, I think all iommu drivers do associate 
IOMMU_CACHE to the snoop aspect:

Intel:
    - either force snooping by ignoring snoop bit in TLP (IOMMU_CACHE)
    - or has snoop decided by TLP (!IOMMU_CACHE)

ARM:
    - set to snoop format if IOMMU_CACHE
    - set to nonsnoop format if !IOMMU_CACHE
(in both cases TLP snoop bit is ignored?)

Other archs
    - ignore IOMMU_CACHE as cache is always snooped via their IOMMUs

> 
> This is fundamentally different from what Intel is using it for.
> 
> > but why do you call it abuse? IOMMU_CACHE was first introduced for
> > Intel platform:
> 
> IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to
> dma_is_coherent stuff. At that point it became linked to 'do I need to
> call arch cache flushers or not'.
> 

I didn't identify the exact commit for above meaning change.

Robin, could you help share some thoughts here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-29  2:52   ` David Gibson
@ 2021-09-29  9:25     ` Lu Baolu
  2021-09-29  9:29       ` Lu Baolu
  0 siblings, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-29  9:25 UTC (permalink / raw)
  To: David Gibson, Liu Yi L
  Cc: baolu.lu, alex.williamson, jgg, hch, jasowang, joro,
	jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu,
	dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, nicolinc

Hi David,

On 2021/9/29 10:52, David Gibson wrote:
> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>> From: Lu Baolu<baolu.lu@linux.intel.com>
>>
>> This provides an interface for upper layers to get the per-device iommu
>> attributes.
>>
>>      int iommu_device_get_info(struct device *dev,
>>                                enum iommu_devattr attr, void *data);
> That fact that this interface doesn't let you know how to size the
> data buffer, other than by just knowing the right size for each attr
> concerns me.
> 

We plan to address this by following the comments here.

https://lore.kernel.org/linux-iommu/20210921161930.GP327412@nvidia.com/

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-29  9:25     ` Lu Baolu
@ 2021-09-29  9:29       ` Lu Baolu
  0 siblings, 0 replies; 280+ messages in thread
From: Lu Baolu @ 2021-09-29  9:29 UTC (permalink / raw)
  To: David Gibson, Liu Yi L
  Cc: baolu.lu, alex.williamson, jgg, hch, jasowang, joro,
	jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu,
	dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, nicolinc

On 2021/9/29 17:25, Lu Baolu wrote:
> Hi David,
> 
> On 2021/9/29 10:52, David Gibson wrote:
>> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>>> From: Lu Baolu<baolu.lu@linux.intel.com>
>>>
>>> This provides an interface for upper layers to get the per-device iommu
>>> attributes.
>>>
>>>      int iommu_device_get_info(struct device *dev,
>>>                                enum iommu_devattr attr, void *data);
>> That fact that this interface doesn't let you know how to size the
>> data buffer, other than by just knowing the right size for each attr
>> concerns me.
>>
> 
> We plan to address this by following the comments here.
> 
> https://lore.kernel.org/linux-iommu/20210921161930.GP327412@nvidia.com/

And Christoph gave another option as well.

https://lore.kernel.org/linux-iommu/20210922050746.GA12921@lst.de/

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-22 14:49   ` Jean-Philippe Brucker
@ 2021-09-29 10:44     ` Liu, Yi L
  2021-09-29 12:07       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 280+ messages in thread
From: Liu, Yi L @ 2021-09-29 10:44 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: alex.williamson, jgg, hch, jasowang, joro, Tian, Kevin, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Wednesday, September 22, 2021 10:49 PM
> 
> On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > IOVA range is critical info for userspace to manage DMA for an I/O address
> > space. This patch reports the valid iova range info of a given device.
> >
> > Due to aforementioned hack, this info comes from the hacked vfio type1
> > driver. To follow the same format in vfio, we also introduce a cap chain
> > format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
> [...]
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 49731be71213..f408ad3c8ade 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -68,6 +68,7 @@
> >   *		   +---------------+------------+
> >   *		   ...
> >   * @addr_width:    the address width of supported I/O address spaces.
> > + * @cap_offset:	   Offset within info struct of first cap
> >   *
> >   * Availability: after device is bound to iommufd
> >   */
> > @@ -77,9 +78,11 @@ struct iommu_device_info {
> >  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU
> enforced snoop */
> >  #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page
> sizes */
> >  #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /*
> addr_wdith field valid */
> > +#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info
> supports cap chain */
> >  	__u64	dev_cookie;
> >  	__u64   pgsize_bitmap;
> >  	__u32	addr_width;
> > +	__u32   cap_offset;
> 
> We can also add vendor-specific page table and PASID table properties as
> capabilities, otherwise we'll need giant unions in the iommu_device_info
> struct. That made me wonder whether pgsize and addr_width should also
> be
> separate capabilities for consistency, but this way might be good enough.
> There won't be many more generic capabilities. I have "output address
> width"

what do you mean by "output address width"? Is it the output address
of stage-1 translation?

>
and "PASID width", the rest is specific to Arm and SMMU table
> formats.

When coming to nested translation support, the stage-1 related info are
likely to be vendor-specific, and will be reported in cap chain.

Regards,
Yi Liu

> Thanks,
> Jean
> 
> >  };
> >
> >  #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE +
> 1)
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22 13:45   ` Jean-Philippe Brucker
@ 2021-09-29 10:47     ` Liu, Yi L
  0 siblings, 0 replies; 280+ messages in thread
From: Liu, Yi L @ 2021-09-29 10:47 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: alex.williamson, jgg, hch, jasowang, joro, Tian, Kevin, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Wednesday, September 22, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> 
> Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
> what it's used for or why it's mandatory. But for PPC it sounds like it
> should be an address range instead of an upper limit?

yes, as this open described, it may need to be a range. But not sure
if PPC requires multiple ranges or just one range. Perhaps, David may
guide there.

Regards,
Yi Liu
 
> Thanks,
> Jean
> 
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> >
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-29 10:44     ` Liu, Yi L
@ 2021-09-29 12:07       ` Jean-Philippe Brucker
  2021-09-29 12:31         ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-29 12:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, Tian, Kevin, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, Raj, Ashok,
	yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Wed, Sep 29, 2021 at 10:44:01AM +0000, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Sent: Wednesday, September 22, 2021 10:49 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> > > [HACK. will fix in v2]
> > >
> > > IOVA range is critical info for userspace to manage DMA for an I/O address
> > > space. This patch reports the valid iova range info of a given device.
> > >
> > > Due to aforementioned hack, this info comes from the hacked vfio type1
> > > driver. To follow the same format in vfio, we also introduce a cap chain
> > > format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
> > [...]
> > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > index 49731be71213..f408ad3c8ade 100644
> > > --- a/include/uapi/linux/iommu.h
> > > +++ b/include/uapi/linux/iommu.h
> > > @@ -68,6 +68,7 @@
> > >   *		   +---------------+------------+
> > >   *		   ...
> > >   * @addr_width:    the address width of supported I/O address spaces.
> > > + * @cap_offset:	   Offset within info struct of first cap
> > >   *
> > >   * Availability: after device is bound to iommufd
> > >   */
> > > @@ -77,9 +78,11 @@ struct iommu_device_info {
> > >  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU
> > enforced snoop */
> > >  #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page
> > sizes */
> > >  #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /*
> > addr_wdith field valid */
> > > +#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info
> > supports cap chain */
> > >  	__u64	dev_cookie;
> > >  	__u64   pgsize_bitmap;
> > >  	__u32	addr_width;
> > > +	__u32   cap_offset;
> > 
> > We can also add vendor-specific page table and PASID table properties as
> > capabilities, otherwise we'll need giant unions in the iommu_device_info
> > struct. That made me wonder whether pgsize and addr_width should also
> > be
> > separate capabilities for consistency, but this way might be good enough.
> > There won't be many more generic capabilities. I have "output address
> > width"
> 
> what do you mean by "output address width"? Is it the output address
> of stage-1 translation?

Yes, so the guest knows the size of GPA it can write into the page table.
For Arm SMMU the GPA size is determined by both the SMMU implementation
and the host kernel configuration. But maybe that could also be
vendor-specific, if other architectures don't need to communicate it. 

> >
> and "PASID width", the rest is specific to Arm and SMMU table
> > formats.
> 
> When coming to nested translation support, the stage-1 related info are
> likely to be vendor-specific, and will be reported in cap chain.

Agreed

Thanks,
Jean

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29  7:08       ` Cornelia Huck
@ 2021-09-29 12:15         ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:15 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Tian, Kevin, David Gibson, Liu, Yi L, alex.williamson, hch,
	jasowang, joro, jean-philippe, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, baolu.lu, nicolinc

On Wed, Sep 29, 2021 at 09:08:25AM +0200, Cornelia Huck wrote:
> On Wed, Sep 29 2021, "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> >> From: David Gibson <david@gibson.dropbear.id.au>
> >> Sent: Wednesday, September 29, 2021 10:44 AM
> >> 
> >> > One alternative option is to arrange device nodes in sub-directories based
> >> > on the device type. But doing so also adds one trouble to userspace. The
> >> > current vfio uAPI is designed to have the user query device type via
> >> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> >> > instead needs to figure out the device type before opening the device, to
> >> > identify the sub-directory.
> >> 
> >> Wouldn't this be up to the operator / configuration, rather than the
> >> actual software though?  I would assume that typically the VFIO
> >> program would be pointed at a specific vfio device node file to use,
> >> e.g.
> >> 	my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1
> >> 
> >> Or more generally, if you're expecting userspace to know a name in a
> >> uniqu pattern, they can equally well know a "type/name" pair.
> >> 
> >
> > You are correct. Currently:
> >
> > -device, vfio-pci,host=DDDD:BB:DD.F
> > -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ DDDD:BB:DD.F
> > -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00
> >
> > above is definitely type/name information to find the related node. 
> >
> > Actually even for Jason's proposal we still need such information to
> > identify the sysfs path.
> >
> > Then I feel type-based sub-directory does work. Adding another link
> > to sysfs sounds unnecessary now. But I'm not sure whether we still
> > want to create /dev/vfio/devices/vfio0 thing and related udev rule
> > thing that you pointed out in another mail.
> 
> Still reading through this whole thread, but type-based subdirectories
> also make the most sense to me. I don't really see userspace wanting to
> grab just any device and then figure out whether it is the device it was
> looking for, but rather immediately go to a specific device or at least
> a device of a specific type.

Even so the kernel should not be creating this, that is a job for
udev and some symlinks

> Sequentially-numbered devices tend to become really unwieldy in my
> experience if you are working on a system with loads of devices.

If the user experiance is always to refer to the sysfs node as Kevin
shows above then the user never sees the integer.

It is very much like how the group number works already, programs
always start at the sysfs, do the readlink thing on iommu_group and
then get the group number to go to /dev/vfio/X

So it is already the case that every piece of software can construct a
sysfs path to the device, we are just changing from
readlink(iommu_group) to readdir(vfio/vfio_device_XX)

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29  2:46         ` david
@ 2021-09-29 12:22           ` Jason Gunthorpe
  2021-09-30  2:48             ` david
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:22 UTC (permalink / raw)
  To: david
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

On Wed, Sep 29, 2021 at 12:46:14PM +1000, david@gibson.dropbear.id.au wrote:
> On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, September 22, 2021 12:01 AM
> > > > 
> > > > >  One open about how to organize the device nodes under
> > > > /dev/vfio/devices/.
> > > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > > devname
> > > > > from all kinds of devices. The prerequisite of this model is that devnames
> > > > > from different bus types are unique formats:
> > > > 
> > > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > > > 
> > > > The userspace can learn the correct major/minor by inspecting the
> > > > sysfs.
> > > > 
> > > > This whole concept should disappear into the prior patch that adds the
> > > > struct device in the first place, and I think most of the code here
> > > > can be deleted once the struct device is used properly.
> > > > 
> > > 
> > > Can you help elaborate above flow? This is one area where we need
> > > more guidance.
> > > 
> > > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > > how does Qemu identify which vifo0/1/... is associated with the specified 
> > > DDDD:BB:DD.F? 
> > 
> > When done properly in the kernel the file:
> > 
> > /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> > 
> > Will contain the major:minor of the VFIO device.
> > 
> > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > that the major:minor matches.
> > 
> > in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> > to qemu.
> 
> I thought part of the appeal of the device centric model was less
> grovelling around in sysfs for information.  Using type/address
> directly in /dev seems simpler than having to dig around matching
> things here.

I would say more regular grovelling. Starting from a sysfs device
directory and querying the VFIO cdev associated with it is much more
normal than what happens today, which also includes passing sysfs
information into an ioctl :\

> Note that this doesn't have to be done in kernel: you could have the
> kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
> that create symlinks from say /dev/vfio/pci/DDDD:BB:SS.F - >
> ../devices/vfioXX based on the sysfs information.

This is the right approach if people want to do this, but I'm not sure
it is worth it given backwards compat requires the sysfs path as
input. We may as well stick with sysfs as the command line interface
for userspace tools.

And I certainly don't want to see userspace tools trying to reverse a
sysfs path into a /dev/ symlink name when they can directly and
reliably learn the correct cdev from the sysfspath.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-29  5:25   ` David Gibson
@ 2021-09-29 12:24     ` Jason Gunthorpe
  2021-09-30  3:10       ` David Gibson
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:24 UTC (permalink / raw)
  To: David Gibson
  Cc: Liu Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:

> > +struct iommufd_device {
> > +	unsigned int id;
> > +	struct iommufd_ctx *ictx;
> > +	struct device *dev; /* always be the physical device */
> > +	u64 dev_cookie;
> 
> Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
> unique, couldn't you just use the cookie directly as the index into
> the xarray?

ID is the kernel value in the xarray - xarray is much more efficient &
safe with small kernel controlled values.

dev_cookie is a user assigned value that may not be unique. It's
purpose is to allow userspace to receive and event and go back to its
structure. Most likely userspace will store a pointer here, but it is
also possible userspace could not use it.

It is a pretty normal pattern

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-29  6:41     ` Tian, Kevin
@ 2021-09-29 12:28       ` Jason Gunthorpe
  2021-09-29 22:34         ` Tian, Kevin
  2021-09-30  3:12       ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: David Gibson, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

On Wed, Sep 29, 2021 at 06:41:00AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Wednesday, September 29, 2021 2:01 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> > vfio
> > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> > provided
> > > because it's implicitly done when the device fd is closed.
> > >
> > > In concept a vfio device can be bound to multiple iommufds, each hosting
> > > a subset of I/O address spaces attached by this device.
> > 
> > I really feel like this many<->many mapping between devices is going
> > to be super-confusing, and therefore make it really hard to be
> > confident we have all the rules right for proper isolation.
> 
> Based on new discussion on group ownership part (patch06), I feel this
> many<->many relationship will disappear. The context fd (either container
> or iommufd) will uniquely mark the ownership on a physical device and
> its group. With this design it's impractical to have one device bound
> to multiple iommufds. 

That should be a requirement! We have no way to prove that two
iommufds are the same security domain, so devices/groups cannot be
shared.

That is why the API I suggested takes in a struct file to ID the user
security context. A group is accessible only from that single struct
file and no more.

If the first series goes the way I outlined then I think David's
concern about security is strongly solved as the IOMMU layer is
directly managing it with a very clear responsiblity and semantic.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-29 12:07       ` Jean-Philippe Brucker
@ 2021-09-29 12:31         ` Jason Gunthorpe
  0 siblings, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:31 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 29, 2021 at 01:07:56PM +0100, Jean-Philippe Brucker wrote:

> Yes, so the guest knows the size of GPA it can write into the page table.
> For Arm SMMU the GPA size is determined by both the SMMU implementation
> and the host kernel configuration. But maybe that could also be
> vendor-specific, if other architectures don't need to communicate it. 

I think there should be a dedicated query to return HW specific
parmaters for a user page table format. Somehow I think there will be
a lot of these.

So 'user page table format arm smmu v1' can be queried to return its
own unique struct that has everything needed to operate that format of
page table.

Userspace already needs to know how to form that specific HW PTEs,
so processing a HW specific query is not a problem.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-29  8:48                 ` Tian, Kevin
@ 2021-09-29 12:36                   ` Jason Gunthorpe
  2021-09-30  8:30                     ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy, Jean-Philippe Brucker, Alex Williamson, Liu, Yi L,
	hch, jasowang, joro, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:

> ARM:
>     - set to snoop format if IOMMU_CACHE
>     - set to nonsnoop format if !IOMMU_CACHE
> (in both cases TLP snoop bit is ignored?)

Where do you see this? I couldn't even find this functionality in the
ARM HW manual??
 
What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
the cache coherence to be disabled, which is not ignoring no snoop.

> I didn't identify the exact commit for above meaning change.
> 
> Robin, could you help share some thoughts here?

It is this:

static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
		     unsigned long attrs)
{
	int prot = coherent ? IOMMU_CACHE : 0;

Which sets IOMMU_CACHE based on:

static void *iommu_dma_alloc(struct device *dev, size_t size,
		dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
{
	bool coherent = dev_is_dma_coherent(dev);
	int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs); 

Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do
with no-snoop TLPs and everything to do with the arch cache
maintenance API

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  6:35       ` David Gibson
  2021-09-29  7:31         ` Tian, Kevin
@ 2021-09-29 12:57         ` Jason Gunthorpe
  2021-09-30  3:09           ` David Gibson
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:57 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:

> Yes, exactly.  And with a group interface it's obvious it has to
> understand it.  With the non-group interface, you can get to this
> stage in ignorance of groups.  It will even work as long as you are
> lucky enough only to try with singleton-group devices.  Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

The security model here says this is fine.

This idea to put the iommu code in charge of security is quite clean,
as I said in the other mail drivers attached to 'struct devices *'
tell the iommu layer what they are are doing:

   iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
   iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
   iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

And it decides if it is allowed.

If device A is allowed to go to userspace then security wise it is
deemed fine that B is impacted. That is what we have defined already
today.

This proposal does not free userpace from having to understand this!
The iommu_group sysfs is still there and still must be understood.

The *admin* the one responsible to understand the groups, not the
applications. The admin has no idea what a group FD is - they should
be looking at the sysfs and seeing the iommu_group directories.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  0:38                     ` Tian, Kevin
@ 2021-09-29 12:59                       ` Jason Gunthorpe
  2021-10-15  1:29                         ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 12:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lu Baolu, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

On Wed, Sep 29, 2021 at 12:38:35AM +0000, Tian, Kevin wrote:

> /* If set the driver must call iommu_XX as the first action in probe() or
>   * before it attempts to do DMA
>   */
>  bool suppress_dma_owner:1;

It is not "attempts to do DMA" but more "operates the physical device
in any away"

Not having ownership means another entity could be using user space
DMA to manipulate the device state and attack the integrity of the
kernel's programming of the device.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-29  2:08   ` David Gibson
@ 2021-09-29 19:05     ` Alex Williamson
  2021-09-30  2:43       ` David Gibson
  2021-10-20 12:39     ` Liu, Yi L
  1 sibling, 1 reply; 280+ messages in thread
From: Alex Williamson @ 2021-09-29 19:05 UTC (permalink / raw)
  To: David Gibson
  Cc: Liu Yi L, jgg, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

On Wed, 29 Sep 2021 12:08:59 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> > 
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> > 
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>  
> [snip]
> 
> > +static bool vfio_device_in_container(struct vfio_device *device)
> > +{
> > +	return !!(device->group && device->group->container);  
> 
> You don't need !! here.  && is already a logical operation, so returns
> a valid bool.
> 
> > +}
> > +
> >  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> >  {
> >  	struct vfio_device *device = filep->private_data;
> > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> >  
> >  	module_put(device->dev->driver->owner);
> >  
> > -	vfio_group_try_dissolve_container(device->group);
> > +	if (vfio_device_in_container(device)) {
> > +		vfio_group_try_dissolve_container(device->group);
> > +	} else {
> > +		atomic_dec(&device->opened);
> > +		if (device->group) {
> > +			mutex_lock(&device->group->opened_lock);
> > +			device->group->opened--;
> > +			mutex_unlock(&device->group->opened_lock);
> > +		}
> > +	}
> >  
> >  	vfio_device_put(device);
> >  
> > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
> >  
> >  static const struct file_operations vfio_device_fops = {
> >  	.owner		= THIS_MODULE,
> > +	.open		= vfio_device_fops_open,
> >  	.release	= vfio_device_fops_release,
> >  	.read		= vfio_device_fops_read,
> >  	.write		= vfio_device_fops_write,
> > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> >  	.mode = S_IRUGO | S_IWUGO,
> >  };
> >  
> > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));  
> 
> Others have pointed out some problems with the use of dev_name()
> here.  I'll add that I think you'll make things much easier if instead
> of using one huge "devices" subdir, you use a separate subdir for each
> vfio sub-driver (so, one for PCI, one for each type of mdev, one for
> platform, etc.).  That should make avoiding name conflicts a lot simpler.

It seems like this is unnecessary if we use the vfioX naming approach.
Conflicts are trivial to ignore if we don't involve dev_name() and
looking for the correct major:minor chardev in the correct subdirectory
seems like a hassle for userspace.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-29 12:28       ` Jason Gunthorpe
@ 2021-09-29 22:34         ` Tian, Kevin
  0 siblings, 0 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-29 22:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Gibson, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 29, 2021 8:28 PM
> 
> On Wed, Sep 29, 2021 at 06:41:00AM +0000, Tian, Kevin wrote:
> > > From: David Gibson <david@gibson.dropbear.id.au>
> > > Sent: Wednesday, September 29, 2021 2:01 PM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind
> the
> > > vfio
> > > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface
> is
> > > provided
> > > > because it's implicitly done when the device fd is closed.
> > > >
> > > > In concept a vfio device can be bound to multiple iommufds, each
> hosting
> > > > a subset of I/O address spaces attached by this device.
> > >
> > > I really feel like this many<->many mapping between devices is going
> > > to be super-confusing, and therefore make it really hard to be
> > > confident we have all the rules right for proper isolation.
> >
> > Based on new discussion on group ownership part (patch06), I feel this
> > many<->many relationship will disappear. The context fd (either container
> > or iommufd) will uniquely mark the ownership on a physical device and
> > its group. With this design it's impractical to have one device bound
> > to multiple iommufds.
> 
> That should be a requirement! We have no way to prove that two
> iommufds are the same security domain, so devices/groups cannot be
> shared.
> 
> That is why the API I suggested takes in a struct file to ID the user
> security context. A group is accessible only from that single struct
> file and no more.
> 
> If the first series goes the way I outlined then I think David's
> concern about security is strongly solved as the IOMMU layer is
> directly managing it with a very clear responsiblity and semantic.
> 

Yes, this is also my understanding now.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-29 19:05     ` Alex Williamson
@ 2021-09-30  2:43       ` David Gibson
  0 siblings, 0 replies; 280+ messages in thread
From: David Gibson @ 2021-09-30  2:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, jgg, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 3689 bytes --]

On Wed, Sep 29, 2021 at 01:05:21PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 12:08:59 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > > userspace to directly open a vfio device w/o relying on container/group
> > > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > > manner.
> > > 
> > > In case a device is exposed in both legacy and new interfaces (see next
> > > patch for how to decide it), this patch also ensures that when the device
> > > is already opened via one interface then the other one must be blocked.
> > > 
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>  
> > [snip]
> > 
> > > +static bool vfio_device_in_container(struct vfio_device *device)
> > > +{
> > > +	return !!(device->group && device->group->container);  
> > 
> > You don't need !! here.  && is already a logical operation, so returns
> > a valid bool.
> > 
> > > +}
> > > +
> > >  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> > >  {
> > >  	struct vfio_device *device = filep->private_data;
> > > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
> > >  
> > >  	module_put(device->dev->driver->owner);
> > >  
> > > -	vfio_group_try_dissolve_container(device->group);
> > > +	if (vfio_device_in_container(device)) {
> > > +		vfio_group_try_dissolve_container(device->group);
> > > +	} else {
> > > +		atomic_dec(&device->opened);
> > > +		if (device->group) {
> > > +			mutex_lock(&device->group->opened_lock);
> > > +			device->group->opened--;
> > > +			mutex_unlock(&device->group->opened_lock);
> > > +		}
> > > +	}
> > >  
> > >  	vfio_device_put(device);
> > >  
> > > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
> > >  
> > >  static const struct file_operations vfio_device_fops = {
> > >  	.owner		= THIS_MODULE,
> > > +	.open		= vfio_device_fops_open,
> > >  	.release	= vfio_device_fops_release,
> > >  	.read		= vfio_device_fops_read,
> > >  	.write		= vfio_device_fops_write,
> > > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> > >  	.mode = S_IRUGO | S_IWUGO,
> > >  };
> > >  
> > > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > > +{
> > > +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));  
> > 
> > Others have pointed out some problems with the use of dev_name()
> > here.  I'll add that I think you'll make things much easier if instead
> > of using one huge "devices" subdir, you use a separate subdir for each
> > vfio sub-driver (so, one for PCI, one for each type of mdev, one for
> > platform, etc.).  That should make avoiding name conflicts a lot simpler.
> 
> It seems like this is unnecessary if we use the vfioX naming approach.
> Conflicts are trivial to ignore if we don't involve dev_name() and
> looking for the correct major:minor chardev in the correct subdirectory
> seems like a hassle for userspace.  Thanks,

Right.. it does sound like a hassle, but AFAICT that's *more*
necessary with /dev/vfio/vfioXX than with /dev/vfio/pci/DDDD:BB:SS.F,
since you have to look up a meaningful name in sysfs to find the right
devnode.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-29 12:22           ` Jason Gunthorpe
@ 2021-09-30  2:48             ` david
  0 siblings, 0 replies; 280+ messages in thread
From: david @ 2021-09-30  2:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 3781 bytes --]

On Wed, Sep 29, 2021 at 09:22:30AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 12:46:14PM +1000, david@gibson.dropbear.id.au wrote:
> > On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Wednesday, September 22, 2021 12:01 AM
> > > > > 
> > > > > >  One open about how to organize the device nodes under
> > > > > /dev/vfio/devices/.
> > > > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > > > devname
> > > > > > from all kinds of devices. The prerequisite of this model is that devnames
> > > > > > from different bus types are unique formats:
> > > > > 
> > > > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > > > > 
> > > > > The userspace can learn the correct major/minor by inspecting the
> > > > > sysfs.
> > > > > 
> > > > > This whole concept should disappear into the prior patch that adds the
> > > > > struct device in the first place, and I think most of the code here
> > > > > can be deleted once the struct device is used properly.
> > > > > 
> > > > 
> > > > Can you help elaborate above flow? This is one area where we need
> > > > more guidance.
> > > > 
> > > > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > > > how does Qemu identify which vifo0/1/... is associated with the specified 
> > > > DDDD:BB:DD.F? 
> > > 
> > > When done properly in the kernel the file:
> > > 
> > > /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> > > 
> > > Will contain the major:minor of the VFIO device.
> > > 
> > > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > > that the major:minor matches.
> > > 
> > > in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> > > to qemu.
> > 
> > I thought part of the appeal of the device centric model was less
> > grovelling around in sysfs for information.  Using type/address
> > directly in /dev seems simpler than having to dig around matching
> > things here.
> 
> I would say more regular grovelling. Starting from a sysfs device
> directory and querying the VFIO cdev associated with it is much more
> normal than what happens today, which also includes passing sysfs
> information into an ioctl :\

Hm.. ok.  Clearly I'm unfamiliar with the things that do that.  Other
than current VFIO, the only model I've really seen is where you just
point your program at a device node.

> > Note that this doesn't have to be done in kernel: you could have the
> > kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
> > that create symlinks from say /dev/vfio/pci/DDDD:BB:SS.F - >
> > ../devices/vfioXX based on the sysfs information.
> 
> This is the right approach if people want to do this, but I'm not sure
> it is worth it given backwards compat requires the sysfs path as
> input.

You mean for userspace that needs to be able to go back to the old
VFIO interface as well?  It seems silly to force this sysfs mucking
about on new programs that depend on the new interface.

> We may as well stick with sysfs as the command line interface
> for userspace tools.

> And I certainly don't want to see userspace tools trying to reverse a
> sysfs path into a /dev/ symlink name when they can directly and
> reliably learn the correct cdev from the sysfspath.

Um.. sure.. but they can get the correct cdev from the sysfspath no
matter how we name the cdevs.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29  7:31         ` Tian, Kevin
@ 2021-09-30  3:05           ` David Gibson
  0 siblings, 0 replies; 280+ messages in thread
From: David Gibson @ 2021-09-30  3:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, jgg, parav, alex.williamson, lkml, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

[-- Attachment #1: Type: text/plain, Size: 3728 bytes --]

On Wed, Sep 29, 2021 at 07:31:08AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Wednesday, September 29, 2021 2:35 PM
> > 
> > On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > > > From: David Gibson <david@gibson.dropbear.id.au>
> > > > Sent: Wednesday, September 29, 2021 12:56 PM
> > > >
> > > > >
> > > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > > as the contract to get security context established (and vice versa
> > > > > for unbinding). One additional requirement in iommufd is to manage
> > the
> > > > > switch between multiple security contexts due to decoupled
> > bind/attach:
> > > > >
> > > > > 1)  Open a device in "/dev/vfio/devices" with user access blocked;
> > > >
> > > > Probably worth clarifying that (1) must happen for *all* devices in
> > > > the group before (2) happens for any device in the group.
> > >
> > > No. User access is naturally blocked for other devices as long as they
> > > are not opened yet.
> > 
> > Uh... my point is that everything in the group has to be removed from
> > regular kernel drivers before we reach step (2).  Is the plan that you
> > must do that before you can even open them?  That's a reasonable
> > choice, but then I think you should show that step in this description
> > as well.
> 
> Agree. I think below proposal can meet above requirement and ensure
> it's not broken in the whole process when the group is operated by the
> userspace:
> 
> https://lore.kernel.org/kvm/20210928140712.GL964074@nvidia.com/
> 
> and definitely an updated description will be provided when sending out
> the new proposal.
> 
> > 
> > > > > 2)  Bind the device to an iommufd with an initial security context
> > > > >     (an empty iommu domain which blocks dma) established for its
> > > > >     group, with user access unblocked;
> > > > >
> > > > > 3)  Attach the device to a user-specified ioasid (shared by all devices
> > > > >     attached to this ioasid). Before attaching, the device should be first
> > > > >     detached from the initial context;
> > > >
> > > > So, this step can implicitly but observably change the behaviour for
> > > > other devices in the group as well.  I don't love that kind of
> > > > difficult to predict side effect, which is why I'm *still* not totally
> > > > convinced by the device-centric model.
> > >
> > > which side-effect is predicted here? The user anyway needs to be
> > > aware of such group restriction regardless whether it uses group
> > > or nongroup interface.
> > 
> > Yes, exactly.  And with a group interface it's obvious it has to
> > understand it.  With the non-group interface, you can get to this
> > stage in ignorance of groups.  It will even work as long as you are
> > lucky enough only to try with singleton-group devices.  Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
> 
> for non-group we can also document it obviously in uAPI that the user
> must understand group restriction and violating it will get failure
> when attaching to different IOAS's for devices in the same group.

Documenting limitations is always inferior to building them into the
actual API signatures.  Sometimes its the only option, but people
frequently don't read the docs, whereas they kind of have to look at
the API itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-29 12:57         ` Jason Gunthorpe
@ 2021-09-30  3:09           ` David Gibson
  2021-09-30 22:28             ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-30  3:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Wed, Sep 29, 2021 at 09:57:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:
> 
> > Yes, exactly.  And with a group interface it's obvious it has to
> > understand it.  With the non-group interface, you can get to this
> > stage in ignorance of groups.  It will even work as long as you are
> > lucky enough only to try with singleton-group devices.  Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
> 
> The security model here says this is fine.

I'm not making a statement about the security model, I'm making a
statement about surprisingness of the programming interface.  In your
program you have devices A & B, you perform an operation that
specifies only device A and device B changes behaviour.

> This idea to put the iommu code in charge of security is quite clean,
> as I said in the other mail drivers attached to 'struct devices *'
> tell the iommu layer what they are are doing:
> 
>    iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>    iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>    iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)
> 
> And it decides if it is allowed.
> 
> If device A is allowed to go to userspace then security wise it is
> deemed fine that B is impacted. That is what we have defined already
> today.
> 
> This proposal does not free userpace from having to understand this!
> The iommu_group sysfs is still there and still must be understood.
> 
> The *admin* the one responsible to understand the groups, not the
> applications. The admin has no idea what a group FD is - they should
> be looking at the sysfs and seeing the iommu_group directories.

Not just the admin.  If an app is given two devices in the same group
to use *both* it must understand that and act accordingly.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-29 12:24     ` Jason Gunthorpe
@ 2021-09-30  3:10       ` David Gibson
  2021-10-01 12:43         ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-09-30  3:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 1163 bytes --]

On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> 
> > > +struct iommufd_device {
> > > +	unsigned int id;
> > > +	struct iommufd_ctx *ictx;
> > > +	struct device *dev; /* always be the physical device */
> > > +	u64 dev_cookie;
> > 
> > Why do you need both an 'id' and a 'dev_cookie'?  Since they're both
> > unique, couldn't you just use the cookie directly as the index into
> > the xarray?
> 
> ID is the kernel value in the xarray - xarray is much more efficient &
> safe with small kernel controlled values.
> 
> dev_cookie is a user assigned value that may not be unique. It's
> purpose is to allow userspace to receive and event and go back to its
> structure. Most likely userspace will store a pointer here, but it is
> also possible userspace could not use it.
> 
> It is a pretty normal pattern

Hm, ok.  Could you point me at an example?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-29  6:41     ` Tian, Kevin
  2021-09-29 12:28       ` Jason Gunthorpe
@ 2021-09-30  3:12       ` David Gibson
  1 sibling, 0 replies; 280+ messages in thread
From: David Gibson @ 2021-09-30  3:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, jgg, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

On Wed, Sep 29, 2021 at 06:41:00AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Wednesday, September 29, 2021 2:01 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the
> > vfio
> > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is
> > provided
> > > because it's implicitly done when the device fd is closed.
> > >
> > > In concept a vfio device can be bound to multiple iommufds, each hosting
> > > a subset of I/O address spaces attached by this device.
> > 
> > I really feel like this many<->many mapping between devices is going
> > to be super-confusing, and therefore make it really hard to be
> > confident we have all the rules right for proper isolation.
> 
> Based on new discussion on group ownership part (patch06), I feel this
> many<->many relationship will disappear. The context fd (either container
> or iommufd) will uniquely mark the ownership on a physical device and
> its group. With this design it's impractical to have one device bound
> to multiple iommufds. Actually I don't think this is a compelling usage
> in reality. The previous rationale was that no need to impose such restriction
> if no special reason... and now we have a reason. 😊
> 
> Jason, are you OK with this simplification?
> 
> > 
> > That's why I was suggesting a concept like endpoints, to break this
> > into two many<->one relationships.  I'm ok if that isn't visible in
> > the user API, but I think this is going to be really hard to keep
> > track of if it isn't explicit somewhere in the internals.
> > 
> 
> I think this endpoint concept is represented by ioas_device_info in
> patch14:
> 
> +/*
> + * An ioas_device_info object is created per each successful attaching
> + * request. A list of objects are maintained per ioas when the address
> + * space is shared by multiple devices.
> + */
> +struct ioas_device_info {
> +	struct iommufd_device *idev;
> +	struct list_head next;
>  };
> 
> currently it's 1:1 mapping before this object and iommufd_device, 
> because no pasid support yet.

Ok, I haven't read that far in the series yet.

> We can rename it to struct ioas_endpoint if it makes you feel
> better.

Meh.  The concept is much more important than the name.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-29 12:36                   ` Jason Gunthorpe
@ 2021-09-30  8:30                     ` Tian, Kevin
  2021-09-30 10:33                       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-30  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, Jean-Philippe Brucker, Jiang,
	Dave, Raj, Ashok, corbet, parav, Alex Williamson, lkml, david,
	dwmw2, Tian, Jun J, linux-kernel, lushenming, pbonzini,
	robin.murphy

> From: Jason Gunthorpe
> Sent: Wednesday, September 29, 2021 8:37 PM
> 
> On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
> 
> > ARM:
> >     - set to snoop format if IOMMU_CACHE
> >     - set to nonsnoop format if !IOMMU_CACHE
> > (in both cases TLP snoop bit is ignored?)
> 
> Where do you see this? I couldn't even find this functionality in the
> ARM HW manual??

Honestly speaking I'm getting confused by the complex attribute
transformation control (default, replace, combine, input, output, etc.)
in SMMU manual. Above was my impression after last check, but now
I cannot find necessary info to build the same picture (except below 
code). :/

> 
> What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> the cache coherence to be disabled, which is not ignoring no snoop.

My impression was that snoop is one way of implementing cache
coherency and now since the PTE can explicitly specify cache coherency 
like below:

                else if (prot & IOMMU_CACHE)
                        pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
                else
                        pte |= ARM_LPAE_PTE_MEMATTR_NC;

This setting in concept overrides the snoop attribute from the device thus
make it sort of ignored?

But I did see the manual says that:
--
Note: To achieve this 'pull-down' behavior, the No_snoop flag might 
be carried through the SMMU and used to transform the SMMU output 
downstream.
--

So again, just got confused here...

> 
> > I didn't identify the exact commit for above meaning change.
> >
> > Robin, could you help share some thoughts here?
> 
> It is this:
> 
> static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
> 		     unsigned long attrs)
> {
> 	int prot = coherent ? IOMMU_CACHE : 0;
> 
> Which sets IOMMU_CACHE based on:
> 
> static void *iommu_dma_alloc(struct device *dev, size_t size,
> 		dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
> {
> 	bool coherent = dev_is_dma_coherent(dev);
> 	int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs);
> 
> Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do
> with no-snoop TLPs and everything to do with the arch cache
> maintenance API

Maybe I'll get a clearer picture on this after understanding the difference 
between cache coherency and snoop on ARM. They are sort of inter-
changeable on Intel (or possibly on x86 since I just found that AMD 
completely ignores IOMMU_CACHE).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 12:22               ` Jason Gunthorpe
  2021-09-29  8:48                 ` Tian, Kevin
@ 2021-09-30  8:49                 ` Tian, Kevin
  2021-09-30 13:43                   ` Lu Baolu
  2021-09-30 22:08                   ` Jason Gunthorpe
  1 sibling, 2 replies; 280+ messages in thread
From: Tian, Kevin @ 2021-09-30  8:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Lu, Baolu
  Cc: Jean-Philippe Brucker, Alex Williamson, Liu, Yi L, hch, jasowang,
	joro, parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 8:22 PM
> 
> > > These are different things and need different bits. Since the ARM path
> > > has a lot more code supporting it, I'd suggest Intel should change
> > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> IOMMU_CACHE.
> >
> > I didn't fully get this point. The end result is same, i.e. making the DMA
> > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > behavior of IOMMU_CACHE, what will you define now?
> 
> It is clearly specifying how the kernel API works:
> 
>  !IOMMU_CACHE
>    must call arch cache flushers
>  IOMMU_CACHE -
>    do not call arch cache flushers
>  IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
>    dot not arch cache flushers, and ignore the no snoop bit.

Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
knowledge about how cache coherency is implemented, i.e. 
when IOMMU_CACHE is set intel-iommu driver just maps it to
blocking no-snoop. It's not necessarily to be an attribute in 
the same level as IOMMU_CACHE?

> 
> On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
> do that. 

Agree. In reality I guess this is not hit because all devices are marked
coherent on Intel platforms...

Baolu, any insight here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-23 11:42           ` Jason Gunthorpe
@ 2021-09-30  9:35             ` Tian, Kevin
  2021-09-30 22:23               ` Jason Gunthorpe
  0 siblings, 1 reply; 280+ messages in thread
From: Tian, Kevin @ 2021-09-30  9:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 7:42 PM
> 
> On Thu, Sep 23, 2021 at 03:38:10AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Thursday, September 23, 2021 11:11 AM
> > >
> > > >
> > > > The required behavior for iommufd is to have the IOMMU ignore the
> > > > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > > > clearly documented for its exact purpose and if other arches also have
> > > > instructions that need to be disabled if snoop TLPs are allowed then
> > > > they can re-use this bit. It appears ARM does not have this issue and
> > > > does not need the bit.
> > >
> > > Disabling wbinvd is one purpose. imo the more important intention
> > > is that iommu vendor uses different PTE formats between snoop and
> > > !snoop. As long as we want allow userspace to opt in case of isoch
> > > performance requirement (unlike current vfio which always choose
> > > snoop format if available), such mechanism is required for all vendors.
> > >
> >
> > btw I'm not sure whether the wbinvd trick is Intel specific. All other
> > platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
> > CACHE_COHERENCY (the source of IOMMU_CACHE).
> 
> This only means they don't need to use the arch cache flush
> helpers. It has nothing to do with no-snoop on those platforms.
> 
> > They didn't hit this problem because vfio always sets IOMMU_CACHE to
> > force every DMA to snoop. Will they need to handle similar
> > wbinvd-like trick (plus necessary memory type virtualization) when
> > non-snoop format is enabled?  Or are their architectures highly
> > optimized to afford isoch traffic even with snoop (then fine to not
> > support user opt-in)?
> 
> In other arches the question is:
>  - Do they allow non-coherent DMA to exist in a VM?

And is coherency a static attribute per device or could be opted
by driver on such arch? If the latter, then the same opt path from
userspace sounds also reasonable, since driver is in userspace now.

>  - Can the VM issue cache maintaince ops to fix the decoherence?

As you listed the questions are all about non-coherent DMA, not
how non-coherent DMAs are implemented underlyingly. From this
angle focusing on coherent part as Alex suggested is more forward
looking than tying the uAPI to a specific coherency implementation
using snoop?

> 
> The Intel functional issue is that Intel blocks the cache maintaince
> ops from the VM and the VM has no way to self-discover that the cache
> maintaince ops don't work.

the VM doesn't need to know whether the maintenance ops 
actually works. It just treats the device as if those ops are always
required. The hypervisor will figure out whether those ops should
be blocked based on whether coherency is guaranteed by iommu
based on iommufd/vfio.

> 
> Other arches don't seem to have this specific problem...

I think the key is whether other archs allow driver to decide DMA
coherency and indirectly the underlying I/O page table format. 
If yes, then I don't see a reason why such decision should not be 
given to userspace for passthrough case.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30  8:30                     ` Tian, Kevin
@ 2021-09-30 10:33                       ` Jean-Philippe Brucker
  2021-09-30 22:04                         ` Jason Gunthorpe
  2021-10-14  8:01                         ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-30 10:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, kvm, jasowang, kwankhede, hch, Jiang, Dave, Raj,
	Ashok, corbet, parav, Alex Williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Thu, Sep 30, 2021 at 08:30:42AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, September 29, 2021 8:37 PM
> > 
> > On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
> > 
> > > ARM:
> > >     - set to snoop format if IOMMU_CACHE
> > >     - set to nonsnoop format if !IOMMU_CACHE
> > > (in both cases TLP snoop bit is ignored?)
> > 
> > Where do you see this? I couldn't even find this functionality in the
> > ARM HW manual??
> 
> Honestly speaking I'm getting confused by the complex attribute
> transformation control (default, replace, combine, input, output, etc.)
> in SMMU manual. Above was my impression after last check, but now
> I cannot find necessary info to build the same picture (except below 
> code). :/
> 
> > 
> > What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> > the cache coherence to be disabled, which is not ignoring no snoop.
> 
> My impression was that snoop is one way of implementing cache
> coherency and now since the PTE can explicitly specify cache coherency 
> like below:
> 
>                 else if (prot & IOMMU_CACHE)
>                         pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>                 else
>                         pte |= ARM_LPAE_PTE_MEMATTR_NC;
> 
> This setting in concept overrides the snoop attribute from the device thus
> make it sort of ignored?

To make sure we're talking about the same thing: "the snoop attribute from
the device" is the "No snoop" attribute in the PCI TLP, right?

The PTE flags define whether the memory access is cache-coherent or not.
* WB is cacheable (short for write-back cacheable. Doesn't matter here
  what OI or RWA mean.)
* NC is non-cacheable.

         | Normal PCI access | No_snoop PCI access
  -------+-------------------+-------------------
  PTE WB | Cacheable         | Non-cacheable
  PTE NC | Non-cacheable     | Non-cacheable

Cacheable memory access participate in cache coherency. Non-cacheable
accesses go directly to memory, do not cause cache allocation.

On Arm cache coherency is configured through PTE attributes. I don't think
PCI No_snoop should be used because it's not necessarily supported
throughout the system and, as far as I understand, software can't discover
whether it is.

[...]
> Maybe I'll get a clearer picture on this after understanding the difference 
> between cache coherency and snoop on ARM.

The architecture uses terms "cacheable" and "coherent". The term "snoop"
is used when referring specifically to the PCI "No snoop" attribute. It is
also used within the interconnect coherency protocols, which are invisible
to software.

Thanks,
Jean


^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30  8:49                 ` Tian, Kevin
@ 2021-09-30 13:43                   ` Lu Baolu
  2021-10-01  3:24                     ` hch
  2021-09-30 22:08                   ` Jason Gunthorpe
  1 sibling, 1 reply; 280+ messages in thread
From: Lu Baolu @ 2021-09-30 13:43 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Lu, Baolu
  Cc: baolu.lu, Jean-Philippe Brucker, Alex Williamson, Liu, Yi L, hch,
	jasowang, joro, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

On 2021/9/30 16:49, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Thursday, September 23, 2021 8:22 PM
>>
>>>> These are different things and need different bits. Since the ARM path
>>>> has a lot more code supporting it, I'd suggest Intel should change
>>>> their code to use IOMMU_BLOCK_NO_SNOOP and abandon
>> IOMMU_CACHE.
>>>
>>> I didn't fully get this point. The end result is same, i.e. making the DMA
>>> cache-coherent when IOMMU_CACHE is set. Or if you help define the
>>> behavior of IOMMU_CACHE, what will you define now?
>>
>> It is clearly specifying how the kernel API works:
>>
>>   !IOMMU_CACHE
>>     must call arch cache flushers
>>   IOMMU_CACHE -
>>     do not call arch cache flushers
>>   IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
>>     dot not arch cache flushers, and ignore the no snoop bit.
> 
> Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
> knowledge about how cache coherency is implemented, i.e.
> when IOMMU_CACHE is set intel-iommu driver just maps it to
> blocking no-snoop. It's not necessarily to be an attribute in
> the same level as IOMMU_CACHE?
> 
>>
>> On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
>> do that.
> 
> Agree. In reality I guess this is not hit because all devices are marked
> coherent on Intel platforms...
> 
> Baolu, any insight here?

I am trying to follow the discussion here. Please guide me if I didn't
get the right context.

Here, we are discussing arch_sync_dma_for_cpu() and
arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
for device, but I can't see any instruction to sync dma buffer for cpu
if the device is not cache coherent. Is that the reason why x86 can't
have an implementation for arch_sync_dma_for_cpu(), hence all devices
are marked coherent?

> Thanks
> Kevin
> 

Best regards,
baolu

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30 10:33                       ` Jean-Philippe Brucker
@ 2021-09-30 22:04                         ` Jason Gunthorpe
  2021-10-01  3:28                           ` hch
  2021-10-14  8:01                         ` Tian, Kevin
  1 sibling, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 22:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, kvm, jasowang, kwankhede, hch, Jiang, Dave, Raj,
	Ashok, corbet, parav, Alex Williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

On Thu, Sep 30, 2021 at 11:33:13AM +0100, Jean-Philippe Brucker wrote:
> On Thu, Sep 30, 2021 at 08:30:42AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, September 29, 2021 8:37 PM
> > > 
> > > On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
> > > 
> > > > ARM:
> > > >     - set to snoop format if IOMMU_CACHE
> > > >     - set to nonsnoop format if !IOMMU_CACHE
> > > > (in both cases TLP snoop bit is ignored?)
> > > 
> > > Where do you see this? I couldn't even find this functionality in the
> > > ARM HW manual??
> > 
> > Honestly speaking I'm getting confused by the complex attribute
> > transformation control (default, replace, combine, input, output, etc.)
> > in SMMU manual. Above was my impression after last check, but now
> > I cannot find necessary info to build the same picture (except below 
> > code). :/
> > 
> > > 
> > > What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> > > the cache coherence to be disabled, which is not ignoring no snoop.
> > 
> > My impression was that snoop is one way of implementing cache
> > coherency and now since the PTE can explicitly specify cache coherency 
> > like below:
> > 
> >                 else if (prot & IOMMU_CACHE)
> >                         pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> >                 else
> >                         pte |= ARM_LPAE_PTE_MEMATTR_NC;
> > 
> > This setting in concept overrides the snoop attribute from the device thus
> > make it sort of ignored?
> 
> To make sure we're talking about the same thing: "the snoop attribute from
> the device" is the "No snoop" attribute in the PCI TLP, right?
> 
> The PTE flags define whether the memory access is cache-coherent or not.
> * WB is cacheable (short for write-back cacheable. Doesn't matter here
>   what OI or RWA mean.)
> * NC is non-cacheable.
> 
>          | Normal PCI access | No_snoop PCI access
>   PTE WB | Cacheable         | Non-cacheable
>   PTE NC | Non-cacheable     | Non-cacheable
> 
> Cacheable memory access participate in cache coherency. Non-cacheable
> accesses go directly to memory, do not cause cache allocation.

This table is what I was thinking after reading through the ARM docs.

> On Arm cache coherency is configured through PTE attributes. I don't think
> PCI No_snoop should be used because it's not necessarily supported
> throughout the system and, as far as I understand, software can't discover
> whether it is.

The usage of no-snoop is a behavior of a device. A generic PCI driver
should be able to program the device to generate no-snoop TLPs and
ideally rely on an arch specific API in the OS to trigger the required
cache maintenance.

It doesn't make much sense for a portable driver to rely on a
non-portable IO PTE flag to control coherency, since that is not a
standards based approach.

That said, Linux doesn't have a generic DMA API to support
no-snoop. The few GPUs drivers that use this stuff just hardwired
wbsync on Intel..

What I don't really understand is why ARM, with an IOMMU that supports
PTE WB, has devices where dev_is_dma_coherent() == false ? 

Is it the case that DMA from those devices ignores the IO PTE's
cachable mode?

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30  8:49                 ` Tian, Kevin
  2021-09-30 13:43                   ` Lu Baolu
@ 2021-09-30 22:08                   ` Jason Gunthorpe
  1 sibling, 0 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 22:08 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lu, Baolu, Jean-Philippe Brucker, Alex Williamson, Liu, Yi L,
	hch, jasowang, joro, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 30, 2021 at 08:49:03AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 23, 2021 8:22 PM
> > 
> > > > These are different things and need different bits. Since the ARM path
> > > > has a lot more code supporting it, I'd suggest Intel should change
> > > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> > IOMMU_CACHE.
> > >
> > > I didn't fully get this point. The end result is same, i.e. making the DMA
> > > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > > behavior of IOMMU_CACHE, what will you define now?
> > 
> > It is clearly specifying how the kernel API works:
> > 
> >  !IOMMU_CACHE
> >    must call arch cache flushers
> >  IOMMU_CACHE -
> >    do not call arch cache flushers
> >  IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
> >    dot not arch cache flushers, and ignore the no snoop bit.
> 
> Who will set IOMMU_BLOCK_NO_SNOOP?

Basically only qemu due to specialized x86 hypervisor knowledge.

The only purpose of this attribute is to support a specific
virtualization use case where a whole bunch of stuff is broken
together:
 - the cache maintenance instructions are not available to a guest
 - the guest isn't aware that the instructions don't work and tells
   the device to issue no-snoop TLPs
 - The device ignores the 'disable no-snoop' flag in the PCIe config
   space

Thus things become broken.

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30  9:35             ` Tian, Kevin
@ 2021-09-30 22:23               ` Jason Gunthorpe
  2021-10-01  3:30                 ` hch
  2021-10-14  9:11                 ` Tian, Kevin
  0 siblings, 2 replies; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 22:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 30, 2021 at 09:35:45AM +0000, Tian, Kevin wrote:

> > The Intel functional issue is that Intel blocks the cache maintaince
> > ops from the VM and the VM has no way to self-discover that the cache
> > maintaince ops don't work.
> 
> the VM doesn't need to know whether the maintenance ops 
> actually works.

Which is the whole problem.

Intel has a design where the device driver tells the device to issue
non-cachable TLPs.

The driver is supposed to know if it can issue the cache maintaince
instructions - if it can then it should ask the device to issue
no-snoop TLPs.

For instance the same PCI driver on non-x86 should never ask the
device to issue no-snoop TLPs because it has no idea how to restore
cache coherence on eg ARM.

Do you see the issue? This configuration where the hypervisor silently
make wbsync a NOP breaks the x86 architecture because the guest has no
idea it can no longer use no-snoop features.

Using the IOMMU to forcibly prevent the device from issuing no-snoop
makes this whole issue of the broken wbsync moot.

It is important to be really clear on what this is about - this is not
some idealized nice iommu feature - it is working around alot of
backwards compatability baggage that is probably completely unique to
x86.

> > Other arches don't seem to have this specific problem...
> 
> I think the key is whether other archs allow driver to decide DMA
> coherency and indirectly the underlying I/O page table format. 
> If yes, then I don't see a reason why such decision should not be 
> given to userspace for passthrough case.

The choice all comes down to if the other arches have cache
maintenance instructions in the VM that *don't work*

Jason

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-30  3:09           ` David Gibson
@ 2021-09-30 22:28             ` Jason Gunthorpe
  2021-10-01  3:54               ` David Gibson
  0 siblings, 1 reply; 280+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 22:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:

> > The *admin* the one responsible to understand the groups, not the
> > applications. The admin has no idea what a group FD is - they should
> > be looking at the sysfs and seeing the iommu_group directories.
> 
> Not just the admin.  If an app is given two devices in the same group
> to use *both* it must understand that and act accordingly.

Yes, but this is true regardless of what the uAPI is, and for common
app cases where we have a single IO Page table for all devices the app
still doesn't need to care about groups since it can just assign all
devices to the same IO page table and everything works out just fine.

For instance qemu without a vIOMMU does not need to care about
groups. It opens a single iommufd, creates a single IO page table that
maps the guest physical space and assigns every device to that IO page
table. No issue.

Only if qemu is creating a vIOMMU does it need to start to look at the
groups and ensure that the group becomes visible to the guest OS. Here
the group fd doesn't really help anything

Jason



^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30 13:43                   ` Lu Baolu
@ 2021-10-01  3:24                     ` hch
  0 siblings, 0 replies; 280+ messages in thread
From: hch @ 2021-10-01  3:24 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Tian, Kevin, Jason Gunthorpe, Lu, Baolu, Jean-Philippe Brucker,
	Alex Williamson, Liu, Yi L, hch, jasowang, joro, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, david, nicolinc

On Thu, Sep 30, 2021 at 09:43:58PM +0800, Lu Baolu wrote:
> Here, we are discussing arch_sync_dma_for_cpu() and
> arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
> for device, but I can't see any instruction to sync dma buffer for cpu
> if the device is not cache coherent. Is that the reason why x86 can't
> have an implementation for arch_sync_dma_for_cpu(), hence all devices
> are marked coherent?

arch_sync_dma_for_cpu and arch_sync_dma_for_device are only used if
the device is marked non-coherent, that is if Linux knows the device
can't be part of the cache coherency protocol.  There are no known
x86 systems with entirely not cache coherent devices so these helpers
won't be useful as-is.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30 22:04                         ` Jason Gunthorpe
@ 2021-10-01  3:28                           ` hch
  2021-10-14  8:13                             ` Tian, Kevin
  0 siblings, 1 reply; 280+ messages in thread
From: hch @ 2021-10-01  3:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, jasowang, kwankhede,
	hch, Jiang, Dave, Raj, Ashok, corbet, parav, Alex Williamson,
	lkml, david, dwmw2, Tian, Jun J, linux-kernel, lushenming,
	pbonzini, robin.murphy

On Thu, Sep 30, 2021 at 07:04:46PM -0300, Jason Gunthorpe wrote:
> > On Arm cache coherency is configured through PTE attributes. I don't think
> > PCI No_snoop should be used because it's not necessarily supported
> > throughout the system and, as far as I understand, software can't discover
> > whether it is.
> 
> The usage of no-snoop is a behavior of a device. A generic PCI driver
> should be able to program the device to generate no-snoop TLPs and
> ideally rely on an arch specific API in the OS to trigger the required
> cache maintenance.

Well, it is a combination of the device, the root port and the driver
which all need to be in line to use this.

> It doesn't make much sense for a portable driver to rely on a
> non-portable IO PTE flag to control coherency, since that is not a
> standards based approach.
> 
> That said, Linux doesn't have a generic DMA API to support
> no-snoop. The few GPUs drivers that use this stuff just hardwired
> wbsync on Intel..

Yes, as usual the GPU folks come up with nasty hacks instead of
providing generic helper.  Basically all we'd need to support it
in a generic way is:

 - a DMA_ATTR_NO_SNOOP (or DMA_ATTR_FORCE_NONCOHERENT to fit the Linux
   terminology) which treats the current dma_map/unmap/sync calls as
   if dev_is_dma_coherent was false
 - a way for the driver to discover that a given architecture / running
   system actually supports this

> What I don't really understand is why ARM, with an IOMMU that supports
> PTE WB, has devices where dev_is_dma_coherent() == false ? 

Because no IOMMU in the world can help that fact that a periphal on the
SOC is not part of the cache coherency protocol.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-30 22:23               ` Jason Gunthorpe
@ 2021-10-01  3:30                 ` hch
  2021-10-14  9:11                 ` Tian, Kevin
  1 sibling, 0 replies; 280+ messages in thread
From: hch @ 2021-10-01  3:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Liu, Yi L, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Thu, Sep 30, 2021 at 07:23:55PM -0300, Jason Gunthorpe wrote:
> > > The Intel functional issue is that Intel blocks the cache maintaince
> > > ops from the VM and the VM has no way to self-discover that the cache
> > > maintaince ops don't work.
> > 
> > the VM doesn't need to know whether the maintenance ops 
> > actually works.
> 
> Which is the whole problem.
> 
> Intel has a design where the device driver tells the device to issue
> non-cachable TLPs.
> 
> The driver is supposed to know if it can issue the cache maintaince
> instructions - if it can then it should ask the device to issue
> no-snoop TLPs.

The driver should never issue them.  This whole idea that a driver
can just magically poke the cache directly is just one of these horrible
short cuts that seems to happen in GPU land all the time but nowhere
else.

> > coherency and indirectly the underlying I/O page table format. 
> > If yes, then I don't see a reason why such decision should not be 
> > given to userspace for passthrough case.
> 
> The choice all comes down to if the other arches have cache
> maintenance instructions in the VM that *don't work*

Or have them at all.

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-30 22:28             ` Jason Gunthorpe
@ 2021-10-01  3:54               ` David Gibson
  0 siblings, 0 replies; 280+ messages in thread
From: David Gibson @ 2021-10-01  3:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Thu, Sep 30, 2021 at 07:28:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:
> 
> > > The *admin* the one responsible to understand the groups, not the
> > > applications. The admin has no idea what a group FD is - they should
> > > be looking at the sysfs and seeing the iommu_group directories.
> > 
> > Not just the admin.  If an app is given two devices in the same group
> > to use *both* it must understand that and act accordingly.
> 
> Yes, but this is true regardless of what the uAPI is,

Yes, but formerly it was explicit and now it is implicit.  Before we
said "attach this group to this container", which can reasonably be
expected to affect the whole group.  Now we say "attach this device to
this IOAS" and it silently also affects other devices.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38 ` [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE Liu Yi L
  2021-09-21 17:44   ` Jason Gunthorpe
  2021-09-22 13:45   ` Jean-Philippe Brucker
@ 2021-10-01  6:11   ` David Gibson
  2021-10-13  7:00     ` Tian, Kevin
  2 siblings, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-10-01  6:11 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 15209 bytes --]

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

Ok, there are several things we need for ppc.  None of which are
inherently ppc specific and some of which will I think be useful for
most platforms.  So, starting from most general to most specific
here's basically what's needed:

1. We need to represent the fact that the IOMMU can only translate
   *some* IOVAs, not a full 64-bit range.  You have the addr_width
   already, but I'm entirely sure if the translatable range on ppc
   (or other platforms) is always a power-of-2 size.  It usually will
   be, of course, but I'm not sure that's a hard requirement.  So
   using a size/max rather than just a number of bits might be safer.

   I think basically every platform will need this.  Most platforms
   don't actually implement full 64-bit translation in any case, but
   rather some smaller number of bits that fits their page table
   format.

2. The translatable range of IOVAs may not begin at 0.  So we need to
   advertise to userspace what the base address is, as well as the
   size.  POWER's main IOVA range begins at 2^59 (at least on the
   models I know about).

   I think a number of platforms are likely to want this, though I
   couldn't name them apart from POWER.  Putting the translated IOVA
   window at some huge address is a pretty obvious approach to making
   an IOMMU which can translate a wide address range without colliding
   with any legacy PCI addresses down low (the IOMMU can check if this
   transaction is for it by just looking at some high bits in the
   address).

3. There might be multiple translatable ranges.  So, on POWER the
   IOMMU can typically translate IOVAs from 0..2GiB, and also from
   2^59..2^59+<RAM size>.  The two ranges have completely separate IO
   page tables, with (usually) different layouts.  (The low range will
   nearly always be a single-level page table with 4kiB or 64kiB
   entries, the high one will be multiple levels depending on the size
   of the range and pagesize).

   This may be less common, but I suspect POWER won't be the only
   platform to do something like this.  As above, using a high range
   is a pretty obvious approach, but clearly won't handle older
   devices which can't do 64-bit DMA.  So adding a smaller range for
   those devices is again a pretty obvious solution.  Any platform
   with an "IO hole" can be treated as having two ranges, one below
   the hole and one above it (although in that case they may well not
   have separate page tables 

4. The translatable ranges might not be fixed.  On ppc that 0..2GiB
   and 2^59..whatever ranges are kernel conventions, not specified by
   the hardware or firmware.  When running as a guest (which is the
   normal case on POWER), there are explicit hypercalls for
   configuring the allowed IOVA windows (along with pagesize, number
   of levels etc.).  At the moment it is fixed in hardware that there
   are only 2 windows, one starting at 0 and one at 2^59 but there's
   no inherent reason those couldn't also be configurable.

   This will probably be rarer, but I wouldn't be surprised if it
   appears on another platform.  If you were designing an IOMMU ASIC
   for use in a variety of platforms, making the base address and size
   of the translatable range(s) configurable in registers would make
   sense.


Now, for (3) and (4), representing lists of windows explicitly in
ioctl()s is likely to be pretty ugly.  We might be able to avoid that,
for at least some of the interfaces, by using the nested IOAS stuff.
One way or another, though, the IOASes which are actually attached to
devices need to represent both windows.

e.g.
Create a "top-level" IOAS <A> representing the device's view.  This
would be either TYPE_KERNEL or maybe a special type.  Into that you'd
make just two iomappings one for each of the translation windows,
pointing to IOASes <B> and <C>.  IOAS <B> and <C> would have a single
window, and would represent the IO page tables for each of the
translation windows.  These could be either TYPE_KERNEL or (say)
TYPE_POWER_TCE for a user managed table.  Well.. in theory, anyway.
The way paravirtualization on POWER is done might mean user managed
tables aren't really possible for other reasons, but that's not
relevant here.

The next problem here is that we don't want userspace to have to do
different things for POWER, at least not for the easy case of a
userspace driver that just wants a chunk of IOVA space and doesn't
really care where it is.

In general I think the right approach to handle that is to
de-emphasize "info" or "query" interfaces.  We'll probably still need
some for debugging and edge cases, but in the normal case userspace
should just specify what it *needs* and (ideally) no more with
optional hints, and the kernel will either supply that or fail.

e.g. A simple userspace driver would simply say "I need an IOAS with
at least 1GiB of IOVA space" and the kernel says "Ok, you can use
2^59..2^59+2GiB".  qemu, emulating the POWER vIOMMU might say "I need
an IOAS with translatable addresses from 0..2GiB with 4kiB page size
and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
either say "ok", or "I can't do that".

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |   3 +
>  include/uapi/linux/iommu.h      |  54 ++++++++++++++
>  3 files changed, 177 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
>  struct iommufd_ctx {
>  	refcount_t refs;
>  	struct mutex lock;
> +	struct xarray ioasid_xa; /* xarray of ioasids */
>  	struct xarray device_xa; /* xarray of bound devices */
>  };
>  
> @@ -42,6 +43,16 @@ struct iommufd_device {
>  	u64 dev_cookie;
>  };
>  
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> +	int ioasid;
> +	u32 type;
> +	u32 addr_width;
> +	bool enforce_snoop;
> +	struct iommufd_ctx *ictx;
> +	refcount_t refs;
> +};
> +
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  {
>  	struct iommufd_ctx *ictx;
> @@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  
>  	refcount_set(&ictx->refs, 1);
>  	mutex_init(&ictx->lock);
> +	xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
>  	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
>  	filep->private_data = ictx;
>  
> @@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
>  	if (!refcount_dec_and_test(&ictx->refs))
>  		return;
>  
> +	WARN_ON(!xa_empty(&ictx->ioasid_xa));
>  	WARN_ON(!xa_empty(&ictx->device_xa));
>  	kfree(ictx);
>  }
>  
> +/* Caller should hold ictx->lock */
> +static void ioas_put_locked(struct iommufd_ioas *ioas)
> +{
> +	struct iommufd_ctx *ictx = ioas->ictx;
> +	int ioasid = ioas->ioasid;
> +
> +	if (!refcount_dec_and_test(&ioas->refs))
> +		return;
> +
> +	xa_erase(&ictx->ioasid_xa, ioasid);
> +	iommufd_ctx_put(ictx);
> +	kfree(ioas);
> +}
> +
> +static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
> +{
> +	struct iommu_ioasid_alloc req;
> +	struct iommufd_ioas *ioas;
> +	unsigned long minsz;
> +	int ioasid, ret;
> +
> +	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
> +
> +	if (copy_from_user(&req, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (req.argsz < minsz || !req.addr_width ||
> +	    req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
> +	    req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
> +		return -EINVAL;
> +
> +	ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
> +	if (!ioas)
> +		return -ENOMEM;
> +
> +	mutex_lock(&ictx->lock);
> +	ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
> +		       XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
> +		       GFP_KERNEL);
> +	mutex_unlock(&ictx->lock);
> +	if (ret) {
> +		pr_err_ratelimited("Failed to alloc ioasid\n");
> +		kfree(ioas);
> +		return ret;
> +	}
> +
> +	ioas->ioasid = ioasid;
> +
> +	/* only supports kernel managed I/O page table so far */
> +	ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
> +
> +	ioas->addr_width = req.addr_width;
> +
> +	/* only supports enforce snoop today */
> +	ioas->enforce_snoop = true;
> +
> +	iommufd_ctx_get(ictx);
> +	ioas->ictx = ictx;
> +
> +	refcount_set(&ioas->refs, 1);
> +
> +	return ioasid;
> +}
> +
> +static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
> +{
> +	struct iommufd_ioas *ioas = NULL;
> +	int ioasid, ret;
> +
> +	if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
> +		return -EFAULT;
> +
> +	if (ioasid < 0)
> +		return -EINVAL;
> +
> +	mutex_lock(&ictx->lock);
> +	ioas = xa_load(&ictx->ioasid_xa, ioasid);
> +	if (IS_ERR(ioas)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/* Disallow free if refcount is not 1 */
> +	if (refcount_read(&ioas->refs) > 1) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	ioas_put_locked(ioas);
> +out_unlock:
> +	mutex_unlock(&ictx->lock);
> +	return ret;
> +};
> +
>  static int iommufd_fops_release(struct inode *inode, struct file *filep)
>  {
>  	struct iommufd_ctx *ictx = filep->private_data;
> +	struct iommufd_ioas *ioas;
> +	unsigned long index;
>  
>  	filep->private_data = NULL;
>  
> +	mutex_lock(&ictx->lock);
> +	xa_for_each(&ictx->ioasid_xa, index, ioas)
> +		ioas_put_locked(ioas);
> +	mutex_unlock(&ictx->lock);
> +
>  	iommufd_ctx_put(ictx);
>  
>  	return 0;
> @@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>  	case IOMMU_DEVICE_GET_INFO:
>  		ret = iommufd_get_device_info(ictx, arg);
>  		break;
> +	case IOMMU_IOASID_ALLOC:
> +		ret = iommufd_ioasid_alloc(ictx, arg);
> +		break;
> +	case IOMMU_IOASID_FREE:
> +		ret = iommufd_ioasid_free(ictx, arg);
> +		break;
>  	default:
>  		pr_err_ratelimited("unsupported cmd %u\n", cmd);
>  		break;
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> index 1603a13937e9..1dd6515e7816 100644
> --- a/include/linux/iommufd.h
> +++ b/include/linux/iommufd.h
> @@ -14,6 +14,9 @@
>  #include <linux/err.h>
>  #include <linux/device.h>
>  
> +#define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
> +#define IOMMUFD_IOASID_MIN	0
> +
>  #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
>  #define IOMMUFD_DEVID_MIN	0
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 76b71f9d6b34..5cbd300eb0ee 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -57,6 +57,60 @@ struct iommu_device_info {
>  
>  #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
>  
> +/*
> + * IOMMU_IOASID_ALLOC	- _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
> + *				struct iommu_ioasid_alloc)
> + *
> + * Allocate an IOASID.
> + *
> + * IOASID is the FD-local software handle representing an I/O address
> + * space. Each IOASID is associated with a single I/O page table. User
> + * must call this ioctl to get an IOASID for every I/O address space
> + * that is intended to be tracked by the kernel.
> + *
> + * User needs to specify the attributes of the IOASID and associated
> + * I/O page table format information according to one or multiple devices
> + * which will be attached to this IOASID right after. The I/O page table
> + * is activated in the IOMMU when it's attached by a device. Incompatible
> + * format between device and IOASID will lead to attaching failure in
> + * device side.
> + *
> + * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
> + * must be always set.
> + *
> + * Only one I/O page table type (kernel-managed) is supported, with vfio
> + * type1v2 mapping semantics.
> + *
> + * User should call IOMMU_CHECK_EXTENSION for future extensions.
> + *
> + * @argsz:	    user filled size of this data.
> + * @flags:	    additional information for IOASID allocation.
> + * @type:	    I/O address space page table type.
> + * @addr_width:    address width of the I/O address space.
> + *
> + * Return: allocated ioasid on success, -errno on failure.
> + */
> +struct iommu_ioasid_alloc {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
> +	__u32	type;
> +#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2	1
> +	__u32	addr_width;
> +};
> +
> +#define IOMMU_IOASID_ALLOC		_IO(IOMMU_TYPE, IOMMU_BASE + 2)
> +
> +/**
> + * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
> + *
> + * Free an IOASID.
> + *
> + * returns: 0 on success, -errno on failure
> + */
> +
> +#define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
> +
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
>  #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-21 17:44   ` Jason Gunthorpe
  2021-09-22  3:40     ` Tian, Kevin
  2021-09-22 12:51     ` Liu, Yi L
@ 2021-10-01  6:13     ` David Gibson
  2021-10-01 12:22       ` Jason Gunthorpe
  2 siblings, 1 reply; 280+ messages in thread
From: David Gibson @ 2021-10-01  6:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, nicolinc

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> > 
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> > 
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> > 
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1 discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the
> actual geometry including any holes via a query.

So part of the point of specifying start/end addresses is that
explicitly querying holes shouldn't be necessary: if the requested
range crosses a hole, it should fail.  If you didn't really need all
that range, you shouldn't have asked for it.

Which means these aren't really "hints" but optionally supplied
constraints.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 280+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22  3:40     ` Tian,