LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
@ 2021-08-06 10:34 David Stevens
  2021-08-06 10:34 ` [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers" David Stevens
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

This patch series adds support for per-domain dynamic pools of iommu
bounce buffers to the dma-iommu API. This allows iommu mappings to be
reused while still maintaining strict iommu protection.

This bounce buffer support is used to add a new config option that, when
enabled, causes all non-direct streaming mappings below a configurable
size to go through the bounce buffers. This serves as an optimization on
systems where manipulating iommu mappings is very expensive. For
example, virtio-iommu operations in a guest on a linux host require a
vmexit, involvement the VMM, and a VFIO syscall. For relatively small
DMA operations, memcpy can be significantly faster.

As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
--rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.

These pooled bounce buffers are also used for subgranule mappings with
untrusted devices, replacing the single use bounce buffers used
currently. The biggest difference here is that the new implementation
maps a whole sglist using a single bounce buffer. The new implementation
does not support using bounce buffers for only some segments of the
sglist, so it may require more copying. However, the current
implementation requires per-segment iommu map/unmap operations for all
untrusted sglist mappings (fully aligned sglists included). On a 
i5-10210U laptop with the internal NVMe drive made to appear untrusted,
fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
a statistically significant decrease in CPU load from 2.28% -> 2.17%
with the new iommu bounce buffer optimization enabled.

Each domain's buffer pool is split into multiple power-of-2 size
classes. Each class allocates a fixed number of buffer slot metadata. A
large iova range is allocated, and each slot is assigned an iova from
the range. This allows the iova to be easily mapped back to the slot,
and allows the critical section of most pool operations to be constant
time. The one exception is finding a cached buffer to reuse. These are
only separated according to R/W permissions - the use of other
permissions such as IOMMU_PRIV may require a linear search through the
cache. However, these other permissions are rare and likely exhibit high
locality, so the should not be a bottleneck in practice.

Since untrusted devices may require bounce buffers, each domain has a
fallback rbtree to manage single use buffers. This may be necessary if a
very large number of DMA operations are simultaneously in-flight, or for
very large individual DMA operations.

This patch set does not use swiotlb. There are two primary ways in which
swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
allocates buffers to be compatible with a single device, whereas
per-domain buffer pools don't handle that during buffer allocation as a
single buffer may end up being used by multiple devices. Second, swiotlb
allocation establishes the original to bounce buffer mapping, which
again doesn't work if buffers can be reused. Effectively the only code
that can be shared between the two use cases is allocating slots from
the swiotlb's memory. However, given that we're going to be allocating
memory for use with an iommu, allocating memory from a block of memory
explicitly set aside to deal with a lack of iommu seems kind of
contradictory. At best there might be a small performance improvement if 
wiotlb allocation is faster than regular page allocation, but buffer
allocation isn't on the hot path anyway.

Not using the swiotlb has the benefit that memory doesn't have to be
preallocated. Instead, bounce buffers consume memory only for in-flight
dma transactions (ignoring temporarily cached buffers), which is the
smallest amount possible. This makes it easier to use bounce buffers as
an optimization on systems with large numbers of devices or in
situations where devices are unknown, since it is not necessary to try
to tune how much memory needs to be set aside to achieve good
performance without costing too much memory.

Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
is meant to address devices which create long lived streaming mappings
but manage CPU cache coherency without using the dma_sync_* APIs.
Currently, these devices don't function properly with swiotlb=force. The
new flag is used to bypass bounce buffers so such devices will function
when the new bounce buffer optimization is enabled. The flag is added to
the i915 driver, which creates such mappings. It can also be added to
various dma-buf implementations as an optimization, although that is not
done here.

v1 -> v2:
 - Replace existing untrusted bounce buffers with new bounce
   buffer pools. This includes significant rework to account for
   untrusted bounce buffers being required instead of an
   optimization.
 - Add flag for persistent streaming mappings.

David Stevens (9):
  Revert "iommu: Allow the dma-iommu api to use bounce buffers"
  dma-iommu: expose a few helper functions to module
  dma-iommu: bounce buffers for untrusted devices
  dma-iommu: remove extra buffer search on unmap
  dma-iommu: clear only necessary bytes
  dma-iommu: add bounce buffer pools
  dma-iommu: support iommu bounce buffer optimization
  dma-mapping: add persistent streaming mapping flag
  drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag

 drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c |   4 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c        |   3 +-
 drivers/iommu/Kconfig                      |  11 +
 drivers/iommu/Makefile                     |   2 +-
 drivers/iommu/dma-iommu.c                  | 268 ++++-----
 drivers/iommu/io-bounce-buffers.c          | 533 +++++++++++++++++
 drivers/iommu/io-bounce-buffers.h          |  49 ++
 drivers/iommu/io-buffer-manager.c          | 633 +++++++++++++++++++++
 drivers/iommu/io-buffer-manager.h          |  94 +++
 include/linux/dma-iommu.h                  |  12 +
 include/linux/dma-mapping.h                |  11 +
 11 files changed, 1460 insertions(+), 160 deletions(-)
 create mode 100644 drivers/iommu/io-bounce-buffers.c
 create mode 100644 drivers/iommu/io-bounce-buffers.h
 create mode 100644 drivers/iommu/io-buffer-manager.c
 create mode 100644 drivers/iommu/io-buffer-manager.h

-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers"
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 2/9] dma-iommu: expose a few helper functions to module David Stevens
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

A new pooled bounce buffer implementation will be added to reduce IOMMU
interactions on platforms with slow IOMMUs. The new implementation can
also support using bounce buffers with untrusted devices, so the current
basic bounce buffer support can be reverted.

This reverts commit 82612d66d51d3bacdd789e31d2e875d2494b7514.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c | 152 ++++----------------------------------
 1 file changed, 13 insertions(+), 139 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 98ba927aee1a..1491b5450246 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,11 +20,9 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
-#include <linux/swiotlb.h>
 #include <linux/scatterlist.h>
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
-#include <linux/dma-direct.h>
 
 struct iommu_dma_msi_page {
 	struct list_head	list;
@@ -493,23 +491,6 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 	iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
-static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
-		size_t size, enum dma_data_direction dir,
-		unsigned long attrs)
-{
-	struct iommu_domain *domain = iommu_get_dma_domain(dev);
-	phys_addr_t phys;
-
-	phys = iommu_iova_to_phys(domain, dma_addr);
-	if (WARN_ON(!phys))
-		return;
-
-	__iommu_dma_unmap(dev, dma_addr, size);
-
-	if (unlikely(is_swiotlb_buffer(phys)))
-		swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
-}
-
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 		size_t size, int prot, u64 dma_mask)
 {
@@ -536,52 +517,6 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 	return iova + iova_off;
 }
 
-static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
-		size_t org_size, dma_addr_t dma_mask, bool coherent,
-		enum dma_data_direction dir, unsigned long attrs)
-{
-	int prot = dma_info_to_prot(dir, coherent, attrs);
-	struct iommu_domain *domain = iommu_get_dma_domain(dev);
-	struct iommu_dma_cookie *cookie = domain->iova_cookie;
-	struct iova_domain *iovad = &cookie->iovad;
-	size_t aligned_size = org_size;
-	void *padding_start;
-	size_t padding_size;
-	dma_addr_t iova;
-
-	/*
-	 * If both the physical buffer start address and size are
-	 * page aligned, we don't need to use a bounce page.
-	 */
-	if (IS_ENABLED(CONFIG_SWIOTLB) && dev_is_untrusted(dev) &&
-	    iova_offset(iovad, phys | org_size)) {
-		aligned_size = iova_align(iovad, org_size);
-		phys = swiotlb_tbl_map_single(dev, phys, org_size,
-					      aligned_size, dir, attrs);
-
-		if (phys == DMA_MAPPING_ERROR)
-			return DMA_MAPPING_ERROR;
-
-		/* Cleanup the padding area. */
-		padding_start = phys_to_virt(phys);
-		padding_size = aligned_size;
-
-		if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-		    (dir == DMA_TO_DEVICE ||
-		     dir == DMA_BIDIRECTIONAL)) {
-			padding_start += org_size;
-			padding_size -= org_size;
-		}
-
-		memset(padding_start, 0, padding_size);
-	}
-
-	iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-	if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
-		swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
-	return iova;
-}
-
 static void __iommu_dma_free_pages(struct page **pages, int count)
 {
 	while (count--)
@@ -776,15 +711,11 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev,
 {
 	phys_addr_t phys;
 
-	if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
+	if (dev_is_dma_coherent(dev))
 		return;
 
 	phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-	if (!dev_is_dma_coherent(dev))
-		arch_sync_dma_for_cpu(phys, size, dir);
-
-	if (is_swiotlb_buffer(phys))
-		swiotlb_sync_single_for_cpu(dev, phys, size, dir);
+	arch_sync_dma_for_cpu(phys, size, dir);
 }
 
 static void iommu_dma_sync_single_for_device(struct device *dev,
@@ -792,15 +723,11 @@ static void iommu_dma_sync_single_for_device(struct device *dev,
 {
 	phys_addr_t phys;
 
-	if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
+	if (dev_is_dma_coherent(dev))
 		return;
 
 	phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-	if (is_swiotlb_buffer(phys))
-		swiotlb_sync_single_for_device(dev, phys, size, dir);
-
-	if (!dev_is_dma_coherent(dev))
-		arch_sync_dma_for_device(phys, size, dir);
+	arch_sync_dma_for_device(phys, size, dir);
 }
 
 static void iommu_dma_sync_sg_for_cpu(struct device *dev,
@@ -810,17 +737,11 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
 	struct scatterlist *sg;
 	int i;
 
-	if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
+	if (dev_is_dma_coherent(dev))
 		return;
 
-	for_each_sg(sgl, sg, nelems, i) {
-		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
-
-		if (is_swiotlb_buffer(sg_phys(sg)))
-			swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
-						    sg->length, dir);
-	}
+	for_each_sg(sgl, sg, nelems, i)
+		arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 }
 
 static void iommu_dma_sync_sg_for_device(struct device *dev,
@@ -830,17 +751,11 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
 	struct scatterlist *sg;
 	int i;
 
-	if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
+	if (dev_is_dma_coherent(dev))
 		return;
 
-	for_each_sg(sgl, sg, nelems, i) {
-		if (is_swiotlb_buffer(sg_phys(sg)))
-			swiotlb_sync_single_for_device(dev, sg_phys(sg),
-						       sg->length, dir);
-
-		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
-	}
+	for_each_sg(sgl, sg, nelems, i)
+		arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
 }
 
 static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
@@ -849,10 +764,10 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 {
 	phys_addr_t phys = page_to_phys(page) + offset;
 	bool coherent = dev_is_dma_coherent(dev);
+	int prot = dma_info_to_prot(dir, coherent, attrs);
 	dma_addr_t dma_handle;
 
-	dma_handle = __iommu_dma_map_swiotlb(dev, phys, size, dma_get_mask(dev),
-			coherent, dir, attrs);
+	dma_handle = __iommu_dma_map(dev, phys, size, prot, dma_get_mask(dev));
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    dma_handle != DMA_MAPPING_ERROR)
 		arch_sync_dma_for_device(phys, size, dir);
@@ -864,7 +779,7 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
 {
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
-	__iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
+	__iommu_dma_unmap(dev, dma_handle, size);
 }
 
 /*
@@ -942,39 +857,6 @@ static void __invalidate_sg(struct scatterlist *sg, int nents)
 	}
 }
 
-static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-	struct scatterlist *s;
-	int i;
-
-	for_each_sg(sg, s, nents, i)
-		__iommu_dma_unmap_swiotlb(dev, sg_dma_address(s),
-				sg_dma_len(s), dir, attrs);
-}
-
-static int iommu_dma_map_sg_swiotlb(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-	struct scatterlist *s;
-	int i;
-
-	for_each_sg(sg, s, nents, i) {
-		sg_dma_address(s) = __iommu_dma_map_swiotlb(dev, sg_phys(s),
-				s->length, dma_get_mask(dev),
-				dev_is_dma_coherent(dev), dir, attrs);
-		if (sg_dma_address(s) == DMA_MAPPING_ERROR)
-			goto out_unmap;
-		sg_dma_len(s) = s->length;
-	}
-
-	return nents;
-
-out_unmap:
-	iommu_dma_unmap_sg_swiotlb(dev, sg, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-	return 0;
-}
-
 /*
  * The DMA API client is passing in a scatterlist which could describe
  * any old buffer layout, but the IOMMU API requires everything to be
@@ -1002,9 +884,6 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
 
-	if (dev_is_untrusted(dev))
-		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
-
 	/*
 	 * Work out how much IOVA space we need, and align the segments to
 	 * IOVA granules for the IOMMU driver to handle. With some clever
@@ -1074,11 +953,6 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 
-	if (dev_is_untrusted(dev)) {
-		iommu_dma_unmap_sg_swiotlb(dev, sg, nents, dir, attrs);
-		return;
-	}
-
 	/*
 	 * The scatterlist segments are mapped into a single
 	 * contiguous IOVA allocation, so this is incredibly easy.
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 2/9] dma-iommu: expose a few helper functions to module
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
  2021-08-06 10:34 ` [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers" David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 17:28   ` kernel test robot
  2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Expose a few helper functions from dma-iommu to the rest of the module.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c | 23 ++++++++++++-----------
 include/linux/dma-iommu.h |  8 ++++++++
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 1491b5450246..055ccda5eba1 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -412,7 +412,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 	}
 }
 
-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 		size_t size, u64 dma_limit, struct device *dev)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -452,7 +452,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct page *freelist)
 {
 	struct iova_domain *iovad = &cookie->iovad;
@@ -488,7 +488,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 
 	if (!cookie->fq_domain)
 		iommu_iotlb_sync(domain, &iotlb_gather);
-	iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
+	__iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -506,12 +506,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 
 	size = iova_align(iovad, size + iova_off);
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
 	if (!iova)
 		return DMA_MAPPING_ERROR;
 
 	if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
-		iommu_dma_free_iova(cookie, iova, size, NULL);
+		__iommu_dma_free_iova(cookie, iova, size, NULL);
 		return DMA_MAPPING_ERROR;
 	}
 	return iova + iova_off;
@@ -617,7 +617,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		return NULL;
 
 	size = iova_align(iovad, size);
-	iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size,
+				      dev->coherent_dma_mask, dev);
 	if (!iova)
 		goto out_free_pages;
 
@@ -643,7 +644,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 out_free_sg:
 	sg_free_table(sgt);
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_pages:
 	__iommu_dma_free_pages(pages, count);
 	return NULL;
@@ -923,7 +924,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		prev = s;
 	}
 
-	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_restore_sg;
 
@@ -937,7 +938,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	return __finalise_sg(dev, sg, nents, iova);
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+	__iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
 	__invalidate_sg(sg, nents);
 	return 0;
@@ -1226,7 +1227,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	if (!msi_page)
 		return NULL;
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_free_page;
 
@@ -1240,7 +1241,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return msi_page;
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_page:
 	kfree(msi_page);
 	return NULL;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 758ca4694257..50f676678318 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -42,6 +42,14 @@ void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
 
 extern bool iommu_dma_forcedac;
 
+struct iommu_dma_cookie;
+
+dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
+				  size_t size, dma_addr_t dma_limit,
+				  struct device *dev);
+void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+		dma_addr_t iova, size_t size, struct page *freelist);
+
 #else /* CONFIG_IOMMU_DMA */
 
 struct iommu_domain;
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
  2021-08-06 10:34 ` [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers" David Stevens
  2021-08-06 10:34 ` [PATCH v2 2/9] dma-iommu: expose a few helper functions to module David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 15:53   ` kernel test robot
  2021-08-10  1:19   ` Mi, Dapeng1
  2021-08-06 10:34 ` [PATCH v2 4/9] dma-iommu: remove extra buffer search on unmap David Stevens
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add support for dynamic bounce buffers to the dma-api for use with
subgranule IOMMU mappings with untrusted devices. Bounce buffer
management is split into two parts. First, there is a buffer manager
that is responsible for allocating and tracking buffers. Second, there
is a layer that uses the managed buffers as bounce buffers. It is
responsible for managing the IOMMU mapping and for syncing between the
original and bounce buffers.

For now, buffer management is very simple - every mapping allocates a
new bounce buffer.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/Makefile            |   2 +-
 drivers/iommu/dma-iommu.c         |  70 +++++-
 drivers/iommu/io-bounce-buffers.c | 358 ++++++++++++++++++++++++++++++
 drivers/iommu/io-bounce-buffers.h |  46 ++++
 drivers/iommu/io-buffer-manager.c | 212 ++++++++++++++++++
 drivers/iommu/io-buffer-manager.h |  43 ++++
 6 files changed, 728 insertions(+), 3 deletions(-)
 create mode 100644 drivers/iommu/io-bounce-buffers.c
 create mode 100644 drivers/iommu/io-bounce-buffers.h
 create mode 100644 drivers/iommu/io-buffer-manager.c
 create mode 100644 drivers/iommu/io-buffer-manager.h

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..4edaf7adc082 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
-obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o io-bounce-buffers.o io-buffer-manager.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 055ccda5eba1..908eb6fb7dc3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -24,6 +24,8 @@
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
 
+#include "io-bounce-buffers.h"
+
 struct iommu_dma_msi_page {
 	struct list_head	list;
 	dma_addr_t		iova;
@@ -44,6 +46,7 @@ struct iommu_dma_cookie {
 		dma_addr_t		msi_iova;
 	};
 	struct list_head		msi_page_list;
+	struct io_bounce_buffers	*bounce_buffers;
 
 	/* Domain for flush queue callback; NULL if flush queue not in use */
 	struct iommu_domain		*fq_domain;
@@ -81,6 +84,14 @@ static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
 	return PAGE_SIZE;
 }
 
+static struct io_bounce_buffers *dev_to_io_bounce_buffers(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+	return cookie->bounce_buffers;
+}
+
 static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)
 {
 	struct iommu_dma_cookie *cookie;
@@ -160,6 +171,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 	if (!cookie)
 		return;
 
+	if (cookie->bounce_buffers)
+		io_bounce_buffers_destroy(cookie->bounce_buffers);
+
 	if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
 		put_iova_domain(&cookie->iovad);
 
@@ -333,6 +347,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	unsigned long order, base_pfn;
 	struct iova_domain *iovad;
+	int ret;
 
 	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
 		return -EINVAL;
@@ -380,7 +395,16 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	if (!dev)
 		return 0;
 
-	return iova_reserve_iommu_regions(dev, domain);
+	ret = iova_reserve_iommu_regions(dev, domain);
+
+	if (ret == 0 && dev_is_untrusted(dev)) {
+		cookie->bounce_buffers =
+			io_bounce_buffers_init(dev, domain, iovad);
+		if (IS_ERR(cookie->bounce_buffers))
+			ret = PTR_ERR(cookie->bounce_buffers);
+	}
+
+	return ret;
 }
 
 /**
@@ -710,8 +734,13 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
 static void iommu_dma_sync_single_for_cpu(struct device *dev,
 		dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	phys_addr_t phys;
 
+	if (bounce && io_bounce_buffers_sync_single(bounce, dma_handle,
+						    size, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -722,8 +751,13 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev,
 static void iommu_dma_sync_single_for_device(struct device *dev,
 		dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	phys_addr_t phys;
 
+	if (bounce && io_bounce_buffers_sync_single(bounce, dma_handle,
+						    size, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -735,9 +769,13 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
 		struct scatterlist *sgl, int nelems,
 		enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	struct scatterlist *sg;
 	int i;
 
+	if (bounce && io_bounce_buffers_sync_sg(bounce, sgl, nelems, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -749,9 +787,14 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
 		struct scatterlist *sgl, int nelems,
 		enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	struct scatterlist *sg;
 	int i;
 
+	if (bounce && io_bounce_buffers_sync_sg(bounce, sgl,
+						nelems, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -763,11 +806,19 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 		unsigned long offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
 {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	phys_addr_t phys = page_to_phys(page) + offset;
 	bool coherent = dev_is_dma_coherent(dev);
 	int prot = dma_info_to_prot(dir, coherent, attrs);
 	dma_addr_t dma_handle;
 
+	if (cookie->bounce_buffers &&
+	    io_bounce_buffers_map_page(cookie->bounce_buffers, dev, page,
+				       offset, size, prot, dir, attrs,
+				       &dma_handle))
+		return dma_handle;
+
 	dma_handle = __iommu_dma_map(dev, phys, size, prot, dma_get_mask(dev));
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    dma_handle != DMA_MAPPING_ERROR)
@@ -778,6 +829,12 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
+
+	if (bounce &&
+	    io_bounce_buffers_unmap_page(bounce, dma_handle, size, dir, attrs))
+		return;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
 	__iommu_dma_unmap(dev, dma_handle, size);
@@ -876,12 +933,17 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
-	int i;
+	int i, ret;
 
 	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
 	    iommu_deferred_attach(dev, domain))
 		return 0;
 
+	if (cookie->bounce_buffers &&
+	    io_bounce_buffers_map_sg(cookie->bounce_buffers, dev, sg, nents,
+				     prot, dir, attrs, &ret))
+		return ret;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
 
@@ -947,10 +1009,14 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	dma_addr_t start, end;
 	struct scatterlist *tmp;
 	int i;
 
+	if (bounce && io_bounce_buffers_unmap_sg(bounce, sg, nents, dir, attrs))
+		return;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 
diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
new file mode 100644
index 000000000000..78b4440b58c8
--- /dev/null
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Dynamic IOMMU mapped bounce buffers.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
+#include <linux/highmem.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+
+#include "io-buffer-manager.h"
+#include "io-bounce-buffers.h"
+
+struct io_bounce_buffers {
+	struct iommu_domain *domain;
+	struct iova_domain *iovad;
+	unsigned int nid;
+	struct io_buffer_manager manager;
+};
+
+bool io_bounce_buffers_release_buffer_cb(struct io_buffer_manager *manager,
+					 dma_addr_t iova, size_t size)
+{
+	struct io_bounce_buffers *buffers =
+		container_of(manager, struct io_bounce_buffers, manager);
+	return iommu_unmap(buffers->domain, iova, size) >= size;
+}
+
+struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
+						 struct iommu_domain *domain,
+						 struct iova_domain *iovad)
+{
+	int ret;
+	struct io_bounce_buffers *buffers;
+
+	buffers = kzalloc(sizeof(*buffers), GFP_KERNEL);
+	if (!buffers)
+		return ERR_PTR(-ENOMEM);
+
+	ret = io_buffer_manager_init(&buffers->manager);
+	if (ret) {
+		kfree(buffers);
+		return ERR_PTR(ret);
+	}
+
+	buffers->domain = domain;
+	buffers->iovad = iovad;
+	buffers->nid = dev_to_node(dev);
+
+	return buffers;
+}
+
+void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers)
+{
+	kfree(buffers);
+}
+
+static bool should_sync_buffer(enum dma_data_direction dir, bool sync_for_cpu)
+{
+	return dir == DMA_BIDIRECTIONAL ||
+	       (dir == DMA_FROM_DEVICE && sync_for_cpu) ||
+	       (dir == DMA_TO_DEVICE && !sync_for_cpu);
+}
+
+static void io_bounce_buffers_do_sync(struct io_bounce_buffers *buffers,
+				      struct page **bounce_buffer,
+				      size_t bounce_offset, struct page *orig,
+				      size_t orig_offset, size_t size,
+				      enum dma_data_direction dir, int prot,
+				      bool sync_for_cpu)
+{
+	bool needs_bounce_sync = should_sync_buffer(dir, sync_for_cpu);
+	char *orig_lowmem_ptr;
+	bool dma_is_coherent = prot & IOMMU_CACHE;
+
+	if (dma_is_coherent && !needs_bounce_sync)
+		return;
+
+	orig_lowmem_ptr = PageHighMem(orig) ? NULL : page_to_virt(orig);
+
+	while (size) {
+		size_t copy_len, bounce_page_offset;
+		struct page *bounce_page;
+
+		bounce_page = bounce_buffer[bounce_offset / PAGE_SIZE];
+		bounce_page_offset = bounce_offset % PAGE_SIZE;
+
+		copy_len = size;
+		if (copy_len + bounce_page_offset > PAGE_SIZE)
+			copy_len = PAGE_SIZE - bounce_page_offset;
+
+		if (!dma_is_coherent && sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_cpu(paddr + bounce_page_offset,
+					      copy_len, dir);
+		}
+
+		if (needs_bounce_sync) {
+			char *bounce_page_ptr = kmap_local_page(bounce_page);
+			char *bounce_ptr = bounce_page_ptr + bounce_page_offset;
+
+			if (!orig_lowmem_ptr) {
+				size_t remaining = copy_len;
+				size_t offset = orig_offset % PAGE_SIZE;
+				size_t orig_page_idx = orig_offset / PAGE_SIZE;
+
+				while (remaining) {
+					char *orig_ptr;
+					size_t sz = min(remaining,
+							PAGE_SIZE - offset);
+
+					orig_ptr = kmap_local_page(
+						nth_page(orig, orig_page_idx));
+					if (sync_for_cpu) {
+						memcpy(orig_ptr + offset,
+						       bounce_ptr, sz);
+					} else {
+						memcpy(bounce_ptr,
+						       orig_ptr + offset, sz);
+					}
+					kunmap_local(orig_ptr);
+
+					remaining -= sz;
+					orig_page_idx += 1;
+					bounce_ptr += sz;
+					offset = 0;
+				}
+			} else if (sync_for_cpu) {
+				memcpy(orig_lowmem_ptr + orig_offset,
+				       bounce_ptr, copy_len);
+			} else {
+				memcpy(bounce_ptr,
+				       orig_lowmem_ptr + orig_offset, copy_len);
+			}
+
+			kunmap_local(bounce_page_ptr);
+		}
+
+		if (!dma_is_coherent && !sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_device(paddr + bounce_page_offset,
+						 copy_len, dir);
+		}
+
+		bounce_offset += copy_len;
+		orig_offset += copy_len;
+		size -= copy_len;
+	}
+}
+
+bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
+				   dma_addr_t dma_handle, size_t size,
+				   enum dma_data_direction dir,
+				   bool sync_for_cpu)
+{
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	size_t offset;
+	int prot;
+
+	if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle, &info,
+					   &orig_buffer, &prot))
+		return false;
+
+	offset = dma_handle - info.iova;
+	io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
+				  orig_buffer, offset, size, dir, prot,
+				  sync_for_cpu);
+	return true;
+}
+
+static void __io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+					struct scatterlist *sgl, int nents,
+					struct page **bounce_buffer,
+					enum dma_data_direction dir, int prot,
+					bool sync_for_cpu)
+{
+	size_t bounce_offset = 0;
+	struct scatterlist *iter;
+	int i;
+
+	for_each_sg(sgl, iter, nents, i) {
+		io_bounce_buffers_do_sync(buffers, bounce_buffer, bounce_offset,
+					  sg_page(iter), iter->offset,
+					  iter->length, dir, prot,
+					  sync_for_cpu);
+		bounce_offset += iter->length;
+	}
+}
+
+bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+			       struct scatterlist *sgl, int nents,
+			       enum dma_data_direction dir, bool sync_for_cpu)
+{
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	int prot;
+
+	if (!io_buffer_manager_find_buffer(&buffers->manager,
+					   sg_dma_address(sgl), &info,
+					   &orig_buffer, &prot))
+		return false;
+
+	// In the non bounce buffer case, iommu_dma_map_sg syncs before setting
+	// up the new mapping's dma address. This check handles false positives
+	// in find_buffer caused by sgl being reused for a non bounce buffer
+	// case after being used with a bounce buffer.
+	if (orig_buffer != sgl)
+		return false;
+
+	__io_bounce_buffers_sync_sg(buffers, sgl, nents, info.bounce_buffer,
+				    dir, prot, sync_for_cpu);
+
+	return true;
+}
+
+bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
+				  dma_addr_t handle, size_t size,
+				  enum dma_data_direction dir,
+				  unsigned long attrs)
+{
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		io_bounce_buffers_sync_single(buffers, handle, size, dir, true);
+
+	return io_buffer_manager_release_buffer(&buffers->manager,
+						buffers->domain, handle, true);
+}
+
+bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
+				struct scatterlist *sgl, int nents,
+				enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		io_bounce_buffers_sync_sg(buffers, sgl, nents, dir, true);
+
+	return io_buffer_manager_release_buffer(
+		&buffers->manager, buffers->domain, sg_dma_address(sgl), true);
+}
+
+static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
+					 struct io_bounce_buffer_info *info,
+					 int prot)
+{
+	unsigned int count = info->size >> PAGE_SHIFT;
+	struct sg_table sgt;
+	size_t mapped;
+
+	if (sg_alloc_table_from_pages(&sgt, info->bounce_buffer, count, 0,
+				      info->size, GFP_ATOMIC))
+		return false;
+
+	mapped = iommu_map_sg_atomic(buffers->domain, info->iova, sgt.sgl,
+				     sgt.orig_nents, prot);
+
+	sg_free_table(&sgt);
+	return mapped >= info->size;
+}
+
+bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
+				struct device *dev, struct page *page,
+				unsigned long offset, size_t size, int prot,
+				enum dma_data_direction dir,
+				unsigned long attrs, dma_addr_t *handle)
+{
+	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	struct io_bounce_buffer_info info;
+	bool force_bounce = iova_offset(buffers->iovad, offset | size);
+
+	if (!force_bounce)
+		return false;
+
+	*handle = DMA_MAPPING_ERROR;
+	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, page,
+					    offset + size, prot, buffers->nid,
+					    &info))
+		return true;
+
+	if (!skip_cpu_sync)
+		io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
+					  page, offset, size, dir, prot, false);
+
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+		io_buffer_manager_release_buffer(
+			&buffers->manager, buffers->domain, info.iova, false);
+		return true;
+	}
+
+	*handle = info.iova + offset;
+	return true;
+}
+
+bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
+			      struct device *dev, struct scatterlist *sgl,
+			      int nents, int prot, enum dma_data_direction dir,
+			      unsigned long attrs, int *out_nents)
+{
+	struct io_bounce_buffer_info info;
+	struct scatterlist *iter;
+	size_t size = 0;
+	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	dma_addr_t seg_iova;
+	int i;
+	bool force_bounce = false;
+
+	for_each_sg(sgl, iter, nents, i) {
+		size += iter->length;
+		force_bounce |= iova_offset(buffers->iovad,
+					    iter->offset | iter->length);
+	}
+
+	if (!force_bounce)
+		return false;
+
+	*out_nents = 0;
+	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, sgl, size,
+					    prot, buffers->nid, &info))
+		return true;
+
+	if (!skip_cpu_sync)
+		__io_bounce_buffers_sync_sg(buffers, sgl, nents,
+					    info.bounce_buffer, dir, prot,
+					    false);
+
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+		io_buffer_manager_release_buffer(
+			&buffers->manager, buffers->domain, info.iova, false);
+		return true;
+	}
+
+	i = 0;
+	seg_iova = info.iova;
+	while (size > 0) {
+		size_t seg_size = min_t(size_t, size,
+					dma_get_max_seg_size(dev));
+
+		sg_dma_len(sgl) = seg_size;
+		sg_dma_address(sgl) = seg_iova;
+
+		sgl = sg_next(sgl);
+		size -= seg_size;
+		seg_iova += seg_size;
+		i++;
+	}
+
+	if (sgl) {
+		sg_dma_address(sgl) = DMA_MAPPING_ERROR;
+		sg_dma_len(sgl) = 0;
+	}
+
+	*out_nents = i;
+	return true;
+}
diff --git a/drivers/iommu/io-bounce-buffers.h b/drivers/iommu/io-bounce-buffers.h
new file mode 100644
index 000000000000..6d132a27646c
--- /dev/null
+++ b/drivers/iommu/io-bounce-buffers.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#ifndef _LINUX_IO_BOUNCE_BUFFERS_H
+#define _LINUX_IO_BOUNCE_BUFFERS_H
+
+#include <linux/dma-iommu.h>
+#include <linux/iova.h>
+
+struct io_bounce_buffers;
+
+struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
+						 struct iommu_domain *domain,
+						 struct iova_domain *iovad);
+void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers);
+
+bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
+				   dma_addr_t dma_handle, size_t size,
+				   enum dma_data_direction dir,
+				   bool sync_for_cpu);
+bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+			       struct scatterlist *sgl, int nents,
+			       enum dma_data_direction dir, bool sync_for_cpu);
+
+bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
+				struct device *dev, struct page *page,
+				unsigned long offset, size_t size, int prot,
+				enum dma_data_direction dir,
+				unsigned long attrs, dma_addr_t *handle);
+bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
+			      struct device *dev, struct scatterlist *sgl,
+			      int nents, int prot, enum dma_data_direction dir,
+			      unsigned long attrs, int *out_nents);
+
+bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
+				  dma_addr_t handle, size_t size,
+				  enum dma_data_direction dir,
+				  unsigned long attrs);
+bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
+				struct scatterlist *sgl, int nents,
+				enum dma_data_direction dir,
+				unsigned long attrs);
+
+#endif /* _LINUX_IO_BOUNCE_BUFFERS_H */
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
new file mode 100644
index 000000000000..24e95a2faa37
--- /dev/null
+++ b/drivers/iommu/io-buffer-manager.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Manager which allocates and tracks bounce buffers and their IOVAs. Does
+ * not actually manage the IOMMU mapping nor do the bounce copies.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#include "io-buffer-manager.h"
+
+#include <linux/slab.h>
+
+struct io_buffer_node {
+	struct rb_node node;
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	int prot;
+};
+
+static void io_buffer_manager_free_pages(struct page **pages, int count)
+{
+	while (count--)
+		__free_page(pages[count]);
+	kfree(pages);
+}
+
+static struct page **io_buffer_manager_alloc_pages(int count, unsigned int nid)
+{
+	struct page **pages;
+	unsigned int i;
+
+	pages = kmalloc_array(count, sizeof(*pages), GFP_ATOMIC);
+	if (!pages)
+		return NULL;
+
+	// The IOMMU can map highmem pages, but try to allocate non-highmem
+	// pages first to make accessing the buffer cheaper.
+	for (i = 0; i < count; i++) {
+		pages[i] = alloc_pages_node(
+			nid,
+			GFP_ATOMIC | __GFP_ZERO | __GFP_NORETRY | __GFP_NOWARN,
+			0);
+		if (!pages[i]) {
+			pages[i] = alloc_pages_node(
+				nid, GFP_ATOMIC | __GFP_ZERO | __GFP_HIGHMEM,
+				0);
+			if (!pages[i]) {
+				io_buffer_manager_free_pages(pages, i);
+				return NULL;
+			}
+		}
+	}
+
+	return pages;
+}
+
+struct io_buffer_node *find_fallback_node(struct rb_root *root, dma_addr_t iova)
+{
+	struct rb_node *node = root->rb_node;
+
+	while (node) {
+		struct io_buffer_node *cur =
+			container_of(node, struct io_buffer_node, node);
+
+		if (iova < cur->info.iova)
+			node = node->rb_left;
+		else if (iova >= cur->info.iova + cur->info.size)
+			node = node->rb_right;
+		else
+			return cur;
+	}
+	return NULL;
+}
+
+bool insert_fallback_node(struct rb_root *root, struct io_buffer_node *node)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	dma_addr_t node_end = node->info.iova + node->info.size;
+
+	while (*new) {
+		struct io_buffer_node *cur =
+			container_of(*new, struct io_buffer_node, node);
+		dma_addr_t cur_end = cur->info.iova + cur->info.size;
+
+		parent = *new;
+		if (node_end <= cur->info.iova)
+			new = &((*new)->rb_left);
+		else if (node->info.iova >= cur_end)
+			new = &((*new)->rb_right);
+		else {
+			pr_crit("IOVA collision new=[%llx,%llx) old=[%llx,%llx)\n",
+				node->info.iova, node_end, cur->info.iova,
+				cur_end);
+			return false;
+		}
+	}
+
+	rb_link_node(&node->node, parent, new);
+	rb_insert_color(&node->node, root);
+	return true;
+}
+
+bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
+				    struct device *dev, void *orig_buffer,
+				    size_t size, int prot, unsigned int nid,
+				    struct io_bounce_buffer_info *info)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct io_buffer_node *node;
+	unsigned long flags;
+
+	node = kzalloc(sizeof(*node), GFP_ATOMIC);
+	if (!node)
+		return false;
+
+	size = PAGE_ALIGN(size);
+	node->info.iova =
+		__iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	if (!node->info.iova)
+		goto free_node;
+
+	node->info.bounce_buffer =
+		io_buffer_manager_alloc_pages(size >> PAGE_SHIFT, nid);
+	if (!node->info.bounce_buffer)
+		goto free_iova;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	if (!insert_fallback_node(&manager->fallback_buffers, node))
+		goto fallback_lock_unlock;
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	node->orig_buffer = orig_buffer;
+	node->prot = prot;
+	node->info.size = size;
+
+	*info = node->info;
+
+	return true;
+
+fallback_lock_unlock:
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+free_iova:
+	__iommu_dma_free_iova(domain->iova_cookie, node->info.iova, size, NULL);
+free_node:
+	kfree(node);
+	return false;
+}
+
+bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
+				   dma_addr_t handle,
+				   struct io_bounce_buffer_info *info,
+				   void **orig_buffer, int *prot)
+{
+	struct io_buffer_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	node = find_fallback_node(&manager->fallback_buffers, handle);
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	if (!node)
+		return false;
+
+	*info = node->info;
+	*orig_buffer = node->orig_buffer;
+	*prot = node->prot;
+	return true;
+}
+
+bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
+				      struct iommu_domain *domain,
+				      dma_addr_t handle, bool inited)
+{
+	struct io_buffer_node *node;
+	unsigned long flags;
+	bool free_buffer;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	node = find_fallback_node(&manager->fallback_buffers, handle);
+	if (node)
+		rb_erase(&node->node, &manager->fallback_buffers);
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	if (!node)
+		return false;
+
+	if (inited)
+		free_buffer = io_bounce_buffers_release_buffer_cb(
+			manager, node->info.iova, node->info.size);
+	else
+		free_buffer = true;
+
+	if (free_buffer) {
+		io_buffer_manager_free_pages(node->info.bounce_buffer,
+					     node->info.size >> PAGE_SHIFT);
+		__iommu_dma_free_iova(domain->iova_cookie, node->info.iova,
+				      node->info.size, NULL);
+	} else {
+		pr_warn("Bounce buffer release failed; leaking buffer\n");
+	}
+
+	kfree(node);
+	return true;
+}
+
+int io_buffer_manager_init(struct io_buffer_manager *manager)
+{
+	manager->fallback_buffers = RB_ROOT;
+	spin_lock_init(&manager->fallback_lock);
+
+	return 0;
+}
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
new file mode 100644
index 000000000000..aae560cc8512
--- /dev/null
+++ b/drivers/iommu/io-buffer-manager.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#ifndef _LINUX_IO_BUFFER_MANAGER_H
+#define _LINUX_IO_BUFFER_MANAGER_H
+
+#include <linux/dma-iommu.h>
+#include <linux/iova.h>
+#include <linux/spinlock.h>
+
+struct io_buffer_manager {
+	spinlock_t fallback_lock;
+	struct rb_root fallback_buffers;
+};
+
+struct io_bounce_buffer_info {
+	struct page **bounce_buffer;
+	dma_addr_t iova;
+	size_t size;
+};
+
+bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
+				    struct device *dev, void *orig_buffer,
+				    size_t size, int prot, unsigned int nid,
+				    struct io_bounce_buffer_info *info);
+
+bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
+				   dma_addr_t handle,
+				   struct io_bounce_buffer_info *info,
+				   void **orig_buffer, int *prot);
+
+bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
+				      struct iommu_domain *domain,
+				      dma_addr_t handle, bool inited);
+
+int io_buffer_manager_init(struct io_buffer_manager *manager);
+
+bool io_bounce_buffers_release_buffer_cb(struct io_buffer_manager *manager,
+					 dma_addr_t iova, size_t size);
+
+#endif /* _LINUX_IO_BUFFER_MANAGER_H */
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 4/9] dma-iommu: remove extra buffer search on unmap
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (2 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 5/9] dma-iommu: clear only necessary bytes David Stevens
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add callback to buffer manager's removal function so that the buffer can
be sync'ed during unmap without an extra find operation.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/io-bounce-buffers.c | 87 +++++++++++++++++++++++++------
 drivers/iommu/io-buffer-manager.c |  6 ++-
 drivers/iommu/io-buffer-manager.h |  6 ++-
 3 files changed, 81 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index 78b4440b58c8..c7c52a3f8bf7 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -153,6 +153,20 @@ static void io_bounce_buffers_do_sync(struct io_bounce_buffers *buffers,
 	}
 }
 
+static void __io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
+					    dma_addr_t dma_handle, size_t size,
+					    struct io_bounce_buffer_info *info,
+					    struct page *orig_buffer, int prot,
+					    enum dma_data_direction dir,
+					    bool sync_for_cpu)
+{
+	size_t offset = dma_handle - info->iova;
+
+	io_bounce_buffers_do_sync(buffers, info->bounce_buffer, offset,
+				  orig_buffer, offset, size, dir, prot,
+				  sync_for_cpu);
+}
+
 bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
 				   dma_addr_t dma_handle, size_t size,
 				   enum dma_data_direction dir,
@@ -160,17 +174,14 @@ bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
 {
 	struct io_bounce_buffer_info info;
 	void *orig_buffer;
-	size_t offset;
 	int prot;
 
 	if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle, &info,
 					   &orig_buffer, &prot))
 		return false;
 
-	offset = dma_handle - info.iova;
-	io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
-				  orig_buffer, offset, size, dir, prot,
-				  sync_for_cpu);
+	__io_bounce_buffers_sync_single(buffers, dma_handle, size, &info,
+					orig_buffer, prot, dir, sync_for_cpu);
 	return true;
 }
 
@@ -219,16 +230,56 @@ bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
 	return true;
 }
 
+struct unmap_sync_args {
+	struct io_bounce_buffers *buffers;
+	unsigned long attrs;
+	enum dma_data_direction dir;
+	dma_addr_t handle;
+	size_t size;
+	int nents;
+};
+
+static void
+io_bounce_buffers_unmap_page_sync(struct io_bounce_buffer_info *info, int prot,
+				  void *orig_buffer, void *ctx)
+{
+	struct unmap_sync_args *args = ctx;
+
+	if (args->attrs & DMA_ATTR_SKIP_CPU_SYNC)
+		return;
+
+	__io_bounce_buffers_sync_single(args->buffers, args->handle, args->size,
+					info, orig_buffer, prot, args->dir,
+					true);
+}
+
 bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
 				  dma_addr_t handle, size_t size,
 				  enum dma_data_direction dir,
 				  unsigned long attrs)
 {
-	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		io_bounce_buffers_sync_single(buffers, handle, size, dir, true);
+	struct unmap_sync_args args = { .buffers = buffers,
+					.attrs = attrs,
+					.dir = dir,
+					.handle = handle,
+					.size = size };
+
+	return io_buffer_manager_release_buffer(
+		&buffers->manager, buffers->domain, handle, true,
+		io_bounce_buffers_unmap_page_sync, &args);
+}
+
+static void io_bounce_buffers_unmap_sg_sync(struct io_bounce_buffer_info *info,
+					    int prot, void *orig_buffer,
+					    void *ctx)
+{
+	struct unmap_sync_args *args = ctx;
+
+	if (args->attrs & DMA_ATTR_SKIP_CPU_SYNC)
+		return;
 
-	return io_buffer_manager_release_buffer(&buffers->manager,
-						buffers->domain, handle, true);
+	__io_bounce_buffers_sync_sg(args->buffers, orig_buffer, args->nents,
+				    info->bounce_buffer, args->dir, prot, true);
 }
 
 bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
@@ -236,11 +287,13 @@ bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
 				enum dma_data_direction dir,
 				unsigned long attrs)
 {
-	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		io_bounce_buffers_sync_sg(buffers, sgl, nents, dir, true);
+	struct unmap_sync_args args = {
+		.buffers = buffers, .attrs = attrs, .dir = dir, .nents = nents
+	};
 
 	return io_buffer_manager_release_buffer(
-		&buffers->manager, buffers->domain, sg_dma_address(sgl), true);
+		&buffers->manager, buffers->domain, sg_dma_address(sgl), true,
+		io_bounce_buffers_unmap_sg_sync, &args);
 }
 
 static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
@@ -286,8 +339,9 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 					  page, offset, size, dir, prot, false);
 
 	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
-		io_buffer_manager_release_buffer(
-			&buffers->manager, buffers->domain, info.iova, false);
+		io_buffer_manager_release_buffer(&buffers->manager,
+						 buffers->domain, info.iova,
+						 false, NULL, NULL);
 		return true;
 	}
 
@@ -328,8 +382,9 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 					    false);
 
 	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
-		io_buffer_manager_release_buffer(
-			&buffers->manager, buffers->domain, info.iova, false);
+		io_buffer_manager_release_buffer(&buffers->manager,
+						 buffers->domain, info.iova,
+						 false, NULL, NULL);
 		return true;
 	}
 
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
index 24e95a2faa37..79b9759da928 100644
--- a/drivers/iommu/io-buffer-manager.c
+++ b/drivers/iommu/io-buffer-manager.c
@@ -169,7 +169,8 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 
 bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      struct iommu_domain *domain,
-				      dma_addr_t handle, bool inited)
+				      dma_addr_t handle, bool inited,
+				      prerelease_cb cb, void *ctx)
 {
 	struct io_buffer_node *node;
 	unsigned long flags;
@@ -184,6 +185,9 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 	if (!node)
 		return false;
 
+	if (cb)
+		cb(&node->info, node->prot, node->orig_buffer, ctx);
+
 	if (inited)
 		free_buffer = io_bounce_buffers_release_buffer_cb(
 			manager, node->info.iova, node->info.size);
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
index aae560cc8512..0e75d89926ca 100644
--- a/drivers/iommu/io-buffer-manager.h
+++ b/drivers/iommu/io-buffer-manager.h
@@ -31,9 +31,13 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 				   struct io_bounce_buffer_info *info,
 				   void **orig_buffer, int *prot);
 
+typedef void (*prerelease_cb)(struct io_bounce_buffer_info *info, int prot,
+			      void *orig_buffer, void *ctx);
+
 bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      struct iommu_domain *domain,
-				      dma_addr_t handle, bool inited);
+				      dma_addr_t handle, bool inited,
+				      prerelease_cb cb, void *ctx);
 
 int io_buffer_manager_init(struct io_buffer_manager *manager);
 
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 5/9] dma-iommu: clear only necessary bytes
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (3 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 4/9] dma-iommu: remove extra buffer search on unmap David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 6/9] dma-iommu: add bounce buffer pools David Stevens
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Only clear the padding bytes in bounce buffers, since syncing from the
original buffer already overwrites the non-padding bytes.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/io-bounce-buffers.c | 64 +++++++++++++++++++++++++++++--
 drivers/iommu/io-buffer-manager.c |  7 +---
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index c7c52a3f8bf7..ed05f593a195 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -296,14 +296,70 @@ bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
 		io_bounce_buffers_unmap_sg_sync, &args);
 }
 
+static void io_bounce_buffers_clear_padding(struct io_bounce_buffer_info *info,
+					    size_t pad_hd_end,
+					    size_t pad_tl_start)
+{
+	size_t idx, pad_hd_idx, pad_tl_idx, count;
+
+	count = info->size / PAGE_SIZE;
+	pad_hd_idx = pad_hd_end / PAGE_SIZE;
+	pad_tl_idx = pad_tl_start / PAGE_SIZE;
+
+	if (!IS_ALIGNED(pad_hd_end, PAGE_SIZE)) {
+		struct page *page = info->bounce_buffer[pad_hd_idx];
+		size_t len = offset_in_page(pad_hd_end);
+
+		memset_page(page, 0, 0, len);
+		arch_sync_dma_for_device(page_to_phys(page), 0, len);
+	}
+
+	if (!IS_ALIGNED(pad_tl_start, PAGE_SIZE)) {
+		size_t off = offset_in_page(pad_tl_start);
+		size_t len = PAGE_SIZE - off;
+		struct page *page = info->bounce_buffer[pad_tl_idx];
+
+		memset_page(page, off, 0, len);
+		arch_sync_dma_for_device(page_to_phys(page) + off, 0, len);
+
+		pad_tl_idx++;
+	}
+
+	idx = pad_hd_idx ? 0 : pad_tl_idx;
+	while (idx < count) {
+		struct page *page = info->bounce_buffer[idx++];
+
+		clear_highpage(page);
+		arch_sync_dma_for_device(page_to_phys(page), 0, PAGE_SIZE);
+		if (idx == pad_hd_idx)
+			idx = pad_tl_idx;
+	}
+}
+
 static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
 					 struct io_bounce_buffer_info *info,
-					 int prot)
+					 int prot, bool skiped_sync,
+					 size_t offset, size_t orig_size)
 {
 	unsigned int count = info->size >> PAGE_SHIFT;
 	struct sg_table sgt;
 	size_t mapped;
 
+	if (offset || offset + orig_size < info->size || skiped_sync) {
+		// Ensure that nothing is leaked to untrusted devices when
+		// mapping the buffer by clearing any part of the bounce buffer
+		// that wasn't already cleared by syncing.
+		size_t pad_hd_end, pad_tl_start;
+
+		if (skiped_sync) {
+			pad_hd_end = pad_tl_start = 0;
+		} else {
+			pad_hd_end = offset;
+			pad_tl_start = offset + orig_size;
+		}
+		io_bounce_buffers_clear_padding(info, pad_hd_end, pad_tl_start);
+	}
+
 	if (sg_alloc_table_from_pages(&sgt, info->bounce_buffer, count, 0,
 				      info->size, GFP_ATOMIC))
 		return false;
@@ -338,7 +394,8 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 		io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
 					  page, offset, size, dir, prot, false);
 
-	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
+					  offset, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
 						 false, NULL, NULL);
@@ -381,7 +438,8 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 					    info.bounce_buffer, dir, prot,
 					    false);
 
-	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
+					  0, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
 						 false, NULL, NULL);
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
index 79b9759da928..587584fdf26b 100644
--- a/drivers/iommu/io-buffer-manager.c
+++ b/drivers/iommu/io-buffer-manager.c
@@ -37,13 +37,10 @@ static struct page **io_buffer_manager_alloc_pages(int count, unsigned int nid)
 	// pages first to make accessing the buffer cheaper.
 	for (i = 0; i < count; i++) {
 		pages[i] = alloc_pages_node(
-			nid,
-			GFP_ATOMIC | __GFP_ZERO | __GFP_NORETRY | __GFP_NOWARN,
-			0);
+			nid, GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN, 0);
 		if (!pages[i]) {
 			pages[i] = alloc_pages_node(
-				nid, GFP_ATOMIC | __GFP_ZERO | __GFP_HIGHMEM,
-				0);
+				nid, GFP_ATOMIC | __GFP_HIGHMEM, 0);
 			if (!pages[i]) {
 				io_buffer_manager_free_pages(pages, i);
 				return NULL;
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 6/9] dma-iommu: add bounce buffer pools
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (4 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 5/9] dma-iommu: clear only necessary bytes David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 7/9] dma-iommu: support iommu bounce buffer optimization David Stevens
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add per-domain pools for IOMMU mapped bounce buffers. Each domain has 8
buffer pools, which hold buffers of size 2^n pages. Buffers are
allocated on demand, and unused buffers are periodically released from
the cache. Single use buffers are still used for mappings that are too
large to use any pool, or if there are too many simultaneously in-use
streaming mappings.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c         |  24 +-
 drivers/iommu/io-bounce-buffers.c |  42 ++-
 drivers/iommu/io-bounce-buffers.h |   3 +
 drivers/iommu/io-buffer-manager.c | 417 +++++++++++++++++++++++++++++-
 drivers/iommu/io-buffer-manager.h |  49 +++-
 include/linux/dma-iommu.h         |   2 +
 6 files changed, 520 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 908eb6fb7dc3..42f85b7a90f0 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -378,6 +378,11 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 			return -EFAULT;
 		}
 
+		if (cookie->bounce_buffers &&
+		    !io_bounce_buffer_reinit_check(cookie->bounce_buffers,
+						   dev, base, limit))
+			return -EFAULT;
+
 		return 0;
 	}
 
@@ -436,12 +441,24 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 	}
 }
 
+u64 __iommu_dma_limit(struct iommu_domain *domain, struct device *dev, u64 mask)
+{
+	u64 dma_limit = mask;
+
+	dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
+	if (domain->geometry.force_aperture)
+		dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
+
+	return dma_limit;
+}
+
 dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
-		size_t size, u64 dma_limit, struct device *dev)
+		size_t size, u64 mask, struct device *dev)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	unsigned long shift, iova_len, iova = 0;
+	u64 dma_limit = __iommu_dma_limit(domain, dev, mask);
 
 	if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
 		cookie->msi_iova += size;
@@ -459,11 +476,6 @@ dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
 		iova_len = roundup_pow_of_two(iova_len);
 
-	dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
-
-	if (domain->geometry.force_aperture)
-		dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
-
 	/* Try to get PCI devices a SAC address */
 	if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
 		iova = alloc_iova_fast(iovad, iova_len,
diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index ed05f593a195..8af8e1546d5f 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -9,11 +9,17 @@
 #include <linux/dma-map-ops.h>
 #include <linux/highmem.h>
 #include <linux/list.h>
+#include <linux/moduleparam.h>
 #include <linux/slab.h>
 
 #include "io-buffer-manager.h"
 #include "io-bounce-buffers.h"
 
+// Specifies the number of slots in each buffer pool. The total amount of
+// preallocated IOVA range per 1024 slots is slightly under 1GB.
+static unsigned int buffer_pool_size = 1024;
+module_param(buffer_pool_size, uint, 0);
+
 struct io_bounce_buffers {
 	struct iommu_domain *domain;
 	struct iova_domain *iovad;
@@ -40,7 +46,8 @@ struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
 	if (!buffers)
 		return ERR_PTR(-ENOMEM);
 
-	ret = io_buffer_manager_init(&buffers->manager);
+	ret = io_buffer_manager_init(&buffers->manager, dev, iovad,
+				     buffer_pool_size);
 	if (ret) {
 		kfree(buffers);
 		return ERR_PTR(ret);
@@ -53,8 +60,26 @@ struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
 	return buffers;
 }
 
+bool io_bounce_buffer_reinit_check(struct io_bounce_buffers *buffers,
+				   struct device *dev, dma_addr_t base,
+				   dma_addr_t limit)
+{
+	if (!io_buffer_manager_reinit_check(&buffers->manager, dev,
+					    buffers->iovad, base, limit)) {
+		pr_warn("io-buffer-buffers out of range of %s\n",
+			dev_name(dev));
+		return false;
+	}
+
+	if (buffers->nid != dev_to_node(dev))
+		pr_info("node mismatch: buffers=%d dev=%d\n", buffers->nid,
+			dev_to_node(dev));
+	return true;
+}
+
 void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers)
 {
+	io_buffer_manager_destroy(&buffers->manager, buffers->domain);
 	kfree(buffers);
 }
 
@@ -377,7 +402,7 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 				enum dma_data_direction dir,
 				unsigned long attrs, dma_addr_t *handle)
 {
-	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	bool new_buffer, skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
 	struct io_bounce_buffer_info info;
 	bool force_bounce = iova_offset(buffers->iovad, offset | size);
 
@@ -387,14 +412,15 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 	*handle = DMA_MAPPING_ERROR;
 	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, page,
 					    offset + size, prot, buffers->nid,
-					    &info))
+					    &info, &new_buffer))
 		return true;
 
 	if (!skip_cpu_sync)
 		io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
 					  page, offset, size, dir, prot, false);
 
-	if (!io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
+	if (new_buffer &&
+	    !io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
 					  offset, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
@@ -414,7 +440,7 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 	struct io_bounce_buffer_info info;
 	struct scatterlist *iter;
 	size_t size = 0;
-	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	bool new_buffer, skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
 	dma_addr_t seg_iova;
 	int i;
 	bool force_bounce = false;
@@ -430,7 +456,8 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 
 	*out_nents = 0;
 	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, sgl, size,
-					    prot, buffers->nid, &info))
+					    prot, buffers->nid, &info,
+					    &new_buffer))
 		return true;
 
 	if (!skip_cpu_sync)
@@ -438,7 +465,8 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 					    info.bounce_buffer, dir, prot,
 					    false);
 
-	if (!io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
+	if (new_buffer &&
+	    !io_bounce_buffers_map_buffer(buffers, &info, prot, skip_cpu_sync,
 					  0, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
diff --git a/drivers/iommu/io-bounce-buffers.h b/drivers/iommu/io-bounce-buffers.h
index 6d132a27646c..cf329a9704fa 100644
--- a/drivers/iommu/io-bounce-buffers.h
+++ b/drivers/iommu/io-bounce-buffers.h
@@ -14,6 +14,9 @@ struct io_bounce_buffers;
 struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
 						 struct iommu_domain *domain,
 						 struct iova_domain *iovad);
+bool io_bounce_buffer_reinit_check(struct io_bounce_buffers *buffers,
+				   struct device *dev, dma_addr_t base,
+				   dma_addr_t limit);
 void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers);
 
 bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
index 587584fdf26b..1c69df08603c 100644
--- a/drivers/iommu/io-buffer-manager.c
+++ b/drivers/iommu/io-buffer-manager.c
@@ -3,6 +3,14 @@
  * Manager which allocates and tracks bounce buffers and their IOVAs. Does
  * not actually manage the IOMMU mapping nor do the bounce copies.
  *
+ * The manager caches recently used bounce buffers. The cache is initialized
+ * with a fixed number of slots, which allows for cache operations to be
+ * performed efficiently. Slots are assigned pre-allocated IOVAs. The number
+ * of slots is configurable, but is limited to 1/2 of the total IOVA range.
+ *
+ * If the cache fills up, or for very large allocations, the manager falls
+ * back to single-use bounce buffers.
+ *
  * Copyright (C) 2021 Google, Inc.
  */
 
@@ -10,6 +18,8 @@
 
 #include <linux/slab.h>
 
+#define EVICT_PERIOD_MSEC 5000
+
 struct io_buffer_node {
 	struct rb_node node;
 	struct io_bounce_buffer_info info;
@@ -51,6 +61,115 @@ static struct page **io_buffer_manager_alloc_pages(int count, unsigned int nid)
 	return pages;
 }
 
+static dma_addr_t io_buffer_slot_to_iova(struct io_buffer_slot *slot,
+					 struct io_buffer_pool *pool)
+{
+	return pool->iova_base + pool->buffer_size * (slot - pool->slots);
+}
+
+static struct io_buffer_slot **
+io_buffer_pool_get_cache(struct io_buffer_pool *pool, int prot)
+{
+	prot &= (IOMMU_READ | IOMMU_WRITE);
+	if (prot == IOMMU_READ)
+		return &pool->cached_slots[IO_BUFFER_SLOT_TYPE_RO];
+	else if (prot == IOMMU_WRITE)
+		return &pool->cached_slots[IO_BUFFER_SLOT_TYPE_WO];
+	BUG_ON(prot == 0);
+	return &pool->cached_slots[IO_BUFFER_SLOT_TYPE_RW];
+}
+
+/**
+ * io_buffer_manager_relese_slots - release unused buffer slots
+ * @to_free: head of list of slots to free
+ * @head: outparam of head of list of slots that were freed
+ * @tail_link: outparam for next ptr of tail of list of freed slots
+ *
+ * Frees slots that are evicted from cache. May leak slots if an
+ * error occurs while freeing slot resources.
+ */
+static void io_buffer_manager_relese_slots(struct io_buffer_manager *manager,
+					   struct io_buffer_pool *pool,
+					   struct io_buffer_slot *to_free,
+					   struct io_buffer_slot **head,
+					   struct io_buffer_slot ***tail_link)
+{
+	struct io_buffer_slot *tmp, **prev_link;
+
+	*head = to_free;
+	prev_link = head;
+
+	while ((tmp = *prev_link)) {
+		dma_addr_t iova = io_buffer_slot_to_iova(tmp, pool);
+
+		if (io_bounce_buffers_release_buffer_cb(manager, iova,
+							pool->buffer_size)) {
+			io_buffer_manager_free_pages(tmp->bounce_buffer,
+						     pool->buffer_size >>
+							     PAGE_SHIFT);
+		} else {
+			// If freeing fails, the iova is in an unknown state.
+			// Remove it from the list of slots being freed.
+			pr_warn("Bounce buffer release failed; leaking slot\n");
+			*prev_link = tmp->next;
+		}
+		prev_link = &tmp->next;
+	}
+
+	*tail_link = prev_link;
+}
+
+static void __io_buffer_manager_evict(struct io_buffer_manager *manager,
+				      bool pool_teardown)
+{
+	struct io_buffer_pool *pool;
+	struct io_buffer_slot **prev_link, *to_free;
+	unsigned long flags;
+	int i, j;
+	bool requeue = false;
+
+	for (i = 0; i < NUM_POOLS; i++) {
+		pool = &manager->pools[i];
+
+		spin_lock_irqsave(&pool->lock, flags);
+		for (j = 0; j < IO_BUFFER_SLOT_TYPE_COUNT; j++) {
+			prev_link = &pool->cached_slots[j];
+
+			if (pool_teardown) {
+				to_free = *prev_link;
+			} else {
+				while ((to_free = *prev_link)) {
+					if (to_free->old_cache_entry) {
+						*prev_link = NULL;
+						break;
+					}
+					requeue = true;
+					to_free->old_cache_entry = true;
+					prev_link = &to_free->next;
+				}
+			}
+			if (!to_free)
+				continue;
+
+			spin_unlock_irqrestore(&pool->lock, flags);
+
+			io_buffer_manager_relese_slots(manager, pool, to_free,
+						       &to_free, &prev_link);
+
+			spin_lock_irqsave(&pool->lock, flags);
+			if (to_free) {
+				*prev_link = pool->empty_slots;
+				pool->empty_slots = to_free;
+			}
+		}
+		spin_unlock_irqrestore(&pool->lock, flags);
+	}
+
+	if (requeue)
+		queue_delayed_work(manager->evict_wq, &manager->evict_work,
+				   msecs_to_jiffies(EVICT_PERIOD_MSEC));
+}
+
 struct io_buffer_node *find_fallback_node(struct rb_root *root, dma_addr_t iova)
 {
 	struct rb_node *node = root->rb_node;
@@ -97,15 +216,126 @@ bool insert_fallback_node(struct rb_root *root, struct io_buffer_node *node)
 	return true;
 }
 
+static void io_buffer_manager_evict(struct work_struct *work)
+{
+	struct io_buffer_manager *manager = container_of(
+		to_delayed_work(work), struct io_buffer_manager, evict_work);
+	__io_buffer_manager_evict(manager, false);
+}
+
+static void fill_buffer_info(struct io_buffer_slot *slot,
+			     struct io_buffer_pool *pool,
+			     struct io_bounce_buffer_info *info)
+{
+	info->bounce_buffer = slot->bounce_buffer;
+	info->iova = io_buffer_slot_to_iova(slot, pool);
+	info->size = pool->buffer_size;
+}
+
+static bool io_buffer_pool_has_empty_slot(struct io_buffer_pool *pool,
+					  int num_slots)
+{
+	if (pool->empty_slots)
+		return true;
+
+	if (!pool->slots) {
+		pool->slots = kmalloc_array(num_slots, sizeof(*pool->slots),
+					    GFP_ATOMIC);
+		if (!pool->slots)
+			return false;
+	}
+
+	if (pool->untouched_slot_idx < num_slots) {
+		struct io_buffer_slot *slot =
+			&pool->slots[pool->untouched_slot_idx++];
+		memset(slot, 0, sizeof(*slot));
+		pool->empty_slots = slot;
+	}
+
+	return !!pool->empty_slots;
+}
+
+static bool io_buffer_manager_alloc_slot(struct io_buffer_manager *manager,
+					 void *orig_buffer, size_t size,
+					 int prot, unsigned int nid,
+					 struct io_bounce_buffer_info *info,
+					 bool *new_buffer)
+{
+	struct io_buffer_slot *slot = NULL, **prev_link, *cur;
+	struct io_buffer_pool *pool = NULL;
+	unsigned long flags;
+	dma_addr_t iova;
+	int pool_idx;
+
+	if (!manager->num_slots)
+		return false;
+
+	// Compute the power-of-2 size buffer needed, and then the pool idx.
+	pool_idx = roundup_pow_of_two(ALIGN(size, PAGE_SIZE));
+	pool_idx = fls(pool_idx >> PAGE_SHIFT) - 1;
+	if (pool_idx >= NUM_POOLS)
+		return false;
+	pool = manager->pools + pool_idx;
+
+	spin_lock_irqsave(&pool->lock, flags);
+
+	prev_link = io_buffer_pool_get_cache(pool, prot);
+	while ((cur = *prev_link)) {
+		if (cur->prot == prot) {
+			slot = cur;
+			*prev_link = cur->next;
+			break;
+		}
+		prev_link = &cur->next;
+	}
+
+	*new_buffer = slot == NULL;
+	if (*new_buffer) {
+		if (!io_buffer_pool_has_empty_slot(pool, manager->num_slots)) {
+			spin_unlock_irqrestore(&pool->lock, flags);
+			return false;
+		}
+
+		slot = pool->empty_slots;
+		pool->empty_slots = slot->next;
+		spin_unlock_irqrestore(&pool->lock, flags);
+
+		iova = io_buffer_slot_to_iova(slot, pool);
+
+		slot->bounce_buffer = io_buffer_manager_alloc_pages(
+			pool->buffer_size >> PAGE_SHIFT, nid);
+		if (!slot->bounce_buffer) {
+			spin_lock_irqsave(&pool->lock, flags);
+			slot->next = pool->empty_slots;
+			pool->empty_slots = slot;
+			spin_unlock_irqrestore(&pool->lock, flags);
+			return false;
+		}
+	} else {
+		spin_unlock_irqrestore(&pool->lock, flags);
+	}
+
+	slot->orig_buffer = orig_buffer;
+	slot->prot = prot;
+
+	fill_buffer_info(slot, pool, info);
+	return true;
+}
+
 bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 				    struct device *dev, void *orig_buffer,
 				    size_t size, int prot, unsigned int nid,
-				    struct io_bounce_buffer_info *info)
+				    struct io_bounce_buffer_info *info,
+				    bool *new_buffer)
 {
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct io_buffer_node *node;
 	unsigned long flags;
 
+	if (io_buffer_manager_alloc_slot(manager, orig_buffer, size, prot,
+					 nid, info, new_buffer))
+		return true;
+
 	node = kzalloc(sizeof(*node), GFP_ATOMIC);
 	if (!node)
 		return false;
@@ -131,6 +361,7 @@ bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 	node->info.size = size;
 
 	*info = node->info;
+	*new_buffer = true;
 
 	return true;
 
@@ -143,14 +374,49 @@ bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 	return false;
 }
 
+static bool __io_buffer_manager_find_slot(struct io_buffer_manager *manager,
+					  dma_addr_t handle,
+					  struct io_buffer_pool **pool,
+					  struct io_buffer_slot **slot)
+{
+	size_t i;
+	dma_addr_t iova_end = manager->iova + manager->iova_size;
+
+	if (!manager->num_slots || handle < manager->iova || handle >= iova_end)
+		return false;
+
+	// Pools are ordered from largest to smallest, and each pool is twice
+	// as large as the next pool. Find how far from the end of the overall
+	// allocation the handle is in terms of the size of the iova range
+	// assigned to the smallest pool (1-indexed), and then compute the idx.
+	i = ALIGN(iova_end - handle, PAGE_SIZE) >> PAGE_SHIFT;
+	i = DIV_ROUND_UP(i, manager->num_slots);
+	i = fls(i) - 1;
+
+	*pool = manager->pools + i;
+	*slot = (*pool)->slots +
+		(handle - (*pool)->iova_base) / (*pool)->buffer_size;
+
+	return true;
+}
+
 bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 				   dma_addr_t handle,
 				   struct io_bounce_buffer_info *info,
 				   void **orig_buffer, int *prot)
 {
+	struct io_buffer_pool *pool;
+	struct io_buffer_slot *slot;
 	struct io_buffer_node *node;
 	unsigned long flags;
 
+	if (__io_buffer_manager_find_slot(manager, handle, &pool, &slot)) {
+		fill_buffer_info(slot, pool, info);
+		*orig_buffer = slot->orig_buffer;
+		*prot = slot->prot;
+		return true;
+	}
+
 	spin_lock_irqsave(&manager->fallback_lock, flags);
 	node = find_fallback_node(&manager->fallback_buffers, handle);
 	spin_unlock_irqrestore(&manager->fallback_lock, flags);
@@ -169,10 +435,45 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      dma_addr_t handle, bool inited,
 				      prerelease_cb cb, void *ctx)
 {
+	struct io_buffer_slot *slot, **cache;
+	struct io_buffer_pool *pool;
 	struct io_buffer_node *node;
 	unsigned long flags;
 	bool free_buffer;
 
+	if (__io_buffer_manager_find_slot(manager, handle, &pool, &slot)) {
+		if (cb) {
+			struct io_bounce_buffer_info info;
+
+			fill_buffer_info(slot, pool, &info);
+			cb(&info, slot->prot, slot->orig_buffer, ctx);
+		}
+
+		spin_lock_irqsave(&pool->lock, flags);
+
+		if (likely(inited)) {
+			cache = io_buffer_pool_get_cache(pool, slot->prot);
+			if (*cache == NULL)
+				queue_delayed_work(
+					manager->evict_wq, &manager->evict_work,
+					msecs_to_jiffies(EVICT_PERIOD_MSEC));
+
+			slot->orig_buffer = NULL;
+			slot->next = *cache;
+			*cache = slot;
+			slot->old_cache_entry = false;
+		} else {
+			io_buffer_manager_free_pages(slot->bounce_buffer,
+						     pool->buffer_size >>
+							     PAGE_SHIFT);
+			slot->next = pool->empty_slots;
+			pool->empty_slots = slot;
+		}
+
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return true;
+	}
+
 	spin_lock_irqsave(&manager->fallback_lock, flags);
 	node = find_fallback_node(&manager->fallback_buffers, handle);
 	if (node)
@@ -204,10 +505,122 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 	return true;
 }
 
-int io_buffer_manager_init(struct io_buffer_manager *manager)
+void io_buffer_manager_destroy(struct io_buffer_manager *manager,
+			       struct iommu_domain *domain)
+{
+	int i;
+
+	if (!manager->num_slots)
+		return;
+
+	cancel_delayed_work_sync(&manager->evict_work);
+	destroy_workqueue(manager->evict_wq);
+	__io_buffer_manager_evict(manager, true);
+	__iommu_dma_free_iova(domain->iova_cookie, manager->iova,
+			      manager->iova_size, NULL);
+
+	for (i = 0; i < NUM_POOLS; i++)
+		kfree(manager->pools[i].slots);
+}
+
+bool io_buffer_manager_reinit_check(struct io_buffer_manager *manager,
+				    struct device *dev,
+				    struct iova_domain *iovad, dma_addr_t base,
+				    dma_addr_t limit)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	u64 dma_limit = __iommu_dma_limit(domain, dev, dma_get_mask(dev));
+	dma_addr_t start_iova = iovad->start_pfn << iovad->granule;
+
+	if (!manager->num_slots)
+		return true;
+
+	if (base > manager->iova ||
+	    limit < manager->iova + manager->iova_size) {
+		pr_warn("Bounce buffer pool out of range\n");
+		return false;
+	}
+
+	if (~dma_limit & (manager->iova + manager->iova_size - 1)) {
+		pr_warn("Bounce buffer pool larger than dma limit\n");
+		return false;
+	}
+
+	if (manager->iova_size > (dma_limit - start_iova) / 2)
+		pr_info("Bounce buffer pool using >1/2 of iova range\n");
+
+	return true;
+}
+
+int io_buffer_manager_init(struct io_buffer_manager *manager,
+			   struct device *dev, struct iova_domain *iovad,
+			   unsigned int num_slots)
 {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	int i;
+	unsigned int old_num_slots = num_slots;
+	size_t reserved_iova_pages, pages_per_slot, max_reserved_iova_pages;
+	dma_addr_t iova_base;
+	u64 dma_limit, start_iova;
+
 	manager->fallback_buffers = RB_ROOT;
 	spin_lock_init(&manager->fallback_lock);
 
+	if (num_slots == 0)
+		return 0;
+
+	INIT_DELAYED_WORK(&manager->evict_work, io_buffer_manager_evict);
+	manager->evict_wq = create_singlethread_workqueue("io-buffer-buffers");
+	if (!manager->evict_wq)
+		return -ENOMEM;
+
+	// Make sure there are iovas left over for non-pooled buffers. The iova
+	// allocation can be quite large, so also handle allocation falures due
+	// to reserved iova regions.
+	dma_limit = __iommu_dma_limit(domain, dev, dma_get_mask(dev));
+	start_iova = iovad->start_pfn << iovad->granule;
+	max_reserved_iova_pages = ((dma_limit - start_iova) / 2) >> PAGE_SHIFT;
+	pages_per_slot = (1 << NUM_POOLS) - 1;
+	do {
+		reserved_iova_pages = pages_per_slot * num_slots;
+		if (reserved_iova_pages > max_reserved_iova_pages) {
+			num_slots = max_reserved_iova_pages / pages_per_slot;
+			reserved_iova_pages = pages_per_slot * num_slots;
+		}
+
+		manager->iova_size = reserved_iova_pages << PAGE_SHIFT;
+		manager->iova = __iommu_dma_alloc_iova(
+			domain, manager->iova_size, dma_get_mask(dev), dev);
+		max_reserved_iova_pages /= 2;
+	} while (!manager->iova && max_reserved_iova_pages >= pages_per_slot);
+
+	if (!manager->iova) {
+		destroy_workqueue(manager->evict_wq);
+		return -ENOSPC;
+	} else if (num_slots < old_num_slots) {
+		pr_info("Insufficient space for %u slots, limited to %u\n",
+			old_num_slots, num_slots);
+	}
+	manager->num_slots = num_slots;
+
+	// To ensure that no slot has a segment which crosses a segment
+	// boundary, align each slot's iova to the slot's size.
+	// __iommu_dma_alloc_iova aligns to roundup_power_of_two(size), which
+	// is larger than the largest buffer size. Assigning iova_base from
+	// largest to smallest ensures each iova_base is aligned to the
+	// previous pool's larger size.
+	iova_base = manager->iova;
+	for (i = NUM_POOLS - 1; i >= 0; i--) {
+		struct io_buffer_pool *pool = manager->pools + i;
+
+		spin_lock_init(&pool->lock);
+		pool->empty_slots = NULL;
+		pool->untouched_slot_idx = 0;
+		pool->buffer_size = PAGE_SIZE << i;
+		pool->iova_base = iova_base;
+
+		iova_base += num_slots * pool->buffer_size;
+	}
+
 	return 0;
 }
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
index 0e75d89926ca..2aa3b9afcb3d 100644
--- a/drivers/iommu/io-buffer-manager.h
+++ b/drivers/iommu/io-buffer-manager.h
@@ -9,10 +9,44 @@
 #include <linux/dma-iommu.h>
 #include <linux/iova.h>
 #include <linux/spinlock.h>
+#include <linux/workqueue.h>
+
+struct io_buffer_slot {
+	void *orig_buffer;
+	struct page **bounce_buffer;
+	struct io_buffer_slot *next;
+	int prot;
+	bool old_cache_entry;
+};
+
+enum io_buffer_slot_type {
+	IO_BUFFER_SLOT_TYPE_RO = 0,
+	IO_BUFFER_SLOT_TYPE_WO,
+	IO_BUFFER_SLOT_TYPE_RW,
+	IO_BUFFER_SLOT_TYPE_COUNT,
+};
+
+struct io_buffer_pool {
+	struct io_buffer_slot *cached_slots[IO_BUFFER_SLOT_TYPE_COUNT];
+	struct io_buffer_slot *empty_slots;
+	unsigned int untouched_slot_idx;
+	spinlock_t lock;
+	dma_addr_t iova_base;
+	size_t buffer_size;
+	struct io_buffer_slot *slots;
+};
+
+#define NUM_POOLS 8
 
 struct io_buffer_manager {
+	struct workqueue_struct *evict_wq;
+	struct delayed_work evict_work;
+	unsigned int num_slots;
 	spinlock_t fallback_lock;
 	struct rb_root fallback_buffers;
+	struct io_buffer_pool pools[NUM_POOLS];
+	dma_addr_t iova;
+	size_t iova_size;
 };
 
 struct io_bounce_buffer_info {
@@ -24,7 +58,8 @@ struct io_bounce_buffer_info {
 bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 				    struct device *dev, void *orig_buffer,
 				    size_t size, int prot, unsigned int nid,
-				    struct io_bounce_buffer_info *info);
+				    struct io_bounce_buffer_info *info,
+				    bool *new_buffer);
 
 bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 				   dma_addr_t handle,
@@ -39,7 +74,17 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      dma_addr_t handle, bool inited,
 				      prerelease_cb cb, void *ctx);
 
-int io_buffer_manager_init(struct io_buffer_manager *manager);
+int io_buffer_manager_init(struct io_buffer_manager *manager,
+			   struct device *dev, struct iova_domain *iovad,
+			   unsigned int num_slots);
+
+bool io_buffer_manager_reinit_check(struct io_buffer_manager *manager,
+				    struct device *dev,
+				    struct iova_domain *iovad, dma_addr_t base,
+				    dma_addr_t limit);
+
+void io_buffer_manager_destroy(struct io_buffer_manager *manager,
+			       struct iommu_domain *domain);
 
 bool io_bounce_buffers_release_buffer_cb(struct io_buffer_manager *manager,
 					 dma_addr_t iova, size_t size);
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 50f676678318..944fd491d94f 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -49,6 +49,8 @@ dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 				  struct device *dev);
 void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct page *freelist);
+u64 __iommu_dma_limit(struct iommu_domain *domain,
+		      struct device *dev, u64 mask);
 
 #else /* CONFIG_IOMMU_DMA */
 
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 7/9] dma-iommu: support iommu bounce buffer optimization
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (5 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 6/9] dma-iommu: add bounce buffer pools David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 8/9] dma-mapping: add persistent streaming mapping flag David Stevens
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add config that uses IOMMU bounce buffer pools to avoid IOMMU
interactions as much as possible for relatively small streaming DMA
operations. This can lead to significant performance improvements on
systems where IOMMU map/unmap operations are very slow, such as when
running virtualized.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/Kconfig             | 11 +++++
 drivers/iommu/dma-iommu.c         |  5 ++-
 drivers/iommu/io-bounce-buffers.c | 70 +++++++++++++++++++++----------
 drivers/iommu/io-buffer-manager.c | 17 +++++---
 drivers/iommu/io-buffer-manager.h |  8 ++--
 include/linux/dma-iommu.h         |  2 +
 6 files changed, 81 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..e573b5c276dc 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -422,4 +422,15 @@ config SPRD_IOMMU
 
 	  Say Y here if you want to use the multimedia devices listed above.
 
+config IOMMU_BOUNCE_BUFFERS
+	bool "Use IOMMU bounce buffers"
+	depends on IOMMU_DMA
+	default n
+	help
+	  Use bounce buffers for small, streaming DMA operations. This may
+	  have performance benefits on systems where establishing IOMMU mappings
+	  is particularly expensive, such as when running as a guest.
+
+	  If unsure, say N here.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 42f85b7a90f0..965bc0a2f140 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -324,7 +324,7 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
 	domain->ops->flush_iotlb_all(domain);
 }
 
-static bool dev_is_untrusted(struct device *dev)
+bool dev_is_untrusted(struct device *dev)
 {
 	return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
 }
@@ -402,7 +402,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 
 	ret = iova_reserve_iommu_regions(dev, domain);
 
-	if (ret == 0 && dev_is_untrusted(dev)) {
+	if (ret == 0 && (dev_is_untrusted(dev) ||
+			 IS_ENABLED(CONFIG_IOMMU_BOUNCE_BUFFERS))) {
 		cookie->bounce_buffers =
 			io_bounce_buffers_init(dev, domain, iovad);
 		if (IS_ERR(cookie->bounce_buffers))
diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index 8af8e1546d5f..af8c2a51eeed 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -20,10 +20,20 @@
 static unsigned int buffer_pool_size = 1024;
 module_param(buffer_pool_size, uint, 0);
 
+#ifdef CONFIG_IOMMU_BOUNCE_BUFFERS
+// All buffers at most this size will always use bounce buffers if there
+// are slots of the appropriate size available.
+static unsigned int always_bounce_limit = PAGE_SIZE;
+module_param(always_bounce_limit, uint, 0644);
+#else
+static const unsigned int always_bounce_limit;
+#endif
+
 struct io_bounce_buffers {
 	struct iommu_domain *domain;
 	struct iova_domain *iovad;
 	unsigned int nid;
+	bool untrusted;
 	struct io_buffer_manager manager;
 };
 
@@ -56,6 +66,7 @@ struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
 	buffers->domain = domain;
 	buffers->iovad = iovad;
 	buffers->nid = dev_to_node(dev);
+	buffers->untrusted = dev_is_untrusted(dev);
 
 	return buffers;
 }
@@ -201,7 +212,8 @@ bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
 	void *orig_buffer;
 	int prot;
 
-	if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle, &info,
+	if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle,
+					   buffers->untrusted, &info,
 					   &orig_buffer, &prot))
 		return false;
 
@@ -237,9 +249,9 @@ bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
 	void *orig_buffer;
 	int prot;
 
-	if (!io_buffer_manager_find_buffer(&buffers->manager,
-					   sg_dma_address(sgl), &info,
-					   &orig_buffer, &prot))
+	if (!io_buffer_manager_find_buffer(
+		    &buffers->manager, sg_dma_address(sgl), buffers->untrusted,
+		    &info, &orig_buffer, &prot))
 		return false;
 
 	// In the non bounce buffer case, iommu_dma_map_sg syncs before setting
@@ -291,7 +303,7 @@ bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
 
 	return io_buffer_manager_release_buffer(
 		&buffers->manager, buffers->domain, handle, true,
-		io_bounce_buffers_unmap_page_sync, &args);
+		buffers->untrusted, io_bounce_buffers_unmap_page_sync, &args);
 }
 
 static void io_bounce_buffers_unmap_sg_sync(struct io_bounce_buffer_info *info,
@@ -318,7 +330,7 @@ bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
 
 	return io_buffer_manager_release_buffer(
 		&buffers->manager, buffers->domain, sg_dma_address(sgl), true,
-		io_bounce_buffers_unmap_sg_sync, &args);
+		buffers->untrusted, io_bounce_buffers_unmap_sg_sync, &args);
 }
 
 static void io_bounce_buffers_clear_padding(struct io_bounce_buffer_info *info,
@@ -370,7 +382,8 @@ static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
 	struct sg_table sgt;
 	size_t mapped;
 
-	if (offset || offset + orig_size < info->size || skiped_sync) {
+	if (buffers->untrusted &&
+	    (offset || offset + orig_size < info->size || skiped_sync)) {
 		// Ensure that nothing is leaked to untrusted devices when
 		// mapping the buffer by clearing any part of the bounce buffer
 		// that wasn't already cleared by syncing.
@@ -396,6 +409,15 @@ static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
 	return mapped >= info->size;
 }
 
+static bool use_bounce_buffer(bool force_bounce, size_t size)
+{
+	if (IS_ENABLED(CONFIG_IOMMU_BOUNCE_BUFFERS) &&
+	    size <= always_bounce_limit)
+		return true;
+
+	return force_bounce;
+}
+
 bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 				struct device *dev, struct page *page,
 				unsigned long offset, size_t size, int prot,
@@ -404,16 +426,17 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 {
 	bool new_buffer, skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
 	struct io_bounce_buffer_info info;
-	bool force_bounce = iova_offset(buffers->iovad, offset | size);
+	bool force_bounce = buffers->untrusted &&
+			    iova_offset(buffers->iovad, offset | size);
 
-	if (!force_bounce)
+	if (!use_bounce_buffer(force_bounce, size))
 		return false;
 
 	*handle = DMA_MAPPING_ERROR;
 	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, page,
-					    offset + size, prot, buffers->nid,
-					    &info, &new_buffer))
-		return true;
+					    offset + size, prot, force_bounce,
+					    buffers->nid, &info, &new_buffer))
+		return force_bounce;
 
 	if (!skip_cpu_sync)
 		io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
@@ -424,8 +447,9 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 					  offset, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
-						 false, NULL, NULL);
-		return true;
+						 false, force_bounce, NULL,
+						 NULL);
+		return force_bounce;
 	}
 
 	*handle = info.iova + offset;
@@ -447,18 +471,19 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 
 	for_each_sg(sgl, iter, nents, i) {
 		size += iter->length;
-		force_bounce |= iova_offset(buffers->iovad,
-					    iter->offset | iter->length);
+		if (buffers->untrusted)
+			force_bounce |= iova_offset(
+				buffers->iovad, iter->offset | iter->length);
 	}
 
-	if (!force_bounce)
+	if (!use_bounce_buffer(force_bounce, size))
 		return false;
 
 	*out_nents = 0;
 	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, sgl, size,
-					    prot, buffers->nid, &info,
-					    &new_buffer))
-		return true;
+					    prot, force_bounce, buffers->nid,
+					    &info, &new_buffer))
+		return force_bounce;
 
 	if (!skip_cpu_sync)
 		__io_bounce_buffers_sync_sg(buffers, sgl, nents,
@@ -470,8 +495,9 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 					  0, size)) {
 		io_buffer_manager_release_buffer(&buffers->manager,
 						 buffers->domain, info.iova,
-						 false, NULL, NULL);
-		return true;
+						 false, force_bounce, NULL,
+						 NULL);
+		return force_bounce;
 	}
 
 	i = 0;
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
index 1c69df08603c..0f7f003b53bb 100644
--- a/drivers/iommu/io-buffer-manager.c
+++ b/drivers/iommu/io-buffer-manager.c
@@ -324,7 +324,8 @@ static bool io_buffer_manager_alloc_slot(struct io_buffer_manager *manager,
 
 bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 				    struct device *dev, void *orig_buffer,
-				    size_t size, int prot, unsigned int nid,
+				    size_t size, int prot, bool require_bounce,
+				    unsigned int nid,
 				    struct io_bounce_buffer_info *info,
 				    bool *new_buffer)
 {
@@ -336,6 +337,9 @@ bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 					 nid, info, new_buffer))
 		return true;
 
+	if (!require_bounce)
+		return false;
+
 	node = kzalloc(sizeof(*node), GFP_ATOMIC);
 	if (!node)
 		return false;
@@ -401,7 +405,7 @@ static bool __io_buffer_manager_find_slot(struct io_buffer_manager *manager,
 }
 
 bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
-				   dma_addr_t handle,
+				   dma_addr_t handle, bool may_use_fallback,
 				   struct io_bounce_buffer_info *info,
 				   void **orig_buffer, int *prot)
 {
@@ -415,7 +419,8 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 		*orig_buffer = slot->orig_buffer;
 		*prot = slot->prot;
 		return true;
-	}
+	} else if (!may_use_fallback)
+		return false;
 
 	spin_lock_irqsave(&manager->fallback_lock, flags);
 	node = find_fallback_node(&manager->fallback_buffers, handle);
@@ -433,7 +438,8 @@ bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
 bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      struct iommu_domain *domain,
 				      dma_addr_t handle, bool inited,
-				      prerelease_cb cb, void *ctx)
+				      bool may_use_fallback, prerelease_cb cb,
+				      void *ctx)
 {
 	struct io_buffer_slot *slot, **cache;
 	struct io_buffer_pool *pool;
@@ -472,7 +478,8 @@ bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 
 		spin_unlock_irqrestore(&pool->lock, flags);
 		return true;
-	}
+	} else if (!may_use_fallback)
+		return false;
 
 	spin_lock_irqsave(&manager->fallback_lock, flags);
 	node = find_fallback_node(&manager->fallback_buffers, handle);
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
index 2aa3b9afcb3d..3d32f9366536 100644
--- a/drivers/iommu/io-buffer-manager.h
+++ b/drivers/iommu/io-buffer-manager.h
@@ -57,12 +57,13 @@ struct io_bounce_buffer_info {
 
 bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
 				    struct device *dev, void *orig_buffer,
-				    size_t size, int prot, unsigned int nid,
+				    size_t size, int prot, bool use_fallback,
+				    unsigned int nid,
 				    struct io_bounce_buffer_info *info,
 				    bool *new_buffer);
 
 bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
-				   dma_addr_t handle,
+				   dma_addr_t handle, bool may_use_fallback,
 				   struct io_bounce_buffer_info *info,
 				   void **orig_buffer, int *prot);
 
@@ -72,7 +73,8 @@ typedef void (*prerelease_cb)(struct io_bounce_buffer_info *info, int prot,
 bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
 				      struct iommu_domain *domain,
 				      dma_addr_t handle, bool inited,
-				      prerelease_cb cb, void *ctx);
+				      bool may_use_fallback, prerelease_cb cb,
+				      void *ctx);
 
 int io_buffer_manager_init(struct io_buffer_manager *manager,
 			   struct device *dev, struct iova_domain *iovad,
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 944fd491d94f..70bed650d5d1 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -52,6 +52,8 @@ void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 u64 __iommu_dma_limit(struct iommu_domain *domain,
 		      struct device *dev, u64 mask);
 
+bool dev_is_untrusted(struct device *dev);
+
 #else /* CONFIG_IOMMU_DMA */
 
 struct iommu_domain;
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 8/9] dma-mapping: add persistent streaming mapping flag
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (6 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 7/9] dma-iommu: support iommu bounce buffer optimization David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2021-08-06 10:34 ` [PATCH v2 9/9] drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag David Stevens
  2022-05-24 12:27 ` [PATCH v2 0/9] Add dynamic iommu backed bounce buffers Niklas Schnelle
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add a DMA_ATTR_PERSISTENT_STREAMING flag which indicates that the
streaming mapping is long lived and that the caller will manage
coherency either through the dma_sync_* functions or via some other
use-case specific mechanism. This flag indicates to the platform that
it should optimize for more efficient syncing at the cost of more
expensive mapping and unmapping.

This flag is used to skip optional bounce buffers when
CONFIG_IOMMU_BOUNCE_BUFFERS is enabled. With respect to these bounce
buffers, in most cases the flag is an optimization. However, callers
which do not use the dma_sync_* calls to manage coherency must use this
flag to work properly when CONFIG_IOMMU_BOUNCE_BUFFERS is enabled.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/io-bounce-buffers.c | 14 +++++++++++---
 include/linux/dma-mapping.h       | 11 +++++++++++
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
index af8c2a51eeed..3a0071d5a9ea 100644
--- a/drivers/iommu/io-bounce-buffers.c
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -409,8 +409,16 @@ static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
 	return mapped >= info->size;
 }
 
-static bool use_bounce_buffer(bool force_bounce, size_t size)
+static bool use_bounce_buffer(struct device *dev, unsigned long attrs,
+			      bool force_bounce, size_t size)
 {
+	if (attrs & DMA_ATTR_PERSISTENT_STREAMING) {
+		WARN_ONCE(force_bounce,
+			  "Skipping bounce buffer for untrusted mapping %s\n",
+			  dev_name(dev));
+		return false;
+	}
+
 	if (IS_ENABLED(CONFIG_IOMMU_BOUNCE_BUFFERS) &&
 	    size <= always_bounce_limit)
 		return true;
@@ -429,7 +437,7 @@ bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
 	bool force_bounce = buffers->untrusted &&
 			    iova_offset(buffers->iovad, offset | size);
 
-	if (!use_bounce_buffer(force_bounce, size))
+	if (!use_bounce_buffer(dev, attrs, force_bounce, size))
 		return false;
 
 	*handle = DMA_MAPPING_ERROR;
@@ -476,7 +484,7 @@ bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
 				buffers->iovad, iter->offset | iter->length);
 	}
 
-	if (!use_bounce_buffer(force_bounce, size))
+	if (!use_bounce_buffer(dev, attrs, force_bounce, size))
 		return false;
 
 	*out_nents = 0;
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 183e7103a66d..5d318753bb79 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,17 @@
  */
 #define DMA_ATTR_PRIVILEGED		(1UL << 9)
 
+/*
+ * DMA_ATTR_PERSISTENT_STREAMING: Indicates that the streaming mapping is long
+ * lived, so syncing performance should be prioritized over mapping/unmapping
+ * performance. Platform code will establish a mapping which only requires CPU
+ * cache synchronization.
+ *
+ * Callers that create long lived mappings and directly handle CPU cache
+ * management without calling using dma_sync_* functions must set this flag.
+ */
+#define DMA_ATTR_PERSISTENT_STREAMING	(1UL << 10)
+
 /*
  * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
  * be given to a device to use as a DMA source or target.  It is specific to a
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 9/9] drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (7 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 8/9] dma-mapping: add persistent streaming mapping flag David Stevens
@ 2021-08-06 10:34 ` David Stevens
  2022-05-24 12:27 ` [PATCH v2 0/9] Add dynamic iommu backed bounce buffers Niklas Schnelle
  9 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-06 10:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Joerg Roedel, Will Deacon, Sergey Senozhatsky,
	Lu Baolu, iommu, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Use the new DMA_ATTR_PERSISTENT_STREAMING for long lived dma mappings
which directly handle CPU cache coherency instead of using dma_sync_*.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 4 +++-
 drivers/gpu/drm/i915/i915_gem_gtt.c        | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
index 616c3a2f1baf..df982cfb4f34 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
@@ -48,7 +48,9 @@ static struct sg_table *i915_gem_map_dma_buf(struct dma_buf_attachment *attachme
 		src = sg_next(src);
 	}
 
-	ret = dma_map_sgtable(attachment->dev, st, dir, DMA_ATTR_SKIP_CPU_SYNC);
+	ret = dma_map_sgtable(attachment->dev, st, dir,
+			      DMA_ATTR_SKIP_CPU_SYNC |
+			      DMA_ATTR_PERSISTENT_STREAMING);
 	if (ret)
 		goto err_free_sg;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 36489be4896b..f27a849631f7 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -33,7 +33,8 @@ int i915_gem_gtt_prepare_pages(struct drm_i915_gem_object *obj,
 				     PCI_DMA_BIDIRECTIONAL,
 				     DMA_ATTR_SKIP_CPU_SYNC |
 				     DMA_ATTR_NO_KERNEL_MAPPING |
-				     DMA_ATTR_NO_WARN))
+				     DMA_ATTR_NO_WARN |
+				     DMA_ATTR_PERSISTENT_STREAMING))
 			return 0;
 
 		/*
-- 
2.32.0.605.g8dce9f2422-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices
  2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
@ 2021-08-06 15:53   ` kernel test robot
  2021-08-10  1:19   ` Mi, Dapeng1
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2021-08-06 15:53 UTC (permalink / raw)
  To: David Stevens, Robin Murphy
  Cc: kbuild-all, Christoph Hellwig, Joerg Roedel, Will Deacon,
	Sergey Senozhatsky, Lu Baolu, iommu, linux-kernel, David Stevens

[-- Attachment #1: Type: text/plain, Size: 3693 bytes --]

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on iommu/next]
[also build test WARNING on drm-intel/for-linux-next hch-configfs/for-next linus/master v5.14-rc4]
[cannot apply to next-20210805]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Stevens/Add-dynamic-iommu-backed-bounce-buffers/20210806-183631
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: ia64-randconfig-r025-20210804 (attached as .config)
compiler: ia64-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/c5f1f9fa88a7062c1ded50fa165f6b01ed73f161
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Stevens/Add-dynamic-iommu-backed-bounce-buffers/20210806-183631
        git checkout c5f1f9fa88a7062c1ded50fa165f6b01ed73f161
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/iommu/io-buffer-manager.c:57:24: warning: no previous prototype for 'find_fallback_node' [-Wmissing-prototypes]
      57 | struct io_buffer_node *find_fallback_node(struct rb_root *root, dma_addr_t iova)
         |                        ^~~~~~~~~~~~~~~~~~
>> drivers/iommu/io-buffer-manager.c:75:6: warning: no previous prototype for 'insert_fallback_node' [-Wmissing-prototypes]
      75 | bool insert_fallback_node(struct rb_root *root, struct io_buffer_node *node)
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/find_fallback_node +57 drivers/iommu/io-buffer-manager.c

    56	
  > 57	struct io_buffer_node *find_fallback_node(struct rb_root *root, dma_addr_t iova)
    58	{
    59		struct rb_node *node = root->rb_node;
    60	
    61		while (node) {
    62			struct io_buffer_node *cur =
    63				container_of(node, struct io_buffer_node, node);
    64	
    65			if (iova < cur->info.iova)
    66				node = node->rb_left;
    67			else if (iova >= cur->info.iova + cur->info.size)
    68				node = node->rb_right;
    69			else
    70				return cur;
    71		}
    72		return NULL;
    73	}
    74	
  > 75	bool insert_fallback_node(struct rb_root *root, struct io_buffer_node *node)
    76	{
    77		struct rb_node **new = &(root->rb_node), *parent = NULL;
    78		dma_addr_t node_end = node->info.iova + node->info.size;
    79	
    80		while (*new) {
    81			struct io_buffer_node *cur =
    82				container_of(*new, struct io_buffer_node, node);
    83			dma_addr_t cur_end = cur->info.iova + cur->info.size;
    84	
    85			parent = *new;
    86			if (node_end <= cur->info.iova)
    87				new = &((*new)->rb_left);
    88			else if (node->info.iova >= cur_end)
    89				new = &((*new)->rb_right);
    90			else {
    91				pr_crit("IOVA collision new=[%llx,%llx) old=[%llx,%llx)\n",
    92					node->info.iova, node_end, cur->info.iova,
    93					cur_end);
    94				return false;
    95			}
    96		}
    97	
    98		rb_link_node(&node->node, parent, new);
    99		rb_insert_color(&node->node, root);
   100		return true;
   101	}
   102	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36856 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/9] dma-iommu: expose a few helper functions to module
  2021-08-06 10:34 ` [PATCH v2 2/9] dma-iommu: expose a few helper functions to module David Stevens
@ 2021-08-06 17:28   ` kernel test robot
  0 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2021-08-06 17:28 UTC (permalink / raw)
  To: David Stevens, Robin Murphy
  Cc: kbuild-all, Christoph Hellwig, Joerg Roedel, Will Deacon,
	Sergey Senozhatsky, Lu Baolu, iommu, linux-kernel, David Stevens

[-- Attachment #1: Type: text/plain, Size: 3631 bytes --]

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on iommu/next]
[also build test ERROR on drm-intel/for-linux-next hch-configfs/for-next linus/master v5.14-rc4]
[cannot apply to next-20210805]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Stevens/Add-dynamic-iommu-backed-bounce-buffers/20210806-183631
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: i386-randconfig-a011-20210804 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/385dff3d789a97ef650912616e9d696fba96cb20
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Stevens/Add-dynamic-iommu-backed-bounce-buffers/20210806-183631
        git checkout 385dff3d789a97ef650912616e9d696fba96cb20
        # save the attached .config to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/iommu/dma-iommu.c:415:12: error: conflicting types for '__iommu_dma_alloc_iova'
     415 | dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
         |            ^~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/iommu/dma-iommu.c:14:
   include/linux/dma-iommu.h:47:12: note: previous declaration of '__iommu_dma_alloc_iova' was here
      47 | dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
         |            ^~~~~~~~~~~~~~~~~~~~~~


vim +/__iommu_dma_alloc_iova +415 drivers/iommu/dma-iommu.c

   414	
 > 415	dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
   416			size_t size, u64 dma_limit, struct device *dev)
   417	{
   418		struct iommu_dma_cookie *cookie = domain->iova_cookie;
   419		struct iova_domain *iovad = &cookie->iovad;
   420		unsigned long shift, iova_len, iova = 0;
   421	
   422		if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
   423			cookie->msi_iova += size;
   424			return cookie->msi_iova - size;
   425		}
   426	
   427		shift = iova_shift(iovad);
   428		iova_len = size >> shift;
   429		/*
   430		 * Freeing non-power-of-two-sized allocations back into the IOVA caches
   431		 * will come back to bite us badly, so we have to waste a bit of space
   432		 * rounding up anything cacheable to make sure that can't happen. The
   433		 * order of the unadjusted size will still match upon freeing.
   434		 */
   435		if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
   436			iova_len = roundup_pow_of_two(iova_len);
   437	
   438		dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
   439	
   440		if (domain->geometry.force_aperture)
   441			dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
   442	
   443		/* Try to get PCI devices a SAC address */
   444		if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
   445			iova = alloc_iova_fast(iovad, iova_len,
   446					       DMA_BIT_MASK(32) >> shift, false);
   447	
   448		if (!iova)
   449			iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
   450					       true);
   451	
   452		return (dma_addr_t)iova << shift;
   453	}
   454	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39716 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices
  2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
  2021-08-06 15:53   ` kernel test robot
@ 2021-08-10  1:19   ` Mi, Dapeng1
  2021-08-10  1:41     ` David Stevens
  1 sibling, 1 reply; 19+ messages in thread
From: Mi, Dapeng1 @ 2021-08-10  1:19 UTC (permalink / raw)
  To: David Stevens, Robin Murphy
  Cc: linux-kernel, Sergey Senozhatsky, iommu, Will Deacon, Christoph Hellwig

Hi David,

I like this patch set and this is crucial for reducing the significant vIOMMU performance. It looks you totally rewrite the IOMMU mapping/unmapping part and use the dynamically allocated memory from buddy system as bounce buffer instead of using the legacy SWIOTLB bounce buffer. As I know, some legacy devices' DMA could not access the memory larger than 32-bit memory space and the dynamically allocated memory address could exceed the 32-bit memory space. Is it a problem?

Thx,
Dapeng Mi

-----Original Message-----
From: iommu <iommu-bounces@lists.linux-foundation.org> On Behalf Of David Stevens
Sent: Friday, August 6, 2021 6:34 PM
To: Robin Murphy <robin.murphy@arm.com>
Cc: linux-kernel@vger.kernel.org; Sergey Senozhatsky <senozhatsky@chromium.org>; iommu@lists.linux-foundation.org; David Stevens <stevensd@chromium.org>; Will Deacon <will@kernel.org>; Christoph Hellwig <hch@lst.de>
Subject: [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices

From: David Stevens <stevensd@chromium.org>

Add support for dynamic bounce buffers to the dma-api for use with subgranule IOMMU mappings with untrusted devices. Bounce buffer management is split into two parts. First, there is a buffer manager that is responsible for allocating and tracking buffers. Second, there is a layer that uses the managed buffers as bounce buffers. It is responsible for managing the IOMMU mapping and for syncing between the original and bounce buffers.

For now, buffer management is very simple - every mapping allocates a new bounce buffer.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/Makefile            |   2 +-
 drivers/iommu/dma-iommu.c         |  70 +++++-
 drivers/iommu/io-bounce-buffers.c | 358 ++++++++++++++++++++++++++++++  drivers/iommu/io-bounce-buffers.h |  46 ++++  drivers/iommu/io-buffer-manager.c | 212 ++++++++++++++++++  drivers/iommu/io-buffer-manager.h |  43 ++++
 6 files changed, 728 insertions(+), 3 deletions(-)  create mode 100644 drivers/iommu/io-bounce-buffers.c  create mode 100644 drivers/iommu/io-bounce-buffers.h  create mode 100644 drivers/iommu/io-buffer-manager.c  create mode 100644 drivers/iommu/io-buffer-manager.h

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index c0fb0ba88143..4edaf7adc082 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
-obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o io-bounce-buffers.o 
+io-buffer-manager.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 055ccda5eba1..908eb6fb7dc3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -24,6 +24,8 @@
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
 
+#include "io-bounce-buffers.h"
+
 struct iommu_dma_msi_page {
 	struct list_head	list;
 	dma_addr_t		iova;
@@ -44,6 +46,7 @@ struct iommu_dma_cookie {
 		dma_addr_t		msi_iova;
 	};
 	struct list_head		msi_page_list;
+	struct io_bounce_buffers	*bounce_buffers;
 
 	/* Domain for flush queue callback; NULL if flush queue not in use */
 	struct iommu_domain		*fq_domain;
@@ -81,6 +84,14 @@ static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
 	return PAGE_SIZE;
 }
 
+static struct io_bounce_buffers *dev_to_io_bounce_buffers(struct device 
+*dev) {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+	return cookie->bounce_buffers;
+}
+
 static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)  {
 	struct iommu_dma_cookie *cookie;
@@ -160,6 +171,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 	if (!cookie)
 		return;
 
+	if (cookie->bounce_buffers)
+		io_bounce_buffers_destroy(cookie->bounce_buffers);
+
 	if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
 		put_iova_domain(&cookie->iovad);
 
@@ -333,6 +347,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	unsigned long order, base_pfn;
 	struct iova_domain *iovad;
+	int ret;
 
 	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
 		return -EINVAL;
@@ -380,7 +395,16 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	if (!dev)
 		return 0;
 
-	return iova_reserve_iommu_regions(dev, domain);
+	ret = iova_reserve_iommu_regions(dev, domain);
+
+	if (ret == 0 && dev_is_untrusted(dev)) {
+		cookie->bounce_buffers =
+			io_bounce_buffers_init(dev, domain, iovad);
+		if (IS_ERR(cookie->bounce_buffers))
+			ret = PTR_ERR(cookie->bounce_buffers);
+	}
+
+	return ret;
 }
 
 /**
@@ -710,8 +734,13 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,  static void iommu_dma_sync_single_for_cpu(struct device *dev,
 		dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)  {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	phys_addr_t phys;
 
+	if (bounce && io_bounce_buffers_sync_single(bounce, dma_handle,
+						    size, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -722,8 +751,13 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev,  static void iommu_dma_sync_single_for_device(struct device *dev,
 		dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)  {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	phys_addr_t phys;
 
+	if (bounce && io_bounce_buffers_sync_single(bounce, dma_handle,
+						    size, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -735,9 +769,13 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
 		struct scatterlist *sgl, int nelems,
 		enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	struct scatterlist *sg;
 	int i;
 
+	if (bounce && io_bounce_buffers_sync_sg(bounce, sgl, nelems, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -749,9 +787,14 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
 		struct scatterlist *sgl, int nelems,
 		enum dma_data_direction dir)
 {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	struct scatterlist *sg;
 	int i;
 
+	if (bounce && io_bounce_buffers_sync_sg(bounce, sgl,
+						nelems, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev))
 		return;
 
@@ -763,11 +806,19 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 		unsigned long offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
 {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	phys_addr_t phys = page_to_phys(page) + offset;
 	bool coherent = dev_is_dma_coherent(dev);
 	int prot = dma_info_to_prot(dir, coherent, attrs);
 	dma_addr_t dma_handle;
 
+	if (cookie->bounce_buffers &&
+	    io_bounce_buffers_map_page(cookie->bounce_buffers, dev, page,
+				       offset, size, prot, dir, attrs,
+				       &dma_handle))
+		return dma_handle;
+
 	dma_handle = __iommu_dma_map(dev, phys, size, prot, dma_get_mask(dev));
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    dma_handle != DMA_MAPPING_ERROR)
@@ -778,6 +829,12 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,  static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)  {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
+
+	if (bounce &&
+	    io_bounce_buffers_unmap_page(bounce, dma_handle, size, dir, attrs))
+		return;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
 	__iommu_dma_unmap(dev, dma_handle, size); @@ -876,12 +933,17 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
-	int i;
+	int i, ret;
 
 	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
 	    iommu_deferred_attach(dev, domain))
 		return 0;
 
+	if (cookie->bounce_buffers &&
+	    io_bounce_buffers_map_sg(cookie->bounce_buffers, dev, sg, nents,
+				     prot, dir, attrs, &ret))
+		return ret;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
 
@@ -947,10 +1009,14 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,  static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)  {
+	struct io_bounce_buffers *bounce = dev_to_io_bounce_buffers(dev);
 	dma_addr_t start, end;
 	struct scatterlist *tmp;
 	int i;
 
+	if (bounce && io_bounce_buffers_unmap_sg(bounce, sg, nents, dir, attrs))
+		return;
+
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 
diff --git a/drivers/iommu/io-bounce-buffers.c b/drivers/iommu/io-bounce-buffers.c
new file mode 100644
index 000000000000..78b4440b58c8
--- /dev/null
+++ b/drivers/iommu/io-bounce-buffers.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Dynamic IOMMU mapped bounce buffers.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
+#include <linux/highmem.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+
+#include "io-buffer-manager.h"
+#include "io-bounce-buffers.h"
+
+struct io_bounce_buffers {
+	struct iommu_domain *domain;
+	struct iova_domain *iovad;
+	unsigned int nid;
+	struct io_buffer_manager manager;
+};
+
+bool io_bounce_buffers_release_buffer_cb(struct io_buffer_manager *manager,
+					 dma_addr_t iova, size_t size)
+{
+	struct io_bounce_buffers *buffers =
+		container_of(manager, struct io_bounce_buffers, manager);
+	return iommu_unmap(buffers->domain, iova, size) >= size; }
+
+struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
+						 struct iommu_domain *domain,
+						 struct iova_domain *iovad)
+{
+	int ret;
+	struct io_bounce_buffers *buffers;
+
+	buffers = kzalloc(sizeof(*buffers), GFP_KERNEL);
+	if (!buffers)
+		return ERR_PTR(-ENOMEM);
+
+	ret = io_buffer_manager_init(&buffers->manager);
+	if (ret) {
+		kfree(buffers);
+		return ERR_PTR(ret);
+	}
+
+	buffers->domain = domain;
+	buffers->iovad = iovad;
+	buffers->nid = dev_to_node(dev);
+
+	return buffers;
+}
+
+void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers) {
+	kfree(buffers);
+}
+
+static bool should_sync_buffer(enum dma_data_direction dir, bool 
+sync_for_cpu) {
+	return dir == DMA_BIDIRECTIONAL ||
+	       (dir == DMA_FROM_DEVICE && sync_for_cpu) ||
+	       (dir == DMA_TO_DEVICE && !sync_for_cpu); }
+
+static void io_bounce_buffers_do_sync(struct io_bounce_buffers *buffers,
+				      struct page **bounce_buffer,
+				      size_t bounce_offset, struct page *orig,
+				      size_t orig_offset, size_t size,
+				      enum dma_data_direction dir, int prot,
+				      bool sync_for_cpu)
+{
+	bool needs_bounce_sync = should_sync_buffer(dir, sync_for_cpu);
+	char *orig_lowmem_ptr;
+	bool dma_is_coherent = prot & IOMMU_CACHE;
+
+	if (dma_is_coherent && !needs_bounce_sync)
+		return;
+
+	orig_lowmem_ptr = PageHighMem(orig) ? NULL : page_to_virt(orig);
+
+	while (size) {
+		size_t copy_len, bounce_page_offset;
+		struct page *bounce_page;
+
+		bounce_page = bounce_buffer[bounce_offset / PAGE_SIZE];
+		bounce_page_offset = bounce_offset % PAGE_SIZE;
+
+		copy_len = size;
+		if (copy_len + bounce_page_offset > PAGE_SIZE)
+			copy_len = PAGE_SIZE - bounce_page_offset;
+
+		if (!dma_is_coherent && sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_cpu(paddr + bounce_page_offset,
+					      copy_len, dir);
+		}
+
+		if (needs_bounce_sync) {
+			char *bounce_page_ptr = kmap_local_page(bounce_page);
+			char *bounce_ptr = bounce_page_ptr + bounce_page_offset;
+
+			if (!orig_lowmem_ptr) {
+				size_t remaining = copy_len;
+				size_t offset = orig_offset % PAGE_SIZE;
+				size_t orig_page_idx = orig_offset / PAGE_SIZE;
+
+				while (remaining) {
+					char *orig_ptr;
+					size_t sz = min(remaining,
+							PAGE_SIZE - offset);
+
+					orig_ptr = kmap_local_page(
+						nth_page(orig, orig_page_idx));
+					if (sync_for_cpu) {
+						memcpy(orig_ptr + offset,
+						       bounce_ptr, sz);
+					} else {
+						memcpy(bounce_ptr,
+						       orig_ptr + offset, sz);
+					}
+					kunmap_local(orig_ptr);
+
+					remaining -= sz;
+					orig_page_idx += 1;
+					bounce_ptr += sz;
+					offset = 0;
+				}
+			} else if (sync_for_cpu) {
+				memcpy(orig_lowmem_ptr + orig_offset,
+				       bounce_ptr, copy_len);
+			} else {
+				memcpy(bounce_ptr,
+				       orig_lowmem_ptr + orig_offset, copy_len);
+			}
+
+			kunmap_local(bounce_page_ptr);
+		}
+
+		if (!dma_is_coherent && !sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_device(paddr + bounce_page_offset,
+						 copy_len, dir);
+		}
+
+		bounce_offset += copy_len;
+		orig_offset += copy_len;
+		size -= copy_len;
+	}
+}
+
+bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
+				   dma_addr_t dma_handle, size_t size,
+				   enum dma_data_direction dir,
+				   bool sync_for_cpu)
+{
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	size_t offset;
+	int prot;
+
+	if (!io_buffer_manager_find_buffer(&buffers->manager, dma_handle, &info,
+					   &orig_buffer, &prot))
+		return false;
+
+	offset = dma_handle - info.iova;
+	io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
+				  orig_buffer, offset, size, dir, prot,
+				  sync_for_cpu);
+	return true;
+}
+
+static void __io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+					struct scatterlist *sgl, int nents,
+					struct page **bounce_buffer,
+					enum dma_data_direction dir, int prot,
+					bool sync_for_cpu)
+{
+	size_t bounce_offset = 0;
+	struct scatterlist *iter;
+	int i;
+
+	for_each_sg(sgl, iter, nents, i) {
+		io_bounce_buffers_do_sync(buffers, bounce_buffer, bounce_offset,
+					  sg_page(iter), iter->offset,
+					  iter->length, dir, prot,
+					  sync_for_cpu);
+		bounce_offset += iter->length;
+	}
+}
+
+bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+			       struct scatterlist *sgl, int nents,
+			       enum dma_data_direction dir, bool sync_for_cpu) {
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	int prot;
+
+	if (!io_buffer_manager_find_buffer(&buffers->manager,
+					   sg_dma_address(sgl), &info,
+					   &orig_buffer, &prot))
+		return false;
+
+	// In the non bounce buffer case, iommu_dma_map_sg syncs before setting
+	// up the new mapping's dma address. This check handles false positives
+	// in find_buffer caused by sgl being reused for a non bounce buffer
+	// case after being used with a bounce buffer.
+	if (orig_buffer != sgl)
+		return false;
+
+	__io_bounce_buffers_sync_sg(buffers, sgl, nents, info.bounce_buffer,
+				    dir, prot, sync_for_cpu);
+
+	return true;
+}
+
+bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
+				  dma_addr_t handle, size_t size,
+				  enum dma_data_direction dir,
+				  unsigned long attrs)
+{
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		io_bounce_buffers_sync_single(buffers, handle, size, dir, true);
+
+	return io_buffer_manager_release_buffer(&buffers->manager,
+						buffers->domain, handle, true);
+}
+
+bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
+				struct scatterlist *sgl, int nents,
+				enum dma_data_direction dir,
+				unsigned long attrs)
+{
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		io_bounce_buffers_sync_sg(buffers, sgl, nents, dir, true);
+
+	return io_buffer_manager_release_buffer(
+		&buffers->manager, buffers->domain, sg_dma_address(sgl), true); }
+
+static bool io_bounce_buffers_map_buffer(struct io_bounce_buffers *buffers,
+					 struct io_bounce_buffer_info *info,
+					 int prot)
+{
+	unsigned int count = info->size >> PAGE_SHIFT;
+	struct sg_table sgt;
+	size_t mapped;
+
+	if (sg_alloc_table_from_pages(&sgt, info->bounce_buffer, count, 0,
+				      info->size, GFP_ATOMIC))
+		return false;
+
+	mapped = iommu_map_sg_atomic(buffers->domain, info->iova, sgt.sgl,
+				     sgt.orig_nents, prot);
+
+	sg_free_table(&sgt);
+	return mapped >= info->size;
+}
+
+bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
+				struct device *dev, struct page *page,
+				unsigned long offset, size_t size, int prot,
+				enum dma_data_direction dir,
+				unsigned long attrs, dma_addr_t *handle) {
+	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	struct io_bounce_buffer_info info;
+	bool force_bounce = iova_offset(buffers->iovad, offset | size);
+
+	if (!force_bounce)
+		return false;
+
+	*handle = DMA_MAPPING_ERROR;
+	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, page,
+					    offset + size, prot, buffers->nid,
+					    &info))
+		return true;
+
+	if (!skip_cpu_sync)
+		io_bounce_buffers_do_sync(buffers, info.bounce_buffer, offset,
+					  page, offset, size, dir, prot, false);
+
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+		io_buffer_manager_release_buffer(
+			&buffers->manager, buffers->domain, info.iova, false);
+		return true;
+	}
+
+	*handle = info.iova + offset;
+	return true;
+}
+
+bool io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
+			      struct device *dev, struct scatterlist *sgl,
+			      int nents, int prot, enum dma_data_direction dir,
+			      unsigned long attrs, int *out_nents) {
+	struct io_bounce_buffer_info info;
+	struct scatterlist *iter;
+	size_t size = 0;
+	bool skip_cpu_sync = attrs & DMA_ATTR_SKIP_CPU_SYNC;
+	dma_addr_t seg_iova;
+	int i;
+	bool force_bounce = false;
+
+	for_each_sg(sgl, iter, nents, i) {
+		size += iter->length;
+		force_bounce |= iova_offset(buffers->iovad,
+					    iter->offset | iter->length);
+	}
+
+	if (!force_bounce)
+		return false;
+
+	*out_nents = 0;
+	if (!io_buffer_manager_alloc_buffer(&buffers->manager, dev, sgl, size,
+					    prot, buffers->nid, &info))
+		return true;
+
+	if (!skip_cpu_sync)
+		__io_bounce_buffers_sync_sg(buffers, sgl, nents,
+					    info.bounce_buffer, dir, prot,
+					    false);
+
+	if (!io_bounce_buffers_map_buffer(buffers, &info, prot)) {
+		io_buffer_manager_release_buffer(
+			&buffers->manager, buffers->domain, info.iova, false);
+		return true;
+	}
+
+	i = 0;
+	seg_iova = info.iova;
+	while (size > 0) {
+		size_t seg_size = min_t(size_t, size,
+					dma_get_max_seg_size(dev));
+
+		sg_dma_len(sgl) = seg_size;
+		sg_dma_address(sgl) = seg_iova;
+
+		sgl = sg_next(sgl);
+		size -= seg_size;
+		seg_iova += seg_size;
+		i++;
+	}
+
+	if (sgl) {
+		sg_dma_address(sgl) = DMA_MAPPING_ERROR;
+		sg_dma_len(sgl) = 0;
+	}
+
+	*out_nents = i;
+	return true;
+}
diff --git a/drivers/iommu/io-bounce-buffers.h b/drivers/iommu/io-bounce-buffers.h
new file mode 100644
index 000000000000..6d132a27646c
--- /dev/null
+++ b/drivers/iommu/io-bounce-buffers.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#ifndef _LINUX_IO_BOUNCE_BUFFERS_H
+#define _LINUX_IO_BOUNCE_BUFFERS_H
+
+#include <linux/dma-iommu.h>
+#include <linux/iova.h>
+
+struct io_bounce_buffers;
+
+struct io_bounce_buffers *io_bounce_buffers_init(struct device *dev,
+						 struct iommu_domain *domain,
+						 struct iova_domain *iovad);
+void io_bounce_buffers_destroy(struct io_bounce_buffers *buffers);
+
+bool io_bounce_buffers_sync_single(struct io_bounce_buffers *buffers,
+				   dma_addr_t dma_handle, size_t size,
+				   enum dma_data_direction dir,
+				   bool sync_for_cpu);
+bool io_bounce_buffers_sync_sg(struct io_bounce_buffers *buffers,
+			       struct scatterlist *sgl, int nents,
+			       enum dma_data_direction dir, bool sync_for_cpu);
+
+bool io_bounce_buffers_map_page(struct io_bounce_buffers *buffers,
+				struct device *dev, struct page *page,
+				unsigned long offset, size_t size, int prot,
+				enum dma_data_direction dir,
+				unsigned long attrs, dma_addr_t *handle); bool 
+io_bounce_buffers_map_sg(struct io_bounce_buffers *buffers,
+			      struct device *dev, struct scatterlist *sgl,
+			      int nents, int prot, enum dma_data_direction dir,
+			      unsigned long attrs, int *out_nents);
+
+bool io_bounce_buffers_unmap_page(struct io_bounce_buffers *buffers,
+				  dma_addr_t handle, size_t size,
+				  enum dma_data_direction dir,
+				  unsigned long attrs);
+bool io_bounce_buffers_unmap_sg(struct io_bounce_buffers *buffers,
+				struct scatterlist *sgl, int nents,
+				enum dma_data_direction dir,
+				unsigned long attrs);
+
+#endif /* _LINUX_IO_BOUNCE_BUFFERS_H */
diff --git a/drivers/iommu/io-buffer-manager.c b/drivers/iommu/io-buffer-manager.c
new file mode 100644
index 000000000000..24e95a2faa37
--- /dev/null
+++ b/drivers/iommu/io-buffer-manager.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Manager which allocates and tracks bounce buffers and their IOVAs. 
+Does
+ * not actually manage the IOMMU mapping nor do the bounce copies.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#include "io-buffer-manager.h"
+
+#include <linux/slab.h>
+
+struct io_buffer_node {
+	struct rb_node node;
+	struct io_bounce_buffer_info info;
+	void *orig_buffer;
+	int prot;
+};
+
+static void io_buffer_manager_free_pages(struct page **pages, int 
+count) {
+	while (count--)
+		__free_page(pages[count]);
+	kfree(pages);
+}
+
+static struct page **io_buffer_manager_alloc_pages(int count, unsigned 
+int nid) {
+	struct page **pages;
+	unsigned int i;
+
+	pages = kmalloc_array(count, sizeof(*pages), GFP_ATOMIC);
+	if (!pages)
+		return NULL;
+
+	// The IOMMU can map highmem pages, but try to allocate non-highmem
+	// pages first to make accessing the buffer cheaper.
+	for (i = 0; i < count; i++) {
+		pages[i] = alloc_pages_node(
+			nid,
+			GFP_ATOMIC | __GFP_ZERO | __GFP_NORETRY | __GFP_NOWARN,
+			0);
+		if (!pages[i]) {
+			pages[i] = alloc_pages_node(
+				nid, GFP_ATOMIC | __GFP_ZERO | __GFP_HIGHMEM,
+				0);
+			if (!pages[i]) {
+				io_buffer_manager_free_pages(pages, i);
+				return NULL;
+			}
+		}
+	}
+
+	return pages;
+}
+
+struct io_buffer_node *find_fallback_node(struct rb_root *root, 
+dma_addr_t iova) {
+	struct rb_node *node = root->rb_node;
+
+	while (node) {
+		struct io_buffer_node *cur =
+			container_of(node, struct io_buffer_node, node);
+
+		if (iova < cur->info.iova)
+			node = node->rb_left;
+		else if (iova >= cur->info.iova + cur->info.size)
+			node = node->rb_right;
+		else
+			return cur;
+	}
+	return NULL;
+}
+
+bool insert_fallback_node(struct rb_root *root, struct io_buffer_node 
+*node) {
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	dma_addr_t node_end = node->info.iova + node->info.size;
+
+	while (*new) {
+		struct io_buffer_node *cur =
+			container_of(*new, struct io_buffer_node, node);
+		dma_addr_t cur_end = cur->info.iova + cur->info.size;
+
+		parent = *new;
+		if (node_end <= cur->info.iova)
+			new = &((*new)->rb_left);
+		else if (node->info.iova >= cur_end)
+			new = &((*new)->rb_right);
+		else {
+			pr_crit("IOVA collision new=[%llx,%llx) old=[%llx,%llx)\n",
+				node->info.iova, node_end, cur->info.iova,
+				cur_end);
+			return false;
+		}
+	}
+
+	rb_link_node(&node->node, parent, new);
+	rb_insert_color(&node->node, root);
+	return true;
+}
+
+bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
+				    struct device *dev, void *orig_buffer,
+				    size_t size, int prot, unsigned int nid,
+				    struct io_bounce_buffer_info *info) {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct io_buffer_node *node;
+	unsigned long flags;
+
+	node = kzalloc(sizeof(*node), GFP_ATOMIC);
+	if (!node)
+		return false;
+
+	size = PAGE_ALIGN(size);
+	node->info.iova =
+		__iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	if (!node->info.iova)
+		goto free_node;
+
+	node->info.bounce_buffer =
+		io_buffer_manager_alloc_pages(size >> PAGE_SHIFT, nid);
+	if (!node->info.bounce_buffer)
+		goto free_iova;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	if (!insert_fallback_node(&manager->fallback_buffers, node))
+		goto fallback_lock_unlock;
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	node->orig_buffer = orig_buffer;
+	node->prot = prot;
+	node->info.size = size;
+
+	*info = node->info;
+
+	return true;
+
+fallback_lock_unlock:
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+free_iova:
+	__iommu_dma_free_iova(domain->iova_cookie, node->info.iova, size, 
+NULL);
+free_node:
+	kfree(node);
+	return false;
+}
+
+bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
+				   dma_addr_t handle,
+				   struct io_bounce_buffer_info *info,
+				   void **orig_buffer, int *prot)
+{
+	struct io_buffer_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	node = find_fallback_node(&manager->fallback_buffers, handle);
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	if (!node)
+		return false;
+
+	*info = node->info;
+	*orig_buffer = node->orig_buffer;
+	*prot = node->prot;
+	return true;
+}
+
+bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
+				      struct iommu_domain *domain,
+				      dma_addr_t handle, bool inited) {
+	struct io_buffer_node *node;
+	unsigned long flags;
+	bool free_buffer;
+
+	spin_lock_irqsave(&manager->fallback_lock, flags);
+	node = find_fallback_node(&manager->fallback_buffers, handle);
+	if (node)
+		rb_erase(&node->node, &manager->fallback_buffers);
+	spin_unlock_irqrestore(&manager->fallback_lock, flags);
+
+	if (!node)
+		return false;
+
+	if (inited)
+		free_buffer = io_bounce_buffers_release_buffer_cb(
+			manager, node->info.iova, node->info.size);
+	else
+		free_buffer = true;
+
+	if (free_buffer) {
+		io_buffer_manager_free_pages(node->info.bounce_buffer,
+					     node->info.size >> PAGE_SHIFT);
+		__iommu_dma_free_iova(domain->iova_cookie, node->info.iova,
+				      node->info.size, NULL);
+	} else {
+		pr_warn("Bounce buffer release failed; leaking buffer\n");
+	}
+
+	kfree(node);
+	return true;
+}
+
+int io_buffer_manager_init(struct io_buffer_manager *manager) {
+	manager->fallback_buffers = RB_ROOT;
+	spin_lock_init(&manager->fallback_lock);
+
+	return 0;
+}
diff --git a/drivers/iommu/io-buffer-manager.h b/drivers/iommu/io-buffer-manager.h
new file mode 100644
index 000000000000..aae560cc8512
--- /dev/null
+++ b/drivers/iommu/io-buffer-manager.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#ifndef _LINUX_IO_BUFFER_MANAGER_H
+#define _LINUX_IO_BUFFER_MANAGER_H
+
+#include <linux/dma-iommu.h>
+#include <linux/iova.h>
+#include <linux/spinlock.h>
+
+struct io_buffer_manager {
+	spinlock_t fallback_lock;
+	struct rb_root fallback_buffers;
+};
+
+struct io_bounce_buffer_info {
+	struct page **bounce_buffer;
+	dma_addr_t iova;
+	size_t size;
+};
+
+bool io_buffer_manager_alloc_buffer(struct io_buffer_manager *manager,
+				    struct device *dev, void *orig_buffer,
+				    size_t size, int prot, unsigned int nid,
+				    struct io_bounce_buffer_info *info);
+
+bool io_buffer_manager_find_buffer(struct io_buffer_manager *manager,
+				   dma_addr_t handle,
+				   struct io_bounce_buffer_info *info,
+				   void **orig_buffer, int *prot);
+
+bool io_buffer_manager_release_buffer(struct io_buffer_manager *manager,
+				      struct iommu_domain *domain,
+				      dma_addr_t handle, bool inited);
+
+int io_buffer_manager_init(struct io_buffer_manager *manager);
+
+bool io_bounce_buffers_release_buffer_cb(struct io_buffer_manager *manager,
+					 dma_addr_t iova, size_t size);
+
+#endif /* _LINUX_IO_BUFFER_MANAGER_H */
--
2.32.0.605.g8dce9f2422-goog

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices
  2021-08-10  1:19   ` Mi, Dapeng1
@ 2021-08-10  1:41     ` David Stevens
  0 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2021-08-10  1:41 UTC (permalink / raw)
  To: Mi, Dapeng1
  Cc: Robin Murphy, linux-kernel, Sergey Senozhatsky, iommu,
	Will Deacon, Christoph Hellwig

On Tue, Aug 10, 2021 at 10:19 AM Mi, Dapeng1 <dapeng1.mi@intel.com> wrote:
>
> Hi David,
>
> I like this patch set and this is crucial for reducing the significant vIOMMU performance. It looks you totally rewrite the IOMMU mapping/unmapping part and use the dynamically allocated memory from buddy system as bounce buffer instead of using the legacy SWIOTLB bounce buffer. As I know, some legacy devices' DMA could not access the memory larger than 32-bit memory space and the dynamically allocated memory address could exceed the 32-bit memory space. Is it a problem?

My understanding is that when devices with that sort of limitation sit
behind an IOMMU, the IOVA is what matters, not the physical address.
The bounce bounce buffers use the same limits for IOVA allocation as
the regular dma-iommu path, so compatible IOVAs will be allocated for
the bounce buffers.

-David

> Thx,
> Dapeng Mi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
  2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
                   ` (8 preceding siblings ...)
  2021-08-06 10:34 ` [PATCH v2 9/9] drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag David Stevens
@ 2022-05-24 12:27 ` Niklas Schnelle
  2022-05-27  1:25   ` David Stevens
  9 siblings, 1 reply; 19+ messages in thread
From: Niklas Schnelle @ 2022-05-24 12:27 UTC (permalink / raw)
  To: David Stevens, Robin Murphy
  Cc: linux-kernel, Sergey Senozhatsky, iommu, Will Deacon, Christoph Hellwig

On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> This patch series adds support for per-domain dynamic pools of iommu
> bounce buffers to the dma-iommu API. This allows iommu mappings to be
> reused while still maintaining strict iommu protection.
> 
> This bounce buffer support is used to add a new config option that, when
> enabled, causes all non-direct streaming mappings below a configurable
> size to go through the bounce buffers. This serves as an optimization on
> systems where manipulating iommu mappings is very expensive. For
> example, virtio-iommu operations in a guest on a linux host require a
> vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> DMA operations, memcpy can be significantly faster.
> 
> As a performance comparison, on a device with an i5-10210U, I ran fio
> with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> by >99%, as bounce buffers don't require syncing here in the read case.
> Running with multiple jobs doesn't serve as a useful performance
> comparison because virtio-iommu and vfio_iommu_type1 both have big
> locks that significantly limit mulithreaded DMA performance.
> 
> These pooled bounce buffers are also used for subgranule mappings with
> untrusted devices, replacing the single use bounce buffers used
> currently. The biggest difference here is that the new implementation
> maps a whole sglist using a single bounce buffer. The new implementation
> does not support using bounce buffers for only some segments of the
> sglist, so it may require more copying. However, the current
> implementation requires per-segment iommu map/unmap operations for all
> untrusted sglist mappings (fully aligned sglists included). On a 
> i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> a statistically significant decrease in CPU load from 2.28% -> 2.17%
> with the new iommu bounce buffer optimization enabled.
> 
> Each domain's buffer pool is split into multiple power-of-2 size
> classes. Each class allocates a fixed number of buffer slot metadata. A
> large iova range is allocated, and each slot is assigned an iova from
> the range. This allows the iova to be easily mapped back to the slot,
> and allows the critical section of most pool operations to be constant
> time. The one exception is finding a cached buffer to reuse. These are
> only separated according to R/W permissions - the use of other
> permissions such as IOMMU_PRIV may require a linear search through the
> cache. However, these other permissions are rare and likely exhibit high
> locality, so the should not be a bottleneck in practice.
> 
> Since untrusted devices may require bounce buffers, each domain has a
> fallback rbtree to manage single use buffers. This may be necessary if a
> very large number of DMA operations are simultaneously in-flight, or for
> very large individual DMA operations.
> 
> This patch set does not use swiotlb. There are two primary ways in which
> swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> allocates buffers to be compatible with a single device, whereas
> per-domain buffer pools don't handle that during buffer allocation as a
> single buffer may end up being used by multiple devices. Second, swiotlb
> allocation establishes the original to bounce buffer mapping, which
> again doesn't work if buffers can be reused. Effectively the only code
> that can be shared between the two use cases is allocating slots from
> the swiotlb's memory. However, given that we're going to be allocating
> memory for use with an iommu, allocating memory from a block of memory
> explicitly set aside to deal with a lack of iommu seems kind of
> contradictory. At best there might be a small performance improvement if 
> wiotlb allocation is faster than regular page allocation, but buffer
> allocation isn't on the hot path anyway.
> 
> Not using the swiotlb has the benefit that memory doesn't have to be
> preallocated. Instead, bounce buffers consume memory only for in-flight
> dma transactions (ignoring temporarily cached buffers), which is the
> smallest amount possible. This makes it easier to use bounce buffers as
> an optimization on systems with large numbers of devices or in
> situations where devices are unknown, since it is not necessary to try
> to tune how much memory needs to be set aside to achieve good
> performance without costing too much memory.
> 
> Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> is meant to address devices which create long lived streaming mappings
> but manage CPU cache coherency without using the dma_sync_* APIs.
> Currently, these devices don't function properly with swiotlb=force. The
> new flag is used to bypass bounce buffers so such devices will function
> when the new bounce buffer optimization is enabled. The flag is added to
> the i915 driver, which creates such mappings. It can also be added to
> various dma-buf implementations as an optimization, although that is not
> done here.
> 
> v1 -> v2:
>  - Replace existing untrusted bounce buffers with new bounce
>    buffer pools. This includes significant rework to account for
>    untrusted bounce buffers being required instead of an
>    optimization.
>  - Add flag for persistent streaming mappings.
> 

Hi David,

I'm currently looking into converting s390 from our custom IOMMU based
DMA API implementation to using dma-iommu.c. We're always using an
IOMMU for PCI devices even when doing pass-through to guests (under
both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
we use to do the shadowing of the guest I/O translations, are
relatively expensive I'm thus very interested in your work. I've tried
rebasing it on v5.18 and got it to compile but didn't get DMA to work
though it seems to partially work as I don't get probe failures unlike
with a completely broken DMA API. Since I might have very well screwed
up the rebase and my DMA API conversion is experimental too I was
wondering if you're still working on this and might have a current
version I could experiment with?

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
  2022-05-24 12:27 ` [PATCH v2 0/9] Add dynamic iommu backed bounce buffers Niklas Schnelle
@ 2022-05-27  1:25   ` David Stevens
  2022-06-03 14:53     ` Niklas Schnelle
  2022-07-01  9:23     ` Niklas Schnelle
  0 siblings, 2 replies; 19+ messages in thread
From: David Stevens @ 2022-05-27  1:25 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Robin Murphy, linux-kernel, Sergey Senozhatsky, iommu,
	Will Deacon, Christoph Hellwig

On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
>
> On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > This patch series adds support for per-domain dynamic pools of iommu
> > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > reused while still maintaining strict iommu protection.
> >
> > This bounce buffer support is used to add a new config option that, when
> > enabled, causes all non-direct streaming mappings below a configurable
> > size to go through the bounce buffers. This serves as an optimization on
> > systems where manipulating iommu mappings is very expensive. For
> > example, virtio-iommu operations in a guest on a linux host require a
> > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > DMA operations, memcpy can be significantly faster.
> >
> > As a performance comparison, on a device with an i5-10210U, I ran fio
> > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > by >99%, as bounce buffers don't require syncing here in the read case.
> > Running with multiple jobs doesn't serve as a useful performance
> > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > locks that significantly limit mulithreaded DMA performance.
> >
> > These pooled bounce buffers are also used for subgranule mappings with
> > untrusted devices, replacing the single use bounce buffers used
> > currently. The biggest difference here is that the new implementation
> > maps a whole sglist using a single bounce buffer. The new implementation
> > does not support using bounce buffers for only some segments of the
> > sglist, so it may require more copying. However, the current
> > implementation requires per-segment iommu map/unmap operations for all
> > untrusted sglist mappings (fully aligned sglists included). On a
> > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > with the new iommu bounce buffer optimization enabled.
> >
> > Each domain's buffer pool is split into multiple power-of-2 size
> > classes. Each class allocates a fixed number of buffer slot metadata. A
> > large iova range is allocated, and each slot is assigned an iova from
> > the range. This allows the iova to be easily mapped back to the slot,
> > and allows the critical section of most pool operations to be constant
> > time. The one exception is finding a cached buffer to reuse. These are
> > only separated according to R/W permissions - the use of other
> > permissions such as IOMMU_PRIV may require a linear search through the
> > cache. However, these other permissions are rare and likely exhibit high
> > locality, so the should not be a bottleneck in practice.
> >
> > Since untrusted devices may require bounce buffers, each domain has a
> > fallback rbtree to manage single use buffers. This may be necessary if a
> > very large number of DMA operations are simultaneously in-flight, or for
> > very large individual DMA operations.
> >
> > This patch set does not use swiotlb. There are two primary ways in which
> > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > allocates buffers to be compatible with a single device, whereas
> > per-domain buffer pools don't handle that during buffer allocation as a
> > single buffer may end up being used by multiple devices. Second, swiotlb
> > allocation establishes the original to bounce buffer mapping, which
> > again doesn't work if buffers can be reused. Effectively the only code
> > that can be shared between the two use cases is allocating slots from
> > the swiotlb's memory. However, given that we're going to be allocating
> > memory for use with an iommu, allocating memory from a block of memory
> > explicitly set aside to deal with a lack of iommu seems kind of
> > contradictory. At best there might be a small performance improvement if
> > wiotlb allocation is faster than regular page allocation, but buffer
> > allocation isn't on the hot path anyway.
> >
> > Not using the swiotlb has the benefit that memory doesn't have to be
> > preallocated. Instead, bounce buffers consume memory only for in-flight
> > dma transactions (ignoring temporarily cached buffers), which is the
> > smallest amount possible. This makes it easier to use bounce buffers as
> > an optimization on systems with large numbers of devices or in
> > situations where devices are unknown, since it is not necessary to try
> > to tune how much memory needs to be set aside to achieve good
> > performance without costing too much memory.
> >
> > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > is meant to address devices which create long lived streaming mappings
> > but manage CPU cache coherency without using the dma_sync_* APIs.
> > Currently, these devices don't function properly with swiotlb=force. The
> > new flag is used to bypass bounce buffers so such devices will function
> > when the new bounce buffer optimization is enabled. The flag is added to
> > the i915 driver, which creates such mappings. It can also be added to
> > various dma-buf implementations as an optimization, although that is not
> > done here.
> >
> > v1 -> v2:
> >  - Replace existing untrusted bounce buffers with new bounce
> >    buffer pools. This includes significant rework to account for
> >    untrusted bounce buffers being required instead of an
> >    optimization.
> >  - Add flag for persistent streaming mappings.
> >
>
> Hi David,
>
> I'm currently looking into converting s390 from our custom IOMMU based
> DMA API implementation to using dma-iommu.c. We're always using an
> IOMMU for PCI devices even when doing pass-through to guests (under
> both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> we use to do the shadowing of the guest I/O translations, are
> relatively expensive I'm thus very interested in your work. I've tried
> rebasing it on v5.18 and got it to compile but didn't get DMA to work
> though it seems to partially work as I don't get probe failures unlike
> with a completely broken DMA API. Since I might have very well screwed
> up the rebase and my DMA API conversion is experimental too I was
> wondering if you're still working on this and might have a current
> version I could experiment with?

Unfortunately I don't have anything more recent to share. I've come
across some performance issues caused by pathological usage patterns
in internal usage, but I haven't seen any correctness issues. I'm
hoping that I'll be able to address the performance issues and send a
rebased series within the next month or so.

It's definitely possible that this series has some bugs. I've tested
it on a range of chromebooks and their various hardware and drivers,
but that's still all relatively normal x86_64/arm64. If your hardware
is more particular about its DMA, this series might be missing
something.

-David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
  2022-05-27  1:25   ` David Stevens
@ 2022-06-03 14:53     ` Niklas Schnelle
  2022-06-06  1:24       ` David Stevens
  2022-07-01  9:23     ` Niklas Schnelle
  1 sibling, 1 reply; 19+ messages in thread
From: Niklas Schnelle @ 2022-06-03 14:53 UTC (permalink / raw)
  To: David Stevens
  Cc: Robin Murphy, linux-kernel, Sergey Senozhatsky, iommu,
	Will Deacon, Christoph Hellwig

On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > > 
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > > 
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > > 
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > > 
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > > 
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > > 
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > > 
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > > 
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an optimization on systems with large numbers of devices or in
> > > situations where devices are unknown, since it is not necessary to try
> > > to tune how much memory needs to be set aside to achieve good
> > > performance without costing too much memory.
> > > 
> > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > is meant to address devices which create long lived streaming mappings
> > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > Currently, these devices don't function properly with swiotlb=force. The
> > > new flag is used to bypass bounce buffers so such devices will function
> > > when the new bounce buffer optimization is enabled. The flag is added to
> > > the i915 driver, which creates such mappings. It can also be added to
> > > various dma-buf implementations as an optimization, although that is not
> > > done here.
> > > 
> > > v1 -> v2:
> > >  - Replace existing untrusted bounce buffers with new bounce
> > >    buffer pools. This includes significant rework to account for
> > >    untrusted bounce buffers being required instead of an
> > >    optimization.
> > >  - Add flag for persistent streaming mappings.
> > > 
> > 
> > Hi David,
> > 
> > I'm currently looking into converting s390 from our custom IOMMU based
> > DMA API implementation to using dma-iommu.c. We're always using an
> > IOMMU for PCI devices even when doing pass-through to guests (under
> > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > we use to do the shadowing of the guest I/O translations, are
> > relatively expensive I'm thus very interested in your work. I've tried
> > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > though it seems to partially work as I don't get probe failures unlike
> > with a completely broken DMA API. Since I might have very well screwed
> > up the rebase and my DMA API conversion is experimental too I was
> > wondering if you're still working on this and might have a current
> > version I could experiment with?
> 
> Unfortunately I don't have anything more recent to share. I've come
> across some performance issues caused by pathological usage patterns
> in internal usage, but I haven't seen any correctness issues. I'm
> hoping that I'll be able to address the performance issues and send a
> rebased series within the next month or so.
> 
> It's definitely possible that this series has some bugs. I've tested
> it on a range of chromebooks and their various hardware and drivers,
> but that's still all relatively normal x86_64/arm64. If your hardware
> is more particular about its DMA, this series might be missing
> something.
> 
> -David


Hi David,

Thanks for the answer. The only unusual thing about our DMA is that we
only do 64 bit DMA and IOVAs are always >2^32. I don't think I
triggered a bug in your code though, rather I think I made some mistake
in the rebase onto 5.18 as some of the APIs changed a bit. I'm out next
week but may try it again and possibly just test on x86_64 if it
doesn't work on s390. If you have anything new I'd be interested to
hear of course. Also could you say anything more about the pathological
usage patterns?

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
  2022-06-03 14:53     ` Niklas Schnelle
@ 2022-06-06  1:24       ` David Stevens
  0 siblings, 0 replies; 19+ messages in thread
From: David Stevens @ 2022-06-06  1:24 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Robin Murphy, linux-kernel, Sergey Senozhatsky, iommu,
	Will Deacon, Christoph Hellwig

On Fri, Jun 3, 2022 at 11:53 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
>
> On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> > On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > This patch series adds support for per-domain dynamic pools of iommu
> > > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > > reused while still maintaining strict iommu protection.
> > > >
> > > > This bounce buffer support is used to add a new config option that, when
> > > > enabled, causes all non-direct streaming mappings below a configurable
> > > > size to go through the bounce buffers. This serves as an optimization on
> > > > systems where manipulating iommu mappings is very expensive. For
> > > > example, virtio-iommu operations in a guest on a linux host require a
> > > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > > DMA operations, memcpy can be significantly faster.
> > > >
> > > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > > Running with multiple jobs doesn't serve as a useful performance
> > > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > > locks that significantly limit mulithreaded DMA performance.
> > > >
> > > > These pooled bounce buffers are also used for subgranule mappings with
> > > > untrusted devices, replacing the single use bounce buffers used
> > > > currently. The biggest difference here is that the new implementation
> > > > maps a whole sglist using a single bounce buffer. The new implementation
> > > > does not support using bounce buffers for only some segments of the
> > > > sglist, so it may require more copying. However, the current
> > > > implementation requires per-segment iommu map/unmap operations for all
> > > > untrusted sglist mappings (fully aligned sglists included). On a
> > > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > > with the new iommu bounce buffer optimization enabled.
> > > >
> > > > Each domain's buffer pool is split into multiple power-of-2 size
> > > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > > large iova range is allocated, and each slot is assigned an iova from
> > > > the range. This allows the iova to be easily mapped back to the slot,
> > > > and allows the critical section of most pool operations to be constant
> > > > time. The one exception is finding a cached buffer to reuse. These are
> > > > only separated according to R/W permissions - the use of other
> > > > permissions such as IOMMU_PRIV may require a linear search through the
> > > > cache. However, these other permissions are rare and likely exhibit high
> > > > locality, so the should not be a bottleneck in practice.
> > > >
> > > > Since untrusted devices may require bounce buffers, each domain has a
> > > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > > very large number of DMA operations are simultaneously in-flight, or for
> > > > very large individual DMA operations.
> > > >
> > > > This patch set does not use swiotlb. There are two primary ways in which
> > > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > > allocates buffers to be compatible with a single device, whereas
> > > > per-domain buffer pools don't handle that during buffer allocation as a
> > > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > > allocation establishes the original to bounce buffer mapping, which
> > > > again doesn't work if buffers can be reused. Effectively the only code
> > > > that can be shared between the two use cases is allocating slots from
> > > > the swiotlb's memory. However, given that we're going to be allocating
> > > > memory for use with an iommu, allocating memory from a block of memory
> > > > explicitly set aside to deal with a lack of iommu seems kind of
> > > > contradictory. At best there might be a small performance improvement if
> > > > wiotlb allocation is faster than regular page allocation, but buffer
> > > > allocation isn't on the hot path anyway.
> > > >
> > > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > > dma transactions (ignoring temporarily cached buffers), which is the
> > > > smallest amount possible. This makes it easier to use bounce buffers as
> > > > an optimization on systems with large numbers of devices or in
> > > > situations where devices are unknown, since it is not necessary to try
> > > > to tune how much memory needs to be set aside to achieve good
> > > > performance without costing too much memory.
> > > >
> > > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > > is meant to address devices which create long lived streaming mappings
> > > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > > Currently, these devices don't function properly with swiotlb=force. The
> > > > new flag is used to bypass bounce buffers so such devices will function
> > > > when the new bounce buffer optimization is enabled. The flag is added to
> > > > the i915 driver, which creates such mappings. It can also be added to
> > > > various dma-buf implementations as an optimization, although that is not
> > > > done here.
> > > >
> > > > v1 -> v2:
> > > >  - Replace existing untrusted bounce buffers with new bounce
> > > >    buffer pools. This includes significant rework to account for
> > > >    untrusted bounce buffers being required instead of an
> > > >    optimization.
> > > >  - Add flag for persistent streaming mappings.
> > > >
> > >
> > > Hi David,
> > >
> > > I'm currently looking into converting s390 from our custom IOMMU based
> > > DMA API implementation to using dma-iommu.c. We're always using an
> > > IOMMU for PCI devices even when doing pass-through to guests (under
> > > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > > we use to do the shadowing of the guest I/O translations, are
> > > relatively expensive I'm thus very interested in your work. I've tried
> > > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > > though it seems to partially work as I don't get probe failures unlike
> > > with a completely broken DMA API. Since I might have very well screwed
> > > up the rebase and my DMA API conversion is experimental too I was
> > > wondering if you're still working on this and might have a current
> > > version I could experiment with?
> >
> > Unfortunately I don't have anything more recent to share. I've come
> > across some performance issues caused by pathological usage patterns
> > in internal usage, but I haven't seen any correctness issues. I'm
> > hoping that I'll be able to address the performance issues and send a
> > rebased series within the next month or so.
> >
> > It's definitely possible that this series has some bugs. I've tested
> > it on a range of chromebooks and their various hardware and drivers,
> > but that's still all relatively normal x86_64/arm64. If your hardware
> > is more particular about its DMA, this series might be missing
> > something.
> >
> > -David
>
>
> Hi David,
>
> Thanks for the answer. The only unusual thing about our DMA is that we
> only do 64 bit DMA and IOVAs are always >2^32. I don't think I
> triggered a bug in your code though, rather I think I made some mistake
> in the rebase onto 5.18 as some of the APIs changed a bit. I'm out next
> week but may try it again and possibly just test on x86_64 if it
> doesn't work on s390. If you have anything new I'd be interested to
> hear of course. Also could you say anything more about the pathological
> usage patterns?

The problem with this implementation is that if you fall outside the
max number/size for the bounce buffer pools, then DMA performance can
fall off of a cliff. Although those max parameters are tunable, it's
always possible to construct a workload that falls outside of the
expected bounds. I think this can be addressed by adding a fallback
pooling structure. It won't necessarily be as performant as the
primary buffer pools, but I think it should work reasonably well for
usage patterns that fall outside the primary buffer pools.

-David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
  2022-05-27  1:25   ` David Stevens
  2022-06-03 14:53     ` Niklas Schnelle
@ 2022-07-01  9:23     ` Niklas Schnelle
  1 sibling, 0 replies; 19+ messages in thread
From: Niklas Schnelle @ 2022-07-01  9:23 UTC (permalink / raw)
  To: David Stevens
  Cc: Robin Murphy, JasonGunthorpe, linux-kernel, Alex Williamson,
	Sergey Senozhatsky, iommu, Will Deacon, Christoph Hellwig

On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > > 
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > > 
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > > 
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > > 
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > > 
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > > 
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > > 
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > > 
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an optimization on systems with large numbers of devices or in
> > > situations where devices are unknown, since it is not necessary to try
> > > to tune how much memory needs to be set aside to achieve good
> > > performance without costing too much memory.
> > > 
> > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > is meant to address devices which create long lived streaming mappings
> > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > Currently, these devices don't function properly with swiotlb=force. The
> > > new flag is used to bypass bounce buffers so such devices will function
> > > when the new bounce buffer optimization is enabled. The flag is added to
> > > the i915 driver, which creates such mappings. It can also be added to
> > > various dma-buf implementations as an optimization, although that is not
> > > done here.
> > > 
> > > v1 -> v2:
> > >  - Replace existing untrusted bounce buffers with new bounce
> > >    buffer pools. This includes significant rework to account for
> > >    untrusted bounce buffers being required instead of an
> > >    optimization.
> > >  - Add flag for persistent streaming mappings.
> > > 
> > 
> > Hi David,
> > 
> > I'm currently looking into converting s390 from our custom IOMMU based
> > DMA API implementation to using dma-iommu.c. We're always using an
> > IOMMU for PCI devices even when doing pass-through to guests (under
> > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > we use to do the shadowing of the guest I/O translations, are
> > relatively expensive I'm thus very interested in your work. I've tried
> > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > though it seems to partially work as I don't get probe failures unlike
> > with a completely broken DMA API. Since I might have very well screwed
> > up the rebase and my DMA API conversion is experimental too I was
> > wondering if you're still working on this and might have a current
> > version I could experiment with?
> 
> Unfortunately I don't have anything more recent to share. I've come
> across some performance issues caused by pathological usage patterns
> in internal usage, but I haven't seen any correctness issues. I'm
> hoping that I'll be able to address the performance issues and send a
> rebased series within the next month or so.
> 
> It's definitely possible that this series has some bugs. I've tested
> it on a range of chromebooks and their various hardware and drivers,
> but that's still all relatively normal x86_64/arm64. If your hardware
> is more particular about its DMA, this series might be missing
> something.
> 
> -David

Hi David,

I finally came around to trying this again. This time I managed to get
it working and figure out what was going wrong. The problem was with
the call to iommu_dma_alloc_iova() in io_buffer_manager_init(). As this
call happens during the IOMMU initialization dma_get_mask(dev) is used
before the driver calls dma_set_mask(_and_coherent)() and is thus still
the default mask of DMA_BIT_MASK(32) instead of what the device really
supports. This breaks s390 because our IOMMU currently only supports
apertures starting at an IOVA >= 2^32. For testing I worked around this
by just passing DMA_BIT_MASK(64) instead but of course that's not a
proper fix. With that in place your patches work on top of my still
experimental conversion to use dma-iommu.c on s390.

I can also already confirm that this gives a similar CPU load
(especially steal time) reduction on our z/VM hypervisor which does I/O
translation table shadowing much like your virtio-iommu test. It also
does help performance of my DMA API rework which sadly still lacks
behind our current s390 DMA API implementation. I suspect that is
because the lazy unmapping used by dma-iommu.c tries to do the
unmapping via a timer in the background while our current approach does
them all at once when wrapping around the IOVA space. The latter I
suspect works better when I/O table shadowing in the hypervisor is
serialized. So to summarize for s390 something like your series would
be of significant interest.

Best regards,
Niklas


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-07-01  9:24 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-06 10:34 [PATCH v2 0/9] Add dynamic iommu backed bounce buffers David Stevens
2021-08-06 10:34 ` [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers" David Stevens
2021-08-06 10:34 ` [PATCH v2 2/9] dma-iommu: expose a few helper functions to module David Stevens
2021-08-06 17:28   ` kernel test robot
2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
2021-08-06 15:53   ` kernel test robot
2021-08-10  1:19   ` Mi, Dapeng1
2021-08-10  1:41     ` David Stevens
2021-08-06 10:34 ` [PATCH v2 4/9] dma-iommu: remove extra buffer search on unmap David Stevens
2021-08-06 10:34 ` [PATCH v2 5/9] dma-iommu: clear only necessary bytes David Stevens
2021-08-06 10:34 ` [PATCH v2 6/9] dma-iommu: add bounce buffer pools David Stevens
2021-08-06 10:34 ` [PATCH v2 7/9] dma-iommu: support iommu bounce buffer optimization David Stevens
2021-08-06 10:34 ` [PATCH v2 8/9] dma-mapping: add persistent streaming mapping flag David Stevens
2021-08-06 10:34 ` [PATCH v2 9/9] drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag David Stevens
2022-05-24 12:27 ` [PATCH v2 0/9] Add dynamic iommu backed bounce buffers Niklas Schnelle
2022-05-27  1:25   ` David Stevens
2022-06-03 14:53     ` Niklas Schnelle
2022-06-06  1:24       ` David Stevens
2022-07-01  9:23     ` Niklas Schnelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).