LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/3] iommu/dma-iommu: Support IOMMU page size larger than the CPU page size
@ 2021-08-06 15:55 Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 1/3] iommu: Move IOMMU pagesize check to attach_device Sven Peter
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sven Peter @ 2021-08-06 15:55 UTC (permalink / raw)
  To: iommu
  Cc: Sven Peter, Joerg Roedel, Will Deacon, Robin Murphy,
	Arnd Bergmann, Mohamed Mediouni, Alexander Graf, Hector Martin,
	linux-kernel

Hi,

On the Apple M1 there's this slightly annoying situation where the DART IOMMU
has a hard-wired page size of 16KB. Additionally, the DARTs for some hardware
(USB A ports, WiFi, Ethernet, Thunderbolt PCIe) cannot be switched to bypass
mode and it's also not easily possible to program a software bypass mode.

This is a problem for kernels configured with 4K pages. Unfortunately,
most distributions ship with those by default.

There's not much that can be done for IOMMU_DOMAIN_UNMANAGED domains since
most API clients likely expect to be able to map single CPU pages.

For IOMMU_DOMAIN_DMA domains however, dma-iommu.c is the only code that
uses the raw IOMMU API to manage these domains and can possibly be adapted
to still work correctly.
Essentially, I changed some relevant alignments to happen with respect to both
PAGE_SIZE and iovad->granule. The sglist code also can't use the optimization
for a single IOVA allocation anymore since most phys_addrs will not be aligned
to the IOMMU page size.

I'd like to get some early feedback on this approach to see if it's feasible
to continue working on this or if a different approach will work better or if
this setup just won't be supported.

I'm not very confident I've covered all necessary cases but I'll take
a closer look at every function in dma-iommu.c if there's a chance that
this will be accepted eventually. The current changes are enough to boot
from a USB device and use the Ethernet adapter on my M1 Mini with 4kb pages
though.


One issue I see is that this will end up wasting memory. There's e.g.
dma_pool_*, which will dma_alloc_coherent PAGE_SIZE bytes and stuff the individual
allocations into those buffers. These will get padded to SZ_16K but dma_pool will
be completely unaware that it got 4x as much memory as requested and will leave
it unused :-(

The other issue I'm aware of is v4l2 which expects that a page-aligned sglist
can be represented contiguously in IOVA space [1].


Best,


Sven


[1] https://lore.kernel.org/linux-iommu/0d20bd6b-d0a1-019c-6398-b12f83f4fdf7@arm.com/

Sven Peter (3):
  iommu: Move IOMMU pagesize check to attach_device
  iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  iommu: Introduce __IOMMU_DOMAIN_LARGE_PAGES

 drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
 drivers/iommu/iommu.c     | 36 ++++++++++++++--
 drivers/iommu/iova.c      |  7 ++--
 include/linux/iommu.h     | 14 ++++---
 4 files changed, 123 insertions(+), 21 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/3] iommu: Move IOMMU pagesize check to attach_device
  2021-08-06 15:55 [RFC PATCH 0/3] iommu/dma-iommu: Support IOMMU page size larger than the CPU page size Sven Peter
@ 2021-08-06 15:55 ` Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 3/3] iommu: Introduce __IOMMU_DOMAIN_LARGE_PAGES Sven Peter
  2 siblings, 0 replies; 14+ messages in thread
From: Sven Peter @ 2021-08-06 15:55 UTC (permalink / raw)
  To: iommu
  Cc: Sven Peter, Joerg Roedel, Will Deacon, Robin Murphy,
	Arnd Bergmann, Mohamed Mediouni, Alexander Graf, Hector Martin,
	linux-kernel

The iova allocator is capable of handling any granularity which is a power
of two. Remove the much stronger condition that the granularity must be
smaller or equal to the CPU page size from a BUG_ON there.
Instead, check this condition during __iommu_attach_device and fail
gracefully.

Signed-off-by: Sven Peter <sven@svenpeter.dev>
---
 drivers/iommu/iommu.c | 34 +++++++++++++++++++++++++++++++---
 drivers/iommu/iova.c  |  7 ++++---
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1de503ddb343..5854a4ef5681 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -78,6 +78,8 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
 						 unsigned type);
 static int __iommu_attach_device(struct iommu_domain *domain,
 				 struct device *dev);
+static void __iommu_detach_device(struct iommu_domain *domain,
+				  struct device *dev);
 static int __iommu_attach_group(struct iommu_domain *domain,
 				struct iommu_group *group);
 static void __iommu_detach_group(struct iommu_domain *domain,
@@ -1964,6 +1966,18 @@ void iommu_domain_free(struct iommu_domain *domain)
 }
 EXPORT_SYMBOL_GPL(iommu_domain_free);
 
+static int iommu_check_page_size(struct iommu_domain *domain)
+{
+	if (!(domain->type & __IOMMU_DOMAIN_PAGING))
+		return 0;
+
+	if ((1 << __ffs(domain->pgsize_bitmap)) > PAGE_SIZE) {
+		pr_warn("IOMMU page size cannot represent CPU pages.\n");
+		return -EFAULT;
+	}
+
+	return 0;
+}
 static int __iommu_attach_device(struct iommu_domain *domain,
 				 struct device *dev)
 {
@@ -1973,9 +1987,23 @@ static int __iommu_attach_device(struct iommu_domain *domain,
 		return -ENODEV;
 
 	ret = domain->ops->attach_dev(domain, dev);
-	if (!ret)
-		trace_attach_device_to_domain(dev);
-	return ret;
+	if (ret)
+		return ret;
+
+	/*
+	 * Check that CPU pages can be represented by the IOVA granularity.
+	 * This has to be done after ops->attach_dev since many IOMMU drivers
+	 * only limit domain->pgsize_bitmap after having attached the first
+	 * device.
+	 */
+	ret = iommu_check_page_size(domain);
+	if (ret) {
+		__iommu_detach_device(domain, dev);
+		return ret;
+	}
+
+	trace_attach_device_to_domain(dev);
+	return 0;
 }
 
 int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..e0f8adde0f1b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -50,10 +50,11 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 {
 	/*
 	 * IOVA granularity will normally be equal to the smallest
-	 * supported IOMMU page size; both *must* be capable of
-	 * representing individual CPU pages exactly.
+	 * supported IOMMU page size; while both usually are capable of
+	 * representing individual CPU pages exactly the IOVA allocator
+	 * supports any granularities that are an exact power of two.
 	 */
-	BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
+	BUG_ON(!is_power_of_2(granule));
 
 	spin_lock_init(&iovad->iova_rbtree_lock);
 	iovad->rbroot = RB_ROOT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-06 15:55 [RFC PATCH 0/3] iommu/dma-iommu: Support IOMMU page size larger than the CPU page size Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 1/3] iommu: Move IOMMU pagesize check to attach_device Sven Peter
@ 2021-08-06 15:55 ` Sven Peter
  2021-08-06 18:04   ` Robin Murphy
  2021-08-06 15:55 ` [RFC PATCH 3/3] iommu: Introduce __IOMMU_DOMAIN_LARGE_PAGES Sven Peter
  2 siblings, 1 reply; 14+ messages in thread
From: Sven Peter @ 2021-08-06 15:55 UTC (permalink / raw)
  To: iommu
  Cc: Sven Peter, Joerg Roedel, Will Deacon, Robin Murphy,
	Arnd Bergmann, Mohamed Mediouni, Alexander Graf, Hector Martin,
	linux-kernel

DMA IOMMU domains can support hardware where the IOMMU page size is
larger than the CPU page size.
Alignments need to be done with respect to both PAGE_SIZE and
iovad->granule. Additionally, the sg list optimization to use a single
IOVA allocation cannot be used in those cases since the physical
addresses will very likely not be aligned to the larger IOMMU page size.

Signed-off-by: Sven Peter <sven@svenpeter.dev>
---
 drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 6f0df629353f..e072d9030d9f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -8,6 +8,7 @@
  * Copyright (C) 2000-2004 Russell King
  */
 
+#include <linux/align.h>
 #include <linux/acpi_iort.h>
 #include <linux/device.h>
 #include <linux/dma-map-ops.h>
@@ -51,6 +52,15 @@ struct iommu_dma_cookie {
 	struct iommu_domain		*fq_domain;
 };
 
+/* aligns size to CPU and IOMMU page size */
+static inline size_t iommu_page_align(struct device *dev, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
+}
+
 static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
 bool iommu_dma_forcedac __read_mostly;
 
@@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
 /*
  * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
  * but an IOMMU which supports smaller pages might not map the whole thing.
+ * If the IOMMU page size is larger than the CPU page size, then the size
+ * will be aligned to that granularity and some memory will be left unused.
  */
 static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
@@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
 
 out_unmap:
 	__iommu_dma_unmap(dev, *dma_handle, size);
-	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
 	return NULL;
 }
 
@@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
 	struct dma_sgt_handle *sh = sgt_handle(sgt);
 
 	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
-	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+	__iommu_dma_free_pages(sh->pages,
+		iommu_page_align(dev, size) >> PAGE_SHIFT);
 	sg_free_table(&sh->sgt);
 	kfree(sh);
 }
@@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (dev_is_untrusted(dev))
 		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
 
+	/*
+	 * If the IOMMU pagesize is larger than the CPU pagesize we will
+	 * very likely run into sgs with a physical address that is not aligned
+	 * to an IOMMU page boundary. Fall back to just mapping every entry
+	 * independently with __iommu_dma_map then.
+	 */
+	if (iovad->granule > PAGE_SIZE) {
+		for_each_sg(sg, s, nents, i) {
+			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
+				s->length, prot, dma_get_mask(dev));
+			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
+				break;
+			sg_dma_len(s) = s->length;
+		}
+
+		if (unlikely(i != nents)) {
+			nents = i;
+			for_each_sg(sg, s, nents, i)
+				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
+			return 0;
+		}
+
+		return nents;
+	}
+
 	/*
 	 * Work out how much IOVA space we need, and align the segments to
 	 * IOVA granules for the IOMMU driver to handle. With some clever
@@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
 	dma_addr_t start, end;
 	struct scatterlist *tmp;
 	int i;
@@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		return;
 	}
 
+	/*
+	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
+	 * every entry indepedently with __iommu_dma_map then. Let's do the
+	 * opposite here.
+	 */
+	if (iovad->granule > PAGE_SIZE) {
+		for_each_sg(sg, tmp, nents, i)
+			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
+		return;
+	}
+
 	/*
 	 * The scatterlist segments are mapped into a single
 	 * contiguous IOVA allocation, so this is incredibly easy.
@@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
 
 static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
 {
-	size_t alloc_size = PAGE_ALIGN(size);
+	size_t alloc_size = iommu_page_align(dev, size);
 	int count = alloc_size >> PAGE_SHIFT;
 	struct page *page = NULL, **pages = NULL;
 
@@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
 		struct page **pagep, gfp_t gfp, unsigned long attrs)
 {
 	bool coherent = dev_is_dma_coherent(dev);
-	size_t alloc_size = PAGE_ALIGN(size);
+	size_t alloc_size = iommu_page_align(dev, size);
 	int node = dev_to_node(dev);
 	struct page *page = NULL;
 	void *cpu_addr;
@@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
 
 	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
 	    !gfpflags_allow_blocking(gfp) && !coherent)
-		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
-					       gfp, NULL);
+		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
+					       &cpu_addr, gfp, NULL);
 	else
 		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
 	if (!cpu_addr)
@@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
 		void *cpu_addr, dma_addr_t dma_addr, size_t size,
 		unsigned long attrs)
 {
+	size_t aligned_size = iommu_page_align(dev, size);
 	struct page *page;
 	int ret;
 
@@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
 
 		if (pages) {
 			return sg_alloc_table_from_pages(sgt, pages,
-					PAGE_ALIGN(size) >> PAGE_SHIFT,
+					aligned_size >> PAGE_SHIFT,
 					0, size, GFP_KERNEL);
 		}
 
@@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
 
 	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
 	if (!ret)
-		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
+		sg_set_page(sgt->sgl, page, aligned_size, 0);
 	return ret;
 }
 
@@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
 	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
+		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
+{
+	size = iommu_page_align(dev, size);
+	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
+}
+
+static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
+		dma_addr_t dma_handle, enum dma_data_direction dir)
+{
+	size = iommu_page_align(dev, size);
+	return dma_common_free_pages(dev, size, page, dma_handle, dir);
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
 	.alloc			= iommu_dma_alloc,
 	.free			= iommu_dma_free,
-	.alloc_pages		= dma_common_alloc_pages,
-	.free_pages		= dma_common_free_pages,
+	.alloc_pages		= iommu_dma_alloc_aligned_pages,
+	.free_pages		= iommu_dma_free_aligned_pages,
 #ifdef CONFIG_DMA_REMAP
 	.alloc_noncontiguous	= iommu_dma_alloc_noncontiguous,
 	.free_noncontiguous	= iommu_dma_free_noncontiguous,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/3] iommu: Introduce __IOMMU_DOMAIN_LARGE_PAGES
  2021-08-06 15:55 [RFC PATCH 0/3] iommu/dma-iommu: Support IOMMU page size larger than the CPU page size Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 1/3] iommu: Move IOMMU pagesize check to attach_device Sven Peter
  2021-08-06 15:55 ` [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE Sven Peter
@ 2021-08-06 15:55 ` Sven Peter
  2 siblings, 0 replies; 14+ messages in thread
From: Sven Peter @ 2021-08-06 15:55 UTC (permalink / raw)
  To: iommu
  Cc: Sven Peter, Joerg Roedel, Will Deacon, Robin Murphy,
	Arnd Bergmann, Mohamed Mediouni, Alexander Graf, Hector Martin,
	linux-kernel

__IOMMU_DOMAIN_LARGE_PAGES indicates that a domain can handle
conditions where PAGE_SIZE might be smaller than the IOMMU page size.
Always allow attaching devices to such domains and set the flag for
IOMMU_DOMAIN_DMA, which can now handle these situations.

Signed-off-by: Sven Peter <sven@svenpeter.dev>
---
 drivers/iommu/iommu.c |  2 ++
 include/linux/iommu.h | 14 +++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5854a4ef5681..f0bfd76187b1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1970,6 +1970,8 @@ static int iommu_check_page_size(struct iommu_domain *domain)
 {
 	if (!(domain->type & __IOMMU_DOMAIN_PAGING))
 		return 0;
+	if (domain->type & __IOMMU_DOMAIN_LARGE_PAGES)
+		return 0;
 
 	if ((1 << __ffs(domain->pgsize_bitmap)) > PAGE_SIZE) {
 		pr_warn("IOMMU page size cannot represent CPU pages.\n");
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e552ecfefcf7..1f97eac8a4b0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -56,10 +56,13 @@ struct iommu_domain_geometry {
 };
 
 /* Domain feature flags */
-#define __IOMMU_DOMAIN_PAGING	(1U << 0)  /* Support for iommu_map/unmap */
-#define __IOMMU_DOMAIN_DMA_API	(1U << 1)  /* Domain for use in DMA-API
-					      implementation              */
-#define __IOMMU_DOMAIN_PT	(1U << 2)  /* Domain is identity mapped   */
+#define __IOMMU_DOMAIN_PAGING       (1U << 0)  /* Support for iommu_map/unmap */
+#define __IOMMU_DOMAIN_DMA_API      (1U << 1)  /* Domain for use in DMA-API
+						   implementation              */
+#define __IOMMU_DOMAIN_PT           (1U << 2)  /* Domain is identity mapped   */
+#define __IOMMU_DOMAIN_LARGE_PAGES  (1U << 3)  /* Domain can handle IOMMU page
+						  sizes larger than the CPU
+						  page size                   */
 
 /*
  * This are the possible domain-types
@@ -77,7 +80,8 @@ struct iommu_domain_geometry {
 #define IOMMU_DOMAIN_IDENTITY	(__IOMMU_DOMAIN_PT)
 #define IOMMU_DOMAIN_UNMANAGED	(__IOMMU_DOMAIN_PAGING)
 #define IOMMU_DOMAIN_DMA	(__IOMMU_DOMAIN_PAGING |	\
-				 __IOMMU_DOMAIN_DMA_API)
+				 __IOMMU_DOMAIN_DMA_API |       \
+				 __IOMMU_DOMAIN_LARGE_PAGES)
 
 struct iommu_domain {
 	unsigned type;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-06 15:55 ` [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE Sven Peter
@ 2021-08-06 18:04   ` Robin Murphy
  2021-08-07  8:41     ` Sven Peter
  2021-08-07 11:47     ` Sven Peter
  0 siblings, 2 replies; 14+ messages in thread
From: Robin Murphy @ 2021-08-06 18:04 UTC (permalink / raw)
  To: Sven Peter, iommu
  Cc: Arnd Bergmann, Will Deacon, Hector Martin, linux-kernel,
	Alexander Graf, Mohamed Mediouni

On 2021-08-06 16:55, Sven Peter via iommu wrote:
> DMA IOMMU domains can support hardware where the IOMMU page size is
> larger than the CPU page size.
> Alignments need to be done with respect to both PAGE_SIZE and
> iovad->granule. Additionally, the sg list optimization to use a single
> IOVA allocation cannot be used in those cases since the physical
> addresses will very likely not be aligned to the larger IOMMU page size.
> 
> Signed-off-by: Sven Peter <sven@svenpeter.dev>
> ---
>   drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
>   1 file changed, 77 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 6f0df629353f..e072d9030d9f 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -8,6 +8,7 @@
>    * Copyright (C) 2000-2004 Russell King
>    */
>   
> +#include <linux/align.h>
>   #include <linux/acpi_iort.h>
>   #include <linux/device.h>
>   #include <linux/dma-map-ops.h>
> @@ -51,6 +52,15 @@ struct iommu_dma_cookie {
>   	struct iommu_domain		*fq_domain;
>   };
>   
> +/* aligns size to CPU and IOMMU page size */
> +static inline size_t iommu_page_align(struct device *dev, size_t size)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +
> +	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
> +}
> +
>   static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
>   bool iommu_dma_forcedac __read_mostly;
>   
> @@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>   /*
>    * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
>    * but an IOMMU which supports smaller pages might not map the whole thing.
> + * If the IOMMU page size is larger than the CPU page size, then the size
> + * will be aligned to that granularity and some memory will be left unused.

Why do we need to increase the actual memory allocation? The point here 
is that we allocate the smallest thing we can allocate and map the 
smallest thing we can map - I think that still works the "wrong" way 
round too, we should just need to start taking an IOVA offset into 
account as in dma_map_page() if we can no longer assume it's 0 for a CPU 
page. Sure we may expose some unrelated adjacent pages, but we'll 
already be doing that to excess for streaming DMA so whoop de do.

>    */
>   static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
>   		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
> @@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
>   
>   out_unmap:
>   	__iommu_dma_unmap(dev, *dma_handle, size);
> -	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> +	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
>   	return NULL;
>   }
>   
> @@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
>   	struct dma_sgt_handle *sh = sgt_handle(sgt);
>   
>   	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
> -	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> +	__iommu_dma_free_pages(sh->pages,
> +		iommu_page_align(dev, size) >> PAGE_SHIFT);
>   	sg_free_table(&sh->sgt);
>   	kfree(sh);
>   }
> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	if (dev_is_untrusted(dev))
>   		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
>   
> +	/*
> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> +	 * very likely run into sgs with a physical address that is not aligned
> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> +	 * independently with __iommu_dma_map then.

Scatterlist segments often don't have nicely aligned ends, which is why 
we already align things to IOVA granules in main loop here. I think in 
principle we'd just need to move the non-IOVA-aligned part of the 
address from sg->page to sg->offset in the temporary transformation for 
the rest of the assumptions to hold. I don't blame you for being timid 
about touching that, though - it took me 3 tries to get right when I 
first wrote it...

> +	 */
> +	if (iovad->granule > PAGE_SIZE) {
> +		for_each_sg(sg, s, nents, i) {
> +			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
> +				s->length, prot, dma_get_mask(dev));
> +			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
> +				break;
> +			sg_dma_len(s) = s->length;
> +		}
> +
> +		if (unlikely(i != nents)) {
> +			nents = i;
> +			for_each_sg(sg, s, nents, i)
> +				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
> +			return 0;
> +		}
> +
> +		return nents;
> +	}

Either way, NAK to having a *third* implementation of SG mapping in this 
file which is fundamentally no different from the second one.

> +
>   	/*
>   	 * Work out how much IOVA space we need, and align the segments to
>   	 * IOVA granules for the IOMMU driver to handle. With some clever
> @@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>   		int nents, enum dma_data_direction dir, unsigned long attrs)
>   {
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
>   	dma_addr_t start, end;
>   	struct scatterlist *tmp;
>   	int i;
> @@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>   		return;
>   	}
>   
> +	/*
> +	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
> +	 * every entry indepedently with __iommu_dma_map then. Let's do the
> +	 * opposite here.
> +	 */
> +	if (iovad->granule > PAGE_SIZE) {
> +		for_each_sg(sg, tmp, nents, i)
> +			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
> +		return;
> +	}

As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.

> +
>   	/*
>   	 * The scatterlist segments are mapped into a single
>   	 * contiguous IOVA allocation, so this is incredibly easy.
> @@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
>   
>   static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
>   {
> -	size_t alloc_size = PAGE_ALIGN(size);
> +	size_t alloc_size = iommu_page_align(dev, size);
>   	int count = alloc_size >> PAGE_SHIFT;
>   	struct page *page = NULL, **pages = NULL;
>   
> @@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
>   		struct page **pagep, gfp_t gfp, unsigned long attrs)
>   {
>   	bool coherent = dev_is_dma_coherent(dev);
> -	size_t alloc_size = PAGE_ALIGN(size);
> +	size_t alloc_size = iommu_page_align(dev, size);
>   	int node = dev_to_node(dev);
>   	struct page *page = NULL;
>   	void *cpu_addr;
> @@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
>   
>   	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
>   	    !gfpflags_allow_blocking(gfp) && !coherent)
> -		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> -					       gfp, NULL);
> +		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
> +					       &cpu_addr, gfp, NULL);
>   	else
>   		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
>   	if (!cpu_addr)
> @@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>   		void *cpu_addr, dma_addr_t dma_addr, size_t size,
>   		unsigned long attrs)

Can we just not bother trying to support this? TBH I don't know exactly 
how the interface is supposed to work - what you're doing here looks 
like it's probably either too much or not enough, depending on whether 
the address and size arguments are supposed to allow representing 
partial buffers - and frankly I can't imagine you'll be needing to 
support dma-buf exports from the USB/ethernet/wifi drivers any time soon...

>   {
> +	size_t aligned_size = iommu_page_align(dev, size);
>   	struct page *page;
>   	int ret;
>   
> @@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>   
>   		if (pages) {
>   			return sg_alloc_table_from_pages(sgt, pages,
> -					PAGE_ALIGN(size) >> PAGE_SHIFT,
> +					aligned_size >> PAGE_SHIFT,
>   					0, size, GFP_KERNEL);
>   		}
>   
> @@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>   
>   	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
>   	if (!ret)
> -		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
> +		sg_set_page(sgt->sgl, page, aligned_size, 0);
>   	return ret;
>   }
>   
> @@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
>   	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
>   }
>   
> +static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
> +		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> +{
> +	size = iommu_page_align(dev, size);
> +	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
> +}
> +
> +static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
> +		dma_addr_t dma_handle, enum dma_data_direction dir)
> +{
> +	size = iommu_page_align(dev, size);
> +	return dma_common_free_pages(dev, size, page, dma_handle, dir);
> +}

Again, what's the point of these? iommu_dma_map_page() still has to cope 
with whatever the caller provides, so there's no difference in the one 
case when that caller happens to be dma_common_map_pages().

Robin.

> +
>   static const struct dma_map_ops iommu_dma_ops = {
>   	.alloc			= iommu_dma_alloc,
>   	.free			= iommu_dma_free,
> -	.alloc_pages		= dma_common_alloc_pages,
> -	.free_pages		= dma_common_free_pages,
> +	.alloc_pages		= iommu_dma_alloc_aligned_pages,
> +	.free_pages		= iommu_dma_free_aligned_pages,
>   #ifdef CONFIG_DMA_REMAP
>   	.alloc_noncontiguous	= iommu_dma_alloc_noncontiguous,
>   	.free_noncontiguous	= iommu_dma_free_noncontiguous,
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-06 18:04   ` Robin Murphy
@ 2021-08-07  8:41     ` Sven Peter
  2021-08-09 18:37       ` Robin Murphy
  2021-08-07 11:47     ` Sven Peter
  1 sibling, 1 reply; 14+ messages in thread
From: Sven Peter @ 2021-08-07  8:41 UTC (permalink / raw)
  To: Robin Murphy, iommu
  Cc: Arnd Bergmann, Will Deacon, Hector Martin, linux-kernel,
	Alexander Graf, Mohamed Mediouni

Hi,

Thanks a lot for quick reply!

On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> > DMA IOMMU domains can support hardware where the IOMMU page size is
> > larger than the CPU page size.
> > Alignments need to be done with respect to both PAGE_SIZE and
> > iovad->granule. Additionally, the sg list optimization to use a single
> > IOVA allocation cannot be used in those cases since the physical
> > addresses will very likely not be aligned to the larger IOMMU page size.
> > 
> > Signed-off-by: Sven Peter <sven@svenpeter.dev>
> > ---
> >   drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
> >   1 file changed, 77 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 6f0df629353f..e072d9030d9f 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -8,6 +8,7 @@
> >    * Copyright (C) 2000-2004 Russell King
> >    */
> >   
> > +#include <linux/align.h>
> >   #include <linux/acpi_iort.h>
> >   #include <linux/device.h>
> >   #include <linux/dma-map-ops.h>
> > @@ -51,6 +52,15 @@ struct iommu_dma_cookie {
> >   	struct iommu_domain		*fq_domain;
> >   };
> >   
> > +/* aligns size to CPU and IOMMU page size */
> > +static inline size_t iommu_page_align(struct device *dev, size_t size)
> > +{
> > +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > +
> > +	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
> > +}
> > +
> >   static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
> >   bool iommu_dma_forcedac __read_mostly;
> >   
> > @@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> >   /*
> >    * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
> >    * but an IOMMU which supports smaller pages might not map the whole thing.
> > + * If the IOMMU page size is larger than the CPU page size, then the size
> > + * will be aligned to that granularity and some memory will be left unused.
> 
> Why do we need to increase the actual memory allocation? The point here 
> is that we allocate the smallest thing we can allocate and map the 
> smallest thing we can map - I think that still works the "wrong" way 
> round too, we should just need to start taking an IOVA offset into 
> account as in dma_map_page() if we can no longer assume it's 0 for a CPU 
> page. Sure we may expose some unrelated adjacent pages, but we'll 
> already be doing that to excess for streaming DMA so whoop de do.

I agree for trusted devices, but untrusted ones (Thunderbolt, and depending on your
risk tolerance possibly even the broadcom wifi) might also end up calling this.
For streaming DMA swiotlb will make sure that these won't see memory
they're not supposed to access.
But, at least as far as I understand it, no swiotlb is in the way to catch devices
who end up calling this function. That wasn't required because we used to get
PAGE_SIZE aligned allocation here and every IOMMU so far would be able to easily
map them without any spill overs.
But now we'll end up exposing three more unrelated pages if the allocation
is not increased.


> 
> >    */
> >   static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
> >   		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
> > @@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
> >   
> >   out_unmap:
> >   	__iommu_dma_unmap(dev, *dma_handle, size);
> > -	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> > +	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
> >   	return NULL;
> >   }
> >   
> > @@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
> >   	struct dma_sgt_handle *sh = sgt_handle(sgt);
> >   
> >   	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
> > -	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> > +	__iommu_dma_free_pages(sh->pages,
> > +		iommu_page_align(dev, size) >> PAGE_SHIFT);
> >   	sg_free_table(&sh->sgt);
> >   	kfree(sh);
> >   }
> > @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >   	if (dev_is_untrusted(dev))
> >   		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >   
> > +	/*
> > +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> > +	 * very likely run into sgs with a physical address that is not aligned
> > +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> > +	 * independently with __iommu_dma_map then.
> 
> Scatterlist segments often don't have nicely aligned ends, which is why 
> we already align things to IOVA granules in main loop here. I think in 
> principle we'd just need to move the non-IOVA-aligned part of the 
> address from sg->page to sg->offset in the temporary transformation for 
> the rest of the assumptions to hold. I don't blame you for being timid 
> about touching that, though - it took me 3 tries to get right when I 
> first wrote it...

I was a little bit afraid I'd get this exact reply :D
I'll try to modify the transformation again, but I'm sure it'll take me more than
3 tries to get it right :)

> 
> > +	 */
> > +	if (iovad->granule > PAGE_SIZE) {
> > +		for_each_sg(sg, s, nents, i) {
> > +			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
> > +				s->length, prot, dma_get_mask(dev));
> > +			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
> > +				break;
> > +			sg_dma_len(s) = s->length;
> > +		}
> > +
> > +		if (unlikely(i != nents)) {
> > +			nents = i;
> > +			for_each_sg(sg, s, nents, i)
> > +				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
> > +			return 0;
> > +		}
> > +
> > +		return nents;
> > +	}
> 
> Either way, NAK to having a *third* implementation of SG mapping in this 
> file which is fundamentally no different from the second one.

Good point. If for some reason I'm not able to modify the transformation correctly
I'll just fall back to iommu_dma_map_sg_swiotlb.

> 
> > +
> >   	/*
> >   	 * Work out how much IOVA space we need, and align the segments to
> >   	 * IOVA granules for the IOMMU driver to handle. With some clever
> > @@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >   static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >   		int nents, enum dma_data_direction dir, unsigned long attrs)
> >   {
> > +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > +	struct iova_domain *iovad = &cookie->iovad;
> >   	dma_addr_t start, end;
> >   	struct scatterlist *tmp;
> >   	int i;
> > @@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >   		return;
> >   	}
> >   
> > +	/*
> > +	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
> > +	 * every entry indepedently with __iommu_dma_map then. Let's do the
> > +	 * opposite here.
> > +	 */
> > +	if (iovad->granule > PAGE_SIZE) {
> > +		for_each_sg(sg, tmp, nents, i)
> > +			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
> > +		return;
> > +	}
> 
> As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.
> 
> > +
> >   	/*
> >   	 * The scatterlist segments are mapped into a single
> >   	 * contiguous IOVA allocation, so this is incredibly easy.
> > @@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
> >   
> >   static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
> >   {
> > -	size_t alloc_size = PAGE_ALIGN(size);
> > +	size_t alloc_size = iommu_page_align(dev, size);
> >   	int count = alloc_size >> PAGE_SHIFT;
> >   	struct page *page = NULL, **pages = NULL;
> >   
> > @@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
> >   		struct page **pagep, gfp_t gfp, unsigned long attrs)
> >   {
> >   	bool coherent = dev_is_dma_coherent(dev);
> > -	size_t alloc_size = PAGE_ALIGN(size);
> > +	size_t alloc_size = iommu_page_align(dev, size);
> >   	int node = dev_to_node(dev);
> >   	struct page *page = NULL;
> >   	void *cpu_addr;
> > @@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
> >   
> >   	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
> >   	    !gfpflags_allow_blocking(gfp) && !coherent)
> > -		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> > -					       gfp, NULL);
> > +		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
> > +					       &cpu_addr, gfp, NULL);
> >   	else
> >   		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
> >   	if (!cpu_addr)
> > @@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >   		void *cpu_addr, dma_addr_t dma_addr, size_t size,
> >   		unsigned long attrs)
> 
> Can we just not bother trying to support this? TBH I don't know exactly 
> how the interface is supposed to work - what you're doing here looks 
> like it's probably either too much or not enough, depending on whether 
> the address and size arguments are supposed to allow representing 
> partial buffers - and frankly I can't imagine you'll be needing to 
> support dma-buf exports from the USB/ethernet/wifi drivers any time soon...

I'm not really sure how this is going to work even before my changes.
Just from reading the code it looks like even then it might be doing too much
or too little.
But this will also be used for Thunderbolt and who knows what kind of devices
will be connected there. I'm fine with just not supporting this interface unless
something actually breaks for some user though.

> 
> >   {
> > +	size_t aligned_size = iommu_page_align(dev, size);
> >   	struct page *page;
> >   	int ret;
> >   
> > @@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >   
> >   		if (pages) {
> >   			return sg_alloc_table_from_pages(sgt, pages,
> > -					PAGE_ALIGN(size) >> PAGE_SHIFT,
> > +					aligned_size >> PAGE_SHIFT,
> >   					0, size, GFP_KERNEL);
> >   		}
> >   
> > @@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >   
> >   	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
> >   	if (!ret)
> > -		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
> > +		sg_set_page(sgt->sgl, page, aligned_size, 0);
> >   	return ret;
> >   }
> >   
> > @@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
> >   	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
> >   }
> >   
> > +static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
> > +		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> > +{
> > +	size = iommu_page_align(dev, size);
> > +	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
> > +}
> > +
> > +static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
> > +		dma_addr_t dma_handle, enum dma_data_direction dir)
> > +{
> > +	size = iommu_page_align(dev, size);
> > +	return dma_common_free_pages(dev, size, page, dma_handle, dir);
> > +}
> 
> Again, what's the point of these? iommu_dma_map_page() still has to cope 
> with whatever the caller provides, so there's no difference in the one 
> case when that caller happens to be dma_common_map_pages().

Same as above, untrusted devices.


Thanks,


Sven



-- 
Sven Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-06 18:04   ` Robin Murphy
  2021-08-07  8:41     ` Sven Peter
@ 2021-08-07 11:47     ` Sven Peter
  2021-08-09 17:41       ` Robin Murphy
  1 sibling, 1 reply; 14+ messages in thread
From: Sven Peter @ 2021-08-07 11:47 UTC (permalink / raw)
  To: Robin Murphy, Sven Peter
  Cc: Arnd Bergmann, Will Deacon, Hector Martin, linux-kernel,
	Alexander Graf, Mohamed Mediouni



On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> > @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >   	if (dev_is_untrusted(dev))
> >   		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >   
> > +	/*
> > +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> > +	 * very likely run into sgs with a physical address that is not aligned
> > +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> > +	 * independently with __iommu_dma_map then.
> 
> Scatterlist segments often don't have nicely aligned ends, which is why 
> we already align things to IOVA granules in main loop here. I think in 
> principle we'd just need to move the non-IOVA-aligned part of the 
> address from sg->page to sg->offset in the temporary transformation for 
> the rest of the assumptions to hold. I don't blame you for being timid 
> about touching that, though - it took me 3 tries to get right when I 
> first wrote it...
> 


I've spent some time with that code now and I think we cannot use it
but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
part is a lie then):

When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
is aligned to PAGE_SIZE but has an offset of 0x1000 from something
the IOMMU can map.
Now this would result in s->offset = -0x1000 which is already weird
enough.
Offset is unsigned (and 32bit) so this will actually look like
s->offset = 0xfffff000 then, which isn't much better.
And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
we'll map some random memory in iommu_map_sg_atomic and a little bit later
everything explodes.

Now I could probably adjust the phys addr backwards and make sure offset is
always positive (and possibly larger than PAGE_SIZE) and later restore it
in __finalise_sg then but I feel like that's pushing this a little bit too far.



Sven

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-07 11:47     ` Sven Peter
@ 2021-08-09 17:41       ` Robin Murphy
  2021-08-09 20:45         ` Sven Peter
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Murphy @ 2021-08-09 17:41 UTC (permalink / raw)
  To: Sven Peter, Sven Peter
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, Alexander Graf,
	Mohamed Mediouni, Will Deacon

On 2021-08-07 12:47, Sven Peter via iommu wrote:
> 
> 
> On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
>> On 2021-08-06 16:55, Sven Peter via iommu wrote:
>>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>    	if (dev_is_untrusted(dev))
>>>    		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
>>>    
>>> +	/*
>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
>>> +	 * very likely run into sgs with a physical address that is not aligned
>>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
>>> +	 * independently with __iommu_dma_map then.
>>
>> Scatterlist segments often don't have nicely aligned ends, which is why
>> we already align things to IOVA granules in main loop here. I think in
>> principle we'd just need to move the non-IOVA-aligned part of the
>> address from sg->page to sg->offset in the temporary transformation for
>> the rest of the assumptions to hold. I don't blame you for being timid
>> about touching that, though - it took me 3 tries to get right when I
>> first wrote it...
>>
> 
> 
> I've spent some time with that code now and I think we cannot use it
> but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
> part is a lie then):
> 
> When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
> is aligned to PAGE_SIZE but has an offset of 0x1000 from something
> the IOMMU can map.
> Now this would result in s->offset = -0x1000 which is already weird
> enough.
> Offset is unsigned (and 32bit) so this will actually look like
> s->offset = 0xfffff000 then, which isn't much better.
> And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
> we'll map some random memory in iommu_map_sg_atomic and a little bit later
> everything explodes.
> 
> Now I could probably adjust the phys addr backwards and make sure offset is
> always positive (and possibly larger than PAGE_SIZE) and later restore it
> in __finalise_sg then but I feel like that's pushing this a little bit too far.

Yes, that's what I meant. At a quick guess, something like the
completely untested diff below. It really comes down to what we want to
achieve here - if it's just to make this thing work at all, then I'd
favour bolting on the absolute minimum changes, possibly even cheating
by tainting the kernel and saying all bets are off instead of trying to
handle the more involved corners really properly. However if you want to
work towards this being a properly-supported thing, then I think it's
worth generalising the existing assumptions of page alignment from the
beginning.

Robin.

----->8-----
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 58cc7297aab5..73aeaa8bfc73 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -895,8 +895,8 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
  		unsigned int s_length = sg_dma_len(s);
  		unsigned int s_iova_len = s->length;
  
-		s->offset += s_iova_off;
-		s->length = s_length;
+		sg_set_page(s, phys_to_page(sg_phys(s)), s_length,
+			    s_iova_off & ~PAGE_MASK);
  		sg_dma_address(s) = DMA_MAPPING_ERROR;
  		sg_dma_len(s) = 0;
  
@@ -940,10 +940,9 @@ static void __invalidate_sg(struct scatterlist *sg, int nents)
  	int i;
  
  	for_each_sg(sg, s, nents, i) {
-		if (sg_dma_address(s) != DMA_MAPPING_ERROR)
-			s->offset += sg_dma_address(s);
  		if (sg_dma_len(s))
-			s->length = sg_dma_len(s);
+			sg_set_page(s, phys_to_page(sg_phys(s)), sg_dma_len(s),
+				    sg_dma_address(s) & ~PAGE_MASK);
  		sg_dma_address(s) = DMA_MAPPING_ERROR;
  		sg_dma_len(s) = 0;
  	}
@@ -1019,15 +1018,16 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
  	 * stashing the unaligned parts in the as-yet-unused DMA fields.
  	 */
  	for_each_sg(sg, s, nents, i) {
-		size_t s_iova_off = iova_offset(iovad, s->offset);
+		phys_addr_t s_phys = sg_phys(s);
+		size_t s_iova_off = iova_offset(iovad, s_phys);
  		size_t s_length = s->length;
  		size_t pad_len = (mask - iova_len + 1) & mask;
  
  		sg_dma_address(s) = s_iova_off;
  		sg_dma_len(s) = s_length;
-		s->offset -= s_iova_off;
  		s_length = iova_align(iovad, s_length + s_iova_off);
-		s->length = s_length;
+		sg_set_page(s, phys_to_page(s_phys - s_iova_off),
+			    s_length, s->offset ^ s_iova_off);
  
  		/*
  		 * Due to the alignment of our single IOVA allocation, we can

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-07  8:41     ` Sven Peter
@ 2021-08-09 18:37       ` Robin Murphy
  2021-08-09 19:57         ` Sven Peter
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Murphy @ 2021-08-09 18:37 UTC (permalink / raw)
  To: Sven Peter, iommu
  Cc: Arnd Bergmann, Will Deacon, Hector Martin, linux-kernel,
	Alexander Graf, Mohamed Mediouni

On 2021-08-07 09:41, Sven Peter wrote:
> Hi,
> 
> Thanks a lot for quick reply!
> 
> On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
>> On 2021-08-06 16:55, Sven Peter via iommu wrote:
>>> DMA IOMMU domains can support hardware where the IOMMU page size is
>>> larger than the CPU page size.
>>> Alignments need to be done with respect to both PAGE_SIZE and
>>> iovad->granule. Additionally, the sg list optimization to use a single
>>> IOVA allocation cannot be used in those cases since the physical
>>> addresses will very likely not be aligned to the larger IOMMU page size.
>>>
>>> Signed-off-by: Sven Peter <sven@svenpeter.dev>
>>> ---
>>>    drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
>>>    1 file changed, 77 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index 6f0df629353f..e072d9030d9f 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -8,6 +8,7 @@
>>>     * Copyright (C) 2000-2004 Russell King
>>>     */
>>>    
>>> +#include <linux/align.h>
>>>    #include <linux/acpi_iort.h>
>>>    #include <linux/device.h>
>>>    #include <linux/dma-map-ops.h>
>>> @@ -51,6 +52,15 @@ struct iommu_dma_cookie {
>>>    	struct iommu_domain		*fq_domain;
>>>    };
>>>    
>>> +/* aligns size to CPU and IOMMU page size */
>>> +static inline size_t iommu_page_align(struct device *dev, size_t size)
>>> +{
>>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>> +
>>> +	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
>>> +}
>>> +
>>>    static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
>>>    bool iommu_dma_forcedac __read_mostly;
>>>    
>>> @@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>>>    /*
>>>     * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
>>>     * but an IOMMU which supports smaller pages might not map the whole thing.
>>> + * If the IOMMU page size is larger than the CPU page size, then the size
>>> + * will be aligned to that granularity and some memory will be left unused.
>>
>> Why do we need to increase the actual memory allocation? The point here
>> is that we allocate the smallest thing we can allocate and map the
>> smallest thing we can map - I think that still works the "wrong" way
>> round too, we should just need to start taking an IOVA offset into
>> account as in dma_map_page() if we can no longer assume it's 0 for a CPU
>> page. Sure we may expose some unrelated adjacent pages, but we'll
>> already be doing that to excess for streaming DMA so whoop de do.
> 
> I agree for trusted devices, but untrusted ones (Thunderbolt, and depending on your
> risk tolerance possibly even the broadcom wifi) might also end up calling this.

Oh, right, I hadn't considered actual untrusted device support at this 
stage.

> For streaming DMA swiotlb will make sure that these won't see memory
> they're not supposed to access.

I was slightly surprised to see that that does appear to work out OK, 
but I guess SWIOTLB slots are already smaller than typical IOMMU pages, 
so it falls out of that. Neat.

> But, at least as far as I understand it, no swiotlb is in the way to catch devices
> who end up calling this function. That wasn't required because we used to get
> PAGE_SIZE aligned allocation here and every IOMMU so far would be able to easily
> map them without any spill overs.
> But now we'll end up exposing three more unrelated pages if the allocation
> is not increased.

Fair enough, but then why still waste memory in the (usual) cases where 
it logically *isn't* necessary?

>>>     */
>>>    static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
>>>    		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
>>> @@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
>>>    
>>>    out_unmap:
>>>    	__iommu_dma_unmap(dev, *dma_handle, size);
>>> -	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
>>> +	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
>>>    	return NULL;
>>>    }
>>>    
>>> @@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
>>>    	struct dma_sgt_handle *sh = sgt_handle(sgt);
>>>    
>>>    	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
>>> -	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
>>> +	__iommu_dma_free_pages(sh->pages,
>>> +		iommu_page_align(dev, size) >> PAGE_SHIFT);
>>>    	sg_free_table(&sh->sgt);
>>>    	kfree(sh);
>>>    }
>>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>    	if (dev_is_untrusted(dev))
>>>    		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
>>>    
>>> +	/*
>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
>>> +	 * very likely run into sgs with a physical address that is not aligned
>>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
>>> +	 * independently with __iommu_dma_map then.
>>
>> Scatterlist segments often don't have nicely aligned ends, which is why
>> we already align things to IOVA granules in main loop here. I think in
>> principle we'd just need to move the non-IOVA-aligned part of the
>> address from sg->page to sg->offset in the temporary transformation for
>> the rest of the assumptions to hold. I don't blame you for being timid
>> about touching that, though - it took me 3 tries to get right when I
>> first wrote it...
> 
> I was a little bit afraid I'd get this exact reply :D
> I'll try to modify the transformation again, but I'm sure it'll take me more than
> 3 tries to get it right :)
> 
>>
>>> +	 */
>>> +	if (iovad->granule > PAGE_SIZE) {
>>> +		for_each_sg(sg, s, nents, i) {
>>> +			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
>>> +				s->length, prot, dma_get_mask(dev));
>>> +			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
>>> +				break;
>>> +			sg_dma_len(s) = s->length;
>>> +		}
>>> +
>>> +		if (unlikely(i != nents)) {
>>> +			nents = i;
>>> +			for_each_sg(sg, s, nents, i)
>>> +				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
>>> +			return 0;
>>> +		}
>>> +
>>> +		return nents;
>>> +	}
>>
>> Either way, NAK to having a *third* implementation of SG mapping in this
>> file which is fundamentally no different from the second one.
> 
> Good point. If for some reason I'm not able to modify the transformation correctly
> I'll just fall back to iommu_dma_map_sg_swiotlb.
> 
>>
>>> +
>>>    	/*
>>>    	 * Work out how much IOVA space we need, and align the segments to
>>>    	 * IOVA granules for the IOMMU driver to handle. With some clever
>>> @@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>    static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>>>    		int nents, enum dma_data_direction dir, unsigned long attrs)
>>>    {
>>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>> +	struct iova_domain *iovad = &cookie->iovad;
>>>    	dma_addr_t start, end;
>>>    	struct scatterlist *tmp;
>>>    	int i;
>>> @@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>>>    		return;
>>>    	}
>>>    
>>> +	/*
>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
>>> +	 * every entry indepedently with __iommu_dma_map then. Let's do the
>>> +	 * opposite here.
>>> +	 */
>>> +	if (iovad->granule > PAGE_SIZE) {
>>> +		for_each_sg(sg, tmp, nents, i)
>>> +			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
>>> +		return;
>>> +	}
>>
>> As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.
>>
>>> +
>>>    	/*
>>>    	 * The scatterlist segments are mapped into a single
>>>    	 * contiguous IOVA allocation, so this is incredibly easy.
>>> @@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
>>>    
>>>    static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
>>>    {
>>> -	size_t alloc_size = PAGE_ALIGN(size);
>>> +	size_t alloc_size = iommu_page_align(dev, size);
>>>    	int count = alloc_size >> PAGE_SHIFT;
>>>    	struct page *page = NULL, **pages = NULL;
>>>    
>>> @@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
>>>    		struct page **pagep, gfp_t gfp, unsigned long attrs)
>>>    {
>>>    	bool coherent = dev_is_dma_coherent(dev);
>>> -	size_t alloc_size = PAGE_ALIGN(size);
>>> +	size_t alloc_size = iommu_page_align(dev, size);
>>>    	int node = dev_to_node(dev);
>>>    	struct page *page = NULL;
>>>    	void *cpu_addr;
>>> @@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
>>>    
>>>    	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
>>>    	    !gfpflags_allow_blocking(gfp) && !coherent)
>>> -		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
>>> -					       gfp, NULL);
>>> +		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
>>> +					       &cpu_addr, gfp, NULL);
>>>    	else
>>>    		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
>>>    	if (!cpu_addr)
>>> @@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>>>    		void *cpu_addr, dma_addr_t dma_addr, size_t size,
>>>    		unsigned long attrs)
>>
>> Can we just not bother trying to support this? TBH I don't know exactly
>> how the interface is supposed to work - what you're doing here looks
>> like it's probably either too much or not enough, depending on whether
>> the address and size arguments are supposed to allow representing
>> partial buffers - and frankly I can't imagine you'll be needing to
>> support dma-buf exports from the USB/ethernet/wifi drivers any time soon...
> 
> I'm not really sure how this is going to work even before my changes.
> Just from reading the code it looks like even then it might be doing too much
> or too little.
> But this will also be used for Thunderbolt and who knows what kind of devices
> will be connected there. I'm fine with just not supporting this interface unless
> something actually breaks for some user though.

I would bet that the set of random Thunderbolt-attachable devices which 
participate in dma-buf exports and the set of media devices which expect 
vb2_dma_contig to work for arbitrary user buffers have a non-empty 
intersection. If you eat your cake you may subsequently discover that 
you no longer have your cake ;)

If we can't easily test it or reason about it, let's not pretend to 
implement it. I'd rather somebody discovered the lack of working support 
in a few years' time because it reliably and safely returned an error, 
rather than because it ate their page cache. Besides, it's not like 
people can't use a kernel built with the right PAGE_SIZE (which might 
perform better anyway) and not have the problem in the first place.

>>>    {
>>> +	size_t aligned_size = iommu_page_align(dev, size);
>>>    	struct page *page;
>>>    	int ret;
>>>    
>>> @@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>>>    
>>>    		if (pages) {
>>>    			return sg_alloc_table_from_pages(sgt, pages,
>>> -					PAGE_ALIGN(size) >> PAGE_SHIFT,
>>> +					aligned_size >> PAGE_SHIFT,
>>>    					0, size, GFP_KERNEL);
>>>    		}
>>>    
>>> @@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
>>>    
>>>    	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
>>>    	if (!ret)
>>> -		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
>>> +		sg_set_page(sgt->sgl, page, aligned_size, 0);
>>>    	return ret;
>>>    }
>>>    
>>> @@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
>>>    	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
>>>    }
>>>    
>>> +static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
>>> +		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
>>> +{
>>> +	size = iommu_page_align(dev, size);
>>> +	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
>>> +}
>>> +
>>> +static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
>>> +		dma_addr_t dma_handle, enum dma_data_direction dir)
>>> +{
>>> +	size = iommu_page_align(dev, size);
>>> +	return dma_common_free_pages(dev, size, page, dma_handle, dir);
>>> +}
>>
>> Again, what's the point of these? iommu_dma_map_page() still has to cope
>> with whatever the caller provides, so there's no difference in the one
>> case when that caller happens to be dma_common_map_pages().
> 
> Same as above, untrusted devices.

Again fair enough, but in that case do it for untrusted devices. Not for 
the whole world for most of whom it still *is* a needless waste.

Robin.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-09 18:37       ` Robin Murphy
@ 2021-08-09 19:57         ` Sven Peter
  0 siblings, 0 replies; 14+ messages in thread
From: Sven Peter @ 2021-08-09 19:57 UTC (permalink / raw)
  To: Robin Murphy, iommu
  Cc: Arnd Bergmann, Will Deacon, Hector Martin, linux-kernel,
	Alexander Graf, Mohamed Mediouni

Hi,

On Mon, Aug 9, 2021, at 20:37, Robin Murphy wrote:
> On 2021-08-07 09:41, Sven Peter wrote:
> > Hi,
> > 
> > Thanks a lot for quick reply!
> > 
> > On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> >> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> >>> DMA IOMMU domains can support hardware where the IOMMU page size is
> >>> larger than the CPU page size.
> >>> Alignments need to be done with respect to both PAGE_SIZE and
> >>> iovad->granule. Additionally, the sg list optimization to use a single
> >>> IOVA allocation cannot be used in those cases since the physical
> >>> addresses will very likely not be aligned to the larger IOMMU page size.
> >>>
> >>> Signed-off-by: Sven Peter <sven@svenpeter.dev>
> >>> ---
> >>>    drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
> >>>    1 file changed, 77 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> >>> index 6f0df629353f..e072d9030d9f 100644
> >>> --- a/drivers/iommu/dma-iommu.c
> >>> +++ b/drivers/iommu/dma-iommu.c
> >>> @@ -8,6 +8,7 @@
> >>>     * Copyright (C) 2000-2004 Russell King
> >>>     */
> >>>    
> >>> +#include <linux/align.h>
> >>>    #include <linux/acpi_iort.h>
> >>>    #include <linux/device.h>
> >>>    #include <linux/dma-map-ops.h>
> >>> @@ -51,6 +52,15 @@ struct iommu_dma_cookie {
> >>>    	struct iommu_domain		*fq_domain;
> >>>    };
> >>>    
> >>> +/* aligns size to CPU and IOMMU page size */
> >>> +static inline size_t iommu_page_align(struct device *dev, size_t size)
> >>> +{
> >>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> >>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> >>> +
> >>> +	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
> >>> +}
> >>> +
> >>>    static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
> >>>    bool iommu_dma_forcedac __read_mostly;
> >>>    
> >>> @@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> >>>    /*
> >>>     * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
> >>>     * but an IOMMU which supports smaller pages might not map the whole thing.
> >>> + * If the IOMMU page size is larger than the CPU page size, then the size
> >>> + * will be aligned to that granularity and some memory will be left unused.
> >>
> >> Why do we need to increase the actual memory allocation? The point here
> >> is that we allocate the smallest thing we can allocate and map the
> >> smallest thing we can map - I think that still works the "wrong" way
> >> round too, we should just need to start taking an IOVA offset into
> >> account as in dma_map_page() if we can no longer assume it's 0 for a CPU
> >> page. Sure we may expose some unrelated adjacent pages, but we'll
> >> already be doing that to excess for streaming DMA so whoop de do.
> > 
> > I agree for trusted devices, but untrusted ones (Thunderbolt, and depending on your
> > risk tolerance possibly even the broadcom wifi) might also end up calling this.
> 
> Oh, right, I hadn't considered actual untrusted device support at this 
> stage.


Me neither :-)
I did the alignment at first without thinking too much about it,
then read your reply, and only *then* realized that there are untrusted devices
for which this just happens to do the right thing (at the cost of wasting
memory for everyone else, but I'll fix that).

> 
> > For streaming DMA swiotlb will make sure that these won't see memory
> > they're not supposed to access.
> 
> I was slightly surprised to see that that does appear to work out OK, 
> but I guess SWIOTLB slots are already smaller than typical IOMMU pages, 
> so it falls out of that. Neat.
> 
> > But, at least as far as I understand it, no swiotlb is in the way to catch devices
> > who end up calling this function. That wasn't required because we used to get
> > PAGE_SIZE aligned allocation here and every IOMMU so far would be able to easily
> > map them without any spill overs.
> > But now we'll end up exposing three more unrelated pages if the allocation
> > is not increased.
> 
> Fair enough, but then why still waste memory in the (usual) cases where 
> it logically *isn't* necessary?

See above, didn't even consider this either.

I should be able to fix this by first allocating <= 3 pages to reach a
iovad->granule boundary (or just satisfy the full allocation length already),
adjust the DMA offset forward for this first allocation and then just keep
allocating iovad->granule blocks for the rest of the requested size.

The current code already does the second part and I think I'll just need
to add the first alignment part.

> 
> >>>     */
> >>>    static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
> >>>    		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
> >>> @@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
> >>>    
> >>>    out_unmap:
> >>>    	__iommu_dma_unmap(dev, *dma_handle, size);
> >>> -	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> >>> +	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
> >>>    	return NULL;
> >>>    }
> >>>    
> >>> @@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
> >>>    	struct dma_sgt_handle *sh = sgt_handle(sgt);
> >>>    
> >>>    	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
> >>> -	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> >>> +	__iommu_dma_free_pages(sh->pages,
> >>> +		iommu_page_align(dev, size) >> PAGE_SHIFT);
> >>>    	sg_free_table(&sh->sgt);
> >>>    	kfree(sh);
> >>>    }
> >>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>    	if (dev_is_untrusted(dev))
> >>>    		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >>>    
> >>> +	/*
> >>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> >>> +	 * very likely run into sgs with a physical address that is not aligned
> >>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> >>> +	 * independently with __iommu_dma_map then.
> >>
> >> Scatterlist segments often don't have nicely aligned ends, which is why
> >> we already align things to IOVA granules in main loop here. I think in
> >> principle we'd just need to move the non-IOVA-aligned part of the
> >> address from sg->page to sg->offset in the temporary transformation for
> >> the rest of the assumptions to hold. I don't blame you for being timid
> >> about touching that, though - it took me 3 tries to get right when I
> >> first wrote it...
> > 
> > I was a little bit afraid I'd get this exact reply :D
> > I'll try to modify the transformation again, but I'm sure it'll take me more than
> > 3 tries to get it right :)
> > 
> >>
> >>> +	 */
> >>> +	if (iovad->granule > PAGE_SIZE) {
> >>> +		for_each_sg(sg, s, nents, i) {
> >>> +			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
> >>> +				s->length, prot, dma_get_mask(dev));
> >>> +			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
> >>> +				break;
> >>> +			sg_dma_len(s) = s->length;
> >>> +		}
> >>> +
> >>> +		if (unlikely(i != nents)) {
> >>> +			nents = i;
> >>> +			for_each_sg(sg, s, nents, i)
> >>> +				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
> >>> +			return 0;
> >>> +		}
> >>> +
> >>> +		return nents;
> >>> +	}
> >>
> >> Either way, NAK to having a *third* implementation of SG mapping in this
> >> file which is fundamentally no different from the second one.
> > 
> > Good point. If for some reason I'm not able to modify the transformation correctly
> > I'll just fall back to iommu_dma_map_sg_swiotlb.
> > 
> >>
> >>> +
> >>>    	/*
> >>>    	 * Work out how much IOVA space we need, and align the segments to
> >>>    	 * IOVA granules for the IOMMU driver to handle. With some clever
> >>> @@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>    static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >>>    		int nents, enum dma_data_direction dir, unsigned long attrs)
> >>>    {
> >>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> >>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> >>> +	struct iova_domain *iovad = &cookie->iovad;
> >>>    	dma_addr_t start, end;
> >>>    	struct scatterlist *tmp;
> >>>    	int i;
> >>> @@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >>>    		return;
> >>>    	}
> >>>    
> >>> +	/*
> >>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
> >>> +	 * every entry indepedently with __iommu_dma_map then. Let's do the
> >>> +	 * opposite here.
> >>> +	 */
> >>> +	if (iovad->granule > PAGE_SIZE) {
> >>> +		for_each_sg(sg, tmp, nents, i)
> >>> +			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
> >>> +		return;
> >>> +	}
> >>
> >> As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.
> >>
> >>> +
> >>>    	/*
> >>>    	 * The scatterlist segments are mapped into a single
> >>>    	 * contiguous IOVA allocation, so this is incredibly easy.
> >>> @@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
> >>>    
> >>>    static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
> >>>    {
> >>> -	size_t alloc_size = PAGE_ALIGN(size);
> >>> +	size_t alloc_size = iommu_page_align(dev, size);
> >>>    	int count = alloc_size >> PAGE_SHIFT;
> >>>    	struct page *page = NULL, **pages = NULL;
> >>>    
> >>> @@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
> >>>    		struct page **pagep, gfp_t gfp, unsigned long attrs)
> >>>    {
> >>>    	bool coherent = dev_is_dma_coherent(dev);
> >>> -	size_t alloc_size = PAGE_ALIGN(size);
> >>> +	size_t alloc_size = iommu_page_align(dev, size);
> >>>    	int node = dev_to_node(dev);
> >>>    	struct page *page = NULL;
> >>>    	void *cpu_addr;
> >>> @@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
> >>>    
> >>>    	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
> >>>    	    !gfpflags_allow_blocking(gfp) && !coherent)
> >>> -		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> >>> -					       gfp, NULL);
> >>> +		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
> >>> +					       &cpu_addr, gfp, NULL);
> >>>    	else
> >>>    		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
> >>>    	if (!cpu_addr)
> >>> @@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    		void *cpu_addr, dma_addr_t dma_addr, size_t size,
> >>>    		unsigned long attrs)
> >>
> >> Can we just not bother trying to support this? TBH I don't know exactly
> >> how the interface is supposed to work - what you're doing here looks
> >> like it's probably either too much or not enough, depending on whether
> >> the address and size arguments are supposed to allow representing
> >> partial buffers - and frankly I can't imagine you'll be needing to
> >> support dma-buf exports from the USB/ethernet/wifi drivers any time soon...
> > 
> > I'm not really sure how this is going to work even before my changes.
> > Just from reading the code it looks like even then it might be doing too much
> > or too little.
> > But this will also be used for Thunderbolt and who knows what kind of devices
> > will be connected there. I'm fine with just not supporting this interface unless
> > something actually breaks for some user though.
> 
> I would bet that the set of random Thunderbolt-attachable devices which 
> participate in dma-buf exports and the set of media devices which expect 
> vb2_dma_contig to work for arbitrary user buffers have a non-empty 
> intersection. If you eat your cake you may subsequently discover that 
> you no longer have your cake ;)

:D

> 
> If we can't easily test it or reason about it, let's not pretend to 
> implement it. I'd rather somebody discovered the lack of working support 
> in a few years' time because it reliably and safely returned an error, 
> rather than because it ate their page cache. Besides, it's not like 
> people can't use a kernel built with the right PAGE_SIZE (which might 
> perform better anyway) and not have the problem in the first place.

Good points, I'll just add if (iovad->granule > PAGE_SIZE) return -ENXIO;
here which should be equivalent to having no get_sgtable in dma_map_ops.

> 
> >>>    {
> >>> +	size_t aligned_size = iommu_page_align(dev, size);
> >>>    	struct page *page;
> >>>    	int ret;
> >>>    
> >>> @@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    
> >>>    		if (pages) {
> >>>    			return sg_alloc_table_from_pages(sgt, pages,
> >>> -					PAGE_ALIGN(size) >> PAGE_SHIFT,
> >>> +					aligned_size >> PAGE_SHIFT,
> >>>    					0, size, GFP_KERNEL);
> >>>    		}
> >>>    
> >>> @@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    
> >>>    	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
> >>>    	if (!ret)
> >>> -		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
> >>> +		sg_set_page(sgt->sgl, page, aligned_size, 0);
> >>>    	return ret;
> >>>    }
> >>>    
> >>> @@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
> >>>    	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
> >>>    }
> >>>    
> >>> +static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
> >>> +		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> >>> +{
> >>> +	size = iommu_page_align(dev, size);
> >>> +	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
> >>> +}
> >>> +
> >>> +static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
> >>> +		dma_addr_t dma_handle, enum dma_data_direction dir)
> >>> +{
> >>> +	size = iommu_page_align(dev, size);
> >>> +	return dma_common_free_pages(dev, size, page, dma_handle, dir);
> >>> +}
> >>
> >> Again, what's the point of these? iommu_dma_map_page() still has to cope
> >> with whatever the caller provides, so there's no difference in the one
> >> case when that caller happens to be dma_common_map_pages().
> > 
> > Same as above, untrusted devices.
> 
> Again fair enough, but in that case do it for untrusted devices. Not for 
> the whole world for most of whom it still *is* a needless waste.

Yeah, agreed!

> 
> Robin.
> 


Best,

Sven

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-09 17:41       ` Robin Murphy
@ 2021-08-09 20:45         ` Sven Peter
  2021-08-10  9:51           ` Robin Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Sven Peter @ 2021-08-09 20:45 UTC (permalink / raw)
  To: Robin Murphy, Sven Peter
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, Alexander Graf,
	Mohamed Mediouni, Will Deacon



On Mon, Aug 9, 2021, at 19:41, Robin Murphy wrote:
> On 2021-08-07 12:47, Sven Peter via iommu wrote:
> > 
> > 
> > On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> >> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> >>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>    	if (dev_is_untrusted(dev))
> >>>    		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >>>    
> >>> +	/*
> >>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> >>> +	 * very likely run into sgs with a physical address that is not aligned
> >>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> >>> +	 * independently with __iommu_dma_map then.
> >>
> >> Scatterlist segments often don't have nicely aligned ends, which is why
> >> we already align things to IOVA granules in main loop here. I think in
> >> principle we'd just need to move the non-IOVA-aligned part of the
> >> address from sg->page to sg->offset in the temporary transformation for
> >> the rest of the assumptions to hold. I don't blame you for being timid
> >> about touching that, though - it took me 3 tries to get right when I
> >> first wrote it...
> >>
> > 
> > 
> > I've spent some time with that code now and I think we cannot use it
> > but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
> > part is a lie then):
> > 
> > When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
> > is aligned to PAGE_SIZE but has an offset of 0x1000 from something
> > the IOMMU can map.
> > Now this would result in s->offset = -0x1000 which is already weird
> > enough.
> > Offset is unsigned (and 32bit) so this will actually look like
> > s->offset = 0xfffff000 then, which isn't much better.
> > And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
> > we'll map some random memory in iommu_map_sg_atomic and a little bit later
> > everything explodes.
> > 
> > Now I could probably adjust the phys addr backwards and make sure offset is
> > always positive (and possibly larger than PAGE_SIZE) and later restore it
> > in __finalise_sg then but I feel like that's pushing this a little bit too far.
> 
> Yes, that's what I meant. At a quick guess, something like the
> completely untested diff below.

That unfortunately results in unaligned mappings

[    9.630334] iommu: unaligned: iova 0xbff40000 pa 0x0000000801a3b000 size 0x4000 min_pagesz 0x4000

I'll take a closer look later this week and see if I can fix it.

> It really comes down to what we want to
> achieve here - if it's just to make this thing work at all, then I'd
> favour bolting on the absolute minimum changes, possibly even cheating
> by tainting the kernel and saying all bets are off instead of trying to
> handle the more involved corners really properly. However if you want to
> work towards this being a properly-supported thing, then I think it's
> worth generalising the existing assumptions of page alignment from the
> beginning.

I'd like to try and see if we can make this a properly-supported thing.

That will likely take a few iterations but realistically the rest of the drivers
required to make this platform actually useful (and especially the display controller
and GPU drivers) won't be ready for a few more months anyway. And even on 4KB PAGE_SIZE
kernels half the USB ports and NVMe will work fine, which should be enough to install
a distro and some third-party package that just ships the distro kernel with 16KB
pages.




Sven


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-09 20:45         ` Sven Peter
@ 2021-08-10  9:51           ` Robin Murphy
  2021-08-11 20:18             ` Sven Peter
  0 siblings, 1 reply; 14+ messages in thread
From: Robin Murphy @ 2021-08-10  9:51 UTC (permalink / raw)
  To: Sven Peter, Sven Peter
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, Alexander Graf,
	Mohamed Mediouni, Will Deacon

On 2021-08-09 21:45, Sven Peter wrote:
> 
> 
> On Mon, Aug 9, 2021, at 19:41, Robin Murphy wrote:
>> On 2021-08-07 12:47, Sven Peter via iommu wrote:
>>>
>>>
>>> On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
>>>> On 2021-08-06 16:55, Sven Peter via iommu wrote:
>>>>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>>>     	if (dev_is_untrusted(dev))
>>>>>     		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
>>>>>     
>>>>> +	/*
>>>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
>>>>> +	 * very likely run into sgs with a physical address that is not aligned
>>>>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
>>>>> +	 * independently with __iommu_dma_map then.
>>>>
>>>> Scatterlist segments often don't have nicely aligned ends, which is why
>>>> we already align things to IOVA granules in main loop here. I think in
>>>> principle we'd just need to move the non-IOVA-aligned part of the
>>>> address from sg->page to sg->offset in the temporary transformation for
>>>> the rest of the assumptions to hold. I don't blame you for being timid
>>>> about touching that, though - it took me 3 tries to get right when I
>>>> first wrote it...
>>>>
>>>
>>>
>>> I've spent some time with that code now and I think we cannot use it
>>> but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
>>> part is a lie then):
>>>
>>> When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
>>> is aligned to PAGE_SIZE but has an offset of 0x1000 from something
>>> the IOMMU can map.
>>> Now this would result in s->offset = -0x1000 which is already weird
>>> enough.
>>> Offset is unsigned (and 32bit) so this will actually look like
>>> s->offset = 0xfffff000 then, which isn't much better.
>>> And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
>>> we'll map some random memory in iommu_map_sg_atomic and a little bit later
>>> everything explodes.
>>>
>>> Now I could probably adjust the phys addr backwards and make sure offset is
>>> always positive (and possibly larger than PAGE_SIZE) and later restore it
>>> in __finalise_sg then but I feel like that's pushing this a little bit too far.
>>
>> Yes, that's what I meant. At a quick guess, something like the
>> completely untested diff below.
> 
> That unfortunately results in unaligned mappings

You mean it even compiles!? :D

> [    9.630334] iommu: unaligned: iova 0xbff40000 pa 0x0000000801a3b000 size 0x4000 min_pagesz 0x4000
> 
> I'll take a closer look later this week and see if I can fix it.

On reflection, "s->offset ^ s_iova_off" is definitely wrong, that more 
likely wants to be "s->offset & ~s_iova_off".

Robin.

>> It really comes down to what we want to
>> achieve here - if it's just to make this thing work at all, then I'd
>> favour bolting on the absolute minimum changes, possibly even cheating
>> by tainting the kernel and saying all bets are off instead of trying to
>> handle the more involved corners really properly. However if you want to
>> work towards this being a properly-supported thing, then I think it's
>> worth generalising the existing assumptions of page alignment from the
>> beginning.
> 
> I'd like to try and see if we can make this a properly-supported thing.
> 
> That will likely take a few iterations but realistically the rest of the drivers
> required to make this platform actually useful (and especially the display controller
> and GPU drivers) won't be ready for a few more months anyway. And even on 4KB PAGE_SIZE
> kernels half the USB ports and NVMe will work fine, which should be enough to install
> a distro and some third-party package that just ships the distro kernel with 16KB
> pages.
> 
> 
> 
> 
> Sven
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-10  9:51           ` Robin Murphy
@ 2021-08-11 20:18             ` Sven Peter
  2021-08-12 12:43               ` Robin Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Sven Peter @ 2021-08-11 20:18 UTC (permalink / raw)
  To: Robin Murphy, iommu
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, Alexander Graf,
	Mohamed Mediouni, Will Deacon



On Tue, Aug 10, 2021, at 11:51, Robin Murphy wrote:
> On 2021-08-09 21:45, Sven Peter wrote:
> > 
> > 
> > On Mon, Aug 9, 2021, at 19:41, Robin Murphy wrote:
> >> On 2021-08-07 12:47, Sven Peter via iommu wrote:
> >>>
> >>>
> >>> On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> >>>> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> >>>>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>>>     	if (dev_is_untrusted(dev))
> >>>>>     		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >>>>>     
> >>>>> +	/*
> >>>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> >>>>> +	 * very likely run into sgs with a physical address that is not aligned
> >>>>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> >>>>> +	 * independently with __iommu_dma_map then.
> >>>>
> >>>> Scatterlist segments often don't have nicely aligned ends, which is why
> >>>> we already align things to IOVA granules in main loop here. I think in
> >>>> principle we'd just need to move the non-IOVA-aligned part of the
> >>>> address from sg->page to sg->offset in the temporary transformation for
> >>>> the rest of the assumptions to hold. I don't blame you for being timid
> >>>> about touching that, though - it took me 3 tries to get right when I
> >>>> first wrote it...
> >>>>
> >>>
> >>>
> >>> I've spent some time with that code now and I think we cannot use it
> >>> but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
> >>> part is a lie then):
> >>>
> >>> When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
> >>> is aligned to PAGE_SIZE but has an offset of 0x1000 from something
> >>> the IOMMU can map.
> >>> Now this would result in s->offset = -0x1000 which is already weird
> >>> enough.
> >>> Offset is unsigned (and 32bit) so this will actually look like
> >>> s->offset = 0xfffff000 then, which isn't much better.
> >>> And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
> >>> we'll map some random memory in iommu_map_sg_atomic and a little bit later
> >>> everything explodes.
> >>>
> >>> Now I could probably adjust the phys addr backwards and make sure offset is
> >>> always positive (and possibly larger than PAGE_SIZE) and later restore it
> >>> in __finalise_sg then but I feel like that's pushing this a little bit too far.
> >>
> >> Yes, that's what I meant. At a quick guess, something like the
> >> completely untested diff below.
> > 
> > That unfortunately results in unaligned mappings
> 
> You mean it even compiles!? :D

I was more impressed that it already almost worked correctly :)

> 
> > [    9.630334] iommu: unaligned: iova 0xbff40000 pa 0x0000000801a3b000 size 0x4000 min_pagesz 0x4000
> > 
> > I'll take a closer look later this week and see if I can fix it.
> 
> On reflection, "s->offset ^ s_iova_off" is definitely wrong, that more 
> likely wants to be "s->offset & ~s_iova_off".
> 
> Robin.
> 


If I change

		sg_set_page(s, phys_to_page(sg_phys(s)), s_length,
			    s_iova_off & ~PAGE_MASK);

in __finalise_sg (and the same thing in __invalidate_sg) to

		sg_set_page(s, phys_to_page(sg_phys(s) + s_iova_off), s_length,
			    s_iova_off & ~PAGE_MASK);

then it also restores the original fields correctly.


What is the proper way to credit you for coming up with this?
Do you create the commit and I apply it to my local tree and
include it in my submission once I have fixed the other
issues? Or do I create the commit and put a Suggested-by
in the message?


Either way, here's the patch that I have right now:

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7ce74476699d..ba31dc59566d 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -907,8 +907,8 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
 		unsigned int s_length = sg_dma_len(s);
 		unsigned int s_iova_len = s->length;

-		s->offset += s_iova_off;
-		s->length = s_length;
+		sg_set_page(s, phys_to_page(sg_phys(s) + s_iova_off), s_length,
+			    s_iova_off & ~PAGE_MASK);
 		sg_dma_address(s) = DMA_MAPPING_ERROR;
 		sg_dma_len(s) = 0;

@@ -952,10 +952,11 @@ static void __invalidate_sg(struct scatterlist *sg, int nents)
 	int i;

 	for_each_sg(sg, s, nents, i) {
-		if (sg_dma_address(s) != DMA_MAPPING_ERROR)
-			s->offset += sg_dma_address(s);
 		if (sg_dma_len(s))
-			s->length = sg_dma_len(s);
+			sg_set_page(s,
+				    phys_to_page(sg_phys(s) + sg_dma_address(s)),
+				    sg_dma_len(s),
+				    sg_dma_address(s) & ~PAGE_MASK);
 		sg_dma_address(s) = DMA_MAPPING_ERROR;
 		sg_dma_len(s) = 0;
 	}
@@ -1031,15 +1032,16 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	 * stashing the unaligned parts in the as-yet-unused DMA fields.
 	 */
 	for_each_sg(sg, s, nents, i) {
-		size_t s_iova_off = iova_offset(iovad, s->offset);
+		phys_addr_t s_phys = sg_phys(s);
+		size_t s_iova_off = iova_offset(iovad, s_phys);
 		size_t s_length = s->length;
 		size_t pad_len = (mask - iova_len + 1) & mask;

 		sg_dma_address(s) = s_iova_off;
 		sg_dma_len(s) = s_length;
-		s->offset -= s_iova_off;
 		s_length = iova_align(iovad, s_length + s_iova_off);
-		s->length = s_length;
+		sg_set_page(s, phys_to_page(s_phys - s_iova_off),
+			    s_length, s->offset & ~s_iova_off);

 		/*
 		 * Due to the alignment of our single IOVA allocation, we can





Sven

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE
  2021-08-11 20:18             ` Sven Peter
@ 2021-08-12 12:43               ` Robin Murphy
  0 siblings, 0 replies; 14+ messages in thread
From: Robin Murphy @ 2021-08-12 12:43 UTC (permalink / raw)
  To: Sven Peter, iommu
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, Alexander Graf,
	Mohamed Mediouni, Will Deacon

On 2021-08-11 21:18, Sven Peter wrote:
> 
> 
> On Tue, Aug 10, 2021, at 11:51, Robin Murphy wrote:
>> On 2021-08-09 21:45, Sven Peter wrote:
>>>
>>>
>>> On Mon, Aug 9, 2021, at 19:41, Robin Murphy wrote:
>>>> On 2021-08-07 12:47, Sven Peter via iommu wrote:
>>>>>
>>>>>
>>>>> On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
>>>>>> On 2021-08-06 16:55, Sven Peter via iommu wrote:
>>>>>>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>>>>>      	if (dev_is_untrusted(dev))
>>>>>>>      		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
>>>>>>>      
>>>>>>> +	/*
>>>>>>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
>>>>>>> +	 * very likely run into sgs with a physical address that is not aligned
>>>>>>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
>>>>>>> +	 * independently with __iommu_dma_map then.
>>>>>>
>>>>>> Scatterlist segments often don't have nicely aligned ends, which is why
>>>>>> we already align things to IOVA granules in main loop here. I think in
>>>>>> principle we'd just need to move the non-IOVA-aligned part of the
>>>>>> address from sg->page to sg->offset in the temporary transformation for
>>>>>> the rest of the assumptions to hold. I don't blame you for being timid
>>>>>> about touching that, though - it took me 3 tries to get right when I
>>>>>> first wrote it...
>>>>>>
>>>>>
>>>>>
>>>>> I've spent some time with that code now and I think we cannot use it
>>>>> but have to fall back to iommu_dma_map_sg_swiotlb (even though that swiotlb
>>>>> part is a lie then):
>>>>>
>>>>> When we have sg_phys(s) = 0x802e65000 with s->offset = 0 the paddr
>>>>> is aligned to PAGE_SIZE but has an offset of 0x1000 from something
>>>>> the IOMMU can map.
>>>>> Now this would result in s->offset = -0x1000 which is already weird
>>>>> enough.
>>>>> Offset is unsigned (and 32bit) so this will actually look like
>>>>> s->offset = 0xfffff000 then, which isn't much better.
>>>>> And then sg_phys(s) = 0x902e64000 (instead of 0x802e64000) and
>>>>> we'll map some random memory in iommu_map_sg_atomic and a little bit later
>>>>> everything explodes.
>>>>>
>>>>> Now I could probably adjust the phys addr backwards and make sure offset is
>>>>> always positive (and possibly larger than PAGE_SIZE) and later restore it
>>>>> in __finalise_sg then but I feel like that's pushing this a little bit too far.
>>>>
>>>> Yes, that's what I meant. At a quick guess, something like the
>>>> completely untested diff below.
>>>
>>> That unfortunately results in unaligned mappings
>>
>> You mean it even compiles!? :D
> 
> I was more impressed that it already almost worked correctly :)
> 
>>
>>> [    9.630334] iommu: unaligned: iova 0xbff40000 pa 0x0000000801a3b000 size 0x4000 min_pagesz 0x4000
>>>
>>> I'll take a closer look later this week and see if I can fix it.
>>
>> On reflection, "s->offset ^ s_iova_off" is definitely wrong, that more
>> likely wants to be "s->offset & ~s_iova_off".
>>
>> Robin.
>>
> 
> 
> If I change
> 
> 		sg_set_page(s, phys_to_page(sg_phys(s)), s_length,
> 			    s_iova_off & ~PAGE_MASK);
> 
> in __finalise_sg (and the same thing in __invalidate_sg) to
> 
> 		sg_set_page(s, phys_to_page(sg_phys(s) + s_iova_off), s_length,
> 			    s_iova_off & ~PAGE_MASK);
> 
> then it also restores the original fields correctly.

Ah, good point, once again this proves to be right on the limit of how 
many moving parts I can hold in my head without working it through on 
paper (or in a debugger). FWIW my thought there was that sg_phys(s) 
would be enough to "re-normalise" things from a state where s->page_link 
was rounded down and s->offset held the PAGE_SIZE multiple (which is 
what the XOR did do), but of course that wasn't right to begin with. In 
fact, s->offset must always be negligible when restoring so you could 
arguably open-code page_to_phys(sg_page(s)), but that might be a bit too 
much of a mouthful for a theoretical micro-optimisation (note that 
trying to avoid the phys_addr_t round-trip by offsetting the page 
pointer itself might appear to work for SPARSEMEM_VMEMMAP, but I'm 
pretty sure it's not valid in general).

> What is the proper way to credit you for coming up with this?
> Do you create the commit and I apply it to my local tree and
> include it in my submission once I have fixed the other
> issues? Or do I create the commit and put a Suggested-by
> in the message?

Suggested-by is fine by me, but if you feel there's enough of my diff 
left that you haven't had to put right then you're also welcome to have 
these instead:

Co-developed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>

Cheers,
Robin.

> 
> 
> Either way, here's the patch that I have right now:
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7ce74476699d..ba31dc59566d 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -907,8 +907,8 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
>   		unsigned int s_length = sg_dma_len(s);
>   		unsigned int s_iova_len = s->length;
> 
> -		s->offset += s_iova_off;
> -		s->length = s_length;
> +		sg_set_page(s, phys_to_page(sg_phys(s) + s_iova_off), s_length,
> +			    s_iova_off & ~PAGE_MASK);
>   		sg_dma_address(s) = DMA_MAPPING_ERROR;
>   		sg_dma_len(s) = 0;
> 
> @@ -952,10 +952,11 @@ static void __invalidate_sg(struct scatterlist *sg, int nents)
>   	int i;
> 
>   	for_each_sg(sg, s, nents, i) {
> -		if (sg_dma_address(s) != DMA_MAPPING_ERROR)
> -			s->offset += sg_dma_address(s);
>   		if (sg_dma_len(s))
> -			s->length = sg_dma_len(s);
> +			sg_set_page(s,
> +				    phys_to_page(sg_phys(s) + sg_dma_address(s)),
> +				    sg_dma_len(s),
> +				    sg_dma_address(s) & ~PAGE_MASK);
>   		sg_dma_address(s) = DMA_MAPPING_ERROR;
>   		sg_dma_len(s) = 0;
>   	}
> @@ -1031,15 +1032,16 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	 * stashing the unaligned parts in the as-yet-unused DMA fields.
>   	 */
>   	for_each_sg(sg, s, nents, i) {
> -		size_t s_iova_off = iova_offset(iovad, s->offset);
> +		phys_addr_t s_phys = sg_phys(s);
> +		size_t s_iova_off = iova_offset(iovad, s_phys);
>   		size_t s_length = s->length;
>   		size_t pad_len = (mask - iova_len + 1) & mask;
> 
>   		sg_dma_address(s) = s_iova_off;
>   		sg_dma_len(s) = s_length;
> -		s->offset -= s_iova_off;
>   		s_length = iova_align(iovad, s_length + s_iova_off);
> -		s->length = s_length;
> +		sg_set_page(s, phys_to_page(s_phys - s_iova_off),
> +			    s_length, s->offset & ~s_iova_off);
> 
>   		/*
>   		 * Due to the alignment of our single IOVA allocation, we can
> 
> 
> 
> 
> 
> Sven
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-08-12 12:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-06 15:55 [RFC PATCH 0/3] iommu/dma-iommu: Support IOMMU page size larger than the CPU page size Sven Peter
2021-08-06 15:55 ` [RFC PATCH 1/3] iommu: Move IOMMU pagesize check to attach_device Sven Peter
2021-08-06 15:55 ` [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE Sven Peter
2021-08-06 18:04   ` Robin Murphy
2021-08-07  8:41     ` Sven Peter
2021-08-09 18:37       ` Robin Murphy
2021-08-09 19:57         ` Sven Peter
2021-08-07 11:47     ` Sven Peter
2021-08-09 17:41       ` Robin Murphy
2021-08-09 20:45         ` Sven Peter
2021-08-10  9:51           ` Robin Murphy
2021-08-11 20:18             ` Sven Peter
2021-08-12 12:43               ` Robin Murphy
2021-08-06 15:55 ` [RFC PATCH 3/3] iommu: Introduce __IOMMU_DOMAIN_LARGE_PAGES Sven Peter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).