LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v1 0/4] Tegra GART fixes and improvements
@ 2018-04-09 20:07 Dmitry Osipenko
  2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-09 20:07 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Joerg Roedel
  Cc: iommu, linux-tegra, linux-kernel

GART driver wasn't ever been utilized in upstream, but finally this should
change sometime soon with Tegra's DRM driver rework. In general GART driver
works fine, though there are couple things that could be improved.

Dmitry Osipenko (4):
  iommu/tegra: gart: Add debugging facility
  iommu/tegra: gart: Fix gart_iommu_unmap()
  iommu/tegra: gart: Constify number of GART pages
  iommu/tegra: gart: Optimize map/unmap

 drivers/iommu/tegra-gart.c | 90 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 69 insertions(+), 21 deletions(-)

-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility
  2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
@ 2018-04-09 20:07 ` Dmitry Osipenko
  2018-04-27  9:46   ` Thierry Reding
  2018-04-09 20:07 ` [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap() Dmitry Osipenko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-09 20:07 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Joerg Roedel
  Cc: iommu, linux-tegra, linux-kernel

Page mapping could overwritten by an accident (a bug). We can catch this
case by checking 'VALID' bit of GART's page entry prior to mapping of a
page. Since that check introduces a small performance impact, it should be
enabled explicitly using new GART's kernel module 'debug' parameter.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/iommu/tegra-gart.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index b62f790ad1ba..4c0abdcd1ad2 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -72,6 +72,8 @@ struct gart_domain {
 
 static struct gart_device *gart_handle; /* unique for a system */
 
+static bool gart_debug;
+
 #define GART_PTE(_pfn)						\
 	(GART_ENTRY_PHYS_ADDR_VALID | ((_pfn) << PAGE_SHIFT))
 
@@ -271,6 +273,7 @@ static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
 	struct gart_device *gart = gart_domain->gart;
 	unsigned long flags;
 	unsigned long pfn;
+	unsigned long pte;
 
 	if (!gart_iova_range_valid(gart, iova, bytes))
 		return -EINVAL;
@@ -282,6 +285,14 @@ static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
 		spin_unlock_irqrestore(&gart->pte_lock, flags);
 		return -EINVAL;
 	}
+	if (gart_debug) {
+		pte = gart_read_pte(gart, iova);
+		if (pte & GART_ENTRY_PHYS_ADDR_VALID) {
+			spin_unlock_irqrestore(&gart->pte_lock, flags);
+			dev_err(gart->dev, "Page entry is in-use\n");
+			return -EBUSY;
+		}
+	}
 	gart_set_pte(gart, iova, GART_PTE(pfn));
 	FLUSH_GART_REGS(gart);
 	spin_unlock_irqrestore(&gart->pte_lock, flags);
@@ -515,7 +526,9 @@ static void __exit tegra_gart_exit(void)
 
 subsys_initcall(tegra_gart_init);
 module_exit(tegra_gart_exit);
+module_param(gart_debug, bool, 0644);
 
+MODULE_PARM_DESC(gart_debug, "Enable GART debugging");
 MODULE_DESCRIPTION("IOMMU API for GART in Tegra20");
 MODULE_AUTHOR("Hiroshi DOYU <hdoyu@nvidia.com>");
 MODULE_ALIAS("platform:tegra-gart");
-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap()
  2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
  2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko
@ 2018-04-09 20:07 ` Dmitry Osipenko
  2018-04-27  9:43   ` Thierry Reding
  2018-04-09 20:07 ` [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages Dmitry Osipenko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-09 20:07 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Joerg Roedel
  Cc: iommu, linux-tegra, linux-kernel

It must return the number of unmapped bytes on success, returning 0 means
that unmapping failed and in result only one page is unmapped.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/iommu/tegra-gart.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index 4c0abdcd1ad2..89ec24c6952c 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -313,7 +313,7 @@ static size_t gart_iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 	gart_set_pte(gart, iova, 0);
 	FLUSH_GART_REGS(gart);
 	spin_unlock_irqrestore(&gart->pte_lock, flags);
-	return 0;
+	return bytes;
 }
 
 static phys_addr_t gart_iommu_iova_to_phys(struct iommu_domain *domain,
-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages
  2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
  2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko
  2018-04-09 20:07 ` [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap() Dmitry Osipenko
@ 2018-04-09 20:07 ` Dmitry Osipenko
  2018-04-27  9:49   ` Thierry Reding
  2018-04-09 20:07 ` [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Dmitry Osipenko
  2018-05-03 12:52 ` [PATCH v1 0/4] Tegra GART fixes and improvements Joerg Roedel
  4 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-09 20:07 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Joerg Roedel
  Cc: iommu, linux-tegra, linux-kernel

GART has a fixed aperture size, hence the number of pages is constant.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/iommu/tegra-gart.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index 89ec24c6952c..4a0607669d34 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -33,6 +33,8 @@
 
 #include <asm/cacheflush.h>
 
+#define GART_APERTURE_SIZE	SZ_32M
+
 /* bitmap of the page sizes currently supported */
 #define GART_IOMMU_PGSIZES	(SZ_4K)
 
@@ -47,6 +49,8 @@
 #define GART_PAGE_MASK						\
 	(~(GART_PAGE_SIZE - 1) & ~GART_ENTRY_PHYS_ADDR_VALID)
 
+#define GART_PAGECOUNT		(GART_APERTURE_SIZE >> GART_PAGE_SHIFT)
+
 struct gart_client {
 	struct device		*dev;
 	struct list_head	list;
@@ -55,7 +59,6 @@ struct gart_client {
 struct gart_device {
 	void __iomem		*regs;
 	u32			*savedata;
-	u32			page_count;	/* total remappable size */
 	dma_addr_t		iovmm_base;	/* offset to vmm_area */
 	spinlock_t		pte_lock;	/* for pagetable */
 	struct list_head	client;
@@ -91,7 +94,7 @@ static struct gart_domain *to_gart_domain(struct iommu_domain *dom)
 
 #define for_each_gart_pte(gart, iova)					\
 	for (iova = gart->iovmm_base;					\
-	     iova < gart->iovmm_base + GART_PAGE_SIZE * gart->page_count; \
+	     iova < gart->iovmm_base + GART_APERTURE_SIZE; 		\
 	     iova += GART_PAGE_SIZE)
 
 static inline void gart_set_pte(struct gart_device *gart,
@@ -158,7 +161,7 @@ static inline bool gart_iova_range_valid(struct gart_device *gart,
 	iova_start = iova;
 	iova_end = iova_start + bytes - 1;
 	gart_start = gart->iovmm_base;
-	gart_end = gart_start + gart->page_count * GART_PAGE_SIZE - 1;
+	gart_end = gart_start + GART_APERTURE_SIZE - 1;
 
 	if (iova_start < gart_start)
 		return false;
@@ -241,7 +244,7 @@ static struct iommu_domain *gart_iommu_domain_alloc(unsigned type)
 	gart_domain->gart = gart;
 	gart_domain->domain.geometry.aperture_start = gart->iovmm_base;
 	gart_domain->domain.geometry.aperture_end = gart->iovmm_base +
-					gart->page_count * GART_PAGE_SIZE - 1;
+							GART_APERTURE_SIZE - 1;
 	gart_domain->domain.geometry.force_aperture = true;
 
 	return &gart_domain->domain;
@@ -463,9 +466,8 @@ static int tegra_gart_probe(struct platform_device *pdev)
 	INIT_LIST_HEAD(&gart->client);
 	gart->regs = gart_regs;
 	gart->iovmm_base = (dma_addr_t)res_remap->start;
-	gart->page_count = (resource_size(res_remap) >> GART_PAGE_SHIFT);
 
-	gart->savedata = vmalloc(sizeof(u32) * gart->page_count);
+	gart->savedata = vmalloc(sizeof(u32) * GART_PAGECOUNT);
 	if (!gart->savedata) {
 		dev_err(dev, "failed to allocate context save area\n");
 		return -ENOMEM;
-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
                   ` (2 preceding siblings ...)
  2018-04-09 20:07 ` [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages Dmitry Osipenko
@ 2018-04-09 20:07 ` Dmitry Osipenko
  2018-04-27 10:02   ` Thierry Reding
  2018-05-03 12:52 ` [PATCH v1 0/4] Tegra GART fixes and improvements Joerg Roedel
  4 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-09 20:07 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Joerg Roedel
  Cc: iommu, linux-tegra, linux-kernel

Currently GART writes one page entry at a time. More optimal would be to
aggregate the writes and flush BUS buffer in the end, this gives map/unmap
10-40% (depending on size of mapping) performance boost compared to a
flushing after each entry update.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/iommu/tegra-gart.c | 63 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index 4a0607669d34..9f59f5f17661 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -36,7 +36,7 @@
 #define GART_APERTURE_SIZE	SZ_32M
 
 /* bitmap of the page sizes currently supported */
-#define GART_IOMMU_PGSIZES	(SZ_4K)
+#define GART_IOMMU_PGSIZES	GENMASK(24, 12)
 
 #define GART_REG_BASE		0x24
 #define GART_CONFIG		(0x24 - GART_REG_BASE)
@@ -269,25 +269,21 @@ static void gart_iommu_domain_free(struct iommu_domain *domain)
 	kfree(gart_domain);
 }
 
-static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
-			  phys_addr_t pa, size_t bytes, int prot)
+static int gart_iommu_map_page(struct gart_device *gart,
+			       unsigned long iova,
+			       phys_addr_t pa)
 {
-	struct gart_domain *gart_domain = to_gart_domain(domain);
-	struct gart_device *gart = gart_domain->gart;
 	unsigned long flags;
 	unsigned long pfn;
 	unsigned long pte;
 
-	if (!gart_iova_range_valid(gart, iova, bytes))
-		return -EINVAL;
-
-	spin_lock_irqsave(&gart->pte_lock, flags);
 	pfn = __phys_to_pfn(pa);
 	if (!pfn_valid(pfn)) {
 		dev_err(gart->dev, "Invalid page: %pa\n", &pa);
-		spin_unlock_irqrestore(&gart->pte_lock, flags);
 		return -EINVAL;
 	}
+
+	spin_lock_irqsave(&gart->pte_lock, flags);
 	if (gart_debug) {
 		pte = gart_read_pte(gart, iova);
 		if (pte & GART_ENTRY_PHYS_ADDR_VALID) {
@@ -297,8 +293,41 @@ static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
 		}
 	}
 	gart_set_pte(gart, iova, GART_PTE(pfn));
+	spin_unlock_irqrestore(&gart->pte_lock, flags);
+
+	return 0;
+}
+
+static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
+			  phys_addr_t pa, size_t bytes, int prot)
+{
+	struct gart_domain *gart_domain = to_gart_domain(domain);
+	struct gart_device *gart = gart_domain->gart;
+	size_t mapped;
+	int ret = -1;
+
+	if (!gart_iova_range_valid(gart, iova, bytes))
+		return -EINVAL;
+
+	for (mapped = 0; mapped < bytes; mapped += GART_PAGE_SIZE) {
+		ret = gart_iommu_map_page(gart, iova + mapped, pa + mapped);
+		if (ret)
+			break;
+	}
+
 	FLUSH_GART_REGS(gart);
+	return ret;
+}
+
+static int gart_iommu_unmap_page(struct gart_device *gart,
+				 unsigned long iova)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&gart->pte_lock, flags);
+	gart_set_pte(gart, iova, 0);
 	spin_unlock_irqrestore(&gart->pte_lock, flags);
+
 	return 0;
 }
 
@@ -307,16 +336,20 @@ static size_t gart_iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 {
 	struct gart_domain *gart_domain = to_gart_domain(domain);
 	struct gart_device *gart = gart_domain->gart;
-	unsigned long flags;
+	size_t unmapped;
+	int ret;
 
 	if (!gart_iova_range_valid(gart, iova, bytes))
 		return 0;
 
-	spin_lock_irqsave(&gart->pte_lock, flags);
-	gart_set_pte(gart, iova, 0);
+	for (unmapped = 0; unmapped < bytes; unmapped += GART_PAGE_SIZE) {
+		ret = gart_iommu_unmap_page(gart, iova + unmapped);
+		if (ret)
+			break;
+	}
+
 	FLUSH_GART_REGS(gart);
-	spin_unlock_irqrestore(&gart->pte_lock, flags);
-	return bytes;
+	return unmapped;
 }
 
 static phys_addr_t gart_iommu_iova_to_phys(struct iommu_domain *domain,
-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap()
  2018-04-09 20:07 ` [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap() Dmitry Osipenko
@ 2018-04-27  9:43   ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2018-04-27  9:43 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Jonathan Hunter, Joerg Roedel, iommu, linux-tegra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

On Mon, Apr 09, 2018 at 11:07:20PM +0300, Dmitry Osipenko wrote:
> It must return the number of unmapped bytes on success, returning 0 means
> that unmapping failed and in result only one page is unmapped.
> 
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>  drivers/iommu/tegra-gart.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Thierry Reding <treding@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility
  2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko
@ 2018-04-27  9:46   ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2018-04-27  9:46 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Jonathan Hunter, Joerg Roedel, iommu, linux-tegra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 764 bytes --]

On Mon, Apr 09, 2018 at 11:07:19PM +0300, Dmitry Osipenko wrote:
> Page mapping could overwritten by an accident (a bug). We can catch this
> case by checking 'VALID' bit of GART's page entry prior to mapping of a
> page. Since that check introduces a small performance impact, it should be
> enabled explicitly using new GART's kernel module 'debug' parameter.
> 
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>  drivers/iommu/tegra-gart.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)

I think this was discussed before, but this really shouldn't be needed
unless the IOVA allocator or GART driver are buggy.

That said, I think this has some usefulness to find such bugs, so:

Acked-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages
  2018-04-09 20:07 ` [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages Dmitry Osipenko
@ 2018-04-27  9:49   ` Thierry Reding
  0 siblings, 0 replies; 18+ messages in thread
From: Thierry Reding @ 2018-04-27  9:49 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Jonathan Hunter, Joerg Roedel, iommu, linux-tegra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 823 bytes --]

On Mon, Apr 09, 2018 at 11:07:21PM +0300, Dmitry Osipenko wrote:
> GART has a fixed aperture size, hence the number of pages is constant.
> 
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>  drivers/iommu/tegra-gart.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)

This doesn't really give a good explanation of why you want to do this.
While it is certainly true that the aperture is fixed in size, I can
imagine cases where somebody would want to restrict the aperture size
(perhaps there's a fixed mapping somewhere near the end of the mapping
that Linux shouldn't touch, or the GART aperture could be reduced to
simulate out of IOVA memory situations).

Without a good explanation why this is necessary I don't see why the
current code should be changed.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-09 20:07 ` [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Dmitry Osipenko
@ 2018-04-27 10:02   ` Thierry Reding
  2018-04-27 12:01     ` Dmitry Osipenko
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Thierry Reding @ 2018-04-27 10:02 UTC (permalink / raw)
  To: Dmitry Osipenko, Joerg Roedel
  Cc: Jonathan Hunter, iommu, linux-tegra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4981 bytes --]

On Mon, Apr 09, 2018 at 11:07:22PM +0300, Dmitry Osipenko wrote:
> Currently GART writes one page entry at a time. More optimal would be to
> aggregate the writes and flush BUS buffer in the end, this gives map/unmap
> 10-40% (depending on size of mapping) performance boost compared to a
> flushing after each entry update.
> 
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>  drivers/iommu/tegra-gart.c | 63 +++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> index 4a0607669d34..9f59f5f17661 100644
> --- a/drivers/iommu/tegra-gart.c
> +++ b/drivers/iommu/tegra-gart.c
> @@ -36,7 +36,7 @@
>  #define GART_APERTURE_SIZE	SZ_32M
>  
>  /* bitmap of the page sizes currently supported */
> -#define GART_IOMMU_PGSIZES	(SZ_4K)
> +#define GART_IOMMU_PGSIZES	GENMASK(24, 12)

That doesn't look right. The GART really only supports 4 KiB pages. You
seem to be "emulating" more page sizes here in order to improve mapping
performance. That seems wrong to me. I'm wondering if this couldn't be
improved by a similar factor by simply moving the flushing into an
implementation of ->iotlb_sync().

That said, it seems like ->iotlb_sync() is only used for unmapping, but
I don't see a reason why iommu_map() wouldn't need to call it as well
after going through several calls to ->map(). It seems to me like a
driver that implements ->iotlb_sync() would want to use it to optimize
for both the mapping and unmapping cases.

Joerg, I've gone over the git log and header files and I see no mention
of why the TLB flush interface isn't used for mapping. Do you recall any
special reasons why the same shouldn't be applied for mapping? Would you
accept any patches doing this?

Thierry

>  
>  #define GART_REG_BASE		0x24
>  #define GART_CONFIG		(0x24 - GART_REG_BASE)
> @@ -269,25 +269,21 @@ static void gart_iommu_domain_free(struct iommu_domain *domain)
>  	kfree(gart_domain);
>  }
>  
> -static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
> -			  phys_addr_t pa, size_t bytes, int prot)
> +static int gart_iommu_map_page(struct gart_device *gart,
> +			       unsigned long iova,
> +			       phys_addr_t pa)
>  {
> -	struct gart_domain *gart_domain = to_gart_domain(domain);
> -	struct gart_device *gart = gart_domain->gart;
>  	unsigned long flags;
>  	unsigned long pfn;
>  	unsigned long pte;
>  
> -	if (!gart_iova_range_valid(gart, iova, bytes))
> -		return -EINVAL;
> -
> -	spin_lock_irqsave(&gart->pte_lock, flags);
>  	pfn = __phys_to_pfn(pa);
>  	if (!pfn_valid(pfn)) {
>  		dev_err(gart->dev, "Invalid page: %pa\n", &pa);
> -		spin_unlock_irqrestore(&gart->pte_lock, flags);
>  		return -EINVAL;
>  	}
> +
> +	spin_lock_irqsave(&gart->pte_lock, flags);
>  	if (gart_debug) {
>  		pte = gart_read_pte(gart, iova);
>  		if (pte & GART_ENTRY_PHYS_ADDR_VALID) {
> @@ -297,8 +293,41 @@ static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		}
>  	}
>  	gart_set_pte(gart, iova, GART_PTE(pfn));
> +	spin_unlock_irqrestore(&gart->pte_lock, flags);
> +
> +	return 0;
> +}
> +
> +static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
> +			  phys_addr_t pa, size_t bytes, int prot)
> +{
> +	struct gart_domain *gart_domain = to_gart_domain(domain);
> +	struct gart_device *gart = gart_domain->gart;
> +	size_t mapped;
> +	int ret = -1;
> +
> +	if (!gart_iova_range_valid(gart, iova, bytes))
> +		return -EINVAL;
> +
> +	for (mapped = 0; mapped < bytes; mapped += GART_PAGE_SIZE) {
> +		ret = gart_iommu_map_page(gart, iova + mapped, pa + mapped);
> +		if (ret)
> +			break;
> +	}
> +
>  	FLUSH_GART_REGS(gart);
> +	return ret;
> +}
> +
> +static int gart_iommu_unmap_page(struct gart_device *gart,
> +				 unsigned long iova)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&gart->pte_lock, flags);
> +	gart_set_pte(gart, iova, 0);
>  	spin_unlock_irqrestore(&gart->pte_lock, flags);
> +
>  	return 0;
>  }
>  
> @@ -307,16 +336,20 @@ static size_t gart_iommu_unmap(struct iommu_domain *domain, unsigned long iova,
>  {
>  	struct gart_domain *gart_domain = to_gart_domain(domain);
>  	struct gart_device *gart = gart_domain->gart;
> -	unsigned long flags;
> +	size_t unmapped;
> +	int ret;
>  
>  	if (!gart_iova_range_valid(gart, iova, bytes))
>  		return 0;
>  
> -	spin_lock_irqsave(&gart->pte_lock, flags);
> -	gart_set_pte(gart, iova, 0);
> +	for (unmapped = 0; unmapped < bytes; unmapped += GART_PAGE_SIZE) {
> +		ret = gart_iommu_unmap_page(gart, iova + unmapped);
> +		if (ret)
> +			break;
> +	}
> +
>  	FLUSH_GART_REGS(gart);
> -	spin_unlock_irqrestore(&gart->pte_lock, flags);
> -	return bytes;
> +	return unmapped;
>  }
>  
>  static phys_addr_t gart_iommu_iova_to_phys(struct iommu_domain *domain,
> -- 
> 2.16.3
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-27 10:02   ` Thierry Reding
@ 2018-04-27 12:01     ` Dmitry Osipenko
  2018-04-27 12:36     ` Robin Murphy
  2018-05-07  7:59     ` Joerg Roedel
  2 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2018-04-27 12:01 UTC (permalink / raw)
  To: Thierry Reding, Joerg Roedel
  Cc: Jonathan Hunter, iommu, linux-tegra, linux-kernel

On 27.04.2018 13:02, Thierry Reding wrote:
> On Mon, Apr 09, 2018 at 11:07:22PM +0300, Dmitry Osipenko wrote:
>> Currently GART writes one page entry at a time. More optimal would be to
>> aggregate the writes and flush BUS buffer in the end, this gives map/unmap
>> 10-40% (depending on size of mapping) performance boost compared to a
>> flushing after each entry update.
>>
>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>> ---
>>  drivers/iommu/tegra-gart.c | 63 +++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 48 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
>> index 4a0607669d34..9f59f5f17661 100644
>> --- a/drivers/iommu/tegra-gart.c
>> +++ b/drivers/iommu/tegra-gart.c
>> @@ -36,7 +36,7 @@
>>  #define GART_APERTURE_SIZE	SZ_32M
>>  
>>  /* bitmap of the page sizes currently supported */
>> -#define GART_IOMMU_PGSIZES	(SZ_4K)
>> +#define GART_IOMMU_PGSIZES	GENMASK(24, 12)
> 
> That doesn't look right. The GART really only supports 4 KiB pages. You
> seem to be "emulating" more page sizes here in order to improve mapping
> performance. That seems wrong to me. I'm wondering if this couldn't be
> improved by a similar factor by simply moving the flushing into an
> implementation of ->iotlb_sync().
> 
> That said, it seems like ->iotlb_sync() is only used for unmapping, but
> I don't see a reason why iommu_map() wouldn't need to call it as well
> after going through several calls to ->map(). It seems to me like a
> driver that implements ->iotlb_sync() would want to use it to optimize
> for both the mapping and unmapping cases.

I vaguely remember looking at the IOMMU syncing, but decided to try at first the
way this patch is. I'll be happy to switch to map/unmap syncing, let's see what
Joerg would suggest.

> Joerg, I've gone over the git log and header files and I see no mention
> of why the TLB flush interface isn't used for mapping. Do you recall any
> special reasons why the same shouldn't be applied for mapping? Would you
> accept any patches doing this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-27 10:02   ` Thierry Reding
  2018-04-27 12:01     ` Dmitry Osipenko
@ 2018-04-27 12:36     ` Robin Murphy
  2018-05-06 21:19       ` Dmitry Osipenko
  2018-05-07  7:59     ` Joerg Roedel
  2 siblings, 1 reply; 18+ messages in thread
From: Robin Murphy @ 2018-04-27 12:36 UTC (permalink / raw)
  To: Thierry Reding, Dmitry Osipenko, Joerg Roedel
  Cc: linux-tegra, iommu, linux-kernel, Jonathan Hunter

Hi Thierry,

On 27/04/18 11:02, Thierry Reding wrote:
> On Mon, Apr 09, 2018 at 11:07:22PM +0300, Dmitry Osipenko wrote:
>> Currently GART writes one page entry at a time. More optimal would be to
>> aggregate the writes and flush BUS buffer in the end, this gives map/unmap
>> 10-40% (depending on size of mapping) performance boost compared to a
>> flushing after each entry update.
>>
>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>> ---
>>   drivers/iommu/tegra-gart.c | 63 +++++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 48 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
>> index 4a0607669d34..9f59f5f17661 100644
>> --- a/drivers/iommu/tegra-gart.c
>> +++ b/drivers/iommu/tegra-gart.c
>> @@ -36,7 +36,7 @@
>>   #define GART_APERTURE_SIZE	SZ_32M
>>   
>>   /* bitmap of the page sizes currently supported */
>> -#define GART_IOMMU_PGSIZES	(SZ_4K)
>> +#define GART_IOMMU_PGSIZES	GENMASK(24, 12)
> 
> That doesn't look right. The GART really only supports 4 KiB pages. You
> seem to be "emulating" more page sizes here in order to improve mapping
> performance. That seems wrong to me. I'm wondering if this couldn't be
> improved by a similar factor by simply moving the flushing into an
> implementation of ->iotlb_sync().
> 
> That said, it seems like ->iotlb_sync() is only used for unmapping, but
> I don't see a reason why iommu_map() wouldn't need to call it as well
> after going through several calls to ->map(). It seems to me like a
> driver that implements ->iotlb_sync() would want to use it to optimize
> for both the mapping and unmapping cases.
> 
> Joerg, I've gone over the git log and header files and I see no mention
> of why the TLB flush interface isn't used for mapping. Do you recall any
> special reasons why the same shouldn't be applied for mapping? Would you
> accept any patches doing this?

In general, requiring TLB maintenance when transitioning from an invalid 
entry to a valid one tends to be the exception rather than the norm, and 
I think we ended up at the consensus that it wasn't worth the 
complication of trying to cater for this in the generic iotlb API.

To be fair, on simple hardware which doesn't implement multiple page 
sizes with associated walk depth/TLB pressure benefits for larger ones, 
there's no need for the IOMMU API (and/or the owner of the domain) to 
try harder to use them, so handling "compound" page sizes within the 
driver is a more reasonable thing to do. There is already some precedent 
for this in other drivers (e.g. mtk_iommu_v1).

Robin.

> 
> Thierry
> 
>>   
>>   #define GART_REG_BASE		0x24
>>   #define GART_CONFIG		(0x24 - GART_REG_BASE)
>> @@ -269,25 +269,21 @@ static void gart_iommu_domain_free(struct iommu_domain *domain)
>>   	kfree(gart_domain);
>>   }
>>   
>> -static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
>> -			  phys_addr_t pa, size_t bytes, int prot)
>> +static int gart_iommu_map_page(struct gart_device *gart,
>> +			       unsigned long iova,
>> +			       phys_addr_t pa)
>>   {
>> -	struct gart_domain *gart_domain = to_gart_domain(domain);
>> -	struct gart_device *gart = gart_domain->gart;
>>   	unsigned long flags;
>>   	unsigned long pfn;
>>   	unsigned long pte;
>>   
>> -	if (!gart_iova_range_valid(gart, iova, bytes))
>> -		return -EINVAL;
>> -
>> -	spin_lock_irqsave(&gart->pte_lock, flags);
>>   	pfn = __phys_to_pfn(pa);
>>   	if (!pfn_valid(pfn)) {
>>   		dev_err(gart->dev, "Invalid page: %pa\n", &pa);
>> -		spin_unlock_irqrestore(&gart->pte_lock, flags);
>>   		return -EINVAL;
>>   	}
>> +
>> +	spin_lock_irqsave(&gart->pte_lock, flags);
>>   	if (gart_debug) {
>>   		pte = gart_read_pte(gart, iova);
>>   		if (pte & GART_ENTRY_PHYS_ADDR_VALID) {
>> @@ -297,8 +293,41 @@ static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
>>   		}
>>   	}
>>   	gart_set_pte(gart, iova, GART_PTE(pfn));
>> +	spin_unlock_irqrestore(&gart->pte_lock, flags);
>> +
>> +	return 0;
>> +}
>> +
>> +static int gart_iommu_map(struct iommu_domain *domain, unsigned long iova,
>> +			  phys_addr_t pa, size_t bytes, int prot)
>> +{
>> +	struct gart_domain *gart_domain = to_gart_domain(domain);
>> +	struct gart_device *gart = gart_domain->gart;
>> +	size_t mapped;
>> +	int ret = -1;
>> +
>> +	if (!gart_iova_range_valid(gart, iova, bytes))
>> +		return -EINVAL;
>> +
>> +	for (mapped = 0; mapped < bytes; mapped += GART_PAGE_SIZE) {
>> +		ret = gart_iommu_map_page(gart, iova + mapped, pa + mapped);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>>   	FLUSH_GART_REGS(gart);
>> +	return ret;
>> +}
>> +
>> +static int gart_iommu_unmap_page(struct gart_device *gart,
>> +				 unsigned long iova)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&gart->pte_lock, flags);
>> +	gart_set_pte(gart, iova, 0);
>>   	spin_unlock_irqrestore(&gart->pte_lock, flags);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -307,16 +336,20 @@ static size_t gart_iommu_unmap(struct iommu_domain *domain, unsigned long iova,
>>   {
>>   	struct gart_domain *gart_domain = to_gart_domain(domain);
>>   	struct gart_device *gart = gart_domain->gart;
>> -	unsigned long flags;
>> +	size_t unmapped;
>> +	int ret;
>>   
>>   	if (!gart_iova_range_valid(gart, iova, bytes))
>>   		return 0;
>>   
>> -	spin_lock_irqsave(&gart->pte_lock, flags);
>> -	gart_set_pte(gart, iova, 0);
>> +	for (unmapped = 0; unmapped < bytes; unmapped += GART_PAGE_SIZE) {
>> +		ret = gart_iommu_unmap_page(gart, iova + unmapped);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>>   	FLUSH_GART_REGS(gart);
>> -	spin_unlock_irqrestore(&gart->pte_lock, flags);
>> -	return bytes;
>> +	return unmapped;
>>   }
>>   
>>   static phys_addr_t gart_iommu_iova_to_phys(struct iommu_domain *domain,
>> -- 
>> 2.16.3
>>
>>
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 0/4] Tegra GART fixes and improvements
  2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
                   ` (3 preceding siblings ...)
  2018-04-09 20:07 ` [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Dmitry Osipenko
@ 2018-05-03 12:52 ` Joerg Roedel
  4 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2018-05-03 12:52 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Thierry Reding, Jonathan Hunter, iommu, linux-tegra, linux-kernel

On Mon, Apr 09, 2018 at 11:07:18PM +0300, Dmitry Osipenko wrote:
> GART driver wasn't ever been utilized in upstream, but finally this should
> change sometime soon with Tegra's DRM driver rework. In general GART driver
> works fine, though there are couple things that could be improved.
> 
> Dmitry Osipenko (4):
>   iommu/tegra: gart: Add debugging facility
>   iommu/tegra: gart: Fix gart_iommu_unmap()

Applied these two patches, thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-27 12:36     ` Robin Murphy
@ 2018-05-06 21:19       ` Dmitry Osipenko
  2018-05-07  8:04         ` Joerg Roedel
  0 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-05-06 21:19 UTC (permalink / raw)
  To: Robin Murphy, Thierry Reding, Joerg Roedel
  Cc: linux-tegra, iommu, linux-kernel, Jonathan Hunter

On 27.04.2018 15:36, Robin Murphy wrote:
> Hi Thierry,
> 
> On 27/04/18 11:02, Thierry Reding wrote:
>> On Mon, Apr 09, 2018 at 11:07:22PM +0300, Dmitry Osipenko wrote:
>>> Currently GART writes one page entry at a time. More optimal would be to
>>> aggregate the writes and flush BUS buffer in the end, this gives map/unmap
>>> 10-40% (depending on size of mapping) performance boost compared to a
>>> flushing after each entry update.
>>>
>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>> ---
>>>   drivers/iommu/tegra-gart.c | 63 +++++++++++++++++++++++++++++++++++-----------
>>>   1 file changed, 48 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
>>> index 4a0607669d34..9f59f5f17661 100644
>>> --- a/drivers/iommu/tegra-gart.c
>>> +++ b/drivers/iommu/tegra-gart.c
>>> @@ -36,7 +36,7 @@
>>>   #define GART_APERTURE_SIZE    SZ_32M
>>>     /* bitmap of the page sizes currently supported */
>>> -#define GART_IOMMU_PGSIZES    (SZ_4K)
>>> +#define GART_IOMMU_PGSIZES    GENMASK(24, 12)
>>
>> That doesn't look right. The GART really only supports 4 KiB pages. You
>> seem to be "emulating" more page sizes here in order to improve mapping
>> performance. That seems wrong to me. I'm wondering if this couldn't be
>> improved by a similar factor by simply moving the flushing into an
>> implementation of ->iotlb_sync().
>>
>> That said, it seems like ->iotlb_sync() is only used for unmapping, but
>> I don't see a reason why iommu_map() wouldn't need to call it as well
>> after going through several calls to ->map(). It seems to me like a
>> driver that implements ->iotlb_sync() would want to use it to optimize
>> for both the mapping and unmapping cases.
>>
>> Joerg, I've gone over the git log and header files and I see no mention
>> of why the TLB flush interface isn't used for mapping. Do you recall any
>> special reasons why the same shouldn't be applied for mapping? Would you
>> accept any patches doing this?
> 
> In general, requiring TLB maintenance when transitioning from an invalid entry
> to a valid one tends to be the exception rather than the norm, and I think we
> ended up at the consensus that it wasn't worth the complication of trying to
> cater for this in the generic iotlb API.
> 
> To be fair, on simple hardware which doesn't implement multiple page sizes with
> associated walk depth/TLB pressure benefits for larger ones, there's no need for
> the IOMMU API (and/or the owner of the domain) to try harder to use them, so
> handling "compound" page sizes within the driver is a more reasonable thing to
> do. There is already some precedent for this in other drivers (e.g. mtk_iommu_v1).
Probably the best variant would be to give an explicit control over syncing to a
user of the IOMMU API, like for example device driver may perform multiple
mappings / unmappings and then sync/flush in the end. I'm not sure that it's
really worth the hassle to shuffle the API right now, maybe we can implement it
later if needed. Joerg, do you have objections to a 'compound page' approach?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-04-27 10:02   ` Thierry Reding
  2018-04-27 12:01     ` Dmitry Osipenko
  2018-04-27 12:36     ` Robin Murphy
@ 2018-05-07  7:59     ` Joerg Roedel
  2018-05-07 15:46       ` Dmitry Osipenko
  2 siblings, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2018-05-07  7:59 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Dmitry Osipenko, Jonathan Hunter, iommu, linux-tegra, linux-kernel

On Fri, Apr 27, 2018 at 12:02:02PM +0200, Thierry Reding wrote:
> Joerg, I've gone over the git log and header files and I see no mention
> of why the TLB flush interface isn't used for mapping. Do you recall any
> special reasons why the same shouldn't be applied for mapping? Would you
> accept any patches doing this?

Yeah, the reason was that we assumed that only emulated IOMMUs ever need
flushing in the mapping path, and there is no reason to optimize for
that. Especially when a call to iotlb_sync() in the mapping path would
hurt most other users.

Does the tegra-gart actually need it too?



	Joerg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-05-06 21:19       ` Dmitry Osipenko
@ 2018-05-07  8:04         ` Joerg Roedel
  2018-05-07 15:51           ` Dmitry Osipenko
  0 siblings, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2018-05-07  8:04 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Robin Murphy, Thierry Reding, linux-tegra, iommu, linux-kernel,
	Jonathan Hunter

On Mon, May 07, 2018 at 12:19:01AM +0300, Dmitry Osipenko wrote:
> Probably the best variant would be to give an explicit control over syncing to a
> user of the IOMMU API, like for example device driver may perform multiple
> mappings / unmappings and then sync/flush in the end. I'm not sure that it's
> really worth the hassle to shuffle the API right now, maybe we can implement it
> later if needed. Joerg, do you have objections to a 'compound page' approach?

Have you measured the performance difference on both variants? The
compound-page approach only works for cases when the physical memory you
map contiguous and correctly aligned.

If it is really needed I would prefer a separate iotlb_sync_map()
call-back that is just NULL when not needed. This way all users that
don't need it only get a minimal penalty in the mapping path and you
don't have any requirements on the physical memory you map to get good
performance.


	Joerg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-05-07  7:59     ` Joerg Roedel
@ 2018-05-07 15:46       ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2018-05-07 15:46 UTC (permalink / raw)
  To: Joerg Roedel, Thierry Reding
  Cc: Jonathan Hunter, iommu, linux-tegra, linux-kernel

On 07.05.2018 10:59, Joerg Roedel wrote:
> On Fri, Apr 27, 2018 at 12:02:02PM +0200, Thierry Reding wrote:
>> Joerg, I've gone over the git log and header files and I see no mention
>> of why the TLB flush interface isn't used for mapping. Do you recall any
>> special reasons why the same shouldn't be applied for mapping? Would you
>> accept any patches doing this?
> 
> Yeah, the reason was that we assumed that only emulated IOMMUs ever need
> flushing in the mapping path, and there is no reason to optimize for
> that. Especially when a call to iotlb_sync() in the mapping path would
> hurt most other users.
> 
> Does the tegra-gart actually need it too?

Tegra-GART is exactly an emulated IOMMU, it doesn't have anything like TLB and
it's simply a remapping table. What is actually needed for the GART - is to make
sure that the remapping table modifications reach GART HW before anything tries
to touch the modified page entries. Presumable there is some kind of a HW buffer
that accumulates multiple registers writes and issues them in bursts to improve
performance, read-after-write is the way to flush that buffer.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-05-07  8:04         ` Joerg Roedel
@ 2018-05-07 15:51           ` Dmitry Osipenko
  2018-05-07 17:38             ` Dmitry Osipenko
  0 siblings, 1 reply; 18+ messages in thread
From: Dmitry Osipenko @ 2018-05-07 15:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Robin Murphy, Thierry Reding, linux-tegra, iommu, linux-kernel,
	Jonathan Hunter

On 07.05.2018 11:04, Joerg Roedel wrote:
> On Mon, May 07, 2018 at 12:19:01AM +0300, Dmitry Osipenko wrote:
>> Probably the best variant would be to give an explicit control over syncing to a
>> user of the IOMMU API, like for example device driver may perform multiple
>> mappings / unmappings and then sync/flush in the end. I'm not sure that it's
>> really worth the hassle to shuffle the API right now, maybe we can implement it
>> later if needed. Joerg, do you have objections to a 'compound page' approach?
> 
> Have you measured the performance difference on both variants? The
> compound-page approach only works for cases when the physical memory you
> map contiguous and correctly aligned.

Yes, previously I actually only tested mapping of the contiguous allocations
(used for memory isolation purposes). But now I've re-tested all variants and
got somewhat interesting results.

Firstly it is not that easy to test a really sparse mapping simply because
memory allocator produces sparse allocation only when memory is _really_
fragmented. Pretty much all of the time the sparse allocations are contiguous or
they consist of a very few chunks that do not impose any noticeable performance
impact.

Secondly, the interesting part is that mapping / unmapping of a contiguous
allocation (CMA using DMA API) is slower by ~50% then doing it for a sparse
allocation (get_pages using bare IOMMU API). /I think/ it's a shortcoming of the
arch/arm/mm/dma-mapping.c, which also suffers from other inflexibilities that
Thierry faced recently. Though I haven't really tried to figure out what is the
bottleneck yet and Thierry was going to re-write ARM's dma-mapping
implementation anyway, I'll take a closer look at this issue a bit later.

I've implemented the iotlb_sync_map() and tested things with it. The end result
is the same as for the compound page approach, simply because actual allocations
are pretty much always contiguous.

> If it is really needed I would prefer a separate iotlb_sync_map()
> call-back that is just NULL when not needed. This way all users that
> don't need it only get a minimal penalty in the mapping path and you
> don't have any requirements on the physical memory you map to get good
> performance.
Summarizing, the iotlb_sync_map() is indeed better way. As you rightly noticed,
that approach is also optimal for the non-contiguous cases as we won't have to
flush on mapping of each contiguous chunk of the sparse allocation, but after
the whole mapping is done.

Thierry, Robin and Joerg - thanks for your input, I'll prepare patches
implementing the iotlb_sync_map.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap
  2018-05-07 15:51           ` Dmitry Osipenko
@ 2018-05-07 17:38             ` Dmitry Osipenko
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Osipenko @ 2018-05-07 17:38 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Robin Murphy, Thierry Reding, linux-tegra, iommu, linux-kernel,
	Jonathan Hunter

On 07.05.2018 18:51, Dmitry Osipenko wrote:

[snip]

> Secondly, the interesting part is that mapping / unmapping of a contiguous
> allocation (CMA using DMA API) is slower by ~50% then doing it for a sparse
> allocation (get_pages using bare IOMMU API). /I think/ it's a shortcoming of the
> arch/arm/mm/dma-mapping.c, which also suffers from other inflexibilities that
> Thierry faced recently. Though I haven't really tried to figure out what is the
> bottleneck yet and Thierry was going to re-write ARM's dma-mapping
> implementation anyway, I'll take a closer look at this issue a bit later.

Please scratch my accusation of ARM's dma-mapping, it's not the culprit at all.
I completely forgot that in a case of sparse allocation displays framebuffer
IOMMU mapping is "pinned" to the GART and hence it's not getting dynamically
mapped / unmapped during of my testing. I also forgot to set CPU freq governor
to "perfomance", that reduced 50% to 20% of the above perf difference. The rest
of the testing is unaffected, flushing after whole mapping is still much more
efficient than flushing after modification of each page entry. And yet again,
performance of sparse mapping is nearly the same as of contiguous mapping unless
sparse allocation is large and _very_ fragmented.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-05-07 17:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko
2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko
2018-04-27  9:46   ` Thierry Reding
2018-04-09 20:07 ` [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap() Dmitry Osipenko
2018-04-27  9:43   ` Thierry Reding
2018-04-09 20:07 ` [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages Dmitry Osipenko
2018-04-27  9:49   ` Thierry Reding
2018-04-09 20:07 ` [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Dmitry Osipenko
2018-04-27 10:02   ` Thierry Reding
2018-04-27 12:01     ` Dmitry Osipenko
2018-04-27 12:36     ` Robin Murphy
2018-05-06 21:19       ` Dmitry Osipenko
2018-05-07  8:04         ` Joerg Roedel
2018-05-07 15:51           ` Dmitry Osipenko
2018-05-07 17:38             ` Dmitry Osipenko
2018-05-07  7:59     ` Joerg Roedel
2018-05-07 15:46       ` Dmitry Osipenko
2018-05-03 12:52 ` [PATCH v1 0/4] Tegra GART fixes and improvements Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).