LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/3] resource: find_next_iomem_res() improvements
@ 2019-06-13  4:59 Nadav Amit
  2019-06-13  4:59 ` [PATCH 1/3] resource: Fix locking in find_next_iomem_res() Nadav Amit
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-13  4:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar

Running some microbenchmarks on dax keeps showing find_next_iomem_res()
as a place in which significant amount of time is spent. It appears that
in order to determine the cacheability that is required for the PTE,
lookup_memtype() is called, and this one traverses the resources list in
an inefficient manner. This patch-set tries to improve this situation.

The first patch fixes what appears to be unsafe locking in
find_next_iomem_res().

The second patch improves performance by searching the top level first,
to find a matching range, before going down to the children. The third
patch improves the performance by caching the top level resource of the
last found resource in find_next_iomem_res().

Both of these optimizations are based on the ranges in the top level not
overlapping each other.

Running sysbench on dax (Haswell, pmem emulation, with write_cache
disabled):

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

		events (avg/stddev)
		-------------------
  5.2-rc3:	1247669.0000/16075.39
  +patches:	1293408.5000/7720.69 (+3.5%)

Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>

Nadav Amit (3):
  resource: Fix locking in find_next_iomem_res()
  resource: Avoid unnecessary lookups in find_next_iomem_res()
  resource: Introduce resource cache

 kernel/resource.c | 96 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 79 insertions(+), 17 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] resource: Fix locking in find_next_iomem_res()
  2019-06-13  4:59 [PATCH 0/3] resource: find_next_iomem_res() improvements Nadav Amit
@ 2019-06-13  4:59 ` Nadav Amit
       [not found]   ` <20190615221557.CD1492183F@mail.kernel.org>
  2019-06-18  4:26   ` Andrew Morton
  2019-06-13  4:59 ` [PATCH 2/3] resource: Avoid unnecessary lookups " Nadav Amit
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-13  4:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, stable, Borislav Petkov,
	Toshi Kani, Peter Zijlstra, Dave Hansen, Dan Williams,
	Bjorn Helgaas, Ingo Molnar

Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.

Keep holding the lock while the data is copied. While at it, change the
return value to a more informative value. It is disregarded by the
callers.

Fixes: ff3cc952d3f00 ("resource: Add remove_resource interface")
Cc: stable@vger.kernel.org
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 kernel/resource.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 158f04ec1d4f..c0f7ba0ece52 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -365,16 +365,16 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 			break;
 	}
 
+	if (p) {
+		/* copy data */
+		res->start = max(start, p->start);
+		res->end = min(end, p->end);
+		res->flags = p->flags;
+		res->desc = p->desc;
+	}
+
 	read_unlock(&resource_lock);
-	if (!p)
-		return -1;
-
-	/* copy data */
-	res->start = max(start, p->start);
-	res->end = min(end, p->end);
-	res->flags = p->flags;
-	res->desc = p->desc;
-	return 0;
+	return p ? 0 : -ENODEV;
 }
 
 static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/3] resource: Avoid unnecessary lookups in find_next_iomem_res()
  2019-06-13  4:59 [PATCH 0/3] resource: find_next_iomem_res() improvements Nadav Amit
  2019-06-13  4:59 ` [PATCH 1/3] resource: Fix locking in find_next_iomem_res() Nadav Amit
@ 2019-06-13  4:59 ` Nadav Amit
  2019-06-13  4:59 ` [PATCH 3/3] resource: Introduce resource cache Nadav Amit
  2019-06-18  6:44 ` [PATCH 0/3] resource: find_next_iomem_res() improvements Dan Williams
  3 siblings, 0 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-13  4:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar

find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.

Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.

Running sysbench on dax (pmem emulation, with write_cache disabled):

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

		events (avg/stddev)
		-------------------
  5.2-rc3:	1247669.0000/16075.39
  w/patch:	1286320.5000/16402.72	(+3%)

Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 kernel/resource.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index c0f7ba0ece52..51c3bf6d9b98 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -342,6 +342,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 			       unsigned long flags, unsigned long desc,
 			       bool first_lvl, struct resource *res)
 {
+	bool siblings_only = true;
 	struct resource *p;
 
 	if (!res)
@@ -352,17 +353,31 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 
 	read_lock(&resource_lock);
 
-	for (p = iomem_resource.child; p; p = next_resource(p, first_lvl)) {
-		if ((p->flags & flags) != flags)
-			continue;
-		if ((desc != IORES_DESC_NONE) && (desc != p->desc))
-			continue;
+	for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) {
+		/* If we passed the resource we are looking for, stop */
 		if (p->start > end) {
 			p = NULL;
 			break;
 		}
-		if ((p->end >= start) && (p->start <= end))
-			break;
+
+		/* Skip until we find a range that matches what we look for */
+		if (p->end < start)
+			continue;
+
+		/*
+		 * Now that we found a range that matches what we look for,
+		 * check the flags and the descriptor. If we were not asked to
+		 * use only the first level, start looking at children as well.
+		 */
+		siblings_only = first_lvl;
+
+		if ((p->flags & flags) != flags)
+			continue;
+		if ((desc != IORES_DESC_NONE) && (desc != p->desc))
+			continue;
+
+		/* Found a match, break */
+		break;
 	}
 
 	if (p) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/3] resource: Introduce resource cache
  2019-06-13  4:59 [PATCH 0/3] resource: find_next_iomem_res() improvements Nadav Amit
  2019-06-13  4:59 ` [PATCH 1/3] resource: Fix locking in find_next_iomem_res() Nadav Amit
  2019-06-13  4:59 ` [PATCH 2/3] resource: Avoid unnecessary lookups " Nadav Amit
@ 2019-06-13  4:59 ` Nadav Amit
       [not found]   ` <20190615221607.4B44521841@mail.kernel.org>
  2019-06-18  4:57   ` Andrew Morton
  2019-06-18  6:44 ` [PATCH 0/3] resource: find_next_iomem_res() improvements Dan Williams
  3 siblings, 2 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-13  4:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar

For efficient search of resources, as needed to determine the memory
type for dax page-faults, introduce a cache of the most recently used
top-level resource. Caching the top-level should be safe as ranges in
that level do not overlap (unlike those of lower levels).

Keep the cache per-cpu to avoid possible contention. Whenever a resource
is added, removed or changed, invalidate all the resources. The
invalidation takes place when the resource_lock is taken for write,
preventing possible races.

This patch provides relatively small performance improvements over the
previous patch (~0.5% on sysbench), but can benefit systems with many
resources.

Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 kernel/resource.c | 51 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 51c3bf6d9b98..ab659d0eb52a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -53,6 +53,12 @@ struct resource_constraint {
 
 static DEFINE_RWLOCK(resource_lock);
 
+/*
+ * Cache of the top-level resource that was most recently use by
+ * find_next_iomem_res().
+ */
+static DEFINE_PER_CPU(struct resource *, resource_cache);
+
 /*
  * For memory hotplug, there is no way to free resource entries allocated
  * by boot mem after the system is up. So for reusing the resource entry
@@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
 	}
 }
 
+static void invalidate_resource_cache(void)
+{
+	int cpu;
+
+	lockdep_assert_held_exclusive(&resource_lock);
+
+	for_each_possible_cpu(cpu)
+		per_cpu(resource_cache, cpu) = NULL;
+}
+
 void release_child_resources(struct resource *r)
 {
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	__release_child_resources(r);
 	write_unlock(&resource_lock);
 }
@@ -281,6 +298,7 @@ struct resource *request_resource_conflict(struct resource *root, struct resourc
 	struct resource *conflict;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	conflict = __request_resource(root, new);
 	write_unlock(&resource_lock);
 	return conflict;
@@ -312,6 +330,7 @@ int release_resource(struct resource *old)
 	int retval;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	retval = __release_resource(old, true);
 	write_unlock(&resource_lock);
 	return retval;
@@ -343,7 +362,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 			       bool first_lvl, struct resource *res)
 {
 	bool siblings_only = true;
-	struct resource *p;
+	struct resource *p, *res_first_lvl;
 
 	if (!res)
 		return -EINVAL;
@@ -353,7 +372,15 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 
 	read_lock(&resource_lock);
 
-	for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) {
+	/*
+	 * If our cached entry preceded or matches the range that we are looking
+	 * for, start with it. Otherwise, start from the beginning.
+	 */
+	p = res_first_lvl = this_cpu_read(resource_cache);
+	if (!p || start < p->start)
+		p = iomem_resource.child;
+
+	for (; p; p = next_resource(p, siblings_only)) {
 		/* If we passed the resource we are looking for, stop */
 		if (p->start > end) {
 			p = NULL;
@@ -364,6 +391,10 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 		if (p->end < start)
 			continue;
 
+		/* We only cache the first level for correctness */
+		if (p->parent == &iomem_resource)
+			res_first_lvl = p;
+
 		/*
 		 * Now that we found a range that matches what we look for,
 		 * check the flags and the descriptor. If we were not asked to
@@ -386,6 +417,9 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 		res->end = min(end, p->end);
 		res->flags = p->flags;
 		res->desc = p->desc;
+
+		/* Update the cache */
+		this_cpu_write(resource_cache, res_first_lvl);
 	}
 
 	read_unlock(&resource_lock);
@@ -674,6 +708,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
 	struct resource *conflict;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 
 	if ((err = __find_resource(root, old, &new, newsize, constraint)))
 		goto out;
@@ -744,6 +779,7 @@ int allocate_resource(struct resource *root, struct resource *new,
 	}
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	err = find_resource(root, new, size, &constraint);
 	if (err >= 0 && __request_resource(root, new))
 		err = -EBUSY;
@@ -848,6 +884,7 @@ struct resource *insert_resource_conflict(struct resource *parent, struct resour
 	struct resource *conflict;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	conflict = __insert_resource(parent, new);
 	write_unlock(&resource_lock);
 	return conflict;
@@ -886,6 +923,7 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
 		return;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	for (;;) {
 		struct resource *conflict;
 
@@ -926,6 +964,7 @@ int remove_resource(struct resource *old)
 	int retval;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	retval = __release_resource(old, false);
 	write_unlock(&resource_lock);
 	return retval;
@@ -985,6 +1024,7 @@ int adjust_resource(struct resource *res, resource_size_t start,
 	int result;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 	result = __adjust_resource(res, start, size);
 	write_unlock(&resource_lock);
 	return result;
@@ -1059,6 +1099,8 @@ reserve_region_with_split(struct resource *root, resource_size_t start,
 	int abort = 0;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
+
 	if (root->start > start || root->end < end) {
 		pr_err("requested range [0x%llx-0x%llx] not in root %pr\n",
 		       (unsigned long long)start, (unsigned long long)end,
@@ -1135,6 +1177,7 @@ struct resource * __request_region(struct resource *parent,
 	res->end = start + n - 1;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 
 	for (;;) {
 		struct resource *conflict;
@@ -1168,6 +1211,7 @@ struct resource * __request_region(struct resource *parent,
 			schedule();
 			remove_wait_queue(&muxed_resource_wait, &wait);
 			write_lock(&resource_lock);
+			invalidate_resource_cache();
 			continue;
 		}
 		/* Uhhuh, that didn't work out.. */
@@ -1198,6 +1242,7 @@ void __release_region(struct resource *parent, resource_size_t start,
 	end = start + n - 1;
 
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 
 	for (;;) {
 		struct resource *res = *p;
@@ -1211,6 +1256,7 @@ void __release_region(struct resource *parent, resource_size_t start,
 			}
 			if (res->start != start || res->end != end)
 				break;
+
 			*p = res->sibling;
 			write_unlock(&resource_lock);
 			if (res->flags & IORESOURCE_MUXED)
@@ -1268,6 +1314,7 @@ int release_mem_region_adjustable(struct resource *parent,
 
 	p = &parent->child;
 	write_lock(&resource_lock);
+	invalidate_resource_cache();
 
 	while ((res = *p)) {
 		if (res->start >= end)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
       [not found]   ` <20190615221607.4B44521841@mail.kernel.org>
@ 2019-06-17 17:20     ` Nadav Amit
  0 siblings, 0 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-17 17:20 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, LKML, Linux-MM, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar, stable

> On Jun 15, 2019, at 3:16 PM, Sasha Levin <sashal@kernel.org> wrote:
> 
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: ff3cc952d3f0 resource: Add remove_resource interface.

This commit (Patch 3/3) does not have the “Fixes:” tag (and it is a
performance enhancement), so I don’t know why it was processed.

IOW: please do not backport it.

> The bot has tested the following trees: v5.1.9, v4.19.50, v4.14.125, v4.9.181.
> 
> v5.1.9: Build OK!
> v4.19.50: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
> 
> v4.14.125: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>    4ac2aed837cb ("resource: Consolidate resource walking code")
>    7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
> 
> v4.9.181: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>    4ac2aed837cb ("resource: Consolidate resource walking code")
>    60fe3910bb02 ("kexec_file: Allow arch-specific memory walking for kexec_add_buffer")
>    7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
>    a0458284f062 ("powerpc: Add support code for kexec_file_load()")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>    da6658859b9c ("powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.")
>    ec2b9bfaac44 ("kexec_file: Change kexec_add_buffer to take kexec_buf as argument.")
> 
> 
> How should we proceed with this patch?
> 
> --
> Thanks,
> Sasha



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] resource: Fix locking in find_next_iomem_res()
       [not found]   ` <20190615221557.CD1492183F@mail.kernel.org>
@ 2019-06-17 19:14     ` Nadav Amit
  2019-06-18  0:55       ` Sasha Levin
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-06-17 19:14 UTC (permalink / raw)
  To: Sasha Levin, Bjorn Helgaas
  Cc: Andrew Morton, LKML, linux-mm, stable, Borislav Petkov,
	Toshi Kani, Peter Zijlstra, Dave Hansen, Dan Williams,
	Bjorn Helgaas, Ingo Molnar

> On Jun 15, 2019, at 3:15 PM, Sasha Levin <sashal@kernel.org> wrote:
> 
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: ff3cc952d3f0 resource: Add remove_resource interface.
> 
> The bot has tested the following trees: v5.1.9, v4.19.50, v4.14.125, v4.9.181.
> 
> v5.1.9: Build OK!
> v4.19.50: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
> 
> v4.14.125: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>    4ac2aed837cb ("resource: Consolidate resource walking code")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
> 
> v4.9.181: Failed to apply! Possible dependencies:
>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>    4ac2aed837cb ("resource: Consolidate resource walking code")
>    60fe3910bb02 ("kexec_file: Allow arch-specific memory walking for kexec_add_buffer")
>    a0458284f062 ("powerpc: Add support code for kexec_file_load()")
>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>    da6658859b9c ("powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.")
>    ec2b9bfaac44 ("kexec_file: Change kexec_add_buffer to take kexec_buf as argument.")

Is there a reason 010a93bf97c7 ("resource: Fix find_next_iomem_res()
iteration issue”) was not backported?

For 4.19 the following passes compilation.

-- >8 --

From: Nadav Amit <namit@vmware.com>
Subject: [PATCH] resource: Fix locking in find_next_iomem_res()

Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource. Keep holding
the lock while the data is copied.

Fixes: ff3cc952d3f00 ("resource: Add remove_resource interface")
Cc: stable@vger.kernel.org
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 kernel/resource.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..0201feade7d5 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -331,6 +331,7 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc,
 	resource_size_t start, end;
 	struct resource *p;
 	bool sibling_only = false;
+	int r = 0;
 
 	BUG_ON(!res);
 
@@ -356,9 +357,11 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc,
 			break;
 	}
 
-	read_unlock(&resource_lock);
-	if (!p)
-		return -1;
+	if (!p) {
+		r = -1;
+		goto out;
+	}
+
 	/* copy data */
 	if (res->start < p->start)
 		res->start = p->start;
@@ -366,7 +369,9 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc,
 		res->end = p->end;
 	res->flags = p->flags;
 	res->desc = p->desc;
-	return 0;
+out:
+	read_unlock(&resource_lock);
+	return r;
 }
 
 static int __walk_iomem_res_desc(struct resource *res, unsigned long desc,
-- 
2.17.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] resource: Fix locking in find_next_iomem_res()
  2019-06-17 19:14     ` Nadav Amit
@ 2019-06-18  0:55       ` Sasha Levin
  2019-06-18  1:32         ` Nadav Amit
  0 siblings, 1 reply; 28+ messages in thread
From: Sasha Levin @ 2019-06-18  0:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Bjorn Helgaas, Andrew Morton, LKML, linux-mm, stable,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Dan Williams, Ingo Molnar

On Mon, Jun 17, 2019 at 07:14:53PM +0000, Nadav Amit wrote:
>> On Jun 15, 2019, at 3:15 PM, Sasha Levin <sashal@kernel.org> wrote:
>>
>> Hi,
>>
>> [This is an automated email]
>>
>> This commit has been processed because it contains a "Fixes:" tag,
>> fixing commit: ff3cc952d3f0 resource: Add remove_resource interface.
>>
>> The bot has tested the following trees: v5.1.9, v4.19.50, v4.14.125, v4.9.181.
>>
>> v5.1.9: Build OK!
>> v4.19.50: Failed to apply! Possible dependencies:
>>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>
>> v4.14.125: Failed to apply! Possible dependencies:
>>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>>    4ac2aed837cb ("resource: Consolidate resource walking code")
>>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>
>> v4.9.181: Failed to apply! Possible dependencies:
>>    010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>    0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>>    1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>>    4ac2aed837cb ("resource: Consolidate resource walking code")
>>    60fe3910bb02 ("kexec_file: Allow arch-specific memory walking for kexec_add_buffer")
>>    a0458284f062 ("powerpc: Add support code for kexec_file_load()")
>>    a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>    da6658859b9c ("powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.")
>>    ec2b9bfaac44 ("kexec_file: Change kexec_add_buffer to take kexec_buf as argument.")
>
>Is there a reason 010a93bf97c7 ("resource: Fix find_next_iomem_res()
>iteration issue”) was not backported?

Mostly because it's not tagged for stable :)

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] resource: Fix locking in find_next_iomem_res()
  2019-06-18  0:55       ` Sasha Levin
@ 2019-06-18  1:32         ` Nadav Amit
  0 siblings, 0 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-18  1:32 UTC (permalink / raw)
  To: Sasha Levin, Bjorn Helgaas
  Cc: Andrew Morton, LKML, linux-mm, stable, Borislav Petkov,
	Toshi Kani, Peter Zijlstra, Dave Hansen, Dan Williams,
	Ingo Molnar

> On Jun 17, 2019, at 5:55 PM, Sasha Levin <sashal@kernel.org> wrote:
> 
> On Mon, Jun 17, 2019 at 07:14:53PM +0000, Nadav Amit wrote:
>>> On Jun 15, 2019, at 3:15 PM, Sasha Levin <sashal@kernel.org> wrote:
>>> 
>>> Hi,
>>> 
>>> [This is an automated email]
>>> 
>>> This commit has been processed because it contains a "Fixes:" tag,
>>> fixing commit: ff3cc952d3f0 resource: Add remove_resource interface.
>>> 
>>> The bot has tested the following trees: v5.1.9, v4.19.50, v4.14.125, v4.9.181.
>>> 
>>> v5.1.9: Build OK!
>>> v4.19.50: Failed to apply! Possible dependencies:
>>>   010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>>   a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>> 
>>> v4.14.125: Failed to apply! Possible dependencies:
>>>   010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>>   0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>>>   1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>>>   4ac2aed837cb ("resource: Consolidate resource walking code")
>>>   a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>> 
>>> v4.9.181: Failed to apply! Possible dependencies:
>>>   010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
>>>   0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
>>>   1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
>>>   4ac2aed837cb ("resource: Consolidate resource walking code")
>>>   60fe3910bb02 ("kexec_file: Allow arch-specific memory walking for kexec_add_buffer")
>>>   a0458284f062 ("powerpc: Add support code for kexec_file_load()")
>>>   a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>>>   da6658859b9c ("powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.")
>>>   ec2b9bfaac44 ("kexec_file: Change kexec_add_buffer to take kexec_buf as argument.")
>> 
>> Is there a reason 010a93bf97c7 ("resource: Fix find_next_iomem_res()
>> iteration issue”) was not backported?
> 
> Mostly because it's not tagged for stable :)

Good point. Unfortunately, 010a93bf97c7 does not apply cleanly either.

Bjorn, do you think your patch should be backported?

If not, I can create a version of the patch that I sent for 4.14/4.9.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] resource: Fix locking in find_next_iomem_res()
  2019-06-13  4:59 ` [PATCH 1/3] resource: Fix locking in find_next_iomem_res() Nadav Amit
       [not found]   ` <20190615221557.CD1492183F@mail.kernel.org>
@ 2019-06-18  4:26   ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2019-06-18  4:26 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-kernel, linux-mm, stable, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar

On Wed, 12 Jun 2019 21:59:01 -0700 Nadav Amit <namit@vmware.com> wrote:

> Since resources can be removed, locking should ensure that the resource
> is not removed while accessing it. However, find_next_iomem_res() does
> not hold the lock while copying the data of the resource.

Looks right to me.

> Keep holding the lock while the data is copied. While at it, change the
> return value to a more informative value. It is disregarded by the
> callers.

The kerneldoc needs a resync:

--- a/kernel/resource.c~resource-fix-locking-in-find_next_iomem_res-fix
+++ a/kernel/resource.c
@@ -326,7 +326,7 @@ EXPORT_SYMBOL(release_resource);
  *
  * If a resource is found, returns 0 and @*res is overwritten with the part
  * of the resource that's within [@start..@end]; if none is found, returns
- * -1 or -EINVAL for other invalid parameters.
+ * -ENODEV.  Returns -EINVAL for invalid parameters.
  *
  * This function walks the whole tree and not just first level children
  * unless @first_lvl is true.
_


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-13  4:59 ` [PATCH 3/3] resource: Introduce resource cache Nadav Amit
       [not found]   ` <20190615221607.4B44521841@mail.kernel.org>
@ 2019-06-18  4:57   ` Andrew Morton
  2019-06-18  5:33     ` Nadav Amit
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2019-06-18  4:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-kernel, linux-mm, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Bjorn Helgaas,
	Ingo Molnar

On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:

> For efficient search of resources, as needed to determine the memory
> type for dax page-faults, introduce a cache of the most recently used
> top-level resource. Caching the top-level should be safe as ranges in
> that level do not overlap (unlike those of lower levels).
> 
> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> is added, removed or changed, invalidate all the resources. The
> invalidation takes place when the resource_lock is taken for write,
> preventing possible races.
> 
> This patch provides relatively small performance improvements over the
> previous patch (~0.5% on sysbench), but can benefit systems with many
> resources.

> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -53,6 +53,12 @@ struct resource_constraint {
>  
>  static DEFINE_RWLOCK(resource_lock);
>  
> +/*
> + * Cache of the top-level resource that was most recently use by
> + * find_next_iomem_res().
> + */
> +static DEFINE_PER_CPU(struct resource *, resource_cache);

A per-cpu cache which is accessed under a kernel-wide read_lock looks a
bit odd - the latency getting at that rwlock will swamp the benefit of
isolating the CPUs from each other when accessing resource_cache.

On the other hand, if we have multiple CPUs running
find_next_iomem_res() concurrently then yes, I see the benefit.  Has
the benefit of using a per-cpu cache (rather than a kernel-wide one)
been quantified?


> @@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
>  	}
>  }
>  
> +static void invalidate_resource_cache(void)
> +{
> +	int cpu;
> +
> +	lockdep_assert_held_exclusive(&resource_lock);
> +
> +	for_each_possible_cpu(cpu)
> +		per_cpu(resource_cache, cpu) = NULL;
> +}

All the calls to invalidate_resource_cache() are rather a
maintainability issue - easy to miss one as the code evolves.

Can't we just make find_next_iomem_res() smarter?  For example, start
the lookup from the cached point and if that failed, do a full sweep?

> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();
> +			invalidate_resource_cache();
> +	invalidate_resource_cache();
> +	invalidate_resource_cache();

Ow.  I guess the maintainability situation can be improved by renaming
resource_lock to something else (to avoid mishaps) then adding wrapper
functions.  But still.  I can't say this is a super-exciting patch :(

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-18  4:57   ` Andrew Morton
@ 2019-06-18  5:33     ` Nadav Amit
  2019-06-18  5:40       ` Nadav Amit
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-06-18  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Linux-MM, Borislav Petkov, Toshi Kani, Peter Zijlstra,
	Dave Hansen, Dan Williams, Bjorn Helgaas, Ingo Molnar

> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:
> 
>> For efficient search of resources, as needed to determine the memory
>> type for dax page-faults, introduce a cache of the most recently used
>> top-level resource. Caching the top-level should be safe as ranges in
>> that level do not overlap (unlike those of lower levels).
>> 
>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>> is added, removed or changed, invalidate all the resources. The
>> invalidation takes place when the resource_lock is taken for write,
>> preventing possible races.
>> 
>> This patch provides relatively small performance improvements over the
>> previous patch (~0.5% on sysbench), but can benefit systems with many
>> resources.
> 
>> --- a/kernel/resource.c
>> +++ b/kernel/resource.c
>> @@ -53,6 +53,12 @@ struct resource_constraint {
>> 
>> static DEFINE_RWLOCK(resource_lock);
>> 
>> +/*
>> + * Cache of the top-level resource that was most recently use by
>> + * find_next_iomem_res().
>> + */
>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
> 
> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> bit odd - the latency getting at that rwlock will swamp the benefit of
> isolating the CPUs from each other when accessing resource_cache.
> 
> On the other hand, if we have multiple CPUs running
> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> been quantified?

No. I am not sure how easy it would be to measure it. On the other hander
the lock is not supposed to be contended (at most cases). At the time I saw
numbers that showed that stores to “exclusive" cache lines can be as
expensive as atomic operations [1]. I am not sure how up to date these
numbers are though. In the benchmark I ran, multiple CPUs ran
find_next_iomem_res() concurrently.

[1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf

> 
> 
>> @@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
>> 	}
>> }
>> 
>> +static void invalidate_resource_cache(void)
>> +{
>> +	int cpu;
>> +
>> +	lockdep_assert_held_exclusive(&resource_lock);
>> +
>> +	for_each_possible_cpu(cpu)
>> +		per_cpu(resource_cache, cpu) = NULL;
>> +}
> 
> All the calls to invalidate_resource_cache() are rather a
> maintainability issue - easy to miss one as the code evolves.
> 
> Can't we just make find_next_iomem_res() smarter?  For example, start
> the lookup from the cached point and if that failed, do a full sweep?

I may be able to do something like that to reduce the number of locations
that need to be updated, but you always need to invalidate if a resource is
removed. This might make the code more prone to bugs, since the logic of
when to invalidate becomes more complicated.

>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +			invalidate_resource_cache();
>> +	invalidate_resource_cache();
>> +	invalidate_resource_cache();
> 
> Ow.  I guess the maintainability situation can be improved by renaming
> resource_lock to something else (to avoid mishaps) then adding wrapper
> functions.  But still.  I can't say this is a super-exciting patch :(

I considered doing so, but I was not sure it is better. If you want I’ll
implement it using wrapper functions (but both lock and unlock would need 
to be wrapped for consistency).

The benefit of this patch over the previous one is not huge, and I don’t
know how to implement it any better (excluding wrapper function, if you
consider it better). If you want, you can just drop this patch.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-18  5:33     ` Nadav Amit
@ 2019-06-18  5:40       ` Nadav Amit
  2019-06-19 13:00         ` Bjorn Helgaas
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-06-18  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Linux-MM, Borislav Petkov, Toshi Kani, Peter Zijlstra,
	Dave Hansen, Dan Williams, Bjorn Helgaas, Ingo Molnar

> On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> 
>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:
>> 
>>> For efficient search of resources, as needed to determine the memory
>>> type for dax page-faults, introduce a cache of the most recently used
>>> top-level resource. Caching the top-level should be safe as ranges in
>>> that level do not overlap (unlike those of lower levels).
>>> 
>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>> is added, removed or changed, invalidate all the resources. The
>>> invalidation takes place when the resource_lock is taken for write,
>>> preventing possible races.
>>> 
>>> This patch provides relatively small performance improvements over the
>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>> resources.
>> 
>>> --- a/kernel/resource.c
>>> +++ b/kernel/resource.c
>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>> 
>>> static DEFINE_RWLOCK(resource_lock);
>>> 
>>> +/*
>>> + * Cache of the top-level resource that was most recently use by
>>> + * find_next_iomem_res().
>>> + */
>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>> 
>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>> bit odd - the latency getting at that rwlock will swamp the benefit of
>> isolating the CPUs from each other when accessing resource_cache.
>> 
>> On the other hand, if we have multiple CPUs running
>> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>> been quantified?
> 
> No. I am not sure how easy it would be to measure it. On the other hander
> the lock is not supposed to be contended (at most cases). At the time I saw
> numbers that showed that stores to “exclusive" cache lines can be as
> expensive as atomic operations [1]. I am not sure how up to date these
> numbers are though. In the benchmark I ran, multiple CPUs ran
> find_next_iomem_res() concurrently.
> 
> [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf

Just to clarify - the main motivation behind the per-cpu variable is not
about contention, but about the fact the different processes/threads that
run concurrently might use different resources.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-06-13  4:59 [PATCH 0/3] resource: find_next_iomem_res() improvements Nadav Amit
                   ` (2 preceding siblings ...)
  2019-06-13  4:59 ` [PATCH 3/3] resource: Introduce resource cache Nadav Amit
@ 2019-06-18  6:44 ` Dan Williams
  2019-06-18 17:42   ` Nadav Amit
  3 siblings, 1 reply; 28+ messages in thread
From: Dan Williams @ 2019-06-18  6:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <namit@vmware.com> wrote:
>
> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
> as a place in which significant amount of time is spent. It appears that
> in order to determine the cacheability that is required for the PTE,
> lookup_memtype() is called, and this one traverses the resources list in
> an inefficient manner. This patch-set tries to improve this situation.

Let's just do this lookup once per device, cache that, and replay it
to modified vmf_insert_* routines that trust the caller to already
know the pgprot_values.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-06-18  6:44 ` [PATCH 0/3] resource: find_next_iomem_res() improvements Dan Williams
@ 2019-06-18 17:42   ` Nadav Amit
  2019-06-18 18:30     ` Dan Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-06-18 17:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

> On Jun 17, 2019, at 11:44 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <namit@vmware.com> wrote:
>> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
>> as a place in which significant amount of time is spent. It appears that
>> in order to determine the cacheability that is required for the PTE,
>> lookup_memtype() is called, and this one traverses the resources list in
>> an inefficient manner. This patch-set tries to improve this situation.
> 
> Let's just do this lookup once per device, cache that, and replay it
> to modified vmf_insert_* routines that trust the caller to already
> know the pgprot_values.

IIUC, one device can have multiple regions with different characteristics,
which require difference cachability. Apparently, that is the reason there
is a tree of resources. Please be more specific about where you want to
cache it, please.

Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
see being done in quite a few cases), but I don’t know the code well enough
to be certain that every vma should have a single protection and that it
should not change afterwards.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-06-18 17:42   ` Nadav Amit
@ 2019-06-18 18:30     ` Dan Williams
  2019-06-18 21:56       ` Nadav Amit
  0 siblings, 1 reply; 28+ messages in thread
From: Dan Williams @ 2019-06-18 18:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

On Tue, Jun 18, 2019 at 10:42 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jun 17, 2019, at 11:44 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <namit@vmware.com> wrote:
> >> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
> >> as a place in which significant amount of time is spent. It appears that
> >> in order to determine the cacheability that is required for the PTE,
> >> lookup_memtype() is called, and this one traverses the resources list in
> >> an inefficient manner. This patch-set tries to improve this situation.
> >
> > Let's just do this lookup once per device, cache that, and replay it
> > to modified vmf_insert_* routines that trust the caller to already
> > know the pgprot_values.
>
> IIUC, one device can have multiple regions with different characteristics,
> which require difference cachability.

Not for pmem. It will always be one common cacheability setting for
the entirety of persistent memory.

> Apparently, that is the reason there
> is a tree of resources. Please be more specific about where you want to
> cache it, please.

The reason for lookup_memtype() was to try to prevent mixed
cacheability settings of pages across different processes . The
mapping type for pmem/dax is established by one of:

drivers/nvdimm/pmem.c:413:              addr =
devm_memremap_pages(dev, &pmem->pgmap);
drivers/nvdimm/pmem.c:425:              addr =
devm_memremap_pages(dev, &pmem->pgmap);
drivers/nvdimm/pmem.c:432:              addr = devm_memremap(dev,
pmem->phys_addr,
drivers/nvdimm/pmem.c-433-                              pmem->size,
ARCH_MEMREMAP_PMEM);

...and is constant for the life of the device and all subsequent mappings.

> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> see being done in quite a few cases), but I don’t know the code well enough
> to be certain that every vma should have a single protection and that it
> should not change afterwards.

No, I'm thinking this would naturally fit as a property hanging off a
'struct dax_device', and then create a version of vmf_insert_mixed()
and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
saved value.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-06-18 18:30     ` Dan Williams
@ 2019-06-18 21:56       ` Nadav Amit
  2019-07-16 22:00         ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-06-18 21:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

> On Jun 18, 2019, at 11:30 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Tue, Jun 18, 2019 at 10:42 AM Nadav Amit <namit@vmware.com> wrote:
>>> On Jun 17, 2019, at 11:44 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <namit@vmware.com> wrote:
>>>> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
>>>> as a place in which significant amount of time is spent. It appears that
>>>> in order to determine the cacheability that is required for the PTE,
>>>> lookup_memtype() is called, and this one traverses the resources list in
>>>> an inefficient manner. This patch-set tries to improve this situation.
>>> 
>>> Let's just do this lookup once per device, cache that, and replay it
>>> to modified vmf_insert_* routines that trust the caller to already
>>> know the pgprot_values.
>> 
>> IIUC, one device can have multiple regions with different characteristics,
>> which require difference cachability.
> 
> Not for pmem. It will always be one common cacheability setting for
> the entirety of persistent memory.
> 
>> Apparently, that is the reason there
>> is a tree of resources. Please be more specific about where you want to
>> cache it, please.
> 
> The reason for lookup_memtype() was to try to prevent mixed
> cacheability settings of pages across different processes . The
> mapping type for pmem/dax is established by one of:
> 
> drivers/nvdimm/pmem.c:413:              addr =
> devm_memremap_pages(dev, &pmem->pgmap);
> drivers/nvdimm/pmem.c:425:              addr =
> devm_memremap_pages(dev, &pmem->pgmap);
> drivers/nvdimm/pmem.c:432:              addr = devm_memremap(dev,
> pmem->phys_addr,
> drivers/nvdimm/pmem.c-433-                              pmem->size,
> ARCH_MEMREMAP_PMEM);
> 
> ...and is constant for the life of the device and all subsequent mappings.
> 
>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>> see being done in quite a few cases), but I don’t know the code well enough
>> to be certain that every vma should have a single protection and that it
>> should not change afterwards.
> 
> No, I'm thinking this would naturally fit as a property hanging off a
> 'struct dax_device', and then create a version of vmf_insert_mixed()
> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> saved value.

Thanks for the detailed explanation. I’ll give it a try (the moment I find
some free time). I still think that patch 2/3 is beneficial, but based on
your feedback, patch 3/3 should be dropped.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-18  5:40       ` Nadav Amit
@ 2019-06-19 13:00         ` Bjorn Helgaas
  2019-06-19 20:35           ` Nadav Amit
  2019-06-19 21:53           ` Dan Williams
  0 siblings, 2 replies; 28+ messages in thread
From: Bjorn Helgaas @ 2019-06-19 13:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, LKML, Linux-MM, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Ingo Molnar

On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@vmware.com> wrote:
> >
> >> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:
> >>
> >>> For efficient search of resources, as needed to determine the memory
> >>> type for dax page-faults, introduce a cache of the most recently used
> >>> top-level resource. Caching the top-level should be safe as ranges in
> >>> that level do not overlap (unlike those of lower levels).
> >>>
> >>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> >>> is added, removed or changed, invalidate all the resources. The
> >>> invalidation takes place when the resource_lock is taken for write,
> >>> preventing possible races.
> >>>
> >>> This patch provides relatively small performance improvements over the
> >>> previous patch (~0.5% on sysbench), but can benefit systems with many
> >>> resources.
> >>
> >>> --- a/kernel/resource.c
> >>> +++ b/kernel/resource.c
> >>> @@ -53,6 +53,12 @@ struct resource_constraint {
> >>>
> >>> static DEFINE_RWLOCK(resource_lock);
> >>>
> >>> +/*
> >>> + * Cache of the top-level resource that was most recently use by
> >>> + * find_next_iomem_res().
> >>> + */
> >>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
> >>
> >> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> >> bit odd - the latency getting at that rwlock will swamp the benefit of
> >> isolating the CPUs from each other when accessing resource_cache.
> >>
> >> On the other hand, if we have multiple CPUs running
> >> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
> >> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> >> been quantified?
> >
> > No. I am not sure how easy it would be to measure it. On the other hander
> > the lock is not supposed to be contended (at most cases). At the time I saw
> > numbers that showed that stores to “exclusive" cache lines can be as
> > expensive as atomic operations [1]. I am not sure how up to date these
> > numbers are though. In the benchmark I ran, multiple CPUs ran
> > find_next_iomem_res() concurrently.
> >
> > [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf
>
> Just to clarify - the main motivation behind the per-cpu variable is not
> about contention, but about the fact the different processes/threads that
> run concurrently might use different resources.

IIUC, the underlying problem is that dax relies heavily on ioremap(),
and ioremap() on x86 takes too long because it relies on
find_next_iomem_res() via the __ioremap_caller() ->
__ioremap_check_mem() -> walk_mem_res() path.

The fact that x86 is the only arch that does this much work in
ioremap() makes me wonder.  Is there something unique about x86
mapping attributes that requires this extra work, or is there some way
this could be reworked to avoid searching the resource map in the
first place?

Bjorn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-19 13:00         ` Bjorn Helgaas
@ 2019-06-19 20:35           ` Nadav Amit
  2019-06-19 21:53           ` Dan Williams
  1 sibling, 0 replies; 28+ messages in thread
From: Nadav Amit @ 2019-06-19 20:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Andrew Morton, LKML, Linux-MM, Borislav Petkov, Toshi Kani,
	Peter Zijlstra, Dave Hansen, Dan Williams, Ingo Molnar

> On Jun 19, 2019, at 6:00 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> 
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@vmware.com> wrote:
>>> On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>>>> 
>>>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>> For efficient search of resources, as needed to determine the memory
>>>>> type for dax page-faults, introduce a cache of the most recently used
>>>>> top-level resource. Caching the top-level should be safe as ranges in
>>>>> that level do not overlap (unlike those of lower levels).
>>>>> 
>>>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>>>> is added, removed or changed, invalidate all the resources. The
>>>>> invalidation takes place when the resource_lock is taken for write,
>>>>> preventing possible races.
>>>>> 
>>>>> This patch provides relatively small performance improvements over the
>>>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>>>> resources.
>>>> 
>>>>> --- a/kernel/resource.c
>>>>> +++ b/kernel/resource.c
>>>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>>>> 
>>>>> static DEFINE_RWLOCK(resource_lock);
>>>>> 
>>>>> +/*
>>>>> + * Cache of the top-level resource that was most recently use by
>>>>> + * find_next_iomem_res().
>>>>> + */
>>>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>>>> 
>>>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>>>> bit odd - the latency getting at that rwlock will swamp the benefit of
>>>> isolating the CPUs from each other when accessing resource_cache.
>>>> 
>>>> On the other hand, if we have multiple CPUs running
>>>> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
>>>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>>>> been quantified?
>>> 
>>> No. I am not sure how easy it would be to measure it. On the other hander
>>> the lock is not supposed to be contended (at most cases). At the time I saw
>>> numbers that showed that stores to “exclusive" cache lines can be as
>>> expensive as atomic operations [1]. I am not sure how up to date these
>>> numbers are though. In the benchmark I ran, multiple CPUs ran
>>> find_next_iomem_res() concurrently.
>>> 
>>> [1] https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsigops.org%2Fs%2Fconferences%2Fsosp%2F2013%2Fpapers%2Fp33-david.pdf&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca2706c5ab2c544283f3b08d6f4b6152b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636965460234022371&amp;sdata=cD7Nhs4jcJGMD7Lav6D%2BC6E5Sei0DiWhKXL7vz2cVHA%3D&amp;reserved=0
>> 
>> Just to clarify - the main motivation behind the per-cpu variable is not
>> about contention, but about the fact the different processes/threads that
>> run concurrently might use different resources.
> 
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.

I don’t know much about this path and whether it is painful. The path I was
regarding is during page-fault handling:

   - handle_mm_fault
      - __handle_mm_fault
         - do_wp_page
            - ext4_dax_fault
               - ext4_dax_huge_fault
                  - dax_iomap_fault
                     - dax_iomap_pte_fault
                        - vmf_insert_mixed_mkwrite
                           - __vm_insert_mixed
                              - track_pfn_insert
                                 - lookup_memtype
                                    - pat_pagerange_is_ram

But indeed track_pfn_insert() in x86 specific. I guess the differences are
due to the page-table controlling the cachability in x86 (PAT), but I don’t
know much about other architectures and whether they have similar
cachability controls in the page-tables.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-19 13:00         ` Bjorn Helgaas
  2019-06-19 20:35           ` Nadav Amit
@ 2019-06-19 21:53           ` Dan Williams
  2019-06-20 21:31             ` Andi Kleen
  1 sibling, 1 reply; 28+ messages in thread
From: Dan Williams @ 2019-06-19 21:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Nadav Amit, Andrew Morton, LKML, Linux-MM, Borislav Petkov,
	Toshi Kani, Peter Zijlstra, Dave Hansen, Ingo Molnar, Kleen,
	Andi

[ add Andi ]

On Wed, Jun 19, 2019 at 6:00 AM Bjorn Helgaas <bhelgaas@google.com> wrote:
>
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@vmware.com> wrote:
> >
> > > On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@vmware.com> wrote:
> > >
> > >> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >>
> > >> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@vmware.com> wrote:
> > >>
> > >>> For efficient search of resources, as needed to determine the memory
> > >>> type for dax page-faults, introduce a cache of the most recently used
> > >>> top-level resource. Caching the top-level should be safe as ranges in
> > >>> that level do not overlap (unlike those of lower levels).
> > >>>
> > >>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> > >>> is added, removed or changed, invalidate all the resources. The
> > >>> invalidation takes place when the resource_lock is taken for write,
> > >>> preventing possible races.
> > >>>
> > >>> This patch provides relatively small performance improvements over the
> > >>> previous patch (~0.5% on sysbench), but can benefit systems with many
> > >>> resources.
> > >>
> > >>> --- a/kernel/resource.c
> > >>> +++ b/kernel/resource.c
> > >>> @@ -53,6 +53,12 @@ struct resource_constraint {
> > >>>
> > >>> static DEFINE_RWLOCK(resource_lock);
> > >>>
> > >>> +/*
> > >>> + * Cache of the top-level resource that was most recently use by
> > >>> + * find_next_iomem_res().
> > >>> + */
> > >>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
> > >>
> > >> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> > >> bit odd - the latency getting at that rwlock will swamp the benefit of
> > >> isolating the CPUs from each other when accessing resource_cache.
> > >>
> > >> On the other hand, if we have multiple CPUs running
> > >> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
> > >> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> > >> been quantified?
> > >
> > > No. I am not sure how easy it would be to measure it. On the other hander
> > > the lock is not supposed to be contended (at most cases). At the time I saw
> > > numbers that showed that stores to “exclusive" cache lines can be as
> > > expensive as atomic operations [1]. I am not sure how up to date these
> > > numbers are though. In the benchmark I ran, multiple CPUs ran
> > > find_next_iomem_res() concurrently.
> > >
> > > [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf
> >
> > Just to clarify - the main motivation behind the per-cpu variable is not
> > about contention, but about the fact the different processes/threads that
> > run concurrently might use different resources.
>
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.
>
> The fact that x86 is the only arch that does this much work in
> ioremap() makes me wonder.  Is there something unique about x86
> mapping attributes that requires this extra work, or is there some way
> this could be reworked to avoid searching the resource map in the
> first place?

The underlying issue is that the x86-PAT implementation wants to
ensure that conflicting mappings are not set up for the same physical
address. This is mentioned in the developer manuals as problematic on
some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
relevant?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-19 21:53           ` Dan Williams
@ 2019-06-20 21:31             ` Andi Kleen
  2019-06-20 23:13               ` Dan Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2019-06-20 21:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bjorn Helgaas, Nadav Amit, Andrew Morton, LKML, Linux-MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Ingo Molnar, Kleen, Andi

Dan Williams <dan.j.williams@intel.com> writes:
>
> The underlying issue is that the x86-PAT implementation wants to
> ensure that conflicting mappings are not set up for the same physical
> address. This is mentioned in the developer manuals as problematic on
> some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
> relevant?

There have been discussions about it in the past, and the right answer
will likely differ for different CPUs: But so far the official answer
for Intel CPUs is that these caching conflicts should be avoided.

So I guess the cache in the original email makes sense for now.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] resource: Introduce resource cache
  2019-06-20 21:31             ` Andi Kleen
@ 2019-06-20 23:13               ` Dan Williams
  0 siblings, 0 replies; 28+ messages in thread
From: Dan Williams @ 2019-06-20 23:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bjorn Helgaas, Nadav Amit, Andrew Morton, LKML, Linux-MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Ingo Molnar, Kleen, Andi

On Thu, Jun 20, 2019 at 2:31 PM Andi Kleen <andi@firstfloor.org> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
> >
> > The underlying issue is that the x86-PAT implementation wants to
> > ensure that conflicting mappings are not set up for the same physical
> > address. This is mentioned in the developer manuals as problematic on
> > some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
> > relevant?
>
> There have been discussions about it in the past, and the right answer
> will likely differ for different CPUs: But so far the official answer
> for Intel CPUs is that these caching conflicts should be avoided.
>

Ok.

> So I guess the cache in the original email makes sense for now.

I wouldn't go that far, but it does mean that if we go ahead with
caching the value as a dax_device property there should at least be a
debug option to assert that the device value conforms to all the other
mappings.

Another  failing of the track_pfn_insert() and lookup_memtype()
implementation is that it makes it awkward to handle marking mappings
UC to prevent speculative consumption of poison. That is something
that is better handled, in my opinion, by asking the device for the
pgprot and coordinating shooting down any WB mappings of the same
physical page.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-06-18 21:56       ` Nadav Amit
@ 2019-07-16 22:00         ` Andrew Morton
  2019-07-16 22:06           ` Nadav Amit
  2019-07-16 22:07           ` Dan Williams
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2019-07-16 22:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:

> > ...and is constant for the life of the device and all subsequent mappings.
> > 
> >> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >> see being done in quite a few cases), but I don’t know the code well enough
> >> to be certain that every vma should have a single protection and that it
> >> should not change afterwards.
> > 
> > No, I'm thinking this would naturally fit as a property hanging off a
> > 'struct dax_device', and then create a version of vmf_insert_mixed()
> > and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> > saved value.
> 
> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> some free time). I still think that patch 2/3 is beneficial, but based on
> your feedback, patch 3/3 should be dropped.

It has been a while.  What should we do with

resource-fix-locking-in-find_next_iomem_res.patch
resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch

?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:00         ` Andrew Morton
@ 2019-07-16 22:06           ` Nadav Amit
  2019-07-16 22:07           ` Dan Williams
  1 sibling, 0 replies; 28+ messages in thread
From: Nadav Amit @ 2019-07-16 22:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

> On Jul 16, 2019, at 3:00 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
> 
>>> ...and is constant for the life of the device and all subsequent mappings.
>>> 
>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>> to be certain that every vma should have a single protection and that it
>>>> should not change afterwards.
>>> 
>>> No, I'm thinking this would naturally fit as a property hanging off a
>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>> saved value.
>> 
>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>> some free time). I still think that patch 2/3 is beneficial, but based on
>> your feedback, patch 3/3 should be dropped.
> 
> It has been a while.  What should we do with
> 
> resource-fix-locking-in-find_next_iomem_res.patch
> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> 
> ?

I didn’t get to follow Dan Williams advice. But, both of two patches are
fine on my opinion and should go upstream. The first one fixes a bug and the
second one improves performance considerably (and removes most of the
overhead). Future improvements can go on top of these patches and are not
expected to conflict.

So I think they should go upstream - the first one immediately, the second
one when possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:00         ` Andrew Morton
  2019-07-16 22:06           ` Nadav Amit
@ 2019-07-16 22:07           ` Dan Williams
  2019-07-16 22:13             ` Nadav Amit
  1 sibling, 1 reply; 28+ messages in thread
From: Dan Williams @ 2019-07-16 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nadav Amit, Linux Kernel Mailing List, Linux MM, Borislav Petkov,
	Toshi Kani, Peter Zijlstra, Dave Hansen, Bjorn Helgaas,
	Ingo Molnar

On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
>
> > > ...and is constant for the life of the device and all subsequent mappings.
> > >
> > >> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> > >> see being done in quite a few cases), but I don’t know the code well enough
> > >> to be certain that every vma should have a single protection and that it
> > >> should not change afterwards.
> > >
> > > No, I'm thinking this would naturally fit as a property hanging off a
> > > 'struct dax_device', and then create a version of vmf_insert_mixed()
> > > and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> > > saved value.
> >
> > Thanks for the detailed explanation. I’ll give it a try (the moment I find
> > some free time). I still think that patch 2/3 is beneficial, but based on
> > your feedback, patch 3/3 should be dropped.
>
> It has been a while.  What should we do with
>
> resource-fix-locking-in-find_next_iomem_res.patch

This one looks obviously correct to me, you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch

This one is a good bug report that we need to go fix pgprot lookups
for dax, but I don't think we need to increase the trickiness of the
core resource lookup code in the meantime.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:07           ` Dan Williams
@ 2019-07-16 22:13             ` Nadav Amit
  2019-07-16 22:20               ` Dan Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-07-16 22:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

> On Jul 16, 2019, at 3:07 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
>> 
>>>> ...and is constant for the life of the device and all subsequent mappings.
>>>> 
>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>>> to be certain that every vma should have a single protection and that it
>>>>> should not change afterwards.
>>>> 
>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>> saved value.
>>> 
>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>> your feedback, patch 3/3 should be dropped.
>> 
>> It has been a while.  What should we do with
>> 
>> resource-fix-locking-in-find_next_iomem_res.patch
> 
> This one looks obviously correct to me, you can add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> 
> This one is a good bug report that we need to go fix pgprot lookups
> for dax, but I don't think we need to increase the trickiness of the
> core resource lookup code in the meantime.

I think that traversing big parts of the tree that are known to be
irrelevant is wasteful no matter what, and this code is used in other cases.

I don’t think the new code is so tricky - can you point to the part of the
code that you find tricky?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:13             ` Nadav Amit
@ 2019-07-16 22:20               ` Dan Williams
  2019-07-16 22:28                 ` Nadav Amit
  0 siblings, 1 reply; 28+ messages in thread
From: Dan Williams @ 2019-07-16 22:20 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jul 16, 2019, at 3:07 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
> >>
> >>>> ...and is constant for the life of the device and all subsequent mappings.
> >>>>
> >>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >>>>> see being done in quite a few cases), but I don’t know the code well enough
> >>>>> to be certain that every vma should have a single protection and that it
> >>>>> should not change afterwards.
> >>>>
> >>>> No, I'm thinking this would naturally fit as a property hanging off a
> >>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
> >>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> >>>> saved value.
> >>>
> >>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> >>> some free time). I still think that patch 2/3 is beneficial, but based on
> >>> your feedback, patch 3/3 should be dropped.
> >>
> >> It has been a while.  What should we do with
> >>
> >> resource-fix-locking-in-find_next_iomem_res.patch
> >
> > This one looks obviously correct to me, you can add:
> >
> > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> >
> >> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> >
> > This one is a good bug report that we need to go fix pgprot lookups
> > for dax, but I don't think we need to increase the trickiness of the
> > core resource lookup code in the meantime.
>
> I think that traversing big parts of the tree that are known to be
> irrelevant is wasteful no matter what, and this code is used in other cases.
>
> I don’t think the new code is so tricky - can you point to the part of the
> code that you find tricky?

Given dax can be updated to avoid this abuse of find_next_iomem_res(),
it was a general observation that the patch adds more lines than it
removes and is not strictly necessary. I'm ambivalent as to whether it
is worth pushing upstream. If anything the changelog is going to be
invalidated by a change to dax to avoid find_next_iomem_res(). Can you
update the changelog to be relevant outside of the dax case?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:20               ` Dan Williams
@ 2019-07-16 22:28                 ` Nadav Amit
  2019-07-16 22:45                   ` Dan Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Nadav Amit @ 2019-07-16 22:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

> On Jul 16, 2019, at 3:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Jul 16, 2019, at 3:07 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
>>>> 
>>>>>> ...and is constant for the life of the device and all subsequent mappings.
>>>>>> 
>>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>>>>> to be certain that every vma should have a single protection and that it
>>>>>>> should not change afterwards.
>>>>>> 
>>>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>>>> saved value.
>>>>> 
>>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>>>> your feedback, patch 3/3 should be dropped.
>>>> 
>>>> It has been a while.  What should we do with
>>>> 
>>>> resource-fix-locking-in-find_next_iomem_res.patch
>>> 
>>> This one looks obviously correct to me, you can add:
>>> 
>>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>>> 
>>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
>>> 
>>> This one is a good bug report that we need to go fix pgprot lookups
>>> for dax, but I don't think we need to increase the trickiness of the
>>> core resource lookup code in the meantime.
>> 
>> I think that traversing big parts of the tree that are known to be
>> irrelevant is wasteful no matter what, and this code is used in other cases.
>> 
>> I don’t think the new code is so tricky - can you point to the part of the
>> code that you find tricky?
> 
> Given dax can be updated to avoid this abuse of find_next_iomem_res(),
> it was a general observation that the patch adds more lines than it
> removes and is not strictly necessary. I'm ambivalent as to whether it
> is worth pushing upstream. If anything the changelog is going to be
> invalidated by a change to dax to avoid find_next_iomem_res(). Can you
> update the changelog to be relevant outside of the dax case?

Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code
according to my calculations. :)

Having said that, if you think I might have made a mistake, or you are
concerned with some bug I might have caused, please let me know. I
understand that this logic might have been lying around for some time.

I can update the commit log, emphasizing the redundant search operations as
motivation and then mentioning dax as an instance that induces overheads due
to this overhead, and say it should be handled regardless to this patch-set.
Once I find time, I am going to deal with DAX, unless you beat me to it.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
  2019-07-16 22:28                 ` Nadav Amit
@ 2019-07-16 22:45                   ` Dan Williams
  0 siblings, 0 replies; 28+ messages in thread
From: Dan Williams @ 2019-07-16 22:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Linux Kernel Mailing List, Linux MM,
	Borislav Petkov, Toshi Kani, Peter Zijlstra, Dave Hansen,
	Bjorn Helgaas, Ingo Molnar

On Tue, Jul 16, 2019 at 3:29 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jul 16, 2019, at 3:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <namit@vmware.com> wrote:
> >>> On Jul 16, 2019, at 3:07 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>>
> >>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <namit@vmware.com> wrote:
> >>>>
> >>>>>> ...and is constant for the life of the device and all subsequent mappings.
> >>>>>>
> >>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >>>>>>> see being done in quite a few cases), but I don’t know the code well enough
> >>>>>>> to be certain that every vma should have a single protection and that it
> >>>>>>> should not change afterwards.
> >>>>>>
> >>>>>> No, I'm thinking this would naturally fit as a property hanging off a
> >>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
> >>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> >>>>>> saved value.
> >>>>>
> >>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> >>>>> some free time). I still think that patch 2/3 is beneficial, but based on
> >>>>> your feedback, patch 3/3 should be dropped.
> >>>>
> >>>> It has been a while.  What should we do with
> >>>>
> >>>> resource-fix-locking-in-find_next_iomem_res.patch
> >>>
> >>> This one looks obviously correct to me, you can add:
> >>>
> >>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> >>>
> >>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> >>>
> >>> This one is a good bug report that we need to go fix pgprot lookups
> >>> for dax, but I don't think we need to increase the trickiness of the
> >>> core resource lookup code in the meantime.
> >>
> >> I think that traversing big parts of the tree that are known to be
> >> irrelevant is wasteful no matter what, and this code is used in other cases.
> >>
> >> I don’t think the new code is so tricky - can you point to the part of the
> >> code that you find tricky?
> >
> > Given dax can be updated to avoid this abuse of find_next_iomem_res(),
> > it was a general observation that the patch adds more lines than it
> > removes and is not strictly necessary. I'm ambivalent as to whether it
> > is worth pushing upstream. If anything the changelog is going to be
> > invalidated by a change to dax to avoid find_next_iomem_res(). Can you
> > update the changelog to be relevant outside of the dax case?
>
> Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code
> according to my calculations. :)
>
> Having said that, if you think I might have made a mistake, or you are
> concerned with some bug I might have caused, please let me know. I
> understand that this logic might have been lying around for some time.

Like I said, I'm ambivalent and not NAK'ing it. It looks ok, but at
the same time something is wrong if a hot path is constantly
re-looking up a resource. The fact that it shows up in profiles when
that happens could be considered useful.

> I can update the commit log, emphasizing the redundant search operations as
> motivation and then mentioning dax as an instance that induces overheads due
> to this overhead, and say it should be handled regardless to this patch-set.
> Once I find time, I am going to deal with DAX, unless you beat me to it.

It turns out that the ability to ask the driver for pgprot bits is
useful above and beyond this performance optimization. For example I'm
looking to use it to support disabling speculation into pages with
known  media errors by letting the driver consult its 'badblock' list.
There are also usages for passing the key-id for persistent memory
encrypted by MKTME.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-07-16 22:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13  4:59 [PATCH 0/3] resource: find_next_iomem_res() improvements Nadav Amit
2019-06-13  4:59 ` [PATCH 1/3] resource: Fix locking in find_next_iomem_res() Nadav Amit
     [not found]   ` <20190615221557.CD1492183F@mail.kernel.org>
2019-06-17 19:14     ` Nadav Amit
2019-06-18  0:55       ` Sasha Levin
2019-06-18  1:32         ` Nadav Amit
2019-06-18  4:26   ` Andrew Morton
2019-06-13  4:59 ` [PATCH 2/3] resource: Avoid unnecessary lookups " Nadav Amit
2019-06-13  4:59 ` [PATCH 3/3] resource: Introduce resource cache Nadav Amit
     [not found]   ` <20190615221607.4B44521841@mail.kernel.org>
2019-06-17 17:20     ` Nadav Amit
2019-06-18  4:57   ` Andrew Morton
2019-06-18  5:33     ` Nadav Amit
2019-06-18  5:40       ` Nadav Amit
2019-06-19 13:00         ` Bjorn Helgaas
2019-06-19 20:35           ` Nadav Amit
2019-06-19 21:53           ` Dan Williams
2019-06-20 21:31             ` Andi Kleen
2019-06-20 23:13               ` Dan Williams
2019-06-18  6:44 ` [PATCH 0/3] resource: find_next_iomem_res() improvements Dan Williams
2019-06-18 17:42   ` Nadav Amit
2019-06-18 18:30     ` Dan Williams
2019-06-18 21:56       ` Nadav Amit
2019-07-16 22:00         ` Andrew Morton
2019-07-16 22:06           ` Nadav Amit
2019-07-16 22:07           ` Dan Williams
2019-07-16 22:13             ` Nadav Amit
2019-07-16 22:20               ` Dan Williams
2019-07-16 22:28                 ` Nadav Amit
2019-07-16 22:45                   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).