LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
@ 2019-05-09 23:26 Yang Shi
  2019-05-13 16:38 ` Will Deacon
  0 siblings, 1 reply; 13+ messages in thread
From: Yang Shi @ 2019-05-09 23:26 UTC (permalink / raw)
  To: jstancek, peterz, will.deacon, namit, minchan, mgorman
  Cc: yang.shi, stable, linux-mm, linux-kernel

A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem).  Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
PTEs in parallel don't remove page tables.  But, the forementioned
commit may do munmap() under read mmap_sem and free page tables.  This
may result in program hang on aarch64 reported by Jan Stancek.  The
problem could be reproduced by his test program with slightly modified
below.

---8<---

static int map_size = 4096;
static int num_iter = 500;
static long threads_total;

static void *distant_area;

void *map_write_unmap(void *ptr)
{
	int *fd = ptr;
	unsigned char *map_address;
	int i, j = 0;

	for (i = 0; i < num_iter; i++) {
		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
		if (map_address == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}

		for (j = 0; j < map_size; j++)
			map_address[j] = 'b';

		if (munmap(map_address, map_size) == -1) {
			perror("munmap");
			exit(1);
		}
	}

	return NULL;
}

void *dummy(void *ptr)
{
	return NULL;
}

int main(void)
{
	pthread_t thid[2];

	/* hint for mmap in map_write_unmap() */
	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
	distant_area += DISTANT_MMAP_SIZE / 2;

	while (1) {
		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
		pthread_create(&thid[1], NULL, dummy, NULL);

		pthread_join(thid[0], NULL);
		pthread_join(thid[1], NULL);
	}
}
---8<---

The program may bring in parallel execution like below:

        t1                                        t2
munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()

__tlb_reset_range() would reset freed_tables and cleared_* bits, but
this may cause inconsistency for munmap() which do free page tables.
Then it may result in some architectures, e.g. aarch64, may not flush
TLB completely as expected to have stale TLB entries remained.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
---
v2: Reworked the commit log per Peter and Will
    Adopted the suggestion from Peter

 mm/mmu_gather.c | 39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..469492d 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 {
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
+	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+	 * flush by batching, one thread may end up seeing inconsistent PTEs
+	 * and result in having stale TLB entries.  So flush TLB forcefully
+	 * if we detect parallel PTE batching threads.
+	 *
+	 * However, some syscalls, e.g. munmap(), may free page tables, this
+	 * needs force flush everything in the given range. Otherwise this
+	 * may result in having stale TLB entries for some architectures,
+	 * e.g. aarch64, that could specify flush what level TLB.
 	 */
-	if (mm_tlb_flush_nested(tlb->mm)) {
-		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
+		/*
+		 * Since we can't tell what we actually should have
+		 * flushed, flush everything in the given range.
+		 */
+		tlb->freed_tables = 1;
+		tlb->cleared_ptes = 1;
+		tlb->cleared_pmds = 1;
+		tlb->cleared_puds = 1;
+		tlb->cleared_p4ds = 1;
+
+		/*
+		 * Some architectures, e.g. ARM, that have range invalidation
+		 * and care about VM_EXEC for I-Cache invalidation, need force
+		 * vma_exec set.
+		 */
+		tlb->vma_exec = 1;
+
+		/* Force vma_huge clear to guarantee safer flush */
+		tlb->vma_huge = 0;
+
+		tlb->start = start;
+		tlb->end = end;
 	}
 
 	tlb_flush_mmu(tlb);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-09 23:26 [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
@ 2019-05-13 16:38 ` Will Deacon
  2019-05-13 23:01   ` Yang Shi
  2019-05-14 11:52   ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Will Deacon @ 2019-05-13 16:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: jstancek, peterz, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel

On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1..469492d 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  {
>  	/*
>  	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> +	 * and result in having stale TLB entries.  So flush TLB forcefully
> +	 * if we detect parallel PTE batching threads.
> +	 *
> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> +	 * needs force flush everything in the given range. Otherwise this
> +	 * may result in having stale TLB entries for some architectures,
> +	 * e.g. aarch64, that could specify flush what level TLB.
>  	 */
> -	if (mm_tlb_flush_nested(tlb->mm)) {
> -		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> +		/*
> +		 * Since we can't tell what we actually should have
> +		 * flushed, flush everything in the given range.
> +		 */
> +		tlb->freed_tables = 1;
> +		tlb->cleared_ptes = 1;
> +		tlb->cleared_pmds = 1;
> +		tlb->cleared_puds = 1;
> +		tlb->cleared_p4ds = 1;
> +
> +		/*
> +		 * Some architectures, e.g. ARM, that have range invalidation
> +		 * and care about VM_EXEC for I-Cache invalidation, need force
> +		 * vma_exec set.
> +		 */
> +		tlb->vma_exec = 1;
> +
> +		/* Force vma_huge clear to guarantee safer flush */
> +		tlb->vma_huge = 0;
> +
> +		tlb->start = start;
> +		tlb->end = end;
>  	}

Whilst I think this is correct, it would be interesting to see whether
or not it's actually faster than just nuking the whole mm, as I mentioned
before.

At least in terms of getting a short-term fix, I'd prefer the diff below
if it's not measurably worse.

Will

--->8

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1dd273..cc251422d307 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 	 * forcefully if we detect parallel PTE batching threads.
 	 */
 	if (mm_tlb_flush_nested(tlb->mm)) {
+		tlb->fullmm = 1;
 		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+		tlb->freed_tables = 1;
 	}
 
 	tlb_flush_mmu(tlb);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-13 16:38 ` Will Deacon
@ 2019-05-13 23:01   ` Yang Shi
  2019-05-14 14:54     ` Will Deacon
  2019-05-14 11:52   ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Yang Shi @ 2019-05-13 23:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: jstancek, peterz, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel



On 5/13/19 9:38 AM, Will Deacon wrote:
> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 99740e1..469492d 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>   {
>>   	/*
>>   	 * If there are parallel threads are doing PTE changes on same range
>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> -	 * forcefully if we detect parallel PTE batching threads.
>> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>> +	 * and result in having stale TLB entries.  So flush TLB forcefully
>> +	 * if we detect parallel PTE batching threads.
>> +	 *
>> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>> +	 * needs force flush everything in the given range. Otherwise this
>> +	 * may result in having stale TLB entries for some architectures,
>> +	 * e.g. aarch64, that could specify flush what level TLB.
>>   	 */
>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>> -		__tlb_reset_range(tlb);
>> -		__tlb_adjust_range(tlb, start, end - start);
>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
>> +		/*
>> +		 * Since we can't tell what we actually should have
>> +		 * flushed, flush everything in the given range.
>> +		 */
>> +		tlb->freed_tables = 1;
>> +		tlb->cleared_ptes = 1;
>> +		tlb->cleared_pmds = 1;
>> +		tlb->cleared_puds = 1;
>> +		tlb->cleared_p4ds = 1;
>> +
>> +		/*
>> +		 * Some architectures, e.g. ARM, that have range invalidation
>> +		 * and care about VM_EXEC for I-Cache invalidation, need force
>> +		 * vma_exec set.
>> +		 */
>> +		tlb->vma_exec = 1;
>> +
>> +		/* Force vma_huge clear to guarantee safer flush */
>> +		tlb->vma_huge = 0;
>> +
>> +		tlb->start = start;
>> +		tlb->end = end;
>>   	}
> Whilst I think this is correct, it would be interesting to see whether
> or not it's actually faster than just nuking the whole mm, as I mentioned
> before.
>
> At least in terms of getting a short-term fix, I'd prefer the diff below
> if it's not measurably worse.

I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 
VM, it shows slightly slowdown on records/s but much more sys time spent 
with fullmm flush, the below is the data.

                                     nofullmm                 fullmm
ops (records/s)              225606                  225119
sys (s)                            0.69                        1.14

It looks the slight reduction of records/s is caused by the increase of 
sys time.

>
> Will
>
> --->8
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..cc251422d307 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>   	 * forcefully if we detect parallel PTE batching threads.
>   	 */
>   	if (mm_tlb_flush_nested(tlb->mm)) {
> +		tlb->fullmm = 1;
>   		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +		tlb->freed_tables = 1;
>   	}
>   
>   	tlb_flush_mmu(tlb);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-13 16:38 ` Will Deacon
  2019-05-13 23:01   ` Yang Shi
@ 2019-05-14 11:52   ` Peter Zijlstra
  2019-05-14 12:02     ` Will Deacon
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2019-05-14 11:52 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yang Shi, jstancek, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel

On Mon, May 13, 2019 at 05:38:04PM +0100, Will Deacon wrote:
> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1..469492d 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >  {
> >  	/*
> >  	 * If there are parallel threads are doing PTE changes on same range
> > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > +	 * if we detect parallel PTE batching threads.
> > +	 *
> > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > +	 * needs force flush everything in the given range. Otherwise this
> > +	 * may result in having stale TLB entries for some architectures,
> > +	 * e.g. aarch64, that could specify flush what level TLB.
> >  	 */
> > +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> > +		/*
> > +		 * Since we can't tell what we actually should have
> > +		 * flushed, flush everything in the given range.
> > +		 */
> > +		tlb->freed_tables = 1;
> > +		tlb->cleared_ptes = 1;
> > +		tlb->cleared_pmds = 1;
> > +		tlb->cleared_puds = 1;
> > +		tlb->cleared_p4ds = 1;
> > +
> > +		/*
> > +		 * Some architectures, e.g. ARM, that have range invalidation
> > +		 * and care about VM_EXEC for I-Cache invalidation, need force
> > +		 * vma_exec set.
> > +		 */
> > +		tlb->vma_exec = 1;
> > +
> > +		/* Force vma_huge clear to guarantee safer flush */
> > +		tlb->vma_huge = 0;
> > +
> > +		tlb->start = start;
> > +		tlb->end = end;
> >  	}
> 
> Whilst I think this is correct, it would be interesting to see whether
> or not it's actually faster than just nuking the whole mm, as I mentioned
> before.
> 
> At least in terms of getting a short-term fix, I'd prefer the diff below
> if it's not measurably worse.

So what point? General paranoia? Either change should allow PPC to get
rid of its magic mushrooms, the below would be a little bit easier for
them because they already do full invalidate correct.

> --->8
> 
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99740e1dd273..cc251422d307 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>  	 * forcefully if we detect parallel PTE batching threads.
>  	 */
>  	if (mm_tlb_flush_nested(tlb->mm)) {
> +		tlb->fullmm = 1;
>  		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +		tlb->freed_tables = 1;
>  	}
>  
>  	tlb_flush_mmu(tlb);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-14 11:52   ` Peter Zijlstra
@ 2019-05-14 12:02     ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-05-14 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yang Shi, jstancek, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel

On Tue, May 14, 2019 at 01:52:23PM +0200, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 05:38:04PM +0100, Will Deacon wrote:
> > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > > index 99740e1..469492d 100644
> > > --- a/mm/mmu_gather.c
> > > +++ b/mm/mmu_gather.c
> > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > >  {
> > >  	/*
> > >  	 * If there are parallel threads are doing PTE changes on same range
> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > > +	 * if we detect parallel PTE batching threads.
> > > +	 *
> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > > +	 * needs force flush everything in the given range. Otherwise this
> > > +	 * may result in having stale TLB entries for some architectures,
> > > +	 * e.g. aarch64, that could specify flush what level TLB.
> > >  	 */
> > > +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> > > +		/*
> > > +		 * Since we can't tell what we actually should have
> > > +		 * flushed, flush everything in the given range.
> > > +		 */
> > > +		tlb->freed_tables = 1;
> > > +		tlb->cleared_ptes = 1;
> > > +		tlb->cleared_pmds = 1;
> > > +		tlb->cleared_puds = 1;
> > > +		tlb->cleared_p4ds = 1;
> > > +
> > > +		/*
> > > +		 * Some architectures, e.g. ARM, that have range invalidation
> > > +		 * and care about VM_EXEC for I-Cache invalidation, need force
> > > +		 * vma_exec set.
> > > +		 */
> > > +		tlb->vma_exec = 1;
> > > +
> > > +		/* Force vma_huge clear to guarantee safer flush */
> > > +		tlb->vma_huge = 0;
> > > +
> > > +		tlb->start = start;
> > > +		tlb->end = end;
> > >  	}
> > 
> > Whilst I think this is correct, it would be interesting to see whether
> > or not it's actually faster than just nuking the whole mm, as I mentioned
> > before.
> > 
> > At least in terms of getting a short-term fix, I'd prefer the diff below
> > if it's not measurably worse.
> 
> So what point? General paranoia? Either change should allow PPC to get
> rid of its magic mushrooms, the below would be a little bit easier for
> them because they already do full invalidate correct.

Right; a combination of paranoia (need to remember to update this code
to "flush everything" if we add new fields to the gather structure) but
I also expected the performance to be better on arm64, where having two
CPUs spamming TLBI messages at the same time is likely to suck.

I'm super confused about the system time being reported as higher with
this change.  That's really not what I expected.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-13 23:01   ` Yang Shi
@ 2019-05-14 14:54     ` Will Deacon
  2019-05-14 17:25       ` Yang Shi
  2019-05-16 15:29       ` Jan Stancek
  0 siblings, 2 replies; 13+ messages in thread
From: Will Deacon @ 2019-05-14 14:54 UTC (permalink / raw)
  To: Yang Shi
  Cc: jstancek, peterz, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel

On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote:
> 
> 
> On 5/13/19 9:38 AM, Will Deacon wrote:
> > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > > index 99740e1..469492d 100644
> > > --- a/mm/mmu_gather.c
> > > +++ b/mm/mmu_gather.c
> > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > >   {
> > >   	/*
> > >   	 * If there are parallel threads are doing PTE changes on same range
> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > > -	 * forcefully if we detect parallel PTE batching threads.
> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > > +	 * if we detect parallel PTE batching threads.
> > > +	 *
> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > > +	 * needs force flush everything in the given range. Otherwise this
> > > +	 * may result in having stale TLB entries for some architectures,
> > > +	 * e.g. aarch64, that could specify flush what level TLB.
> > >   	 */
> > > -	if (mm_tlb_flush_nested(tlb->mm)) {
> > > -		__tlb_reset_range(tlb);
> > > -		__tlb_adjust_range(tlb, start, end - start);
> > > +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> > > +		/*
> > > +		 * Since we can't tell what we actually should have
> > > +		 * flushed, flush everything in the given range.
> > > +		 */
> > > +		tlb->freed_tables = 1;
> > > +		tlb->cleared_ptes = 1;
> > > +		tlb->cleared_pmds = 1;
> > > +		tlb->cleared_puds = 1;
> > > +		tlb->cleared_p4ds = 1;
> > > +
> > > +		/*
> > > +		 * Some architectures, e.g. ARM, that have range invalidation
> > > +		 * and care about VM_EXEC for I-Cache invalidation, need force
> > > +		 * vma_exec set.
> > > +		 */
> > > +		tlb->vma_exec = 1;
> > > +
> > > +		/* Force vma_huge clear to guarantee safer flush */
> > > +		tlb->vma_huge = 0;
> > > +
> > > +		tlb->start = start;
> > > +		tlb->end = end;
> > >   	}
> > Whilst I think this is correct, it would be interesting to see whether
> > or not it's actually faster than just nuking the whole mm, as I mentioned
> > before.
> > 
> > At least in terms of getting a short-term fix, I'd prefer the diff below
> > if it's not measurably worse.
> 
> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM,
> it shows slightly slowdown on records/s but much more sys time spent with
> fullmm flush, the below is the data.
> 
>                                     nofullmm                 fullmm
> ops (records/s)              225606                  225119
> sys (s)                            0.69                        1.14
> 
> It looks the slight reduction of records/s is caused by the increase of sys
> time.

That's not what I expected, and I'm unable to explain why moving to fullmm
would /increase/ the system time. I would've thought the time spent doing
the invalidation would decrease, with the downside that the TLB is cold
when returning back to userspace.

FWIW, I ran 10 iterations of ebizzy on my arm64 box using a vanilla 5.1
kernel and the numbers are all over the place (see below). I think
deducing anything meaningful from this benchmark will be a challenge.

Will

--->8

306090 records/s
real 10.00 s
user 1227.55 s
sys   0.54 s
323547 records/s
real 10.00 s
user 1262.95 s
sys   0.82 s
409148 records/s
real 10.00 s
user 1266.54 s
sys   0.94 s
341507 records/s
real 10.00 s
user 1263.49 s
sys   0.66 s
375910 records/s
real 10.00 s
user 1259.87 s
sys   0.82 s
376152 records/s
real 10.00 s
user 1265.76 s
sys   0.96 s
358862 records/s
real 10.00 s
user 1251.13 s
sys   0.72 s
358164 records/s
real 10.00 s
user 1243.48 s
sys   0.85 s
332148 records/s
real 10.00 s
user 1260.93 s
sys   0.70 s
367021 records/s
real 10.00 s
user 1264.06 s
sys   1.43 s

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-14 14:54     ` Will Deacon
@ 2019-05-14 17:25       ` Yang Shi
  2019-05-16 15:29       ` Jan Stancek
  1 sibling, 0 replies; 13+ messages in thread
From: Yang Shi @ 2019-05-14 17:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: jstancek, peterz, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel



On 5/14/19 7:54 AM, Will Deacon wrote:
> On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote:
>>
>> On 5/13/19 9:38 AM, Will Deacon wrote:
>>> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>> index 99740e1..469492d 100644
>>>> --- a/mm/mmu_gather.c
>>>> +++ b/mm/mmu_gather.c
>>>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>>    {
>>>>    	/*
>>>>    	 * If there are parallel threads are doing PTE changes on same range
>>>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>>>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>>>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>>> -	 * forcefully if we detect parallel PTE batching threads.
>>>> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>>>> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>>>> +	 * and result in having stale TLB entries.  So flush TLB forcefully
>>>> +	 * if we detect parallel PTE batching threads.
>>>> +	 *
>>>> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>>>> +	 * needs force flush everything in the given range. Otherwise this
>>>> +	 * may result in having stale TLB entries for some architectures,
>>>> +	 * e.g. aarch64, that could specify flush what level TLB.
>>>>    	 */
>>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>>> -		__tlb_reset_range(tlb);
>>>> -		__tlb_adjust_range(tlb, start, end - start);
>>>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
>>>> +		/*
>>>> +		 * Since we can't tell what we actually should have
>>>> +		 * flushed, flush everything in the given range.
>>>> +		 */
>>>> +		tlb->freed_tables = 1;
>>>> +		tlb->cleared_ptes = 1;
>>>> +		tlb->cleared_pmds = 1;
>>>> +		tlb->cleared_puds = 1;
>>>> +		tlb->cleared_p4ds = 1;
>>>> +
>>>> +		/*
>>>> +		 * Some architectures, e.g. ARM, that have range invalidation
>>>> +		 * and care about VM_EXEC for I-Cache invalidation, need force
>>>> +		 * vma_exec set.
>>>> +		 */
>>>> +		tlb->vma_exec = 1;
>>>> +
>>>> +		/* Force vma_huge clear to guarantee safer flush */
>>>> +		tlb->vma_huge = 0;
>>>> +
>>>> +		tlb->start = start;
>>>> +		tlb->end = end;
>>>>    	}
>>> Whilst I think this is correct, it would be interesting to see whether
>>> or not it's actually faster than just nuking the whole mm, as I mentioned
>>> before.
>>>
>>> At least in terms of getting a short-term fix, I'd prefer the diff below
>>> if it's not measurably worse.
>> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM,
>> it shows slightly slowdown on records/s but much more sys time spent with
>> fullmm flush, the below is the data.
>>
>>                                      nofullmm                 fullmm
>> ops (records/s)              225606                  225119
>> sys (s)                            0.69                        1.14
>>
>> It looks the slight reduction of records/s is caused by the increase of sys
>> time.
> That's not what I expected, and I'm unable to explain why moving to fullmm
> would /increase/ the system time. I would've thought the time spent doing
> the invalidation would decrease, with the downside that the TLB is cold
> when returning back to userspace.
>
> FWIW, I ran 10 iterations of ebizzy on my arm64 box using a vanilla 5.1
> kernel and the numbers are all over the place (see below). I think
> deducing anything meaningful from this benchmark will be a challenge.

Yes, it looks so. What else benchmark do you suggest?

>
> Will
>
> --->8
>
> 306090 records/s
> real 10.00 s
> user 1227.55 s
> sys   0.54 s
> 323547 records/s
> real 10.00 s
> user 1262.95 s
> sys   0.82 s
> 409148 records/s
> real 10.00 s
> user 1266.54 s
> sys   0.94 s
> 341507 records/s
> real 10.00 s
> user 1263.49 s
> sys   0.66 s
> 375910 records/s
> real 10.00 s
> user 1259.87 s
> sys   0.82 s
> 376152 records/s
> real 10.00 s
> user 1265.76 s
> sys   0.96 s
> 358862 records/s
> real 10.00 s
> user 1251.13 s
> sys   0.72 s
> 358164 records/s
> real 10.00 s
> user 1243.48 s
> sys   0.85 s
> 332148 records/s
> real 10.00 s
> user 1260.93 s
> sys   0.70 s
> 367021 records/s
> real 10.00 s
> user 1264.06 s
> sys   1.43 s


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-14 14:54     ` Will Deacon
  2019-05-14 17:25       ` Yang Shi
@ 2019-05-16 15:29       ` Jan Stancek
  2019-05-20  2:59         ` Yang Shi
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Stancek @ 2019-05-16 15:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yang Shi, peterz, namit, minchan, mgorman, stable, linux-mm,
	linux-kernel, Jan Stancek



----- Original Message -----
> On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote:
> > 
> > 
> > On 5/13/19 9:38 AM, Will Deacon wrote:
> > > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > > > index 99740e1..469492d 100644
> > > > --- a/mm/mmu_gather.c
> > > > +++ b/mm/mmu_gather.c
> > > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > > >   {
> > > >   	/*
> > > >   	 * If there are parallel threads are doing PTE changes on same range
> > > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> > > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > > > -	 * forcefully if we detect parallel PTE batching threads.
> > > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > > > +	 * if we detect parallel PTE batching threads.
> > > > +	 *
> > > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > > > +	 * needs force flush everything in the given range. Otherwise this
> > > > +	 * may result in having stale TLB entries for some architectures,
> > > > +	 * e.g. aarch64, that could specify flush what level TLB.
> > > >   	 */
> > > > -	if (mm_tlb_flush_nested(tlb->mm)) {
> > > > -		__tlb_reset_range(tlb);
> > > > -		__tlb_adjust_range(tlb, start, end - start);
> > > > +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> > > > +		/*
> > > > +		 * Since we can't tell what we actually should have
> > > > +		 * flushed, flush everything in the given range.
> > > > +		 */
> > > > +		tlb->freed_tables = 1;
> > > > +		tlb->cleared_ptes = 1;
> > > > +		tlb->cleared_pmds = 1;
> > > > +		tlb->cleared_puds = 1;
> > > > +		tlb->cleared_p4ds = 1;
> > > > +
> > > > +		/*
> > > > +		 * Some architectures, e.g. ARM, that have range invalidation
> > > > +		 * and care about VM_EXEC for I-Cache invalidation, need force
> > > > +		 * vma_exec set.
> > > > +		 */
> > > > +		tlb->vma_exec = 1;
> > > > +
> > > > +		/* Force vma_huge clear to guarantee safer flush */
> > > > +		tlb->vma_huge = 0;
> > > > +
> > > > +		tlb->start = start;
> > > > +		tlb->end = end;
> > > >   	}
> > > Whilst I think this is correct, it would be interesting to see whether
> > > or not it's actually faster than just nuking the whole mm, as I mentioned
> > > before.
> > > 
> > > At least in terms of getting a short-term fix, I'd prefer the diff below
> > > if it's not measurably worse.
> > 
> > I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM,
> > it shows slightly slowdown on records/s but much more sys time spent with
> > fullmm flush, the below is the data.
> > 
> >                                     nofullmm                 fullmm
> > ops (records/s)              225606                  225119
> > sys (s)                            0.69                        1.14
> > 
> > It looks the slight reduction of records/s is caused by the increase of sys
> > time.
> 
> That's not what I expected, and I'm unable to explain why moving to fullmm
> would /increase/ the system time. I would've thought the time spent doing
> the invalidation would decrease, with the downside that the TLB is cold
> when returning back to userspace.
> 

I tried ebizzy with various parameters (malloc vs mmap, ran it for hour),
but performance was very similar for both patches.

So, I was looking for workload that would demonstrate the largest difference.
Inspired by python xml-rpc, which can handle each request in new thread,
I tried following [1]:

16 threads, each looping 100k times over:
  mmap(16M)
  touch 1 page
  madvise(DONTNEED)
  munmap(16M)

This yields quite significant difference for 2 patches when running on
my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted,
but numbers stayed +- couple seconds the same.

Does it somewhat match your expectation?

v2 patch
---------
real    2m33.460s
user    0m3.359s
sys     15m32.307s

real    2m33.895s
user    0m2.749s
sys     16m34.500s

real    2m35.666s
user    0m3.528s
sys     15m23.377s

real    2m32.898s
user    0m2.789s
sys     16m18.801s

real    2m33.087s
user    0m3.565s
sys     16m23.815s


fullmm version
---------------
real    0m46.811s
user    0m1.596s
sys     1m47.500s

real    0m47.322s
user    0m1.803s
sys     1m48.449s

real    0m46.668s
user    0m1.508s
sys     1m47.352s

real    0m46.742s
user    0m2.007s
sys     1m47.217s

real    0m46.948s
user    0m1.785s
sys     1m47.906s

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap8.c

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-16 15:29       ` Jan Stancek
@ 2019-05-20  2:59         ` Yang Shi
  0 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2019-05-20  2:59 UTC (permalink / raw)
  To: Jan Stancek, Will Deacon
  Cc: peterz, namit, minchan, mgorman, stable, linux-mm, linux-kernel



On 5/16/19 11:29 PM, Jan Stancek wrote:
>
> ----- Original Message -----
>> On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote:
>>>
>>> On 5/13/19 9:38 AM, Will Deacon wrote:
>>>> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>>> index 99740e1..469492d 100644
>>>>> --- a/mm/mmu_gather.c
>>>>> +++ b/mm/mmu_gather.c
>>>>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>>>    {
>>>>>    	/*
>>>>>    	 * If there are parallel threads are doing PTE changes on same range
>>>>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>>>>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>>>>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>>>> -	 * forcefully if we detect parallel PTE batching threads.
>>>>> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>>>>> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>>>>> +	 * and result in having stale TLB entries.  So flush TLB forcefully
>>>>> +	 * if we detect parallel PTE batching threads.
>>>>> +	 *
>>>>> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>>>>> +	 * needs force flush everything in the given range. Otherwise this
>>>>> +	 * may result in having stale TLB entries for some architectures,
>>>>> +	 * e.g. aarch64, that could specify flush what level TLB.
>>>>>    	 */
>>>>> -	if (mm_tlb_flush_nested(tlb->mm)) {
>>>>> -		__tlb_reset_range(tlb);
>>>>> -		__tlb_adjust_range(tlb, start, end - start);
>>>>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
>>>>> +		/*
>>>>> +		 * Since we can't tell what we actually should have
>>>>> +		 * flushed, flush everything in the given range.
>>>>> +		 */
>>>>> +		tlb->freed_tables = 1;
>>>>> +		tlb->cleared_ptes = 1;
>>>>> +		tlb->cleared_pmds = 1;
>>>>> +		tlb->cleared_puds = 1;
>>>>> +		tlb->cleared_p4ds = 1;
>>>>> +
>>>>> +		/*
>>>>> +		 * Some architectures, e.g. ARM, that have range invalidation
>>>>> +		 * and care about VM_EXEC for I-Cache invalidation, need force
>>>>> +		 * vma_exec set.
>>>>> +		 */
>>>>> +		tlb->vma_exec = 1;
>>>>> +
>>>>> +		/* Force vma_huge clear to guarantee safer flush */
>>>>> +		tlb->vma_huge = 0;
>>>>> +
>>>>> +		tlb->start = start;
>>>>> +		tlb->end = end;
>>>>>    	}
>>>> Whilst I think this is correct, it would be interesting to see whether
>>>> or not it's actually faster than just nuking the whole mm, as I mentioned
>>>> before.
>>>>
>>>> At least in terms of getting a short-term fix, I'd prefer the diff below
>>>> if it's not measurably worse.
>>> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM,
>>> it shows slightly slowdown on records/s but much more sys time spent with
>>> fullmm flush, the below is the data.
>>>
>>>                                      nofullmm                 fullmm
>>> ops (records/s)              225606                  225119
>>> sys (s)                            0.69                        1.14
>>>
>>> It looks the slight reduction of records/s is caused by the increase of sys
>>> time.
>> That's not what I expected, and I'm unable to explain why moving to fullmm
>> would /increase/ the system time. I would've thought the time spent doing
>> the invalidation would decrease, with the downside that the TLB is cold
>> when returning back to userspace.
>>
> I tried ebizzy with various parameters (malloc vs mmap, ran it for hour),
> but performance was very similar for both patches.
>
> So, I was looking for workload that would demonstrate the largest difference.
> Inspired by python xml-rpc, which can handle each request in new thread,
> I tried following [1]:
>
> 16 threads, each looping 100k times over:
>    mmap(16M)
>    touch 1 page
>    madvise(DONTNEED)
>    munmap(16M)
>
> This yields quite significant difference for 2 patches when running on
> my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted,
> but numbers stayed +- couple seconds the same.

Thanks for the testing. I'm a little bit surprised by the significant 
difference.

I did the same test on my x86 VM (24 cores), they yield almost same number.

Given the significant improvement on arm64 with fullmm version, I'm 
going to respin the patch.

>
> Does it somewhat match your expectation?
>
> v2 patch
> ---------
> real    2m33.460s
> user    0m3.359s
> sys     15m32.307s
>
> real    2m33.895s
> user    0m2.749s
> sys     16m34.500s
>
> real    2m35.666s
> user    0m3.528s
> sys     15m23.377s
>
> real    2m32.898s
> user    0m2.789s
> sys     16m18.801s
>
> real    2m33.087s
> user    0m3.565s
> sys     16m23.815s
>
>
> fullmm version
> ---------------
> real    0m46.811s
> user    0m1.596s
> sys     1m47.500s
>
> real    0m47.322s
> user    0m1.803s
> sys     1m48.449s
>
> real    0m46.668s
> user    0m1.508s
> sys     1m47.352s
>
> real    0m46.742s
> user    0m2.007s
> sys     1m47.217s
>
> real    0m46.948s
> user    0m1.785s
> sys     1m47.906s
>
> [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap8.c


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-14  7:21   ` Nadav Amit
@ 2019-05-14 11:49     ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-05-14 11:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jan Stancek, Yang Shi, Will Deacon, minchan, mgorman, stable,
	linux-mm, linux-kernel

On Tue, May 14, 2019 at 07:21:33AM +0000, Nadav Amit wrote:
> > On May 14, 2019, at 12:15 AM, Jan Stancek <jstancek@redhat.com> wrote:

> > Replacing fullmm with need_flush_all, brings the problem back / reproducer hangs.
> 
> Maybe setting need_flush_all does not have the right effect, but setting
> fullmm and then calling __tlb_reset_range() when the PTEs were already
> zapped seems strange.
> 
> fullmm is described as:
> 
>         /*
>          * we are in the middle of an operation to clear
>          * a full mm and can make some optimizations
>          */
> 
> And this not the case.

Correct; starting with fullmm would be wrong. For instance
tlb_start_vma() would do the wrong thing because it assumes the whole mm
is going away. But we're at tlb_finish_mmu() time and there the
difference doesn't matter anymore.

But yes, that's a wee abuse.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
       [not found] <45c6096e-c3e0-4058-8669-75fbba415e07@email.android.com>
  2019-05-14  7:15 ` Jan Stancek
@ 2019-05-14 11:43 ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-05-14 11:43 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Yang Shi, Will Deacon, jstancek, minchan, mgorman, stable,
	linux-mm, linux-kernel

On Tue, May 14, 2019 at 02:01:34AM +0000, Nadav Amit wrote:
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1dd273..cc251422d307 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >         * forcefully if we detect parallel PTE batching threads.
> >         */
> >        if (mm_tlb_flush_nested(tlb->mm)) {
> > +             tlb->fullmm = 1;
> >                __tlb_reset_range(tlb);
> > -             __tlb_adjust_range(tlb, start, end - start);
> > +             tlb->freed_tables = 1;
> >        }
> >
> >        tlb_flush_mmu(tlb);
> 
> 
> I think that this should have set need_flush_all and not fullmm.

Difficult, mmu_gather::need_flush_all is arch specific and not everybody
implements it.

And while mmu_gather::fullmm isn't strictly correct either; we can
(ab)use it here, because at tlb_finish_mmu() time the differences don't
matter anymore.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-14  7:15 ` Jan Stancek
@ 2019-05-14  7:21   ` Nadav Amit
  2019-05-14 11:49     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Nadav Amit @ 2019-05-14  7:21 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Yang Shi, Will Deacon, peterz, minchan, mgorman, stable,
	linux-mm, linux-kernel

> On May 14, 2019, at 12:15 AM, Jan Stancek <jstancek@redhat.com> wrote:
> 
> 
> ----- Original Message -----
>> On May 13, 2019 4:01 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> 
>> 
>> On 5/13/19 9:38 AM, Will Deacon wrote:
>>> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>> index 99740e1..469492d 100644
>>>> --- a/mm/mmu_gather.c
>>>> +++ b/mm/mmu_gather.c
>>>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>>  {
>>>>      /*
>>>>       * If there are parallel threads are doing PTE changes on same range
>>>> -     * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>>>> -     * flush by batching, a thread has stable TLB entry can fail to flush
>>>> -     * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>>>> -     * forcefully if we detect parallel PTE batching threads.
>>>> +     * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>>>> +     * flush by batching, one thread may end up seeing inconsistent PTEs
>>>> +     * and result in having stale TLB entries.  So flush TLB forcefully
>>>> +     * if we detect parallel PTE batching threads.
>>>> +     *
>>>> +     * However, some syscalls, e.g. munmap(), may free page tables, this
>>>> +     * needs force flush everything in the given range. Otherwise this
>>>> +     * may result in having stale TLB entries for some architectures,
>>>> +     * e.g. aarch64, that could specify flush what level TLB.
>>>>       */
>>>> -    if (mm_tlb_flush_nested(tlb->mm)) {
>>>> -            __tlb_reset_range(tlb);
>>>> -            __tlb_adjust_range(tlb, start, end - start);
>>>> +    if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
>>>> +            /*
>>>> +             * Since we can't tell what we actually should have
>>>> +             * flushed, flush everything in the given range.
>>>> +             */
>>>> +            tlb->freed_tables = 1;
>>>> +            tlb->cleared_ptes = 1;
>>>> +            tlb->cleared_pmds = 1;
>>>> +            tlb->cleared_puds = 1;
>>>> +            tlb->cleared_p4ds = 1;
>>>> +
>>>> +            /*
>>>> +             * Some architectures, e.g. ARM, that have range invalidation
>>>> +             * and care about VM_EXEC for I-Cache invalidation, need
>>>> force
>>>> +             * vma_exec set.
>>>> +             */
>>>> +            tlb->vma_exec = 1;
>>>> +
>>>> +            /* Force vma_huge clear to guarantee safer flush */
>>>> +            tlb->vma_huge = 0;
>>>> +
>>>> +            tlb->start = start;
>>>> +            tlb->end = end;
>>>>      }
>>> Whilst I think this is correct, it would be interesting to see whether
>>> or not it's actually faster than just nuking the whole mm, as I mentioned
>>> before.
>>> 
>>> At least in terms of getting a short-term fix, I'd prefer the diff below
>>> if it's not measurably worse.
>> 
>> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86
>> VM, it shows slightly slowdown on records/s but much more sys time spent
>> with fullmm flush, the below is the data.
>> 
>>                                     nofullmm                 fullmm
>> ops (records/s)              225606                  225119
>> sys (s)                            0.69                        1.14
>> 
>> It looks the slight reduction of records/s is caused by the increase of
>> sys time.
>> 
>>> Will
>>> 
>>> --->8
>>> 
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 99740e1dd273..cc251422d307 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>>>        * forcefully if we detect parallel PTE batching threads.
>>>        */
>>>       if (mm_tlb_flush_nested(tlb->mm)) {
>>> +             tlb->fullmm = 1;
>>>               __tlb_reset_range(tlb);
>>> -             __tlb_adjust_range(tlb, start, end - start);
>>> +             tlb->freed_tables = 1;
>>>       }
>>> 
>>>       tlb_flush_mmu(tlb);
>> 
>> 
>> I think that this should have set need_flush_all and not fullmm.
> 
> Wouldn't that skip the flush?
> 
> If fulmm == 0, then __tlb_reset_range() sets tlb->end = 0.
>  tlb_flush_mmu
>    tlb_flush_mmu_tlbonly
>      if (!tlb->end)
>         return
> 
> Replacing fullmm with need_flush_all, brings the problem back / reproducer hangs.

Maybe setting need_flush_all does not have the right effect, but setting
fullmm and then calling __tlb_reset_range() when the PTEs were already
zapped seems strange.

fullmm is described as:

        /*
         * we are in the middle of an operation to clear
         * a full mm and can make some optimizations
         */

And this not the case.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
       [not found] <45c6096e-c3e0-4058-8669-75fbba415e07@email.android.com>
@ 2019-05-14  7:15 ` Jan Stancek
  2019-05-14  7:21   ` Nadav Amit
  2019-05-14 11:43 ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Stancek @ 2019-05-14  7:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Yang Shi, Will Deacon, peterz, minchan, mgorman, stable,
	linux-mm, linux-kernel, Jan Stancek


----- Original Message -----
> 
> 
> On May 13, 2019 4:01 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote:
> 
> 
> On 5/13/19 9:38 AM, Will Deacon wrote:
> > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >> index 99740e1..469492d 100644
> >> --- a/mm/mmu_gather.c
> >> +++ b/mm/mmu_gather.c
> >> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >>   {
> >>       /*
> >>        * If there are parallel threads are doing PTE changes on same range
> >> -     * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >> -     * flush by batching, a thread has stable TLB entry can fail to flush
> >> -     * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >> -     * forcefully if we detect parallel PTE batching threads.
> >> +     * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> >> +     * flush by batching, one thread may end up seeing inconsistent PTEs
> >> +     * and result in having stale TLB entries.  So flush TLB forcefully
> >> +     * if we detect parallel PTE batching threads.
> >> +     *
> >> +     * However, some syscalls, e.g. munmap(), may free page tables, this
> >> +     * needs force flush everything in the given range. Otherwise this
> >> +     * may result in having stale TLB entries for some architectures,
> >> +     * e.g. aarch64, that could specify flush what level TLB.
> >>        */
> >> -    if (mm_tlb_flush_nested(tlb->mm)) {
> >> -            __tlb_reset_range(tlb);
> >> -            __tlb_adjust_range(tlb, start, end - start);
> >> +    if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> >> +            /*
> >> +             * Since we can't tell what we actually should have
> >> +             * flushed, flush everything in the given range.
> >> +             */
> >> +            tlb->freed_tables = 1;
> >> +            tlb->cleared_ptes = 1;
> >> +            tlb->cleared_pmds = 1;
> >> +            tlb->cleared_puds = 1;
> >> +            tlb->cleared_p4ds = 1;
> >> +
> >> +            /*
> >> +             * Some architectures, e.g. ARM, that have range invalidation
> >> +             * and care about VM_EXEC for I-Cache invalidation, need
> >> force
> >> +             * vma_exec set.
> >> +             */
> >> +            tlb->vma_exec = 1;
> >> +
> >> +            /* Force vma_huge clear to guarantee safer flush */
> >> +            tlb->vma_huge = 0;
> >> +
> >> +            tlb->start = start;
> >> +            tlb->end = end;
> >>       }
> > Whilst I think this is correct, it would be interesting to see whether
> > or not it's actually faster than just nuking the whole mm, as I mentioned
> > before.
> >
> > At least in terms of getting a short-term fix, I'd prefer the diff below
> > if it's not measurably worse.
> 
> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86
> VM, it shows slightly slowdown on records/s but much more sys time spent
> with fullmm flush, the below is the data.
> 
>                                      nofullmm                 fullmm
> ops (records/s)              225606                  225119
> sys (s)                            0.69                        1.14
> 
> It looks the slight reduction of records/s is caused by the increase of
> sys time.
> 
> >
> > Will
> >
> > --->8
> >
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1dd273..cc251422d307 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >         * forcefully if we detect parallel PTE batching threads.
> >         */
> >        if (mm_tlb_flush_nested(tlb->mm)) {
> > +             tlb->fullmm = 1;
> >                __tlb_reset_range(tlb);
> > -             __tlb_adjust_range(tlb, start, end - start);
> > +             tlb->freed_tables = 1;
> >        }
> >
> >        tlb_flush_mmu(tlb);
> 
> 
> I think that this should have set need_flush_all and not fullmm.
> 

Wouldn't that skip the flush?

If fulmm == 0, then __tlb_reset_range() sets tlb->end = 0.
  tlb_flush_mmu
    tlb_flush_mmu_tlbonly
      if (!tlb->end)
         return

Replacing fullmm with need_flush_all, brings the problem back / reproducer hangs.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-20  2:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-09 23:26 [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
2019-05-13 16:38 ` Will Deacon
2019-05-13 23:01   ` Yang Shi
2019-05-14 14:54     ` Will Deacon
2019-05-14 17:25       ` Yang Shi
2019-05-16 15:29       ` Jan Stancek
2019-05-20  2:59         ` Yang Shi
2019-05-14 11:52   ` Peter Zijlstra
2019-05-14 12:02     ` Will Deacon
     [not found] <45c6096e-c3e0-4058-8669-75fbba415e07@email.android.com>
2019-05-14  7:15 ` Jan Stancek
2019-05-14  7:21   ` Nadav Amit
2019-05-14 11:49     ` Peter Zijlstra
2019-05-14 11:43 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).