LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] change non-atomic bitops method
@ 2015-02-02  3:55 Wang, Yalin
  2015-02-02 18:53 ` Laura Abbott
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Wang, Yalin @ 2015-02-02  3:55 UTC (permalink / raw)
  To: 'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

This patch change non-atomic bitops,
add a if() condition to test it, before set/clear the bit.
so that we don't need dirty the cache line, if this bit
have been set or clear. On SMP system, dirty cache line will
need invalidate other processors cache line, this will have
some impact on SMP systems.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 include/asm-generic/bitops/non-atomic.h | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h
index 697cc2b..e4ef18a 100644
--- a/include/asm-generic/bitops/non-atomic.h
+++ b/include/asm-generic/bitops/non-atomic.h
@@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p  |= mask;
+	if ((*p & mask) == 0)
+		*p  |= mask;
+
 }
 
 static inline void __clear_bit(int nr, volatile unsigned long *addr)
@@ -25,7 +27,8 @@ static inline void __clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p &= ~mask;
+	if ((*p & mask) != 0)
+		*p &= ~mask;
 }
 
 /**
@@ -60,7 +63,8 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old | mask;
+	if ((old & mask) == 0)
+		*p = old | mask;
 	return (old & mask) != 0;
 }
 
@@ -79,7 +83,8 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old & ~mask;
+	if ((old & mask) != 0)
+		*p = old & ~mask;
 	return (old & mask) != 0;
 }
 
-- 
2.2.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02  3:55 [RFC] change non-atomic bitops method Wang, Yalin
@ 2015-02-02 18:53 ` Laura Abbott
  2015-02-02 19:31 ` Uwe Kleine-König
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Laura Abbott @ 2015-02-02 18:53 UTC (permalink / raw)
  To: Wang, Yalin, 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On 2/1/2015 7:55 PM, Wang, Yalin wrote:
> This patch change non-atomic bitops,
> add a if() condition to test it, before set/clear the bit.
> so that we don't need dirty the cache line, if this bit
> have been set or clear. On SMP system, dirty cache line will
> need invalidate other processors cache line, this will have
> some impact on SMP systems.
>

Any actual numbers to give an idea of the impact?

Thanks,
Laura


-- 
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02  3:55 [RFC] change non-atomic bitops method Wang, Yalin
  2015-02-02 18:53 ` Laura Abbott
@ 2015-02-02 19:31 ` Uwe Kleine-König
  2015-02-02 23:29 ` Andrew Morton
  2015-02-03 15:14 ` David Howells
  3 siblings, 0 replies; 23+ messages in thread
From: Uwe Kleine-König @ 2015-02-02 19:31 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Mon, Feb 02, 2015 at 11:55:03AM +0800, Wang, Yalin wrote:
> This patch change non-atomic bitops,
> add a if() condition to test it, before set/clear the bit.
> so that we don't need dirty the cache line, if this bit
> have been set or clear. On SMP system, dirty cache line will
> need invalidate other processors cache line, this will have
> some impact on SMP systems.
> 
> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> ---
>  include/asm-generic/bitops/non-atomic.h | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h
> index 697cc2b..e4ef18a 100644
> --- a/include/asm-generic/bitops/non-atomic.h
> +++ b/include/asm-generic/bitops/non-atomic.h
> @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
>  	unsigned long mask = BIT_MASK(nr);
>  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
>  
> -	*p  |= mask;
> +	if ((*p & mask) == 0)
> +		*p  |= mask;
Care to fix the double space here while touching the code?

I think the more natural check here is:

	if ((~*p & mask) != 0)
		*p |= mask;

Might be a matter of taste, but this check is equivalent to

	*p != (*p | mask)

which is what you really want to test for. (Your check only has this
property for values of mask that have a single bit set, which is ok here
of course.)

> +
>  }
>  
>  static inline void __clear_bit(int nr, volatile unsigned long *addr)
> @@ -25,7 +27,8 @@ static inline void __clear_bit(int nr, volatile unsigned long *addr)
>  	unsigned long mask = BIT_MASK(nr);
>  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
>  
> -	*p &= ~mask;
> +	if ((*p & mask) != 0)
> +		*p &= ~mask;
This is already fine.

>  }
>  
>  /**
> @@ -60,7 +63,8 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
>  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
>  	unsigned long old = *p;
>  
> -	*p = old | mask;
> +	if ((old & mask) == 0)
> +		*p = old | mask;
Here it would be:

	if ((~old & mask) != 0)
	
>  	return (old & mask) != 0;
>  }

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02  3:55 [RFC] change non-atomic bitops method Wang, Yalin
  2015-02-02 18:53 ` Laura Abbott
  2015-02-02 19:31 ` Uwe Kleine-König
@ 2015-02-02 23:29 ` Andrew Morton
  2015-02-02 23:31   ` Russell King - ARM Linux
  2015-02-03  1:17   ` Kirill A. Shutemov
  2015-02-03 15:14 ` David Howells
  3 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2015-02-02 23:29 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> This patch change non-atomic bitops,
> add a if() condition to test it, before set/clear the bit.
> so that we don't need dirty the cache line, if this bit
> have been set or clear. On SMP system, dirty cache line will
> need invalidate other processors cache line, this will have
> some impact on SMP systems.
> 
> --- a/include/asm-generic/bitops/non-atomic.h
> +++ b/include/asm-generic/bitops/non-atomic.h
> @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
>  	unsigned long mask = BIT_MASK(nr);
>  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
>  
> -	*p  |= mask;
> +	if ((*p & mask) == 0)
> +		*p  |= mask;
> +
>  }

hm, maybe.

It will speed up set_bit on an already-set bit.  But it will slow down
set_bit on a not-set bit.  And the latter case is presumably much, much
more common.

How do we know the patch is a net performance gain?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02 23:29 ` Andrew Morton
@ 2015-02-02 23:31   ` Russell King - ARM Linux
  2015-02-03  1:17   ` Kirill A. Shutemov
  1 sibling, 0 replies; 23+ messages in thread
From: Russell King - ARM Linux @ 2015-02-02 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Yalin, 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux-arm-kernel@lists.infradead.org'

On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote:
> On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:
> 
> > This patch change non-atomic bitops,
> > add a if() condition to test it, before set/clear the bit.
> > so that we don't need dirty the cache line, if this bit
> > have been set or clear. On SMP system, dirty cache line will
> > need invalidate other processors cache line, this will have
> > some impact on SMP systems.
> > 
> > --- a/include/asm-generic/bitops/non-atomic.h
> > +++ b/include/asm-generic/bitops/non-atomic.h
> > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
> >  	unsigned long mask = BIT_MASK(nr);
> >  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> >  
> > -	*p  |= mask;
> > +	if ((*p & mask) == 0)
> > +		*p  |= mask;
> > +
> >  }
> 
> hm, maybe.
> 
> It will speed up set_bit on an already-set bit.  But it will slow down
> set_bit on a not-set bit.  And the latter case is presumably much, much
> more common.
> 
> How do we know the patch is a net performance gain?

Yes, we do need to know the performance impact of changes like this -
as Laura said in her reply already...

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02 23:29 ` Andrew Morton
  2015-02-02 23:31   ` Russell King - ARM Linux
@ 2015-02-03  1:17   ` Kirill A. Shutemov
  2015-02-03  2:13     ` Wang, Yalin
  2015-02-03 10:39     ` Kirill A. Shutemov
  1 sibling, 2 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2015-02-03  1:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Yalin, 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote:
> On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:
> 
> > This patch change non-atomic bitops,
> > add a if() condition to test it, before set/clear the bit.
> > so that we don't need dirty the cache line, if this bit
> > have been set or clear. On SMP system, dirty cache line will
> > need invalidate other processors cache line, this will have
> > some impact on SMP systems.
> > 
> > --- a/include/asm-generic/bitops/non-atomic.h
> > +++ b/include/asm-generic/bitops/non-atomic.h
> > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
> >  	unsigned long mask = BIT_MASK(nr);
> >  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> >  
> > -	*p  |= mask;
> > +	if ((*p & mask) == 0)
> > +		*p  |= mask;
> > +
> >  }
> 
> hm, maybe.
> 
> It will speed up set_bit on an already-set bit.  But it will slow down
> set_bit on a not-set bit.  And the latter case is presumably much, much
> more common.
> 
> How do we know the patch is a net performance gain?

Let's try to measure. The micro benchmark:

	#include <stdio.h>
	#include <time.h>
	#include <sys/mman.h>

	#ifdef CACHE_HOT
	#define SIZE (2UL << 20)
	#define TIMES 10000000
	#else
	#define SIZE (1UL << 30)
	#define TIMES 10000
	#endif

	int main(int argc, char **argv)
	{
		struct timespec a, b, diff;
		unsigned long i, *p, times = TIMES;

		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
				MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
		
		clock_gettime(CLOCK_MONOTONIC, &a);
		while (times--) {
			for (i = 0; i < SIZE/64/sizeof(*p); i++) {
	#ifdef CHECK_BEFORE_SET
				if (p[i] != times)
	#endif
					p[i] = times;
			}
		}
		clock_gettime(CLOCK_MONOTONIC, &b);

		diff.tv_sec = b.tv_sec - a.tv_sec;
		if (a.tv_nsec > b.tv_nsec) {
			diff.tv_sec--;
			diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;
		} else
			diff.tv_nsec = b.tv_nsec - a.tv_nsec;

		printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
		return 0;
	}

Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz Turbo
with 3MB LLC):

				Avg		Stddev
baseline			21.5351		0.5315
-DCHECK_BEFORE_SET		21.9834		0.0789
-DCACHE_HOT			14.9987		0.0365
-DCACHE_HOT -DCHECK_BEFORE_SET	29.9010		0.0204

Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears
huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz,
it's 1.02530 and 2.04401 CPU cycles respectively.

Basically, the check is free on decent CPU. 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03  1:17   ` Kirill A. Shutemov
@ 2015-02-03  2:13     ` Wang, Yalin
  2015-02-03  5:42       ` Wang, Yalin
  2015-02-03 10:39     ` Kirill A. Shutemov
  1 sibling, 1 reply; 23+ messages in thread
From: Wang, Yalin @ 2015-02-03  2:13 UTC (permalink / raw)
  To: 'Kirill A. Shutemov', Andrew Morton
  Cc: 'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Kirill A. Shutemov [mailto:kirill@shutemov.name]
> Sent: Tuesday, February 03, 2015 9:18 AM
> To: Andrew Morton
> Cc: Wang, Yalin; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux-
> kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote:
> > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin"
> <Yalin.Wang@sonymobile.com> wrote:
> >
> > > This patch change non-atomic bitops,
> > > add a if() condition to test it, before set/clear the bit.
> > > so that we don't need dirty the cache line, if this bit
> > > have been set or clear. On SMP system, dirty cache line will
> > > need invalidate other processors cache line, this will have
> > > some impact on SMP systems.
> > >
> > > --- a/include/asm-generic/bitops/non-atomic.h
> > > +++ b/include/asm-generic/bitops/non-atomic.h
> > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile
> unsigned long *addr)
> > >  	unsigned long mask = BIT_MASK(nr);
> > >  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> > >
> > > -	*p  |= mask;
> > > +	if ((*p & mask) == 0)
> > > +		*p  |= mask;
> > > +
> > >  }
> >
> > hm, maybe.
> >
> > It will speed up set_bit on an already-set bit.  But it will slow down
> > set_bit on a not-set bit.  And the latter case is presumably much, much
> > more common.
> >
> > How do we know the patch is a net performance gain?
> 
> Let's try to measure. The micro benchmark:
> 
> 	#include <stdio.h>
> 	#include <time.h>
> 	#include <sys/mman.h>
> 
> 	#ifdef CACHE_HOT
> 	#define SIZE (2UL << 20)
> 	#define TIMES 10000000
> 	#else
> 	#define SIZE (1UL << 30)
> 	#define TIMES 10000
> 	#endif
> 
> 	int main(int argc, char **argv)
> 	{
> 		struct timespec a, b, diff;
> 		unsigned long i, *p, times = TIMES;
> 
> 		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> 				MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1,
> 0);
> 
> 		clock_gettime(CLOCK_MONOTONIC, &a);
> 		while (times--) {
> 			for (i = 0; i < SIZE/64/sizeof(*p); i++) {
> 	#ifdef CHECK_BEFORE_SET
> 				if (p[i] != times)
> 	#endif
> 					p[i] = times;
> 			}
> 		}
> 		clock_gettime(CLOCK_MONOTONIC, &b);
> 
> 		diff.tv_sec = b.tv_sec - a.tv_sec;
> 		if (a.tv_nsec > b.tv_nsec) {
> 			diff.tv_sec--;
> 			diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;
> 		} else
> 			diff.tv_nsec = b.tv_nsec - a.tv_nsec;
> 
> 		printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
> 		return 0;
> 	}
> 
> Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz
> Turbo
> with 3MB LLC):
> 
> 				Avg		Stddev
> baseline			21.5351		0.5315
> -DCHECK_BEFORE_SET		21.9834		0.0789
> -DCACHE_HOT			14.9987		0.0365
> -DCACHE_HOT -DCHECK_BEFORE_SET	29.9010		0.0204
> 
> Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears
> huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz,
> it's 1.02530 and 2.04401 CPU cycles respectively.
> 
> Basically, the check is free on decent CPU.
> 
Awesome test, but you only test the one cpu which running this code,
Have not consider the other CPUs, whose cache line will be invalidate if
The cache is dirtied by writer CPU,
So another test should be running 2 thread on two different CPUs(bind to CPU),
One write , one read, to see the impact on the reader CPU.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03  2:13     ` Wang, Yalin
@ 2015-02-03  5:42       ` Wang, Yalin
  2015-02-03  6:38         ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Wang, Yalin @ 2015-02-03  5:42 UTC (permalink / raw)
  To: 'Kirill A. Shutemov', 'Andrew Morton'
  Cc: 'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Wang, Yalin
> Sent: Tuesday, February 03, 2015 10:13 AM
> To: 'Kirill A. Shutemov'; Andrew Morton
> Cc: 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux-
> kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: RE: [RFC] change non-atomic bitops method
> 
> > -----Original Message-----
> > From: Kirill A. Shutemov [mailto:kirill@shutemov.name]
> > Sent: Tuesday, February 03, 2015 9:18 AM
> > To: Andrew Morton
> > Cc: Wang, Yalin; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux-
> > kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> > kernel@lists.infradead.org'
> > Subject: Re: [RFC] change non-atomic bitops method
> >
> > On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote:
> > > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin"
> > <Yalin.Wang@sonymobile.com> wrote:
> > >
> > > > This patch change non-atomic bitops,
> > > > add a if() condition to test it, before set/clear the bit.
> > > > so that we don't need dirty the cache line, if this bit
> > > > have been set or clear. On SMP system, dirty cache line will
> > > > need invalidate other processors cache line, this will have
> > > > some impact on SMP systems.
> > > >
> > > > --- a/include/asm-generic/bitops/non-atomic.h
> > > > +++ b/include/asm-generic/bitops/non-atomic.h
> > > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile
> > unsigned long *addr)
> > > >  	unsigned long mask = BIT_MASK(nr);
> > > >  	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> > > >
> > > > -	*p  |= mask;
> > > > +	if ((*p & mask) == 0)
> > > > +		*p  |= mask;
> > > > +
> > > >  }
> > >
> > > hm, maybe.
> > >
> > > It will speed up set_bit on an already-set bit.  But it will slow down
> > > set_bit on a not-set bit.  And the latter case is presumably much, much
> > > more common.
> > >
> > > How do we know the patch is a net performance gain?
> >
> > Let's try to measure. The micro benchmark:
> >
> > 	#include <stdio.h>
> > 	#include <time.h>
> > 	#include <sys/mman.h>
> >
> > 	#ifdef CACHE_HOT
> > 	#define SIZE (2UL << 20)
> > 	#define TIMES 10000000
> > 	#else
> > 	#define SIZE (1UL << 30)
> > 	#define TIMES 10000
> > 	#endif
> >
> > 	int main(int argc, char **argv)
> > 	{
> > 		struct timespec a, b, diff;
> > 		unsigned long i, *p, times = TIMES;
> >
> > 		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> > 				MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1,
> > 0);
> >
> > 		clock_gettime(CLOCK_MONOTONIC, &a);
> > 		while (times--) {
> > 			for (i = 0; i < SIZE/64/sizeof(*p); i++) {
> > 	#ifdef CHECK_BEFORE_SET
> > 				if (p[i] != times)
> > 	#endif
> > 					p[i] = times;
> > 			}
> > 		}
> > 		clock_gettime(CLOCK_MONOTONIC, &b);
> >
> > 		diff.tv_sec = b.tv_sec - a.tv_sec;
> > 		if (a.tv_nsec > b.tv_nsec) {
> > 			diff.tv_sec--;
> > 			diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;
> > 		} else
> > 			diff.tv_nsec = b.tv_nsec - a.tv_nsec;
> >
> > 		printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
> > 		return 0;
> > 	}
> >
> > Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz
> > Turbo
> > with 3MB LLC):
> >
> > 				Avg		Stddev
> > baseline			21.5351		0.5315
> > -DCHECK_BEFORE_SET		21.9834		0.0789
> > -DCACHE_HOT			14.9987		0.0365
> > -DCACHE_HOT -DCHECK_BEFORE_SET	29.9010		0.0204
> >
> > Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears
> > huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz,
> > it's 1.02530 and 2.04401 CPU cycles respectively.
> >
> > Basically, the check is free on decent CPU.
> >
> Awesome test, but you only test the one cpu which running this code,
> Have not consider the other CPUs, whose cache line will be invalidate if
> The cache is dirtied by writer CPU,
> So another test should be running 2 thread on two different CPUs(bind to
> CPU),
> One write , one read, to see the impact on the reader CPU.
I make a little change about your test progrom,
Add a new thread to test SMP cache impact.
---
#include <stdio.h>
#include <time.h>
#include <sys/mman.h>
#include <errno.h>
#define _GNU_SOURCE
#define __USE_GNU
#include <sched.h>
#include <pthread.h>

#ifdef CACHE_HOT
#define SIZE (2UL << 20)
#define TIMES 100000
#else
#define SIZE (1UL << 20)
#define TIMES 10000
#endif
static void *reader_thread(void *arg)
{

	struct timespec a, b, diff;
	unsigned long *p = arg;
	volatile unsigned long temp;
	unsigned long i, ret, times = TIMES;
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	ret = sched_setaffinity(-1, sizeof(cpu_set_t), &set);
	if (ret < 0) {
		printf("sched_setaffinity error:%s", strerror(errno));
	}
	clock_gettime(CLOCK_MONOTONIC, &a);
	while (times--) {
		for (i = 0; i < SIZE/sizeof(*p); i++) {
				temp = p[i];
		}
	}
	clock_gettime(CLOCK_MONOTONIC, &b);

	diff.tv_sec = b.tv_sec - a.tv_sec;
	if (a.tv_nsec > b.tv_nsec) {
		diff.tv_sec--;
		diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;

	} else
		diff.tv_nsec = b.tv_nsec - a.tv_nsec;

	printf("reader:%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
}

int main(int argc, char **argv)
{
	struct timespec a, b, diff;
	unsigned long i, ret, *p, times = TIMES;
	pthread_t thread;
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	ret = sched_setaffinity(-1, sizeof(cpu_set_t), &set);
	if (ret < 0) {
		printf("sched_setaffinity error:%s", strerror(errno));
	}
	p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
			MAP_LOCKED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
	pthread_create(&thread, NULL, reader_thread, p);
	clock_gettime(CLOCK_MONOTONIC, &a);
	while (times--) {
		for (i = 0; i < SIZE/sizeof(*p); i++) {
#ifdef CHECK_BEFORE_SET
			if (p[i] != times)
#endif
				p[i] = times;
		}
	}
	clock_gettime(CLOCK_MONOTONIC, &b);

	diff.tv_sec = b.tv_sec - a.tv_sec;
	if (a.tv_nsec > b.tv_nsec) {
		diff.tv_sec--;
		diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;

	} else
		diff.tv_nsec = b.tv_nsec - a.tv_nsec;

	printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
	return 0;
}
----
One run on CPU0, reader thread run on CPU1,
Test result:
sudo ./cache_test
reader:8.426228173
8.672198335

With -DCHECK_BEFORE_SET
sudo ./cache_test_check
reader:7.537036819
10.799746531

You can see reader can save some time if cache not dirtied.
Also we can see that for writer, it will increase some impact
Because it need read the data before change it,

I think if the system have lots of cores, reader performance
Improve is more useful .

My CPU info:

28851195@cnbjlx20570:~/test$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 37
model name      : Intel(R) Core(TM) i5 CPU         660  @ 3.33GHz
stepping        : 5
microcode       : 0x2
cpu MHz         : 1199.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4

Thanks for your test program very much!










^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  5:42       ` Wang, Yalin
@ 2015-02-03  6:38         ` Andrew Morton
  2015-02-03  7:03           ` Wang, Yalin
                             ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2015-02-03  6:38 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:
>
> ...
>
> #ifdef CHECK_BEFORE_SET
> 			if (p[i] != times)
> #endif
>
> ...
>
> ----
> One run on CPU0, reader thread run on CPU1,
> Test result:
> sudo ./cache_test
> reader:8.426228173
> 8.672198335
> 
> With -DCHECK_BEFORE_SET
> sudo ./cache_test_check
> reader:7.537036819
> 10.799746531
> 

You aren't measuring the right thing.  You should compare

	if (p[i] != x)
		p[i] = x;

versus

	p[i] = x;

and you should do this for two cases:

a) p[i] == x

b) p[i] != x


The first code sequence will be slower when (p[i] != x) and faster when
(p[i] == x).


Next, we should instrument the kernel to work out the frequency of
set_bit on an already-set bit.

It is only with both these ratios that we can work out whether the
patch is a net gain.  My suspicion is that set_bit on an already-set
bit is so rare that the patch will be a loss.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03  6:38         ` Andrew Morton
@ 2015-02-03  7:03           ` Wang, Yalin
  2015-02-03  8:42             ` Wang, Yalin
  2015-02-03  8:40           ` David Miller
  2015-02-03  9:34           ` Rasmus Villemoes
  2 siblings, 1 reply; 23+ messages in thread
From: Wang, Yalin @ 2015-02-03  7:03 UTC (permalink / raw)
  To: 'Andrew Morton'
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Tuesday, February 03, 2015 2:39 PM
> To: Wang, Yalin
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com>
> wrote:
> >
> > ...
> >
> > #ifdef CHECK_BEFORE_SET
> > 			if (p[i] != times)
> > #endif
> >
> > ...
> >
> > ----
> > One run on CPU0, reader thread run on CPU1,
> > Test result:
> > sudo ./cache_test
> > reader:8.426228173
> > 8.672198335
> >
> > With -DCHECK_BEFORE_SET
> > sudo ./cache_test_check
> > reader:7.537036819
> > 10.799746531
> >
> 
> You aren't measuring the right thing.  You should compare
> 
> 	if (p[i] != x)
> 		p[i] = x;
> 
> versus
> 
> 	p[i] = x;
> 
> and you should do this for two cases:
> 
> a) p[i] == x
> 
> b) p[i] != x
> 
> 
> The first code sequence will be slower when (p[i] != x) and faster when
> (p[i] == x).
> 
> 
> Next, we should instrument the kernel to work out the frequency of
> set_bit on an already-set bit.
> 
> It is only with both these ratios that we can work out whether the
> patch is a net gain.  My suspicion is that set_bit on an already-set
> bit is so rare that the patch will be a loss.
I see, let's change the test a little:
1)
	memset(p, 0, SIZE);
	if (p[i] != 0)
		p[i] = 0;  // never called

	#sudo ./cache_test_check
	6.698153838
	reader:7.529402625


2)
	memset(p, 0, SIZE);
	if (p[i] == 0)
		p[i] = 0; // always called
	#sudo ./cache_test_check
	reader:7.895421311
	9.000889973

Thanks





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  6:38         ` Andrew Morton
  2015-02-03  7:03           ` Wang, Yalin
@ 2015-02-03  8:40           ` David Miller
  2015-02-03  8:48             ` Andrew Morton
  2015-02-03  9:34           ` Rasmus Villemoes
  2 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2015-02-03  8:40 UTC (permalink / raw)
  To: akpm
  Cc: Yalin.Wang, kirill, arnd, linux-arch, linux-kernel, linux,
	linux-arm-kernel

From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 2 Feb 2015 22:38:51 -0800

> It is only with both these ratios that we can work out whether the
> patch is a net gain.  My suspicion is that set_bit on an already-set
> bit is so rare that the patch will be a loss.

A common pattern is implementing a "referenced" bit, and in that case
the bit is often already set, and in such a scenerio the proposed
change is a huge win.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03  7:03           ` Wang, Yalin
@ 2015-02-03  8:42             ` Wang, Yalin
  2015-02-03 10:59               ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Wang, Yalin @ 2015-02-03  8:42 UTC (permalink / raw)
  To: 'Andrew Morton'
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Wang, Yalin
> Sent: Tuesday, February 03, 2015 3:04 PM
> To: 'Andrew Morton'
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: RE: [RFC] change non-atomic bitops method
> 
> > -----Original Message-----
> > From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > Sent: Tuesday, February 03, 2015 2:39 PM
> > To: Wang, Yalin
> > Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> > 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> > kernel@lists.infradead.org'
> > Subject: Re: [RFC] change non-atomic bitops method
> >
> > On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin"
> <Yalin.Wang@sonymobile.com>
> > wrote:
> > >
> > > ...
> > >
> > > #ifdef CHECK_BEFORE_SET
> > > 			if (p[i] != times)
> > > #endif
> > >
> > > ...
> > >
> > > ----
> > > One run on CPU0, reader thread run on CPU1,
> > > Test result:
> > > sudo ./cache_test
> > > reader:8.426228173
> > > 8.672198335
> > >
> > > With -DCHECK_BEFORE_SET
> > > sudo ./cache_test_check
> > > reader:7.537036819
> > > 10.799746531
> > >
> >
> > You aren't measuring the right thing.  You should compare
> >
> > 	if (p[i] != x)
> > 		p[i] = x;
> >
> > versus
> >
> > 	p[i] = x;
> >
> > and you should do this for two cases:
> >
> > a) p[i] == x
> >
> > b) p[i] != x
> >
> >
> > The first code sequence will be slower when (p[i] != x) and faster when
> > (p[i] == x).
> >
> >
> > Next, we should instrument the kernel to work out the frequency of
> > set_bit on an already-set bit.
> >
> > It is only with both these ratios that we can work out whether the
> > patch is a net gain.  My suspicion is that set_bit on an already-set
> > bit is so rare that the patch will be a loss.
> I see, let's change the test a little:
> 1)
> 	memset(p, 0, SIZE);
> 	if (p[i] != 0)
> 		p[i] = 0;  // never called
> 
> 	#sudo ./cache_test_check
> 	6.698153838
> 	reader:7.529402625
> 
> 
> 2)
> 	memset(p, 0, SIZE);
> 	if (p[i] == 0)
> 		p[i] = 0; // always called
> 	#sudo ./cache_test_check
> 	reader:7.895421311
> 	9.000889973
> 
> Thanks
> 
> 
I make a change in kernel to test hit/miss ratio:
---
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 80e4645..a82937b 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -2,6 +2,7 @@
 #include <linux/hugetlb.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/mmzone.h>
@@ -15,6 +16,41 @@
 #include <asm/pgtable.h>
 #include "internal.h"
 
+atomic_t __set_bit_success_count = ATOMIC_INIT(0);
+atomic_t __set_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__set_bit_success_count);
+EXPORT_SYMBOL_GPL(__set_bit_miss_count);
+
+atomic_t __clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t __clear_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__clear_bit_success_count);
+EXPORT_SYMBOL_GPL(__clear_bit_miss_count);
+
+atomic_t __test_and_set_bit_success_count = ATOMIC_INIT(0);
+atomic_t __test_and_set_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__test_and_set_bit_success_count);
+EXPORT_SYMBOL_GPL(__test_and_set_bit_miss_count);
+
+atomic_t __test_and_clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t __test_and_clear_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__test_and_clear_bit_success_count);
+EXPORT_SYMBOL_GPL(__test_and_clear_bit_miss_count);
+
+/*
+ * atomic bitops
+ */
+atomic_t set_bit_success_count = ATOMIC_INIT(0);
+atomic_t set_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t clear_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t test_and_set_bit_success_count = ATOMIC_INIT(0);
+atomic_t test_and_set_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t test_and_clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t test_and_clear_bit_miss_count = ATOMIC_INIT(0);
+
 void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
 {
 }
@@ -165,6 +201,18 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		   HPAGE_PMD_NR)
 #endif
 		);
+	seq_printf(m,   "__set_bit_miss_count:%d __set_bit_success_count:%d\n"
+			"__clear_bit_miss_count:%d __clear_bit_success_count:%d\n"
+			"__test_and_set_bit_miss_count:%d __test_and_set_bit_success_count:%d\n"
+			"__test_and_clear_bit_miss_count:%d __test_and_clear_bit_success_count:%d\n",
+			atomic_read(&__set_bit_miss_count), atomic_read(&__set_bit_success_count),
+			atomic_read(&__clear_bit_miss_count), atomic_read(&__clear_bit_success_count),
+
+			atomic_read(&__test_and_set_bit_miss_count),
+			atomic_read(&__test_and_set_bit_success_count),
+
+			atomic_read(&__test_and_clear_bit_miss_count),
+			atomic_read(&__test_and_clear_bit_success_count));
 
 	hugetlb_report_meminfo(m);
 
diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h
index 697cc2b..1895133 100644
--- a/include/asm-generic/bitops/non-atomic.h
+++ b/include/asm-generic/bitops/non-atomic.h
@@ -2,7 +2,18 @@
 #define _ASM_GENERIC_BITOPS_NON_ATOMIC_H_
 
 #include <asm/types.h>
+#include <asm/atomic.h>
+extern atomic_t __set_bit_success_count;
+extern atomic_t __set_bit_miss_count;
 
+extern atomic_t __clear_bit_success_count;
+extern atomic_t __clear_bit_miss_count;
+
+extern atomic_t __test_and_set_bit_success_count;
+extern atomic_t __test_and_set_bit_miss_count;
+
+extern atomic_t __test_and_clear_bit_success_count;
+extern atomic_t __test_and_clear_bit_miss_count;
 /**
  * __set_bit - Set a bit in memory
  * @nr: the bit to set
@@ -17,7 +28,13 @@ static inline void __set_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p  |= mask;
+	if ((*p & mask) == 0) {
+		atomic_inc(&__set_bit_success_count);
+		*p  |= mask;
+	} else {
+		atomic_inc(&__set_bit_miss_count);
+	}
+
 }
 
 static inline void __clear_bit(int nr, volatile unsigned long *addr)
@@ -25,7 +42,12 @@ static inline void __clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p &= ~mask;
+	if ((*p & mask) != 0) {
+		atomic_inc(&__clear_bit_success_count);
+		*p &= ~mask;
+	} else {
+		atomic_inc(&__clear_bit_miss_count);
+	}
 }
 
 /**
@@ -60,7 +82,12 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old | mask;
+	if ((old & mask) == 0) {
+		atomic_inc(&__test_and_set_bit_success_count);
+		*p = old | mask;
+	} else {
+		atomic_inc(&__test_and_set_bit_miss_count);
+	}
 	return (old & mask) != 0;
 }
 
@@ -79,7 +106,12 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old & ~mask;
+	if ((old & mask) != 0) {
+		atomic_inc(&__test_and_clear_bit_success_count);
+		*p = old & ~mask;
+	} else {
+		atomic_inc(&__test_and_clear_bit_miss_count);
+	}
 	return (old & mask) != 0;
 }
---
After use the phone some time:
root@D5303:/ # cat /proc/meminfo
VmallocUsed:       10348 kB
VmallocChunk:      75632 kB
__set_bit_miss_count:10002 __set_bit_success_count:1096661
__clear_bit_miss_count:359484 __clear_bit_success_count:3674617
__test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
__test_and_clear_bit_miss_count:924611 __test_and_clear_bit_success_count:193

__test_and_clear_bit_miss_count has a very high miss rate.
In fact, I think set/clear/test_and_set(clear)_bit atomic version can also
Be investigated to see its miss ratio,
I have not tested the atomic version,
Because it reside in different architectures.

Thanks













^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  8:40           ` David Miller
@ 2015-02-03  8:48             ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2015-02-03  8:48 UTC (permalink / raw)
  To: David Miller
  Cc: Yalin.Wang, kirill, arnd, linux-arch, linux-kernel, linux,
	linux-arm-kernel

On Tue, 03 Feb 2015 00:40:31 -0800 (PST) David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Mon, 2 Feb 2015 22:38:51 -0800
> 
> > It is only with both these ratios that we can work out whether the
> > patch is a net gain.  My suspicion is that set_bit on an already-set
> > bit is so rare that the patch will be a loss.
> 
> A common pattern is implementing a "referenced" bit, and in that case
> the bit is often already set, and in such a scenerio the proposed
> change is a huge win.

pagecache, dcache and icache already perform this optimisation (and
only pagecache uses bitops for it anyway).  I'm not sure what's left.

But there's really no point in speculating about this - it's trivial to
instrument the kernel and get real numbers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  6:38         ` Andrew Morton
  2015-02-03  7:03           ` Wang, Yalin
  2015-02-03  8:40           ` David Miller
@ 2015-02-03  9:34           ` Rasmus Villemoes
  2015-02-03  9:41             ` Wang, Yalin
  2 siblings, 1 reply; 23+ messages in thread
From: Rasmus Villemoes @ 2015-02-03  9:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Yalin, 'Kirill A. Shutemov',
	'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Tue, Feb 03 2015, Andrew Morton <akpm@linux-foundation.org> wrote:

>
> You aren't measuring the right thing.  You should compare
>
> 	if (p[i] != x)
> 		p[i] = x;
>
> versus
>
> 	p[i] = x;
>
> and you should do this for two cases:
>
> a) p[i] == x
>
> b) p[i] != x
>
>
> The first code sequence will be slower when (p[i] != x) and faster when
> (p[i] == x).
>
>
> Next, we should instrument the kernel to work out the frequency of
> set_bit on an already-set bit.
>
> It is only with both these ratios that we can work out whether the
> patch is a net gain.  My suspicion is that set_bit on an already-set
> bit is so rare that the patch will be a loss.

There's also the code-bloat issue to consider (instruction cache and all
that); the conditional versions will usually require three extra
instructions and an extra register. Also, the cache line might already
be dirty because of something in the surrounding code. Instruction cache
misses and larger stack footprint (from larger register pressure) won't
show up in a microbenchmark, so I think this needs a real-world example
to justify.

But even if one finds some hot spot that would benefit from the
conditional, that should simply be added explicitly there, instead of
pessimizing every other user. (A good example of that is 358eec18243a
("vfs: decrapify dput(), fix cache behavior under normal load")).

Rasmus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03  9:34           ` Rasmus Villemoes
@ 2015-02-03  9:41             ` Wang, Yalin
  0 siblings, 0 replies; 23+ messages in thread
From: Wang, Yalin @ 2015-02-03  9:41 UTC (permalink / raw)
  To: 'Rasmus Villemoes', Andrew Morton
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Rasmus Villemoes [mailto:linux@rasmusvillemoes.dk]
> Sent: Tuesday, February 03, 2015 5:34 PM
> To: Andrew Morton
> Cc: Wang, Yalin; 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-
> arch@vger.kernel.org'; 'linux-kernel@vger.kernel.org';
> 'linux@arm.linux.org.uk'; 'linux-arm-kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Tue, Feb 03 2015, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> >
> > You aren't measuring the right thing.  You should compare
> >
> > 	if (p[i] != x)
> > 		p[i] = x;
> >
> > versus
> >
> > 	p[i] = x;
> >
> > and you should do this for two cases:
> >
> > a) p[i] == x
> >
> > b) p[i] != x
> >
> >
> > The first code sequence will be slower when (p[i] != x) and faster when
> > (p[i] == x).
> >
> >
> > Next, we should instrument the kernel to work out the frequency of
> > set_bit on an already-set bit.
> >
> > It is only with both these ratios that we can work out whether the
> > patch is a net gain.  My suspicion is that set_bit on an already-set
> > bit is so rare that the patch will be a loss.
> 
> There's also the code-bloat issue to consider (instruction cache and all
> that); the conditional versions will usually require three extra
> instructions and an extra register. Also, the cache line might already
> be dirty because of something in the surrounding code. Instruction cache
> misses and larger stack footprint (from larger register pressure) won't
> show up in a microbenchmark, so I think this needs a real-world example
> to justify.
> 
> But even if one finds some hot spot that would benefit from the
> conditional, that should simply be added explicitly there, instead of
> pessimizing every other user. (A good example of that is 358eec18243a
> ("vfs: decrapify dput(), fix cache behavior under normal load")).

Oh, thank you, it is really a very nice example.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  1:17   ` Kirill A. Shutemov
  2015-02-03  2:13     ` Wang, Yalin
@ 2015-02-03 10:39     ` Kirill A. Shutemov
  1 sibling, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2015-02-03 10:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang, Yalin, 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, Feb 03, 2015 at 03:17:30AM +0200, Kirill A. Shutemov wrote:
> Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz Turbo
> with 3MB LLC):

I've screwed up the inner loop condition and step. As result the benchmark
touches the same cache line 8 times and scan SIZE/8 of memory. Fixed test
is in attach.
				Avg		Stddev
baseline			14.0663		0.0182
-DCHECK_BEFORE_SET		13.8594		0.0458
-DCACHE_HOT			12.3896		0.0867
-DCACHE_HOT -DCHECK_BEFORE_SET	11.7480		0.2497

And now it's faster *with* the check. Sometimes CPU is just too clever. ;)

-- 
 Kirill A. Shutemov

[-- Attachment #2: test.c --]
[-- Type: text/plain, Size: 901 bytes --]

#include <stdio.h>
#include <time.h>
#include <sys/mman.h>

#ifdef CACHE_HOT
#define SIZE (2UL << 20)
#define TIMES 100000
#else
#define SIZE (1UL << 30)
#define TIMES 100
#endif

#define CACHE_LINE 64

int main(int argc, char **argv)
{
	struct timespec a, b, diff;
	unsigned long i, *p, times = TIMES;

	p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
			MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
	
	clock_gettime(CLOCK_MONOTONIC, &a);
	while (times--) {
		for (i = 0; i < SIZE / sizeof(*p);
				i += CACHE_LINE / sizeof(*p)) {
#ifdef CHECK_BEFORE_SET
			if (p[i] != times)
#endif
				p[i] = times;
		}
	}
	clock_gettime(CLOCK_MONOTONIC, &b);

	diff.tv_sec = b.tv_sec - a.tv_sec;
	if (a.tv_nsec > b.tv_nsec) {
		diff.tv_sec--;
		diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;
	} else
		diff.tv_nsec = b.tv_nsec - a.tv_nsec;

	printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03  8:42             ` Wang, Yalin
@ 2015-02-03 10:59               ` Andrew Morton
  2015-02-09  8:18                 ` Wang, Yalin
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2015-02-03 10:59 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Tue, 3 Feb 2015 16:42:14 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> I make a change in kernel to test hit/miss ratio:

Neat, thanks.

>
> ...
>
> After use the phone some time:
> root@D5303:/ # cat /proc/meminfo
> VmallocUsed:       10348 kB
> VmallocChunk:      75632 kB
> __set_bit_miss_count:10002 __set_bit_success_count:1096661
> __clear_bit_miss_count:359484 __clear_bit_success_count:3674617
> __test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
> __test_and_clear_bit_miss_count:924611 __test_and_clear_bit_success_count:193
> 
> __test_and_clear_bit_miss_count has a very high miss rate.
> In fact, I think set/clear/test_and_set(clear)_bit atomic version can also
> Be investigated to see its miss ratio,
> I have not tested the atomic version,
> Because it reside in different architectures.

Hopefully misses in test_and_X_bit are not a problem.  The CPU
implementation would be pretty stupid to go and dirty the cacheline
when it knows it didn't change anything.  But maybe I'm wrong about
that.  

That we're running clear_bit against a cleared bit 10% of the time is a
bit alarming.  I wonder where that's coming from.

The enormous miss count in test_and_clear_bit() might indicate an
inefficiency somewhere.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-02  3:55 [RFC] change non-atomic bitops method Wang, Yalin
                   ` (2 preceding siblings ...)
  2015-02-02 23:29 ` Andrew Morton
@ 2015-02-03 15:14 ` David Howells
  2015-02-03 19:10   ` Uwe Kleine-König
  3 siblings, 1 reply; 23+ messages in thread
From: David Howells @ 2015-02-03 15:14 UTC (permalink / raw)
  To: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?=
  Cc: dhowells, Wang, Yalin, 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

Uwe Kleine-König  wrote:

> Might be a matter of taste, but this check is equivalent to
> 
> 	*p != (*p | mask)
> 
> which is what you really want to test for.

I would argue that this is less clear as to what's going on.

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-03 15:14 ` David Howells
@ 2015-02-03 19:10   ` Uwe Kleine-König
  0 siblings, 0 replies; 23+ messages in thread
From: Uwe Kleine-König @ 2015-02-03 19:10 UTC (permalink / raw)
  To: David Howells
  Cc: 'linux-arch@vger.kernel.org',
	'linux@arm.linux.org.uk',
	Wang, Yalin, 'arnd@arndb.de',
	'linux-kernel@vger.kernel.org',
	'linux-arm-kernel@lists.infradead.org'

Hello,

[added some more context again]

On Tue, Feb 03, 2015 at 03:14:43PM +0000, David Howells wrote:
> > > -     *p  |= mask;
> > > +     if ((*p & mask) == 0)
> > > +             *p  |= mask;
> > Care to fix the double space here while touching the code?
> > 
> > I think the more natural check here is:
> > 
> >         if ((~*p & mask) != 0)
> >                 *p |= mask;
> >
> > Might be a matter of taste, but this check is equivalent to
> > 
> > 	*p != (*p | mask)
> > 
> > which is what you really want to test for.
> I would argue that this is less clear as to what's going on.
OK, I admit that this equivalence is not obvious. Then maybe let the
compiler find the equivalence and do:

-	*p  |= mask;
+	if (*p != (*p | mask))
+		p |= mask;

?

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-03 10:59               ` Andrew Morton
@ 2015-02-09  8:18                 ` Wang, Yalin
  2015-02-09 20:34                   ` Andrew Morton
  2015-02-09 21:42                   ` Rasmus Villemoes
  0 siblings, 2 replies; 23+ messages in thread
From: Wang, Yalin @ 2015-02-09  8:18 UTC (permalink / raw)
  To: 'Andrew Morton'
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Tuesday, February 03, 2015 6:59 PM
> To: Wang, Yalin
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Tue, 3 Feb 2015 16:42:14 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com>
> wrote:
> 
> > I make a change in kernel to test hit/miss ratio:
> 
> Neat, thanks.
> 
> >
> > ...
> >
> > After use the phone some time:
> > root@D5303:/ # cat /proc/meminfo
> > VmallocUsed:       10348 kB
> > VmallocChunk:      75632 kB
> > __set_bit_miss_count:10002 __set_bit_success_count:1096661
> > __clear_bit_miss_count:359484 __clear_bit_success_count:3674617
> > __test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
> > __test_and_clear_bit_miss_count:924611
> __test_and_clear_bit_success_count:193
> >
> > __test_and_clear_bit_miss_count has a very high miss rate.
> > In fact, I think set/clear/test_and_set(clear)_bit atomic version can
> also
> > Be investigated to see its miss ratio,
> > I have not tested the atomic version,
> > Because it reside in different architectures.
> 
> Hopefully misses in test_and_X_bit are not a problem.  The CPU
> implementation would be pretty stupid to go and dirty the cacheline
> when it knows it didn't change anything.  But maybe I'm wrong about
> that.
> 
> That we're running clear_bit against a cleared bit 10% of the time is a
> bit alarming.  I wonder where that's coming from.
> 
> The enormous miss count in test_and_clear_bit() might indicate an
> inefficiency somewhere.
I te-test the patch on 3.10 kernel.
The result like this:

VmallocChunk:   251498164 kB
__set_bit_miss_count:11730 __set_bit_success_count:1036316
__clear_bit_miss_count:209640 __clear_bit_success_count:4806556
__test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
__test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445

__clear_bit miss rate is a little high,
I check the log, and most miss coming from this code:

<6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
<6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
<6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
<6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
<6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
<6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
In __clear_close_on_exec(fd, fdt);



<6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
<6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
<6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
<6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
<6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
<6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
In __clear_open_fd(open_files - i, new_fdt);

Do we need test_bit() before clear_bit()at these 2 place?

Thanks





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-09  8:18                 ` Wang, Yalin
@ 2015-02-09 20:34                   ` Andrew Morton
  2015-02-10  7:05                     ` Wang, Yalin
  2015-02-09 21:42                   ` Rasmus Villemoes
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2015-02-09 20:34 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Mon, 9 Feb 2015 16:18:10 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> > That we're running clear_bit against a cleared bit 10% of the time is a
> > bit alarming.  I wonder where that's coming from.
> > 
> > The enormous miss count in test_and_clear_bit() might indicate an
> > inefficiency somewhere.
> I te-test the patch on 3.10 kernel.
> The result like this:
> 
> VmallocChunk:   251498164 kB
> __set_bit_miss_count:11730 __set_bit_success_count:1036316
> __clear_bit_miss_count:209640 __clear_bit_success_count:4806556
> __test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
> __test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445
> 
> __clear_bit miss rate is a little high,
> I check the log, and most miss coming from this code:
> 
> <6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
> <6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
> <6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
> <6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
> In __clear_close_on_exec(fd, fdt);
> 
> 
> 
> <6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
> <6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
> <6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
> <6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
> In __clear_open_fd(open_files - i, new_fdt);
> 
> Do we need test_bit() before clear_bit()at these 2 place?

I don't know.  I was happily typing in this:

diff -puN include/linux/bitops.h~a include/linux/bitops.h
--- a/include/linux/bitops.h~a
+++ a/include/linux/bitops.h
@@ -226,5 +226,37 @@ extern unsigned long find_last_bit(const
 				   unsigned long size);
 #endif
 
+/**
+ * __set_clear_bit - non-atomically set a bit if it is presently clear
+ * @nr: The bit number
+ * @addr: The base address of the operation
+ *
+ * __set_clear_bit() and similar functions avoid unnecessarily dirtying a
+ * cacheline when the operation will have no effect.
+ */
+static inline void __set_clear_bit(unsigned nr, volatile unsigned long *addr)
+{
+	if (!test_bit(nr, addr))
+		__set_bit(nr, addr);
+}
+
+static inline void __clear_set_bit(unsigned nr, volatile unsigned long *addr)
+{
+	if (test_bit(nr, addr))
+		__clear_bit(nr, addr);
+}
+
+static inline void set_clear_bit(unsigned nr, volatile unsigned long *addr)
+{
+	if (!test_bit(nr, addr))
+		set_bit(nr, addr);
+}
+
+static inline void clear_set_bit(unsigned nr, volatile unsigned long *addr)
+{
+	if (test_bit(nr, addr))
+		clear_bit(nr, addr);
+}
+
 #endif /* __KERNEL__ */
 #endif

(maybe __set_bit_if_clear would be a better name)

But I don't know if it will do anything useful.  The CPU *should* be
able to avoid dirtying the cacheline on its own: it has all the info it
needs to know that no writeback will be needed.  But I don't know which
(if any) CPUs perform this optimisation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] change non-atomic bitops method
  2015-02-09  8:18                 ` Wang, Yalin
  2015-02-09 20:34                   ` Andrew Morton
@ 2015-02-09 21:42                   ` Rasmus Villemoes
  1 sibling, 0 replies; 23+ messages in thread
From: Rasmus Villemoes @ 2015-02-09 21:42 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Andrew Morton', 'Kirill A. Shutemov',
	'arnd@arndb.de', 'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

On Mon, Feb 09 2015, "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> I te-test the patch on 3.10 kernel.
> The result like this:
>
> VmallocChunk:   251498164 kB
> __set_bit_miss_count:11730 __set_bit_success_count:1036316
> __clear_bit_miss_count:209640 __clear_bit_success_count:4806556
> __test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
> __test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445
>
> __clear_bit miss rate is a little high,
> I check the log, and most miss coming from this code:
>
> <6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
> <6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
> <6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
> <6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
> In __clear_close_on_exec(fd, fdt);
>
>
>
> <6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
> <6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
> <6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
> <6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
> In __clear_open_fd(open_files - i, new_fdt);
>
> Do we need test_bit() before clear_bit()at these 2 place?
>

In the second case, new_fdt->open_fds has just been filled by a
memcpy, and no-one can possibly have written to that cache line in the
meantime. 

In the first case, testing is also likely wasteful if fdt->max_fds is
less than half the number of bits in a cacheline (fdt->close_on_exec and
fdt->open_fds are always contiguous, and the latter is unconditionally
written to).

Rasmus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] change non-atomic bitops method
  2015-02-09 20:34                   ` Andrew Morton
@ 2015-02-10  7:05                     ` Wang, Yalin
  0 siblings, 0 replies; 23+ messages in thread
From: Wang, Yalin @ 2015-02-10  7:05 UTC (permalink / raw)
  To: 'Andrew Morton'
  Cc: 'Kirill A. Shutemov', 'arnd@arndb.de',
	'linux-arch@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	'linux@arm.linux.org.uk',
	'linux-arm-kernel@lists.infradead.org'

> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Tuesday, February 10, 2015 4:34 AM
> To: Wang, Yalin
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Mon, 9 Feb 2015 16:18:10 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com>
> wrote:
> 
> > > That we're running clear_bit against a cleared bit 10% of the time is a
> > > bit alarming.  I wonder where that's coming from.
> > >
> > > The enormous miss count in test_and_clear_bit() might indicate an
> > > inefficiency somewhere.
> > I te-test the patch on 3.10 kernel.
> > The result like this:
> >
> > VmallocChunk:   251498164 kB
> > __set_bit_miss_count:11730 __set_bit_success_count:1036316
> > __clear_bit_miss_count:209640 __clear_bit_success_count:4806556
> > __test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
> > __test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445
> >
> > __clear_bit miss rate is a little high,
> > I check the log, and most miss coming from this code:
> >
> > <6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> > <6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> > <6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
> > <6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
> > <6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
> > <6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
> > In __clear_close_on_exec(fd, fdt);
> >
> >
> >
> > <6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> > <6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> > <6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
> > <6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
> > <6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
> > <6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
> > In __clear_open_fd(open_files - i, new_fdt);
> >
> > Do we need test_bit() before clear_bit()at these 2 place?
> 
> I don't know.  I was happily typing in this:
> 
> diff -puN include/linux/bitops.h~a include/linux/bitops.h
> --- a/include/linux/bitops.h~a
> +++ a/include/linux/bitops.h
> @@ -226,5 +226,37 @@ extern unsigned long find_last_bit(const
>  				   unsigned long size);
>  #endif
> 
> +/**
> + * __set_clear_bit - non-atomically set a bit if it is presently clear
> + * @nr: The bit number
> + * @addr: The base address of the operation
> + *
> + * __set_clear_bit() and similar functions avoid unnecessarily dirtying a
> + * cacheline when the operation will have no effect.
> + */
> +static inline void __set_clear_bit(unsigned nr, volatile unsigned long
> *addr)
> +{
> +	if (!test_bit(nr, addr))
> +		__set_bit(nr, addr);
> +}
> +
> +static inline void __clear_set_bit(unsigned nr, volatile unsigned long
> *addr)
> +{
> +	if (test_bit(nr, addr))
> +		__clear_bit(nr, addr);
> +}
> +
> +static inline void set_clear_bit(unsigned nr, volatile unsigned long
> *addr)
> +{
> +	if (!test_bit(nr, addr))
> +		set_bit(nr, addr);
> +}
> +
> +static inline void clear_set_bit(unsigned nr, volatile unsigned long
> *addr)
> +{
> +	if (test_bit(nr, addr))
> +		clear_bit(nr, addr);
> +}
> +
>  #endif /* __KERNEL__ */
>  #endif
> 
> (maybe __set_bit_if_clear would be a better name)
> 
> But I don't know if it will do anything useful.  The CPU *should* be
> able to avoid dirtying the cacheline on its own: it has all the info it
> needs to know that no writeback will be needed.  But I don't know which
> (if any) CPUs perform this optimisation.
I will send a new patch for your review .

Thanks


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-02-10  7:05 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-02  3:55 [RFC] change non-atomic bitops method Wang, Yalin
2015-02-02 18:53 ` Laura Abbott
2015-02-02 19:31 ` Uwe Kleine-König
2015-02-02 23:29 ` Andrew Morton
2015-02-02 23:31   ` Russell King - ARM Linux
2015-02-03  1:17   ` Kirill A. Shutemov
2015-02-03  2:13     ` Wang, Yalin
2015-02-03  5:42       ` Wang, Yalin
2015-02-03  6:38         ` Andrew Morton
2015-02-03  7:03           ` Wang, Yalin
2015-02-03  8:42             ` Wang, Yalin
2015-02-03 10:59               ` Andrew Morton
2015-02-09  8:18                 ` Wang, Yalin
2015-02-09 20:34                   ` Andrew Morton
2015-02-10  7:05                     ` Wang, Yalin
2015-02-09 21:42                   ` Rasmus Villemoes
2015-02-03  8:40           ` David Miller
2015-02-03  8:48             ` Andrew Morton
2015-02-03  9:34           ` Rasmus Villemoes
2015-02-03  9:41             ` Wang, Yalin
2015-02-03 10:39     ` Kirill A. Shutemov
2015-02-03 15:14 ` David Howells
2015-02-03 19:10   ` Uwe Kleine-König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).