Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
@ 2020-06-24  8:32 Shaokun Zhang
  2020-08-21 16:02 ` Will Deacon
  2020-09-08 11:37 ` Jan Kara
  0 siblings, 2 replies; 6+ messages in thread
From: Shaokun Zhang @ 2020-06-24  8:32 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Shaokun Zhang, Will Deacon, Mark Rutland, Peter Zijlstra,
	Alexander Viro, Boqun Feng, Yuqi Jin

get_file_rcu_many, which is called by __fget_files, has used
atomic_try_cmpxchg now and it can reduce the access number of the global
variable to improve the performance of atomic instruction compared with
atomic_cmpxchg. 

__fget_files does check the @f_mode with mask variable and will do some
atomic operations on @f_count, but both are on the same cacheline.
Many CPU cores do file access and it will cause much conflicts on @f_count. 
If we could make the two members into different cachelines, it shall relax
the siutations.

We have tested this on ARM64 and X86, the result is as follows:
Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
24 x System Call Overhead  1

System Call Overhead                    3160841.4 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    3160841.4   2107.2
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2107.2

Without this patch:
24 x System Call Overhead  1

System Call Overhead                    2222456.0 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    2222456.0   1481.6
                                                                   ========
System Benchmarks Index Score (Partial Only)                         1481.6

And on Intel 6248 platform with this patch:
40 CPUs in system; running 24 parallel copies of tests

System Call Overhead                        4288509.1 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    4288509.1   2859.0
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2859.0

Without this patch:
40 CPUs in system; running 24 parallel copies of tests

System Call Overhead                        3666313.0 lps   (10.0 s, 1 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    3666313.0   2444.2
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2444.2

Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f881a892ea7..0faeab5622fb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -955,7 +955,6 @@ struct file {
 	 */
 	spinlock_t		f_lock;
 	enum rw_hint		f_write_hint;
-	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
 	struct mutex		f_pos_lock;
@@ -979,6 +978,7 @@ struct file {
 	struct address_space	*f_mapping;
 	errseq_t		f_wb_err;
 	errseq_t		f_sb_err; /* for syncfs */
+	atomic_long_t		f_count;
 } __randomize_layout
   __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
  2020-06-24  8:32 [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode Shaokun Zhang
@ 2020-08-21 16:02 ` Will Deacon
  2020-08-26  7:24   ` Shaokun Zhang
  2020-09-08 11:37 ` Jan Kara
  1 sibling, 1 reply; 6+ messages in thread
From: Will Deacon @ 2020-08-21 16:02 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Mark Rutland, Peter Zijlstra,
	Alexander Viro, Boqun Feng, Yuqi Jin

On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg. 
> 
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count. 
> If we could make the two members into different cachelines, it shall relax
> the siutations.
> 
> We have tested this on ARM64 and X86, the result is as follows:
> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    3160841.4 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3160841.4   2107.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2107.2
> 
> Without this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    2222456.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    2222456.0   1481.6
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         1481.6
> 
> And on Intel 6248 platform with this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        4288509.1 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    4288509.1   2859.0
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2859.0
> 
> Without this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        3666313.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3666313.0   2444.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2444.2
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
>  	 */
>  	spinlock_t		f_lock;
>  	enum rw_hint		f_write_hint;
> -	atomic_long_t		f_count;
>  	unsigned int 		f_flags;
>  	fmode_t			f_mode;
>  	struct mutex		f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
>  	struct address_space	*f_mapping;
>  	errseq_t		f_wb_err;
>  	errseq_t		f_sb_err; /* for syncfs */
> +	atomic_long_t		f_count;
>  } __randomize_layout
>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Hmm. So the microbenchmark numbers look lovely, but:

  - What impact does it actually have for real workloads?
  - How do we avoid regressing performance by innocently changing the struct
    again later on?
  - This thing is tagged with __randomize_layout, so it doesn't help anybody
    using that crazy plugin
  - What about all the other atomics and locks that share cachelines?

Will

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
  2020-08-21 16:02 ` Will Deacon
@ 2020-08-26  7:24   ` Shaokun Zhang
  2020-08-26  8:24     ` Aleksa Sarai
  0 siblings, 1 reply; 6+ messages in thread
From: Shaokun Zhang @ 2020-08-26  7:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-fsdevel, linux-kernel, Mark Rutland, Peter Zijlstra,
	Alexander Viro, Boqun Feng, Yuqi Jin

Hi Will,

在 2020/8/22 0:02, Will Deacon 写道:
> On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
>> get_file_rcu_many, which is called by __fget_files, has used
>> atomic_try_cmpxchg now and it can reduce the access number of the global
>> variable to improve the performance of atomic instruction compared with
>> atomic_cmpxchg. 
>>
>> __fget_files does check the @f_mode with mask variable and will do some
>> atomic operations on @f_count, but both are on the same cacheline.
>> Many CPU cores do file access and it will cause much conflicts on @f_count. 
>> If we could make the two members into different cachelines, it shall relax
>> the siutations.
>>
>> We have tested this on ARM64 and X86, the result is as follows:
>> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
>> 24 x System Call Overhead  1
>>
>> System Call Overhead                    3160841.4 lps   (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> System Call Overhead                          15000.0    3160841.4   2107.2
>>                                                                    ========
>> System Benchmarks Index Score (Partial Only)                         2107.2
>>
>> Without this patch:
>> 24 x System Call Overhead  1
>>
>> System Call Overhead                    2222456.0 lps   (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> System Call Overhead                          15000.0    2222456.0   1481.6
>>                                                                    ========
>> System Benchmarks Index Score (Partial Only)                         1481.6
>>
>> And on Intel 6248 platform with this patch:
>> 40 CPUs in system; running 24 parallel copies of tests
>>
>> System Call Overhead                        4288509.1 lps   (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> System Call Overhead                          15000.0    4288509.1   2859.0
>>                                                                    ========
>> System Benchmarks Index Score (Partial Only)                         2859.0
>>
>> Without this patch:
>> 40 CPUs in system; running 24 parallel copies of tests
>>
>> System Call Overhead                        3666313.0 lps   (10.0 s, 1 samples)
>>
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> System Call Overhead                          15000.0    3666313.0   2444.2
>>                                                                    ========
>> System Benchmarks Index Score (Partial Only)                         2444.2
>>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>>  include/linux/fs.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3f881a892ea7..0faeab5622fb 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -955,7 +955,6 @@ struct file {
>>  	 */
>>  	spinlock_t		f_lock;
>>  	enum rw_hint		f_write_hint;
>> -	atomic_long_t		f_count;
>>  	unsigned int 		f_flags;
>>  	fmode_t			f_mode;
>>  	struct mutex		f_pos_lock;
>> @@ -979,6 +978,7 @@ struct file {
>>  	struct address_space	*f_mapping;
>>  	errseq_t		f_wb_err;
>>  	errseq_t		f_sb_err; /* for syncfs */
>> +	atomic_long_t		f_count;
>>  } __randomize_layout
>>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
> 
> Hmm. So the microbenchmark numbers look lovely, but:

Thanks,

> 
>   - What impact does it actually have for real workloads?

It is exposed by we do the unixbench test. About the real workloads, if it has many
threads and open the same file, it shall be useful like unixbench.
If not the scenes, it should not be regression with the patch because we only change
the poistion of @f_count with @f_mode.

>   - How do we avoid regressing performance by innocently changing the struct
>     again later on?

It shall be commented this change on the @f_count, I'm not sure it is enough.

>   - This thing is tagged with __randomize_layout, so it doesn't help anybody
>     using that crazy plugin

This patch isolated the @f_count with @f_mode absolutely and we don't care the
base address of the structure, or I may miss something what you said.

>   - What about all the other atomics and locks that share cachelines?

An interesting question, to be honest, about this issue, we did performance
profile using unixbench and found it, then we want to relax the conflicts.
For other scenes, this method may be useful if it is debugged by the same
conflicts, but it can't be detected automatically.

Thanks,
Shaokun

> 
> Will
> 
> .
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
  2020-08-26  7:24   ` Shaokun Zhang
@ 2020-08-26  8:24     ` Aleksa Sarai
  2020-08-27  8:27       ` Shaokun Zhang
  0 siblings, 1 reply; 6+ messages in thread
From: Aleksa Sarai @ 2020-08-26  8:24 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: Will Deacon, linux-fsdevel, linux-kernel, Mark Rutland,
	Peter Zijlstra, Alexander Viro, Boqun Feng, Yuqi Jin

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On 2020-08-26, Shaokun Zhang <zhangshaokun@hisilicon.com> wrote:
> 在 2020/8/22 0:02, Will Deacon 写道:
> >   - This thing is tagged with __randomize_layout, so it doesn't help anybody
> >     using that crazy plugin
> 
> This patch isolated the @f_count with @f_mode absolutely and we don't care the
> base address of the structure, or I may miss something what you said.

__randomize_layout randomises the order of fields in a structure on each
kernel rebuild (to make attacks against sensitive kernel structures
theoretically harder because the offset of a field is per-build). It is
separate to ASLR or other base-related randomisation. However it depends
on having CONFIG_GCC_PLUGIN_RANDSTRUCT=y and I believe (at least for
distribution kernels) this isn't a widely-used configuration.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
  2020-08-26  8:24     ` Aleksa Sarai
@ 2020-08-27  8:27       ` Shaokun Zhang
  0 siblings, 0 replies; 6+ messages in thread
From: Shaokun Zhang @ 2020-08-27  8:27 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Will Deacon, linux-fsdevel, linux-kernel, Mark Rutland,
	Peter Zijlstra, Alexander Viro, Boqun Feng, Yuqi Jin

Hi Aleksa,

在 2020/8/26 16:24, Aleksa Sarai 写道:
> On 2020-08-26, Shaokun Zhang <zhangshaokun@hisilicon.com> wrote:
>> 在 2020/8/22 0:02, Will Deacon 写道:
>>>   - This thing is tagged with __randomize_layout, so it doesn't help anybody
>>>     using that crazy plugin
>>
>> This patch isolated the @f_count with @f_mode absolutely and we don't care the
>> base address of the structure, or I may miss something what you said.
> 
> __randomize_layout randomises the order of fields in a structure on each
> kernel rebuild (to make attacks against sensitive kernel structures
> theoretically harder because the offset of a field is per-build). It is

My bad, I missed Will's comments for my poor understanding on it.

> separate to ASLR or other base-related randomisation. However it depends
> on having CONFIG_GCC_PLUGIN_RANDSTRUCT=y and I believe (at least for
> distribution kernels) this isn't a widely-used configuration.

Thanks for more explanations about it, in our test, this config is also
disabled. If having CONFIG_GCC_PLUGIN_RANDSTRUCT=y, it seems this patch
will lose its value.
If it isn't widely-used for this config, hopefully we can do something on
the scene.

Thanks,
Shaokun

> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode
  2020-06-24  8:32 [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode Shaokun Zhang
  2020-08-21 16:02 ` Will Deacon
@ 2020-09-08 11:37 ` Jan Kara
  1 sibling, 0 replies; 6+ messages in thread
From: Jan Kara @ 2020-09-08 11:37 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-fsdevel, linux-kernel, Will Deacon, Mark Rutland,
	Peter Zijlstra, Alexander Viro, Boqun Feng, Yuqi Jin

On Wed 24-06-20 16:32:28, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg. 
> 
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count. 
> If we could make the two members into different cachelines, it shall relax
> the siutations.

<snip nice unixbench results>

Thanks for the patch! The wins for your microbenchmark heavily sharing
struct file are nice but I'm not sure your change is a universal win. When
struct file is not shared (which is far more common), hot code paths like
__fget() or __fget_light() will now need to fetch two cache lines from
struct file instead of one. So I don't think that for most users the
tradeoff is really worth it...

								Honza

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
>  	 */
>  	spinlock_t		f_lock;
>  	enum rw_hint		f_write_hint;
> -	atomic_long_t		f_count;
>  	unsigned int 		f_flags;
>  	fmode_t			f_mode;
>  	struct mutex		f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
>  	struct address_space	*f_mapping;
>  	errseq_t		f_wb_err;
>  	errseq_t		f_sb_err; /* for syncfs */
> +	atomic_long_t		f_count;
>  } __randomize_layout
>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-08 12:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24  8:32 [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode Shaokun Zhang
2020-08-21 16:02 ` Will Deacon
2020-08-26  7:24   ` Shaokun Zhang
2020-08-26  8:24     ` Aleksa Sarai
2020-08-27  8:27       ` Shaokun Zhang
2020-09-08 11:37 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).