LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* x86: should clear_user() have alternatives?
@ 2022-02-08  5:45 Hugh Dickins
  2022-02-08 11:45 ` David Laight
  2022-02-08 12:52 ` Borislav Petkov
  0 siblings, 2 replies; 6+ messages in thread
From: Hugh Dickins @ 2022-02-08  5:45 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, x86

Hi Boris,

I've noticed that clear_user() is slower than it need be:

dd if=/dev/zero of=/dev/null bs=1M count=1M
1099511627776 bytes (1.1 TB) copied, 45.9641 s, 23.9 GB/s
whereas with the hacked patch below
1099511627776 bytes (1.1 TB) copied, 33.4 s, 32.9 GB/s

That was on some Intel machine: IIRC an AMD went faster.

It's because clear_user() lacks alternatives, and uses a
nowadays suboptimal implementation; whereas clear_page()
and copy_user() do support alternatives.

I came across this because reading a sparse file on tmpfs uses
copy_page_to_iter() from the ZERO_PAGE(), and I wanted to change that
to iov_iter_zero(), which sounds obviously faster - but in fact slower.

I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
tmpfs, are not prime candidates for optimization; and I've no idea how
much clear_user() normally gets used for long clears.

If I were capable of compiler asm, with alternatives, and knew at what
length ERMS becomes advantageous when clearing, I would be sending you
a proper patch.  As it is, I'm hoping to tempt someone else to do the
work!  Or reject it as too niche to bother with.

Thanks,
Hugh

Hacked patch against 5.16 (5.17-rc is a little different there):

 arch/x86/lib/copy_user_64.S |   26 ++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c  |   36 +-----------------------------------
 2 files changed, 27 insertions(+), 35 deletions(-)

--- 5.16/arch/x86/lib/copy_user_64.S	2022-01-09 14:55:34.000000000 -0800
+++ erms/arch/x86/lib/copy_user_64.S	2022-01-22 16:57:21.156968363 -0800
@@ -392,3 +392,29 @@ SYM_FUNC_START(__copy_user_nocache)
 	_ASM_EXTABLE_CPY(41b, .L_fixup_1b_copy)
 SYM_FUNC_END(__copy_user_nocache)
 EXPORT_SYMBOL(__copy_user_nocache)
+
+/*
+ * Recent CPUs have added enhanced REP MOVSB/STOSB instructions.
+ * It's recommended to use enhanced REP MOVSB/STOSB if it's enabled.
+ * Assume that's best for __clear_user(), until alternatives are provided
+ * (though would be better to avoid REP STOSB for short clears, if no FSRM).
+ *
+ * Input:
+ * rdi destination
+ * rsi count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+SYM_FUNC_START(__clear_user)
+	ASM_STAC
+	movl %esi,%ecx
+	xorq %rax,%rax
+1:	rep stosb
+2:	movl %ecx,%eax
+	ASM_CLAC
+	ret
+
+	_ASM_EXTABLE_UA(1b, 2b)
+SYM_FUNC_END(__clear_user)
+EXPORT_SYMBOL(__clear_user)
--- 5.16/arch/x86/lib/usercopy_64.c	2020-12-13 14:41:30.000000000 -0800
+++ erms/arch/x86/lib/usercopy_64.c	2022-01-20 18:40:04.125752474 -0800
@@ -13,43 +13,9 @@
 /*
  * Zero Userspace
  */
-
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-		".section .fixup,\"ax\"\n"
-		"3:	lea 0(%[size1],%[size8],8),%[size8]\n"
-		"	jmp 2b\n"
-		".previous\n"
-		_ASM_EXTABLE_UA(0b, 3b)
-		_ASM_EXTABLE_UA(1b, 2b)
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
 unsigned long clear_user(void __user *to, unsigned long n)
 {
+	might_fault();
 	if (access_ok(to, n))
 		return __clear_user(to, n);
 	return n;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: x86: should clear_user() have alternatives?
  2022-02-08  5:45 x86: should clear_user() have alternatives? Hugh Dickins
@ 2022-02-08 11:45 ` David Laight
  2022-02-08 12:52 ` Borislav Petkov
  1 sibling, 0 replies; 6+ messages in thread
From: David Laight @ 2022-02-08 11:45 UTC (permalink / raw)
  To: 'Hugh Dickins', Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, x86

From: Hugh Dickins
> Sent: 08 February 2022 05:46
> 
> I've noticed that clear_user() is slower than it need be:
> 
> dd if=/dev/zero of=/dev/null bs=1M count=1M
> 1099511627776 bytes (1.1 TB) copied, 45.9641 s, 23.9 GB/s
> whereas with the hacked patch below
> 1099511627776 bytes (1.1 TB) copied, 33.4 s, 32.9 GB/s
> 
> That was on some Intel machine: IIRC an AMD went faster.
> 
> It's because clear_user() lacks alternatives, and uses a
> nowadays suboptimal implementation; whereas clear_page()
> and copy_user() do support alternatives.
> 
...
> +SYM_FUNC_START(__clear_user)
> +	ASM_STAC
> +	movl %esi,%ecx
> +	xorq %rax,%rax
> +1:	rep stosb
> +2:	movl %ecx,%eax
> +	ASM_CLAC
> +	ret

You only want to even consider than version for long copies
(and possibly only for aligned ones).

The existing code (I've not quoted) does look sub-optimal though.
It should be easy to obtain a write every clock.
But I suspect the loop is too long.
The code gcc generates might even be better!

Note that for copies longer than 8 bytes 'odd' lengths can
be handled by a single misaligned write to the end of the buffer.
No need for a byte copy loop.

I've not experimented with misaligned writes - they might take two clocks.
So it might be worth aligning them - but they may not happen often
enough for it to be an overall gain.
Misaligned reads usually don't make any difference.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86: should clear_user() have alternatives?
  2022-02-08  5:45 x86: should clear_user() have alternatives? Hugh Dickins
  2022-02-08 11:45 ` David Laight
@ 2022-02-08 12:52 ` Borislav Petkov
  2022-02-08 23:36   ` David Laight
  2022-02-09  6:18   ` Hugh Dickins
  1 sibling, 2 replies; 6+ messages in thread
From: Borislav Petkov @ 2022-02-08 12:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Peter Zijlstra, linux-kernel, x86

Hi Hugh,

On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> tmpfs, are not prime candidates for optimization; and I've no idea how
> much clear_user() normally gets used for long clears.

Right, we usually don't take such "optimizations" because the folks who
send them always come up with either microbenchmarks or only test on a
single machine.

> If I were capable of compiler asm, with alternatives, and knew at what
> length ERMS becomes advantageous when clearing, I would be sending you
> a proper patch.  As it is, I'm hoping to tempt someone else to do the
> work!  Or reject it as too niche to bother with.

Yap, looking at arch/x86/lib/clear_page_64.S - that's straight-forward
asm without special-cases noodles like __copy_user_nocache, for example,
so I wouldn't be opposed to someone

 - remodelling it so that you can have clear_user* variants there too,
 with the length supplied so that you can call a common function with
 arbitrary length and clear_page* can call it too. And then call them in
 a clear_user() version just like the clear_page() one which selects the
 proper target based on CPU feature flags.

 - testing this on bunch of modern machines with, say, a kernel build or
 some sensible benchmark so that we at least have some coverage

If the numbers are worth it - and judging by your quick testing above
they should be - then I don't mind taking that at all.

If only someone would have the time and diligence to do it properly...

:-)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: x86: should clear_user() have alternatives?
  2022-02-08 12:52 ` Borislav Petkov
@ 2022-02-08 23:36   ` David Laight
  2022-02-09  6:18   ` Hugh Dickins
  1 sibling, 0 replies; 6+ messages in thread
From: David Laight @ 2022-02-08 23:36 UTC (permalink / raw)
  To: 'Borislav Petkov', Hugh Dickins; +Cc: Peter Zijlstra, linux-kernel, x86

From: Borislav Petkov
> Sent: 08 February 2022 12:53
> 
> On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> > I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> > tmpfs, are not prime candidates for optimization; and I've no idea how
> > much clear_user() normally gets used for long clears.
> 
> Right, we usually don't take such "optimizations" because the folks who
> send them always come up with either microbenchmarks or only test on a
> single machine.

I think it is reasonable to use microbenchmarks to see how fast
a particular loop runs on a specific cpu - since you can work out
the number of clocks per iteration and directly compare different
versions.

The problem with microbenchmarks is when they get used on unrolled loops.
Run something 10000 times with hot-cache data and unrolled loops
almost always seem best.
Cold cache or the effects of displacing other code from the I-cache
can make loop unrolling a big loss.
Especially on current cpu that execute multiple instructions/clock
and out-of-order execution.

Loops only need unrolling just enough to hit some architectural limit.
For memset() (using normal instructions) that is 1 write per clock.

A classic example has to be the sparc divide remainder function
in the original CY7C600 book (which I happen to still have).
I'm not sure how many instructions it is - but it is printed
on 4 pages. It just has to blow anything else out of the
small I-cache those cpu had.
It might be faster - but only if you are doing a lot of divides.
 
	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86: should clear_user() have alternatives?
  2022-02-08 12:52 ` Borislav Petkov
  2022-02-08 23:36   ` David Laight
@ 2022-02-09  6:18   ` Hugh Dickins
  2022-04-07 23:21     ` Jason A. Donenfeld
  1 sibling, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2022-02-09  6:18 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Hugh Dickins, Peter Zijlstra, linux-kernel, x86

On Tue, 8 Feb 2022, Borislav Petkov wrote:
> Hi Hugh,
> 
> On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> > I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> > tmpfs, are not prime candidates for optimization; and I've no idea how
> > much clear_user() normally gets used for long clears.
> 
> Right, we usually don't take such "optimizations" because the folks who
> send them always come up with either microbenchmarks or only test on a
> single machine.
> 
> > If I were capable of compiler asm, with alternatives, and knew at what
> > length ERMS becomes advantageous when clearing, I would be sending you
> > a proper patch.  As it is, I'm hoping to tempt someone else to do the
> > work!  Or reject it as too niche to bother with.
> 
> Yap, looking at arch/x86/lib/clear_page_64.S - that's straight-forward
> asm without special-cases noodles like __copy_user_nocache, for example,
> so I wouldn't be opposed to someone
> 
>  - remodelling it so that you can have clear_user* variants there too,
>  with the length supplied so that you can call a common function with
>  arbitrary length and clear_page* can call it too. And then call them in
>  a clear_user() version just like the clear_page() one which selects the
>  proper target based on CPU feature flags.
> 
>  - testing this on bunch of modern machines with, say, a kernel build or
>  some sensible benchmark so that we at least have some coverage
> 
> If the numbers are worth it - and judging by your quick testing above
> they should be - then I don't mind taking that at all.
> 
> If only someone would have the time and diligence to do it properly...

Thanks a lot for setting out what's needed.  Remodelling so there can be
commonality, that's attractive and intriguing.  But I'm disqualified in
several ways (if we tactfully leave out questions of my competence, I'm
still lacking time, and wide machine pool, and expertise with latter):
so I'm responding mainly to make it clear that I shall not be pursuing
this, but hope somebody can eventually.

Thanks,
Hugh

> 
> :-)
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86: should clear_user() have alternatives?
  2022-02-09  6:18   ` Hugh Dickins
@ 2022-04-07 23:21     ` Jason A. Donenfeld
  0 siblings, 0 replies; 6+ messages in thread
From: Jason A. Donenfeld @ 2022-04-07 23:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Borislav Petkov, Peter Zijlstra, linux-kernel, x86, ak, Matthew Wilcox

Hey folks,

Just a reminder that Samuel already did a lot of work and benchmarking
and made this a lot faster:

https://lore.kernel.org/lkml/20210523180423.108087-1-sneves@dei.uc.pt/

As far as I can tell, there was a lot of nitpicking, some of which might
have been pointless, and the author understandably didn't followup to
finish it off.

But the code and analysis is there, and maybe it'd be worthwhile for
somebody here to pick it up again?

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-04-07 23:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  5:45 x86: should clear_user() have alternatives? Hugh Dickins
2022-02-08 11:45 ` David Laight
2022-02-08 12:52 ` Borislav Petkov
2022-02-08 23:36   ` David Laight
2022-02-09  6:18   ` Hugh Dickins
2022-04-07 23:21     ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).