LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Rafał Miłecki" <zajec5@gmail.com>
To: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Network Development <netdev@vger.kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org, John Crispin <john@phrozen.org>,
	Jonas Gorski <jonas.gorski@gmail.com>,
	Jo-Philipp Wich <jo@mein.io>
Subject: Re: ARM router NAT performance affected by random/unrelated commits
Date: Wed, 22 May 2019 23:12:12 +0200	[thread overview]
Message-ID: <d0d67f85-01e9-037a-3a18-6282a8bfce5c@gmail.com> (raw)
In-Reply-To: <20190522121730.fhswxkw4gbflkhei@shell.armlinux.org.uk>

On 22.05.2019 14:17, Russell King - ARM Linux admin wrote:
> On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
>> On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
>>>> I work on home routers based on Broadcom's Northstar SoCs. Those devices
>>>> have ARM Cortex-A9 and most of them are dual-core.
>>>>
>>>> As for home routers, my main concern is network performance. That CPU
>>>> isn't powerful enough to handle gigabit traffic so all kind of
>>>> optimizations do matter. I noticed some unexpected changes in NAT
>>>> performance when switching between kernels.
>>>>
>>>> My hardware is BCM47094 SoC (dual core ARM) with integrated network
>>>> controller and external BCM53012 switch.
>>>
>>> Guessing, I'd say it's to do with the placement of code wrt cachelines.
>>> You could try aligning some of the cache flushing code to a cache line
>>> and see what effect that has.
>>
>> Is System.map a good place to check for functions code alignment?
>>
>> With Linux 4.19 + OpenWrt mtd patches I have:
>> (...)
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> (...)
>> c02ca3d0 T blk_mq_update_nr_hw_queues
>> c02ca69c T blk_mq_alloc_tag_set
>> c02ca94c T blk_mq_release
>> c02ca9b4 T blk_mq_free_queue
>> c02caa88 T blk_mq_update_nr_requests
>> c02cab50 T blk_mq_unique_tag
>> (...)
>>
>> After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
>> up an SQ queue and tag set"):
>> (...)
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> (...)
>> c02ca3d0 T blk_mq_update_nr_hw_queues
>> c02ca69c T blk_mq_alloc_tag_set
>> c02ca94c T blk_mq_init_sq_queue <-- NEW
>> c02ca9c0 T blk_mq_release <-- Different address of this & all below
>> c02caa28 T blk_mq_free_queue
>> c02caafc T blk_mq_update_nr_requests
>> c02cabc4 T blk_mq_unique_tag
>> (...)
>>
>> As you can see blk_mq_init_sq_queue has appeared in the System.map and
>> it affected addresses of ~30000 symbols. I can believe some frequently
>> used symbols got luckily aligned and that improved overall performance.
>>
>> Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
>> relocated.
>>
>> *****
>>
>> I followed Russell's suggestion and added .align 5 to cache-v7.S (see
>> two attached diffs).
>>
>> 1) v4.19 + OpenWrt mtd patches
>>> egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
>> c010ea58 T v7_flush_kern_dcache_area
>> c010ea94 t v7_dma_inv_range
>> c010eae0 t v7_dma_clean_range
>> c010eb18 T b15_dma_flush_range
>>
>> 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
>> c010ea6c T v7_flush_kern_dcache_area
>> c010eac0 t v7_dma_inv_range
>> c010eb20 t v7_dma_clean_range
>> c010eb58 T b15_dma_flush_range
>> (actually 15 symbols above v7_dma_inv_range were replaced)
>>
>> This method seems to be somehow working (at least affects addresses in
>> System.map).
>>
>> *****
>>
>> I run 2 tests for each combination of changes. Each test consisted of
>> 10 sequences of: 30 seconds iperf session + reboot.
>>
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>> Test #1: 738 Mb/s
>> Test #2: 737 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>> patch -p1 < v7_dma_clean_range-align.diff
>> Test #1: 746 Mb/s
>> Test #2: 747 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 745 Mb/s
>> Test #2: 746 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> patch -p1 < v7_dma_clean_range-align.diff
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 762 Mb/s
>> Test #2: 761 Mb/s
>>
>> As you can see I got a quite nice performance improvement after aligning
>> both: v7_dma_clean_range() and v7_dma_inv_range().
> 
> This is an improvement of about 3.3%.
> 
>> It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
>> close.
>>
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>> Test #1: 770 Mb/s
>> Test #2: 766 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_clean_range-align.diff
>> Test #1: 756 Mb/s
>> Test #2: 759 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 758 Mb/s
>> Test #2: 759 Mb/s
>>
>>> git reset --hard v4.19
>>> git am OpenWrt-mtd-chages.patch
>>> git cherry-pick -x 9316a9ed6895
>>> patch -p1 < v7_dma_clean_range-align.diff
>>> patch -p1 < v7_dma_inv_range-align.diff
>> Test #1: 767 Mb/s
>> Test #2: 763 Mb/s
>>
>> Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
>> and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
>> that extra alignment can actually *hurt* NAT performance.
> 
> You have a maximum variance of 4Mb/s in your tests which is around
> 0.5%, and this shows a reduction of 3Mb/s, or 0.4%.
> 
> If we look at it a different way:
> - Without the alignment patches, there is a difference of 4% in
>    performance depending on whether 9316a9ed6895 is applied.
> - With the alignment patches, there is a difference of 0.4% in
>    performance depending on whether 9316a9ed6895 is applied.
> 
> How can this not be beneficial?

Aligning v7_dma_clean_range() and v7_dma_inv_range() is definitely
beneficial! I'm sorry I wasn't clear enough.

I redid testing of 2 most important setups with few more iterations.

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > git cherry-pick -x 9316a9ed6895
[  3]  0.0-30.0 sec  2.71 GBytes   776 Mbits/sec
[  3]  0.0-30.0 sec  2.71 GBytes   775 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   774 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   774 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.70 GBytes   773 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   771 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   770 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   768 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   768 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   764 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
Average: 769 Mb/s (+4,10%)
Previous results: 773 Mb/s, 770 Mb/s, 766 Mb/s

 > git reset --hard v4.19
 > git am OpenWrt-mtd-chages.patch
 > patch -p1 < v7_dma_clean_range-align.diff
 > patch -p1 < v7_dma_inv_range-align.diff
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.69 GBytes   769 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   767 Mbits/sec
[  3]  0.0-30.0 sec  2.68 GBytes   766 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   766 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   765 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   764 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.67 GBytes   763 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   762 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   761 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   761 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.66 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   760 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   759 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   759 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   758 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   758 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.65 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.64 GBytes   757 Mbits/sec
[  3]  0.0-30.0 sec  2.64 GBytes   756 Mbits/sec
Average: 762 Mb/s (+3,16%)
Previous results: 767 Mb/s, 763 Mb/s

So let me explain why I keep researching on this. There are two reasons:

1) Realignment done by cherry-picking 9316a9ed6895 was providing a
*marginally* better performance than aligning v7_dma_clean_range() and
v7_dma_inv_range(). It's a *very* minimal difference but I can't stop
thinking I can still do better.

2) Cherry-picking 9316a9ed6895 doesn't change v7_dma_clean_range or
v7_dma_inv_range addresses at all. Yet it still improves NAT
performance. That makes me believe there are more functions that (if
properly aligned) can bump NAT performance.
I hope that aligning all:
* v7_dma_clean_range
* v7_dma_inv_range
* [some unrevealed functions]
could result in even better NAT performance.

  reply	other threads:[~2019-05-22 21:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-21 10:28 Rafał Miłecki
2019-05-21 10:45 ` Russell King - ARM Linux admin
2019-05-21 11:16   ` Rafał Miłecki
2019-05-21 11:19     ` Russell King - ARM Linux admin
2019-05-22 11:51   ` Rafał Miłecki
2019-05-22 12:17     ` Russell King - ARM Linux admin
2019-05-22 21:12       ` Rafał Miłecki [this message]
2019-05-21 13:01 ` Andrew Lunn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0d67f85-01e9-037a-3a18-6282a8bfce5c@gmail.com \
    --to=zajec5@gmail.com \
    --cc=jo@mein.io \
    --cc=john@phrozen.org \
    --cc=jonas.gorski@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=netdev@vger.kernel.org \
    --subject='Re: ARM router NAT performance affected by random/unrelated commits' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).