LKML Archive on
help / color / mirror / Atom feed
From: Robin Murphy <>
To: Ming Lei <>
Cc: John Garry <>,,,, Will Deacon <>,
Subject: Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node
Date: Tue, 27 Jul 2021 18:08:04 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <YPqYDY9/VAhfHNfU@T590>

On 2021-07-23 11:21, Ming Lei wrote:
> On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote:
>> On 2021-07-22 16:54, Ming Lei wrote:
>> [...]
>>>> If you are still keen to investigate more, then can try either of these:
>>>> - add iommu.strict=0 to the cmdline
>>>> - use perf record+annotate to find the hotspot
>>>>     - For this you need to enable psuedo-NMI with 2x steps:
>>>>       CONFIG_ARM64_PSEUDO_NMI=y in defconfig
>>>>       Add irqchip.gicv3_pseudo_nmi=1
>>>>       See
>>>>       Your kernel log should show:
>>>>       [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
>>>> synchronisation
>>> OK, will try the above tomorrow.
>> Thanks, I was also going to suggest the latter, since it's what
>> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
>> indicative of where the slowness most likely stems from.
> The improvement from 'iommu.strict=0' is very small:
> [root@ampere-mtjade-04 ~]# cat /proc/cmdline
> BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0
> [root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021
>    read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec)
> [root@ampere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
> + fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
> test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021
>    read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec)

OK, that appears to confirm that the invalidation overhead is more of a 
symptom than the major contributing factor, which also seems to line up 
fairly well with the other information.

>> FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU
>> overhead for both cases since it should effectively mean only 1/256 as many
>> invalidations are issued.
>> Could you also check whether the SMMU platform devices have "numa_node"
>> properties exposed in sysfs (and if so whether the values look right), and
>> share all the SMMU output from the boot log?
> No found numa_node attribute for smmu platform device, and the whole dmesg log is
> attached.

Thanks, so it seems like the SMMUs have MSI capability and are correctly 
described as coherent, which means completion polling should be 
happening in memory and so hopefully not contributing much more than a 
couple of cross-socket cacheline migrations and/or snoops. Combined with 
the difference in the perf traces looking a lot smaller than the 
order-of-magnitude difference in the overall IOPS throughput, I suspect 
this is overall SMMU overhead exacerbated by the missing NUMA info. If 
every new 4K block touched by the NVMe means a TLB miss where the SMMU 
has to walk pagetables from the wrong side of the system, I'm sure 
that's going to add up.

I'd suggest following John's suggestion and getting some baseline 
figures for just the cross-socket overhead between the CPU and NVMe with 
the SMMU right out of the picture, then have a hack at the firmware (or 
pester the system vendor) to see how much of the difference you can make 
back up by having the SMMU proximity domains described correctly such 
that there's minimal likelihood of the SMMUs having to make non-local 
accesses to their in-memory data. FWIW I don't think it should be *too* 
hard to disassemble the IORT, fill in the proximity domain numbers and 
valid flags on the SMMU nodes, then assemble it again to load as an 
override (it's anything involving offsets in that table that's a real pain).

Note that you might also need to make sure you have CMA set up and sized 
appropriately with CONFIG_DMA_PERNUMA_CMA enabled to get the full benefit.


      parent reply	other threads:[~2021-07-27 17:08 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-09  8:38 [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node Ming Lei
2021-07-09 10:16 ` Russell King (Oracle)
2021-07-09 14:21   ` Ming Lei
2021-07-09 10:26 ` Robin Murphy
2021-07-09 11:04   ` John Garry
2021-07-09 12:34     ` Robin Murphy
2021-07-09 14:24   ` Ming Lei
2021-07-19 16:14     ` John Garry
2021-07-21  1:40       ` Ming Lei
2021-07-21  9:23         ` John Garry
2021-07-21  9:59           ` Ming Lei
2021-07-21 11:07             ` John Garry
2021-07-21 11:58               ` Ming Lei
2021-07-22  7:58               ` Ming Lei
2021-07-22 10:05                 ` John Garry
2021-07-22 10:19                   ` Ming Lei
2021-07-22 11:12                     ` John Garry
2021-07-22 12:53                       ` Marc Zyngier
2021-07-22 13:54                         ` John Garry
2021-07-22 15:54                       ` Ming Lei
2021-07-22 17:40                         ` Robin Murphy
2021-07-23 10:21                           ` Ming Lei
2021-07-26  7:51                             ` John Garry
2021-07-28  1:32                               ` Ming Lei
2021-07-28 10:38                                 ` John Garry
2021-07-28 15:17                                   ` Ming Lei
2021-07-28 15:39                                     ` Robin Murphy
2021-08-10  9:36                                     ` John Garry
2021-08-10 10:35                                       ` Ming Lei
2021-07-27 17:08                             ` Robin Murphy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).