LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: David Stevens <stevensd@chromium.org>
Cc: Robin Murphy <robin.murphy@arm.com>,
	JasonGunthorpe <jgg@nvidia.com>,
	linux-kernel@vger.kernel.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	iommu@lists.linux-foundation.org, Will Deacon <will@kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers
Date: Fri, 01 Jul 2022 11:23:25 +0200	[thread overview]
Message-ID: <880cceedfb8753467802356980fb09c83d250a09.camel@linux.ibm.com> (raw)
In-Reply-To: <CAD=HUj62m_P53nr-Zz4bO4Bsn_tgauWX3a28MgRgsB2xShB4Hg@mail.gmail.com>

On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > > 
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > > 
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > > 
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > > 
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > > 
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > > 
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > > 
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > > 
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an optimization on systems with large numbers of devices or in
> > > situations where devices are unknown, since it is not necessary to try
> > > to tune how much memory needs to be set aside to achieve good
> > > performance without costing too much memory.
> > > 
> > > Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> > > is meant to address devices which create long lived streaming mappings
> > > but manage CPU cache coherency without using the dma_sync_* APIs.
> > > Currently, these devices don't function properly with swiotlb=force. The
> > > new flag is used to bypass bounce buffers so such devices will function
> > > when the new bounce buffer optimization is enabled. The flag is added to
> > > the i915 driver, which creates such mappings. It can also be added to
> > > various dma-buf implementations as an optimization, although that is not
> > > done here.
> > > 
> > > v1 -> v2:
> > >  - Replace existing untrusted bounce buffers with new bounce
> > >    buffer pools. This includes significant rework to account for
> > >    untrusted bounce buffers being required instead of an
> > >    optimization.
> > >  - Add flag for persistent streaming mappings.
> > > 
> > 
> > Hi David,
> > 
> > I'm currently looking into converting s390 from our custom IOMMU based
> > DMA API implementation to using dma-iommu.c. We're always using an
> > IOMMU for PCI devices even when doing pass-through to guests (under
> > both the KVM and z/VM hypervisors). In this case I/O TLB flushes, which
> > we use to do the shadowing of the guest I/O translations, are
> > relatively expensive I'm thus very interested in your work. I've tried
> > rebasing it on v5.18 and got it to compile but didn't get DMA to work
> > though it seems to partially work as I don't get probe failures unlike
> > with a completely broken DMA API. Since I might have very well screwed
> > up the rebase and my DMA API conversion is experimental too I was
> > wondering if you're still working on this and might have a current
> > version I could experiment with?
> 
> Unfortunately I don't have anything more recent to share. I've come
> across some performance issues caused by pathological usage patterns
> in internal usage, but I haven't seen any correctness issues. I'm
> hoping that I'll be able to address the performance issues and send a
> rebased series within the next month or so.
> 
> It's definitely possible that this series has some bugs. I've tested
> it on a range of chromebooks and their various hardware and drivers,
> but that's still all relatively normal x86_64/arm64. If your hardware
> is more particular about its DMA, this series might be missing
> something.
> 
> -David

Hi David,

I finally came around to trying this again. This time I managed to get
it working and figure out what was going wrong. The problem was with
the call to iommu_dma_alloc_iova() in io_buffer_manager_init(). As this
call happens during the IOMMU initialization dma_get_mask(dev) is used
before the driver calls dma_set_mask(_and_coherent)() and is thus still
the default mask of DMA_BIT_MASK(32) instead of what the device really
supports. This breaks s390 because our IOMMU currently only supports
apertures starting at an IOVA >= 2^32. For testing I worked around this
by just passing DMA_BIT_MASK(64) instead but of course that's not a
proper fix. With that in place your patches work on top of my still
experimental conversion to use dma-iommu.c on s390.

I can also already confirm that this gives a similar CPU load
(especially steal time) reduction on our z/VM hypervisor which does I/O
translation table shadowing much like your virtio-iommu test. It also
does help performance of my DMA API rework which sadly still lacks
behind our current s390 DMA API implementation. I suspect that is
because the lazy unmapping used by dma-iommu.c tries to do the
unmapping via a timer in the background while our current approach does
them all at once when wrapping around the IOVA space. The latter I
suspect works better when I/O table shadowing in the hypervisor is
serialized. So to summarize for s390 something like your series would
be of significant interest.

Best regards,
Niklas


      parent reply	other threads:[~2022-07-01  9:24 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-06 10:34 David Stevens
2021-08-06 10:34 ` [PATCH v2 1/9] Revert "iommu: Allow the dma-iommu api to use bounce buffers" David Stevens
2021-08-06 10:34 ` [PATCH v2 2/9] dma-iommu: expose a few helper functions to module David Stevens
2021-08-06 17:28   ` kernel test robot
2021-08-06 10:34 ` [PATCH v2 3/9] dma-iommu: bounce buffers for untrusted devices David Stevens
2021-08-06 15:53   ` kernel test robot
2021-08-10  1:19   ` Mi, Dapeng1
2021-08-10  1:41     ` David Stevens
2021-08-06 10:34 ` [PATCH v2 4/9] dma-iommu: remove extra buffer search on unmap David Stevens
2021-08-06 10:34 ` [PATCH v2 5/9] dma-iommu: clear only necessary bytes David Stevens
2021-08-06 10:34 ` [PATCH v2 6/9] dma-iommu: add bounce buffer pools David Stevens
2021-08-06 10:34 ` [PATCH v2 7/9] dma-iommu: support iommu bounce buffer optimization David Stevens
2021-08-06 10:34 ` [PATCH v2 8/9] dma-mapping: add persistent streaming mapping flag David Stevens
2021-08-06 10:34 ` [PATCH v2 9/9] drm/i915: use DMA_ATTR_PERSISTENT_STREAMING flag David Stevens
2022-05-24 12:27 ` [PATCH v2 0/9] Add dynamic iommu backed bounce buffers Niklas Schnelle
2022-05-27  1:25   ` David Stevens
2022-06-03 14:53     ` Niklas Schnelle
2022-06-06  1:24       ` David Stevens
2022-07-01  9:23     ` Niklas Schnelle [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=880cceedfb8753467802356980fb09c83d250a09.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=hch@lst.de \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=senozhatsky@chromium.org \
    --cc=stevensd@chromium.org \
    --cc=will@kernel.org \
    --subject='Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).