LKML Archive on
help / color / mirror / Atom feed
From: Grant Grundler <>
To: mark gross <>
Cc: Grant Grundler <>,
	Andrew Morton <>,, lkml <>,
Subject: Re: [PATCH] Use an array instead of a list for deffered intel-iommu iotlb flushing  Re: [PATCH]iommu-iotlb-flushing
Date: Sat, 8 Mar 2008 10:20:56 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Wed, Mar 05, 2008 at 03:01:57PM -0800, mark gross wrote:
> > *nod* - I know. That why I use pktgen to measure the dma map/unmap overhead.
> > Note that with solid state storage (should be more visible in the next year
> > or so), the transaction rate is going to look alot more like a NIC than
> > the traditional HBA storage controller. So map/unmap performance will
> > matter for those configurations too.
> Sweet! now I have an excuse to get one of those spiffy SD-Disks!  

*chuckle* Glad I could be of assistance :P

> > Ok - I wasn't sure which step was the "syncronize step".
> > 
> > BTW, I hope you (and others from Intel - go willy! ;) can give feedback
> > to the Intel/AMD chipset designers to ditch this design ASAP.
> > It clearly sucks.
> > 
> The HW implementation IS evolving.  Especially as the MCH and more of
> the chipsets are moved into the CPU package.  It will get better over
> time, but the protection will never be "free".

Agreed. But IO TLB shoot-down/invalidate could be alot cheaper. Intel knows
how to do it for CPU MMU (I hope they do at least given IA64 experience).
IO MMU should be no different (well, not too much different). IOMMU is
like the CPU MMU except it's shared by many IO devices instead of
"one per CPU".

> > If you can reduce the overhead to < 1% for pktgen, TPC-C won't
> > notice it and I doubt specweb will either.
> Sadly only a fraction of the overhead is due to the IOTLB flushes.
> I can't wave away the IOVA management overhead with batched
> flushes of the IOTLB.

Right. IOVA management is CPU intensive.  But stalling on IO TLB flush
syncronize is a major part, an easy target and should be reduced.
Taking advantage of "warm cache" and other "normal" coding methods
will help minimize the CPU overhead.

> I've been using oprofile, and some TSC counters I have in an out-of tree
> patch for instrumenting the code and dumping the cycles per
> code-path-of-interest.  Its been pretty effective, but it affects the
> throughput of the run.

I removed stats from the parisc IOMMU code exactly for that reason.
It was useful for evaluating details of specific algorithms (comparative)
but not for runtime benchmarking. I suggest removing that code.

> > > FWIW : throughput isn't effected so much as the CPU overhead.
> > > iommu=strict: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 25% cpu 
> > > with this patch: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 18% cpu 
> > 
> > Understood. That's why netperf (see measures "service demand".
> > Taking CPU away from user space generally results in lower benchmark/app perf.
> >
> The following patch is an update to use an array instead of a list of
> IOVA's in the implementation of defered iotlb flushes.  It takes
> inspiration from sba_iommu.c
> I like this implementation better as it encapsulates the batch process
> within intel-iommu.c, and no longer touches iova.h (which is shared)
> Performance data:  Netperf 32byte UDP streaming
> 2.6.25-rc3-mm1:
> IOMMU-strict : 58Mps @ 62% cpu
> NO-IOMMU : 71Mbs @ 41% cpu
> List-based IOMMU-default-batched-IOTLB flush: 66Mbps @ 57% cpu
> with this patch:
> IOMMU-strict : 73Mps @ 75% cpu
> NO-IOMMU : 74Mbs @ 42% cpu
> Array-based IOMMU-default-batched-IOTLB flush: 72Mbps @ 62% cpu

Nice! :)
66/57 == 1.15
72/62 == 1.16
~10% higher throughput with essentially no change in service demand.

But I'm wondering why IOMMU-strict gets better throughput. Something
else is going on here. I suspect better CPU cache utilization and
perhaps lowering the high water mark to 32 would be a test to prove that.

BTW, can you clarify what the units are?
I see "Mps", "Mbs", and "Mbps". Ideally we'd be using
a single unit of measure to compare. "Mpps" would be my preferred one
(Million packets per second) for small, fixed sized packets.

Ditch the "debug" code (stats pr0n) and I'll bet this will go up
a few more percentage points and reduce the service demand.


      reply	other threads:[~2008-03-08 17:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-21  0:06 [PATCH]iommu-iotlb-flushing mark gross
2008-02-23  8:05 ` [PATCH]iommu-iotlb-flushing Andrew Morton
2008-02-25 16:28   ` [PATCH]iommu-iotlb-flushing mark gross
2008-02-25 18:40     ` [PATCH]iommu-iotlb-flushing Andrew Morton
2008-02-29 23:18   ` [PATCH]iommu-iotlb-flushing mark gross
2008-03-01  5:54     ` [PATCH]iommu-iotlb-flushing Greg KH
2008-03-01  7:10     ` [PATCH]iommu-iotlb-flushing Grant Grundler
2008-03-03 18:34       ` [PATCH]iommu-iotlb-flushing mark gross
2008-03-05 18:23         ` [PATCH]iommu-iotlb-flushing Grant Grundler
2008-03-05 23:01           ` [PATCH] Use an array instead of a list for deffered intel-iommu iotlb flushing [PATCH]iommu-iotlb-flushing mark gross
2008-03-08 17:20             ` Grant Grundler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \
    --subject='Re: [PATCH] Use an array instead of a list for deffered intel-iommu iotlb flushing  Re: [PATCH]iommu-iotlb-flushing' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).