LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@infradead.org>,
	Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>,
	Gregory Haskins <ghaskins@novell.com>,
	Arnaldo Carvalho de Melo <acme@ghostprotocols.net>,
	Thomas Gleixner <tglx@linutronix.de>,
	Tim Bird <tim.bird@am.sony.com>, Sam Ravnborg <sam@ravnborg.org>,
	"Frank Ch. Eigler" <fche@redhat.com>,
	Jan Kiszka <jan.kiszka@siemens.com>,
	John Stultz <johnstul@us.ibm.com>,
	Arjan van de Ven <arjan@infradead.org>,
	Steven Rostedt <srostedt@redhat.com>
Subject: Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
Date: Fri, 1 Feb 2008 14:34:13 -0800	[thread overview]
Message-ID: <20080201223413.GB9247@linux.vnet.ibm.com> (raw)
In-Reply-To: <1201703101.28547.224.camel@lappy>

On Wed, Jan 30, 2008 at 03:25:00PM +0100, Peter Zijlstra wrote:
> 
> On Wed, 2008-01-30 at 09:09 -0500, Steven Rostedt wrote:
> > Paul,
> > 
> > Peter and I are having a discussion on craziness of archs and memory
> > barriers. You seem to understand crazy archs pretty well, and we would
> > like some advice. :-)

OK, let's see what we have here...

> > See below:
> > 
> > On Wed, 30 Jan 2008, Steven Rostedt wrote:
> > 
> > >
> > >
> > > On Wed, 30 Jan 2008, Peter Zijlstra wrote:
> > >
> > > >
> > > > On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> > > >
> > > > > +int register_mcount_function(struct mcount_ops *ops)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	spin_lock_irqsave(&mcount_func_lock, flags);
> > > > > +	ops->next = mcount_list;
> > > > > +	/* must have next seen before we update the list pointer */
> > > > > +	smp_wmb();
> > > >
> > > > That comment does not explain which race it closes; this is esp
> > > > important as there is no paired barrier to give hints.
> > >
> > > OK, fair enough. I'll explain it a bit more.
> > >
> > > How's this:
> > >
> > >  /*
> > >   * We are entering ops into the mcount_list but another
> > >   * CPU might be walking that list. We need to make sure
> > >   * the ops->next pointer is valid before another CPU sees
> > >   * the ops pointer included into the mcount_list.
> > >   */
> > >
> > 
> > The above is my new comment. But Peter says that it's still not good
> > enough and that all write memory barriers need read barriers.
> 
> To clarify, either: full mb, rmb or read depend.

This is true.  A write barrier ensures that the writes remain ordered,
but unless the reads are also ordered, the reader can still get confused.
For example (assuming all variables are initially zero):

writer:

	a = 1;
	smp_wmb();  /* or smp_mb() */
	b = 1;

reader:

	tb = b;
	ta = a;

The writer will (roughly speaking) execute the assignments in order,
but the reader might not.  If the reader executes the assignment from
"a" first, it might see tb==1&&ta==0.  To prevent this, we do:

reader:

	tb = b;
	smp_rmb();  /* or smp_mb() */
	ta = a;

There are a lot of variations on this theme.

> > Let me explain the situation here.
> > 
> > We have a single link list called mcount_list that is walked when more
> > than one function is registered by mcount. Mcount is called at the start
> > of all C functions that are not annotated with "notrace". When more than
> > one function is registered, mcount calls a loop function that does the
> > following:
> > 
> > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > {
> >         struct mcount_ops *op = mcount_list;
> 
> When thinking RCU, this would be rcu_dereference and imply a read
> barrier.
> 
> >         while (op != &mcount_list_end) {
> >                 op->func(ip, parent_ip);
> >                 op = op->next;
> 
> Same here; the rcu_dereference() would do the read depend barrier.

Specifically:

notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
{
        struct mcount_ops *op = rcu_dereference(mcount_list);

        while (op != &mcount_list_end) {
                op->func(ip, parent_ip);
                op = rcu_dereference(op->next);

This assumes that you are using call_rcu(), synchronize_rcu(), or
whatever to defer freeing/reuse of the ops structure.

> >         };
> > }
> > 
> > A registered function must already have a "func" filled, and the mcount
> > register code takes care of "next".  It is documented that the calling
> > function should "never" change next and always expect that the func can be
> > called after it is unregistered. That's not the issue here.
> > 
> > The issue is how to insert the ops into the list. I've done the following,
> > as you can see in the code this text is inserted between.
> > 
> >    ops->next = mcount_list;
> >    smp_wmb();
> >    mcount_list = ops;
> > 
> > The read side pair is the reading of ops to ops->next, which should imply
> > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > crazy enough to do better than that! Thus, I'm asking you.

Peter is correct when he says that Alpha does not necessarily respect data
dependencies.  See the following URL for the official story:

	http://www.openvms.compaq.com/wizard/wiz_2637.html

And I give an example hardware cache design that can result in this
situation here:

	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf

See the discussion starting with the "Why Reorder Memory Accesses?"
heading in the second column of the first page.

Strange, but true.  It took an Alpha architect quite some time to
convince me of this back in the late 90s.  ;-)

> > Can some arch have a reader where it receives ops->next before it received
> > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > before it knows ops!

The trick is that the machine might have a split cache, with (say)
odd-numbered cache lines being processed by one half and even-numbered
lines processed by the other half.  If reading CPU has one half of the
cache extremely busy (e.g., processing invalidation requests from other
CPUs) and the other half idle, memory misordering can happen in the
receiving CPU -- if the pointer is processed by the idle half, and
the pointed-to struct by the busy half, you might see the unitialized
contents of the pointed-to structure.  The reading CPU must execute
a memory barrier to force ordering in this case.

> > Remember, that the ops that is being registered, is not viewable by any
> > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > in this case. But I could very well be wrong.

And I was right there with you before my extended discussions with the
aforementioned Alpha architect!

						Thanx, Paul

> > Help!
> > 
> > -- Steve
> > 
> > 
> > >
> > > >
> > > > > +	mcount_list = ops;
> > > > > +	/*
> > > > > +	 * For one func, simply call it directly.
> > > > > +	 * For more than one func, call the chain.
> > > > > +	 */
> > > > > +	if (ops->next == &mcount_list_end)
> > > > > +		mcount_trace_function = ops->func;
> > > > > +	else
> > > > > +		mcount_trace_function = mcount_list_func;
> > > > > +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > >
> > > >
> > > >
> > >
> 

  reply	other threads:[~2008-02-01 22:34 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
2008-01-30  3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
2008-01-30  8:46   ` Peter Zijlstra
2008-01-30 13:08     ` Steven Rostedt
2008-01-30 14:09       ` Steven Rostedt
2008-01-30 14:25         ` Peter Zijlstra
2008-02-01 22:34           ` Paul E. McKenney [this message]
2008-02-02  1:56             ` Steven Rostedt
2008-02-02 21:41               ` Paul E. McKenney
2008-02-04 17:09                 ` Steven Rostedt
2008-02-04 21:40                   ` Paul E. McKenney
2008-02-04 22:03                     ` Steven Rostedt
2008-02-04 22:41                       ` Mathieu Desnoyers
2008-02-05  6:11                         ` Paul E. McKenney
2008-02-05  5:13                       ` Paul E. McKenney
2008-01-30 13:21   ` Jan Kiszka
2008-01-30 13:53     ` Steven Rostedt
2008-01-30 14:28       ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 03/22 -v7] Annotate core code that should not be traced Steven Rostedt
2008-01-30  3:15 ` [PATCH 04/22 -v7] x86_64: notrace annotations Steven Rostedt
2008-01-30  3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
2008-01-30  8:49   ` Peter Zijlstra
2008-01-30 13:15     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 06/22 -v7] handle accurate time keeping over long delays Steven Rostedt
2008-01-30  3:15 ` [PATCH 07/22 -v7] initialize the clock source to jiffies clock Steven Rostedt
2008-01-30  3:15 ` [PATCH 08/22 -v7] add get_monotonic_cycles Steven Rostedt
2008-01-30  3:15 ` [PATCH 09/22 -v7] add notrace annotations to timing events Steven Rostedt
2008-01-30  3:15 ` [PATCH 10/22 -v7] mcount based trace in the form of a header file library Steven Rostedt
2008-01-30  3:15 ` [PATCH 11/22 -v7] Add context switch marker to sched.c Steven Rostedt
2008-01-30  3:15 ` [PATCH 12/22 -v7] Make the task State char-string visible to all Steven Rostedt
2008-01-30  3:15 ` [PATCH 13/22 -v7] Add tracing of context switches Steven Rostedt
2008-01-30  3:15 ` [PATCH 14/22 -v7] Generic command line storage Steven Rostedt
2008-01-30  3:15 ` [PATCH 15/22 -v7] trace generic call to schedule switch Steven Rostedt
2008-01-30  3:15 ` [PATCH 16/22 -v7] Add marker in try_to_wake_up Steven Rostedt
2008-01-30  3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
2008-01-30  9:31   ` Peter Zijlstra
2008-01-30 13:18     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 18/22 -v7] Trace irq disabled critical timings Steven Rostedt
2008-01-30  3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
2008-01-30  9:40   ` Peter Zijlstra
2008-01-30 13:40     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 20/22 -v7] Add markers to various events Steven Rostedt
2008-01-30  3:15 ` [PATCH 21/22 -v7] Add event tracer Steven Rostedt
2008-01-30  3:15 ` [PATCH 22/22 -v7] Critical latency timings histogram Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080201223413.GB9247@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@ghostprotocols.net \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=fche@redhat.com \
    --cc=ghaskins@novell.com \
    --cc=hch@infradead.org \
    --cc=jan.kiszka@siemens.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=rostedt@goodmis.org \
    --cc=sam@ravnborg.org \
    --cc=srostedt@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tim.bird@am.sony.com \
    --cc=torvalds@linux-foundation.org \
    --subject='Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).