LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Guenter Roeck <linux@roeck-us.net>
To: Waiman Long <longman@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	linux-kernel@vger.kernel.org,
	Jeremy Linton <jeremy.linton@arm.com>,
	pbunyan@redhat.com
Subject: Re: [RFC PATCH v2] tick: Make tick_periodic() check for missing ticks
Date: Sun, 15 Mar 2020 19:57:23 -0700	[thread overview]
Message-ID: <26e82da0-1395-2f92-0318-09ab336222ba@roeck-us.net> (raw)
In-Reply-To: <087e8692-4bfd-6407-3aac-7689f80142de@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]

On 3/15/20 7:43 PM, Waiman Long wrote:
> On 3/15/20 10:20 PM, Guenter Roeck wrote:
>> Hi,
>>
>> On Fri, Feb 07, 2020 at 02:39:29PM -0500, Waiman Long wrote:
>>> The tick_periodic() function is used at the beginning part of the
>>> bootup process for time keeping while the other clock sources are
>>> being initialized.
>>>
>>> The current code assumes that all the timer interrupts are handled in
>>> a timely manner with no missing ticks. That is not actually true. Some
>>> ticks are missed and there are some discrepancies between the tick time
>>> (jiffies) and the timestamp reported in the kernel log.  Some systems,
>>> however, are more prone to missing ticks than the others.  In the extreme
>>> case, the discrepancy can actually cause a soft lockup message to be
>>> printed by the watchdog kthread. For example, on a Cavium ThunderX2
>>> Sabre arm64 system:
>>>
>>>  [   25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s!
>>>
>>> On that system, the missing ticks are especially prevalent during the
>>> smp_init() phase of the boot process. With an instrumented kernel,
>>> it was found that it took about 24s as reported by the timestamp for
>>> the tick to accumulate 4s of time.
>>>
>>> Investigation and bisection done by others seemed to point to the
>>> commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or
>>> lack thereof") as the culprit. It could also be a firmware issue as
>>> new firmware was promised that would fix the issue.
>>>
>>> To properly address this problem, we cannot assume that there will
>>> be no missing tick in tick_periodic(). This function is now modified
>>> to follow the example of tick_do_update_jiffies64() by using another
>>> reference clock to check for missing ticks. Since the watchdog timer
>>> uses running_clock(), it is used here as the reference. With this patch
>>> applied, the soft lockup problem in the arm64 system is gone and tick
>>> time tracks much more closely to the timestamp time.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Since this patch is in linux-next, roughly 10% of my x86 and x86_64
>> qemu emulation boots are stalling. Typical log:
>>
>> [    0.002016] smpboot: Total of 1 processors activated (7576.40 BogoMIPS)
>> [    0.002016] devtmpfs: initialized
>> [    0.002016] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
>> [    0.002016] futex hash table entries: 256 (order: 3, 32768 bytes, linear)
>> [    0.002016] xor: measuring software checksum speed
>>
>> another:
>>
>> [    0.002653] Freeing SMP alternatives memory: 44K
>> [    0.002653] smpboot: CPU0: Intel Westmere E56xx/L56xx/X56xx (IBRS update) (family: 0x6, model: 0x2c, stepping: 0x1)
>> [    0.002653] Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only.
>> [    0.002653] rcu: Hierarchical SRCU implementation.
>> [    0.002653] smp: Bringing up secondary CPUs ...
>> [    0.002653] x86: Booting SMP configuration:
>> [    0.002653] .... node  #0, CPUs:      #1
>> [    0.000000] smpboot: CPU 1 Converting physical 0 to logical die 1
>>
>> ... and then there is silence until the test aborts.
>>
>> This is only (or at least predominantly) seen if the system running
>> the emulation is under load.
>>
>> Reverting this patch fixes the problem.
> 
> I was aware that there are some problem with this patch, but it is hard
> to reproduce it. Do you have a more consistent way to reproduce it.
> When  you say under load, you mean that the host system is also busy so
> that there are a lot of vcpu preemption. Right? Could you give me the

Correct. I am able to reproduce the problem quite reliably (ie 2-3 boots
out of ~25 fail) if I run a kernel compilation in parallel, but not (or
rarely) if the system is otherwise idle.

> x86-64 .config file that you use?
> 

Attached. It is pretty much defconfig with various debug and test options
enabled.

Guenter

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 30037 bytes --]

  reply	other threads:[~2020-03-16  2:57 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 19:39 Waiman Long
2020-03-04  9:20 ` [tip: timers/core] tick/common: " tip-bot2 for Waiman Long
2020-03-16  2:20 ` [RFC PATCH v2] tick: " Guenter Roeck
2020-03-16  2:43   ` Waiman Long
2020-03-16  2:57     ` Guenter Roeck [this message]
2020-03-16 14:20       ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=26e82da0-1395-2f92-0318-09ab336222ba@roeck-us.net \
    --to=linux@roeck-us.net \
    --cc=fweisbec@gmail.com \
    --cc=jeremy.linton@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=pbunyan@redhat.com \
    --cc=tglx@linutronix.de \
    --subject='Re: [RFC PATCH v2] tick: Make tick_periodic() check for missing ticks' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).