From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752922AbbBXPjf (ORCPT ); Tue, 24 Feb 2015 10:39:35 -0500 Received: from mx1.redhat.com ([209.132.183.28]:54784 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752475AbbBXPjd (ORCPT ); Tue, 24 Feb 2015 10:39:33 -0500 Date: Tue, 24 Feb 2015 10:39:25 -0500 From: Don Zickus To: Andrew Morton Cc: LKML , Ulrich Obergfell , Ingo Molnar Subject: Re: [PATCH 6/9] watchdog: implement error handling for failure to set up hardware perf events Message-ID: <20150224153925.GI126481@redhat.com> References: <1423168825-156238-1-git-send-email-dzickus@redhat.com> <1423168825-156238-7-git-send-email-dzickus@redhat.com> <20150223131734.61ee63b5f4064e656f0da762@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150223131734.61ee63b5f4064e656f0da762@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 23, 2015 at 01:17:34PM -0800, Andrew Morton wrote: > On Thu, 5 Feb 2015 15:40:22 -0500 Don Zickus wrote: > > > From: Ulrich Obergfell > > > > If watchdog_nmi_enable() fails to set up the hardware perf event > > of one CPU, the entire hard lockup detector is deemed unreliable. > > Hence, disable the hard lockup detector and shut down the hardware > > perf events on all CPUs. > > > > Signed-off-by: Ulrich Obergfell > > Signed-off-by: Don Zickus > > --- > > kernel/watchdog.c | 18 ++++++++++++++++++ > > 1 files changed, 18 insertions(+), 0 deletions(-) > > > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > > index 26002ed..7ad8949 100644 > > --- a/kernel/watchdog.c > > +++ b/kernel/watchdog.c > > @@ -502,6 +502,15 @@ static void watchdog(unsigned int cpu) > > __this_cpu_write(soft_lockup_hrtimer_cnt, > > __this_cpu_read(hrtimer_interrupts)); > > __touch_watchdog(); > > + > > + /* > > + * watchdog_nmi_enable() clears the NMI_WATCHDOG_ENABLED bit in the > > + * failure path. Check for failures that can occur asynchronously - > > + * for example, when CPUs are on-lined - and shut down the hardware > > + * perf event on each CPU accordingly. > > + */ > > + if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED)) > > + watchdog_nmi_disable(cpu); > > Silently disabling the hardware watchdog. Wouldn't it be better to > emit a printk to alert the operator about this event? We can add something here. The original hypothetical problem was cpu0 and 1 start the watchdog correctly, but for whatever reason cpu2 fails. You would then see output like: ... NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. .. NMI watchdog: disabled (cpu2): hardware events not enabled .. Asynchronously a few seconds later the above code would disable _all_ the nmi watchdog across the cpus. My thought process was the first failure was enough. But perhaps that isn't obvious enough that the other cpus would be disabled. So we can add something. > > > > > #ifdef CONFIG_HARDLOCKUP_DETECTOR > > @@ -552,6 +561,15 @@ handle_err: > > goto out_save; > > } > > > > + /* > > + * Disable the hard lockup detector if _any_ CPU fails to set up > > + * set up the hardware perf event. The watchdog() function checks > > + * the NMI_WATCHDOG_ENABLED bit periodically. > > + */ > > + smp_mb__before_atomic(); > > + clear_bit(NMI_WATCHDOG_ENABLED_BIT, &watchdog_enabled); > > + smp_mb__after_atomic(); > > Please send along a patch which adds comments explaining what these > barriers are for. > > What are these barriers for? ;) Sadly, I am not strong with memory barriers, so the bulk of the motivation for barriers comes from Documentation/atomic_ops.txt when using clear_bit(). I will add some comment explaining the need to sync up watchdog_enabled before we clear it. Cheers, Don