From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752475AbbBWVRh (ORCPT ); Mon, 23 Feb 2015 16:17:37 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:44891 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751249AbbBWVRg (ORCPT ); Mon, 23 Feb 2015 16:17:36 -0500 Date: Mon, 23 Feb 2015 13:17:34 -0800 From: Andrew Morton To: Don Zickus Cc: LKML , Ulrich Obergfell , Ingo Molnar Subject: Re: [PATCH 6/9] watchdog: implement error handling for failure to set up hardware perf events Message-Id: <20150223131734.61ee63b5f4064e656f0da762@linux-foundation.org> In-Reply-To: <1423168825-156238-7-git-send-email-dzickus@redhat.com> References: <1423168825-156238-1-git-send-email-dzickus@redhat.com> <1423168825-156238-7-git-send-email-dzickus@redhat.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 5 Feb 2015 15:40:22 -0500 Don Zickus wrote: > From: Ulrich Obergfell > > If watchdog_nmi_enable() fails to set up the hardware perf event > of one CPU, the entire hard lockup detector is deemed unreliable. > Hence, disable the hard lockup detector and shut down the hardware > perf events on all CPUs. > > Signed-off-by: Ulrich Obergfell > Signed-off-by: Don Zickus > --- > kernel/watchdog.c | 18 ++++++++++++++++++ > 1 files changed, 18 insertions(+), 0 deletions(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 26002ed..7ad8949 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -502,6 +502,15 @@ static void watchdog(unsigned int cpu) > __this_cpu_write(soft_lockup_hrtimer_cnt, > __this_cpu_read(hrtimer_interrupts)); > __touch_watchdog(); > + > + /* > + * watchdog_nmi_enable() clears the NMI_WATCHDOG_ENABLED bit in the > + * failure path. Check for failures that can occur asynchronously - > + * for example, when CPUs are on-lined - and shut down the hardware > + * perf event on each CPU accordingly. > + */ > + if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED)) > + watchdog_nmi_disable(cpu); Silently disabling the hardware watchdog. Wouldn't it be better to emit a printk to alert the operator about this event? > > #ifdef CONFIG_HARDLOCKUP_DETECTOR > @@ -552,6 +561,15 @@ handle_err: > goto out_save; > } > > + /* > + * Disable the hard lockup detector if _any_ CPU fails to set up > + * set up the hardware perf event. The watchdog() function checks > + * the NMI_WATCHDOG_ENABLED bit periodically. > + */ > + smp_mb__before_atomic(); > + clear_bit(NMI_WATCHDOG_ENABLED_BIT, &watchdog_enabled); > + smp_mb__after_atomic(); Please send along a patch which adds comments explaining what these barriers are for. What are these barriers for? ;) > /* skip displaying the same error again */ > if (cpu > 0 && (PTR_ERR(event) == cpu0_err)) > return PTR_ERR(event);