From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D23E1C00140 for ; Tue, 2 Aug 2022 20:09:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233792AbiHBUJG (ORCPT ); Tue, 2 Aug 2022 16:09:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51604 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232116AbiHBUJE (ORCPT ); Tue, 2 Aug 2022 16:09:04 -0400 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC1D852455 for ; Tue, 2 Aug 2022 13:09:03 -0700 (PDT) Received: from [2603:3005:d05:2b00:6e0b:84ff:fee2:98bb] (helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1oIyCL-0001Cv-Gd; Tue, 02 Aug 2022 16:09:01 -0400 Date: Tue, 2 Aug 2022 16:09:00 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: x86@vger.kernel.org, kernel-team@fb.com, Dave Hansen , Thomas Gleixner , Dave Jones , Andy Lutomirski Subject: [PATCH] x86,mm: print likely CPU at segfault time Message-ID: <20220802160900.7a68909b@imladris.surriel.com> X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.31; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: riel@shelob.surriel.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system. However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else. Add a printk to show_signal_msg() to print the CPU, core, and socket at segfault time. This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice this has been good enough to help us identify several bad CPU cores. segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0) Signed-off-by: Rik van Riel CC: Dave Jones --- arch/x86/mm/fault.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fad8faa29d04..47faf7c0041e 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -782,6 +782,12 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code, print_vma_addr(KERN_CONT " in ", regs->ip); + printk(KERN_CONT " on CPU %d (core %d, socket %d)", + raw_smp_processor_id(), + topology_core_id(raw_smp_processor_id()), + topology_physical_package_id(raw_smp_processor_id())); + + printk(KERN_CONT "\n"); show_opcodes(regs, loglvl); -- 2.37.1