LKML Archive on
help / color / mirror / Atom feed
From: Rik van Riel <>
Cc: Dave Hansen <>,
	Andy Lutomirski <>,, Peter Zijlstra <>,
	Ingo Molnar <>, Borislav Petkov <>,
Subject: [PATCH] x86,mm: print likely CPU at segfault time
Date: Mon, 19 Jul 2021 15:00:41 -0400	[thread overview]
Message-ID: <> (raw)

From 14d31a44a5186c94399dc9518ba80adf64c99772 Mon Sep 17 00:00:00 2001
From: Rik van Riel <>
Date: Mon, 19 Jul 2021 14:49:17 -0400
Subject: [PATCH] x86,mm: print likely CPU at segfault time

In a large enough fleet of computers, it is common to have a few bad
CPUs. Those can often be identified by seeing that some commonly run
kernel code (that runs fine everywhere else) keeps crashing on the
same CPU core on a particular bad system.

One of the failure modes observed is that either the instruction pointer,
or some register used to specify the address of data that needs to be
fetched gets corrupted, resulting in something like a kernel page fault,
null pointer dereference, NX violation, or similar.

Those kernel failures are often preceded by similar looking userspace
failures. It would be useful to know if those are also happening on
the same CPU cores, to get a little more confirmation that it is indeed
a hardware issue.

Adding a printk to show_signal_msg() achieves that purpose. It isn't
perfect since the task might get rescheduled on another CPU between
when the fault hit and when the message is printed, but it should be
good enough to show correlation between userspace and kernel errors
when dealing with a bad CPU.

$ ./segfault
Segmentation fault (core dumped)
$ dmesg | grep segfault
segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0

Signed-off-by: Rik van Riel <>
 arch/x86/mm/fault.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..dd6c89c23a3a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -777,6 +777,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	print_vma_addr(KERN_CONT " in ", regs->ip);
+	printk(KERN_CONT " on CPU %d", raw_smp_processor_id());
 	printk(KERN_CONT "\n");
 	show_opcodes(regs, loglvl);

             reply	other threads:[~2021-07-19 19:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-19 19:00 Rik van Riel [this message]
2021-07-19 19:20 ` Dave Hansen
2021-07-19 19:34   ` Rik van Riel
2021-07-21 20:38     ` Thomas Gleixner
2021-07-21 20:36 ` Thomas Gleixner
2021-07-24  1:38   ` Rik van Riel
2022-08-02 20:09 Rik van Riel
2022-08-03 14:49 ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH] x86,mm: print likely CPU at segfault time' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).