LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Jue Wang <juew@google.com>, Ding Hui <dinghui@sangfor.com.cn>,
	naoya.horiguchi@nec.com, osalvador@suse.de,
	Youquan Song <youquan.song@intel.com>,
	huangcun@sangfor.com.cn, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery
Date: Fri, 20 Aug 2021 21:27:44 +0200	[thread overview]
Message-ID: <YSACMCEoU6FxjDNh@zn.tnic> (raw)
In-Reply-To: <20210820185945.GA1623421@agluck-desk2.amr.corp.intel.com>

On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote:
> It's the "when we return" part that is the problem here. Logical
> trace looks like:
> 
> user-syscall:
> 
> 	kernel does get_user() or copyin(), hits user poison address
> 
> 		machine check
> 		sees that this was kernel get_user()/copyin() and
> 		uses extable to "return" to exception path
> 
> 	still in kernel, see that get_user() or copyin() failed
> 
> 	Kernel does another get_user() or copyin() (maybe the first

I forgot all the details we were talking at the time but there's no way
to tell the kernel to back off here, is it?

As in: there was an MCE while trying to access this user memory, you
should not do get_user anymore. You did add that

         * Return zero to pretend that this copy succeeded. This
         * is counter-intuitive, but needed to prevent the code
         * in lib/iov_iter.c from retrying and running back into

which you're removing with the last patch so I'm confused.

IOW, the problem is that with repeated MCEs while the kernel is
accessing that memory, it should be the kernel which should back off.
And then we should kill that process too but apparently we don't even
come to that.

> Maybe the message could be clearer?
> 
> 	mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg);

That's not my point - it is rather: this is a recoverable error because
it is in user memory even if it is the kernel which tries to access it.
And maybe we should not panic the whole box but try to cordon off the
faulty memory only and poison it after having killed the process using
it...

> Not quite the same answer ... but similar.  We could in theory handle
> multiple different machine check addresses by turning the "mce_addr"
> field in the task structure into an array and saving each address so
> that when the kernel eventually gives up poking at poison and tries
> to return to user kill_me_maybe() could loop through them and deal
> with each poison page.

Yes, I like the aspect of making the kernel give up poking at poison and
when we return we should kill the process and poison all pages collected
so that the error source is hopefully contained.

But again, I think the important thing is how to make the kernel to back
off quicker so that we can poison the pages at all...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

  reply	other threads:[~2021-08-20 19:27 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-06 19:06 [PATCH 0/3] More machine check recovery fixes Tony Luck
2021-07-06 19:06 ` [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck
2021-07-06 19:06 ` [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-07-06 19:06 ` [PATCH 3/3] x86/mce: Drop copyin special case for #MC Tony Luck
2021-08-18  0:29 ` [PATCH v2 0/3] More machine check recovery fixes Tony Luck
2021-08-18  0:29   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-08-20 17:31     ` Borislav Petkov
2021-08-20 18:59       ` Luck, Tony
2021-08-20 19:27         ` Borislav Petkov [this message]
2021-08-20 20:23           ` Luck, Tony
2021-08-21  4:51             ` Tony Luck
2021-08-21 21:51               ` Al Viro
2021-08-22 14:36             ` Borislav Petkov
2021-08-20 20:33           ` Luck, Tony
2021-08-22 14:46             ` Borislav Petkov
2021-08-23 15:24               ` Luck, Tony
2021-09-13  9:24     ` Borislav Petkov
2021-09-13 21:52       ` [PATCH v3] " Luck, Tony
2021-09-14  8:28         ` Borislav Petkov
2021-08-18  0:29   ` [PATCH v2 2/3] x86/mce: Change to not send SIGBUS error during copy from user Tony Luck
2021-09-21  7:52     ` [tip: ras/core] " tip-bot2 for Tony Luck
2021-08-18  0:29   ` [PATCH v2 3/3] x86/mce: Drop copyin special case for #MC Tony Luck
2021-09-20  9:13     ` Borislav Petkov
2021-09-20 16:18       ` Luck, Tony
2021-09-20 16:37         ` Borislav Petkov
2021-09-20 16:43           ` Luck, Tony
2021-09-21  7:52     ` [tip: ras/core] " tip-bot2 for Tony Luck
2021-08-18 16:14   ` [PATCH v2 0/3] More machine check recovery fixes Luck, Tony
  -- strict thread matches above, loose matches on Subject: below --
2021-01-08 22:22 [PATCH 0/2] Fix infinite machine check loop in futex_wait_setup() Tony Luck
2021-01-11 21:44 ` [PATCH v2 0/3] " Tony Luck
2021-01-11 21:44   ` [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery Tony Luck
2021-01-11 22:11     ` Andy Lutomirski
2021-01-11 22:20       ` Luck, Tony
2021-01-12 17:00         ` Andy Lutomirski
2021-01-12 17:16           ` Luck, Tony
2021-01-12 17:21             ` Andy Lutomirski
2021-01-12 18:23               ` Luck, Tony
2021-01-12 18:57                 ` Andy Lutomirski
2021-01-12 20:52                   ` Luck, Tony
2021-01-12 22:04                     ` Andy Lutomirski
2021-01-13  1:50                       ` Luck, Tony
2021-01-13  4:15                         ` Andy Lutomirski
2021-01-13 10:00                           ` Borislav Petkov
2021-01-13 16:06                             ` Luck, Tony
2021-01-13 16:19                               ` Borislav Petkov
2021-01-13 16:32                                 ` Luck, Tony
2021-01-13 17:35                                   ` Borislav Petkov
2021-01-14 20:22     ` Borislav Petkov
2021-01-14 21:05       ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YSACMCEoU6FxjDNh@zn.tnic \
    --to=bp@alien8.de \
    --cc=dinghui@sangfor.com.cn \
    --cc=huangcun@sangfor.com.cn \
    --cc=juew@google.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=youquan.song@intel.com \
    --subject='Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).