LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
       [not found] <20210512013058.6827-1-mukul.joshi@amd.com>
@ 2021-05-12  9:36 ` Borislav Petkov
  2021-05-12 19:00   ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2021-05-12  9:36 UTC (permalink / raw)
  To: Mukul Joshi; +Cc: amd-gfx, harish.kasiviswanathan, x86-ml, lkml

Hi,

so this is a drive-by review using the lore.kernel.org mail because I
wasn't CCed on this.

On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> +				    unsigned long val, void *data)
> +{
> +	struct mce *m = (struct mce *)data;
> +	struct amdgpu_device *adev = NULL;
> +	uint32_t gpu_id = 0;
> +	uint32_t umc_inst = 0;
> +	uint32_t chan_index = 0;
> +	struct ras_err_data err_data = {0, 0, 0, NULL};
> +	struct eeprom_table_record err_rec;
> +	uint64_t retired_page;
> +
> +	/*
> +	 * If the error was generated in UMC_V2, which belongs to GPU UMCs,

Why does it matter if the error is a v2 UMC generated one?

IOW, can you detect the type of errors in GPU memory - I assume this
is what this is supposed to handle - from the error signature itself
instead of doing is_smca_umc_v2?

> +	 * and error occurred in DramECC (Extended error code = 0) then only
> +	 * process the error, else bail out.
> +	 */
> +	if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> +		return NOTIFY_DONE;
> +
> +	gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> +
> +	/*
> +	 * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> +	 */
> +	gpu_id -= GPU_ID_OFFSET;
> +
> +	adev = find_adev(gpu_id);
> +	if (!adev) {
> +		dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> +				     __func__, gpu_id);
> +		return NOTIFY_DONE;
> +	}
> +
> +	/*
> +	 * If it is correctable error, then print a message and return.
> +	 */
> +	if (mce_is_correctable(m)) {
> +		dev_info(adev->dev, "%s: UMC Correctable error detected.",
> +				    __func__);

So put yourself in the shoes of a support engineer who starts getting
all those calls about people's hardware getting correctable errors and
should they replace it and should AMD RMA the GPUs and so on and so
on..? Do you really wanna be on the receiving side of that call?

IOW, whom does printing the fact that the GPU had a corrected error
which got corrected by the hardware, help and do you really want to
upset people needlessly?

> +		return NOTIFY_OK;
> +	}
> +
> +	/*
> +	 * If it is uncorrectable error, then find out UMC instance and
> +	 * channel index.
> +	 */
> +	find_umc_inst_chan_index(m, &umc_inst, &chan_index);

That's a void function but it could return a pointer to the instance and
channel bundled in a struct maybe...

> +
> +	dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> +			    " chan_idx: %d", umc_inst, chan_index);
> +
> +	memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> +
> +	/*
> +	 * Translate UMC channel address to Physical address
> +	 */
> +	retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> +			ADDR_OF_256B_BLOCK(chan_index) |
> +			OFFSET_IN_256B_BLOCK(m->addr);
> +
> +	err_rec.address = m->addr;
> +	err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> +	err_rec.ts = (uint64_t)ktime_get_real_seconds();
> +	err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> +	err_rec.cu = 0;
> +	err_rec.mem_channel = chan_index;
> +	err_rec.mcumc_id = umc_inst;
> +
> +	err_data.err_addr = &err_rec;
> +	err_data.err_addr_cnt = 1;
> +
> +	if (amdgpu_bad_page_threshold != 0) {
> +		amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> +						err_data.err_addr_cnt);
> +		amdgpu_ras_save_bad_pages(adev);
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block amdgpu_bad_page_nb = {
> +	.notifier_call  = amdgpu_bad_page_notifier,
> +	.priority       = MCE_PRIO_ACCEL,

Nothing ever explains why this needs to be a separate priority. So how
about it?

> +};
> +
> +static void amdgpu_register_bad_pages_mca_notifier(void)
> +{
> +	/*
> +	 * Register the x86 notifier with MCE subsystem.
> +	 * Please note a notifier can be registered only once
> +	 * with the MCE subsystem.
> +	 */

Why do you need this? Other users don't need that. Do you need to call
mce_unregister_decode_chain() when the driver gets removed or so?

> +	if (notifier_registered == false) {

This is just silly and should be fixed properly once the issue is
understood.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-12  9:36 ` [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Borislav Petkov
@ 2021-05-12 19:00   ` Joshi, Mukul
  2021-05-12 21:05     ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-12 19:00 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

[AMD Official Use Only - Internal Distribution Only]


> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Wednesday, May 12, 2021 5:37 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> Hi,
> 
> so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed
> on this.
> 
> On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > +                                 unsigned long val, void *data) {
> > +     struct mce *m = (struct mce *)data;
> > +     struct amdgpu_device *adev = NULL;
> > +     uint32_t gpu_id = 0;
> > +     uint32_t umc_inst = 0;
> > +     uint32_t chan_index = 0;
> > +     struct ras_err_data err_data = {0, 0, 0, NULL};
> > +     struct eeprom_table_record err_rec;
> > +     uint64_t retired_page;
> > +
> > +     /*
> > +      * If the error was generated in UMC_V2, which belongs to GPU
> > + UMCs,
> 
> Why does it matter if the error is a v2 UMC generated one?
> 
> IOW, can you detect the type of errors in GPU memory - I assume this is what
> this is supposed to handle - from the error signature itself instead of doing
> is_smca_umc_v2?

SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is only interested in errors on GPU UMC.
We cannot know this without is_smca_umc_v2.

> 
> > +      * and error occurred in DramECC (Extended error code = 0) then only
> > +      * process the error, else bail out.
> > +      */
> > +     if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> > +             return NOTIFY_DONE;
> > +
> > +     gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> > +
> > +     /*
> > +      * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > +      */
> > +     gpu_id -= GPU_ID_OFFSET;
> > +
> > +     adev = find_adev(gpu_id);
> > +     if (!adev) {
> > +             dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > +                                  __func__, gpu_id);
> > +             return NOTIFY_DONE;
> > +     }
> > +
> > +     /*
> > +      * If it is correctable error, then print a message and return.
> > +      */
> > +     if (mce_is_correctable(m)) {
> > +             dev_info(adev->dev, "%s: UMC Correctable error detected.",
> > +                                 __func__);
> 
> So put yourself in the shoes of a support engineer who starts getting all those
> calls about people's hardware getting correctable errors and should they replace
> it and should AMD RMA the GPUs and so on and so on..? Do you really wanna be
> on the receiving side of that call?
> 
> IOW, whom does printing the fact that the GPU had a corrected error which got
> corrected by the hardware, help and do you really want to upset people
> needlessly?

Agree. I will remove this debug print.

> 
> > +             return NOTIFY_OK;
> > +     }
> > +
> > +     /*
> > +      * If it is uncorrectable error, then find out UMC instance and
> > +      * channel index.
> > +      */
> > +     find_umc_inst_chan_index(m, &umc_inst, &chan_index);
> 
> That's a void function but it could return a pointer to the instance and channel
> bundled in a struct maybe...

Maybe. I hope its not too much of a concern if it stays the way it is.

> 
> > +
> > +     dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> > +                         " chan_idx: %d", umc_inst, chan_index);
> > +
> > +     memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> > +
> > +     /*
> > +      * Translate UMC channel address to Physical address
> > +      */
> > +     retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> > +                     ADDR_OF_256B_BLOCK(chan_index) |
> > +                     OFFSET_IN_256B_BLOCK(m->addr);
> > +
> > +     err_rec.address = m->addr;
> > +     err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> > +     err_rec.ts = (uint64_t)ktime_get_real_seconds();
> > +     err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> > +     err_rec.cu = 0;
> > +     err_rec.mem_channel = chan_index;
> > +     err_rec.mcumc_id = umc_inst;
> > +
> > +     err_data.err_addr = &err_rec;
> > +     err_data.err_addr_cnt = 1;
> > +
> > +     if (amdgpu_bad_page_threshold != 0) {
> > +             amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> > +                                             err_data.err_addr_cnt);
> > +             amdgpu_ras_save_bad_pages(adev);
> > +     }
> > +
> > +     return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block amdgpu_bad_page_nb = {
> > +     .notifier_call  = amdgpu_bad_page_notifier,
> > +     .priority       = MCE_PRIO_ACCEL,
> 
> Nothing ever explains why this needs to be a separate priority. So how about it?

I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
are two different class of devices.
That is the reason I create a new one. 
I am OK to use EDAC priority if that is acceptable.

> 
> > +};
> > +
> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > +     /*
> > +      * Register the x86 notifier with MCE subsystem.
> > +      * Please note a notifier can be registered only once
> > +      * with the MCE subsystem.
> > +      */
> 
> Why do you need this? Other users don't need that. Do you need to call
> mce_unregister_decode_chain() when the driver gets removed or so?
> 
> > +     if (notifier_registered == false) {
> 
> This is just silly and should be fixed properly once the issue is understood.

A system can have multiple GPUs and we only want a single notifier registered.
I will change the comment to explicitly state this.

Thanks,
Mukul

> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C1c231207786
> 446018e7208d915297942%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564090158211378%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =7LSOnoJWrTf5z96YFACxuRKZP1z4E4zkvtrNzjbTaPs%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-12 19:00   ` Joshi, Mukul
@ 2021-05-12 21:05     ` Borislav Petkov
  2021-05-13  3:20       ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2021-05-12 21:05 UTC (permalink / raw)
  To: Joshi, Mukul; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.

So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.

> We cannot know this without is_smca_umc_v2.

You don't need it - just export smca_get_bank_type() and test the bank
type at the call site.

> Maybe. I hope its not too much of a concern if it stays the way it is.

That was just a suggestion anyway - it is not code I maintain so not my
call.

> I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
> I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
> are two different class of devices.
> That is the reason I create a new one. 
> I am OK to use EDAC priority if that is acceptable.

I don't know what's acceptable because I still am unclear as to what
that thing is supposed to do. It seems you are interested only in
uncorrectable errors.

How are those errors reported? #MC exception, deferred interrupt, simply
logged in the bank and we find them by polling?

Then, the commit message is talking about some "bad page retirement".
What does that do? What can the user do when she sees the

"Uncorrectable error detected in UMC..."

message?

It depends on what "retiring" of GPU pages means...

In any case, dmesg should issue a human-understandable message about the
recovery action being done and what that means for the user: should she
replace the GPU, should she ignore, etc, etc.

> A system can have multiple GPUs and we only want a single notifier
> registered. I will change the comment to explicitly state this.

Actually, the notifier registration should be able to return a different
retval to state that a callback has already been registered but that
warns only currently so I'm guessing we're stuck with such ugly
"workarounds" for their shortcomings.

I'm gonna take a look whether they can be fixed though so that you don't
have to do this notifier_registered thing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-12 21:05     ` Borislav Petkov
@ 2021-05-13  3:20       ` Joshi, Mukul
  2021-05-13  9:53         ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-13  3:20 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Wednesday, May 12, 2021 5:06 PM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in errors on GPU UMC.
> 
> So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
> 
> > We cannot know this without is_smca_umc_v2.
> 
> You don't need it - just export smca_get_bank_type() and test the bank type at
> the call site.

Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
driver when CONFIG_X86_MCE_AMD is not defined.
I can avoid all that by using is_smca_umc_v2().
I think it would be cleaner with using is_smca_umc_v2().

> 
> > Maybe. I hope its not too much of a concern if it stays the way it is.
> 
> That was just a suggestion anyway - it is not code I maintain so not my call.
> 
> > I wasn't really sure if I should use the EDAC priority here or create a new one
> for Accelerator devices.
> > I thought using EDAC priority might not be accepted by the maintainers
> > as EDAC and GPU (Accelerator) devices are two different class of devices.
> > That is the reason I create a new one.
> > I am OK to use EDAC priority if that is acceptable.
> 
> I don't know what's acceptable because I still am unclear as to what that thing is
> supposed to do. It seems you are interested only in uncorrectable errors.
> 

You can think of GPU device as a EDAC device here. It is mainly interested in
handling uncorrectable errors.

> How are those errors reported? #MC exception, deferred interrupt, simply
> logged in the bank and we find them by polling?
> 

It is a deferred interrupt that generates an MCE.

> Then, the commit message is talking about some "bad page retirement".
> What does that do? What can the user do when she sees the
> 
> "Uncorrectable error detected in UMC..."
> 
> message?
> 
> It depends on what "retiring" of GPU pages means...
> 
> In any case, dmesg should issue a human-understandable message about the
> recovery action being done and what that means for the user: should she
> replace the GPU, should she ignore, etc, etc.
>

When an uncorrectable error is detected on the GPU UMC, all we are doing is 
determining the physical address where the error occurred and then "retiring"
the page that address belongs to.
By retiring, we mean we reserve the page so that it is not available for allocations
to any applications.
We are providing information to the user by storing all the information about the
retired pages in EEPROM. This can be accessed through sysfs.

Hope it clears what "bad page retirement" is achieving.

Thanks,
Mukul

> > A system can have multiple GPUs and we only want a single notifier
> > registered. I will change the comment to explicitly state this.
> 
> Actually, the notifier registration should be able to return a different retval to
> state that a callback has already been registered but that warns only currently so
> I'm guessing we're stuck with such ugly "workarounds" for their shortcomings.
> 
> I'm gonna take a look whether they can be fixed though so that you don't have
> to do this notifier_registered thing.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C5e305845504
> 14bc53c3008d91589b50b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564503481206042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =fF2iaSQxAREAV2OwLi%2F8cN9uRuqNNKim2FpWJUBQS98%3D&amp;reserved=
> 0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13  3:20       ` Joshi, Mukul
@ 2021-05-13  9:53         ` Borislav Petkov
  2021-05-13 14:17           ` Alex Deucher
  2021-05-13 23:10           ` Joshi, Mukul
  0 siblings, 2 replies; 20+ messages in thread
From: Borislav Petkov @ 2021-05-13  9:53 UTC (permalink / raw)
  To: Joshi, Mukul; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> driver when CONFIG_X86_MCE_AMD is not defined.
> I can avoid all that by using is_smca_umc_v2().
> I think it would be cleaner with using is_smca_umc_v2().

See how smca_get_long_name() is exported and export that function the
same way.

To save you some energy: is_smca_umc_v2() is not going to happen.

> You can think of GPU device as a EDAC device here. It is mainly
> interested in handling uncorrectable errors.

An EDAC "device", as you call it, is not interested in handling UEs. If
anything, it counts them.

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?

> When an uncorrectable error is detected on the GPU UMC, all we are
> doing is determining the physical address where the error occurred and
> then "retiring" the page that address belongs to.

What page is that? Normal DRAM page or a page in some special GPU memory?

> By retiring, we mean we reserve the page so that it is not available
> for allocations to any applications.

We do that for normal DRAM memory pages by poisoning them. I hope you
don't mean that.

Looking at

amdgpu_ras_add_bad_pages
|-> amdgpu_vram_mgr_reserve_range

that's some VRAM thing so I'm guessing special memory on the GPU.

If so, what happens with all those "retired" pages when you reboot?
They're getting used again and potentially trigger the same UEs and the
same retiring happens?

> We are providing information to the user by storing all the
> information about the retired pages in EEPROM. This can be accessed
> through sysfs.

Ok, I'm a user and I can access that information through sysfs. What can
I do with it?

> Hope it clears what "bad page retirement" is achieving.

It is getting there.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13  9:53         ` Borislav Petkov
@ 2021-05-13 14:17           ` Alex Deucher
  2021-05-13 14:30             ` Borislav Petkov
  2021-05-13 23:10           ` Joshi, Mukul
  1 sibling, 1 reply; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 14:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joshi, Mukul, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 9:26 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> > driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
>
> See how smca_get_long_name() is exported and export that function the
> same way.
>
> To save you some energy: is_smca_umc_v2() is not going to happen.
>
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
>
> An EDAC "device", as you call it, is not interested in handling UEs. If
> anything, it counts them.
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
>
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
>
> What page is that? Normal DRAM page or a page in some special GPU memory?
>

GPU memory.

> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
>
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
>
> Looking at
>
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
>
> that's some VRAM thing so I'm guessing special memory on the GPU.
>

Yes.

> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the
> same retiring happens?

The bad pages are stored in an EEPROM on the board and the next time
the driver loads it reads the EEPROM so that it can reserve the bad
pages at init time so they don't get used again.

Alex


>
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
>
> Ok, I'm a user and I can access that information through sysfs. What can
> I do with it?
>
> > Hope it clears what "bad page retirement" is achieving.
>
> It is getting there.
>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 14:17           ` Alex Deucher
@ 2021-05-13 14:30             ` Borislav Petkov
  2021-05-13 14:32               ` Alex Deucher
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2021-05-13 14:30 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Joshi, Mukul, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> The bad pages are stored in an EEPROM on the board and the next time
> the driver loads it reads the EEPROM so that it can reserve the bad
> pages at init time so they don't get used again.

And that works automagically on the next boot? Because that sounds like
the right thing to do.

So practically, what happens to a GPU in such a case where the VRAM
starts going bad? It might get exhausted eventually and the driver will
say something along the lines of:

  "VRAM bad pages: 80%, consider replacing the GPU. It is operating
  currently with degrated performance."

or so?

Yap, from a RAS perspective, that makes good sense as you're prolonging
the life of the component while still remains operational as good as it
can and the only user interaction you need is she/he replacing it.

Sounds good.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 14:30             ` Borislav Petkov
@ 2021-05-13 14:32               ` Alex Deucher
  2021-05-13 14:57                 ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 14:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joshi, Mukul, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 10:30 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they don't get used again.
>
> And that works automagically on the next boot? Because that sounds like
> the right thing to do.

Yes, or driver reload, suspend/resume, etc.

>
> So practically, what happens to a GPU in such a case where the VRAM
> starts going bad? It might get exhausted eventually and the driver will
> say something along the lines of:
>
>   "VRAM bad pages: 80%, consider replacing the GPU. It is operating
>   currently with degrated performance."
>
> or so?

Right.  The sys admin can query the bad page count and decide when to
retire the card.

>
> Yap, from a RAS perspective, that makes good sense as you're prolonging
> the life of the component while still remains operational as good as it
> can and the only user interaction you need is she/he replacing it.
>
> Sounds good.

Yes.  That's the idea.

Alex


>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 14:32               ` Alex Deucher
@ 2021-05-13 14:57                 ` Borislav Petkov
  2021-05-13 15:02                   ` Alex Deucher
  2021-05-13 23:14                   ` Joshi, Mukul
  0 siblings, 2 replies; 20+ messages in thread
From: Borislav Petkov @ 2021-05-13 14:57 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Joshi, Mukul, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> Right.  The sys admin can query the bad page count and decide when to
> retire the card.

Yap, although the driver should actively "tell" the sysadmin when some
critical counts of retired VRAM pages are reached because I doubt all
admins would go look at those counts on their own.

Btw, you say "admin" - am I to understand that those are some high end
GPU cards with ECC memory? If consumer grade stuff has this too, then
the driver should very much warn on such levels on its own because
normal users won't know what and where to look.

Other than that, the big picture sounds good to me.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 14:57                 ` Borislav Petkov
@ 2021-05-13 15:02                   ` Alex Deucher
  2021-05-13 23:14                   ` Joshi, Mukul
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 15:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joshi, Mukul, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 10:57 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right.  The sys admin can query the bad page count and decide when to
> > retire the card.
>
> Yap, although the driver should actively "tell" the sysadmin when some
> critical counts of retired VRAM pages are reached because I doubt all
> admins would go look at those counts on their own.

I think we print something in the log as well when we hit the
threshold.  I need to double check the code.

>
> Btw, you say "admin" - am I to understand that those are some high end
> GPU cards with ECC memory? If consumer grade stuff has this too, then
> the driver should very much warn on such levels on its own because
> normal users won't know what and where to look.
>

Currently it's only available on workstation and datacenter boards.

> Other than that, the big picture sounds good to me.

Thanks!

Alex

>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13  9:53         ` Borislav Petkov
  2021-05-13 14:17           ` Alex Deucher
@ 2021-05-13 23:10           ` Joshi, Mukul
  2021-05-14  7:05             ` Borislav Petkov
  1 sibling, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-13 23:10 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Thursday, May 13, 2021 5:53 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile
> > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
> 
> See how smca_get_long_name() is exported and export that function the same
> way.
> 

That's probably not the best example to look at.
smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't
get compiled when CONFIG_X86_MCE_AMD is not defined.

And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

So here is one option that we can try:
1. Export smca_get_bank_type().
2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD

Will that work for you?

Thanks,
Mukul

> To save you some energy: is_smca_umc_v2() is not going to happen.


> 
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
> 
> An EDAC "device", as you call it, is not interested in handling UEs. If anything, it
> counts them.
> 
> > It is a deferred interrupt that generates an MCE.
> 
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
> 
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
> 
> What page is that? Normal DRAM page or a page in some special GPU memory?
> 
> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
> 
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
> 
> Looking at
> 
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
> 
> that's some VRAM thing so I'm guessing special memory on the GPU.
> 
> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the same
> retiring happens?
> 
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
> 
> Ok, I'm a user and I can access that information through sysfs. What can I do
> with it?
> 
> > Hope it clears what "bad page retirement" is achieving.
> 
> It is getting there.
> 
> Thx.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2
> 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=
> %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&amp;reser
> ved=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 14:57                 ` Borislav Petkov
  2021-05-13 15:02                   ` Alex Deucher
@ 2021-05-13 23:14                   ` Joshi, Mukul
  2021-05-14  7:03                     ` Borislav Petkov
  1 sibling, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-13 23:14 UTC (permalink / raw)
  To: Borislav Petkov, Alex Deucher
  Cc: x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Thursday, May 13, 2021 10:58 AM
> To: Alex Deucher <alexdeucher@gmail.com>
> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>; x86-ml <x86@kernel.org>;
> Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>; lkml <linux-
> kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right.  The sys admin can query the bad page count and decide when to
> > retire the card.
> 
> Yap, although the driver should actively "tell" the sysadmin when some critical
> counts of retired VRAM pages are reached because I doubt all admins would go
> look at those counts on their own.
> 
> Btw, you say "admin" - am I to understand that those are some high end GPU
> cards with ECC memory? If consumer grade stuff has this too, then the driver
> should very much warn on such levels on its own because normal users won't
> know what and where to look.
> 
> Other than that, the big picture sounds good to me.
> 

Since now you are OK with how page retirement works, lets revisit the original 
question.

Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use
something else?

Thanks,
Mukul

> Thx.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C50588f11ed5
> 3456b03e008d9161f765c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565146658376385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =Es0FMDNzNEKgxvFiqe1kOo9aEPK6%2BOXrhI5aWs3QH9Q%3D&amp;reserved=
> 0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 23:14                   ` Joshi, Mukul
@ 2021-05-14  7:03                     ` Borislav Petkov
  2021-05-27 19:54                       ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2021-05-14  7:03 UTC (permalink / raw)
  To: Joshi, Mukul; +Cc: Alex Deucher, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx

On Thu, May 13, 2021 at 11:14:30PM +0000, Joshi, Mukul wrote:
> Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use
> something else?

I still don't know why a separate priority is needed. Maybe this still
needs answering:

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls
amd_deferred_error_interrupt() ?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-13 23:10           ` Joshi, Mukul
@ 2021-05-14  7:05             ` Borislav Petkov
  2021-05-14 13:06               ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2021-05-14  7:05 UTC (permalink / raw)
  To: Joshi, Mukul; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

On Thu, May 13, 2021 at 11:10:34PM +0000, Joshi, Mukul wrote:
> That's probably not the best example to look at.

Oh, it is the *perfect* example but...

> smca_get_long_name() is used in drivers/edac/mce_amd.c and this file
> doesn't get compiled when CONFIG_X86_MCE_AMD is not defined.
>
> And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

... maybe this will make you see it the right way: how much of the
amdgpu RAS functionality you're adding, is going to even function
without CONFIG_X86_MCE_AMD?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-14  7:05             ` Borislav Petkov
@ 2021-05-14 13:06               ` Joshi, Mukul
  2021-05-14 14:38                 ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-14 13:06 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Friday, May 14, 2021 3:06 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 11:10:34PM +0000, Joshi, Mukul wrote:
> > That's probably not the best example to look at.
> 
> Oh, it is the *perfect* example but...
> 

I would prefer not to get into an argument of if it is the perfect example or not.
I would prefer to get the job done and move forward.

> > smca_get_long_name() is used in drivers/edac/mce_amd.c and this file
> > doesn't get compiled when CONFIG_X86_MCE_AMD is not defined.
> >
> > And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.
> 
> ... maybe this will make you see it the right way: how much of the amdgpu RAS
> functionality you're adding, is going to even function without
> CONFIG_X86_MCE_AMD?
>

We have RAS functionality in other ASICs that is not dependent on CONFIG_X86_MCE_AMD.
So, I don't think we would want to do that just for one ASIC.
Maybe Alex D. has an opinion on it.

As I had mentioned in my previous comment, I am fine with wrapping around my code with
#ifdef CONFIG_X86_MCE_AMD and exporting smca_get_bank_type().
I hope we can move forward with this approach.

Thanks,
Mukul
 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C7fded9165e6
> 64dde943808d916a6b33f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565727509040995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =5hNyAEla4H%2BN0kMdCDs6iL0Dr8mVQJNxKA0UXJU7%2F8c%3D&amp;reserv
> ed=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-14 13:06               ` Joshi, Mukul
@ 2021-05-14 14:38                 ` Borislav Petkov
  0 siblings, 0 replies; 20+ messages in thread
From: Borislav Petkov @ 2021-05-14 14:38 UTC (permalink / raw)
  To: Joshi, Mukul; +Cc: amd-gfx, Kasiviswanathan, Harish, x86-ml, lkml

On Fri, May 14, 2021 at 01:06:33PM +0000, Joshi, Mukul wrote:
> We have RAS functionality in other ASICs that is not dependent on
> CONFIG_X86_MCE_AMD. So, I don't think we would want to do that just
> for one ASIC.

Lemme try again: you said that those errors do get reported through a
deferred interrupt. Which is likely amd_deferred_error_interrupt().

If it is that interrupt and you don't have CONFIG_X86_MCE_AMD enabled,
then you won't get any errors reported and your RAS functionality will
simply sit there inactive.

So if that above is true - something to which I'm still not getting
an answer but maybe one fine day... - so if that above is true, your
RAS functionality *needs* CONFIG_X86_MCE_AMD to be enabled in order to
*actually* function.

So you *must* make your RAS functionality depend on CONFIG_X86_MCE_AMD
- otherwise no deferred interrupts and no errors reported. It is that
simple.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-14  7:03                     ` Borislav Petkov
@ 2021-05-27 19:54                       ` Joshi, Mukul
  2021-06-03 21:13                         ` Yazen Ghannam
  0 siblings, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-05-27 19:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Deucher, x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx,
	Ghannam, Yazen

[AMD Official Use Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Friday, May 14, 2021 3:04 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: Alex Deucher <alexdeucher@gmail.com>; x86-ml <x86@kernel.org>;
> Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>; lkml <linux-
> kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 11:14:30PM +0000, Joshi, Mukul wrote:
> > Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us
> > to use something else?
> 
> I still don't know why a separate priority is needed. Maybe this still needs
> answering:
> 
> > It is a deferred interrupt that generates an MCE.
> 
> Is that the same deferred interrupt which calls
> amd_deferred_error_interrupt() ?

Sorry picking this up after sometime. I thought I had replied to this email.
Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().

Thanks,
Mukul

> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C7e04d0ad3c6
> 147a8dfd008d916a66bbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565726326540654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =AWHJvqIbtkGl6OPC86alAtnxsleq6UVe9mcM1MjxpUQ%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-05-27 19:54                       ` Joshi, Mukul
@ 2021-06-03 21:13                         ` Yazen Ghannam
  2021-07-29 23:59                           ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Yazen Ghannam @ 2021-06-03 21:13 UTC (permalink / raw)
  To: Joshi, Mukul
  Cc: Borislav Petkov, Alex Deucher, x86-ml, Kasiviswanathan, Harish,
	lkml, amd-gfx

On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
...
> > Is that the same deferred interrupt which calls
> > amd_deferred_error_interrupt() ?
> 
> Sorry picking this up after sometime. I thought I had replied to this email.
> Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().
>

Mukul,

Do you expect that the driver will need to mark pages with high
correctable error counts as bad? I think the hardware folks may want the
GPU memory errors to be handled more aggressively than CPU memory
errors. The specific threshold may change from product to product, so it
may make sense to hardcode this in the driver.

We have similar functionality in the Correctable Errors Collector. But
enterprise users may prefer a direct approach done in the driver (based
on the hardware experts' guidance) instead of configuring the kernel at
runtime.

So I think having a separate priority may make sense if some special
functionality, or combination of behaviors, is needed which don't fall
under any exisiting things. In this case, "special functionality" could
be that the GPU memory needs to be handled differently than CPU memory.

Another thing is that this behavior is similar to the NFIT behavior,
i.e. there's a memory error on an external device that needs to be
handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT to
be generic (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with
the same priority is okay, right?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-06-03 21:13                         ` Yazen Ghannam
@ 2021-07-29 23:59                           ` Joshi, Mukul
  2021-09-13  1:31                             ` Joshi, Mukul
  0 siblings, 1 reply; 20+ messages in thread
From: Joshi, Mukul @ 2021-07-29 23:59 UTC (permalink / raw)
  To: Ghannam, Yazen
  Cc: Borislav Petkov, Alex Deucher, x86-ml, Kasiviswanathan, Harish,
	lkml, amd-gfx

[AMD Official Use Only]



> -----Original Message-----
> From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Sent: Thursday, June 3, 2021 5:13 PM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: Borislav Petkov <bp@alien8.de>; Alex Deucher <alexdeucher@gmail.com>;
> x86-ml <x86@kernel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; lkml <linux-kernel@vger.kernel.org>;
> amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> ...
> > > Is that the same deferred interrupt which calls
> > > amd_deferred_error_interrupt() ?
> >
> > Sorry picking this up after sometime. I thought I had replied to this email.
> > Yes it is the same deferred interrupt which calls
> amd_deferred_error_interrupt().
> >
> 
> Mukul,
> 
> Do you expect that the driver will need to mark pages with high correctable
> error counts as bad? I think the hardware folks may want the GPU memory
> errors to be handled more aggressively than CPU memory errors. The specific
> threshold may change from product to product, so it may make sense to
> hardcode this in the driver.
> 

Sorry I missed this email completely. Just saw it so responding now.

At the moment, we don't have a requirement to mark a page "bad" if there is a high correctable error counts. 
Our previous GPU ASICs which support RAS, also do not have such a feature.
But you make a good point. It might be worthwhile to go and ask the hardware folks about it.

> We have similar functionality in the Correctable Errors Collector. But enterprise
> users may prefer a direct approach done in the driver (based on the hardware
> experts' guidance) instead of configuring the kernel at runtime.
> 
> So I think having a separate priority may make sense if some special
> functionality, or combination of behaviors, is needed which don't fall under any
> exisiting things. In this case, "special functionality" could be that the GPU
> memory needs to be handled differently than CPU memory.
> 
> Another thing is that this behavior is similar to the NFIT behavior, i.e. there's a
> memory error on an external device that needs to be handled by the device's
> driver. So maybe we can rename MCE_PRIO_NFIT to be generic
> (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same priority
> is okay, right?
> 
With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC instead of creating a new priority as the code in the GPU driver is doing error detection and handling the uncorrectable errors.
Not sure if that aligns with the definition of EDAC device in the kernel.

What do you think?

Regards,
Mukul

> Thanks,
> Yazen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
  2021-07-29 23:59                           ` Joshi, Mukul
@ 2021-09-13  1:31                             ` Joshi, Mukul
  0 siblings, 0 replies; 20+ messages in thread
From: Joshi, Mukul @ 2021-09-13  1:31 UTC (permalink / raw)
  To: Joshi, Mukul, Ghannam, Yazen
  Cc: x86-ml, Kasiviswanathan, Harish, lkml, amd-gfx, Borislav Petkov,
	Alex Deucher

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Joshi,
> Mukul
> Sent: Thursday, July 29, 2021 8:00 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: x86-ml <x86@kernel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; lkml <linux-kernel@vger.kernel.org>;
> amd-gfx@lists.freedesktop.org; Borislav Petkov <bp@alien8.de>; Alex Deucher
> <alexdeucher@gmail.com>
> Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> [AMD Official Use Only]
> 
> 
> 
> > -----Original Message-----
> > From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> > Sent: Thursday, June 3, 2021 5:13 PM
> > To: Joshi, Mukul <Mukul.Joshi@amd.com>
> > Cc: Borislav Petkov <bp@alien8.de>; Alex Deucher
> > <alexdeucher@gmail.com>; x86-ml <x86@kernel.org>; Kasiviswanathan,
> > Harish <Harish.Kasiviswanathan@amd.com>; lkml
> > <linux-kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> > ...
> > > > Is that the same deferred interrupt which calls
> > > > amd_deferred_error_interrupt() ?
> > >
> > > Sorry picking this up after sometime. I thought I had replied to this email.
> > > Yes it is the same deferred interrupt which calls
> > amd_deferred_error_interrupt().
> > >
> >
> > Mukul,
> >
> > Do you expect that the driver will need to mark pages with high
> > correctable error counts as bad? I think the hardware folks may want
> > the GPU memory errors to be handled more aggressively than CPU memory
> > errors. The specific threshold may change from product to product, so
> > it may make sense to hardcode this in the driver.
> >
> 
> Sorry I missed this email completely. Just saw it so responding now.
> 
> At the moment, we don't have a requirement to mark a page "bad" if there is a
> high correctable error counts.
> Our previous GPU ASICs which support RAS, also do not have such a feature.
> But you make a good point. It might be worthwhile to go and ask the hardware
> folks about it.
> 
> > We have similar functionality in the Correctable Errors Collector. But
> > enterprise users may prefer a direct approach done in the driver
> > (based on the hardware experts' guidance) instead of configuring the kernel at
> runtime.
> >
> > So I think having a separate priority may make sense if some special
> > functionality, or combination of behaviors, is needed which don't fall
> > under any exisiting things. In this case, "special functionality"
> > could be that the GPU memory needs to be handled differently than CPU
> memory.
> >
> > Another thing is that this behavior is similar to the NFIT behavior,
> > i.e. there's a memory error on an external device that needs to be
> > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT
> > to be generic
> > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same
> > priority is okay, right?
> >
> With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC
> instead of creating a new priority as the code in the GPU driver is doing error
> detection and handling the uncorrectable errors.
> Not sure if that aligns with the definition of EDAC device in the kernel.
> 
> What do you think?
> 
> Regards,
> Mukul
> 

After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing
only with uncorrectable errors.
I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is.

Thanks,
Mukul

> > Thanks,
> > Yazen
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
> edesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0
> aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376
> 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YWZz9
> OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-09-13  1:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210512013058.6827-1-mukul.joshi@amd.com>
2021-05-12  9:36 ` [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).