linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
@ 2025-09-04 15:57 Tony Luck
  2025-09-04 17:25 ` Mike Rapoport
  2025-09-04 18:16 ` Liam R. Howlett
  0 siblings, 2 replies; 20+ messages in thread
From: Tony Luck @ 2025-09-04 15:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: surenb, Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, liam.howlett, jiaqiyan, jane.chu,
	david, bp, Meyer, Kyle, akpm, linux-mm, vbabka, linux-acpi,
	Tony Luck, Shawn Fan

BIOS can supply a GHES error record that reports that the corrected
error threshold has been exceeded. Linux will attempt to soft offline
the page in response.

But "exceeded threshold" has many interpretations. Some BIOS versions
accumulate error counts per-rank, and then report threshold exceeded
when the number of errors crosses a threshold for the rank. Taking
a page offline in this case is unlikely to solve any problems. But
losing a 4KB page will have little impact on the overall system.

On the other hand, taking a huge page offline will have significant
impact (and still not solve any problems).

Check if the GHES record refers to a huge page. Skip the offline
process if the page is huge.

Reported-by: Shawn Fan <shawn.fan@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/acpi/apei/ghes.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..bacfebdd4969 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -540,8 +540,16 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 
 	/* iff following two events can be handled properly by now */
 	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
+		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
+		struct page *page = pfn_to_page(pfn);
+		struct folio *folio = page_folio(page);
+
+		if (folio_test_hugetlb(folio))
+			return false;
+
 		flags = MF_SOFT_OFFLINE;
+	}
 	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
 		flags = sync ? MF_ACTION_REQUIRED : 0;
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-04 15:57 [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked Tony Luck
@ 2025-09-04 17:25 ` Mike Rapoport
  2025-09-04 18:16 ` Liam R. Howlett
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-09-04 17:25 UTC (permalink / raw)
  To: Tony Luck
  Cc: Rafael J. Wysocki, surenb, Anderson, Russ, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, liam.howlett,
	jiaqiyan, jane.chu, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Shawn Fan

On Thu, Sep 04, 2025 at 08:57:20AM -0700, Tony Luck wrote:
> BIOS can supply a GHES error record that reports that the corrected
> error threshold has been exceeded. Linux will attempt to soft offline
> the page in response.
> 
> But "exceeded threshold" has many interpretations. Some BIOS versions
> accumulate error counts per-rank, and then report threshold exceeded
> when the number of errors crosses a threshold for the rank. Taking
> a page offline in this case is unlikely to solve any problems. But
> losing a 4KB page will have little impact on the overall system.
> 
> On the other hand, taking a huge page offline will have significant
> impact (and still not solve any problems).
> 
> Check if the GHES record refers to a huge page. Skip the offline
> process if the page is huge.
> 
> Reported-by: Shawn Fan <shawn.fan@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..bacfebdd4969 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -540,8 +540,16 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>  
>  	/* iff following two events can be handled properly by now */
>  	if (sec_sev == GHES_SEV_CORRECTED &&
> -	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> +	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> +		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> +		struct page *page = pfn_to_page(pfn);
> +		struct folio *folio = page_folio(page);

There's pfn_folio(), saves a line :)

> +
> +		if (folio_test_hugetlb(folio))
> +			return false;
> +
>  		flags = MF_SOFT_OFFLINE;
> +	}
>  	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>  		flags = sync ? MF_ACTION_REQUIRED : 0;
>  
> -- 
> 2.51.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-04 15:57 [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked Tony Luck
  2025-09-04 17:25 ` Mike Rapoport
@ 2025-09-04 18:16 ` Liam R. Howlett
  2025-09-05 15:53   ` [PATCH v2] " Luck, Tony
  1 sibling, 1 reply; 20+ messages in thread
From: Liam R. Howlett @ 2025-09-04 18:16 UTC (permalink / raw)
  To: Tony Luck
  Cc: Rafael J. Wysocki, surenb, Anderson, Russ, rppt, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, jiaqiyan,
	jane.chu, david, bp, Meyer, Kyle, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan

* Tony Luck <tony.luck@intel.com> [250904 11:57]:
> BIOS can supply a GHES error record that reports that the corrected
> error threshold has been exceeded. Linux will attempt to soft offline
> the page in response.
> 
> But "exceeded threshold" has many interpretations. Some BIOS versions
> accumulate error counts per-rank, and then report threshold exceeded
> when the number of errors crosses a threshold for the rank. Taking
> a page offline in this case is unlikely to solve any problems. But
> losing a 4KB page will have little impact on the overall system.
> 
> On the other hand, taking a huge page offline will have significant
> impact (and still not solve any problems).
> 
> Check if the GHES record refers to a huge page. Skip the offline
> process if the page is huge.
> 
> Reported-by: Shawn Fan <shawn.fan@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..bacfebdd4969 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -540,8 +540,16 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>  
>  	/* iff following two events can be handled properly by now */
>  	if (sec_sev == GHES_SEV_CORRECTED &&
> -	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> +	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> +		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> +		struct page *page = pfn_to_page(pfn);
> +		struct folio *folio = page_folio(page);
> +
> +		if (folio_test_hugetlb(folio))
> +			return false;
> +
>  		flags = MF_SOFT_OFFLINE;
> +	}
>  	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)

It is a bit odd that you return false vs not set the flags and continue
the checks.

The control flow may cause issues if other checks are added, and only if
a huge page is hit, which will be difficult to debug.

Looking at the code, it seems fine now, but when we don't return false
(as you have added) it looks impossible to reach the line below anyways,
so should it be an else if?  I guess this is not strictly an issue with
your patch.

And if it doesn't solve anything anyways, why aren't we just skipping it
altogether and maybe logging it?

>  		flags = sync ? MF_ACTION_REQUIRED : 0;
>  
> -- 
> 2.51.0
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-04 18:16 ` Liam R. Howlett
@ 2025-09-05 15:53   ` Luck, Tony
  2025-09-05 16:25     ` Liam R. Howlett
  0 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2025-09-05 15:53 UTC (permalink / raw)
  To: Liam R. Howlett, Rafael J. Wysocki, surenb, Anderson, Russ, rppt,
	osalvador, nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe,
	jiaqiyan, jane.chu, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Shawn Fan

BIOS can supply a GHES error record that reports that the corrected
error threshold has been exceeded. Linux will attempt to soft offline
the page in response.

But "exceeded threshold" has many interpretations. Some BIOS versions
accumulate error counts per-rank, and then report threshold exceeded
when the number of errors crosses a threshold for the rank. Taking
a page offline in this case is unlikely to solve any problems. But
losing a 4KB page will have little impact on the overall system.

On the other hand, taking a huge page offline will have significant
impact (and still not solve any problems).

Check if the GHES record refers to a huge page. Skip the offline
process if the page is huge.

Reported-by: Shawn Fan <shawn.fan@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Changes since v1:

* Mike Rapoport: Save a line by using pfn_folio()

* Liam R. Howlett: Don't return false. Continue to rest of function.
  [Yes, code could use "else if" since "sec_sev" can't be both GHES_SEV_CORRECTED
   and GHES_SEV_RECOVERABLE. But I left that alone for now as unrelated to
   this change]

 drivers/acpi/apei/ghes.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..92a767fa7ca4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -540,8 +540,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 
 	/* iff following two events can be handled properly by now */
 	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
-		flags = MF_SOFT_OFFLINE;
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
+		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
+		struct folio *folio = pfn_folio(pfn);
+
+		/* Only try to offline non-huge pages */
+		if (!folio_test_hugetlb(folio))
+			flags = MF_SOFT_OFFLINE;
+	}
 	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
 		flags = sync ? MF_ACTION_REQUIRED : 0;
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 15:53   ` [PATCH v2] " Luck, Tony
@ 2025-09-05 16:25     ` Liam R. Howlett
  2025-09-05 18:17       ` PATCH v3 " Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: Liam R. Howlett @ 2025-09-05 16:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Rafael J. Wysocki, surenb, Anderson, Russ, rppt, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, jiaqiyan,
	jane.chu, david, bp, Meyer, Kyle, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan

* Luck, Tony <tony.luck@intel.com> [250905 11:53]:
> BIOS can supply a GHES error record that reports that the corrected
> error threshold has been exceeded. Linux will attempt to soft offline
> the page in response.
> 
> But "exceeded threshold" has many interpretations. Some BIOS versions
> accumulate error counts per-rank, and then report threshold exceeded
> when the number of errors crosses a threshold for the rank. Taking
> a page offline in this case is unlikely to solve any problems. But
> losing a 4KB page will have little impact on the overall system.
> 
> On the other hand, taking a huge page offline will have significant
> impact (and still not solve any problems).
> 
> Check if the GHES record refers to a huge page. Skip the offline
> process if the page is huge.
> 
> Reported-by: Shawn Fan <shawn.fan@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
> 
> Changes since v1:
> 
> * Mike Rapoport: Save a line by using pfn_folio()
> 
> * Liam R. Howlett: Don't return false. Continue to rest of function.
>   [Yes, code could use "else if" since "sec_sev" can't be both GHES_SEV_CORRECTED
>    and GHES_SEV_RECOVERABLE. But I left that alone for now as unrelated to
>    this change]
> 
>  drivers/acpi/apei/ghes.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..92a767fa7ca4 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -540,8 +540,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>  
>  	/* iff following two events can be handled properly by now */
>  	if (sec_sev == GHES_SEV_CORRECTED &&
> -	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> -		flags = MF_SOFT_OFFLINE;
> +	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> +		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> +		struct folio *folio = pfn_folio(pfn);
> +
> +		/* Only try to offline non-huge pages */
> +		if (!folio_test_hugetlb(folio))
> +			flags = MF_SOFT_OFFLINE;
> +	}
>  	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>  		flags = sync ? MF_ACTION_REQUIRED : 0;
>  
> -- 
> 2.51.0
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 16:25     ` Liam R. Howlett
@ 2025-09-05 18:17       ` Luck, Tony
  2025-09-05 19:39         ` jane.chu
  0 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2025-09-05 18:17 UTC (permalink / raw)
  To: Liam R. Howlett, Rafael J. Wysocki, surenb, Anderson, Russ, rppt,
	osalvador, nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe,
	jiaqiyan, jane.chu, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Shawn Fan

BIOS can supply a GHES error record that reports that the corrected
error threshold has been exceeded. Linux will attempt to soft offline
the page in response.

But "exceeded threshold" has many interpretations. Some BIOS versions
accumulate error counts per-rank, and then report threshold exceeded
when the number of errors crosses a threshold for the rank. Taking
a page offline in this case is unlikely to solve any problems. But
losing a 4KB page will have little impact on the overall system.

On the other hand, taking a huge page offline will have significant
impact (and still not solve any problems).

Check if the GHES record refers to a huge page. Skip the offline
process if the page is huge.

Reported-by: Shawn Fan <shawn.fan@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Change since v2:

Me: Add sanity check on the address (pfn) that BIOS provided. It might
be in some reserved area that doesn't have a "struct page" which would
likely result in an OOPs if fed to pfn_folio().

The original code relied on sanity check of the pfn received from the
BIOS when this eventually feeds into memory_failure(). That used to
result in:
	pr_err("%#lx: memory outside kernel control\n", pfn);
which won't happen with this change, since memory_failure is not
called. Was that a useful message? A Google search mostly shows
references to the code. There are few instances of people reporting
they saw this message.


 drivers/acpi/apei/ghes.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..c2fc1196438c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 
 	/* iff following two events can be handled properly by now */
 	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
-		flags = MF_SOFT_OFFLINE;
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
+		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
+
+		if (pfn_valid(pfn)) {
+			struct folio *folio = pfn_folio(pfn);
+
+			/* Only try to offline non-huge pages */
+			if (!folio_test_hugetlb(folio))
+				flags = MF_SOFT_OFFLINE;
+		}
+	}
 	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
 		flags = sync ? MF_ACTION_REQUIRED : 0;
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 18:17       ` PATCH v3 " Luck, Tony
@ 2025-09-05 19:39         ` jane.chu
  2025-09-05 19:58           ` Luck, Tony
  2025-09-05 19:59           ` Jiaqi Yan
  0 siblings, 2 replies; 20+ messages in thread
From: jane.chu @ 2025-09-05 19:39 UTC (permalink / raw)
  To: Luck, Tony, Liam R. Howlett, Rafael J. Wysocki, surenb, Anderson,
	Russ, rppt, osalvador, nao.horiguchi, mhocko, lorenzo.stoakes,
	linmiaohe, jiaqiyan, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Shawn Fan


On 9/5/2025 11:17 AM, Luck, Tony wrote:
> BIOS can supply a GHES error record that reports that the corrected
> error threshold has been exceeded. Linux will attempt to soft offline
> the page in response.
> 
> But "exceeded threshold" has many interpretations. Some BIOS versions
> accumulate error counts per-rank, and then report threshold exceeded
> when the number of errors crosses a threshold for the rank. Taking
> a page offline in this case is unlikely to solve any problems. But
> losing a 4KB page will have little impact on the overall system.
> 
> On the other hand, taking a huge page offline will have significant
> impact (and still not solve any problems).
> 
> Check if the GHES record refers to a huge page. Skip the offline
> process if the page is huge.
> 
> Reported-by: Shawn Fan <shawn.fan@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> Change since v2:
> 
> Me: Add sanity check on the address (pfn) that BIOS provided. It might
> be in some reserved area that doesn't have a "struct page" which would
> likely result in an OOPs if fed to pfn_folio().
> 
> The original code relied on sanity check of the pfn received from the
> BIOS when this eventually feeds into memory_failure(). That used to
> result in:
> 	pr_err("%#lx: memory outside kernel control\n", pfn);
> which won't happen with this change, since memory_failure is not
> called. Was that a useful message? A Google search mostly shows
> references to the code. There are few instances of people reporting
> they saw this message.
> 
> 
>   drivers/acpi/apei/ghes.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..c2fc1196438c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>   
>   	/* iff following two events can be handled properly by now */
>   	if (sec_sev == GHES_SEV_CORRECTED &&
> -	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> -		flags = MF_SOFT_OFFLINE;
> +	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> +		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> +
> +		if (pfn_valid(pfn)) {
> +			struct folio *folio = pfn_folio(pfn);
> +
> +			/* Only try to offline non-huge pages */
> +			if (!folio_test_hugetlb(folio))
> +				flags = MF_SOFT_OFFLINE;
> +		}
> +	}
>   	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>   		flags = sync ? MF_ACTION_REQUIRED : 0;
>   

So the issue is the result of inaccurate MCA record about per rank CE 
threshold being crossed. If OS offline the indicted page, it might be 
signaled to offline another 4K page in the same rank upon access.

Both MCA and offline-op are performance hitter, and as argued by this 
patch, offline doesn't help except loosing a already corrected page.

Here we choose to bypass hugetlb page simply because it's huge.  Is it 
possible to argue that because the page is huge, it's less likely to get 
another MCA on another page from the same rank?

A while back this patch
56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
has provided userspace control over whether to soft offline, could it be 
a more preferable option?

I don't know, the patch itself is fine, it's the issue that it has 
exposed that is more concerning.

thanks,
-jane






^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 19:39         ` jane.chu
@ 2025-09-05 19:58           ` Luck, Tony
  2025-09-05 20:14             ` jane.chu
  2025-09-05 19:59           ` Jiaqi Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2025-09-05 19:58 UTC (permalink / raw)
  To: jane.chu, Liam R. Howlett, Wysocki, Rafael J, surenb, Anderson,
	Russ, rppt, osalvador, nao.horiguchi, mhocko, lorenzo.stoakes,
	linmiaohe, jiaqiyan, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Fan, Shawn

> So the issue is the result of inaccurate MCA record about per rank CE
> threshold being crossed. If OS offline the indicted page, it might be
> signaled to offline another 4K page in the same rank upon access.

It appears that the BIOS that resulted in this report sensibly treats crossing
the rank error threshold as needing a one-time report via GHES.

> Both MCA and offline-op are performance hitter, and as argued by this
> patch, offline doesn't help except loosing a already corrected page.
>
> Here we choose to bypass hugetlb page simply because it's huge.  Is it
> possible to argue that because the page is huge, it's less likely to get
> another MCA on another page from the same rank?

If there really is a problem with the rank, it likely affects most pages (or
at least most pages on the same NUMA node) because memory access
is (usually) interleaved between channels, and accesses within a 4K page
may hash to different ranks withing a channel.

> A while back this patch
> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
> has provided userspace control over whether to soft offline, could it be
> a more preferable option?

Thanks for pointing that one out. I'll feed that back to the original reporter
and see if it is an acceptable solution.

> I don't know, the patch itself is fine, it's the issue that it has
> exposed that is more concerning.

Agreed. The root problem is the BIOS using this threshold reporting
mechanism, without there being a way for the OS to determine the
scope of memory affected by the threshold.

When this was originally implemented, the expectation was that the
scope would be a 4K page.

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 19:39         ` jane.chu
  2025-09-05 19:58           ` Luck, Tony
@ 2025-09-05 19:59           ` Jiaqi Yan
  2025-09-08 19:14             ` Kyle Meyer
  1 sibling, 1 reply; 20+ messages in thread
From: Jiaqi Yan @ 2025-09-05 19:59 UTC (permalink / raw)
  To: jane.chu
  Cc: Luck, Tony, Liam R. Howlett, Rafael J. Wysocki, surenb, Anderson,
	Russ, rppt, osalvador, nao.horiguchi, mhocko, lorenzo.stoakes,
	linmiaohe, david, bp, Meyer, Kyle, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan

On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
>
>
> On 9/5/2025 11:17 AM, Luck, Tony wrote:
> > BIOS can supply a GHES error record that reports that the corrected
> > error threshold has been exceeded. Linux will attempt to soft offline
> > the page in response.
> >
> > But "exceeded threshold" has many interpretations. Some BIOS versions
> > accumulate error counts per-rank, and then report threshold exceeded
> > when the number of errors crosses a threshold for the rank. Taking
> > a page offline in this case is unlikely to solve any problems. But
> > losing a 4KB page will have little impact on the overall system.

Hi Tony,

This is exactly the problem I encountered [1], and I agree with Jane
that disabling soft offline via /proc/sys/vm/enable_soft_offline
should work for your case.

[1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86

> >
> > On the other hand, taking a huge page offline will have significant
> > impact (and still not solve any problems).
> >
> > Check if the GHES record refers to a huge page. Skip the offline
> > process if the page is huge.
> >
> > Reported-by: Shawn Fan <shawn.fan@intel.com>
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> >
> > Change since v2:
> >
> > Me: Add sanity check on the address (pfn) that BIOS provided. It might
> > be in some reserved area that doesn't have a "struct page" which would
> > likely result in an OOPs if fed to pfn_folio().
> >
> > The original code relied on sanity check of the pfn received from the
> > BIOS when this eventually feeds into memory_failure(). That used to
> > result in:
> >       pr_err("%#lx: memory outside kernel control\n", pfn);
> > which won't happen with this change, since memory_failure is not
> > called. Was that a useful message? A Google search mostly shows
> > references to the code. There are few instances of people reporting
> > they saw this message.
> >
> >
> >   drivers/acpi/apei/ghes.c | 13 +++++++++++--
> >   1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index a0d54993edb3..c2fc1196438c 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> >
> >       /* iff following two events can be handled properly by now */
> >       if (sec_sev == GHES_SEV_CORRECTED &&
> > -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> > -             flags = MF_SOFT_OFFLINE;
> > +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> > +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> > +
> > +             if (pfn_valid(pfn)) {
> > +                     struct folio *folio = pfn_folio(pfn);
> > +
> > +                     /* Only try to offline non-huge pages */
> > +                     if (!folio_test_hugetlb(folio))
> > +                             flags = MF_SOFT_OFFLINE;
> > +             }
> > +     }
> >       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> >               flags = sync ? MF_ACTION_REQUIRED : 0;
> >
>
> So the issue is the result of inaccurate MCA record about per rank CE
> threshold being crossed. If OS offline the indicted page, it might be
> signaled to offline another 4K page in the same rank upon access.
>
> Both MCA and offline-op are performance hitter, and as argued by this
> patch, offline doesn't help except loosing a already corrected page.
>
> Here we choose to bypass hugetlb page simply because it's huge.  Is it
> possible to argue that because the page is huge, it's less likely to get
> another MCA on another page from the same rank?
>
> A while back this patch
> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
> has provided userspace control over whether to soft offline, could it be
> a more preferable option?
>
> I don't know, the patch itself is fine, it's the issue that it has
> exposed that is more concerning.
>
> thanks,
> -jane
>
>
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 19:58           ` Luck, Tony
@ 2025-09-05 20:14             ` jane.chu
  2025-09-05 20:36               ` Luck, Tony
  0 siblings, 1 reply; 20+ messages in thread
From: jane.chu @ 2025-09-05 20:14 UTC (permalink / raw)
  To: Luck, Tony, Liam R. Howlett, Wysocki, Rafael J, surenb, Anderson,
	Russ, rppt, osalvador, nao.horiguchi, mhocko, lorenzo.stoakes,
	linmiaohe, jiaqiyan, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Fan, Shawn


On 9/5/2025 12:58 PM, Luck, Tony wrote:
>> So the issue is the result of inaccurate MCA record about per rank CE
>> threshold being crossed. If OS offline the indicted page, it might be
>> signaled to offline another 4K page in the same rank upon access.
> 
> It appears that the BIOS that resulted in this report sensibly treats crossing
> the rank error threshold as needing a one-time report via GHES.
> 
>> Both MCA and offline-op are performance hitter, and as argued by this
>> patch, offline doesn't help except loosing a already corrected page.
>>
>> Here we choose to bypass hugetlb page simply because it's huge.  Is it
>> possible to argue that because the page is huge, it's less likely to get
>> another MCA on another page from the same rank?
> 
> If there really is a problem with the rank, it likely affects most pages (or
> at least most pages on the same NUMA node) because memory access
> is (usually) interleaved between channels, and accesses within a 4K page
> may hash to different ranks withing a channel.
> 
>> A while back this patch
>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
>> has provided userspace control over whether to soft offline, could it be
>> a more preferable option?
> 
> Thanks for pointing that one out. I'll feed that back to the original reporter
> and see if it is an acceptable solution.
> 
>> I don't know, the patch itself is fine, it's the issue that it has
>> exposed that is more concerning.
> 
> Agreed. The root problem is the BIOS using this threshold reporting
> mechanism, without there being a way for the OS to determine the
> scope of memory affected by the threshold.
> 
> When this was originally implemented, the expectation was that the
> scope would be a 4K page.

Thanks!

BTW, forgot to ask another question.
	ghes_do_proc
	  bool sync = is_hest_sync_notify(ghes);
	  [..]
		queued = ghes_handle_memory_failure(gdata, sev, sync);
	  [..]
	  if (sync && !queued) {
	      force_sig(SIGBUS);
The question is, in the CE MCE case, 'sync' is never 'true' by design, 
correct?

thanks,
-jane


	
	


> 
> -Tony



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 20:14             ` jane.chu
@ 2025-09-05 20:36               ` Luck, Tony
  0 siblings, 0 replies; 20+ messages in thread
From: Luck, Tony @ 2025-09-05 20:36 UTC (permalink / raw)
  To: jane.chu, Liam R. Howlett, Wysocki, Rafael J, surenb, Anderson,
	Russ, rppt, osalvador, nao.horiguchi, mhocko, lorenzo.stoakes,
	linmiaohe, jiaqiyan, david, bp, Meyer, Kyle, akpm, linux-mm,
	vbabka, linux-acpi, Fan, Shawn

> BTW, forgot to ask another question.
>       ghes_do_proc
>         bool sync = is_hest_sync_notify(ghes);
>         [..]
>               queued = ghes_handle_memory_failure(gdata, sev, sync);
>         [..]
>         if (sync && !queued) {
>             force_sig(SIGBUS);
> The question is, in the CE MCE case, 'sync' is never 'true' by design,
> correct?

I'd call it a BIOS bug if sync (ghes->generic->notify.type == ACPI_HEST_NOTIFY_SEA)
were set for a corrected error notification.

But this seems to be an ARM thing. All these notifications on x86 are async.

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-05 19:59           ` Jiaqi Yan
@ 2025-09-08 19:14             ` Kyle Meyer
  2025-09-08 20:01               ` Luck, Tony
  2025-09-18  3:39               ` Shuai Xue
  0 siblings, 2 replies; 20+ messages in thread
From: Kyle Meyer @ 2025-09-08 19:14 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: jane.chu, Luck, Tony, Liam R. Howlett, Rafael J. Wysocki, surenb,
	Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan

On Fri, Sep 05, 2025 at 12:59:00PM -0700, Jiaqi Yan wrote:
> On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
> >
> >
> > On 9/5/2025 11:17 AM, Luck, Tony wrote:
> > > BIOS can supply a GHES error record that reports that the corrected
> > > error threshold has been exceeded. Linux will attempt to soft offline
> > > the page in response.
> > >
> > > But "exceeded threshold" has many interpretations. Some BIOS versions
> > > accumulate error counts per-rank, and then report threshold exceeded
> > > when the number of errors crosses a threshold for the rank. Taking
> > > a page offline in this case is unlikely to solve any problems. But
> > > losing a 4KB page will have little impact on the overall system.
> 
> Hi Tony,
> 
> This is exactly the problem I encountered [1], and I agree with Jane
> that disabling soft offline via /proc/sys/vm/enable_soft_offline
> should work for your case.
> 
> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86 

If that doesn't work for your case, I just want to mention that hugepages might
still be soft offlined with that check in ghes_handle_memory_failure().

> > >
> > > On the other hand, taking a huge page offline will have significant
> > > impact (and still not solve any problems).
> > >
> > > Check if the GHES record refers to a huge page. Skip the offline
> > > process if the page is huge.

AFAICT, we're still notifying the MCE decoder chain and CEC will soft offline
the hugepage once the "action threshold" is reached.

This could be moved to soft_offline_page(). That would prevent other sources
(/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being able to
soft offline hugepages, not just GHES.

> > > Reported-by: Shawn Fan <shawn.fan@intel.com>
> > > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > > ---
> > >
> > > Change since v2:
> > >
> > > Me: Add sanity check on the address (pfn) that BIOS provided. It might
> > > be in some reserved area that doesn't have a "struct page" which would
> > > likely result in an OOPs if fed to pfn_folio().
> > >
> > > The original code relied on sanity check of the pfn received from the
> > > BIOS when this eventually feeds into memory_failure(). That used to
> > > result in:
> > >       pr_err("%#lx: memory outside kernel control\n", pfn);
> > > which won't happen with this change, since memory_failure is not
> > > called. Was that a useful message? A Google search mostly shows
> > > references to the code. There are few instances of people reporting
> > > they saw this message.
> > >
> > >
> > >   drivers/acpi/apei/ghes.c | 13 +++++++++++--
> > >   1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > index a0d54993edb3..c2fc1196438c 100644
> > > --- a/drivers/acpi/apei/ghes.c
> > > +++ b/drivers/acpi/apei/ghes.c
> > > @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> > >
> > >       /* iff following two events can be handled properly by now */
> > >       if (sec_sev == GHES_SEV_CORRECTED &&
> > > -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> > > -             flags = MF_SOFT_OFFLINE;
> > > +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
> > > +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
> > > +
> > > +             if (pfn_valid(pfn)) {
> > > +                     struct folio *folio = pfn_folio(pfn);
> > > +
> > > +                     /* Only try to offline non-huge pages */
> > > +                     if (!folio_test_hugetlb(folio))
> > > +                             flags = MF_SOFT_OFFLINE;
> > > +             }
> > > +     }
> > >       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> > >               flags = sync ? MF_ACTION_REQUIRED : 0;
> > >
> >
> > So the issue is the result of inaccurate MCA record about per rank CE
> > threshold being crossed. If OS offline the indicted page, it might be
> > signaled to offline another 4K page in the same rank upon access.
> >
> > Both MCA and offline-op are performance hitter, and as argued by this
> > patch, offline doesn't help except loosing a already corrected page.
> >
> > Here we choose to bypass hugetlb page simply because it's huge.  Is it
> > possible to argue that because the page is huge, it's less likely to get
> > another MCA on another page from the same rank?
> >
> > A while back this patch
> > 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
> > has provided userspace control over whether to soft offline, could it be
> > a more preferable option?

Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:

0: Soft offline is disabled.
1: Soft offline is enabled for normal pages (skip hugepages).
2: Soft offline is enabled for normal pages and hugepages.

Maybe something like...

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc30ca4804bf..efa535d405a8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -64,11 +64,17 @@
 #include "internal.h"
 #include "ras/ras_event.h"
 
+enum soft_offline {
+	SOFT_OFFLINE_DISABLED = 0,
+	SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES,
+	SOFT_OFFLINE_ENABLED
+};
+
 static int sysctl_memory_failure_early_kill __read_mostly;
 
 static int sysctl_memory_failure_recovery __read_mostly = 1;
 
-static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_SKIP_HUGEPAGES;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
@@ -150,7 +156,7 @@ static const struct ctl_table memory_failure_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= SYSCTL_TWO,
 	}
 };
 
@@ -2799,12 +2805,20 @@ int soft_offline_page(unsigned long pfn, int flags)
 		return -EIO;
 	}
 
-	if (!sysctl_enable_soft_offline) {
-		pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
+	if (sysctl_enable_soft_offline == SOFT_OFFLINE_DISABLED) {
+		pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
 		put_ref_page(pfn, flags);
 		return -EOPNOTSUPP;
 	}
 
+	if (sysctl_enable_soft_offline == SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES) {
+		if (folio_test_hugetlb(pfn_folio(pfn))) {
+			pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
+			put_ref_page(pfn, flags);
+			return -EOPNOTSUPP;
+		}
+	}
+
 	mutex_lock(&mf_mutex);
 
 	if (PageHWPoison(page)) {

> > I don't know, the patch itself is fine, it's the issue that it has
> > exposed that is more concerning.

Thanks,
Kyle Meyer


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-08 19:14             ` Kyle Meyer
@ 2025-09-08 20:01               ` Luck, Tony
  2025-09-10 12:01                 ` Rafael J. Wysocki
  2025-09-18  3:39               ` Shuai Xue
  1 sibling, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2025-09-08 20:01 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: Jiaqi Yan, jane.chu, Liam R. Howlett, Rafael J. Wysocki, surenb,
	Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan

On Mon, Sep 08, 2025 at 02:14:42PM -0500, Kyle Meyer wrote:
> Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
> 
> 0: Soft offline is disabled.
> 1: Soft offline is enabled for normal pages (skip hugepages).
> 2: Soft offline is enabled for normal pages and hugepages.

Seems like a solid plan.

Needs an update to Documentation/admin-guide/sysctl/vm.rst
to match the new functionality and describe the reason that
users might want to avoid taking huge pages offline.

Also this line in existing text:

- On ARM, the request to soft offline pages from GHES driver.

should read:

- On ARM and X86, the request to soft offline pages from GHES driver.

> 
> Maybe something like...
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc30ca4804bf..efa535d405a8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -64,11 +64,17 @@
>  #include "internal.h"
>  #include "ras/ras_event.h"
>  
> +enum soft_offline {
> +	SOFT_OFFLINE_DISABLED = 0,
> +	SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES,
> +	SOFT_OFFLINE_ENABLED
> +};
> +
>  static int sysctl_memory_failure_early_kill __read_mostly;
>  
>  static int sysctl_memory_failure_recovery __read_mostly = 1;
> -static int sysctl_enable_soft_offline __read_mostly = 1;
> +static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_SKIP_HUGEPAGES;

This is changing the default behavior (which allowed offline
of huge pages). I'm in favor of the change. But you should call it
out explicitly in the commit message when you write up a patch.

>  
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
> @@ -150,7 +156,7 @@ static const struct ctl_table memory_failure_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
> -		.extra2		= SYSCTL_ONE,
> +		.extra2		= SYSCTL_TWO,
>  	}
>  };
>  
> @@ -2799,12 +2805,20 @@ int soft_offline_page(unsigned long pfn, int flags)
>  		return -EIO;
>  	}
>  
> -	if (!sysctl_enable_soft_offline) {
> -		pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
> +	if (sysctl_enable_soft_offline == SOFT_OFFLINE_DISABLED) {
> +		pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
>  		put_ref_page(pfn, flags);
>  		return -EOPNOTSUPP;
>  	}
>  
> +	if (sysctl_enable_soft_offline == SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES) {
> +		if (folio_test_hugetlb(pfn_folio(pfn))) {
> +			pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
> +			put_ref_page(pfn, flags);
> +			return -EOPNOTSUPP;
> +		}
> +	}
> +
>  	mutex_lock(&mf_mutex);
>  
>  	if (PageHWPoison(page)) {
> 
> > > I don't know, the patch itself is fine, it's the issue that it has
> > > exposed that is more concerning.
> 
> Thanks,
> Kyle Meyer

-Tony


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-08 20:01               ` Luck, Tony
@ 2025-09-10 12:01                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2025-09-10 12:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Kyle Meyer, Jiaqi Yan, jane.chu, Liam R. Howlett,
	Rafael J. Wysocki, surenb, Anderson, Russ, rppt, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, david, bp,
	akpm, linux-mm, vbabka, linux-acpi, Shawn Fan

Hi Tony,

On Mon, Sep 8, 2025 at 10:01 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Mon, Sep 08, 2025 at 02:14:42PM -0500, Kyle Meyer wrote:
> > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
> >
> > 0: Soft offline is disabled.
> > 1: Soft offline is enabled for normal pages (skip hugepages).
> > 2: Soft offline is enabled for normal pages and hugepages.
>
> Seems like a solid plan.
>
> Needs an update to Documentation/admin-guide/sysctl/vm.rst
> to match the new functionality and describe the reason that
> users might want to avoid taking huge pages offline.
>
> Also this line in existing text:
>
> - On ARM, the request to soft offline pages from GHES driver.
>
> should read:
>
> - On ARM and X86, the request to soft offline pages from GHES driver.
>
> >
> > Maybe something like...
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index fc30ca4804bf..efa535d405a8 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -64,11 +64,17 @@
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> >
> > +enum soft_offline {
> > +     SOFT_OFFLINE_DISABLED = 0,
> > +     SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES,
> > +     SOFT_OFFLINE_ENABLED
> > +};
> > +
> >  static int sysctl_memory_failure_early_kill __read_mostly;
> >
> >  static int sysctl_memory_failure_recovery __read_mostly = 1;
> > -static int sysctl_enable_soft_offline __read_mostly = 1;
> > +static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_SKIP_HUGEPAGES;
>
> This is changing the default behavior (which allowed offline
> of huge pages). I'm in favor of the change. But you should call it
> out explicitly in the commit message when you write up a patch.
>
> >
> >  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
> >
> > @@ -150,7 +156,7 @@ static const struct ctl_table memory_failure_table[] = {
> >               .mode           = 0644,
> >               .proc_handler   = proc_dointvec_minmax,
> >               .extra1         = SYSCTL_ZERO,
> > -             .extra2         = SYSCTL_ONE,
> > +             .extra2         = SYSCTL_TWO,
> >       }
> >  };
> >
> > @@ -2799,12 +2805,20 @@ int soft_offline_page(unsigned long pfn, int flags)
> >               return -EIO;
> >       }
> >
> > -     if (!sysctl_enable_soft_offline) {
> > -             pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
> > +     if (sysctl_enable_soft_offline == SOFT_OFFLINE_DISABLED) {
> > +             pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
> >               put_ref_page(pfn, flags);
> >               return -EOPNOTSUPP;
> >       }
> >
> > +     if (sysctl_enable_soft_offline == SOFT_OFFLINE_ENABLED_SKIP_HUGEPAGES) {
> > +             if (folio_test_hugetlb(pfn_folio(pfn))) {
> > +                     pr_info("disabled by /proc/sys/vm/enable_soft_offline\n");
> > +                     put_ref_page(pfn, flags);
> > +                     return -EOPNOTSUPP;
> > +             }
> > +     }
> > +
> >       mutex_lock(&mf_mutex);
> >
> >       if (PageHWPoison(page)) {
> >
> > > > I don't know, the patch itself is fine, it's the issue that it has
> > > > exposed that is more concerning.

Is the $subject patch still relevant?

I gather from the conversation following it that it isn't.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-08 19:14             ` Kyle Meyer
  2025-09-08 20:01               ` Luck, Tony
@ 2025-09-18  3:39               ` Shuai Xue
  2025-09-18 15:43                 ` Jiaqi Yan
  1 sibling, 1 reply; 20+ messages in thread
From: Shuai Xue @ 2025-09-18  3:39 UTC (permalink / raw)
  To: Kyle Meyer, Jiaqi Yan
  Cc: jane.chu, Luck, Tony, Liam R. Howlett, Rafael J. Wysocki, surenb,
	Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Shawn Fan



在 2025/9/9 03:14, Kyle Meyer 写道:> On Fri, Sep 05, 2025 at 12:59:00PM -0700, Jiaqi Yan wrote:
 >> On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
 >>>
 >>>
 >>> On 9/5/2025 11:17 AM, Luck, Tony wrote:
 >>>> BIOS can supply a GHES error record that reports that the corrected
 >>>> error threshold has been exceeded. Linux will attempt to soft offline
 >>>> the page in response.
 >>>>
 >>>> But "exceeded threshold" has many interpretations. Some BIOS versions
 >>>> accumulate error counts per-rank, and then report threshold exceeded
 >>>> when the number of errors crosses a threshold for the rank. Taking
 >>>> a page offline in this case is unlikely to solve any problems. But
 >>>> losing a 4KB page will have little impact on the overall system.

Hi, Tony,

Thank you for your detailed explanation. I believe this is exactly the problem
we're encountering in our production environment.

As you mentioned, memory access is typically interleaved between channels. When
the per-rank threshold is exceeded, soft-offlining the last accessed address
seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
error accumulation happens at the rank level, but the action is taken on a
specific page that happened to trigger the threshold, which doesn't address the
underlying issue.

I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
Is there a specific interpretation of "threshold exceeded" that would make
page-level offline action meaningful? If not, how about disabling soft offline from
GHES and leave that to userspace tools like rasdaemon (mcelog) ?

 >>
 >> Hi Tony,
 >>
 >> This is exactly the problem I encountered [1], and I agree with Jane
 >> that disabling soft offline via /proc/sys/vm/enable_soft_offline
 >> should work for your case.
 >>
 >> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86
 >
 > If that doesn't work for your case, I just want to mention that hugepages might
 > still be soft offlined with that check in ghes_handle_memory_failure().
 >
 >>>>
 >>>> On the other hand, taking a huge page offline will have significant
 >>>> impact (and still not solve any problems).
 >>>>
 >>>> Check if the GHES record refers to a huge page. Skip the offline
 >>>> process if the page is huge.
 >
 > AFAICT, we're still notifying the MCE decoder chain and CEC will soft offline
 > the hugepage once the "action threshold" is reached.
 >
 > This could be moved to soft_offline_page(). That would prevent other sources
 > (/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being able to
 > soft offline hugepages, not just GHES.
 >
 >>>> Reported-by: Shawn Fan <shawn.fan@intel.com>
 >>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
 >>>> ---
 >>>>
 >>>> Change since v2:
 >>>>
 >>>> Me: Add sanity check on the address (pfn) that BIOS provided. It might
 >>>> be in some reserved area that doesn't have a "struct page" which would
 >>>> likely result in an OOPs if fed to pfn_folio().
 >>>>
 >>>> The original code relied on sanity check of the pfn received from the
 >>>> BIOS when this eventually feeds into memory_failure(). That used to
 >>>> result in:
 >>>>        pr_err("%#lx: memory outside kernel control\n", pfn);
 >>>> which won't happen with this change, since memory_failure is not
 >>>> called. Was that a useful message? A Google search mostly shows
 >>>> references to the code. There are few instances of people reporting
 >>>> they saw this message.
 >>>>
 >>>>
 >>>>    drivers/acpi/apei/ghes.c | 13 +++++++++++--
 >>>>    1 file changed, 11 insertions(+), 2 deletions(-)
 >>>>
 >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
 >>>> index a0d54993edb3..c2fc1196438c 100644
 >>>> --- a/drivers/acpi/apei/ghes.c
 >>>> +++ b/drivers/acpi/apei/ghes.c
 >>>> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 >>>>
 >>>>        /* iff following two events can be handled properly by now */
 >>>>        if (sec_sev == GHES_SEV_CORRECTED &&
 >>>> -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 >>>> -             flags = MF_SOFT_OFFLINE;
 >>>> +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
 >>>> +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
 >>>> +
 >>>> +             if (pfn_valid(pfn)) {
 >>>> +                     struct folio *folio = pfn_folio(pfn);
 >>>> +
 >>>> +                     /* Only try to offline non-huge pages */
 >>>> +                     if (!folio_test_hugetlb(folio))
 >>>> +                             flags = MF_SOFT_OFFLINE;
 >>>> +             }
 >>>> +     }
 >>>>        if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
 >>>>                flags = sync ? MF_ACTION_REQUIRED : 0;
 >>>>
 >>>
 >>> So the issue is the result of inaccurate MCA record about per rank CE
 >>> threshold being crossed. If OS offline the indicted page, it might be
 >>> signaled to offline another 4K page in the same rank upon access.
 >>>
 >>> Both MCA and offline-op are performance hitter, and as argued by this
 >>> patch, offline doesn't help except loosing a already corrected page.
 >>>
 >>> Here we choose to bypass hugetlb page simply because it's huge.  Is it
 >>> possible to argue that because the page is huge, it's less likely to get
 >>> another MCA on another page from the same rank?
 >>>
 >>> A while back this patch
 >>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
 >>> has provided userspace control over whether to soft offline, could it be
 >>> a more preferable option?
 >
 > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
 >
 > 0: Soft offline is disabled.
 > 1: Soft offline is enabled for normal pages (skip hugepages).
 > 2: Soft offline is enabled for normal pages and hugepages.
 >

I prefer having soft-offline fully controlled by userspace, especially
for DPDK-style applications. These applications use hugepage mappings and maintain
their own VA-to-PA mappings. When the kernel migrates a hugepage to a new physical
page during soft-offline, DPDK continues accessing the old physical address,
leading to data corruption or access errors.

For such use cases, the application needs to be aware of and handle memory errors
itself. The kernel performing automatic page migration breaks the assumptions these
applications make about stable physical addresses.
Thanks.
Shuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-18  3:39               ` Shuai Xue
@ 2025-09-18 15:43                 ` Jiaqi Yan
  2025-09-18 18:45                   ` Luck, Tony
                                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jiaqi Yan @ 2025-09-18 15:43 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Kyle Meyer, jane.chu, Luck, Tony, Liam R. Howlett,
	Rafael J. Wysocki, surenb, Anderson, Russ, rppt, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, david, bp,
	akpm, linux-mm, vbabka, linux-acpi, Shawn Fan

On Wed, Sep 17, 2025 at 8:39 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
>
>
> 在 2025/9/9 03:14, Kyle Meyer 写道:> On Fri, Sep 05, 2025 at 12:59:00PM -0700, Jiaqi Yan wrote:
>  >> On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
>  >>>
>  >>>
>  >>> On 9/5/2025 11:17 AM, Luck, Tony wrote:
>  >>>> BIOS can supply a GHES error record that reports that the corrected
>  >>>> error threshold has been exceeded. Linux will attempt to soft offline
>  >>>> the page in response.
>  >>>>
>  >>>> But "exceeded threshold" has many interpretations. Some BIOS versions
>  >>>> accumulate error counts per-rank, and then report threshold exceeded
>  >>>> when the number of errors crosses a threshold for the rank. Taking
>  >>>> a page offline in this case is unlikely to solve any problems. But
>  >>>> losing a 4KB page will have little impact on the overall system.
>
> Hi, Tony,
>
> Thank you for your detailed explanation. I believe this is exactly the problem
> we're encountering in our production environment.
>
> As you mentioned, memory access is typically interleaved between channels. When
> the per-rank threshold is exceeded, soft-offlining the last accessed address
> seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
> error accumulation happens at the rank level, but the action is taken on a
> specific page that happened to trigger the threshold, which doesn't address the
> underlying issue.
>
> I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
> flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
> Is there a specific interpretation of "threshold exceeded" that would make
> page-level offline action meaningful? If not, how about disabling soft offline from
> GHES and leave that to userspace tools like rasdaemon (mcelog) ?

The existing /proc/sys/vm/enable_soft_offline can already entirely
disable soft offline. GHES may still ask for soft offline to
memory-failure.c, but soft_offline_page will discard the ask as long
as userspace sets 0 to /proc/sys/vm/enable_soft_offline.

>
>  >>
>  >> Hi Tony,
>  >>
>  >> This is exactly the problem I encountered [1], and I agree with Jane
>  >> that disabling soft offline via /proc/sys/vm/enable_soft_offline
>  >> should work for your case.
>  >>
>  >> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86
>  >
>  > If that doesn't work for your case, I just want to mention that hugepages might
>  > still be soft offlined with that check in ghes_handle_memory_failure().
>  >
>  >>>>
>  >>>> On the other hand, taking a huge page offline will have significant
>  >>>> impact (and still not solve any problems).
>  >>>>
>  >>>> Check if the GHES record refers to a huge page. Skip the offline
>  >>>> process if the page is huge.
>  >
>  > AFAICT, we're still notifying the MCE decoder chain and CEC will soft offline
>  > the hugepage once the "action threshold" is reached.
>  >
>  > This could be moved to soft_offline_page(). That would prevent other sources
>  > (/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being able to
>  > soft offline hugepages, not just GHES.
>  >
>  >>>> Reported-by: Shawn Fan <shawn.fan@intel.com>
>  >>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>  >>>> ---
>  >>>>
>  >>>> Change since v2:
>  >>>>
>  >>>> Me: Add sanity check on the address (pfn) that BIOS provided. It might
>  >>>> be in some reserved area that doesn't have a "struct page" which would
>  >>>> likely result in an OOPs if fed to pfn_folio().
>  >>>>
>  >>>> The original code relied on sanity check of the pfn received from the
>  >>>> BIOS when this eventually feeds into memory_failure(). That used to
>  >>>> result in:
>  >>>>        pr_err("%#lx: memory outside kernel control\n", pfn);
>  >>>> which won't happen with this change, since memory_failure is not
>  >>>> called. Was that a useful message? A Google search mostly shows
>  >>>> references to the code. There are few instances of people reporting
>  >>>> they saw this message.
>  >>>>
>  >>>>
>  >>>>    drivers/acpi/apei/ghes.c | 13 +++++++++++--
>  >>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>  >>>>
>  >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>  >>>> index a0d54993edb3..c2fc1196438c 100644
>  >>>> --- a/drivers/acpi/apei/ghes.c
>  >>>> +++ b/drivers/acpi/apei/ghes.c
>  >>>> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>  >>>>
>  >>>>        /* iff following two events can be handled properly by now */
>  >>>>        if (sec_sev == GHES_SEV_CORRECTED &&
>  >>>> -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>  >>>> -             flags = MF_SOFT_OFFLINE;
>  >>>> +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
>  >>>> +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
>  >>>> +
>  >>>> +             if (pfn_valid(pfn)) {
>  >>>> +                     struct folio *folio = pfn_folio(pfn);
>  >>>> +
>  >>>> +                     /* Only try to offline non-huge pages */
>  >>>> +                     if (!folio_test_hugetlb(folio))
>  >>>> +                             flags = MF_SOFT_OFFLINE;
>  >>>> +             }
>  >>>> +     }
>  >>>>        if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>  >>>>                flags = sync ? MF_ACTION_REQUIRED : 0;
>  >>>>
>  >>>
>  >>> So the issue is the result of inaccurate MCA record about per rank CE
>  >>> threshold being crossed. If OS offline the indicted page, it might be
>  >>> signaled to offline another 4K page in the same rank upon access.
>  >>>
>  >>> Both MCA and offline-op are performance hitter, and as argued by this
>  >>> patch, offline doesn't help except loosing a already corrected page.
>  >>>
>  >>> Here we choose to bypass hugetlb page simply because it's huge.  Is it
>  >>> possible to argue that because the page is huge, it's less likely to get
>  >>> another MCA on another page from the same rank?
>  >>>
>  >>> A while back this patch
>  >>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
>  >>> has provided userspace control over whether to soft offline, could it be
>  >>> a more preferable option?
>  >
>  > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
>  >
>  > 0: Soft offline is disabled.
>  > 1: Soft offline is enabled for normal pages (skip hugepages).
>  > 2: Soft offline is enabled for normal pages and hugepages.
>  >
>
> I prefer having soft-offline fully controlled by userspace, especially
> for DPDK-style applications. These applications use hugepage mappings and maintain
> their own VA-to-PA mappings. When the kernel migrates a hugepage to a new physical
> page during soft-offline, DPDK continues accessing the old physical address,
> leading to data corruption or access errors.

Just curious, does the DPDK applications pin (pin_user_pages) the
VA-to-PA mappings? If so I would expect both soft offline and hard
offline will fail and become no-op.

>
> For such use cases, the application needs to be aware of and handle memory errors
> itself. The kernel performing automatic page migration breaks the assumptions these
> applications make about stable physical addresses.
> Thanks.
> Shuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-18 15:43                 ` Jiaqi Yan
@ 2025-09-18 18:45                   ` Luck, Tony
  2025-09-19  1:53                     ` Shuai Xue
  2025-09-18 19:46                   ` Luck, Tony
  2025-09-19  1:49                   ` Shuai Xue
  2 siblings, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2025-09-18 18:45 UTC (permalink / raw)
  To: Jiaqi Yan, Shuai Xue
  Cc: Meyer, Kyle, jane.chu, Liam R. Howlett, Wysocki, Rafael J,
	surenb, Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Fan, Shawn

> Thank you for your detailed explanation. I believe this is exactly the problem
> we're encountering in our production environment.
>
> As you mentioned, memory access is typically interleaved between channels. When
> the per-rank threshold is exceeded, soft-offlining the last accessed address
> seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
> error accumulation happens at the rank level, but the action is taken on a
> specific page that happened to trigger the threshold, which doesn't address the
> underlying issue.
>
> I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
> flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
> Is there a specific interpretation of "threshold exceeded" that would make
> page-level offline action meaningful? If not, how about disabling soft offline from
> GHES and leave that to userspace tools like rasdaemon (mcelog) ?

The original use case was defined by IBM [1] (that division is now part of Lenovo).
IBM BIOS enabled a firmware first mode to handle errors, cutting the OS out of
the picture. But the challenge with this was how to handle a case where the BIOS
identified a recurring problem on a specific memory address. The solution proposed
was to use GHES notification using the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
flag to let the OS know that this corrected error needs some action.

-Tony

[1] cf870c70a194 ("mce: acpi/apei: Soft-offline a page on firmware GHES notification")



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-18 15:43                 ` Jiaqi Yan
  2025-09-18 18:45                   ` Luck, Tony
@ 2025-09-18 19:46                   ` Luck, Tony
  2025-09-19  1:49                   ` Shuai Xue
  2 siblings, 0 replies; 20+ messages in thread
From: Luck, Tony @ 2025-09-18 19:46 UTC (permalink / raw)
  To: Jiaqi Yan, Shuai Xue
  Cc: Meyer, Kyle, jane.chu, Liam R. Howlett, Wysocki, Rafael J,
	surenb, Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Fan, Shawn

We have multiple ways that soft offline of memory can happen.

1) BIOS (via GHES request)
	a) Old "IBM style" BIOS counting errors per-page => Probably want to offline a 4K page,
	breakup a transparent huge page, likely don't want to offline a libhugetlb page.
	b) New BIOS counting errors on larger structures (rank, channel) => Probably want to ignore these requests.
2) RAS_CEC
	This is counting per page (with very different thresholds for Intel vs. AMD). => Same as 1(a) above?
3) /sys/devices/system/memory/soft_offline_page
	a) mcelog(8) Also counting per-page. => Same as 1(a)
	b) rasdaemon(8) Same as mcelog(8) => Same as 1(a)
	c) Some other user agent. Up to the user who knows what this agent does. => May always want to offline?

Perhaps the sysctl option to control offline should take into account the source of the request.
Especially for users who know whether their BIOS is 1(a) or. 1(b) 

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-18 15:43                 ` Jiaqi Yan
  2025-09-18 18:45                   ` Luck, Tony
  2025-09-18 19:46                   ` Luck, Tony
@ 2025-09-19  1:49                   ` Shuai Xue
  2 siblings, 0 replies; 20+ messages in thread
From: Shuai Xue @ 2025-09-19  1:49 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Kyle Meyer, jane.chu, Luck, Tony, Liam R. Howlett,
	Rafael J. Wysocki, surenb, Anderson, Russ, rppt, osalvador,
	nao.horiguchi, mhocko, lorenzo.stoakes, linmiaohe, david, bp,
	akpm, linux-mm, vbabka, linux-acpi, Shawn Fan



在 2025/9/18 23:43, Jiaqi Yan 写道:
> On Wed, Sep 17, 2025 at 8:39 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2025/9/9 03:14, Kyle Meyer 写道:> On Fri, Sep 05, 2025 at 12:59:00PM -0700, Jiaqi Yan wrote:
>>   >> On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
>>   >>>
>>   >>>
>>   >>> On 9/5/2025 11:17 AM, Luck, Tony wrote:
>>   >>>> BIOS can supply a GHES error record that reports that the corrected
>>   >>>> error threshold has been exceeded. Linux will attempt to soft offline
>>   >>>> the page in response.
>>   >>>>
>>   >>>> But "exceeded threshold" has many interpretations. Some BIOS versions
>>   >>>> accumulate error counts per-rank, and then report threshold exceeded
>>   >>>> when the number of errors crosses a threshold for the rank. Taking
>>   >>>> a page offline in this case is unlikely to solve any problems. But
>>   >>>> losing a 4KB page will have little impact on the overall system.
>>
>> Hi, Tony,
>>
>> Thank you for your detailed explanation. I believe this is exactly the problem
>> we're encountering in our production environment.
>>
>> As you mentioned, memory access is typically interleaved between channels. When
>> the per-rank threshold is exceeded, soft-offlining the last accessed address
>> seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
>> error accumulation happens at the rank level, but the action is taken on a
>> specific page that happened to trigger the threshold, which doesn't address the
>> underlying issue.
>>
>> I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
>> flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
>> Is there a specific interpretation of "threshold exceeded" that would make
>> page-level offline action meaningful? If not, how about disabling soft offline from
>> GHES and leave that to userspace tools like rasdaemon (mcelog) ?
> 
> The existing /proc/sys/vm/enable_soft_offline can already entirely
> disable soft offline. GHES may still ask for soft offline to
> memory-failure.c, but soft_offline_page will discard the ask as long
> as userspace sets 0 to /proc/sys/vm/enable_soft_offline.
> 

I see. Thanks.

>>
>>   >>
>>   >> Hi Tony,
>>   >>
>>   >> This is exactly the problem I encountered [1], and I agree with Jane
>>   >> that disabling soft offline via /proc/sys/vm/enable_soft_offline
>>   >> should work for your case.
>>   >>
>>   >> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86
>>   >
>>   > If that doesn't work for your case, I just want to mention that hugepages might
>>   > still be soft offlined with that check in ghes_handle_memory_failure().
>>   >
>>   >>>>
>>   >>>> On the other hand, taking a huge page offline will have significant
>>   >>>> impact (and still not solve any problems).
>>   >>>>
>>   >>>> Check if the GHES record refers to a huge page. Skip the offline
>>   >>>> process if the page is huge.
>>   >
>>   > AFAICT, we're still notifying the MCE decoder chain and CEC will soft offline
>>   > the hugepage once the "action threshold" is reached.
>>   >
>>   > This could be moved to soft_offline_page(). That would prevent other sources
>>   > (/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being able to
>>   > soft offline hugepages, not just GHES.
>>   >
>>   >>>> Reported-by: Shawn Fan <shawn.fan@intel.com>
>>   >>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>>   >>>> ---
>>   >>>>
>>   >>>> Change since v2:
>>   >>>>
>>   >>>> Me: Add sanity check on the address (pfn) that BIOS provided. It might
>>   >>>> be in some reserved area that doesn't have a "struct page" which would
>>   >>>> likely result in an OOPs if fed to pfn_folio().
>>   >>>>
>>   >>>> The original code relied on sanity check of the pfn received from the
>>   >>>> BIOS when this eventually feeds into memory_failure(). That used to
>>   >>>> result in:
>>   >>>>        pr_err("%#lx: memory outside kernel control\n", pfn);
>>   >>>> which won't happen with this change, since memory_failure is not
>>   >>>> called. Was that a useful message? A Google search mostly shows
>>   >>>> references to the code. There are few instances of people reporting
>>   >>>> they saw this message.
>>   >>>>
>>   >>>>
>>   >>>>    drivers/acpi/apei/ghes.c | 13 +++++++++++--
>>   >>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>   >>>>
>>   >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>   >>>> index a0d54993edb3..c2fc1196438c 100644
>>   >>>> --- a/drivers/acpi/apei/ghes.c
>>   >>>> +++ b/drivers/acpi/apei/ghes.c
>>   >>>> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>   >>>>
>>   >>>>        /* iff following two events can be handled properly by now */
>>   >>>>        if (sec_sev == GHES_SEV_CORRECTED &&
>>   >>>> -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>   >>>> -             flags = MF_SOFT_OFFLINE;
>>   >>>> +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
>>   >>>> +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
>>   >>>> +
>>   >>>> +             if (pfn_valid(pfn)) {
>>   >>>> +                     struct folio *folio = pfn_folio(pfn);
>>   >>>> +
>>   >>>> +                     /* Only try to offline non-huge pages */
>>   >>>> +                     if (!folio_test_hugetlb(folio))
>>   >>>> +                             flags = MF_SOFT_OFFLINE;
>>   >>>> +             }
>>   >>>> +     }
>>   >>>>        if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>>   >>>>                flags = sync ? MF_ACTION_REQUIRED : 0;
>>   >>>>
>>   >>>
>>   >>> So the issue is the result of inaccurate MCA record about per rank CE
>>   >>> threshold being crossed. If OS offline the indicted page, it might be
>>   >>> signaled to offline another 4K page in the same rank upon access.
>>   >>>
>>   >>> Both MCA and offline-op are performance hitter, and as argued by this
>>   >>> patch, offline doesn't help except loosing a already corrected page.
>>   >>>
>>   >>> Here we choose to bypass hugetlb page simply because it's huge.  Is it
>>   >>> possible to argue that because the page is huge, it's less likely to get
>>   >>> another MCA on another page from the same rank?
>>   >>>
>>   >>> A while back this patch
>>   >>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
>>   >>> has provided userspace control over whether to soft offline, could it be
>>   >>> a more preferable option?
>>   >
>>   > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
>>   >
>>   > 0: Soft offline is disabled.
>>   > 1: Soft offline is enabled for normal pages (skip hugepages).
>>   > 2: Soft offline is enabled for normal pages and hugepages.
>>   >
>>
>> I prefer having soft-offline fully controlled by userspace, especially
>> for DPDK-style applications. These applications use hugepage mappings and maintain
>> their own VA-to-PA mappings. When the kernel migrates a hugepage to a new physical
>> page during soft-offline, DPDK continues accessing the old physical address,
>> leading to data corruption or access errors.
> 
> Just curious, does the DPDK applications pin (pin_user_pages) the
> VA-to-PA mappings? If so I would expect both soft offline and hard
> offline will fail and become no-op.
> 

I think these does. We encountered this problem in older kernel
versions. However, since it's application-specific behavior, I agree
that using enable_soft_offline for userspace control is a good solution.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
  2025-09-18 18:45                   ` Luck, Tony
@ 2025-09-19  1:53                     ` Shuai Xue
  0 siblings, 0 replies; 20+ messages in thread
From: Shuai Xue @ 2025-09-19  1:53 UTC (permalink / raw)
  To: Luck, Tony, Jiaqi Yan
  Cc: Meyer, Kyle, jane.chu, Liam R. Howlett, Wysocki, Rafael J,
	surenb, Anderson, Russ, rppt, osalvador, nao.horiguchi, mhocko,
	lorenzo.stoakes, linmiaohe, david, bp, akpm, linux-mm, vbabka,
	linux-acpi, Fan, Shawn



在 2025/9/19 02:45, Luck, Tony 写道:
>> Thank you for your detailed explanation. I believe this is exactly the problem
>> we're encountering in our production environment.
>>
>> As you mentioned, memory access is typically interleaved between channels. When
>> the per-rank threshold is exceeded, soft-offlining the last accessed address
>> seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
>> error accumulation happens at the rank level, but the action is taken on a
>> specific page that happened to trigger the threshold, which doesn't address the
>> underlying issue.
>>
>> I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
>> flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
>> Is there a specific interpretation of "threshold exceeded" that would make
>> page-level offline action meaningful? If not, how about disabling soft offline from
>> GHES and leave that to userspace tools like rasdaemon (mcelog) ?
> 
> The original use case was defined by IBM [1] (that division is now part of Lenovo).
> IBM BIOS enabled a firmware first mode to handle errors, cutting the OS out of
> the picture. But the challenge with this was how to handle a case where the BIOS
> identified a recurring problem on a specific memory address. The solution proposed
> was to use GHES notification using the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
> flag to let the OS know that this corrected error needs some action.
> 
> -Tony
> 
> [1] cf870c70a194 ("mce: acpi/apei: Soft-offline a page on firmware GHES notification")
> 
> 

Hi, Tony,

Thanks for the historical context. Understanding the IBM use case
and the original design intent is very helpful.

Best regards,
Shuai


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-09-19  1:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-04 15:57 [PATCH] ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked Tony Luck
2025-09-04 17:25 ` Mike Rapoport
2025-09-04 18:16 ` Liam R. Howlett
2025-09-05 15:53   ` [PATCH v2] " Luck, Tony
2025-09-05 16:25     ` Liam R. Howlett
2025-09-05 18:17       ` PATCH v3 " Luck, Tony
2025-09-05 19:39         ` jane.chu
2025-09-05 19:58           ` Luck, Tony
2025-09-05 20:14             ` jane.chu
2025-09-05 20:36               ` Luck, Tony
2025-09-05 19:59           ` Jiaqi Yan
2025-09-08 19:14             ` Kyle Meyer
2025-09-08 20:01               ` Luck, Tony
2025-09-10 12:01                 ` Rafael J. Wysocki
2025-09-18  3:39               ` Shuai Xue
2025-09-18 15:43                 ` Jiaqi Yan
2025-09-18 18:45                   ` Luck, Tony
2025-09-19  1:53                     ` Shuai Xue
2025-09-18 19:46                   ` Luck, Tony
2025-09-19  1:49                   ` Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox