linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jiaqi Yan <jiaqiyan@google.com>
To: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Kyle Meyer <kyle.meyer@hpe.com>,
	jane.chu@oracle.com,  "Luck, Tony" <tony.luck@intel.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	surenb@google.com,  "Anderson, Russ" <russ.anderson@hpe.com>,
	rppt@kernel.org, osalvador@suse.de,  nao.horiguchi@gmail.com,
	mhocko@suse.com, lorenzo.stoakes@oracle.com,
	 linmiaohe@huawei.com, david@redhat.com, bp@alien8.de,
	 akpm@linux-foundation.org, linux-mm@kvack.org, vbabka@suse.cz,
	 linux-acpi@vger.kernel.org, Shawn Fan <shawn.fan@intel.com>
Subject: Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked
Date: Thu, 18 Sep 2025 08:43:51 -0700	[thread overview]
Message-ID: <CACw3F50hU3BCP=A++Dx_V=U8PKvsTvTa1=krULxfQdeK2kVBrw@mail.gmail.com> (raw)
In-Reply-To: <7d3cc42c-f1ef-4f28-985e-3a5e4011c585@linux.alibaba.com>

On Wed, Sep 17, 2025 at 8:39 PM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
>
>
> 在 2025/9/9 03:14, Kyle Meyer 写道:> On Fri, Sep 05, 2025 at 12:59:00PM -0700, Jiaqi Yan wrote:
>  >> On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@oracle.com> wrote:
>  >>>
>  >>>
>  >>> On 9/5/2025 11:17 AM, Luck, Tony wrote:
>  >>>> BIOS can supply a GHES error record that reports that the corrected
>  >>>> error threshold has been exceeded. Linux will attempt to soft offline
>  >>>> the page in response.
>  >>>>
>  >>>> But "exceeded threshold" has many interpretations. Some BIOS versions
>  >>>> accumulate error counts per-rank, and then report threshold exceeded
>  >>>> when the number of errors crosses a threshold for the rank. Taking
>  >>>> a page offline in this case is unlikely to solve any problems. But
>  >>>> losing a 4KB page will have little impact on the overall system.
>
> Hi, Tony,
>
> Thank you for your detailed explanation. I believe this is exactly the problem
> we're encountering in our production environment.
>
> As you mentioned, memory access is typically interleaved between channels. When
> the per-rank threshold is exceeded, soft-offlining the last accessed address
> seems unreasonable - regardless of whether it's a 4KB page or a huge page. The
> error accumulation happens at the rank level, but the action is taken on a
> specific page that happened to trigger the threshold, which doesn't address the
> underlying issue.
>
> I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_EXCEEDED
> flag. What scenario was Intel BIOS expecting the OS to handle when this flag is set?
> Is there a specific interpretation of "threshold exceeded" that would make
> page-level offline action meaningful? If not, how about disabling soft offline from
> GHES and leave that to userspace tools like rasdaemon (mcelog) ?

The existing /proc/sys/vm/enable_soft_offline can already entirely
disable soft offline. GHES may still ask for soft offline to
memory-failure.c, but soft_offline_page will discard the ask as long
as userspace sets 0 to /proc/sys/vm/enable_soft_offline.

>
>  >>
>  >> Hi Tony,
>  >>
>  >> This is exactly the problem I encountered [1], and I agree with Jane
>  >> that disabling soft offline via /proc/sys/vm/enable_soft_offline
>  >> should work for your case.
>  >>
>  >> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@google.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86
>  >
>  > If that doesn't work for your case, I just want to mention that hugepages might
>  > still be soft offlined with that check in ghes_handle_memory_failure().
>  >
>  >>>>
>  >>>> On the other hand, taking a huge page offline will have significant
>  >>>> impact (and still not solve any problems).
>  >>>>
>  >>>> Check if the GHES record refers to a huge page. Skip the offline
>  >>>> process if the page is huge.
>  >
>  > AFAICT, we're still notifying the MCE decoder chain and CEC will soft offline
>  > the hugepage once the "action threshold" is reached.
>  >
>  > This could be moved to soft_offline_page(). That would prevent other sources
>  > (/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being able to
>  > soft offline hugepages, not just GHES.
>  >
>  >>>> Reported-by: Shawn Fan <shawn.fan@intel.com>
>  >>>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>  >>>> ---
>  >>>>
>  >>>> Change since v2:
>  >>>>
>  >>>> Me: Add sanity check on the address (pfn) that BIOS provided. It might
>  >>>> be in some reserved area that doesn't have a "struct page" which would
>  >>>> likely result in an OOPs if fed to pfn_folio().
>  >>>>
>  >>>> The original code relied on sanity check of the pfn received from the
>  >>>> BIOS when this eventually feeds into memory_failure(). That used to
>  >>>> result in:
>  >>>>        pr_err("%#lx: memory outside kernel control\n", pfn);
>  >>>> which won't happen with this change, since memory_failure is not
>  >>>> called. Was that a useful message? A Google search mostly shows
>  >>>> references to the code. There are few instances of people reporting
>  >>>> they saw this message.
>  >>>>
>  >>>>
>  >>>>    drivers/acpi/apei/ghes.c | 13 +++++++++++--
>  >>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>  >>>>
>  >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>  >>>> index a0d54993edb3..c2fc1196438c 100644
>  >>>> --- a/drivers/acpi/apei/ghes.c
>  >>>> +++ b/drivers/acpi/apei/ghes.c
>  >>>> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>  >>>>
>  >>>>        /* iff following two events can be handled properly by now */
>  >>>>        if (sec_sev == GHES_SEV_CORRECTED &&
>  >>>> -         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>  >>>> -             flags = MF_SOFT_OFFLINE;
>  >>>> +         (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
>  >>>> +             unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
>  >>>> +
>  >>>> +             if (pfn_valid(pfn)) {
>  >>>> +                     struct folio *folio = pfn_folio(pfn);
>  >>>> +
>  >>>> +                     /* Only try to offline non-huge pages */
>  >>>> +                     if (!folio_test_hugetlb(folio))
>  >>>> +                             flags = MF_SOFT_OFFLINE;
>  >>>> +             }
>  >>>> +     }
>  >>>>        if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>  >>>>                flags = sync ? MF_ACTION_REQUIRED : 0;
>  >>>>
>  >>>
>  >>> So the issue is the result of inaccurate MCA record about per rank CE
>  >>> threshold being crossed. If OS offline the indicted page, it might be
>  >>> signaled to offline another 4K page in the same rank upon access.
>  >>>
>  >>> Both MCA and offline-op are performance hitter, and as argued by this
>  >>> patch, offline doesn't help except loosing a already corrected page.
>  >>>
>  >>> Here we choose to bypass hugetlb page simply because it's huge.  Is it
>  >>> possible to argue that because the page is huge, it's less likely to get
>  >>> another MCA on another page from the same rank?
>  >>>
>  >>> A while back this patch
>  >>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
>  >>> has provided userspace control over whether to soft offline, could it be
>  >>> a more preferable option?
>  >
>  > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_offline:
>  >
>  > 0: Soft offline is disabled.
>  > 1: Soft offline is enabled for normal pages (skip hugepages).
>  > 2: Soft offline is enabled for normal pages and hugepages.
>  >
>
> I prefer having soft-offline fully controlled by userspace, especially
> for DPDK-style applications. These applications use hugepage mappings and maintain
> their own VA-to-PA mappings. When the kernel migrates a hugepage to a new physical
> page during soft-offline, DPDK continues accessing the old physical address,
> leading to data corruption or access errors.

Just curious, does the DPDK applications pin (pin_user_pages) the
VA-to-PA mappings? If so I would expect both soft offline and hard
offline will fail and become no-op.

>
> For such use cases, the application needs to be aware of and handle memory errors
> itself. The kernel performing automatic page migration breaks the assumptions these
> applications make about stable physical addresses.
> Thanks.
> Shuai


  reply	other threads:[~2025-09-18 15:44 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-04 15:57 [PATCH] " Tony Luck
2025-09-04 17:25 ` Mike Rapoport
2025-09-04 18:16 ` Liam R. Howlett
2025-09-05 15:53   ` [PATCH v2] " Luck, Tony
2025-09-05 16:25     ` Liam R. Howlett
2025-09-05 18:17       ` PATCH v3 " Luck, Tony
2025-09-05 19:39         ` jane.chu
2025-09-05 19:58           ` Luck, Tony
2025-09-05 20:14             ` jane.chu
2025-09-05 20:36               ` Luck, Tony
2025-09-05 19:59           ` Jiaqi Yan
2025-09-08 19:14             ` Kyle Meyer
2025-09-08 20:01               ` Luck, Tony
2025-09-10 12:01                 ` Rafael J. Wysocki
2025-09-18  3:39               ` Shuai Xue
2025-09-18 15:43                 ` Jiaqi Yan [this message]
2025-09-18 18:45                   ` Luck, Tony
2025-09-19  1:53                     ` Shuai Xue
2025-09-18 19:46                   ` Luck, Tony
2025-09-19  1:49                   ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACw3F50hU3BCP=A++Dx_V=U8PKvsTvTa1=krULxfQdeK2kVBrw@mail.gmail.com' \
    --to=jiaqiyan@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=david@redhat.com \
    --cc=jane.chu@oracle.com \
    --cc=kyle.meyer@hpe.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rppt@kernel.org \
    --cc=russ.anderson@hpe.com \
    --cc=shawn.fan@intel.com \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox