From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 65059CA1013 for ; Thu, 18 Sep 2025 15:44:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B89A728003B; Thu, 18 Sep 2025 11:44:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B39648E00F6; Thu, 18 Sep 2025 11:44:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4F5328003B; Thu, 18 Sep 2025 11:44:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 93B338E00F6 for ; Thu, 18 Sep 2025 11:44:06 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5337A576E3 for ; Thu, 18 Sep 2025 15:44:06 +0000 (UTC) X-FDA: 83902792092.11.4AB475C Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) by imf23.hostedemail.com (Postfix) with ESMTP id 56464140013 for ; Thu, 18 Sep 2025 15:44:04 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K+veIzIX; spf=pass (imf23.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1758210244; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=o+D+JfJ/Jq0/jdAS21NjzfpdXlFqTr3y01BMsRdUEi8=; b=xqhf/vy/sz+VZHXFsCwxapxhWJrhxqjgh9SwrRy3A5eG1O1D11eYrrYCTFF4g8XuBJOWjp 0um21oCgLqvlEonvD6e9+8kHgyZSXTHnltPHvRdvctDEdGg9QCFVYmYzwmiMvcWsmBLT5Z V1/MmpX3aEy8dmzpPNmpvVARHCfUX+A= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K+veIzIX; spf=pass (imf23.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1758210244; a=rsa-sha256; cv=none; b=3Ro9TI0j3OTSNWF3+wpfjdVGLoXmLmCBPmiGlUct3JuUBNqS5SuMH7yZwmkICR78Xrt+Zx /S5fJjbBvl3LCKGyzZTAJ+IRzQ8Ib27VQ3m2y1u9IubiGtwJiu7PeznVNl79WM0qQ+XMzS KiNL84ATzBrEGM7xfR8X/UTBh9+3I20= Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-45f2d21cbabso73555e9.1 for ; Thu, 18 Sep 2025 08:44:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1758210243; x=1758815043; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=o+D+JfJ/Jq0/jdAS21NjzfpdXlFqTr3y01BMsRdUEi8=; b=K+veIzIXu2/22tDW2/TBvFcOPFfWmzr/90xQgh5r8+KItWGeI+oJ46b01JslrGFfBu nhQSY1jGIFDgjLuUByBc4t9RL9wWhaZRSiHF9eWYAEaOwTWg3j8tAeMNRgxLWyOC2dYu uPkdRCT12+XPoAb3fO5xkjnIKhson0hnhNqCxTFamCeJx9Iecxpivawx8OXRPwpkg26Y unsZeESqxC8IGLRE/DDIzREJjs0880zIAv3dN7c177a6V4pYrHscyhjXWiek88nm82Om 1jUkHUMzpfulSbputD1WNKpLIjyM1+ThCPqJayONHvptcctetHcd9/TuEJmBz0e6k1q0 4dKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758210243; x=1758815043; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o+D+JfJ/Jq0/jdAS21NjzfpdXlFqTr3y01BMsRdUEi8=; b=Xk08O6WcJ99xLTDGkDb8Q4nwdo/ZOXtvFNhkytp4aM9n5MKB3Vr+3JTHjAnbteBvaj TGj0xDDPs3d6vVVN0yckrRH9qTqRH7Gcb46+C8r7aLxqqU7aiWZq6KJYWNxvYtWs+Gjp b99qtJDpWIejYfgf1xr3wc5baqximqFGyuT3sknLSjebE/UlBqE8v2axt9XDdTDK7B0A S/+dDfn9Au7beaI4OS6mesSLEncppI66aMiHdNpMt2dPZ1kRkIO6bskbHoVCxEH0eOhk enDV8BlrV67meWL7GaQAjtIXTRe6HkcWGLpeeQt6M/h9tge5zeTOOeYfgJSq8WGP9Oge hA4A== X-Forwarded-Encrypted: i=1; AJvYcCWVmhvcI2mXsOYj0EsyLsLj7IazivDw0tj0uhJbzcwZAvLZWt8h28/QKJOw8VCDv4djBeWQ0lmiXw==@kvack.org X-Gm-Message-State: AOJu0Yx10n2VtquNn0+Tl7C+iqFNozm+tWKbZSoCqZNRXOpLKSE3QWS8 slgsermUX1Nsn6xh4YdzyAfn+MuM9smfX4r2vGwBJQa8bsNYWSuml6UsHivaoqCOK8jzKuftZLQ DnyMmFolptMs6XXfAVA9KkHSEDHVBZsS5qVdoD4rq X-Gm-Gg: ASbGncv4cjDAKpirK/PiwVrohww82AVpbKFrJEJ4d/abezuS+kguMPsqiDkra7ufPPp +y0STxRat4aHcNsC49oGRZZoMRjksKhB2GsXP+5movSb7ejC7uYVLHSjAsRTeNGII63DVUOsgA8 Zx57k2ylHIC/AduGGxGFUnD+hYA8L9egRWw/4bb0p/KikVANB0XbJxVX7Vd7dZWSSO4vNWZkSEk wh9TOWpUBzwBDbYfaq/oygk8GuG14ckGxUNxi4990qmOUehdQV7dpAiQ++qfsNJhg== X-Google-Smtp-Source: AGHT+IGKDPhFW7QY2ThA23Eiqh7Jb4gB3IAedet7C/hbJyJis0arTM7d/07UQoRXJ2ALik0wfBJpwlDyK2X2eoThyyw= X-Received: by 2002:a05:600c:621a:b0:453:672b:5b64 with SMTP id 5b1f17b1804b1-4616dbbbb8fmr4089715e9.2.1758210242387; Thu, 18 Sep 2025 08:44:02 -0700 (PDT) MIME-Version: 1.0 References: <20250904155720.22149-1-tony.luck@intel.com> <7d3cc42c-f1ef-4f28-985e-3a5e4011c585@linux.alibaba.com> In-Reply-To: <7d3cc42c-f1ef-4f28-985e-3a5e4011c585@linux.alibaba.com> From: Jiaqi Yan Date: Thu, 18 Sep 2025 08:43:51 -0700 X-Gm-Features: AS18NWBE_Rh3aqLAP0Y0-SJP8rZCEq2FVPADSEjBbJ_MtsZw23_iRMOUp4qz4C8 Message-ID: Subject: Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked To: Shuai Xue Cc: Kyle Meyer , jane.chu@oracle.com, "Luck, Tony" , "Liam R. Howlett" , "Rafael J. Wysocki" , surenb@google.com, "Anderson, Russ" , rppt@kernel.org, osalvador@suse.de, nao.horiguchi@gmail.com, mhocko@suse.com, lorenzo.stoakes@oracle.com, linmiaohe@huawei.com, david@redhat.com, bp@alien8.de, akpm@linux-foundation.org, linux-mm@kvack.org, vbabka@suse.cz, linux-acpi@vger.kernel.org, Shawn Fan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 56464140013 X-Stat-Signature: 4wskycxqsbdfz9oqrw5apifjikypyotp X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1758210244-601740 X-HE-Meta: U2FsdGVkX19EPIzUAvqCgwenO7W0S2lk9BzPZ1Hda/Sgzt3H1+kWQ56/oVWyxJiCtAcruDTUoZ6SQnpjfw0eG4pWxL5H8urbg40X/4lXvHFSq47iHKzOjadjSXuJusxvBV9O8dLaJiTE4lFtoGsel+Nsp3CHzTtlIyBkSBAdkmbNk0AljI9zfHA8EUM4UKoypvVVs0DB8Y7M2usz1fymE9oENaqNH8FSNDdi4EMzVNM6c+a943XT2GmLyn3cx8PMAraQi6f32Cnc3u2pDFB58pzbpsMrJ32g0ks7jl5jlz6fYXVi7mk6gsK20pxu6oskmCMXOF68pRAbL2vjwIAbrSpCKG9bEEGuS8kG7hMls0ufft9s2+aBHamoBX8GsBra4W6N47SoLJBX78olUhu+ABYKD46GpvT1XPMezMoRGxNJR3USfTN4IbpJzvwxwuRcRZ8lMsGX9NmBDBeP5dLSW71S1q+u8qUsWYX02KsOvidfRjlexDcXGjXrCMigwBTeXeChO8kM3EBqXsjpFwbc1R+1uhLer8bH6wEUnuI+iOGiID1j3Q+s54R3d4rSEJqnbXM9TW785ebLShs12+ofv5hdDeQ/3Fe70pmkBQ5s+aPPkJB0yIrWtUDBpE4IqIMnfZhBGZC4R5/fRXh0K74d7YsBIxPtHCi6hMavkXCoKPblkYup317csjYU4l3fa9xhPsxDULV0owOAmyV+nXG1LZg6ayCG9j1YzmVLnFd9067AZxGY/ImGbSUQGL4oNu4zu49bkQFC2D/TF3zhzukU/njaOKVUC/VyjSpgpeXACB1WsxpWRBFq8HKv3EwoYdKfAeTFDIh4huvi3KjGOXCzwr0SoACan4S5IoyztaGfKzk3TWqAlrVz1ODLa8nCThHFqMqhGy5oqRMMT2L0Uqkt0DI5E4KeS4h6hAOwcvIVpXwc4i+OXvBvCuAlfD7tAnfYtpBX5XqWHDof6h7dolk oL9GUm5I urkFBArmcwrDmdg3ULNi8WMJ46bIaiKQdoYQlyck/X+gUtSkO7VaLgOIUR4x2uD7ukolxvJ3/tCq/JJ2vP5p7ChY3LGCi+M/J/JV/71v8yUYmuebHtrJNMptSHKffoctir1i9uWElrRUXOvyO7Y1CmFBzgnkS01GNoiA+NAAkjE/Fu1mb4KmPZdhMeHxftdCe/+iwz7icaF+M1bIetCSvyzdgisen+eHMfNxkx2Kn7HQa/jssxvZX/3dMztjOveDBS+OmrX4YoKn9udlkDjuxxE/QHFsGuSQ7xFj/mhBG2iZP0RfU02aB5OEXPpdfuuAcKUEnna70QazQU1VKuVu2PV2V1slI0auKkrfVQ6sspSfQgt1QYmvowoR6+9J+S+ZNMG4b X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 17, 2025 at 8:39=E2=80=AFPM Shuai Xue wrote: > > > > =E5=9C=A8 2025/9/9 03:14, Kyle Meyer =E5=86=99=E9=81=93:> On Fri, Sep 05,= 2025 at 12:59:00PM -0700, Jiaqi Yan wrote: > >> On Fri, Sep 5, 2025 at 12:39=E2=80=AFPM wrote: > >>> > >>> > >>> On 9/5/2025 11:17 AM, Luck, Tony wrote: > >>>> BIOS can supply a GHES error record that reports that the corrected > >>>> error threshold has been exceeded. Linux will attempt to soft offli= ne > >>>> the page in response. > >>>> > >>>> But "exceeded threshold" has many interpretations. Some BIOS versio= ns > >>>> accumulate error counts per-rank, and then report threshold exceede= d > >>>> when the number of errors crosses a threshold for the rank. Taking > >>>> a page offline in this case is unlikely to solve any problems. But > >>>> losing a 4KB page will have little impact on the overall system. > > Hi, Tony, > > Thank you for your detailed explanation. I believe this is exactly the pr= oblem > we're encountering in our production environment. > > As you mentioned, memory access is typically interleaved between channels= . When > the per-rank threshold is exceeded, soft-offlining the last accessed addr= ess > seems unreasonable - regardless of whether it's a 4KB page or a huge page= . The > error accumulation happens at the rank level, but the action is taken on = a > specific page that happened to trigger the threshold, which doesn't addre= ss the > underlying issue. > > I'm curious about the intended use case for the CPER_SEC_ERROR_THRESHOLD_= EXCEEDED > flag. What scenario was Intel BIOS expecting the OS to handle when this f= lag is set? > Is there a specific interpretation of "threshold exceeded" that would mak= e > page-level offline action meaningful? If not, how about disabling soft of= fline from > GHES and leave that to userspace tools like rasdaemon (mcelog) ? The existing /proc/sys/vm/enable_soft_offline can already entirely disable soft offline. GHES may still ask for soft offline to memory-failure.c, but soft_offline_page will discard the ask as long as userspace sets 0 to /proc/sys/vm/enable_soft_offline. > > >> > >> Hi Tony, > >> > >> This is exactly the problem I encountered [1], and I agree with Jane > >> that disabling soft offline via /proc/sys/vm/enable_soft_offline > >> should work for your case. > >> > >> [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@goo= gle.com/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86 > > > > If that doesn't work for your case, I just want to mention that hugepa= ges might > > still be soft offlined with that check in ghes_handle_memory_failure()= . > > > >>>> > >>>> On the other hand, taking a huge page offline will have significant > >>>> impact (and still not solve any problems). > >>>> > >>>> Check if the GHES record refers to a huge page. Skip the offline > >>>> process if the page is huge. > > > > AFAICT, we're still notifying the MCE decoder chain and CEC will soft = offline > > the hugepage once the "action threshold" is reached. > > > > This could be moved to soft_offline_page(). That would prevent other s= ources > > (/sys/devices/system/memory/soft_offline_page, CEC, etc.) from being a= ble to > > soft offline hugepages, not just GHES. > > > >>>> Reported-by: Shawn Fan > >>>> Signed-off-by: Tony Luck > >>>> --- > >>>> > >>>> Change since v2: > >>>> > >>>> Me: Add sanity check on the address (pfn) that BIOS provided. It mi= ght > >>>> be in some reserved area that doesn't have a "struct page" which wo= uld > >>>> likely result in an OOPs if fed to pfn_folio(). > >>>> > >>>> The original code relied on sanity check of the pfn received from t= he > >>>> BIOS when this eventually feeds into memory_failure(). That used to > >>>> result in: > >>>> pr_err("%#lx: memory outside kernel control\n", pfn); > >>>> which won't happen with this change, since memory_failure is not > >>>> called. Was that a useful message? A Google search mostly shows > >>>> references to the code. There are few instances of people reporting > >>>> they saw this message. > >>>> > >>>> > >>>> drivers/acpi/apei/ghes.c | 13 +++++++++++-- > >>>> 1 file changed, 11 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > >>>> index a0d54993edb3..c2fc1196438c 100644 > >>>> --- a/drivers/acpi/apei/ghes.c > >>>> +++ b/drivers/acpi/apei/ghes.c > >>>> @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct = acpi_hest_generic_data *gdata, > >>>> > >>>> /* iff following two events can be handled properly by now *= / > >>>> if (sec_sev =3D=3D GHES_SEV_CORRECTED && > >>>> - (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) > >>>> - flags =3D MF_SOFT_OFFLINE; > >>>> + (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) { > >>>> + unsigned long pfn =3D PHYS_PFN(mem_err->physical_addr= ); > >>>> + > >>>> + if (pfn_valid(pfn)) { > >>>> + struct folio *folio =3D pfn_folio(pfn); > >>>> + > >>>> + /* Only try to offline non-huge pages */ > >>>> + if (!folio_test_hugetlb(folio)) > >>>> + flags =3D MF_SOFT_OFFLINE; > >>>> + } > >>>> + } > >>>> if (sev =3D=3D GHES_SEV_RECOVERABLE && sec_sev =3D=3D GHES_S= EV_RECOVERABLE) > >>>> flags =3D sync ? MF_ACTION_REQUIRED : 0; > >>>> > >>> > >>> So the issue is the result of inaccurate MCA record about per rank C= E > >>> threshold being crossed. If OS offline the indicted page, it might b= e > >>> signaled to offline another 4K page in the same rank upon access. > >>> > >>> Both MCA and offline-op are performance hitter, and as argued by thi= s > >>> patch, offline doesn't help except loosing a already corrected page. > >>> > >>> Here we choose to bypass hugetlb page simply because it's huge. Is = it > >>> possible to argue that because the page is huge, it's less likely to= get > >>> another MCA on another page from the same rank? > >>> > >>> A while back this patch > >>> 56374430c5dfc mm/memory-failure: userspace controls soft-offlining p= ages > >>> has provided userspace control over whether to soft offline, could i= t be > >>> a more preferable option? > > > > Optionally, a 3rd setting could be added to /proc/sys/vm/enable_soft_o= ffline: > > > > 0: Soft offline is disabled. > > 1: Soft offline is enabled for normal pages (skip hugepages). > > 2: Soft offline is enabled for normal pages and hugepages. > > > > I prefer having soft-offline fully controlled by userspace, especially > for DPDK-style applications. These applications use hugepage mappings and= maintain > their own VA-to-PA mappings. When the kernel migrates a hugepage to a new= physical > page during soft-offline, DPDK continues accessing the old physical addre= ss, > leading to data corruption or access errors. Just curious, does the DPDK applications pin (pin_user_pages) the VA-to-PA mappings? If so I would expect both soft offline and hard offline will fail and become no-op. > > For such use cases, the application needs to be aware of and handle memor= y errors > itself. The kernel performing automatic page migration breaks the assumpt= ions these > applications make about stable physical addresses. > Thanks. > Shuai