From: Jane Chu <jane.chu@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>, Miaohe Lin <linmiaohe@huawei.com>
Cc: naoya.horiguchi@nec.com, akpm@linux-foundation.org,
shuah@kernel.org, corbet@lwn.net, osalvador@suse.de,
rientjes@google.com, duenwen@google.com, fvdl@google.com,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
linux-doc@vger.kernel.org, muchun.song@linux.dev
Subject: Re: [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages
Date: Mon, 10 Jun 2024 12:41:20 -0700 [thread overview]
Message-ID: <2738aa0e-99d8-44d7-ac81-e38fd64591b7@oracle.com> (raw)
In-Reply-To: <CACw3F52Ws2R-7kBbo29==tU=FOV=8aiWFZH2aL2DS_5nuTGO=w@mail.gmail.com>
On 6/7/2024 3:22 PM, Jiaqi Yan wrote:
> On Tue, Jun 4, 2024 at 12:19 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>> On 2024/6/1 5:34, Jiaqi Yan wrote:
>>> Correctable memory errors are very common on servers with large
>>> amount of memory, and are corrected by ECC, but with two
>>> pain points to users:
>>> 1. Correction usually happens on the fly and adds latency overhead
>>> 2. Not-fully-proved theory states excessive correctable memory
>>> errors can develop into uncorrectable memory error.
>> Thanks for your patch.
> Thanks Miaohe, sorry I missed your message (Gmail mistakenly put it in
> my spam folder).
>
>>> Soft offline is kernel's additional solution for memory pages
>>> having (excessive) corrected memory errors. Impacted page is migrated
>>> to healthy page if it is in use, then the original page is discarded
>>> for any future use.
>>>
>>> The actual policy on whether (and when) to soft offline should be
>>> maintained by userspace, especially in case of HugeTLB hugepages.
>>> Soft-offline dissolves a hugepage, either in-use or free, into
>>> chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.
>>> If userspace has not acknowledged such behavior, it may be surprised
>>> when later mmap hugepages MAP_FAILED due to lack of hugepages.
>> For in use hugetlb folio case, migrate_pages() is called. The hugetlb pool
>> capacity won't be modified in that case. So I assume you're referring to the
> I don't think so.
>
> For in-use hugetlb folio case, after migrate_pages, kernel will
> dissolve_free_hugetlb_folio the src hugetlb folio. At this point
> refcount of src hugetlb folio should be zero already, and
> remove_hugetlb_folio will reduce the hugetlb pool capacity (both
> nr_hugepages and free_hugepages) accordingly.
>
> For the free hugetlb folio case, dissolving also happens. But CE on
> free pages should be very rare (since no one is accessing except
> patrol scrubber).
>
> One of my test cases in patch 2/3 validates my point: the test case
> MADV_SOFT_OFFLINE a mapped page and at the point soft offline
> succeeds, both nr_hugepages and nr_freepages are reduced by 1.
>
>> free hugetlb folio case? The Hugetlb pool capacity is reduced in that case.
>> But if we don't do that, we might encounter uncorrectable memory error later
> If your concern is more correctable error will develop into more
> severe uncorrectable, your concern is absolutely valid. There is a
> tradeoff between reliability vs performance (availability of hugetlb
> pages), but IMO should be decided by userspace.
>
>> which will be more severe? Will it be better to add a way to compensate the
>> capacity?
> Corner cases: What if finding physically contiguous memory takes too
> long? What if we can't find any physically contiguous memory to
> compensate? (then hugetlb pool will still need to be reduced).
>
> If we treat "compensate" as an improvement to the overall soft offline
> process, it is something we can do in future and it is something
> orthogonal to this control API, right? I think if userspace explicitly
> tells kernel to soft offline, then they are also well-prepared for the
> corner cases above.
>
>>> In addition, discarding the entire 1G memory page only because of
>>> corrected memory errors sounds very costly and kernel better not
>>> doing under the hood. But today there are at least 2 such cases:
>>> 1. GHES driver sees both GHES_SEV_CORRECTED and
>>> CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
>>> 2. RAS Correctable Errors Collector counts correctable errors per
>>> PFN and when the counter for a PFN reaches threshold
>>> In both cases, userspace has no control of the soft offline performed
>>> by kernel's memory failure recovery.
>> Userspace can figure out the hugetlb folio pfn range by using `page-types -b huge
>> -rlN` and then decide whether to soft offline the page according to it. But for
>> GHES driver, I think it has to be done in the kernel. So add a control in /sys/
>> seems like a good idea.
> Thanks.
>
>>> This patch series give userspace the control of soft-offlining
>>> HugeTLB pages: kernel only soft offlines hugepage if userspace has
>>> opt-ed in for that specific hugepage size, and exposed to userspace
>>> by a new sysfs entry called softoffline_corrected_errors under
>>> /sys/kernel/mm/hugepages/hugepages-${size}kB directory:
>>> * When softoffline_corrected_errors=0, skip soft offlining for all
>>> hugepages of size ${size}kB.
>>> * When softoffline_corrected_errors=1, soft offline as before this
>> Will it be better to be called as "soft_offline_corrected_errors" or simplify "soft_offline_enabled"?
> "soft_offline_enabled" is less optimal as it can't be extended to
> support something like "soft offline this PFN if something repeatedly
> requested soft offline this exact PFN x times". (although I don't
> think we need it).
The "x time" thing is a threshold thing, and if your typical application
needs to have a say about performance(and maintaining physically
contiguous memory) over RAS, shouldn't that be baked into the driver
rather than hugetlbfs ?
Also, I am not comfortable with this being hugetlbfs specific. What is
the objection to creating a "soft_offline_enabled" switch that is
applicable to any user page size?
thanks,
-jane
>
> softoffline_corrected_errors is one char less, but if you insist,
> soft_offline_corrected_errors also works for me.
>
>> Thanks.
>> .
>>
next prev parent reply other threads:[~2024-06-10 19:41 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-31 21:34 Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages Jiaqi Yan
2024-06-07 22:26 ` Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 2/3] selftest/mm: test softoffline_corrected_errors behaviors Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 3/3] docs: hugetlbpage.rst: add softoffline_corrected_errors Jiaqi Yan
2024-06-04 7:19 ` [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages Miaohe Lin
2024-06-07 22:22 ` Jiaqi Yan
2024-06-10 19:41 ` Jane Chu [this message]
2024-06-10 22:55 ` Jiaqi Yan
2024-06-11 17:55 ` Jane Chu
2024-06-11 18:12 ` Jiaqi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2738aa0e-99d8-44d7-ac81-e38fd64591b7@oracle.com \
--to=jane.chu@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=duenwen@google.com \
--cc=fvdl@google.com \
--cc=jiaqiyan@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
--cc=rientjes@google.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox