Re: [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jane Chu <jane.chu@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	naoya.horiguchi@nec.com, akpm@linux-foundation.org,
	shuah@kernel.org, corbet@lwn.net, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com, fvdl@google.com,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	linux-doc@vger.kernel.org, muchun.song@linux.dev,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: Re: [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages
Date: Tue, 11 Jun 2024 10:55:08 -0700	[thread overview]
Message-ID: <f446406d-7739-4367-ac68-0a3f30c04612@oracle.com> (raw)
In-Reply-To: <CACw3F50rh08o0hAG1rSfUnuJ3wezjCa8_ZE4rUGRUntUfx+-OQ@mail.gmail.com>

On 6/10/2024 3:55 PM, Jiaqi Yan wrote:

> Thanks for your feedback, Jane!
>
> On Mon, Jun 10, 2024 at 12:41 PM Jane Chu <jane.chu@oracle.com> wrote:
>> On 6/7/2024 3:22 PM, Jiaqi Yan wrote:
>>
>>> On Tue, Jun 4, 2024 at 12:19 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>> On 2024/6/1 5:34, Jiaqi Yan wrote:
>>>>> Correctable memory errors are very common on servers with large
>>>>> amount of memory, and are corrected by ECC, but with two
>>>>> pain points to users:
>>>>> 1. Correction usually happens on the fly and adds latency overhead
>>>>> 2. Not-fully-proved theory states excessive correctable memory
>>>>>      errors can develop into uncorrectable memory error.
>>>> Thanks for your patch.
>>> Thanks Miaohe, sorry I missed your message (Gmail mistakenly put it in
>>> my spam folder).
>>>
>>>>> Soft offline is kernel's additional solution for memory pages
>>>>> having (excessive) corrected memory errors. Impacted page is migrated
>>>>> to healthy page if it is in use, then the original page is discarded
>>>>> for any future use.
>>>>>
>>>>> The actual policy on whether (and when) to soft offline should be
>>>>> maintained by userspace, especially in case of HugeTLB hugepages.
>>>>> Soft-offline dissolves a hugepage, either in-use or free, into
>>>>> chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.
>>>>> If userspace has not acknowledged such behavior, it may be surprised
>>>>> when later mmap hugepages MAP_FAILED due to lack of hugepages.
>>>> For in use hugetlb folio case, migrate_pages() is called. The hugetlb pool
>>>> capacity won't be modified in that case. So I assume you're referring to the
>>> I don't think so.
>>>
>>> For in-use hugetlb folio case, after migrate_pages, kernel will
>>> dissolve_free_hugetlb_folio the src hugetlb folio. At this point
>>> refcount of src hugetlb folio should be zero already, and
>>> remove_hugetlb_folio will reduce the hugetlb pool capacity (both
>>> nr_hugepages and free_hugepages) accordingly.
>>>
>>> For the free hugetlb folio case, dissolving also happens. But CE on
>>> free pages should be very rare (since no one is accessing except
>>> patrol scrubber).
>>>
>>> One of my test cases in patch 2/3 validates my point: the test case
>>> MADV_SOFT_OFFLINE a mapped page and at the point soft offline
>>> succeeds, both nr_hugepages and nr_freepages are reduced by 1.
>>>
>>>> free hugetlb folio case? The Hugetlb pool capacity is reduced in that case.
>>>> But if we don't do that, we might encounter uncorrectable memory error later
>>> If your concern is more correctable error will develop into more
>>> severe uncorrectable, your concern is absolutely valid. There is a
>>> tradeoff between reliability vs performance (availability of hugetlb
>>> pages), but IMO should be decided by userspace.
>>>
>>>> which will be more severe? Will it be better to add a way to compensate the
>>>> capacity?
>>> Corner cases: What if finding physically contiguous memory takes too
>>> long? What if we can't find any physically contiguous memory to
>>> compensate? (then hugetlb pool will still need to be reduced).
>>>
>>> If we treat "compensate" as an improvement to the overall soft offline
>>> process, it is something we can do in future and it is something
>>> orthogonal to this control API, right? I think if userspace explicitly
>>> tells kernel to soft offline, then they are also well-prepared for the
>>> corner cases above.
>>>
>>>>> In addition, discarding the entire 1G memory page only because of
>>>>> corrected memory errors sounds very costly and kernel better not
>>>>> doing under the hood. But today there are at least 2 such cases:
>>>>> 1. GHES driver sees both GHES_SEV_CORRECTED and
>>>>>      CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
>>>>> 2. RAS Correctable Errors Collector counts correctable errors per
>>>>>      PFN and when the counter for a PFN reaches threshold
>>>>> In both cases, userspace has no control of the soft offline performed
>>>>> by kernel's memory failure recovery.
>>>> Userspace can figure out the hugetlb folio pfn range by using `page-types -b huge
>>>> -rlN` and then decide whether to soft offline the page according to it. But for
>>>> GHES driver, I think it has to be done in the kernel. So add a control in /sys/
>>>> seems like a good idea.
>>> Thanks.
>>>
>>>>> This patch series give userspace the control of soft-offlining
>>>>> HugeTLB pages: kernel only soft offlines hugepage if userspace has
>>>>> opt-ed in for that specific hugepage size, and exposed to userspace
>>>>> by a new sysfs entry called softoffline_corrected_errors under
>>>>> /sys/kernel/mm/hugepages/hugepages-${size}kB directory:
>>>>> * When softoffline_corrected_errors=0, skip soft offlining for all
>>>>>     hugepages of size ${size}kB.
>>>>> * When softoffline_corrected_errors=1, soft offline as before this
>>>> Will it be better to be called as "soft_offline_corrected_errors" or simplify "soft_offline_enabled"?
>>> "soft_offline_enabled" is less optimal as it can't be extended to
>>> support something like "soft offline this PFN if something repeatedly
>>> requested soft offline this exact PFN x times". (although I don't
>>> think we need it).
>> The "x time" thing is a threshold thing, and if your typical application
>> needs to have a say about performance(and maintaining physically
>> contiguous memory) over RAS, shouldn't that be baked into the driver
>> rather than hugetlbfs ?
> I mostly agree, only that I want to point out the threshold has
> already been maintained by some firmware. For example, CPER has
> something like the following defined in UEFI Spec Table N.5: Section
> Descriptor:
>
>    Bit 3 - Error threshold exceeded: If set, OS may choose to discontinue
>    use of this resource.
>
> In this case, I think "enable_soft_offline" is a better name for "OS
> choose to discontinue use of this page" (enable_soft_offline=1) or not
> (enable_soft_offline=0). WDYT?

Yes, as long as enable_soft_offline=1 is the default. Out of thought, I 
suppose the CE count and threshold can be retrieved by the GHES driver? 
I haven't checked. If so, maybe another way is to implement a per task 
CE threshold: add a new field.ce_thresholdto the tsak struct, add a 
function to prctl(2) for a user thread to specify a CE threshold, also a 
function to retrieve the firmware defined default CE threshold, and let 
soft_offline_page() check against the task->ce_threshold to decide 
whether to offline the page. If you want to apply the CE threshold to 
patrol scrub triggered soft offline, than you could define a 
global/system wide CE threshold. That said, this might be an overblown 
to what you need, I'm just letting it out there for the sake of brain 
storming.

>
>> Also, I am not comfortable with this being hugetlbfs specific. What is
>> the objection to creating a "soft_offline_enabled" switch that is
>> applicable to any user page size?
> I have no objection to making the "soft_offline_enabled" switch to
> apply to anything (hugetlb, transparent hugepage, raw page, etc). The
> only reason my current patch is hugetlb specific is because
> softoffline behavior is very disruptive in the hugetlb 1G page case,
> and I want to start with a limited scope in my first attempt.
>
> If Miaohe, you, and other people are fine with making it applicable to
> any user pages, maybe a better interface for this could be at
> something like /sys/devices/system/memory/enable_soft_offline
> (location-wise close to /sys/devices/system/memory/soft_offline_page)?

Or, you could use /proc/sys/vm/enable_soft_offline, side by side with 
the existing 'memory_failure_early_kill' and 'memory_failure_recovery' 
switches.

You could also make 'enable_soft_offline' a per process option, similar 
to 'PR_MCE_KILL_EARLY' in prctl(2).*
*

thanks,

-jane

>
>> thanks,
>>
>> -jane
>>
>>> softoffline_corrected_errors is one char less, but if you insist,
>>> soft_offline_corrected_errors also works for me.
>>>
>>>> Thanks.
>>>> .
>>>>

next prev parent reply	other threads:[~2024-06-11 17:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31 21:34 Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages Jiaqi Yan
2024-06-07 22:26   ` Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 2/3] selftest/mm: test softoffline_corrected_errors behaviors Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 3/3] docs: hugetlbpage.rst: add softoffline_corrected_errors Jiaqi Yan
2024-06-04  7:19 ` [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages Miaohe Lin
2024-06-07 22:22   ` Jiaqi Yan
2024-06-10 19:41     ` Jane Chu
2024-06-10 22:55       ` Jiaqi Yan
2024-06-11 17:55         ` Jane Chu [this message]
2024-06-11 18:12           ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f446406d-7739-4367-ac68-0a3f30c04612@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=duenwen@google.com \
    --cc=fvdl@google.com \
    --cc=jiaqiyan@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=shuah@kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox