From: David Hildenbrand <david@redhat.com>
To: Zach O'Keefe <zokeefe@google.com>, Lance Yang <ioworker0@gmail.com>
Cc: akpm@linux-foundation.org, songmuchun@bytedance.com,
linux-kernel@vger.kernel.org, Yang Shi <shy828301@gmail.com>,
Peter Xu <peterx@redhat.com>,
Michael Knyszek <mknyszek@google.com>,
Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
Date: Wed, 17 Jan 2024 19:41:04 +0100 [thread overview]
Message-ID: <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com> (raw)
In-Reply-To: <CAAa6QmR0rcdk_rJOzc88ZA4jm9K5LwxT4dSHiBX+nPyd6E3Ddw@mail.gmail.com>
On 17.01.24 18:10, Zach O'Keefe wrote:
> [+linux-mm & others]
>
> On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote:
>>
>> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
>>
>> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
>> make a least-effort attempt at a synchronous collapse of memory at
>> their own expense.
>>
>> The only difference from MADV_COLLAPSE is that the new hugepage allocation
>> avoids direct reclaim and/or compaction, quickly failing on allocation errors.
>>
>> The benefits of this approach are:
>>
>> * CPU is charged to the process that wants to spend the cycles for the THP
>> * Avoid unpredictable timing of khugepaged collapse
>> * Prevent unpredictable stalls caused by direct reclaim and/or compaction
>>
>> Semantics
>>
>> This call is independent of the system-wide THP sysfs settings, but will
>> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
>> multiple VMAs, the semantics of the collapse over each VMA is independent
>> from the others. This implies a hugepage cannot cross a VMA boundary. If
>> collapse of a given hugepage-aligned/sized region fails, the operation may
>> continue to attempt collapsing the remainder of memory specified.
>>
>> The memory ranges provided must be page-aligned, but are not required to
>> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
>> start/end of the range will be clamped to the first/last hugepage-aligned
>> address covered by said range. The memory ranges must span at least one
>> hugepage-sized region.
>>
>> All non-resident pages covered by the range will first be
>> swapped/faulted-in, before being internally copied onto a freshly
>> allocated hugepage. Unmapped pages will have their data directly
>> initialized to 0 in the new hugepage. However, for every eligible
>> hugepage aligned/sized region to-be collapsed, at least one page must
>> currently be backed by memory (a PMD covering the address range must
>> already exist).
>>
>> Allocation for the new hugepage will not enter direct reclaim and/or
>> compaction, quickly failing if allocation fails. When the system has
>> multiple NUMA nodes, the hugepage will be allocated from the node providing
>> the most native pages. This operation operates on the current state of the
>> specified process and makes no persistent changes or guarantees on how pages
>> will be mapped, constructed, or faulted in the future.
>>
>> Return Value
>>
>> If all hugepage-sized/aligned regions covered by the provided range were
>> either successfully collapsed, or were already PMD-mapped THPs, this
>> operation will be deemed successful. On success, madvise(2) returns 0.
>> Else, -1 is returned and errno is set to indicate the error for the
>> most-recently attempted hugepage collapse. Note that many failures might
>> have occurred, since the operation may continue to collapse in the event a
>> single hugepage-sized/aligned region fails.
>>
>> ENOMEM Memory allocation failed or VMA not found
>> EBUSY Memcg charging failed
>> EAGAIN Required resource temporarily unavailable. Try again
>> might succeed.
>> EINVAL Other error: No PMD found, subpage doesn't have Present
>> bit set, "Special" page no backed by struct page, VMA
>> incorrectly sized, address not page-aligned, ...
>>
>> Use Cases
>>
>> An immediate user of this new functionality is the Go runtime heap allocator
>> that manages memory in hugepage-sized chunks. In the past, whether it was a
>> newly allocated chunk through mmap() or a reused chunk released by
>> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
>> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
>> respectively. However, both approaches resulted in performance issues; for
>> both scenarios, there could be entries into direct reclaim and/or compaction,
>> leading to unpredictable stalls[4]. Now, the allocator can confidently use
>> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
>>
>> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
>> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
>> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
>> [4] https://github.com/golang/go/issues/63334
>
> Thanks for the patch, Lance, and thanks for providing the links above,
> referring to issues Go has seen.
>
> I've reached out to the Go team to try and understand their use case,
> and how we could help. It's not immediately clear whether a
> lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> be.
>
> That said, with respect to the implementation, should a need for a
> lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> process_madvise(2) be the "v2" of madvise(2), where we can start
> leveraging the forward-facing flags argument for these different
> advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> ("mm/madvise: remove racy mm ownership check") so that
> process_madvise(2) can always operate on self. IIRC, this was ~ the
> plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> sane default, and implement options in flags down the line).
+1, using process_madvise() would likely be the right approach.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-01-17 18:41 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240117050217.43610-1-ioworker0@gmail.com>
2024-01-17 17:10 ` Zach O'Keefe
2024-01-17 18:41 ` David Hildenbrand [this message]
2024-01-18 1:51 ` Lance Yang
2024-01-18 1:46 ` Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=ioworker0@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=mknyszek@google.com \
--cc=peterx@redhat.com \
--cc=shy828301@gmail.com \
--cc=songmuchun@bytedance.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox