linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Zach O'Keefe <zokeefe@google.com>, Lance Yang <ioworker0@gmail.com>
Cc: akpm@linux-foundation.org, songmuchun@bytedance.com,
	linux-kernel@vger.kernel.org, Yang Shi <shy828301@gmail.com>,
	Peter Xu <peterx@redhat.com>,
	Michael Knyszek <mknyszek@google.com>,
	Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
Date: Wed, 17 Jan 2024 19:41:04 +0100	[thread overview]
Message-ID: <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com> (raw)
In-Reply-To: <CAAa6QmR0rcdk_rJOzc88ZA4jm9K5LwxT4dSHiBX+nPyd6E3Ddw@mail.gmail.com>

On 17.01.24 18:10, Zach O'Keefe wrote:
> [+linux-mm & others]
> 
> On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote:
>>
>> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
>>
>> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
>> make a least-effort attempt at a synchronous collapse of memory at
>> their own expense.
>>
>> The only difference from MADV_COLLAPSE is that the new hugepage allocation
>> avoids direct reclaim and/or compaction, quickly failing on allocation errors.
>>
>> The benefits of this approach are:
>>
>> * CPU is charged to the process that wants to spend the cycles for the THP
>> * Avoid unpredictable timing of khugepaged collapse
>> * Prevent unpredictable stalls caused by direct reclaim and/or compaction
>>
>> Semantics
>>
>> This call is independent of the system-wide THP sysfs settings, but will
>> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
>> multiple VMAs, the semantics of the collapse over each VMA is independent
>> from the others.  This implies a hugepage cannot cross a VMA boundary.  If
>> collapse of a given hugepage-aligned/sized region fails, the operation may
>> continue to attempt collapsing the remainder of memory specified.
>>
>> The memory ranges provided must be page-aligned, but are not required to
>> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
>> start/end of the range will be clamped to the first/last hugepage-aligned
>> address covered by said range.  The memory ranges must span at least one
>> hugepage-sized region.
>>
>> All non-resident pages covered by the range will first be
>> swapped/faulted-in, before being internally copied onto a freshly
>> allocated hugepage.  Unmapped pages will have their data directly
>> initialized to 0 in the new hugepage.  However, for every eligible
>> hugepage aligned/sized region to-be collapsed, at least one page must
>> currently be backed by memory (a PMD covering the address range must
>> already exist).
>>
>> Allocation for the new hugepage will not enter direct reclaim and/or
>> compaction, quickly failing if allocation fails. When the system has
>> multiple NUMA nodes, the hugepage will be allocated from the node providing
>> the most native pages. This operation operates on the current state of the
>> specified process and makes no persistent changes or guarantees on how pages
>> will be mapped, constructed, or faulted in the future.
>>
>> Return Value
>>
>> If all hugepage-sized/aligned regions covered by the provided range were
>> either successfully collapsed, or were already PMD-mapped THPs, this
>> operation will be deemed successful.  On success, madvise(2) returns 0.
>> Else, -1 is returned and errno is set to indicate the error for the
>> most-recently attempted hugepage collapse.  Note that many failures might
>> have occurred, since the operation may continue to collapse in the event a
>> single hugepage-sized/aligned region fails.
>>
>>          ENOMEM  Memory allocation failed or VMA not found
>>          EBUSY   Memcg charging failed
>>          EAGAIN  Required resource temporarily unavailable.  Try again
>>                  might succeed.
>>          EINVAL  Other error: No PMD found, subpage doesn't have Present
>>                  bit set, "Special" page no backed by struct page, VMA
>>                  incorrectly sized, address not page-aligned, ...
>>
>> Use Cases
>>
>> An immediate user of this new functionality is the Go runtime heap allocator
>> that manages memory in hugepage-sized chunks. In the past, whether it was a
>> newly allocated chunk through mmap() or a reused chunk released by
>> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
>> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
>> respectively. However, both approaches resulted in performance issues; for
>> both scenarios, there could be entries into direct reclaim and/or compaction,
>> leading to unpredictable stalls[4]. Now, the allocator can confidently use
>> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
>>
>> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
>> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
>> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
>> [4] https://github.com/golang/go/issues/63334
> 
> Thanks for the patch, Lance, and thanks for providing the links above,
> referring to issues Go has seen.
> 
> I've reached out to the Go team to try and understand their use case,
> and how we could help. It's not immediately clear whether a
> lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> be.
> 
> That said, with respect to the implementation, should a need for a
> lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> process_madvise(2) be the "v2" of madvise(2), where we can start
> leveraging the forward-facing flags argument for these different
> advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> ("mm/madvise: remove racy mm ownership check") so that
> process_madvise(2) can always operate on self. IIRC, this was ~ the
> plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> sane default, and implement options in flags down the line).

+1, using process_madvise() would likely be the right approach.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-01-17 18:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240117050217.43610-1-ioworker0@gmail.com>
2024-01-17 17:10 ` Zach O'Keefe
2024-01-17 18:41   ` David Hildenbrand [this message]
2024-01-18  1:51     ` Lance Yang
2024-01-18  1:46   ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mknyszek@google.com \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox