linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lance Yang <ioworker0@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: "Zach O'Keefe" <zokeefe@google.com>,
	akpm@linux-foundation.org, songmuchun@bytedance.com,
	 linux-kernel@vger.kernel.org, Yang Shi <shy828301@gmail.com>,
	 Peter Xu <peterx@redhat.com>,
	Michael Knyszek <mknyszek@google.com>,
	 Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
Date: Thu, 18 Jan 2024 09:51:32 +0800	[thread overview]
Message-ID: <CAK1f24=MbVMrxWO2xa+9bJiqEKJ=DG68WQ5bE_LgW9=oTk6GwQ@mail.gmail.com> (raw)
In-Reply-To: <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com>

Hey David,

Thanks for taking the time to review!

David Hildenbrand <david@redhat.com> 于2024年1月18日周四 02:41写道:
>
> On 17.01.24 18:10, Zach O'Keefe wrote:
> > [+linux-mm & others]
> >
> > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote:
> >>
> >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >>
> >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
> >> make a least-effort attempt at a synchronous collapse of memory at
> >> their own expense.
> >>
> >> The only difference from MADV_COLLAPSE is that the new hugepage allocation
> >> avoids direct reclaim and/or compaction, quickly failing on allocation errors.
> >>
> >> The benefits of this approach are:
> >>
> >> * CPU is charged to the process that wants to spend the cycles for the THP
> >> * Avoid unpredictable timing of khugepaged collapse
> >> * Prevent unpredictable stalls caused by direct reclaim and/or compaction
> >>
> >> Semantics
> >>
> >> This call is independent of the system-wide THP sysfs settings, but will
> >> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> >> multiple VMAs, the semantics of the collapse over each VMA is independent
> >> from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> >> collapse of a given hugepage-aligned/sized region fails, the operation may
> >> continue to attempt collapsing the remainder of memory specified.
> >>
> >> The memory ranges provided must be page-aligned, but are not required to
> >> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> >> start/end of the range will be clamped to the first/last hugepage-aligned
> >> address covered by said range.  The memory ranges must span at least one
> >> hugepage-sized region.
> >>
> >> All non-resident pages covered by the range will first be
> >> swapped/faulted-in, before being internally copied onto a freshly
> >> allocated hugepage.  Unmapped pages will have their data directly
> >> initialized to 0 in the new hugepage.  However, for every eligible
> >> hugepage aligned/sized region to-be collapsed, at least one page must
> >> currently be backed by memory (a PMD covering the address range must
> >> already exist).
> >>
> >> Allocation for the new hugepage will not enter direct reclaim and/or
> >> compaction, quickly failing if allocation fails. When the system has
> >> multiple NUMA nodes, the hugepage will be allocated from the node providing
> >> the most native pages. This operation operates on the current state of the
> >> specified process and makes no persistent changes or guarantees on how pages
> >> will be mapped, constructed, or faulted in the future.
> >>
> >> Return Value
> >>
> >> If all hugepage-sized/aligned regions covered by the provided range were
> >> either successfully collapsed, or were already PMD-mapped THPs, this
> >> operation will be deemed successful.  On success, madvise(2) returns 0.
> >> Else, -1 is returned and errno is set to indicate the error for the
> >> most-recently attempted hugepage collapse.  Note that many failures might
> >> have occurred, since the operation may continue to collapse in the event a
> >> single hugepage-sized/aligned region fails.
> >>
> >>          ENOMEM  Memory allocation failed or VMA not found
> >>          EBUSY   Memcg charging failed
> >>          EAGAIN  Required resource temporarily unavailable.  Try again
> >>                  might succeed.
> >>          EINVAL  Other error: No PMD found, subpage doesn't have Present
> >>                  bit set, "Special" page no backed by struct page, VMA
> >>                  incorrectly sized, address not page-aligned, ...
> >>
> >> Use Cases
> >>
> >> An immediate user of this new functionality is the Go runtime heap allocator
> >> that manages memory in hugepage-sized chunks. In the past, whether it was a
> >> newly allocated chunk through mmap() or a reused chunk released by
> >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> >> respectively. However, both approaches resulted in performance issues; for
> >> both scenarios, there could be entries into direct reclaim and/or compaction,
> >> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
> >>
> >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> >> [4] https://github.com/golang/go/issues/63334
> >
> > Thanks for the patch, Lance, and thanks for providing the links above,
> > referring to issues Go has seen.
> >
> > I've reached out to the Go team to try and understand their use case,
> > and how we could help. It's not immediately clear whether a
> > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> > be.
> >
> > That said, with respect to the implementation, should a need for a
> > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> > process_madvise(2) be the "v2" of madvise(2), where we can start
> > leveraging the forward-facing flags argument for these different
> > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> > ("mm/madvise: remove racy mm ownership check") so that
> > process_madvise(2) can always operate on self. IIRC, this was ~ the
> > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> > sane default, and implement options in flags down the line).
>
> +1, using process_madvise() would likely be the right approach.

Thanks for your suggestion! I completely agree :)
Lance

>
> --
> Cheers,
>
> David / dhildenb
>


  reply	other threads:[~2024-01-18  1:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240117050217.43610-1-ioworker0@gmail.com>
2024-01-17 17:10 ` Zach O'Keefe
2024-01-17 18:41   ` David Hildenbrand
2024-01-18  1:51     ` Lance Yang [this message]
2024-01-18  1:46   ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAK1f24=MbVMrxWO2xa+9bJiqEKJ=DG68WQ5bE_LgW9=oTk6GwQ@mail.gmail.com' \
    --to=ioworker0@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mknyszek@google.com \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox