From: David Rientjes <rientjes@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
Alex Shi <alex.shi@linux.alibaba.com>,
Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Song Liu <songliubraving@fb.com>,
Matthew Wilcox <willy@infradead.org>,
Minchan Kim <minchan@kernel.org>,
Chris Kennelly <ckennelly@google.com>,
linux-mm@kvack.org, linux-api@vger.kernel.org
Subject: Re: [RFC] Hugepage collapse in process context
Date: Thu, 18 Feb 2021 14:34:56 -0800 (PST) [thread overview]
Message-ID: <5127b9c-a147-8ef5-c942-ae8c755413d0@google.com> (raw)
In-Reply-To: <600ee57f-d839-d402-fb0f-e9f350114dce@redhat.com>
On Thu, 18 Feb 2021, David Hildenbrand wrote:
> > > > Hi everybody,
> > > >
> > > > Khugepaged is slow by default, it scans at most 4096 pages every 10s.
> > > > That's normally fine as a system-wide setting, but some applications
> > > > would
> > > > benefit from a more aggressive approach (as long as they are willing to
> > > > pay for it).
> > > >
> > > > Instead of adding priorities for eligible ranges of memory to
> > > > khugepaged,
> > > > temporarily speeding khugepaged up for the whole system, or sharding its
> > > > work for memory belonging to a certain process, one approach would be to
> > > > allow userspace to induce hugepage collapse.
> > > >
> > > > The benefit to this approach would be that this is done in process
> > > > context
> > > > so its cpu is charged to the process that is inducing the collapse.
> > > > Khugepaged is not involved.
> > >
> > > Yes, this makes a lot of sense to me.
> > >
> > > > Idea was to allow userspace to induce hugepage collapse through the new
> > > > process_madvise() call. This allows us to collapse hugepages on behalf
> > > > of
> > > > current or another process for a vectored set of ranges.
> > >
> > > Yes, madvise sounds like a good fit for the purpose.
> >
> > Agreed on both points.
> >
> > > > This could be done through a new process_madvise() mode *or* it could be
> > > > a
> > > > flag to MADV_HUGEPAGE since process_madvise() allows for a flag
> > > > parameter
> > > > to be passed. For example, MADV_F_SYNC.
> > >
> > > Would this MADV_F_SYNC be applicable to other madvise modes? Most
> > > existing madvise modes do not seem to make much sense. We can argue that
> > > MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not
> > > sure we want to provide such a strong semantic because it can limit
> > > future reclaim optimizations.
> > >
> > > To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward.
> >
> > I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE |
> > MADV_WILLNEED with this semantic? But you are probably more interested in
> > process_madvise() anyway. There the new flag would make more sense. But
> > there's
> > also David H.'s proposal for MADV_POPULATE and there might be benefit in
> > considering both at the same time? Should e.g. MADV_POPULATE with
> > MADV_HUGEPAGE
> > have the collapse semantics? But would MADV_POPULATE be added to
> > process_madvise() as well? Just thinking out loud so we don't end up with
> > more
> > flags than necessary, it's already confusing enough as it is.
> >
>
> Note that madvise() eats only a single value, not flags. Combinations as you
> describe are not possible.
>
> Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap
> lock in write and does not modify the actual VMA, only a mapping.
>
Agreed, and happy to see that there's a general consensus for the
direction. Benefit of a new madvise mode is that it can be used for
madvise() as well if you are interested in only a single range of your own
memory and then it doesn't need to reconcile with any of the already
overloaded semantics of MADV_HUGEPAGE.
Otherwise, process_madvise() can be used for other processes and/or
vectored ranges.
Song's use case for this to prioritize thp usage is very important for us
as well. I hadn't thought of the madvise(MADV_HUGEPAGE) +
madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter
would allocate the hugepage with khugepaged's gfp mask so it would always
compact. But it seems like this would actually be better to use the gfp
mask that would be used at fault for the vma and left to userspace to
determine whether that's MADV_HUGEPAGE or not. Makes sense.
(Userspace could even do madvise(MADV_NOHUGEPAGE) +
madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but
otherwise exclude it from khugepaged's consideration if it were inclined.)
Two other minor points:
- Currently, process_madvise() doesn't use the flags parameter at all so
there's the question of whether we need generalized flags that apply to
most madvise modes or whether the flags can be specific to the mode
being used. For example, a natural extension of this new mode would be
to determine the hugepage size if we were ever to support synchronous
collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :)
- We haven't discussed the future of khugepaged with this new mode: it
seems like we could simply implement khugepaged fully in userspace and
remove it from the kernel? :)
next prev parent reply other threads:[~2021-02-18 22:35 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-17 4:24 David Rientjes
2021-02-17 8:21 ` Michal Hocko
2021-02-18 13:43 ` Vlastimil Babka
2021-02-18 13:52 ` David Hildenbrand
2021-02-18 22:34 ` David Rientjes [this message]
2021-02-19 16:16 ` Zi Yan
2021-02-24 9:44 ` Alex Shi
2021-03-01 20:56 ` David Rientjes
2021-03-04 10:52 ` Alex Shi
2021-02-17 15:49 ` Zi Yan
2021-02-18 8:11 ` Song Liu
2021-02-18 8:39 ` Michal Hocko
2021-02-18 9:53 ` Song Liu
2021-02-18 10:01 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5127b9c-a147-8ef5-c942-ae8c755413d0@google.com \
--to=rientjes@google.com \
--cc=aarcange@redhat.com \
--cc=alex.shi@linux.alibaba.com \
--cc=ckennelly@google.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=songliubraving@fb.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox