From: "Zach O'Keefe" <zokeefe@google.com>
To: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
David Hildenbrand <david@redhat.com>,
David Rientjes <rientjes@google.com>,
Michal Hocko <mhocko@suse.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
Linux MM <linux-mm@kvack.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Arnd Bergmann <arnd@arndb.de>,
Axel Rasmussen <axelrasmussen@google.com>,
Chris Kennelly <ckennelly@google.com>,
Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
Hugh Dickins <hughd@google.com>,
Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
Jens Axboe <axboe@kernel.dk>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
Matt Turner <mattst88@gmail.com>,
Max Filippov <jcmvbkbc@gmail.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Minchan Kim <minchan@kernel.org>,
Patrick Xia <patrickx@google.com>,
Pavel Begunkov <asml.silence@gmail.com>,
Peter Xu <peterx@redhat.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Subject: Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
Date: Thu, 10 Mar 2022 07:50:49 -0800 [thread overview]
Message-ID: <CAAa6QmSDd6U=ZWRd-Z1W_n+UTgY3e+8W-jYoPL=9G6_vJakjfQ@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com>
On Wed, Mar 9, 2022 at 6:16 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > > >
> > > > > Could you please elaborate why this usecase is valid? Typically
> > > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > > >
> > > >
> > > > Hey Yang, thanks for taking time to review and comment.
> > > >
> > > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > > the user to say "I don't want hugepages here", so that the kernel
> > > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > > backed by hugepages - so presumably that is exactly what they want.
> > > >
> > > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > > the user wanted collapsed, but that had some sub-areas marked
> > > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > > operations around the excluded regions.
> > > >
> > > > In terms of use cases, I don't have a concrete example, but a user
> > > > could hypothetically choose to exclude regions from management from
> > > > khugepaged, but still be able to collapse the memory themselves,
> > > > when/if they deem appropriate.
> > >
> > > I see. It seems you thought MADV_COLLAPSE actually unsets
> > > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > > right away, right? To some degree, it makes some sense.
> >
> > Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> > ignores VM_NOHUGEPAGE, and so it's not really the same as
> > MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> > to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> > sure it does) then we don't want to be altering the vma flags since we
> > don't want to touch khugepaged behavior.
> >
> > > If this is the
> > > behavior you'd like to achieve, I'd suggest making it more explicit,
> > > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > > than ignore or change vm flags silently. When using madvise mode, but
> > > not having VM_HUGEPAGE set, the vma check should fail in the current
> > > code (I didn't look hard if you already covered this or not).
> > >
> >
> > You're correct, this will fail, since it's following the same
> > semantics as the fault path. I see what you're saying though; that
> > perhaps this is inconsistent with my above reasoning that "the user
> > asked to collapse this memory, and so we should do it". If so, then
> > perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> > entirely for the purposes of eligibility, and only uses it for the
> > purposes of determining gfp flags for compaction/reclaim. Pushing that
> > further, compaction/reclaim could entirely be specified by the user
> > using a process_madvise(2) flag (later in the series, we do something
> > like this).
>
> Anyway I think we could have two options for MADV_COLLAPSE:
>
> 1. Just treat it as a hint (nice to have, best effort). It should obey
> all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
> if madvise mode, etc.
>
> 2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
> should set vma flags properly as I suggested.
>
> Either is fine to me. But I don't prefer something in between personally.
>
Makes sense to be consistent. Of these, #1 seems the most
straightforward to use. Doing an MADV_COLLAPSE on a VM_NOHUGEPAGE vma
seems like a corner case. The more likely scenario is MADV_COLLAPSE on
an unflagged (neither VM_HUGEPAGE or VM_NOHUGEPAGE) vma - in which
case it's less intrusive to not additionally set VM_HUGEPAGE (though
the user can always do so if they wish). It's a little more consistent
with "always" mode, where MADV_HUGEPAGE isn't necessary for
eligibility. It'll also reduce some code complexity.
I'll float one last option your way, however:
3. The collapsed region is always eligible, regardless of vma flags or
thp settings (except "never"?). For process_madvise(2), a flag will
explicitly specify defrag semantics.
This separates "async-hint" vs "sync-explicit" madvise requests.
MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
the kernel how to treat memory in the future. The kernel uses
VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
request, is free to define its own defrag semantics.
This would allow flexibility to separately define async vs sync thp
policies. For example, highly tuned userspace applications that are
sensitive to unexpected latency might want to manage their hugepages
utilization themselves, and ask khugepaged to stay away. There is no
way in "always" mode to do this without setting VM_NOHUGEPAGE.
> >
> >
> > > >
> > > > > >
> > > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > > which can be used to ignore vm flags used when considering thp
> > > > > > eligibility.
> > > > > >
> > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > ---
> > > > > > mm/khugepaged.c | 18 ++++++++++++------
> > > > > > 1 file changed, 12 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > > > #endif
> > > > > >
> > > > > > /*
> > > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > > + * collapse context.
> > > > > > */
> > > > > > static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > unsigned long address, int nr,
> > > > > > + unsigned long vm_flags_ignore,
> > > > > > struct vm_area_struct **vmap)
> > > > > > {
> > > > > > struct vm_area_struct *vma;
> > > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > > > if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > > > return SCAN_ADDRESS_RANGE;
> > > > > > - if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > > + if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > > > return SCAN_VMA_CHECK;
> > > > > > /* Anon VMA expected */
> > > > > > if (!vma->anon_vma || vma->vm_ops)
> > > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > */
> > > > > >
> > > > > > static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > > + unsigned long vm_flags_ignore,
> > > > > > struct vm_area_struct **vmap)
> > > > > > {
> > > > > > - return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > > + return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > > + vm_flags_ignore, vmap);
> > > > > > }
> > > > > >
> > > > > > /*
> > > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > > > /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > > > if (ret & VM_FAULT_RETRY) {
> > > > > > mmap_read_lock(mm);
> > > > > > - if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > > + if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > > > /* vma is no longer available, don't continue to swapin */
> > > > > > trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > > > return false;
> > > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > > >
> > > > > > mmap_read_lock(mm);
> > > > > > - result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > + result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > if (result) {
> > > > > > mmap_read_unlock(mm);
> > > > > > goto out_nolock;
> > > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > */
> > > > > > mmap_write_lock(mm);
> > > > > >
> > > > > > - result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > + result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > if (result)
> > > > > > goto out_up_write;
> > > > > > /* check if the pmd is still valid */
> > > > > > --
> > > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > > >
next prev parent reply other threads:[~2022-03-10 15:51 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
2022-03-09 22:48 ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-03-09 22:53 ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:15 ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:17 ` Yang Shi
2022-03-10 0:00 ` Zach O'Keefe
2022-03-10 0:41 ` Yang Shi
2022-03-10 1:09 ` Zach O'Keefe
2022-03-10 2:16 ` Yang Shi
2022-03-10 15:50 ` Zach O'Keefe [this message]
2022-03-10 18:17 ` Yang Shi
2022-03-10 18:46 ` David Rientjes
2022-03-10 18:58 ` Zach O'Keefe
2022-03-10 19:54 ` Yang Shi
2022-03-10 20:24 ` Zach O'Keefe
2022-03-10 18:53 ` Zach O'Keefe
2022-03-10 15:56 ` David Hildenbrand
2022-03-10 18:39 ` Zach O'Keefe
2022-03-10 18:54 ` David Rientjes
2022-03-21 14:27 ` Michal Hocko
2022-03-08 21:34 ` [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
2022-03-09 23:40 ` Yang Shi
2022-03-10 0:46 ` Zach O'Keefe
2022-03-10 2:05 ` Yang Shi
2022-03-10 8:37 ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-03-09 23:43 ` Yang Shi
2022-03-10 1:11 ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
2022-03-10 0:06 ` Yang Shi
2022-03-10 19:26 ` David Rientjes
2022-03-10 20:16 ` Matthew Wilcox
2022-03-11 0:06 ` Zach O'Keefe
2022-03-25 16:51 ` Zach O'Keefe
2022-03-25 19:54 ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE) Zach O'Keefe
2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
2022-03-21 14:51 ` Zach O'Keefe
2022-03-21 14:37 ` Michal Hocko
2022-03-21 15:46 ` Zach O'Keefe
2022-03-22 12:11 ` Michal Hocko
2022-03-22 15:53 ` Zach O'Keefe
2022-03-29 12:24 ` Michal Hocko
2022-03-30 0:36 ` Zach O'Keefe
2022-03-22 6:40 ` Zach O'Keefe
2022-03-22 12:05 ` Michal Hocko
2022-03-23 13:30 ` Zach O'Keefe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAAa6QmSDd6U=ZWRd-Z1W_n+UTgY3e+8W-jYoPL=9G6_vJakjfQ@mail.gmail.com' \
--to=zokeefe@google.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=arnd@arndb.de \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=axelrasmussen@google.com \
--cc=chris@zankel.net \
--cc=ckennelly@google.com \
--cc=david@redhat.com \
--cc=deller@gmx.de \
--cc=hughd@google.com \
--cc=ink@jurassic.park.msu.ru \
--cc=jcmvbkbc@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=mattst88@gmail.com \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=patrickx@google.com \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=songliubraving@fb.com \
--cc=tsbogend@alpha.franken.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox