From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 079EBC433F5
	for <linux-mm@archiver.kernel.org>; Thu, 10 Mar 2022 18:54:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9B6818D0003; Thu, 10 Mar 2022 13:54:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 93F6E8D0001; Thu, 10 Mar 2022 13:54:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7B8CB8D0003; Thu, 10 Mar 2022 13:54:27 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28])
	by kanga.kvack.org (Postfix) with ESMTP id 6B2498D0001
	for <linux-mm@kvack.org>; Thu, 10 Mar 2022 13:54:27 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 3887124E52
	for <linux-mm@kvack.org>; Thu, 10 Mar 2022 18:54:27 +0000 (UTC)
X-FDA: 79229377374.09.212EFCA
Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170])
	by imf09.hostedemail.com (Postfix) with ESMTP id 85EC7140018
	for <linux-mm@kvack.org>; Thu, 10 Mar 2022 18:54:26 +0000 (UTC)
Received: by mail-lj1-f170.google.com with SMTP id 25so9060483ljv.10
        for <linux-mm@kvack.org>; Thu, 10 Mar 2022 10:54:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=rpcg3G0ZzCTlHYh56AZ6WaHvDThB7ADiNaLKaj64ZPQ=;
        b=ADciPqwh8We7Qo98PqHlpy+UxrHVLNHZI6JBcGJ80BbTdaY/n6CfyMjgPvsO3GBxCg
         9d6Oimoh7Q3IiixaHrN2EkllJANJHSNEFN6ZLhFXvhEZ9+/0xLy261PUH3X5s4zy9AXV
         D01uc+1wWX2/LSLAltgHd01HLveVRUFCyOORVARi0rk6PXOqMzVGys5sCayPQJZjpe1f
         JasALH0hejiR/RkflLdCMToxCa+krDkCm+70vMIc9gOsp7ZFj2jfx3ws4lxMcFdWJc9x
         DwX06qeSnt4WsXv4IkhpUR94HovI+/DR4iDafs1dg/2sqK2XpEm6hIUQVZC5QleXftMK
         pErQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=rpcg3G0ZzCTlHYh56AZ6WaHvDThB7ADiNaLKaj64ZPQ=;
        b=TA/FcDKyYwirUoeu1Td/QGo13jtvuCy8Iyfw1+Ydt9cbSIn1ZCNeeXtnjuXkvEHwO1
         D/MuNfzhbKAqPxFSMPc2sEgH40dcJ6v5pcqSiA3qu+YMxlLlcaUMntFB2vq3fC9Vshqj
         4EqxKAWe6WzfV9L8ipWQd+BYcx9wmy+ImdfCiivKnEqH3fpqEX/zJM7uAB1iYZAzws9l
         qr+N8G1+P8vtffJlq8UhbD3YPpr8jZon/8ec/JZZ3ZhXFBoS5QdV+rw319sjyVHfTfYU
         yRIB2ZG0tR2aPomgFtmLV6491F60Yg+P6HdqS/Eo/kR6MahGoUFWVTIpYbYksg0iwQMG
         zRRw==
X-Gm-Message-State: AOAM530WA0wJNl+7bbwBcuQlU+OOrodTjCn6mZ0O4XaOM+7uQxgsAHDC
	IaDAN3dVudwZcNH/tPZ/9P46lZxneda0g+NZHSgYog==
X-Google-Smtp-Source: ABdhPJzE661tWtwIMxfHgeGXrCgdmxak0485wJ/LVXZ542wQ76EhZtKO9BWAP0s1RCGbsxZvOZaIgpC6dwUfNdBwpxE=
X-Received: by 2002:a2e:8053:0:b0:23c:fa2a:5d3e with SMTP id
 p19-20020a2e8053000000b0023cfa2a5d3emr3935990ljg.96.1646938464679; Thu, 10
 Mar 2022 10:54:24 -0800 (PST)
MIME-Version: 1.0
References: <20220308213417.1407042-1-zokeefe@google.com> <20220308213417.1407042-8-zokeefe@google.com>
 <CAHbLzkqqkzNGT8ELuN6oVwW11==5Tuh08xti27qzgscF88jR0g@mail.gmail.com>
 <CAAa6QmTsuvjtAYvthnyY0bv2NRNX2J1Bqj1kDS-XKvUGEGem_Q@mail.gmail.com>
 <CAHbLzkob7n4BcZFNiXtJTdOT9c=Rd7uw2o14YiX50Qqy1ZhXNg@mail.gmail.com>
 <CAAa6QmQCP4Rm50yDcac+O=UJ-Vt6hOTgc2_KNrJu3HxNrDhuHw@mail.gmail.com>
 <CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com>
 <CAAa6QmSDd6U=ZWRd-Z1W_n+UTgY3e+8W-jYoPL=9G6_vJakjfQ@mail.gmail.com> <CAHbLzkqO2tV0tz5NaFe9X6wC0WL5c69Am3=++h=X=Z1+PDZRZg@mail.gmail.com>
In-Reply-To: <CAHbLzkqO2tV0tz5NaFe9X6wC0WL5c69Am3=++h=X=Z1+PDZRZg@mail.gmail.com>
From: "Zach O'Keefe" <zokeefe@google.com>
Date: Thu, 10 Mar 2022 10:53:48 -0800
Message-ID: <CAAa6QmQqiuQQ-tvtWD27UzMUWXjD0ivxvtUdzigXqaXRogMduA@mail.gmail.com>
Subject: Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
To: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>, David Hildenbrand <david@redhat.com>, 
	David Rientjes <rientjes@google.com>, Michal Hocko <mhocko@suse.com>, 
	Pasha Tatashin <pasha.tatashin@soleen.com>, SeongJae Park <sj@kernel.org>, 
	Song Liu <songliubraving@fb.com>, Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>, 
	Linux MM <linux-mm@kvack.org>, Andrea Arcangeli <aarcange@redhat.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Arnd Bergmann <arnd@arndb.de>, 
	Axel Rasmussen <axelrasmussen@google.com>, Chris Kennelly <ckennelly@google.com>, 
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>, Hugh Dickins <hughd@google.com>, 
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>, 
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>, Jens Axboe <axboe@kernel.dk>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, 
	Matt Turner <mattst88@gmail.com>, Max Filippov <jcmvbkbc@gmail.com>, 
	Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, 
	Patrick Xia <patrickx@google.com>, Pavel Begunkov <asml.silence@gmail.com>, 
	Peter Xu <peterx@redhat.com>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 85EC7140018
X-Stat-Signature: myk6c7uee8gacthctczxnnrfrcrszkft
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=ADciPqwh;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf09.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=zokeefe@google.com
X-HE-Tag: 1646938466-259196
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Mar 10, 2022 at 10:17 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 7:51 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 6:16 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > > > >
> > > > > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > > > > >
> > > > > > > Could you please elaborate why this usecase is valid? Typically
> > > > > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > > > > >
> > > > > >
> > > > > > Hey Yang, thanks for taking time to review and comment.
> > > > > >
> > > > > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > > > > the user to say "I don't want hugepages here", so that the kernel
> > > > > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > > > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > > > > backed by hugepages - so presumably that is exactly what they want.
> > > > > >
> > > > > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > > > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > > > > the user wanted collapsed, but that had some sub-areas marked
> > > > > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > > > > operations around the excluded regions.
> > > > > >
> > > > > > In terms of use cases, I don't have a concrete example, but a user
> > > > > > could hypothetically choose to exclude regions from management from
> > > > > > khugepaged, but still be able to collapse the memory themselves,
> > > > > > when/if they deem appropriate.
> > > > >
> > > > > I see. It seems you thought MADV_COLLAPSE actually unsets
> > > > > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > > > > right away, right? To some degree, it makes some sense.
> > > >
> > > > Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> > > > ignores VM_NOHUGEPAGE, and so it's not really the same as
> > > > MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> > > > to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> > > > sure it does) then we don't want to be altering the vma flags since we
> > > > don't want to touch khugepaged behavior.
> > > >
> > > > > If this is the
> > > > > behavior you'd like to achieve, I'd suggest making it more explicit,
> > > > > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > > > > than ignore or change vm flags silently. When using madvise mode, but
> > > > > not having VM_HUGEPAGE set, the vma check should fail in the current
> > > > > code (I didn't look hard if you already covered this or not).
> > > > >
> > > >
> > > > You're correct, this will fail, since it's following the same
> > > > semantics as the fault path. I see what you're saying though; that
> > > > perhaps this is inconsistent with my above reasoning that "the user
> > > > asked to collapse this memory, and so we should do it". If so, then
> > > > perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> > > > entirely for the purposes of eligibility, and only uses it for the
> > > > purposes of determining gfp flags for compaction/reclaim. Pushing that
> > > > further, compaction/reclaim could entirely be specified by the user
> > > > using a process_madvise(2) flag (later in the series, we do something
> > > > like this).
> > >
> > > Anyway I think we could have two options for MADV_COLLAPSE:
> > >
> > > 1. Just treat it as a hint (nice to have, best effort). It should obey
> > > all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
> > > if madvise mode, etc.
> > >
> > > 2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
> > > should set vma flags properly as I suggested.
> > >
> > > Either is fine to me. But I don't prefer something in between personally.
> > >
> >
> > Makes sense to be consistent. Of these, #1 seems the most
> > straightforward to use. Doing an MADV_COLLAPSE on a VM_NOHUGEPAGE vma
> > seems like a corner case. The more likely scenario is MADV_COLLAPSE on
> > an unflagged (neither VM_HUGEPAGE or VM_NOHUGEPAGE) vma - in which
> > case it's less intrusive to not additionally set VM_HUGEPAGE (though
> > the user can always do so if they wish). It's a little more consistent
> > with "always" mode, where MADV_HUGEPAGE isn't necessary for
> > eligibility. It'll also reduce some code complexity.
> >
> > I'll float one last option your way, however:
> >
> > 3. The collapsed region is always eligible, regardless of vma flags or
> > thp settings (except "never"?). For process_madvise(2), a flag will
> > explicitly specify defrag semantics.
>
> This is what I meant for #2 IIUC. Defrag could follow the system's
> defrag setting rather than the khugepaged's.
>
> But it may break s390 as David pointed out.
>
> >
> > This separates "async-hint" vs "sync-explicit" madvise requests.
> > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > the kernel how to treat memory in the future. The kernel uses
> > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > request, is free to define its own defrag semantics.
> >
> > This would allow flexibility to separately define async vs sync thp
> > policies. For example, highly tuned userspace applications that are
> > sensitive to unexpected latency might want to manage their hugepages
> > utilization themselves, and ask khugepaged to stay away. There is no
> > way in "always" mode to do this without setting VM_NOHUGEPAGE.
>
> I don't quite get why you set THP to always but don't want to
> khugepaged do its job. It may be slow, I think this is why you
> introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> scan the same area, it just doesn't do any real work and waste some
> cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> being split, right? So khugepaged still plays a role to re-collapse
> the area without calling MADV_COLLAPSE over again and again.
>

Ya, I agree that the common case is that, if you are MADV_COLLAPSE'ing
memory, chances are you just want that memory backed by hugepages -
and so if that area were to be split, presumably we'd want khugepaged
to come and recollapse when possible.

I think the (possibly contrived) use case I was thinking about a
program that (a) didn't have ability to change thp settings
("always"), and (b) wanted to manage it's own hugepages. If a concrete
use case like this did arise, I think David H.'s suggestion of using
prctl(2) would work.

> >
> > > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > > > > which can be used to ignore vm flags used when considering thp
> > > > > > > > eligibility.
> > > > > > > >
> > > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > > > ---
> > > > > > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > > > > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > > > > >  #endif
> > > > > > > >
> > > > > > > >  /*
> > > > > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > > > > + * collapse context.
> > > > > > > >   */
> > > > > > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >                                              unsigned long address, int nr,
> > > > > > > > +                                            unsigned long vm_flags_ignore,
> > > > > > > >                                              struct vm_area_struct **vmap)
> > > > > > > >  {
> > > > > > > >         struct vm_area_struct *vma;
> > > > > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > > > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > > > > >                 return SCAN_ADDRESS_RANGE;
> > > > > > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > > > > >                 return SCAN_VMA_CHECK;
> > > > > > > >         /* Anon VMA expected */
> > > > > > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >   */
> > > > > > > >
> > > > > > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > > > > +                                  unsigned long vm_flags_ignore,
> > > > > > > >                                    struct vm_area_struct **vmap)
> > > > > > > >  {
> > > > > > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > > > > +                       vm_flags_ignore, vmap);
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  /*
> > > > > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > > > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > > > > >                 if (ret & VM_FAULT_RETRY) {
> > > > > > > >                         mmap_read_lock(mm);
> > > > > > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > > > > >                                 /* vma is no longer available, don't continue to swapin */
> > > > > > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > > > > >                                 return false;
> > > > > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > > > > >
> > > > > > > >         mmap_read_lock(mm);
> > > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > > >         if (result) {
> > > > > > > >                 mmap_read_unlock(mm);
> > > > > > > >                 goto out_nolock;
> > > > > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > > >          */
> > > > > > > >         mmap_write_lock(mm);
> > > > > > > >
> > > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > > >         if (result)
> > > > > > > >                 goto out_up_write;
> > > > > > > >         /* check if the pmd is still valid */
> > > > > > > > --
> > > > > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > > > > >