linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: David Rientjes <rientjes@google.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>,
	Alex Shi <alex.shi@linux.alibaba.com>,
	 David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@suse.com>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	Peter Xu <peterx@redhat.com>,  SeongJae Park <sj@kernel.org>,
	Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org,  Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Arnd Bergmann <arnd@arndb.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Chris Kennelly <ckennelly@google.com>,
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	 Hugh Dickins <hughd@google.com>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	 "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matt Turner <mattst88@gmail.com>,
	 Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	 Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>,
	 Pavel Begunkov <asml.silence@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	 calling@linux.alibaba.com
Subject: Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
Date: Fri, 13 May 2022 14:06:22 -0700	[thread overview]
Message-ID: <CAAa6QmSjUe9P82bqL2UXWkcBC-TyDBvEJA=AJX-p1HsUqCH4BQ@mail.gmail.com> (raw)
In-Reply-To: <3a53435-5d5a-d6cb-5739-5c444523fc7@google.com>

On Thu, May 12, 2022 at 1:03 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 11 May 2022, Zach O'Keefe wrote:
>
> > Hey Rongwei,
> >
> > Thanks for taking the time to review!
> >
> > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> > >
> > > Hi, Zach
> > >
> > > Thanks for your great patchset!
> > > Recently, We also try to collapse THP in this way, likes performance
> > > degradation due to using too much hugepages in our scenes.
> > >
>
> Rongwei, could you elaborate on this?  I can understand undesired overhead
> for allocation of a hugepage at the time of fault if there is not enough
> benefit derived by the hugepages over the long term (like for a database
> workload), is this the performance degradation you're referring to?
>
> Otherwise I'm unfamiliar with performance degradation for using too much
> hugepages after they have been allocated :)  Maybe RSS grows too much and
> we run into memory pressure?
>
> It would be good to know more if you can share details here.
>
> > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > > something, please let me know.
> > >
> >
> > I tried to have MADV_COLLAPSE follow the same THP eligibility
> > semantics as khugepaged and at-fault: either THP=always, or
> > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> > out.
> >
> > If I understand you correctly, the usefulness of
> > process_madvise(MADV_COLLAPSE) is limited in the case where
> > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> > behalf of another process since they don't have a way to mark the
> > target memory as eligible (which requires VM_HUGEPAGE).
> >
> > If so, I think that's a valid point, and your suggestion below of a
> > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> > the sake of exploring all options, I'll mention that there was also a
> > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> > VM_HUGEPAGE[1].
> >
>
> If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
> to need to do MADV_HUGEPAGE before that regardless of system-wide settings
> that it may not control?
>

If THP=always, this isn't necessary - but yes, if THP=madvise then, as
proposed, we'd need MADV_HUGEPAGE to opt-in to THPs, just like
at-fault and khugepaged.

If MADV_COLLAPSE also sets VM_HUGEPAGE, then in THP=madvise mode, we
also opt-in that range to khugepaged scanning. Most likely this is
what we want anyways (if the collapsed range is split after
MADV_COLLAPSE, khugepaged could come along and fix things) but it does
couple the two more. We can always MADV_NOHUGEPAGE the range after
MADV_COLLAPSE if this was ever a concern, however.

> Same point for a root user doing this on behalf of another user.  It could
> either do MADV_HUGEPAGE and change the behavior that the user perhaps
> requested behind its back (not desired) or it could temporarily set the
> system-wide setting to allow THP always before doing the MADV_COLLAPSE
> (it's root).
>
> We can simply allow MADV_COLLAPSE to always collapse in either context
> regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?
>

I think we need to respect VM_NOHUGEPAGE as it can be set from
non-madvise code, as David H mentioned for kvm on s390 in the PATCH
RFC[1]. It's not clear to me how this would break, however. If
MADV_HUGEPAGE'ing the areas marked by
arch/s390/mm/gmap.c:thp_split_mm() would also break kvm (assuming
subsequent collapse by khugepaged), then MADV_COLLAPSE ignoring
VM_NOHUGEPAGE doesn't really increase the failure space much. Else, I
think we should really be respecting VM_NOHUGEPAGE.

[1] https://lore.kernel.org/linux-mm/30571216-5a6a-7a11-3b2c-77d914025f6d@redhat.com/


> > Since it's possible supporting MADV_[NO]HUGEPAGE for
> > process_madivse(2) has applications outside a subsequent
> > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> > be in a hot path, I'd vote in favor of your suggestion and include
> > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> >
> > Thanks again for your review and your suggestion!
> > Zach
> >
> > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> >
> > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > > vma with 'hg', and the collapse process can be finished completely with
> > > the help of other processes. the latter could let some special vma avoid
> > > collapsing when setting 'THP=always'.
> > >
> > > Best regards,
> > > -wrw
> > >
> > > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >   mm/madvise.c | 6 ++++--
> > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 638517952bd2..08c11217025a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > > >   }
> > > >
> > > >   static bool
> > > > -process_madvise_behavior_valid(int behavior)
> > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > > >   {
> > > >       switch (behavior) {
> > > >       case MADV_COLD:
> > > >       case MADV_PAGEOUT:
> > > >       case MADV_WILLNEED:
> > > >               return true;
> > > > +     case MADV_COLLAPSE:
> > > > +             return task == current || capable(CAP_SYS_ADMIN);
> > > >       default:
> > > >               return false;
> > > >       }
> > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > >               goto free_iov;
> > > >       }
> > > >
> > > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > > >               ret = -EINVAL;
> > > >               goto release_task;
> > > >       }
> >


  reply	other threads:[~2022-05-13 21:07 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-05-18 18:41   ` Peter Xu
2022-05-19 21:06     ` Zach O'Keefe
2022-05-20  1:12       ` Peter Xu
2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-18 20:03   ` Peter Xu
2022-05-18 20:11     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-13 18:26     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-13 23:04     ` Zach O'Keefe
2022-05-13 23:17       ` Yang Shi
2022-05-13 23:55         ` Zach O'Keefe
2022-05-17 17:18           ` Yang Shi
2022-05-17 22:35             ` Zach O'Keefe
2022-05-25 17:58               ` Yang Shi
2022-05-25 18:27                 ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-13 18:17     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-05-05 18:50   ` Zach O'Keefe
2022-05-05 18:58     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-05-11  0:49   ` Rongwei Wang
2022-05-11 15:34     ` Zach O'Keefe
2022-05-12 15:53       ` Rongwei Wang
2022-05-12 20:03       ` David Rientjes
2022-05-13 21:06         ` Zach O'Keefe [this message]
2022-05-16  3:56         ` Rongwei Wang
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 11/13] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 12/13] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 13/13] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAa6QmSjUe9P82bqL2UXWkcBC-TyDBvEJA=AJX-p1HsUqCH4BQ@mail.gmail.com' \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=calling@linux.alibaba.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=rongwei.wang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox