linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>,
	Alex Shi <alex.shi@linux.alibaba.com>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Michal Hocko <mhocko@suse.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org,  Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Arnd Bergmann <arnd@arndb.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Chris Kennelly <ckennelly@google.com>,
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	 Hugh Dickins <hughd@google.com>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	 "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matt Turner <mattst88@gmail.com>,
	 Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	 Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>,
	 Pavel Begunkov <asml.silence@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Subject: Re: [PATCH v2 00/12] mm: userspace hugepage collapse
Date: Tue, 19 Apr 2022 08:55:03 -0700	[thread overview]
Message-ID: <CAAa6QmQMNG2dXrKp5LwQbsVnR-LbOfk9BtJm=6cjxitjMcWSZg@mail.gmail.com> (raw)
In-Reply-To: <8d8da2fb-aed9-96d0-47ed-94806e190250@redhat.com>

On Tue, Apr 19, 2022 at 7:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.04.22 21:26, Peter Xu wrote:
> > Hi, Zach,
> >
> > On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
> >> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
> >>>
> >>> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> >>>> Hey Peter,
> >>>>
> >>>> Thanks for taking the time to review!
> >>>>
> >>>> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>
> >>>>> Hi, Zach,
> >>>>>
> >>>>> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> >>>>>> process_madvise(2)
> >>>>>>
> >>>>>>       Performs a synchronous collapse of the native pages
> >>>>>>       mapped by the list of iovecs into transparent hugepages.
> >>>>>>
> >>>>>>       Allocation semantics are the same as khugepaged, and depend on
> >>>>>>       (1) the active sysfs settings
> >>>>>>       /sys/kernel/mm/transparent_hugepage/enabled and
> >>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> >>>>>>       the VMA flags of the memory range being collapsed.
> >>>>>>
> >>>>>>       Collapse eligibility criteria differs from khugepaged in that
> >>>>>>       the sysfs files
> >>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> >>>>>>       are ignored.
> >>>>>
> >>>>> The userspace khugepaged idea definitely makes sense to me, though I'm
> >>>>> curious how the line is drown on the different behaviors here by explicitly
> >>>>> ignoring the max_ptes_* entries.
> >>>>>
> >>>>> Let's assume the initiative is to duplicate a more data-aware khugepaged in
> >>>>> the userspace, then IMHO it makes more sense to start with all the policies
> >>>>> that applies to khugepaged already, including max_pte_*.
> >>>>>
> >>>>> I can understand the willingness to provide even stronger semantics here
> >>>>> than khugepaged since the userspace could have very clear knowledge of how
> >>>>> to provision the memories (better than a kernel scanner).  It's just that
> >>>>> IMHO it could be slightly confusing if the new interface only partially
> >>>>> apply the khugepaged rules.
> >>>>>
> >>>>> No strong opinion here.  It could already been a trade-off after the
> >>>>> discussion from the RFC with Michal which I read..  Just curious about how
> >>>>> you made that design decision so feel free to read it as a pure question.
> >>>>>
> >>>>
> >>>> Understand your point here. The allocation and max_pte_* semantics are
> >>>> split between khugepaged-like and fault-like, respectively - which
> >>>> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> >>>> flag to control the former's behavior, but agreed to keep things
> >>>> simple to start, and expand the interface if/when necessary. I opted
> >>>> to ignore max_ptes_* as the default since I envisioned that early
> >>>> adopters would "just want it to work". One such example would be
> >>>> backing executable text by hugepages on program load when many pages
> >>>> haven't been demand-paged in yet.
> >>>>
> >>>> What do you think?
> >>>
> >>> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
> >>> blurred.
> >>>
> >>> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
> >>> khugepaged on this range, and with current thread context".  IMHO any
> >>> feature bits then can be supplementing special needs, and I'll take the thp
> >>> backing executable example to be one of the (good?) reason we'd need an
> >>> extra flag for ignoring the max_ptes_* knobs.
> >>>
> >>> So personally if I were you maybe I'll start with the simple scheme of that
> >>> (even if it won't immediately service a thing) but then add either the
> >>> defrag or ignore_max_ptes_* as feature bits later on, with clear use case
> >>> descriptions about why we need each of the feature flags.  IMHO numbers
> >>> would be even more helpful when there's specific use cases on the show.
> >>>
> >>> Or, perhaps you think all potential MADV_COLLAPSE users should literally
> >>> skip max_ptes_* limitations always?
> >>>
> >>
> >> Thanks for your time and valuable feedback here, Peter. I had a response typed
> >> up, but after a few iterations became increasingly unsatisfied with my
> >> own response.
> >>
> >> I think this feature should be able to stand on its own without
> >> consideration of a userspace khugepaged, as we have existing concrete
> >> examples where it would be useful. In these cases, and I assume almost
> >> all other use-cases outside userspace khugepaged, max_ptes_* should be
> >> ignored as the fundamental assumption of MADV_COLLAPSE is that the
> >> user knows better, and IMHO, khugepaged heuristics shouldn't tell
> >> users they are wrong.
> >
> > Valid point.  And actually right after I replied I thought similarly on
> > whether we need to connect the two interfaces at all..
> >
> > It's just that it's very easy to go think like that after reading the cover
> > letter since that's exactly what it is comparing to. :)
> >
> > There's definitely a difference view on user/kernel level of things, then
> > it sounds reasonable to me if we add a new interface it by default has a
> > stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
> > existance.
> >
> > So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
> > as you said.  And that's a real pure question I asked above, and I feel
> > like your answer is actually "yes" we should always ignore the max_ptes_*
> > fields until there's a proof that it'll be helpful.
> >
> >>
> >> But this, as you mention, unsatisfactorily blurs the semantics of
> >> MADV_COLLAPSE: "act like khugepaged here, but not here".
> >>
> >> As such, WDYT about the reverse-side of the coin of what you proposed:
> >> to not couple the default behavior of MADV_COLLAPSE with khugepaged at
> >> all? I.e. Not tie the allocation semantics to
> >> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
> >> flags as necessary when/if a reimplementation of khugepaged in
> >> userspace proves fruitful.
> >
> > Let's see whether others have thoughts, but what you proposed here makes
> > sense to me.
>
>
Hey David - thanks again for taking the time to read and comment.

> Makes sense to me. IIUC, the whole handling of max_ptes_* is in place as
>  tunable because we don't know what user space is up to.
>

Agreed.

> E.g., have with a very sparse memory layout, we don't want to waste
> memory by allocating memory where we actually have no page populated yet
> -- could be user space won't reuse that memory in the foreseeable
> future. With too many swap entries, we don't want to trigger an
> eventually unnecessary overhead of swapping in entries if user space
> won't access them in the foreseeable future. Something similar applies
> to max_ptes_shared, where one might just end up wasting a lot of memory
> eventually in some applications.
>
> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
> try figuring out what user space might be doing. We know exactly what
> user space asks for -- and that can be documented properly.
>

Sounds good to me. Would you also be in favor of decoupling allocation
semantics from khugepaged? I.e. we'll pick some default gfp flags and
not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?

Thanks again,

Zach

> --
> Thanks,
>
> David / dhildenb
>


  reply	other threads:[~2022-04-19 15:55 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-14 18:06 Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-04-14 18:06 ` [PATCH v2 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
2022-04-15  0:04 ` [PATCH v2 00/12] mm: userspace hugepage collapse Peter Xu
2022-04-15  0:52   ` Zach O'Keefe
2022-04-15 13:39     ` Peter Xu
2022-04-15 20:04       ` Zach O'Keefe
2022-04-16 19:26         ` Peter Xu
2022-04-17 17:23           ` Zach O'Keefe
2022-04-19 14:33           ` David Hildenbrand
2022-04-19 15:55             ` Zach O'Keefe [this message]
2022-04-19 20:02               ` David Hildenbrand
2022-04-19 22:42                 ` Zach O'Keefe
2022-04-21  0:56                   ` Yang Shi
2022-04-21  1:02                     ` Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAa6QmQMNG2dXrKp5LwQbsVnR-LbOfk9BtJm=6cjxitjMcWSZg@mail.gmail.com' \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox