Re: [RFC PATCH 00/14] mm: userspace hugepage collapse

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Zach O'Keefe" <zokeefe@google.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
	David Hildenbrand <david@redhat.com>,
	 David Rientjes <rientjes@google.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	Chris Kennelly <ckennelly@google.com>,
	 Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	Hugh Dickins <hughd@google.com>,
	 Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	 "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Matt Turner <mattst88@gmail.com>,
	Max Filippov <jcmvbkbc@gmail.com>,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	 Patrick Xia <patrickx@google.com>,
	Pavel Begunkov <asml.silence@gmail.com>,
	 Peter Xu <peterx@redhat.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	 Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
Date: Tue, 22 Mar 2022 08:53:35 -0700	[thread overview]
Message-ID: <CAAa6QmT988vZ8802PRd0-4i3=ME-kwPzrOz77OJrA46caeGE5A@mail.gmail.com> (raw)
In-Reply-To: <Yjm89k7jVdl7s77U@dhcp22.suse.cz>

On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > Hey Michal, thanks for taking the time to review / comment.
> >
> > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > >   his address]
> >
> > Thank you :)
> >
> > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > Introduction
> > > > --------------------------------
> > > >
> > > > This series provides a mechanism for userspace to induce a collapse of
> > > > eligible ranges of memory into transparent hugepages in process context,
> > > > thus permitting users to more tightly control their own hugepage
> > > > utilization policy at their own expense.
> > > >
> > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > everyone for your patience while I prepared these patches resulting from
> > > > that discussion[1].
> > > >
> > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > >
> > > > Interface
> > > > --------------------------------
> > > >
> > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > leverages the new process_madvise(2) call.
> > > >
> > > > (*) process_madvise(2)
> > > >
> > > >         Performs a synchronous collapse of the native pages mapped by
> > > >         the list of iovecs into transparent hugepages. The default gfp
> > > >         flags used will be the same as those used at-fault for the VMA
> > > >         region(s) covered.
> > >
> > > Could you expand on reasoning here? The default allocation mode for #PF
> > > is rather light. Madvised will try harder. The reasoning is that we want
> > > to make stalls due to #PF as small as possible and only try harder for
> > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > sense to try harder for an explicit calls like madvise?
> > >
> >
> > The reasoning is that the user has presumably configured system/vmas
> > to tell the kernel how badly they want thps, and so this call aligns
> > with current expectations. I.e. a user who goes about the trouble of
> > trying to fault-in a thp at a given memory address likely wants a thp
> > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > thp.
>
> If the syscall tries only as hard as the #PF doesn't that limit the
> functionality?

I'd argue that, the various allocation semantics possible through
existing thp knobs / vma flags, in addition to the proposed
MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
work with. Relatively speaking, in what way would we be lacking
functionality?

> I mean a non #PF can consume more resources to allocate
> and collapse a THP as it won't inflict any measurable latency to the
> targetting process (except for potential CPU contention).

Sorry, I'm not sure I understand this. What latency are we discussing
in this point? Do you mean to say that since MADV_COLLAPSE isn't in
the fault path, it doesn't necessarily need to be fast / direct
reclaim wouldn't be noticed?

> From that
> perspective madvise is much more similar to khugepaged. I would even
> argue that it could try even harder because madvise is focused on a very
> specific memory range and the execution is not shared among all
> processes that are scanned by khugepaged.
>

Good point. Covered at the end.

> > If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
> > used to explicitly request the kernel to try harder, as you mention.
>
> Do we really need that? How many do_harder levels do we want to support?
>
> What would be typical usecases for #PF based and DEFRAG usages?
>

Thanks for challenging this. Covered at the end.

> [...]
>
> > > >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> > > >         by default, as the user is explicitly requesting this action.
> > > >         Define two flags to control collapse semantics, passed through
> > > >         process_madvise(2)’s optional flags parameter:
> > >
> > > This part is discussed later in the thread.
> > >
> > > >
> > > >         MADV_F_COLLAPSE_LIMITS
> > > >
> > > >         If supplied, collapse respects pte collapse limits set via
> > > >         sysfs:
> > > >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> > > >         Required if calling on behalf of another process and not
> > > >         CAP_SYS_ADMIN.
> > > >
> > > >         MADV_F_COLLAPSE_DEFRAG
> > > >
> > > >         If supplied, permit synchronous compaction and reclaim,
> > > >         regardless of VMA flags.
> > >
> > > Why do we need this?
> >
> > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> >
> > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > inter-process protection for collapsing memory in another process'
> > address space (which a malevolent program could exploit to cause oom
> > conditions in another memcg hierarchy, for example), but we want
> > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > utilization as they wish.
>
> Could you expand some more please? How is this any different from
> khugepaged (well, except that you can trigger the collapsing explicitly
> rather than rely on khugepaged to find that mm)?
>

MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
extend khugepaged in userspace, where the benefit is precisely that we
can choose that mm/vma more intelligently.

> > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > to explicitly tell the kernel to try harder to back this by thps,
> > regardless of the current system/vma configuration.
> >
> > Note that when used together, these flags can be used to implement the
> > exact behavior of khugepaged, through MADV_COLLAPSE.
>
> IMHO this is stretching the interface and this can backfire in the
> future. The interface should be really trivial. I want to collapse a
> memory area. Let the kernel do the right thing and do not bother with
> all the implementation details. I would use the same allocation strategy
> as khugepaged as this seems to be closesest from the latency and
> application awareness POV. In a way you can look at the madvise call as
> a way to trigger khugepaged functionality on he particular memory range.

Trying to summarize a few earlier comments centering around
MADV_F_COLLAPSE_DEFRAG and allocation semantics.

This series presupposes the existence of an informed userspace agent
that is aware of what processes/memory ranges would benefit most from
thps. Such an agent might either be:
(1) A system-level daemon optimizing thp utilization system-wide
(2) A highly tuned process / malloc implementation optimizing their
own thp usage

The different types of agents reflects the divide between #PF and
DEFRAG semantics.

For (1), we want to view this exactly like triggering khugepaged
functionality from userspace, and likely want DEFRAG semantics.

For (2), I was viewing this as the "live" symmetric counterpart to
at-fault thp allocation where the process has decided, at runtime,
that this memory could benefit from thp backing, and so #PF semantics
seemed like sane default. I'd worry that using DEFRAG semantics by
default might deter adoption by users who might not be willing to wait
an unbounded amount of time for direct reclaim.


> --
> Michal Hocko
> SUSE Labs

next prev parent reply	other threads:[~2022-03-22 15:54 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-08 21:34 Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
2022-03-09 22:48   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-03-09 22:53   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:15   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:17   ` Yang Shi
2022-03-10  0:00     ` Zach O'Keefe
2022-03-10  0:41       ` Yang Shi
2022-03-10  1:09         ` Zach O'Keefe
2022-03-10  2:16           ` Yang Shi
2022-03-10 15:50             ` Zach O'Keefe
2022-03-10 18:17               ` Yang Shi
2022-03-10 18:46                 ` David Rientjes
2022-03-10 18:58                   ` Zach O'Keefe
2022-03-10 19:54                   ` Yang Shi
2022-03-10 20:24                     ` Zach O'Keefe
2022-03-10 18:53                 ` Zach O'Keefe
2022-03-10 15:56   ` David Hildenbrand
2022-03-10 18:39     ` Zach O'Keefe
2022-03-10 18:54     ` David Rientjes
2022-03-21 14:27       ` Michal Hocko
2022-03-08 21:34 ` [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
2022-03-09 23:40   ` Yang Shi
2022-03-10  0:46     ` Zach O'Keefe
2022-03-10  2:05       ` Yang Shi
2022-03-10  8:37         ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-03-09 23:43   ` Yang Shi
2022-03-10  1:11     ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
2022-03-10  0:06   ` Yang Shi
2022-03-10 19:26     ` David Rientjes
2022-03-10 20:16       ` Matthew Wilcox
2022-03-11  0:06         ` Zach O'Keefe
2022-03-25 16:51           ` Zach O'Keefe
2022-03-25 19:54             ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE) Zach O'Keefe
2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
2022-03-21 14:51   ` Zach O'Keefe
2022-03-21 14:37 ` Michal Hocko
2022-03-21 15:46   ` Zach O'Keefe
2022-03-22 12:11     ` Michal Hocko
2022-03-22 15:53       ` Zach O'Keefe [this message]
2022-03-29 12:24         ` Michal Hocko
2022-03-30  0:36           ` Zach O'Keefe
2022-03-22  6:40 ` Zach O'Keefe
2022-03-22 12:05   ` Michal Hocko
2022-03-23 13:30     ` Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAa6QmT988vZ8802PRd0-4i3=ME-kwPzrOz77OJrA46caeGE5A@mail.gmail.com' \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox