Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yu Zhao <yuzhao@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Sean Christopherson <seanjc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Paolo Bonzini <pbonzini@redhat.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	 Ankit Agrawal <ankita@nvidia.com>,
	Anup Patel <anup@brainfault.org>,
	 Atish Patra <atishp@atishpatra.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Bibo Mao <maobibo@loongson.cn>,
	Catalin Marinas <catalin.marinas@arm.com>,
	 David Matlack <dmatlack@google.com>,
	David Rientjes <rientjes@google.com>,
	 Huacai Chen <chenhuacai@kernel.org>,
	James Morse <james.morse@arm.com>,
	 Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	 Nicholas Piggin <npiggin@gmail.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	 Palmer Dabbelt <palmer@dabbelt.com>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	 Raghavendra Rao Ananta <rananta@google.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	 Shaoqin Huang <shahuang@redhat.com>,
	Shuah Khan <shuah@kernel.org>,
	 Suzuki K Poulose <suzuki.poulose@arm.com>,
	Tianrui Zhao <zhaotianrui@loongson.cn>,
	 Will Deacon <will@kernel.org>, Zenghui Yu <yuzenghui@huawei.com>,
	kvm-riscv@lists.infradead.org,  kvm@vger.kernel.org,
	kvmarm@lists.linux.dev,  linux-arm-kernel@lists.infradead.org,
	linux-doc@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,  linux-mips@vger.kernel.org,
	linux-mm@kvack.org,  linux-riscv@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org,  loongarch@lists.linux.dev
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging
Date: Fri, 31 May 2024 00:05:48 -0600	[thread overview]
Message-ID: <CAOUHufZq6DwpStzHtjG+TOiHaQ6FFbkTfHMCe8Yy0n_M9MKdqw@mail.gmail.com> (raw)
In-Reply-To: <CADrL8HXHWg_MkApYQTngzmN21NEGNWC6KzJDw_Lm63JHJkR=5A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8733 bytes --]

On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@google.com> wrote:
>
> On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, May 29, 2024, Yu Zhao wrote:
> > > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Wed, May 29, 2024, Yu Zhao wrote:
> > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <jthoughton@google.com> wrote:
> > > > > >
> > > > > > Secondary MMUs are currently consulted for access/age information at
> > > > > > eviction time, but before then, we don't get accurate age information.
> > > > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > > > generation, and then at eviction time, if KVM reports the page to be
> > > > > > young, the page will be activated/promoted back to the youngest
> > > > > > generation.
> > > > >
> > > > > Correct, and as I explained offline, this is the only reasonable
> > > > > behavior if we can't locklessly walk secondary MMUs.
> > > > >
> > > > > Just for the record, the (crude) analogy I used was:
> > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > > > but you are only allowed to pick up 10 of them (and put them in your
> > > > > pocket). A smart move would be to survey the room *first and then*
> > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > > > you would just want to pick up whichever that's in front of you rather
> > > > > than walk the entire room.
> > > > >
> > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > > > done lockless. Otherwise, it should just fall back to the existing
> > > > > approach, which existed in previous versions but is removed in this
> > > > > version.
> > > >
> > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that
> > > > don't implement a lockless walk?
> > >
> > > No, the existing approach only checks secondary MMUs for LRU folios,
> > > i.e., those at the end of the LRU list. It might not find the best
> > > candidates (the coldest ones) on the entire list, but it doesn't pay
> > > as much for the locking. MGLRU can *optionally* scan MMUs (secondary
> > > included) to find the best candidates, but it can only be a win if the
> > > scanning incurs a relatively low overhead, e.g., done locklessly for
> > > the secondary MMU. IOW, this is a balance between the cost of
> > > reclaiming not-so-cold (warm) folios and that of finding the coldest
> > > folios.
> >
> > Gotcha.
> >
> > I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler
> > _code_, but I think it increases the overall system complexity.  E.g. distros
> > will likely enable the Kconfig, and in my experience people using KVM with a
> > distro kernel usually aren't kernel experts, i.e. likely won't know that there's
> > even a decision to be made, let alone be able to make an informed decision.
> >
> > Having an mmu_notifier hook that is conditionally implemented doesn't seem overly
> > complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for
> > KVM to nullify its mmu_notifier hook during initialization.  The hardest part is
> > likely going to be figuring out the threshold for how much overhead is too much.
>
> Hi Yu, Sean,
>
> Perhaps I "simplified" this bit of the series a little bit too much.
> Being able to opportunistically do aging with KVM (even without
> setting the Kconfig) is valuable.
>
> IIUC, we have the following possibilities:
> - v4: aging with KVM is done if the new Kconfig is set.
> - v3: aging with KVM is always done.

This is not true -- in v3, MGLRU only scans secondary MMUs if it can
be done locklessly on x86. It uses a bitmap to imply this requirement.

> - v2: aging with KVM is done when the architecture reports that it can
> probably be done locklessly, set at KVM MMU init time.

Not really -- it's only done if it can be done locklessly on both x86 and arm64.

> - Another possibility?: aging with KVM is only done exactly when it
> can be done locklessly (i.e., mmu_notifier_test/clear_young() called
> such that it will not grab any locks).

This is exactly the case for v2.

> I like the v4 approach because:
> 1. We can choose whether or not to do aging with KVM no matter what
> architecture we're using (without requiring userspace be aware to
> disable the feature at runtime with sysfs to avoid regressing
> performance if they don't care about proactive reclaim).
> 2. If we check the new feature bit (0x8) in sysfs, we can know for
> sure if aging is meant to be working or not. The selftest changes I
> made won't work properly unless there is a way to be sure that aging
> is working with KVM.

I'm not convinced, but it doesn't mean your point of view is invalid.
If you fully understand the implications of your design choice and
document them, I will not object.

All optimizations in v2 were measured step by step. Even that bitmap,
which might be considered overengineered, brought a readily
measuarable 4% improvement in memcached throughput on Altra Max
swapping to Optane:

Using the bitmap (64 KVM PTEs for each call)
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50
Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---
    ---             ---             ---         0.00
Gets      1012801.92    431436.92     14965.11         0.06246
0.04700         0.16700         4.31900     39635.83
Waits           0.00          ---          ---             ---
    ---             ---             ---          ---
Totals    1012801.92    431436.92     14965.11         0.06246
0.04700         0.16700         4.31900     39635.83


Not using the bitmap (1 KVM PTEs for each call)
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50
Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---
    ---             ---             ---         0.00
Gets       968210.02    412443.85     14303.89         0.06517
0.04700         0.15900         7.42300     37890.74
Waits           0.00          ---          ---             ---
    ---             ---             ---          ---
Totals     968210.02    412443.85     14303.89         0.06517
0.04700         0.15900         7.42300     37890.74


FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached.

What I don't think is acceptable is simplifying those optimizations
out without documenting your justifications (I would even call it a
design change, rather than simplification, from v3 to v4).

> For look-around at eviction time:
> - v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
> - v2/v3: done if the main mm PTE was young or (the SPTE was young and
> the MMU notifier was lockless/fast).

The host and secondary MMUs are two *independent* cases, IMO:
1. lookaround the host MMU if the PTE mapping the folio under reclaim is young.
2. lookaround the secondary MMU if it can be done locklessly.

So the v2/v3 behavior sounds a lot more reasonable to me.

Also a nit -- don't use 'else' in the following case (should_look_around()):

  if (foo)
    return bar;
  else
    do_something();

> I made this logic change as part of removing batching.
>
> I'd really appreciate guidance on what the correct thing to do is.
>
> In my mind, what would work great is: by default, do aging exactly
> when KVM can do it locklessly, and then have a Kconfig to always have
> MGLRU to do aging with KVM if a user really cares about proactive
> reclaim (when the feature bit is set). The selftest can check the
> Kconfig + feature bit to know for sure if aging will be done.

I still don't see how that Kconfig helps. Or why the new static branch
isn't enough?






> I'm not sure what the exact right thing to do for look-around is.
>
> Thanks for the quick feedback.

[-- Attachment #2: 2.svg --]
[-- Type: image/svg+xml, Size: 71758 bytes --]

[-- Attachment #3: 1.svg --]
[-- Type: image/svg+xml, Size: 92103 bytes --]

next prev parent reply	other threads:[~2024-05-31  6:06 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-29 18:05 [PATCH v4 0/7] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
2024-05-29 18:05 ` [PATCH v4 1/7] mm/Kconfig: Add LRU_GEN_WALKS_SECONDARY_MMU James Houghton
2024-05-29 18:05 ` [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
2024-05-29 21:03   ` Yu Zhao
2024-05-29 21:59     ` Sean Christopherson
2024-05-29 22:21       ` Yu Zhao
2024-05-29 22:58         ` Sean Christopherson
2024-05-30  1:08           ` James Houghton
2024-05-31  6:05             ` Yu Zhao [this message]
2024-05-31  7:02               ` Oliver Upton
2024-05-31 16:45                 ` Yu Zhao
2024-05-31 18:41                   ` Oliver Upton
2024-06-03 22:45               ` James Houghton
2024-06-03 23:03                 ` Sean Christopherson
2024-06-03 23:16                   ` James Houghton
2024-06-04  0:23                     ` Sean Christopherson
2024-05-31  7:24     ` Oliver Upton
2024-05-31 20:31       ` Yu Zhao
2024-05-31 21:06         ` David Matlack
2024-05-31 21:09           ` David Matlack
2024-05-31 21:18         ` Oliver Upton
2024-05-29 18:05 ` [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM James Houghton
2024-05-29 21:51   ` Sean Christopherson
2024-05-30  3:26     ` James Houghton
2024-05-29 18:05 ` [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture James Houghton
2024-05-29 21:55   ` Sean Christopherson
2024-05-30  3:27     ` James Houghton
2024-05-29 18:05 ` [PATCH v4 5/7] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-05-29 18:05 ` [PATCH v4 6/7] KVM: arm64: " James Houghton
2024-05-31 19:11   ` Oliver Upton
2024-05-31 19:18     ` Oliver Upton
2024-06-04 22:20       ` James Houghton
2024-06-04 23:00         ` Oliver Upton
2024-06-04 23:36           ` Sean Christopherson
2024-05-29 18:05 ` [PATCH v4 7/7] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOUHufZq6DwpStzHtjG+TOiHaQ6FFbkTfHMCe8Yy0n_M9MKdqw@mail.gmail.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=atishp@atishpatra.org \
    --cc=axelrasmussen@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=chenhuacai@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dmatlack@google.com \
    --cc=james.morse@arm.com \
    --cc=jthoughton@google.com \
    --cc=kvm-riscv@lists.infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=loongarch@lists.linux.dev \
    --cc=maobibo@loongson.cn \
    --cc=maz@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=rananta@google.com \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=seanjc@google.com \
    --cc=shahuang@redhat.com \
    --cc=shuah@kernel.org \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    --cc=zhaotianrui@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox