From: James Houghton <jthoughton@google.com>
To: Yu Zhao <yuzhao@google.com>
Cc: Sean Christopherson <seanjc@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Albert Ou <aou@eecs.berkeley.edu>,
Ankit Agrawal <ankita@nvidia.com>,
Anup Patel <anup@brainfault.org>,
Atish Patra <atishp@atishpatra.org>,
Axel Rasmussen <axelrasmussen@google.com>,
Bibo Mao <maobibo@loongson.cn>,
Catalin Marinas <catalin.marinas@arm.com>,
David Matlack <dmatlack@google.com>,
David Rientjes <rientjes@google.com>,
Huacai Chen <chenhuacai@kernel.org>,
James Morse <james.morse@arm.com>,
Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Nicholas Piggin <npiggin@gmail.com>,
Oliver Upton <oliver.upton@linux.dev>,
Palmer Dabbelt <palmer@dabbelt.com>,
Paul Walmsley <paul.walmsley@sifive.com>,
Raghavendra Rao Ananta <rananta@google.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Shaoqin Huang <shahuang@redhat.com>,
Shuah Khan <shuah@kernel.org>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Tianrui Zhao <zhaotianrui@loongson.cn>,
Will Deacon <will@kernel.org>, Zenghui Yu <yuzenghui@huawei.com>,
kvm-riscv@lists.infradead.org, kvm@vger.kernel.org,
kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org, linux-mips@vger.kernel.org,
linux-mm@kvack.org, linux-riscv@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging
Date: Mon, 3 Jun 2024 15:45:59 -0700 [thread overview]
Message-ID: <CADrL8HW44Hx_Ejx_6+FVKt1V17PdgT6rw+sNtKzumqc9UCVDfA@mail.gmail.com> (raw)
In-Reply-To: <CAOUHufZq6DwpStzHtjG+TOiHaQ6FFbkTfHMCe8Yy0n_M9MKdqw@mail.gmail.com>
On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Hi Yu, Sean,
> >
> > Perhaps I "simplified" this bit of the series a little bit too much.
> > Being able to opportunistically do aging with KVM (even without
> > setting the Kconfig) is valuable.
> >
> > IIUC, we have the following possibilities:
> > - v4: aging with KVM is done if the new Kconfig is set.
> > - v3: aging with KVM is always done.
>
> This is not true -- in v3, MGLRU only scans secondary MMUs if it can
> be done locklessly on x86. It uses a bitmap to imply this requirement.
>
> > - v2: aging with KVM is done when the architecture reports that it can
> > probably be done locklessly, set at KVM MMU init time.
>
> Not really -- it's only done if it can be done locklessly on both x86 and arm64.
>
> > - Another possibility?: aging with KVM is only done exactly when it
> > can be done locklessly (i.e., mmu_notifier_test/clear_young() called
> > such that it will not grab any locks).
>
> This is exactly the case for v2.
Thanks for clarifying; sorry for getting this wrong.
>
> > I like the v4 approach because:
> > 1. We can choose whether or not to do aging with KVM no matter what
> > architecture we're using (without requiring userspace be aware to
> > disable the feature at runtime with sysfs to avoid regressing
> > performance if they don't care about proactive reclaim).
> > 2. If we check the new feature bit (0x8) in sysfs, we can know for
> > sure if aging is meant to be working or not. The selftest changes I
> > made won't work properly unless there is a way to be sure that aging
> > is working with KVM.
>
> I'm not convinced, but it doesn't mean your point of view is invalid.
> If you fully understand the implications of your design choice and
> document them, I will not object.
>
> All optimizations in v2 were measured step by step. Even that bitmap,
> which might be considered overengineered, brought a readily
> measuarable 4% improvement in memcached throughput on Altra Max
> swapping to Optane:
>
> Using the bitmap (64 KVM PTEs for each call)
> ============================================================================================================================
> Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
> Latency p99 Latency p99.9 Latency KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets 0.00 --- --- ---
> --- --- --- 0.00
> Gets 1012801.92 431436.92 14965.11 0.06246
> 0.04700 0.16700 4.31900 39635.83
> Waits 0.00 --- --- ---
> --- --- --- ---
> Totals 1012801.92 431436.92 14965.11 0.06246
> 0.04700 0.16700 4.31900 39635.83
>
>
> Not using the bitmap (1 KVM PTEs for each call)
> ============================================================================================================================
> Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
> Latency p99 Latency p99.9 Latency KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets 0.00 --- --- ---
> --- --- --- 0.00
> Gets 968210.02 412443.85 14303.89 0.06517
> 0.04700 0.15900 7.42300 37890.74
> Waits 0.00 --- --- ---
> --- --- --- ---
> Totals 968210.02 412443.85 14303.89 0.06517
> 0.04700 0.15900 7.42300 37890.74
>
>
> FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached.
>
> What I don't think is acceptable is simplifying those optimizations
> out without documenting your justifications (I would even call it a
> design change, rather than simplification, from v3 to v4).
I'll put back something similar to what you had before (like a
test_clear_young() with a "fast" parameter instead of "bitmap"). I
like the idea of having a new mmu notifier, like
fast_test_clear_young(), while leaving test_young() and clear_young()
unchanged (where "fast" means "prioritize speed over accuracy"). It
seems a little more straightforward that way.
>
> > For look-around at eviction time:
> > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
> > - v2/v3: done if the main mm PTE was young or (the SPTE was young and
> > the MMU notifier was lockless/fast).
>
> The host and secondary MMUs are two *independent* cases, IMO:
> 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young.
> 2. lookaround the secondary MMU if it can be done locklessly.
>
> So the v2/v3 behavior sounds a lot more reasonable to me.
I'll restore the v2/v3 behavior. I initially removed it because,
without batching, we (mostly) lose the spatial locality that, IIUC,
look-around is designed to exploit.
>
> Also a nit -- don't use 'else' in the following case (should_look_around()):
>
> if (foo)
> return bar;
> else
> do_something();
Oh, yes, sorry. I wrote and rewrote should_look_around() quite a few
times while trying to figure out what made sense in a no-batching
series. I'll fix this.
>
> > I made this logic change as part of removing batching.
> >
> > I'd really appreciate guidance on what the correct thing to do is.
> >
> > In my mind, what would work great is: by default, do aging exactly
> > when KVM can do it locklessly, and then have a Kconfig to always have
> > MGLRU to do aging with KVM if a user really cares about proactive
> > reclaim (when the feature bit is set). The selftest can check the
> > Kconfig + feature bit to know for sure if aging will be done.
>
> I still don't see how that Kconfig helps. Or why the new static branch
> isn't enough?
Without a special Kconfig, the feature bit just tells us that aging
with KVM is possible, not that it will necessarily be done. For the
self-test, it'd be good to know exactly when aging is being done or
not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
help make the self-test set the right expectations for aging.
The Kconfig would also allow a user to know that, no matter what,
we're going to get correct age data for VMs, even if, say, we're using
the shadow MMU. This is somewhat important for me/Google Cloud. Is
that reasonable? Maybe there's a better solution.
next prev parent reply other threads:[~2024-06-03 22:46 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-29 18:05 [PATCH v4 0/7] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
2024-05-29 18:05 ` [PATCH v4 1/7] mm/Kconfig: Add LRU_GEN_WALKS_SECONDARY_MMU James Houghton
2024-05-29 18:05 ` [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
2024-05-29 21:03 ` Yu Zhao
2024-05-29 21:59 ` Sean Christopherson
2024-05-29 22:21 ` Yu Zhao
2024-05-29 22:58 ` Sean Christopherson
2024-05-30 1:08 ` James Houghton
2024-05-31 6:05 ` Yu Zhao
2024-05-31 7:02 ` Oliver Upton
2024-05-31 16:45 ` Yu Zhao
2024-05-31 18:41 ` Oliver Upton
2024-06-03 22:45 ` James Houghton [this message]
2024-06-03 23:03 ` Sean Christopherson
2024-06-03 23:16 ` James Houghton
2024-06-04 0:23 ` Sean Christopherson
2024-05-31 7:24 ` Oliver Upton
2024-05-31 20:31 ` Yu Zhao
2024-05-31 21:06 ` David Matlack
2024-05-31 21:09 ` David Matlack
2024-05-31 21:18 ` Oliver Upton
2024-05-29 18:05 ` [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM James Houghton
2024-05-29 21:51 ` Sean Christopherson
2024-05-30 3:26 ` James Houghton
2024-05-29 18:05 ` [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture James Houghton
2024-05-29 21:55 ` Sean Christopherson
2024-05-30 3:27 ` James Houghton
2024-05-29 18:05 ` [PATCH v4 5/7] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-05-29 18:05 ` [PATCH v4 6/7] KVM: arm64: " James Houghton
2024-05-31 19:11 ` Oliver Upton
2024-05-31 19:18 ` Oliver Upton
2024-06-04 22:20 ` James Houghton
2024-06-04 23:00 ` Oliver Upton
2024-06-04 23:36 ` Sean Christopherson
2024-05-29 18:05 ` [PATCH v4 7/7] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CADrL8HW44Hx_Ejx_6+FVKt1V17PdgT6rw+sNtKzumqc9UCVDfA@mail.gmail.com \
--to=jthoughton@google.com \
--cc=akpm@linux-foundation.org \
--cc=ankita@nvidia.com \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=atishp@atishpatra.org \
--cc=axelrasmussen@google.com \
--cc=catalin.marinas@arm.com \
--cc=chenhuacai@kernel.org \
--cc=corbet@lwn.net \
--cc=dmatlack@google.com \
--cc=james.morse@arm.com \
--cc=kvm-riscv@lists.infradead.org \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=loongarch@lists.linux.dev \
--cc=maobibo@loongson.cn \
--cc=maz@kernel.org \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=rananta@google.com \
--cc=rientjes@google.com \
--cc=ryan.roberts@arm.com \
--cc=seanjc@google.com \
--cc=shahuang@redhat.com \
--cc=shuah@kernel.org \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
--cc=yuzhao@google.com \
--cc=zhaotianrui@loongson.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox