From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3BB9EC27C6E
	for <linux-mm@archiver.kernel.org>; Fri, 14 Jun 2024 23:22:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D36356B017B; Fri, 14 Jun 2024 19:17:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CBD7B6B017C; Fri, 14 Jun 2024 19:17:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B37456B018C; Fri, 14 Jun 2024 19:17:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 8F48F6B017B
	for <linux-mm@kvack.org>; Fri, 14 Jun 2024 19:17:58 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id BE125A09CD
	for <linux-mm@kvack.org>; Fri, 14 Jun 2024 23:17:57 +0000 (UTC)
X-FDA: 82231058994.17.722B2BE
Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201])
	by imf19.hostedemail.com (Postfix) with ESMTP id E61B51A000E
	for <linux-mm@kvack.org>; Fri, 14 Jun 2024 23:17:55 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=sPPBvpf3;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf19.hostedemail.com: domain of 3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718407072;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=;
	b=yCSKvA1PGZ4JcLLsDtGqdKmjYNxvz5pIS7VgVCAbbtKbGQ7QmEALKi4rBKqZZ/ZdK/0fTr
	SNZzEBqX/boDxSceN7i5VNLNLYYkxX473bucLIl1jvjA6g+ZotvaDZpn9dvenq1B8sDGfA
	dPaFiRtxVeUQlPLM0VHMrVJD8iTYV4I=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718407072; a=rsa-sha256;
	cv=none;
	b=qzpbcZvgoajs5YgfAbWsDtopYBNI29caNBCYKs6OSvyJfnzgoSZ8ugcoGRps1fvE1tN3+9
	7i7Pmh5gXZfucmVmx3kenLWQvu2n9t8N9I9mJUNAgkwo2zOf92GkcjfYwvZoT2vNOPuN5r
	xsGcqKvsVmXei9xfRCrsZ+IADHLn8X4=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=sPPBvpf3;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf19.hostedemail.com: domain of 3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com
Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-63128f5eb5bso42068437b3.2
        for <linux-mm@kvack.org>; Fri, 14 Jun 2024 16:17:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1718407075; x=1719011875; darn=kvack.org;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=;
        b=sPPBvpf3Y+cVdJZ8Va2MJon6CYCyFRoi7ULrJsMg6H9xMvG1bDwSCRSMiDrsRXtxU7
         +eUm198CdsBhMtKJSH52aQq8QPgK8O4SKwSjBlNfQNse448I+gAy63Vm1rt/j1Zlfz0k
         50U4XHW9WCNhIbHSMxb6iGFI8quV9eLvpAgyBozz2eY+cimxPal1NsR6ZNXACsWXEIoJ
         PAp3t9PjuCVTWTkaD1nuvqDO1F0/+fYyo98kroWtuq42DQ+qs4lxqC+tZ8USk5VFSmJq
         QZFJ4+T3fr17s5xqJ3S5LOvKVBI5+uRnywu1YzS0NQE0hO0PvNFOGo02N4aZokuNZCYd
         YNYQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718407075; x=1719011875;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject
         :date:message-id:reply-to;
        bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=;
        b=Hr+h5sOh4WNOsiACVmSUZ5UqUxTAiaSZcOrMeHRT03hOVw9qOHCUfsUwUp9Pb83G02
         4Pp6jZlG3D/EG2ZyMSxxoCKXcZsq2YwxpdoQT6shwiVTJk2vkPKPiKf6FbDptBK6zzEQ
         6g6mmzBFGX/t9o2eedPnka/8FLdo7JG436nDldj9u95giLbvNJB+A3xZiZ0zEtjIsS/5
         hMamdJEBZL3EthX5KgEcUn/VbgznbLYJMvLe6ukqXvzVk1uXGEcD7IiWOUJygWMB5XuI
         AvsvzShb2Jo3B55HU+XliwP78Pl/L/d6qOOrDXakUO4uA69f7ep2SPBXBl0TU2jyceu8
         lnrw==
X-Forwarded-Encrypted: i=1; AJvYcCUwRuALIVFO2b4ncfSF53mNjFP2OSiICwxjQ2CD0imV3Hw9vWpg5MK3a8El56l0Wv+TDoLZo0DXPktp5plDldxpysU=
X-Gm-Message-State: AOJu0YwqxbCxQbeSU6dP65OBxbg53TXlU1H2ZE0Pdm7kr829K8jyzhhi
	rU34VXpDYliqLsPvY2P6ss6N1VGW51YdbPOYTBBEXRNa+NBTM28yvGrVVE1z0bPQfHcEw9k/ri4
	jCA==
X-Google-Smtp-Source: AGHT+IFf04V+NzFpQ2I3NWK+rQ7QUB1fuzeDCHCZBIuJOjTEyb5oX0HI/aAQHZ5FG+/pVHLQj8M1yQSGuys=
X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:690c:700e:b0:627:a962:4252 with SMTP id
 00721157ae682-63224613c12mr8916887b3.7.1718407074918; Fri, 14 Jun 2024
 16:17:54 -0700 (PDT)
Date: Fri, 14 Jun 2024 16:17:53 -0700
In-Reply-To: <CADrL8HVDZ+m_-jUCaXf_DWJ92N30oqS=_9wNZwRvoSp5fo7asg@mail.gmail.com>
Mime-Version: 1.0
References: <20240611002145.2078921-1-jthoughton@google.com>
 <20240611002145.2078921-5-jthoughton@google.com> <CAOUHufYGqbd45shZkGCpqeTV9wcBDUoo3iw1SKiDeFLmrP0+=w@mail.gmail.com>
 <CADrL8HVHcKSW3hiHzKTit07gzo36jtCZCnM9ZpueyifgNdGggw@mail.gmail.com>
 <ZmioedgEBptNoz91@google.com> <CADrL8HU_FKHTz_6d=xhVLZFDQ_zQo-zdB2rqdpa2CKusa1uo+A@mail.gmail.com>
 <ZmjtEBH42u7NUWRc@google.com> <CADrL8HUW2q79F0FsEjhGW0ujij6+FfCqas5UpQp27Epfjc94Nw@mail.gmail.com>
 <ZmxsCwu4uP1lGsWz@google.com> <CADrL8HVDZ+m_-jUCaXf_DWJ92N30oqS=_9wNZwRvoSp5fo7asg@mail.gmail.com>
Message-ID: <ZmzPoW7K5GIitQ8B@google.com>
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier
From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Yu Zhao <yuzhao@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Paolo Bonzini <pbonzini@redhat.com>, Ankit Agrawal <ankita@nvidia.com>, 
	Axel Rasmussen <axelrasmussen@google.com>, Catalin Marinas <catalin.marinas@arm.com>, 
	David Matlack <dmatlack@google.com>, David Rientjes <rientjes@google.com>, 
	James Morse <james.morse@arm.com>, Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>, 
	Oliver Upton <oliver.upton@linux.dev>, Raghavendra Rao Ananta <rananta@google.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Shaoqin Huang <shahuang@redhat.com>, 
	Suzuki K Poulose <suzuki.poulose@arm.com>, Wei Xu <weixugc@google.com>, 
	Will Deacon <will@kernel.org>, Zenghui Yu <yuzenghui@huawei.com>, kvmarm@lists.linux.dev, 
	kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, 
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: E61B51A000E
X-Stat-Signature: y5xyfki9thr97tpoxgd13kj6nzryshti
X-Rspam-User: 
X-HE-Tag: 1718407075-526326
X-HE-Meta: U2FsdGVkX19oeg/iGtjzrH8VmRPICnGTAZKEw+1gRuI7vePNDQB/lwaJCQ6qj0vwuVayiWCVhRTvCunW2ixBo4q1LfUpU9uhcXxkVBwE+M3Orx4SLyqmWt923kJMTY8IVrTS0suvbMU+1X+B20IqnVOfsbCV0R6HpnQa51HEv31rYHqKSjOeI2rUpayZBJbHjiCPPmxoth62TD+5dZ3q10Z2+Ds+UqPfbS+VY0vqLkja6PkPrCSEx8DcefqW/P/ciiyIDj/2X5knWwW4evGaeuJNkGJCWr5ob3gVFc6EmIqVzHiX8lLVManFT2hYN8Y6/NUuZ3X+Xl/vQSEqFq1hkbmCMMFzgB7eRoVUVBW0w/nECv9gcnR4vQJjhH+XQQDcpUbLeYnEacVsLZNmayOowgpyvfMN/zWkL3nHiaIr+m+u2wC0AUtFg4c6A7/9rvBB7a6ZJai1WxABNIP0TJ97Q36KtoYEL1CC3na7sJ19gLxWZH3mdBI7zdzlWJQcmhkdPXD3uM48gRnf3jglj28D+hoPV1s/VA7xHtoV+Yk/h1vGX+1vbW7rmxuFgrOilF5qwfdtggfO2JavqOWWV6UM6YKUn9YrvLBcFvyCC4u9CT/X73YvlhYTlxqetHgqfRN9MbnyWBF0/MPlBI0JsGaihaRquYUJmdXfq0U1i9rCtb0GV5Y6sV2ncl8+Lpej5HMSUXz5c6zRDznAW8881Umo2qgKvBTulxvS1T1X5e1skTtUbocK4C/ryhbeODsolM5aOw1Fa/sp4F1uO9XYSq7B/lUQInVE+GyyqkB0a8ZT4dxWjaNiasNLML4IUolveRDJZF6Fk0P98HYnPTqi4ta1hM0CKAscAHiRl5CkOXbRJKGvj0C2beK0vTLPKmhLGuk4iZJf92mzvhBgB40OVvtn5k9VCTLI22l6PaHFq1E7I/obFfl2eBkRDWSEpnsbpolL3TPRrwpMbsjbv8FEMRo
 merPLWG0
 nbja9GOKp5SH/cxO8iOafXyc9rnohlll4U+aaWSf7IaI/XkfRcLM4TRs7yx2jzUjXE784cxbka/7LGyPac7sjlnXEXNprwM8Yvly1AssuCHC2FzmYBZvxx4K3ONYvuMz80cP4vsk2f0cOhxWHr7AEQG7tJ9WRLsbZCIwhT0JfIB0ZwfT/R8U+0fz2gl2oHPUPd5S/TBvyvJ49r9mtet6Fhb2b8R5qPYTyxeCrHIeQNGoDjKGk6XJcWah2vOpW6mknKJHhFwsq5pqTpupq+nlB7aUeDbqE0r3hylgh2/RkdZl7OSiwn/GIHggsVhZIes1bK9eernkHw5PMBli0YH1JQ7yArCzFK69AFGPm+wQpIYjgWiK6qh0mXr0cdz9KvsmYvMfMj4Y/x/b13NXdyVVX74RH/W5VOoyRThwP1f7pDlCdjFomLJwdmVKYi2qpoIDhCC+qns84SbKmL6WwOytpmBP/kFuenl50gY/sHpNb3454yaWr2WS41Zajbx6B9ggfwo9T+46BI+2Gfr4v+xLG5o+WowScDQxzcWDyoetzP3h39BKqbYnoP0a9M995u4ImGJvoG+tNm8w/RcycUDdzN0H1jCglnLqro4S+
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jun 14, 2024, James Houghton wrote:
> On Fri, Jun 14, 2024 at 9:13=E2=80=AFAM Sean Christopherson <seanjc@googl=
e.com> wrote:
> >
> > On Thu, Jun 13, 2024, James Houghton wrote:
> > > On Tue, Jun 11, 2024 at 5:34=E2=80=AFPM Sean Christopherson <seanjc@g=
oogle.com> wrote:
> > > > A flag would also avoid an indirect call and thus a RETPOLINE when =
CONFIG_RETPOLINE=3Dy,
> > > > i.e. would be a minor optimization when KVM doesn't suppport fast a=
ging.  But that's
> > > > probably a pretty unlikely combination, so it's probably not a vali=
d argument.
> > > >
> > > > So, I guess I don't have a strong opinion?
> > >
> > > (Sorry for the somewhat delayed response... spent some time actually
> > > writing what this would look like.)
> > >
> > > I see what you mean, thanks! So has_fast_aging might be set by KVM if
> > > the architecture sets a Kconfig saying that it understands the concep=
t
> > > of fast aging, basically what the presence of this v5's
> > > test_clear_young_fast_only() indicates.
> >
> > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_=
enabled=3Dfalse
> > doesn't support fast aging (uses the shadow MMU even for TDP).
>=20
> I see. I'm not sure if it makes sense to put this in `ops` as you
> originally had it then (it seems like a bit of a pain anyway). I could
> just make it a member of `struct mmu_notifier` itself.

Ah, right, because the ops are const.  Yeah, losing the const just to set a=
 flag
would be unfortunate.

> > > So just to be clear, for test_young(), I intend to have a patch in v6
> > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seem=
s
> > > like a pure win; no reason not to include it if we're making logic
> > > changes here anyway.
> >
> > I don't think that's correct.  The initial fast_only=3Dfalse aging shou=
ld process
> > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=3Df=
alse would
> > get a false positive on young due to failing to clear the Accessed bit =
in the
> > shadow MMU.  E.g. if page X is accessed by both L1 and L2, then aged, a=
nd never
> > accessed again, the Accessed bit would still be set in the page tables =
for L2.
>=20
> For clear_young(fast_only=3Dfalse), yeah we need to check and clear
> Accessed for both MMUs. But for test_young(fast_only=3Dfalse), I don't
> see why we couldn't just return early if the TDP MMU reports young.

Ooh, good point.

> > > Oh, yeah, that's a lot more intelligent than what I had. I think I
> > > fully understand your suggestion; I guess we'll see in v6. :)
> > >
> > > I wonder if this still makes sense if whether or not an MMU is "fast"
> > > is determined by how contended some lock(s) are at the time.
> >
> > No.  Just because a lock wasn't contended on the initial aging doesn't =
mean it
> > won't be contended on the next round.  E.g. when using KVM x86's shadow=
 MMU, which
> > takes mmu_lock for write for all operations, an aging operation could g=
et lucky
> > and sneak in while mmu_lock happened to be free, but then get stuck beh=
ind a large
> > queue of operations.
> >
> > The fast-ness needs to be predictable and all but guaranteed, i.e. lock=
less or in
> > an MMU that takes mmu_lock for read in all but the most rare paths.
>=20
> Aging and look-around themselves only use the fast-only notifiers, so
> they won't ever wait on a lock (well... provided KVM is written like
> that, which I think is a given).

Regarding aging, is that actually the behavior that we want?  I thought the=
 plan
is to have the initial test look at all MMUs, i.e. be potentially slow, but=
 only
do the lookaround if it can be fast.  IIUC, that was Yu's intent (and peeki=
ng back
at v2, that is indeed the case, unless I'm misreading the code).

If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 w=
ill
end up with hot pages (used by L2) swapped out.

Or are you saying that the "test" could be slow, but not "clear"?  That's a=
lso
suboptimal, because any pages accessed by L2 will always appear hot.

> should_look_around() will use the slow notifier because it (despite its n=
ame)
> is responsible for accurately determining if a page is young lest we evic=
t a
> young page.
>=20
> So in this case where "fast" means "lock not contended for now",

No.  In KVM, "fast" needs to be a property of the MMU, not a reflection of =
system
state at some random snapshot in time.

> I don't think it's necessarily wrong for MGLRU to attempt to find young
> pages, even if sometimes it will bail out because a lock is contended/hel=
d

lru_gen_look_around() skips lookaround if something else is waiting on the =
page
table lock, but that is a far cry from what KVM would be doing.  (a) the PT=
L is
already held, and (b) it is scoped precisely to the range being processed. =
 Not
looking around makes sense because there's a high probability of the PTEs i=
n
question being modified by a different task, i.e. of the look around being =
a
waste of time.

In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoi=
d
waiting, and would need to bail from the middle of the aging walk if a diff=
erent
task contends mmu_lock.

I agree that's not "wrong", but in part because mmu_lock is scoped to the e=
ntire
VM, it risks ending up with semi-random, hard to debug behavior.  E.g. a us=
er
could see intermittent "failures" that come and go based on seemingly unrel=
ated
behavior in KVM.  And implementing the "bail" behavior in the shadow MMU wo=
uld
require non-trivial changes.

In other words, I would very strongly prefer that the shadow MMU be all or =
nothing,
i.e. is either part of look-around or isn't.  And if nested TDP doesn't fai=
r well
with MGLRU, then we (or whoever cares) can spend the time+effort to make it=
 work
with fast-aging.

Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the =
shadow
MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or =
rather,
I suspect/hope we can get close enough for an initial merge, which would al=
low
aging_is_fast to be a property of the mmu_notifier, i.e. would simplify thi=
ngs
because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notif=
ication.

Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries =
is done
in such a way that a lockless walk would be painfully complex.  But if ther=
e is
exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points direct=
ly at
that one SPTE.  And with nested TDP, unless L1 is doing something uncommon,=
 e.g.
mapping the same page into multiple L2s, that overwhelming vast majority of=
 rmaps
have only one entry.  That's not the case for legacy shadow paging because =
kernels
almost always map a pfn using multiple virtual addresses, e.g. Linux's dire=
ct map
along with any userspace mappings.

E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn h=
as
multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory =
during
boot).

So, if we bifurcate aging behavior based on whether or not the TDP MMU is e=
nabled,
then whether or not aging is fast is constant (after KVM loads).  Rougly, t=
he KVM
side of things would be the below, plus a bunch of conversions to WRITE_ONC=
E() to
ensure a stable rmap value (KVM already plays nice with lockless accesses t=
o SPTEs,
thanks to the fast page fault path).

If KVM adds an rmap entry after the READ_ONCE(), then functionally all is s=
till
well because the original SPTE pointer is still valid.  If the rmap entry i=
s
removed, then KVM just needs to ensure the owning page table isn't freed.  =
That
could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing p=
age
tables, and the rmap stuff doesn't actually walk upper level entries), or b=
y
enhancing the shadow MMU's lockless walk logic to allow lockless walks from
non-vCPU tasks.

And in the (hopefully) unlikely scenario someone has a use case where L1 ma=
ps a
gfn into multiple L2s (or aliases in bizarre ways), then we can tackle maki=
ng the
nested TDP shadow MMU rmap walks always lockless.

E.g. again very roughly, if we went with the latter:

@@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const s=
truct kvm_memory_slot *slot,
        __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
 }
=20
+static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
+                                                         struct kvm_gfn_ra=
nge *range,
+                                                         typedefme handler=
)
+{
+       gfn_t gfn;
+       int level;
+       u64 *spte;
+       bool ret;
+
+       walk_shadow_page_lockless_begin(???);
+
+       for (gfn =3D range->start; gfn < range->end; gfn++) {
+               for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEV=
EL; level++) {
+                       spte =3D (void *)READ_ONCE(gfn_to_rmap(gfn, level, =
range->slot)->val);
+
+                       /* Skip the gfn if there are multiple SPTEs. */
+                       if ((unsigned long)spte & 1)
+                               continue;
+
+                       ret |=3D handler(spte);
+               }
+       }
+
+       walk_shadow_page_lockless_end(???);
+}
+
 static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
                         bool fast_only)
 {
        bool young =3D false;
=20
-       if (kvm_memslots_have_rmaps(kvm)) {
-               if (fast_only)
-                       return -1;
-
-               write_lock(&kvm->mmu_lock);
-               young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
-               write_unlock(&kvm->mmu_lock);
-       }
-
-       if (tdp_mmu_enabled)
+       if (tdp_mmu_enabled) {
                young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range);
+               young |=3D kvm_handle_gfn_range_lockless(kvm, range, kvm_ag=
e_rmap_fast);
+       } else if (!fast_only) {
+               write_lock(&kvm->mmu_lock);
+               young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+               write_unlock(&kvm->mmu_lock);
+       }
=20
        return (int)young;
 }

> for a few or even a majority of the pages. Not doing look-around is the s=
ame
> as doing look-around and finding that no pages are young.

No, because the former is deterministic and predictable, the latter is not.

> Anyway, I don't think this bit is really all that important unless we
> can demonstrate that KVM participating like this actually results in a
> measurable win.

Participating like what?  You've lost me a bit.  Are we talking past each o=
ther?

What I am saying is that we do this (note that this is slightly different t=
han
an earlier sketch; I botched the ordering of spin_is_contend() in that one,=
 and
didn't account for the page being young in the primary MMU).

	if (pte_young(ptep_get(pte)))
		young =3D 1 | MMU_NOTIFY_WAS_FAST;

        young |=3D ptep_clear_young_notify(vma, addr, pte);
        if (!young)
                return false;

        if (!(young & MMU_NOTIFY_WAS_FAST))
                return true;

        if (spin_is_contended(pvmw->ptl))
                return false;

        /* exclude special VMAs containing anon pages from COW */
        if (vma->vm_flags & VM_SPECIAL)
                return false;