From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BB9EC27C6E for ; Fri, 14 Jun 2024 23:22:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D36356B017B; Fri, 14 Jun 2024 19:17:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBD7B6B017C; Fri, 14 Jun 2024 19:17:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B37456B018C; Fri, 14 Jun 2024 19:17:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 8F48F6B017B for ; Fri, 14 Jun 2024 19:17:58 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BE125A09CD for ; Fri, 14 Jun 2024 23:17:57 +0000 (UTC) X-FDA: 82231058994.17.722B2BE Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf19.hostedemail.com (Postfix) with ESMTP id E61B51A000E for ; Fri, 14 Jun 2024 23:17:55 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=sPPBvpf3; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718407072; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=; b=yCSKvA1PGZ4JcLLsDtGqdKmjYNxvz5pIS7VgVCAbbtKbGQ7QmEALKi4rBKqZZ/ZdK/0fTr SNZzEBqX/boDxSceN7i5VNLNLYYkxX473bucLIl1jvjA6g+ZotvaDZpn9dvenq1B8sDGfA dPaFiRtxVeUQlPLM0VHMrVJD8iTYV4I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718407072; a=rsa-sha256; cv=none; b=qzpbcZvgoajs5YgfAbWsDtopYBNI29caNBCYKs6OSvyJfnzgoSZ8ugcoGRps1fvE1tN3+9 7i7Pmh5gXZfucmVmx3kenLWQvu2n9t8N9I9mJUNAgkwo2zOf92GkcjfYwvZoT2vNOPuN5r xsGcqKvsVmXei9xfRCrsZ+IADHLn8X4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=sPPBvpf3; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3os9sZgYKCC8dPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-63128f5eb5bso42068437b3.2 for ; Fri, 14 Jun 2024 16:17:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718407075; x=1719011875; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=; b=sPPBvpf3Y+cVdJZ8Va2MJon6CYCyFRoi7ULrJsMg6H9xMvG1bDwSCRSMiDrsRXtxU7 +eUm198CdsBhMtKJSH52aQq8QPgK8O4SKwSjBlNfQNse448I+gAy63Vm1rt/j1Zlfz0k 50U4XHW9WCNhIbHSMxb6iGFI8quV9eLvpAgyBozz2eY+cimxPal1NsR6ZNXACsWXEIoJ PAp3t9PjuCVTWTkaD1nuvqDO1F0/+fYyo98kroWtuq42DQ+qs4lxqC+tZ8USk5VFSmJq QZFJ4+T3fr17s5xqJ3S5LOvKVBI5+uRnywu1YzS0NQE0hO0PvNFOGo02N4aZokuNZCYd YNYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718407075; x=1719011875; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Of1/APUBbAKdtL3qvNUh6RP4QfUxQKIPvJrZrNb83Wo=; b=Hr+h5sOh4WNOsiACVmSUZ5UqUxTAiaSZcOrMeHRT03hOVw9qOHCUfsUwUp9Pb83G02 4Pp6jZlG3D/EG2ZyMSxxoCKXcZsq2YwxpdoQT6shwiVTJk2vkPKPiKf6FbDptBK6zzEQ 6g6mmzBFGX/t9o2eedPnka/8FLdo7JG436nDldj9u95giLbvNJB+A3xZiZ0zEtjIsS/5 hMamdJEBZL3EthX5KgEcUn/VbgznbLYJMvLe6ukqXvzVk1uXGEcD7IiWOUJygWMB5XuI AvsvzShb2Jo3B55HU+XliwP78Pl/L/d6qOOrDXakUO4uA69f7ep2SPBXBl0TU2jyceu8 lnrw== X-Forwarded-Encrypted: i=1; AJvYcCUwRuALIVFO2b4ncfSF53mNjFP2OSiICwxjQ2CD0imV3Hw9vWpg5MK3a8El56l0Wv+TDoLZo0DXPktp5plDldxpysU= X-Gm-Message-State: AOJu0YwqxbCxQbeSU6dP65OBxbg53TXlU1H2ZE0Pdm7kr829K8jyzhhi rU34VXpDYliqLsPvY2P6ss6N1VGW51YdbPOYTBBEXRNa+NBTM28yvGrVVE1z0bPQfHcEw9k/ri4 jCA== X-Google-Smtp-Source: AGHT+IFf04V+NzFpQ2I3NWK+rQ7QUB1fuzeDCHCZBIuJOjTEyb5oX0HI/aAQHZ5FG+/pVHLQj8M1yQSGuys= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:700e:b0:627:a962:4252 with SMTP id 00721157ae682-63224613c12mr8916887b3.7.1718407074918; Fri, 14 Jun 2024 16:17:54 -0700 (PDT) Date: Fri, 14 Jun 2024 16:17:53 -0700 In-Reply-To: Mime-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-5-jthoughton@google.com> Message-ID: Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier From: Sean Christopherson To: James Houghton Cc: Yu Zhao , Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: E61B51A000E X-Stat-Signature: y5xyfki9thr97tpoxgd13kj6nzryshti X-Rspam-User: X-HE-Tag: 1718407075-526326 X-HE-Meta: U2FsdGVkX19oeg/iGtjzrH8VmRPICnGTAZKEw+1gRuI7vePNDQB/lwaJCQ6qj0vwuVayiWCVhRTvCunW2ixBo4q1LfUpU9uhcXxkVBwE+M3Orx4SLyqmWt923kJMTY8IVrTS0suvbMU+1X+B20IqnVOfsbCV0R6HpnQa51HEv31rYHqKSjOeI2rUpayZBJbHjiCPPmxoth62TD+5dZ3q10Z2+Ds+UqPfbS+VY0vqLkja6PkPrCSEx8DcefqW/P/ciiyIDj/2X5knWwW4evGaeuJNkGJCWr5ob3gVFc6EmIqVzHiX8lLVManFT2hYN8Y6/NUuZ3X+Xl/vQSEqFq1hkbmCMMFzgB7eRoVUVBW0w/nECv9gcnR4vQJjhH+XQQDcpUbLeYnEacVsLZNmayOowgpyvfMN/zWkL3nHiaIr+m+u2wC0AUtFg4c6A7/9rvBB7a6ZJai1WxABNIP0TJ97Q36KtoYEL1CC3na7sJ19gLxWZH3mdBI7zdzlWJQcmhkdPXD3uM48gRnf3jglj28D+hoPV1s/VA7xHtoV+Yk/h1vGX+1vbW7rmxuFgrOilF5qwfdtggfO2JavqOWWV6UM6YKUn9YrvLBcFvyCC4u9CT/X73YvlhYTlxqetHgqfRN9MbnyWBF0/MPlBI0JsGaihaRquYUJmdXfq0U1i9rCtb0GV5Y6sV2ncl8+Lpej5HMSUXz5c6zRDznAW8881Umo2qgKvBTulxvS1T1X5e1skTtUbocK4C/ryhbeODsolM5aOw1Fa/sp4F1uO9XYSq7B/lUQInVE+GyyqkB0a8ZT4dxWjaNiasNLML4IUolveRDJZF6Fk0P98HYnPTqi4ta1hM0CKAscAHiRl5CkOXbRJKGvj0C2beK0vTLPKmhLGuk4iZJf92mzvhBgB40OVvtn5k9VCTLI22l6PaHFq1E7I/obFfl2eBkRDWSEpnsbpolL3TPRrwpMbsjbv8FEMRo merPLWG0 nbja9GOKp5SH/cxO8iOafXyc9rnohlll4U+aaWSf7IaI/XkfRcLM4TRs7yx2jzUjXE784cxbka/7LGyPac7sjlnXEXNprwM8Yvly1AssuCHC2FzmYBZvxx4K3ONYvuMz80cP4vsk2f0cOhxWHr7AEQG7tJ9WRLsbZCIwhT0JfIB0ZwfT/R8U+0fz2gl2oHPUPd5S/TBvyvJ49r9mtet6Fhb2b8R5qPYTyxeCrHIeQNGoDjKGk6XJcWah2vOpW6mknKJHhFwsq5pqTpupq+nlB7aUeDbqE0r3hylgh2/RkdZl7OSiwn/GIHggsVhZIes1bK9eernkHw5PMBli0YH1JQ7yArCzFK69AFGPm+wQpIYjgWiK6qh0mXr0cdz9KvsmYvMfMj4Y/x/b13NXdyVVX74RH/W5VOoyRThwP1f7pDlCdjFomLJwdmVKYi2qpoIDhCC+qns84SbKmL6WwOytpmBP/kFuenl50gY/sHpNb3454yaWr2WS41Zajbx6B9ggfwo9T+46BI+2Gfr4v+xLG5o+WowScDQxzcWDyoetzP3h39BKqbYnoP0a9M995u4ImGJvoG+tNm8w/RcycUDdzN0H1jCglnLqro4S+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 14, 2024, James Houghton wrote: > On Fri, Jun 14, 2024 at 9:13=E2=80=AFAM Sean Christopherson wrote: > > > > On Thu, Jun 13, 2024, James Houghton wrote: > > > On Tue, Jun 11, 2024 at 5:34=E2=80=AFPM Sean Christopherson wrote: > > > > A flag would also avoid an indirect call and thus a RETPOLINE when = CONFIG_RETPOLINE=3Dy, > > > > i.e. would be a minor optimization when KVM doesn't suppport fast a= ging. But that's > > > > probably a pretty unlikely combination, so it's probably not a vali= d argument. > > > > > > > > So, I guess I don't have a strong opinion? > > > > > > (Sorry for the somewhat delayed response... spent some time actually > > > writing what this would look like.) > > > > > > I see what you mean, thanks! So has_fast_aging might be set by KVM if > > > the architecture sets a Kconfig saying that it understands the concep= t > > > of fast aging, basically what the presence of this v5's > > > test_clear_young_fast_only() indicates. > > > > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_= enabled=3Dfalse > > doesn't support fast aging (uses the shadow MMU even for TDP). >=20 > I see. I'm not sure if it makes sense to put this in `ops` as you > originally had it then (it seems like a bit of a pain anyway). I could > just make it a member of `struct mmu_notifier` itself. Ah, right, because the ops are const. Yeah, losing the const just to set a= flag would be unfortunate. > > > So just to be clear, for test_young(), I intend to have a patch in v6 > > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seem= s > > > like a pure win; no reason not to include it if we're making logic > > > changes here anyway. > > > > I don't think that's correct. The initial fast_only=3Dfalse aging shou= ld process > > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=3Df= alse would > > get a false positive on young due to failing to clear the Accessed bit = in the > > shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, a= nd never > > accessed again, the Accessed bit would still be set in the page tables = for L2. >=20 > For clear_young(fast_only=3Dfalse), yeah we need to check and clear > Accessed for both MMUs. But for test_young(fast_only=3Dfalse), I don't > see why we couldn't just return early if the TDP MMU reports young. Ooh, good point. > > > Oh, yeah, that's a lot more intelligent than what I had. I think I > > > fully understand your suggestion; I guess we'll see in v6. :) > > > > > > I wonder if this still makes sense if whether or not an MMU is "fast" > > > is determined by how contended some lock(s) are at the time. > > > > No. Just because a lock wasn't contended on the initial aging doesn't = mean it > > won't be contended on the next round. E.g. when using KVM x86's shadow= MMU, which > > takes mmu_lock for write for all operations, an aging operation could g= et lucky > > and sneak in while mmu_lock happened to be free, but then get stuck beh= ind a large > > queue of operations. > > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lock= less or in > > an MMU that takes mmu_lock for read in all but the most rare paths. >=20 > Aging and look-around themselves only use the fast-only notifiers, so > they won't ever wait on a lock (well... provided KVM is written like > that, which I think is a given). Regarding aging, is that actually the behavior that we want? I thought the= plan is to have the initial test look at all MMUs, i.e. be potentially slow, but= only do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeki= ng back at v2, that is indeed the case, unless I'm misreading the code). If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 w= ill end up with hot pages (used by L2) swapped out. Or are you saying that the "test" could be slow, but not "clear"? That's a= lso suboptimal, because any pages accessed by L2 will always appear hot. > should_look_around() will use the slow notifier because it (despite its n= ame) > is responsible for accurately determining if a page is young lest we evic= t a > young page. >=20 > So in this case where "fast" means "lock not contended for now", No. In KVM, "fast" needs to be a property of the MMU, not a reflection of = system state at some random snapshot in time. > I don't think it's necessarily wrong for MGLRU to attempt to find young > pages, even if sometimes it will bail out because a lock is contended/hel= d lru_gen_look_around() skips lookaround if something else is waiting on the = page table lock, but that is a far cry from what KVM would be doing. (a) the PT= L is already held, and (b) it is scoped precisely to the range being processed. = Not looking around makes sense because there's a high probability of the PTEs i= n question being modified by a different task, i.e. of the look around being = a waste of time. In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoi= d waiting, and would need to bail from the middle of the aging walk if a diff= erent task contends mmu_lock. I agree that's not "wrong", but in part because mmu_lock is scoped to the e= ntire VM, it risks ending up with semi-random, hard to debug behavior. E.g. a us= er could see intermittent "failures" that come and go based on seemingly unrel= ated behavior in KVM. And implementing the "bail" behavior in the shadow MMU wo= uld require non-trivial changes. In other words, I would very strongly prefer that the shadow MMU be all or = nothing, i.e. is either part of look-around or isn't. And if nested TDP doesn't fai= r well with MGLRU, then we (or whoever cares) can spend the time+effort to make it= work with fast-aging. Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the = shadow MMU would be, I'm pretty sure we can do straight there for nested TDP. Or = rather, I suspect/hope we can get close enough for an initial merge, which would al= low aging_is_fast to be a property of the mmu_notifier, i.e. would simplify thi= ngs because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notif= ication. Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries = is done in such a way that a lockless walk would be painfully complex. But if ther= e is exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points direct= ly at that one SPTE. And with nested TDP, unless L1 is doing something uncommon,= e.g. mapping the same page into multiple L2s, that overwhelming vast majority of= rmaps have only one entry. That's not the case for legacy shadow paging because = kernels almost always map a pfn using multiple virtual addresses, e.g. Linux's dire= ct map along with any userspace mappings. E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn h= as multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory = during boot). So, if we bifurcate aging behavior based on whether or not the TDP MMU is e= nabled, then whether or not aging is fast is constant (after KVM loads). Rougly, t= he KVM side of things would be the below, plus a bunch of conversions to WRITE_ONC= E() to ensure a stable rmap value (KVM already plays nice with lockless accesses t= o SPTEs, thanks to the fast page fault path). If KVM adds an rmap entry after the READ_ONCE(), then functionally all is s= till well because the original SPTE pointer is still valid. If the rmap entry i= s removed, then KVM just needs to ensure the owning page table isn't freed. = That could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing p= age tables, and the rmap stuff doesn't actually walk upper level entries), or b= y enhancing the shadow MMU's lockless walk logic to allow lockless walks from non-vCPU tasks. And in the (hopefully) unlikely scenario someone has a use case where L1 ma= ps a gfn into multiple L2s (or aliases in bizarre ways), then we can tackle maki= ng the nested TDP shadow MMU rmap walks always lockless. E.g. again very roughly, if we went with the latter: @@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const s= truct kvm_memory_slot *slot, __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); } =20 +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm, + struct kvm_gfn_ra= nge *range, + typedefme handler= ) +{ + gfn_t gfn; + int level; + u64 *spte; + bool ret; + + walk_shadow_page_lockless_begin(???); + + for (gfn =3D range->start; gfn < range->end; gfn++) { + for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEV= EL; level++) { + spte =3D (void *)READ_ONCE(gfn_to_rmap(gfn, level, = range->slot)->val); + + /* Skip the gfn if there are multiple SPTEs. */ + if ((unsigned long)spte & 1) + continue; + + ret |=3D handler(spte); + } + } + + walk_shadow_page_lockless_end(???); +} + static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, bool fast_only) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) { - if (fast_only) - return -1; - - write_lock(&kvm->mmu_lock); - young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); - write_unlock(&kvm->mmu_lock); - } - - if (tdp_mmu_enabled) + if (tdp_mmu_enabled) { young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); + young |=3D kvm_handle_gfn_range_lockless(kvm, range, kvm_ag= e_rmap_fast); + } else if (!fast_only) { + write_lock(&kvm->mmu_lock); + young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + write_unlock(&kvm->mmu_lock); + } =20 return (int)young; } > for a few or even a majority of the pages. Not doing look-around is the s= ame > as doing look-around and finding that no pages are young. No, because the former is deterministic and predictable, the latter is not. > Anyway, I don't think this bit is really all that important unless we > can demonstrate that KVM participating like this actually results in a > measurable win. Participating like what? You've lost me a bit. Are we talking past each o= ther? What I am saying is that we do this (note that this is slightly different t= han an earlier sketch; I botched the ordering of spin_is_contend() in that one,= and didn't account for the page being young in the primary MMU). if (pte_young(ptep_get(pte))) young =3D 1 | MMU_NOTIFY_WAS_FAST; young |=3D ptep_clear_young_notify(vma, addr, pte); if (!young) return false; if (!(young & MMU_NOTIFY_WAS_FAST)) return true; if (spin_is_contended(pvmw->ptl)) return false; /* exclude special VMAs containing anon pages from COW */ if (vma->vm_flags & VM_SPECIAL) return false;