From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75745C27C79 for ; Mon, 17 Jun 2024 16:51:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE17E6B0246; Mon, 17 Jun 2024 12:51:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C6A876B0248; Mon, 17 Jun 2024 12:51:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ABCF06B0249; Mon, 17 Jun 2024 12:51:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 827C36B0246 for ; Mon, 17 Jun 2024 12:51:07 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DD85B1A081F for ; Mon, 17 Jun 2024 16:51:06 +0000 (UTC) X-FDA: 82240970532.05.1F02381 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by imf07.hostedemail.com (Postfix) with ESMTP id 2ADBA4001B for ; Mon, 17 Jun 2024 16:51:04 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X8kX1msX; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718643058; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=txeYGZbAulYQnRze5h20s5tFWK2wdS382ItpdGZVmi0=; b=eBZcIU1swe7QrC/7xDSvAXSCh0T2G6qzvWmtOZdjy8JIW62M12aNsT3GBL0pYbHT2+aRzg C9B6H0EDNUS4Pnho9IV1MsXZBmFdvzlLYcSMl4+ttAI6yDex8RhOKJ/SL5czh3pjWYm1PY Dyj9oSp3UU/FQGEbgUfJf21w+ZmLMg4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718643058; a=rsa-sha256; cv=none; b=3eGdDmAfTxP8k9K/tLchmK29bZGxr9P5sPqGtwbUSIbwGkAuSfW6uOsT8pS3kZz84S8QEr 328CjOy6CAjzl2AINlapQ+aio0FbLcwzXuJ7a7d+cChO7XaNEduzonP9vRdb7GLbj/SESR /LpKHexBo6aikkdbi7Y/6s6NpFD1PiM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X8kX1msX; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=jthoughton@google.com Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-443586c2091so8681cf.0 for ; Mon, 17 Jun 2024 09:51:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718643064; x=1719247864; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=txeYGZbAulYQnRze5h20s5tFWK2wdS382ItpdGZVmi0=; b=X8kX1msXfQK2/X+/Ltf29unbpGnJvwXwVIJ/O9zLDMX9P+EMaXcASF7UaU6zz22UoE aW5J2w8XhbJATZGYz6E2rEjkcWYnfy3WOVeTuvHi9KYukTSs0c/QkUsjHrXxwH1yIDIY sqtuokySO43oy7vVxXBvlRLj2jXO/om3jYjOrOPs3sAQRtESYaqI8lsPwCoCmTss11Or 8JDzZeVcni5j1mNPJlfY+CYtQPAtc4/Htx1F0jAScoEYroGpGpw1gL8wp7kCw/p73pLf 9qae06q6DB4BzTxC8cTVTasxAzVJp78mdtCEIMSf4zcfJMxkAl/WKM1fqSfrNXdk84xM dBBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718643064; x=1719247864; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=txeYGZbAulYQnRze5h20s5tFWK2wdS382ItpdGZVmi0=; b=s1Uo7cnU37NVFFtvn1nAcv7QRHJHz+w4+3iCmVGDqIMVgBrE0i1OqiIkl17boCWIT2 bfM2SDvMtV4RUn4HlP06BG0AwEyLqKUqEoWptFZOD/sG9JDwu/Z3oBilJYGXC1Zf66Wa Zdep214RaodsdnLxIoo2rjZBOKxY6beyOj67k8QPAQANHmiheximqiOF5vwo4TR+dTKT 47lDzrLwQeiXOW3/zvCZtaeDLk5PUV+muKQN7DJToVXjAj8Z7i/g/XzA7U+Beu90XFqS jGV4qXLN/tH7iS3RTaPkg1ziFBZTLLQd5milO9QAHmBoEIo2JM5LjhwnK5xuoQ4tX0uu 5hjw== X-Forwarded-Encrypted: i=1; AJvYcCUKC2upITyjrQrTL2msGe5ERO4670TiXY5CLY54pf/WwuGCAnuh7dwCsPO8Inly+zReGlRtjfdLc5HFF2VFdyGCU7U= X-Gm-Message-State: AOJu0YzD4VO9kT3dm3QNHgu7pNRazH0HwVVJLp+C4bUQPUZUXTjtdZMA i4plGFNiWIvI7DSiGTF+oUXzkE/Hujg6eNHgchV5N2vAKpxmJZ0yH8WuLJgi304S+UVzMUzz6Jf FbaXFX8ZMd68cVHKQXvHMh9qJ+RDL7JMRqUE+ X-Google-Smtp-Source: AGHT+IFZ9WgzL02erlLgT4OACUR5uu/+x7DQfx8/7vgEmwC6CTrUFy/B1nVWq1p2P76roeiS4xwSGANQ9yKvyy4Xp+Q= X-Received: by 2002:a05:622a:20c:b0:43e:3833:c5e3 with SMTP id d75a77b69052e-44350a5e557mr5052471cf.11.1718643063774; Mon, 17 Jun 2024 09:51:03 -0700 (PDT) MIME-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-5-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Mon, 17 Jun 2024 09:50:26 -0700 Message-ID: Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier To: Sean Christopherson Cc: Yu Zhao , Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2ADBA4001B X-Stat-Signature: ktzrx7649ex9wxeschbzfgykzmsnmya6 X-Rspam-User: X-HE-Tag: 1718643064-534189 X-HE-Meta: U2FsdGVkX1+J4R23gVgT7R/Q25fj9xtirFz6Ghj8Ou6jMbVI1JvHGINlZ6Wj08urnDjlJoyhQnpwfJq1MnaROhM/GPdvsUuXt/AuXsBkCkarWLia2DCKJ5xwTwec0hWGIZueOMw6w9tC+LZGqxiAA/AoczAkCoEbnKsvaAyv5Je/YdMr/2IP8Oc4XyZJOCWliL5dbfiiV/Fta4L8cQLNvMCTnSQvtlwD5E+HwDChAAPYeGcQyYCozEOLCjdI1Xt+3v9a4ZYjDTQBb3vmOj3C3fFuKmyi0Ytb2tP6dCeHrdRyWOSSksCL4ScYio9oR1EXSKKDFBwr/8BQ+zpU6aq5cNkiLrJ7jLiwGCsLU/46D2mjlqgGDSkuQD19s7nt4GSZimez0aHoJXHWrZnkM8NZjGWy3XzqsDkSL8e1XsIMFlGMEy/DLyH8/+ZTdkNzZD4A9eO0mk/OPMJtZa1QyHrdiE9YOohLb5/87lLh/BVVFarp7rpnbIyeWmAyPnuDrHwY5pyE+uWOc/RWsRaJOrbakKxXXDS898wovMt87nhgkIYXafDNc/f4FZyAp82LZSG2DZqS9unGoZPIG3Qm5hLw91Qnt8P2wv02s0wzNQTK3N5g1w6s+CfoVBIPb9ohuWhfFPBkyeByhyEZlegUglfmJ8HSFwru6BaSZycOoyEyO89bu+Mo/Ky+R7Nq1sz++6qg+p5w+gD3qlDjeIm8FbUqWdBBCGWnqHrFRWCGSZHI0V8gUYymkOOqsSuZtkUBEF8hrjwC3yNmP1oV3nbJZ5nGUvAh+IOPlMHOai88kiA+a28t3t3KxPJ1johjyHD7GuDwcATvZVtBhHcYHxuWie0KRCCuRL7/x3btvrMta7fO9WeFgSmX7uBeULi8a9bdlQUicgsZzUWmvYD3HPTJ2dvVZPNPSjBWllRbn5L5H0TcfMW9/1IqUY2omQ7bRqIYn32y2GpgNBrjG0ODsO3Jlaa O11MnFlA m4pyrOME/ZT0nowhlKhmyseD60FT11wyIedl+UhPXXssm4/+kSJMynT97dnF+0mZ09JzzzCwdirXyQDZFjT0jnAqCF7HIBPlU7zp1CuddFbqxmFFzlSkVnXO89SE223To5uIOpjuuuN8aUhOLquEk6VFy2QvVPkRD4KFBg8DFBRKSvCwNnLIOnlVC6DDCemCRP8ebahKZYQz/gnI595k54f7LAUpgE9YLMIqTJ/SpTDMQxULjLGTVOmku6EAJ9x/JVzcg0xJW3of8lSTcct2e4nD688JgAjl6cjWsY1doDZ8CaAs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 14, 2024 at 4:17=E2=80=AFPM Sean Christopherson wrote: > > On Fri, Jun 14, 2024, James Houghton wrote: > > On Fri, Jun 14, 2024 at 9:13=E2=80=AFAM Sean Christopherson wrote: > > > > > > On Thu, Jun 13, 2024, James Houghton wrote: > > > > I wonder if this still makes sense if whether or not an MMU is "fas= t" > > > > is determined by how contended some lock(s) are at the time. > > > > > > No. Just because a lock wasn't contended on the initial aging doesn'= t mean it > > > won't be contended on the next round. E.g. when using KVM x86's shad= ow MMU, which > > > takes mmu_lock for write for all operations, an aging operation could= get lucky > > > and sneak in while mmu_lock happened to be free, but then get stuck b= ehind a large > > > queue of operations. > > > > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lo= ckless or in > > > an MMU that takes mmu_lock for read in all but the most rare paths. > > > > Aging and look-around themselves only use the fast-only notifiers, so > > they won't ever wait on a lock (well... provided KVM is written like > > that, which I think is a given). > > Regarding aging, is that actually the behavior that we want? I thought t= he plan > is to have the initial test look at all MMUs, i.e. be potentially slow, b= ut only > do the lookaround if it can be fast. IIUC, that was Yu's intent (and pee= king back > at v2, that is indeed the case, unless I'm misreading the code). I believe what I said is correct. There are three separate things going on = here: 1. Aging (when we hit the low watermark, scan PTEs to find young pages) 2. Eviction (pick a page to evict; if it is definitely not young, evict it) 3. Look-around (upon finding a page is young upon attempted eviction, check adjacent pages if they are young too) (1) and (3) both use the fast-only notifier, (2) does not. In v2[1], this is true, as in the (1) and (3) paths, the notifier being called is the test_clear_young() notifier with the bitmap present. If the bitmap is present, the shadow MMU is not consulted. For (2), there is an mmu_notifier_clear_young() with no bitmap called in should_look_around(), just like in this v5. (2) is the only one that needs to use the slow notifier; the (1) and (3) are best-effort attempts to improve the decision for which pages to evict. [1]: https://lore.kernel.org/linux-mm/20230526234435.662652-11-yuzhao@googl= e.com/ > > If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2= will > end up with hot pages (used by L2) swapped out. The shadow MMU is consulted at eviction time -- only at eviction time. So pages used by L2 won't be swapped out unless they're still cold at eviction time. In my (and Yu's) head, not being able to do aging for nested TDP is ok because running nested VMs is much more rare than running non-nested VMs. And in the non-nested case, being able to do aging is a strict improvement over what we have now. We could look into being able to do aging with the shadow MMU, but I don't think that should necessarily block this series. > Or are you saying that the "test" could be slow, but not "clear"? That's= also > suboptimal, because any pages accessed by L2 will always appear hot. No. I hope what I said above makes sense. > > should_look_around() will use the slow notifier because it (despite its= name) > > is responsible for accurately determining if a page is young lest we ev= ict a > > young page. > > > > So in this case where "fast" means "lock not contended for now", > > No. In KVM, "fast" needs to be a property of the MMU, not a reflection o= f system > state at some random snapshot in time. Ok, that's fine with me. > > I don't think it's necessarily wrong for MGLRU to attempt to find young > > pages, even if sometimes it will bail out because a lock is contended/h= eld > > lru_gen_look_around() skips lookaround if something else is waiting on th= e page > table lock, but that is a far cry from what KVM would be doing. (a) the = PTL is > already held, and (b) it is scoped precisely to the range being processed= . Not > looking around makes sense because there's a high probability of the PTEs= in > question being modified by a different task, i.e. of the look around bein= g a > waste of time. > > In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to av= oid > waiting, and would need to bail from the middle of the aging walk if a di= fferent > task contends mmu_lock. > > I agree that's not "wrong", but in part because mmu_lock is scoped to the= entire > VM, it risks ending up with semi-random, hard to debug behavior. E.g. a = user > could see intermittent "failures" that come and go based on seemingly unr= elated > behavior in KVM. And implementing the "bail" behavior in the shadow MMU = would > require non-trivial changes. This is a great point, thanks Sean. I won't send any patches that make other architectures trylock() for the fast notifier. > > In other words, I would very strongly prefer that the shadow MMU be all o= r nothing, > i.e. is either part of look-around or isn't. And if nested TDP doesn't f= air well > with MGLRU, then we (or whoever cares) can spend the time+effort to make = it work > with fast-aging. > > Ooh! Actually, after fiddling a bit to see how feasible fast-aging in th= e shadow > MMU would be, I'm pretty sure we can do straight there for nested TDP. O= r rather, > I suspect/hope we can get close enough for an initial merge, which would = allow > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify t= hings > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each not= ification. > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entrie= s is done > in such a way that a lockless walk would be painfully complex. But if th= ere is > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points dire= ctly at > that one SPTE. And with nested TDP, unless L1 is doing something uncommo= n, e.g. > mapping the same page into multiple L2s, that overwhelming vast majority = of rmaps > have only one entry. That's not the case for legacy shadow paging becaus= e kernels > almost always map a pfn using multiple virtual addresses, e.g. Linux's di= rect map > along with any userspace mappings. > > E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn= has > multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memor= y during > boot). > > So, if we bifurcate aging behavior based on whether or not the TDP MMU is= enabled, > then whether or not aging is fast is constant (after KVM loads). Rougly,= the KVM > side of things would be the below, plus a bunch of conversions to WRITE_O= NCE() to > ensure a stable rmap value (KVM already plays nice with lockless accesses= to SPTEs, > thanks to the fast page fault path). > > If KVM adds an rmap entry after the READ_ONCE(), then functionally all is= still > well because the original SPTE pointer is still valid. If the rmap entry= is > removed, then KVM just needs to ensure the owning page table isn't freed.= That > could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing= page > tables, and the rmap stuff doesn't actually walk upper level entries), or= by > enhancing the shadow MMU's lockless walk logic to allow lockless walks fr= om > non-vCPU tasks. > > And in the (hopefully) unlikely scenario someone has a use case where L1 = maps a > gfn into multiple L2s (or aliases in bizarre ways), then we can tackle ma= king the > nested TDP shadow MMU rmap walks always lockless. > > E.g. again very roughly, if we went with the latter: > > @@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const= struct kvm_memory_slot *slot, > __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); > } > > +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kv= m, > + struct kvm_gfn_= range *range, > + typedefme handl= er) > +{ > + gfn_t gfn; > + int level; > + u64 *spte; > + bool ret; > + > + walk_shadow_page_lockless_begin(???); > + > + for (gfn =3D range->start; gfn < range->end; gfn++) { > + for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_L= EVEL; level++) { > + spte =3D (void *)READ_ONCE(gfn_to_rmap(gfn, level= , range->slot)->val); > + > + /* Skip the gfn if there are multiple SPTEs. */ > + if ((unsigned long)spte & 1) > + continue; > + > + ret |=3D handler(spte); > + } > + } > + > + walk_shadow_page_lockless_end(???); > +} > + > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > bool fast_only) > { > bool young =3D false; > > - if (kvm_memslots_have_rmaps(kvm)) { > - if (fast_only) > - return -1; > - > - write_lock(&kvm->mmu_lock); > - young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > - write_unlock(&kvm->mmu_lock); > - } > - > - if (tdp_mmu_enabled) > + if (tdp_mmu_enabled) { > young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); > + young |=3D kvm_handle_gfn_range_lockless(kvm, range, kvm_= age_rmap_fast); > + } else if (!fast_only) { > + write_lock(&kvm->mmu_lock); > + young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > + write_unlock(&kvm->mmu_lock); > + } > > return (int)young; > } Hmm, interesting. I need to spend a little bit more time digesting this. Would you like to see this included in v6? (It'd be nice to avoid the WAS_FAST stuff....) Should we leave it for a later series? I haven't formed my own opinion yet. > > for a few or even a majority of the pages. Not doing look-around is the= same > > as doing look-around and finding that no pages are young. > > No, because the former is deterministic and predictable, the latter is no= t. Fair enough. I just meant in terms of the end result. > > Anyway, I don't think this bit is really all that important unless we > > can demonstrate that KVM participating like this actually results in a > > measurable win. > > Participating like what? You've lost me a bit. Are we talking past each= other? Participating in aging and look-around by trylock()ing instead of being lockless. All I meant was: it's not important to try to figure out how to get this non-deterministic trylock()-based "fast" notifier implementation to work, because I'm not actually suggesting we actually do this. There was some suggestion previously to make arm64 work like this (in lieu of a lockless implementation), but I'd need to actually verify that this performs/behaves well before actually sending a patch. > What I am saying is that we do this (note that this is slightly different= than > an earlier sketch; I botched the ordering of spin_is_contend() in that on= e, and > didn't account for the page being young in the primary MMU). > > if (pte_young(ptep_get(pte))) > young =3D 1 | MMU_NOTIFY_WAS_FAST; > > young |=3D ptep_clear_young_notify(vma, addr, pte); > if (!young) > return false; > > if (!(young & MMU_NOTIFY_WAS_FAST)) > return true; > > if (spin_is_contended(pvmw->ptl)) > return false; > > /* exclude special VMAs containing anon pages from COW */ > if (vma->vm_flags & VM_SPECIAL) > return false; SGTM. I think we're on the same page here.