From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD2DAC636D6 for ; Thu, 23 Feb 2023 19:37:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 24F7A6B0075; Thu, 23 Feb 2023 14:37:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1FF046B007B; Thu, 23 Feb 2023 14:37:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C7B06B007D; Thu, 23 Feb 2023 14:37:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id EFA216B0075 for ; Thu, 23 Feb 2023 14:37:08 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id AED0340E0D for ; Thu, 23 Feb 2023 19:37:08 +0000 (UTC) X-FDA: 80499564936.15.9A0C0A2 Received: from mail-vs1-f43.google.com (mail-vs1-f43.google.com [209.85.217.43]) by imf09.hostedemail.com (Postfix) with ESMTP id E65E5140019 for ; Thu, 23 Feb 2023 19:37:05 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WVsEQT6P; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677181025; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VQKTd1mzsg/pi2cH97hh9NzvK6Pa7c5BewHqL7hHt/g=; b=mL/dnKqDzGleaZlDWtzhAmlmzO/ayfSHzJT8X2yqMQZXaMMOLwUrfILvxC/BQ7D3+Tu7fG o7dsh31I6SkNwTuNOTf+NAT/KVmLQDRRqe8iPDluCLzVL9mNgPeUkC7Dzop/ICz3QlzLlY p9ZE68D9+Ta6exGkyQmKfXsHJ3z5gD0= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WVsEQT6P; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677181025; a=rsa-sha256; cv=none; b=59oGUhwveEyK81KOvsW/uVubWh+1v/9T+ByPC3wdErabxucHcEkTtkX5DuxQjG9AEC3t89 SW0baYaXM/VVjXGu9OYsVYiMyUVq8ht9kIcxf4nhjqC72wwntZnaj5HbrqwV65t12BwL5f sg7XebqJm6AgoOVFua4vEYius9aNAjE= Received: by mail-vs1-f43.google.com with SMTP id d7so11661199vsj.2 for ; Thu, 23 Feb 2023 11:37:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VQKTd1mzsg/pi2cH97hh9NzvK6Pa7c5BewHqL7hHt/g=; b=WVsEQT6PPy34hEkjLZAQ1Q4pou3W21hsg4lj+Y9rlbMvhBUpDkgNnnyF1+9kH4FqAn qTUy0r7+8+kyQgV0L8sMPzZ963+pTxRWWEu1N2Swdx0P+IVuT1Tp0Lq2NHGTYZZWsKf+ kCn6QOqveulzJ4bhrYdghA2c7pAlxBLF4NVOD75oVbMiMWgxJh7o5GeWiUK7cHiSmf6k xSBQ0cwHc0NlX3NKUWdMRAPRU2xunE6JyEtgyJBIjQIEp+ql/nZZCSVk/HxaAL8d0g6r sVBLv2wm5MgsxZft5qwDVO4U+vYcyEgqCLaG3xdWMDMWg1qZsODxo7jaPyxqfa+SdBtj FkXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VQKTd1mzsg/pi2cH97hh9NzvK6Pa7c5BewHqL7hHt/g=; b=uu+ztsQN6/1D+JyixyiS2c0az2QkQ5ua1MJvf83o5Z1euZD8J5jB0DBhrQwBgqHOwd Arl9CgTDku6zI6H9ffk92v/+8KIFcV+dFMgK5/yiUuxCNLvUme9UOhTSJxuGGNTdXAD2 idWs5qNbooHr3M9E1p47WDx5MgOqs2ujtRFs2WbilX9AIGXwa8FxoogutLfwGBea70kY G1vwVUxrSIw+dOGqymZX/ZV46/DEuKh8wcxGSSeT0OZseQdiOI0uvbVWNRZYa1E6pZtI BdxIjXMYRbS0shVZEvhc06TgTAEqps4TYeS7dK4JEb2WXkNF6LutaNIUJuvPz8onItFP YqEQ== X-Gm-Message-State: AO0yUKUIynKkLTmQJojU9rarGNgeuG/VtEHGcfe1wfRl4Q+EsEJ02Pmr xqhtz7kqedmrB4tIwE3t0mZ+IXARvXWSFX/diJMXSA== X-Google-Smtp-Source: AK7set8VQE1ZUs/DARz1uDJOtOd3y5APdCNGehQzmQqDg5Sq2V+KxGB+BR0gjpLPcgnhYR5yHSA4XUwuxFt6OaxzPsA= X-Received: by 2002:a05:6102:5d9:b0:415:74b4:6067 with SMTP id v25-20020a05610205d900b0041574b46067mr658682vsf.6.1677181024793; Thu, 23 Feb 2023 11:37:04 -0800 (PST) MIME-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> <20230217041230.2417228-6-yuzhao@google.com> In-Reply-To: From: Yu Zhao Date: Thu, 23 Feb 2023 12:36:26 -0700 Message-ID: Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young() To: Sean Christopherson , Johannes Weiner Cc: Andrew Morton , Paolo Bonzini , Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E65E5140019 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: cjiz1yrn89gh6jgzdkob84yijcnrwsqh X-HE-Tag: 1677181025-318120 X-HE-Meta: U2FsdGVkX1/ilL8SlSuzEa4bPgJsJ73ngQ6eP2aFmTXdgEWS+3DQlHd8+XetTdepce06t56co895ufpcV9CFhE3JGaYdne47cjqT++HiT9a9xWBAchzKp1idnn2oTJ8CGIwhCLRuXnQgo9vo5Vc6eLodF5a3ssxye+K38UIpngBQyUddBwEoxtSD6iF/X/Zfe7WjdtOT9TTVaHZxMkS8xnoiW+FagNM22JAgiHbJEB1FAQArIBp3YCzuIzNLnQjGZWhvTSwxAbbJMpH3eRBdQuu2Hb96ABjZtMLxcp7BMnbQMXvee0Io9mTRKIbVAnZgYIUc4MC9lH/0xCJVR1lFRMyauzLRxhCnecWqz+ie5MYCkK6Kd+PISn3j+dbaHMV7yuJwXzcChYu91N+yTg2hwUG0WhEdpw8on4lg+juxf7UTXpPL45Sv0MxAXkvYuf7p+qApo8dC7O9WVX9cTYjOrcKks7+GDPssxD0nsPzHg6SSh51h3yfVkoH91Gka4A5OjrVFDZSVn3ANUE4grG/FLxVMhXd+UJD10fh95PjwgG32Q7//7iXXBKO1vRfRWR0vVJOhRfCJSn3Jw7tzcM74zND3DJhzCKbqulWp0kYA/XQbcxJAfgBHV31vF9/S4Ebzc3hltMP/9+HWbEp6DlmCtDvC3VJMX+Uguj7J9VeLx4YOEiqdnvqj2TnuEQC+DHLSM/NgJuiYKsZp4IBBEGtAskGicR53tKP0i1oV1GDXf4HUhAWAxC9A6mR7mQEXccM0ZQUoasFpMZ9c1lcDK3QQXfAyvCARMv5MD57mqaJjEqSt0GCcF92KBNmfB9ok1TebduJfG+Hscruvt6QGzGrCeoZ3hdAcdJNGxWSlVTIqUT3XZqP2c36hk1a/UEsoKcxwRZii9PB1IM9GpKOhW/WlaDZRoQoIblH7pMp2wwxG6FJXUmb7CSv63rASakD/BeV1QBdsxcQSRyX+1ObaJ8Q 2qjMrQqp iCfeu+VlzNV0Xf31blP9aaJ2woSyHmHjemFruXkZAO9JafZKZpqyfyITkQxEkx12j7CEZtQt08xcU7M6TVlZN4B64B4AItZge/tHf52B25U9E9IEXdp+dNjrulear21rPUvKp0Fz22nVcD11hqOX3VAX7UHAuQ39ExNJbK0msMW9LRhUEWsSYpGqbypegC18qphop6SjsLim8R4uoeaT0mBhMtGdYqx4uc9uuDg49FBCWQR7D0/op2TNxzH/bzljSvxtrC6SyU7x8wzPzSi+RbcaawZLvbPsqzy0Xc3TuXi6F5l9CuDEULEJ5w9/tu+Gq8Q6J X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 23, 2023 at 12:11=E2=80=AFPM Sean Christopherson wrote: > > On Thu, Feb 23, 2023, Yu Zhao wrote: > > On Thu, Feb 23, 2023 at 10:43=E2=80=AFAM Sean Christopherson wrote: > > > > > > On Thu, Feb 16, 2023, Yu Zhao wrote: > > > > kswapd (MGLRU before) > > > > 100.00% balance_pgdat > > > > 100.00% shrink_node > > > > 100.00% shrink_one > > > > 99.97% try_to_shrink_lruvec > > > > 99.06% evict_folios > > > > 97.41% shrink_folio_list > > > > 31.33% folio_referenced > > > > 31.06% rmap_walk_file > > > > 30.89% folio_referenced_one > > > > 20.83% __mmu_notifier_clear_flush_young > > > > 20.54% kvm_mmu_notifier_clear_flush_young > > > > =3D> 19.34% _raw_write_lock > > > > > > > > kswapd (MGLRU after) > > > > 100.00% balance_pgdat > > > > 100.00% shrink_node > > > > 100.00% shrink_one > > > > 99.97% try_to_shrink_lruvec > > > > 99.51% evict_folios > > > > 71.70% shrink_folio_list > > > > 7.08% folio_referenced > > > > 6.78% rmap_walk_file > > > > 6.72% folio_referenced_one > > > > 5.60% lru_gen_look_around > > > > =3D> 1.53% __mmu_notifier_test_clear_young > > > > > > Do you happen to know how much of the improvement is due to batching,= and how > > > much is due to using a walkless walk? > > > > No. I have three benchmarks running at the moment: > > 1. Windows SQL server guest on x86 host, > > 2. Apache Spark guest on arm64 host, and > > 3. Memcached guest on ppc64 host. > > > > If you are really interested in that, I can reprioritize -- I need to > > stop 1) and use that machine to get the number for you. > > After looking at the "MGLRU before" stack again, it's definitely worth ge= tting > those numbers. The "before" isn't just taking mmu_lock, it's taking mmu_= lock for > write _and_ flushing remote TLBs on _every_ PTE. Correct. > I suspect the batching is a > tiny percentage of the overall win (might be larger with RETPOLINE and fr= iends), Same here. > and that the bulk of the improvement comes from avoiding the insanity of > kvm_mmu_notifier_clear_flush_young(). > > Speaking of which, what would it take to drop mmu_notifier_clear_flush_yo= ung() > entirely? That's not my call :) Adding Johannes. > I.e. why can MGLRU tolerate stale information but !MGLRU cannot? Good question. The native clear API doesn't flush: int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter because there's no real memory * pressure for swapout to react to. ] */ return ptep_test_and_clear_young(vma, address, ptep); } > If > we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_= clear_young() > instead, would anyone notice, let alone care? I tend to agree. > > > > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *k= obj, struct kobj_attribute *attr, c > > > > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEA= F_YOUNG)) > > > > caps |=3D BIT(LRU_GEN_NONLEAF_YOUNG); > > > > > > > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_W= ALK)) > > > > + caps |=3D BIT(LRU_GEN_SPTE_WALK); > > > > > > As alluded to in patch 1, unless batching the walks even if KVM does = _not_ support > > > a lockless walk is somehow _worse_ than using the existing mmu_notifi= er_clear_flush_young(), > > > I think batching the calls should be conditional only on LRU_GEN_SPTE= _WALK. Or > > > if we want to avoid batching when there are no mmu_notifier listeners= , probe > > > mmu_notifiers. But don't call into KVM directly. > > > > I'm not sure I fully understand. Let's present the problem on the MM > > side: assuming KVM supports lockless walks, batching can still be > > worse (very unlikely), because GFNs can exhibit no memory locality at > > all. So this option allows userspace to disable batching. > > I'm asking the opposite. Is there a scenario where batching+lock is wors= e than > !batching+lock? If not, then don't make batching depend on lockless walk= s. Yes, absolutely. batching+lock means we take/release mmu_lock for every single PTE in the entire VA space -- each small batch contains 64 PTEs but the entire batch is the whole KVM. > > I fully understand why you don't want MM to call into KVM directly. No > > acceptable ways to set up a clear interface between MM and KVM other > > than the MMU notifier? > > There are several options I can think of, but before we go spend time des= igning > the best API, I'd rather figure out if we care in the first place. This is self serving -- MGLRU would be the only user in the near future. But I never assume there will be no common ground, at least it doesn't hurt to check.