From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A429C3DA5D for ; Mon, 22 Jul 2024 20:46:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF0B86B007B; Mon, 22 Jul 2024 16:46:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA0BB6B0083; Mon, 22 Jul 2024 16:46:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A41596B0085; Mon, 22 Jul 2024 16:46:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8A2596B007B for ; Mon, 22 Jul 2024 16:46:22 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1B7EAA5E7E for ; Mon, 22 Jul 2024 20:46:22 +0000 (UTC) X-FDA: 82368571404.15.971A2F4 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf15.hostedemail.com (Postfix) with ESMTP id 5316DA0022 for ; Mon, 22 Jul 2024 20:46:20 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cFt0SqR3; spf=pass (imf15.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721681158; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+dA2de0TPq97lZ7zf2Lryp0hLTOjDnt1RCWONOWWc3s=; b=SWke8o2Ou2+OoZrL9wcXEYPa28BYwgfOOUwSIzcbkRic0kop8CtPfauzvSnstJXcrTA3+u C6/xgb8fj8vSxgulZcNE3pKHW7IOW4XOY2v5HSc4SN4eqrFz91KYHIyLxLw57XwWCoQpQc fbHAQm5GxCgmhIusicx0PfnNxykUFIk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=cFt0SqR3; spf=pass (imf15.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721681158; a=rsa-sha256; cv=none; b=HtYjR4xFaH4Bb8tlyYwP27jtfjZ9J5/9Nqm5l71+DQw8A+k0unlbBmI0mw/lWf4V8+N1dj vcIQ/02gpKtJtS86UietA7etHqGPku6LW8mXcWk4svwL3YeWDW2SAi53WhNlJQN9z1Ua2M oVc2SQuVwPDIl8joZ5NN3Nz1ZDTEfiw= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-44e534a1fbeso18861cf.1 for ; Mon, 22 Jul 2024 13:46:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721681179; x=1722285979; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+dA2de0TPq97lZ7zf2Lryp0hLTOjDnt1RCWONOWWc3s=; b=cFt0SqR3hb82/8RdZHRm0CflfhxDFm2I/ZGp/8fbUfWc1DIUICShmi8UsDfXOVtqpu FUP1/H8Ldvxs9MgrL2i1xVfi2eHkS3RD/v7Jn6YvXX4WQ/irLPISF66jZxBJS4Dvgz/O HB9NGuFYXzYU+nJZRM/gJBpOIwtW/UP6smO1soxeK+QwG5WbMFIakHxESMqvLKFuD7/B CMumz2f+APll57q3L+9CMdVqV7fpUdZM+QFKBlkKCFt56M6vSeaXz6avog8hBmqhTw8e BUL3iH1AOW7l3zaDwbVRPnIyY4N/BTC9lHJL6hK3EamXW12So8W190gMiiMrqxHL1JrT 3zuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721681179; x=1722285979; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+dA2de0TPq97lZ7zf2Lryp0hLTOjDnt1RCWONOWWc3s=; b=DcWKEcIy99kh5qlkaucVi9jT/PzdVZAG5JCQwuyeuCbua/WTkPOzZ9d/G0gxylE0gu hK2WRh1WHrEOOoJOT6+lwiQiJUWx0rg8UR8CpxW58jmrP2vOV4nlN2/ZkeX4NWr0KyXg BvEtL9yCCcSaHpwRVPl1ZyDK41gL9tp+1dTOPVvywJhYpLvq009pCCjK3Wi0Sd+lUcJM xf1Yc+apQ9zWlLNOcqA9Y9HpXEWB5SuJoq6/C4KZkcONunLcTnsW9SvcYLumsx3QlDCM Ss1HNKVo+l8mfz3IaxEHfoJr6ZHxGO48p+u+TaAH8fqDa28MnY4I0uTbSEfrTlruUvBW IcBg== X-Forwarded-Encrypted: i=1; AJvYcCXrjzKqPZ/wmr6qZ0W7w7Ec3vZ4ZzdsUmFUpZqcH1vEdgwK2VU6Uy8nnaDTGalpZf15wmBt74znzwrUWxGrTPPDspg= X-Gm-Message-State: AOJu0YzhtDiC5mhXXjZTB4oOK2gib32TP/+rSG4/lsGbppF1wRQZFUF/ cnaM1j+COvYH8lpwuPD1x5Po08qRhU0HMrYYPvd1/qHFZHu7Jnavotx4Op/HK9U8ybfpiO473Wm HDpNFTBuzMIF+zHNSpmRfo/t6q3I/fSq1Gl/K X-Google-Smtp-Source: AGHT+IHmG6svEkQAUKQGLl75jQoJQoNjqUrFjS34eF1iJLFIfux81mHYuigDqfh57v63/vRcBxBh34E0nXtZyYS1jfA= X-Received: by 2002:ac8:5e4e:0:b0:447:d81a:9320 with SMTP id d75a77b69052e-44fa7da9b77mr5801921cf.20.1721681179016; Mon, 22 Jul 2024 13:46:19 -0700 (PDT) MIME-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-9-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Mon, 22 Jul 2024 13:45:42 -0700 Message-ID: Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging To: Yu Zhao Cc: Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Sean Christopherson , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 5316DA0022 X-Stat-Signature: 8xp8xrpwh8ome7dpbt6h1tyz1q4ikf7p X-HE-Tag: 1721681180-859284 X-HE-Meta: U2FsdGVkX18oZX9wxKbmF/c/cKhUvdmzrrtLMGrNa9ESzWa9brI0e28iLkcIzUQW4iEPn3B5KWuYnfumynCIf9zN3RDUSTRDMfouXl34rxSRnA4zyyTsjb3Sx5xPxuSAOoyjDFIEy9/qQ7eSlbyy8IfEryAE66qFXHAxuSWtSB2GzMgWF6Jxxz4UHn9RK+T3eGGAbYevABoT/5V8ZHPzj594f5THv1/lxvZIzbAdBZ0SZ7zOT9zvqQNWe/Yjnmgn9gOJtK9XoDA/avQtkpQCiPfXx/zaqfMFl6WmSrlse+SSWFF+HDE9Rdka3cOAHi8LNvoLEJ8GTD4jdh+49DRxRPpdxd9I0ZKMlQIQiZ7H3kpGJtz0cAGGfSiP4Nu8GmHn/qFoH90U2GJlmpewy8UP/KHiLfB51tyVU3h3pgzqI25t3OFLgBAWqyn/wSnFKcgnC9OU4V9T4gIf4zDAJydfEs6S9qATIMWM7i4EsUy7yTxfEnIoBROFgYSBLin3yqqdwrH78VXYx+OwRVE9+jyBZlJJ81xl5DjV6e80zJ27qpamARtTZhj1PUToy0WIKChUEm2tmhhiYP0GY8972zH3pwlj6RiiZfS9RNILE0Mohna6h4+Oz8LIEAIWWP8ZVHrRE5kFS5K2f8DQDTlgeQTLoAKGFZ3PfKYQZtA1F5KIuyOXoJGetcJc7iDWOUHzecn0kSbt4IWpvFmkfklFxrWr0QKL2vXTULKCVaRLFHitsPeGsReT/yGyVFPxBOeGbiwvOCG/vicSPrmS/s/AYOb8XCDM8oO8Qb7/H2GQWGKBOILKaLFFV15kUyZzNsPY3ueg06fUGR84/qmiHAswo/4LXw9cFWxE8DV4dizbW/BDMjnAoUj/WsSqXrhVDTKEoplShvIrsg5Wf2DHR+7bA5rXDQLN1D5LTQSBwejh/ROBeYj4ORLyotG0gLU9h2nurT88fwKE5qYit8RolgAbgSB R0JPucBG ZIdHm2dpjsrvuSIj3CPb/b0ysa1lePkSjYRsEvP8rdbCFBReZ28qoQ931jfld1gMlJ51kuccNczLNyUxybgdUE1RIHFjqqX4B7DHRnyCt0hvAYHjwAhlKmhMDcthSFEXC3cEHtF78GG6C0WLOp9rBaYmhMhqyZOzZHIUoGAL2QIhdRUK06tGK346JwN7aAU5npi+o3hFptcFjZNh73SSCDu7MMGVNwoqDWXsZ9Vl/XxlfXiknNIkbEockqW/JwiQwp/cA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 8, 2024 at 4:42=E2=80=AFPM Yu Zhao wrote: > > On Mon, Jul 8, 2024 at 11:31=E2=80=AFAM James Houghton wrote: > > > > On Fri, Jul 5, 2024 at 11:36=E2=80=AFAM Yu Zhao wro= te: > > > > > > On Mon, Jun 10, 2024 at 6:22=E2=80=AFPM James Houghton wrote: > > > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsign= ed long start, unsigned long end, > > > > if (!folio) > > > > continue; > > > > > > > > - if (!ptep_test_and_clear_young(args->vma, addr, pte= + i)) > > > > - VM_WARN_ON_ONCE(true); > > > > + lru_gen_notifier_clear_young(mm, addr, addr + PAGE_= SIZE); > > > > + if (pte_young(ptent)) > > > > + ptep_test_and_clear_young(args->vma, addr, = pte + i); > > > > > > > > young++; > > > > walk->mm_stats[MM_LEAF_YOUNG]++; > > > > > > > > > There are two ways to structure the test conditions in walk_pte_range= (): > > > 1. a single pass into the MMU notifier (combine test/clear) which > > > causes a cache miss from get_pfn_page() if the page is NOT young. > > > 2. two passes into the MMU notifier (separate test/clear) if the page > > > is young, which does NOT cause a cache miss if the page is NOT young. > > > > > > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier > > > twice every 64 PTEs, and therefore the second option is a clear win. > > > > > > But you are doing twice per PTE. So what's the rationale behind going > > > with the second option? Was the first option considered? > > > > Hi Yu, > > > > I didn't consider changing this from your v2[1]. Thanks for bringing it= up. > > > > The only real change I have made is that I reordered the > > (!test_spte_young() && !pte_young()) to what it is now (!pte_young() > > && !lru_gen_notifier_test_young()) because pte_young() can be > > evaluated much faster. > > > > I am happy to change the initial test_young() notifier to a > > clear_young() (and drop the later clear_young(). In fact, I think I > > should. Making the condition (!pte_young() && > > !lru_gen_notifier_clear_young()) makes sense to me. This returns the > > same result as if it were !lru_gen_notifier_test_young() instead, > > there is no need for a second clear_young(), and we don't call > > get_pfn_folio() on pages that are not young. > > We don't want to do that because we would lose the A-bit for a folio > that's beyond the current reclaim scope, i.e., the cases where > get_pfn_folio() returns NULL (a folio from another memcg, e.g.). > > > WDYT? Have I misunderstood your comment? > > I hope this is clear enough: > > @@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned > long start, unsigned long end, > if (pfn =3D=3D -1) > continue; > > - if (!pte_young(ptent)) { > + if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) { > walk->mm_stats[MM_LEAF_OLD]++; > continue; > } > @@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned > long start, unsigned long end, > if (!folio) > continue; > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > - VM_WARN_ON_ONCE(true); > + if (!ptep_clear_young_notify(args->vma, addr, pte + i)) walk->mm_stats[MM_LEAF_OLD]++ should be here, I take it. > + continue; > > young++; > walk->mm_stats[MM_LEAF_YOUNG]++; > > > Also, I take it your comment was not just about walk_pte_range() but > > about the similar bits in lru_gen_look_around() as well, so I'll make > > whatever changes we agree on there too (or maybe factor out the common > > bits). > > > > [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@goo= gle.com/ > > > > > In addition, what about the non-lockless cases? Would this change mak= e > > > them worse by grabbing the MMU lock twice per PTE? > > > > That's a good point. Yes I think calling the notifier twice here would > > indeed exacerbate problems with a non-lockless notifier. > > I think so too, but I haven't verified it. Please do? I have some results now, sorry for the wait. It seems like one notifier is definitely better. It doesn't look like the read lock actually made anything worse with what I was testing (faulting memory in while doing aging). This is kind of surprising, but either way, I'll change it to the single notifier in v6. Thanks Yu! Here are the results I'm basing this conclusion on, using the selftest added at the end of this series. # Use taskset to minimize NUMA concern. # Give an extra core for the aging thread. # THPs disabled (echo never > /sys/kernel/mm/transparent_hugepage/enabled) x86: # taskset -c 0-32 ./access_tracking_perf_test -l -v 32 # # One notifier Populating memory : 1.933017284s Writing to populated memory : 0.017323539s Reading from populated memory : 0.013113260s lru_gen: Aging : 0.894133259s lru_gen: Aging : 0.738950525s Writing to idle memory : 0.059661329s lru_gen: Aging : 0.922719935s lru_gen: Aging : 0.829129877s Reading from idle memory : 0.059095098s lru_gen: Aging : 0.922689975s # # Two notifiers Populating memory : 1.842645795s Writing to populated memory : 0.017277075s Reading from populated memory : 0.013047457s lru_gen: Aging : 0.900751764s lru_gen: Aging : 0.707203167s Writing to idle memory : 0.060663733s lru_gen: Aging : 1.539957250s <------ got longer lru_gen: Aging : 0.797475887s Reading from idle memory : 0.084415591s lru_gen: Aging : 1.539417121s <------ got longer arm64*: (*Patched to do aging; not done in v5 or v6. Doing this to see if the read lock is made substantially worse by using two notifiers vs. one.) # taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3 # # One notifier Populating memory : 1.439261355s Writing to populated memory : 0.009755279s Reading from populated memory : 0.007714120s lru_gen: Aging : 0.540183328s lru_gen: Aging : 0.455427973s Writing to idle memory : 0.010130399s lru_gen: Aging : 0.563424247s lru_gen: Aging : 0.500419850s Reading from idle memory : 0.008519640s lru_gen: Aging : 0.563178643s # # Two notifiers Populating memory : 1.526805625s Writing to populated memory : 0.009836118s Reading from populated memory : 0.007757280s lru_gen: Aging : 0.537770978s lru_gen: Aging : 0.421915391s Writing to idle memory : 0.010281959s lru_gen: Aging : 0.971448688s <------ got longer lru_gen: Aging : 0.466956547s Reading from idle memory : 0.008588559s lru_gen: Aging : 0.971030648s <------ got longer arm64, faulting memory in while aging: # perf record -g -- taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m= 3 -p # # One notifier vcpu wall time : 1.433908058s lru_gen avg pass duration : 0.172128073s, (passes:11, total:1.893408807= s) # # Two notifiers vcpu wall time : 1.450387765s lru_gen avg pass duration : 0.175652974s, (passes:10, total:1.756529744= s) # perf report # # One notifier - 6.25% 0.00% access_tracking [kernel.kallsyms] [k] try_to_inc_ma= x_seq - try_to_inc_max_seq - 6.06% walk_page_range __walk_page_range - walk_pgd_range - 6.04% walk_pud_range - 4.73% __mmu_notifier_clear_young + 4.29% kvm_mmu_notifier_clear_young # # Two notifiers - 6.43% 0.00% access_tracking [kernel.kallsyms] [k] try_to_inc_ma= x_seq - try_to_inc_max_seq - 6.25% walk_page_range __walk_page_range - walk_pgd_range - 6.23% walk_pud_range - 2.75% __mmu_notifier_test_young + 2.48% kvm_mmu_notifier_test_young - 2.39% __mmu_notifier_clear_young + 2.19% kvm_mmu_notifier_clear_young