From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB867C3DA59 for ; Mon, 22 Jul 2024 21:24:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68EDE6B0085; Mon, 22 Jul 2024 17:24:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 63F696B0088; Mon, 22 Jul 2024 17:24:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DF786B0089; Mon, 22 Jul 2024 17:24:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2F6636B0085 for ; Mon, 22 Jul 2024 17:24:12 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A81B11C3E57 for ; Mon, 22 Jul 2024 21:24:11 +0000 (UTC) X-FDA: 82368666702.18.6C42B7F Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by imf28.hostedemail.com (Postfix) with ESMTP id D4875C0006 for ; Mon, 22 Jul 2024 21:24:09 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ciAJetPk; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721683404; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jUvbV8cr91Az3570PAWe2v2K6wrQs4UKFeU0Jgh3ytQ=; b=X/z80fU9syx11qVAUnRzkAOnippyphE/n/iddX73QfffnTQYJEZvWfK3IMFuPSpykwJKwY +bOiF4fxznL0+xv9UcUx/2HtOlNKtSI9jV+e+ojPpHaz26QfNF5yo7KrgxH8Hkau2zJTns 9JDhbDxJBJjM2i281bDRVfXROLMuIFQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721683404; a=rsa-sha256; cv=none; b=rx/4azLF9dNw/FetPmU8BGPzI4fad3dnH+kPlTz/BIaD89JAhiLJTwIPqXZPFMm2Tt0lyc wLioKCReki0XuBToyHe5T24phLZT9VbAxZ/2h7M9XnYHjQjQEbZray68vT2pwtCHTSucMB pm1aiLBc6Cnto09fa/2Jf9tccJthBZU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ciAJetPk; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-447f8aa87bfso110641cf.0 for ; Mon, 22 Jul 2024 14:24:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721683449; x=1722288249; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jUvbV8cr91Az3570PAWe2v2K6wrQs4UKFeU0Jgh3ytQ=; b=ciAJetPkH1/ql6VAgsGM5INnMZiSDLCc4RJOYneNvH9LUoyt/RMsJXD/Rjhp2YH1rH zwvcavyfUBBXZQJFf07qby5BYE6ueyoF9z5v1ycYhbNYobFpMAkEcJn1GTim1X6tg5OD 2Zrqt8bydDSj7cM8/dDLnUPzeusht2AHbHP3B4oeX7xmGteORWGjoRqEh650xTzJIPas w48ZZwXCEJe3wnUsImoc6jPMwkQ1XJqamsTeMtSvYI00rvvaYmyaysxl3oKj9YTA3d/c bWWnvlDZXiIcpU46x+jsuofII/sOKffQzjsPS7oV911Ko6zC6r96ZUzwPepK4ZlsDaKn tgqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721683449; x=1722288249; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jUvbV8cr91Az3570PAWe2v2K6wrQs4UKFeU0Jgh3ytQ=; b=dZ38DbaaqTovCjwpf5rXpY5lP7ENXAr3VMOX18IffoE4HLHaZ3haFsudbnKz/zQrZn +gmUNhq6lxeTnbp7XGwQJ+DlOpssMuzJ2ZBBTAOcMqD4pQJ0915RSeBrQYOVHJXO8eYv w6It1nJWY9qnNGogzT5U56CJFWQdKwPbvzANfWgXWDO/ouEN3ixh9Oe8/D2fG2wmGNCk miNsPiT9pyGBBTNApjmHqdDWvWANJTqA629hQ9w8bQUyOPUJMVFf0JsqyH3gkahZhnlG bsGnAoGIXL1lRqfKSMQRFf0R3T/++dmSKR7KLjHu7FFpi6MVSE3KO1VCDUVSVqTMEvGr VNaQ== X-Forwarded-Encrypted: i=1; AJvYcCVW+ASxJcPCos7bz4nAAkIftiFLqijlef/ixzxKHJNocu9Pb69q8RI58U3W/8gCee3BFiLyqmorRI2PdzFmR+kuhS4= X-Gm-Message-State: AOJu0YxkZ0JaTNiyl6eX3wr59ZaBq4Px3zq5xHuF7KPUN5FIhf/a/1Nr XJF9+aO24ju42qctY06B4hWLZvIbIlugQo5Xxqk+SQGCrPQ/xvTgEKN+OUT3YbjMLPilifk1gAc cnhzZnTvxVzdsNmwU5u6D/cYOu+sqLSq0alo1 X-Google-Smtp-Source: AGHT+IF65D9l0J9nysjn+Fs0bUa7Rk1bA005g2DJfpgR1lRIXz0pmT072M//KBwaPKskcNk0gii+b5JQqm5e6s9QHbM= X-Received: by 2002:a05:622a:164a:b0:447:e8bd:2fbe with SMTP id d75a77b69052e-44faa96a947mr5161191cf.1.1721683448578; Mon, 22 Jul 2024 14:24:08 -0700 (PDT) MIME-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-9-jthoughton@google.com> In-Reply-To: From: Yu Zhao Date: Mon, 22 Jul 2024 15:23:29 -0600 Message-ID: Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging To: James Houghton Cc: Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Sean Christopherson , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: D4875C0006 X-Stat-Signature: oz9jya7ubztcc66nh5a8nymghwjcb8d7 X-HE-Tag: 1721683449-590238 X-HE-Meta: U2FsdGVkX19Dn7nhbKMy9gp38ML07iE7q4VH6KtKb2BPZWjijUQ5hlOafcBT4gKVGDUfOM/Zp7Nt5o8zIV6lKft783+RipPAzvzkaVO/gI0jg4HZ3erFcrovdhByx9WRtFqWkWFh+sOUKsDMZHgICxgAalCsspcUWB/jJljB1Op28nh8hyr6dKT1HfSKtwnMHslTmUasOAdlmDGgClfzLeMuOM2LkbWQWWDC3tQBwJiBkAcMTIQoGW/S3AehfkZd0qPurbU1/7YIt57+6hfQUBauJlw0ToFN122WOA76MertHjGI0VyATcOl3hp5b+90ZG/kgVpWqb2QjrCJMHi3NdQFAgKe9NN+/hcmF/TfpShw7M7UFTQbVPLXPIN5srsWB1B9mkjVydEeORZzbPt5aJ0U084h2HC1mUYSgc2TEldawkVZIJEdwjNUdbuvQ2q8vSQzxB7F3eQdn12UEmhR5ImNDCDp+p9FQy6gfqR+NnHZEwj0HFPo873U0c/L572y965ZAoFCJsRKvSEpf3nidfIHyypjGBuLZQKyX9srIhXsNIWxN181jnOi0fyt5tvgUrHYMST7LGOcUkXbcMjxkCJDzIljoHWZ0OK8T3MmcUiGDT3ay9mutliMyxTGyqWVwLzJ6TwWfKyb7N9a9FYujoPx09Vt5d1W8+jYHTGWX0kFgdQyqrj2UlWBKOqU5q2wK66Vs8/WebHvAbxTivvhhmaalPP6NwY0mrDDklFMiMeOng3FdJqecRuNFG9Qv8CyVMzVVXMsmaYjgYlLlD8C8SaiiYWJ8wf6lJNgimUmAtIfQtl6kuFEDvqz+pbw1eJyJ8t4/rzxmjeFRmQGfjLGlrFLmJaxI4DzQ1yH/SmHvKhYMh1c0NE6inf+cbW4hY4bocUNcpT+6JSjB5wAixuQK5EoGTIC54FkQgt7GOWtWhqpisoRvHFWLV34RcbBgh8kwQipHA69OT73hBcgwts G7x+1x/m 7YmbQk0LE0Rp27C+b5CKtHs6ISDGPXCYw7AN5MPWXr9yly607vFQTUiTBTKpi3LwhcNyRwRSKcWzgXQOt4y/yuG46/JzqNFdaxm7z+Ex2viZ3dNVTGWOYVcZcXC1PJlPLVOx1TkZaALe1DO4l0oMfYbmK5U3pa6uuvr4olFf7XWnJIwG8kYhEturPVBuSHlpoRHp/6npRSaet4whP8k2k3EVnIjIP+X0RNqrjdOaNc08HzHSrDSlxhZdd0y1R/IhIlgXv X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 22, 2024 at 2:46=E2=80=AFPM James Houghton wrote: > > On Mon, Jul 8, 2024 at 4:42=E2=80=AFPM Yu Zhao wrote: > > > > On Mon, Jul 8, 2024 at 11:31=E2=80=AFAM James Houghton wrote: > > > > > > On Fri, Jul 5, 2024 at 11:36=E2=80=AFAM Yu Zhao w= rote: > > > > > > > > On Mon, Jun 10, 2024 at 6:22=E2=80=AFPM James Houghton wrote: > > > > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsi= gned long start, unsigned long end, > > > > > if (!folio) > > > > > continue; > > > > > > > > > > - if (!ptep_test_and_clear_young(args->vma, addr, p= te + i)) > > > > > - VM_WARN_ON_ONCE(true); > > > > > + lru_gen_notifier_clear_young(mm, addr, addr + PAG= E_SIZE); > > > > > + if (pte_young(ptent)) > > > > > + ptep_test_and_clear_young(args->vma, addr= , pte + i); > > > > > > > > > > young++; > > > > > walk->mm_stats[MM_LEAF_YOUNG]++; > > > > > > > > > > > > There are two ways to structure the test conditions in walk_pte_ran= ge(): > > > > 1. a single pass into the MMU notifier (combine test/clear) which > > > > causes a cache miss from get_pfn_page() if the page is NOT young. > > > > 2. two passes into the MMU notifier (separate test/clear) if the pa= ge > > > > is young, which does NOT cause a cache miss if the page is NOT youn= g. > > > > > > > > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifie= r > > > > twice every 64 PTEs, and therefore the second option is a clear win= . > > > > > > > > But you are doing twice per PTE. So what's the rationale behind goi= ng > > > > with the second option? Was the first option considered? > > > > > > Hi Yu, > > > > > > I didn't consider changing this from your v2[1]. Thanks for bringing = it up. > > > > > > The only real change I have made is that I reordered the > > > (!test_spte_young() && !pte_young()) to what it is now (!pte_young() > > > && !lru_gen_notifier_test_young()) because pte_young() can be > > > evaluated much faster. > > > > > > I am happy to change the initial test_young() notifier to a > > > clear_young() (and drop the later clear_young(). In fact, I think I > > > should. Making the condition (!pte_young() && > > > !lru_gen_notifier_clear_young()) makes sense to me. This returns the > > > same result as if it were !lru_gen_notifier_test_young() instead, > > > there is no need for a second clear_young(), and we don't call > > > get_pfn_folio() on pages that are not young. > > > > We don't want to do that because we would lose the A-bit for a folio > > that's beyond the current reclaim scope, i.e., the cases where > > get_pfn_folio() returns NULL (a folio from another memcg, e.g.). > > > > > WDYT? Have I misunderstood your comment? > > > > I hope this is clear enough: > > > > @@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned > > long start, unsigned long end, > > if (pfn =3D=3D -1) > > continue; > > > > - if (!pte_young(ptent)) { > > + if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) { > > walk->mm_stats[MM_LEAF_OLD]++; > > continue; > > } > > @@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned > > long start, unsigned long end, > > if (!folio) > > continue; > > > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i= )) > > - VM_WARN_ON_ONCE(true); > > + if (!ptep_clear_young_notify(args->vma, addr, pte + i)) > > walk->mm_stats[MM_LEAF_OLD]++ should be here, I take it. > > > + continue; > > > > young++; > > walk->mm_stats[MM_LEAF_YOUNG]++; > > > > > Also, I take it your comment was not just about walk_pte_range() but > > > about the similar bits in lru_gen_look_around() as well, so I'll make > > > whatever changes we agree on there too (or maybe factor out the commo= n > > > bits). > > > > > > [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@g= oogle.com/ > > > > > > > In addition, what about the non-lockless cases? Would this change m= ake > > > > them worse by grabbing the MMU lock twice per PTE? > > > > > > That's a good point. Yes I think calling the notifier twice here woul= d > > > indeed exacerbate problems with a non-lockless notifier. > > > > I think so too, but I haven't verified it. Please do? > > I have some results now, sorry for the wait. > > It seems like one notifier is definitely better. It doesn't look like > the read lock actually made anything worse with what I was testing > (faulting memory in while doing aging). This is kind of surprising, Not at all if you were only doing the aging path, which only takes the lock for read. Under memory pressure, we need to both the aging and eviction, and the latter has to take the lock for write (to unmap). And that's when the real contention happens, because the search space is too big -- the entire system memory for global reclaim -- unmapping can easily collide with clearing the A-bit. > but either way, I'll change it to the single notifier in v6. Thanks > Yu! > > Here are the results I'm basing this conclusion on, using the selftest > added at the end of this series. > > # Use taskset to minimize NUMA concern. > # Give an extra core for the aging thread. > # THPs disabled (echo never > /sys/kernel/mm/transparent_hugepage/enabled= ) > > x86: > > # taskset -c 0-32 ./access_tracking_perf_test -l -v 32 > # # One notifier > Populating memory : 1.933017284s > Writing to populated memory : 0.017323539s > Reading from populated memory : 0.013113260s > lru_gen: Aging : 0.894133259s > lru_gen: Aging : 0.738950525s > Writing to idle memory : 0.059661329s > lru_gen: Aging : 0.922719935s > lru_gen: Aging : 0.829129877s > Reading from idle memory : 0.059095098s > lru_gen: Aging : 0.922689975s > > # # Two notifiers > Populating memory : 1.842645795s > Writing to populated memory : 0.017277075s > Reading from populated memory : 0.013047457s > lru_gen: Aging : 0.900751764s > lru_gen: Aging : 0.707203167s > Writing to idle memory : 0.060663733s > lru_gen: Aging : 1.539957250s <------ got longer > lru_gen: Aging : 0.797475887s > Reading from idle memory : 0.084415591s > lru_gen: Aging : 1.539417121s <------ got longer > > arm64*: > (*Patched to do aging; not done in v5 or v6. Doing this to see if the rea= d > lock is made substantially worse by using two notifiers vs. one.) > > # taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3 > # # One notifier > Populating memory : 1.439261355s > Writing to populated memory : 0.009755279s > Reading from populated memory : 0.007714120s > lru_gen: Aging : 0.540183328s > lru_gen: Aging : 0.455427973s > Writing to idle memory : 0.010130399s > lru_gen: Aging : 0.563424247s > lru_gen: Aging : 0.500419850s > Reading from idle memory : 0.008519640s > lru_gen: Aging : 0.563178643s > > # # Two notifiers > Populating memory : 1.526805625s > Writing to populated memory : 0.009836118s > Reading from populated memory : 0.007757280s > lru_gen: Aging : 0.537770978s > lru_gen: Aging : 0.421915391s > Writing to idle memory : 0.010281959s > lru_gen: Aging : 0.971448688s <------ got longer > lru_gen: Aging : 0.466956547s > Reading from idle memory : 0.008588559s > lru_gen: Aging : 0.971030648s <------ got longer > > > arm64, faulting memory in while aging: > > # perf record -g -- taskset -c 0-16 ./access_tracking_perf_test -l -v 16 = -m 3 -p > # # One notifier > vcpu wall time : 1.433908058s > lru_gen avg pass duration : 0.172128073s, (passes:11, total:1.8934088= 07s) > > # # Two notifiers > vcpu wall time : 1.450387765s > lru_gen avg pass duration : 0.175652974s, (passes:10, total:1.7565297= 44s) > > # perf report > # # One notifier > - 6.25% 0.00% access_tracking [kernel.kallsyms] [k] try_to_inc_= max_seq > - try_to_inc_max_seq > - 6.06% walk_page_range > __walk_page_range > - walk_pgd_range > - 6.04% walk_pud_range > - 4.73% __mmu_notifier_clear_young > + 4.29% kvm_mmu_notifier_clear_young > > # # Two notifiers > - 6.43% 0.00% access_tracking [kernel.kallsyms] [k] try_to_inc_= max_seq > - try_to_inc_max_seq > - 6.25% walk_page_range > __walk_page_range > - walk_pgd_range > - 6.23% walk_pud_range > - 2.75% __mmu_notifier_test_young > + 2.48% kvm_mmu_notifier_test_young > - 2.39% __mmu_notifier_clear_young > + 2.19% kvm_mmu_notifier_clear_young