Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yu Zhao <yuzhao@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 Ankit Agrawal <ankita@nvidia.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	David Matlack <dmatlack@google.com>,
	 David Rientjes <rientjes@google.com>,
	James Morse <james.morse@arm.com>,
	 Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	 Raghavendra Rao Ananta <rananta@google.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	 Sean Christopherson <seanjc@google.com>,
	Shaoqin Huang <shahuang@redhat.com>,
	 Suzuki K Poulose <suzuki.poulose@arm.com>,
	Wei Xu <weixugc@google.com>,  Will Deacon <will@kernel.org>,
	Zenghui Yu <yuzenghui@huawei.com>,
	kvmarm@lists.linux.dev,  kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,  linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging
Date: Mon, 22 Jul 2024 15:23:29 -0600	[thread overview]
Message-ID: <CAOUHufapvh13G9CAwsGiap0=LjE+qor4i1hT=3APOtBSjtX4Kw@mail.gmail.com> (raw)
In-Reply-To: <CADrL8HVRSyS8ZADRTvHZ-QDKBRv1SFvVyJKkr-CW2mzpNjW5Zw@mail.gmail.com>

On Mon, Jul 22, 2024 at 2:46 PM James Houghton <jthoughton@google.com> wrote:
>
> On Mon, Jul 8, 2024 at 4:42 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jul 8, 2024 at 11:31 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > On Fri, Jul 5, 2024 at 11:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
> > > > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > > > >                 if (!folio)
> > > > >                         continue;
> > > > >
> > > > > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > > > > -                       VM_WARN_ON_ONCE(true);
> > > > > +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> > > > > +               if (pte_young(ptent))
> > > > > +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
> > > > >
> > > > >                 young++;
> > > > >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> > > >
> > > >
> > > > There are two ways to structure the test conditions in walk_pte_range():
> > > > 1. a single pass into the MMU notifier (combine test/clear) which
> > > > causes a cache miss from get_pfn_page() if the page is NOT young.
> > > > 2. two passes into the MMU notifier (separate test/clear) if the page
> > > > is young, which does NOT cause a cache miss if the page is NOT young.
> > > >
> > > > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
> > > > twice every 64 PTEs, and therefore the second option is a clear win.
> > > >
> > > > But you are doing twice per PTE. So what's the rationale behind going
> > > > with the second option? Was the first option considered?
> > >
> > > Hi Yu,
> > >
> > > I didn't consider changing this from your v2[1]. Thanks for bringing it up.
> > >
> > > The only real change I have made is that I reordered the
> > > (!test_spte_young() && !pte_young()) to what it is now (!pte_young()
> > > && !lru_gen_notifier_test_young()) because pte_young() can be
> > > evaluated much faster.
> > >
> > > I am happy to change the initial test_young() notifier to a
> > > clear_young() (and drop the later clear_young(). In fact, I think I
> > > should. Making the condition (!pte_young() &&
> > > !lru_gen_notifier_clear_young()) makes sense to me. This returns the
> > > same result as if it were !lru_gen_notifier_test_young() instead,
> > > there is no need for a second clear_young(), and we don't call
> > > get_pfn_folio() on pages that are not young.
> >
> > We don't want to do that because we would lose the A-bit for a folio
> > that's beyond the current reclaim scope, i.e., the cases where
> > get_pfn_folio() returns NULL (a folio from another memcg, e.g.).
> >
> > > WDYT? Have I misunderstood your comment?
> >
> > I hope this is clear enough:
> >
> > @@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> > long start, unsigned long end,
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               if (!pte_young(ptent)) {
> > +               if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) {
> >                         walk->mm_stats[MM_LEAF_OLD]++;
> >                         continue;
> >                 }
> > @@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> > long start, unsigned long end,
> >                 if (!folio)
> >                         continue;
> >
> > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > -                       VM_WARN_ON_ONCE(true);
> > +               if (!ptep_clear_young_notify(args->vma, addr, pte + i))
>
> walk->mm_stats[MM_LEAF_OLD]++ should be here, I take it.
>
> > +                       continue;
> >
> >                 young++;
> >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> >
> > > Also, I take it your comment was not just about walk_pte_range() but
> > > about the similar bits in lru_gen_look_around() as well, so I'll make
> > > whatever changes we agree on there too (or maybe factor out the common
> > > bits).
> > >
> > > [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/
> > >
> > > > In addition, what about the non-lockless cases? Would this change make
> > > > them worse by grabbing the MMU lock twice per PTE?
> > >
> > > That's a good point. Yes I think calling the notifier twice here would
> > > indeed exacerbate problems with a non-lockless notifier.
> >
> > I think so too, but I haven't verified it. Please do?
>
> I have some results now, sorry for the wait.
>
> It seems like one notifier is definitely better. It doesn't look like
> the read lock actually made anything worse with what I was testing
> (faulting memory in while doing aging). This is kind of surprising,

Not at all if you were only doing the aging path, which only takes the
lock for read.

Under memory pressure, we need to both the aging and eviction, and the
latter has to take the lock for write (to unmap). And that's when the
real contention happens, because the search space is too big -- the
entire system memory for global reclaim -- unmapping can easily
collide with clearing the A-bit.

> but either way, I'll change it to the single notifier in v6. Thanks
> Yu!
>
> Here are the results I'm basing this conclusion on, using the selftest
> added at the end of this series.
>
> # Use taskset to minimize NUMA concern.
> # Give an extra core for the aging thread.
> # THPs disabled (echo never > /sys/kernel/mm/transparent_hugepage/enabled)
>
> x86:
>
> # taskset -c 0-32 ./access_tracking_perf_test -l -v 32
> # # One notifier
> Populating memory             : 1.933017284s
> Writing to populated memory   : 0.017323539s
> Reading from populated memory : 0.013113260s
> lru_gen: Aging                : 0.894133259s
> lru_gen: Aging                : 0.738950525s
> Writing to idle memory        : 0.059661329s
> lru_gen: Aging                : 0.922719935s
> lru_gen: Aging                : 0.829129877s
> Reading from idle memory      : 0.059095098s
> lru_gen: Aging                : 0.922689975s
>
> # # Two notifiers
> Populating memory             : 1.842645795s
> Writing to populated memory   : 0.017277075s
> Reading from populated memory : 0.013047457s
> lru_gen: Aging                : 0.900751764s
> lru_gen: Aging                : 0.707203167s
> Writing to idle memory        : 0.060663733s
> lru_gen: Aging                : 1.539957250s  <------ got longer
> lru_gen: Aging                : 0.797475887s
> Reading from idle memory      : 0.084415591s
> lru_gen: Aging                : 1.539417121s  <------ got longer
>
> arm64*:
> (*Patched to do aging; not done in v5 or v6. Doing this to see if the read
> lock is made substantially worse by using two notifiers vs. one.)
>
> # taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3
> # # One notifier
> Populating memory             : 1.439261355s
> Writing to populated memory   : 0.009755279s
> Reading from populated memory : 0.007714120s
> lru_gen: Aging                : 0.540183328s
> lru_gen: Aging                : 0.455427973s
> Writing to idle memory        : 0.010130399s
> lru_gen: Aging                : 0.563424247s
> lru_gen: Aging                : 0.500419850s
> Reading from idle memory      : 0.008519640s
> lru_gen: Aging                : 0.563178643s
>
> # # Two notifiers
> Populating memory             : 1.526805625s
> Writing to populated memory   : 0.009836118s
> Reading from populated memory : 0.007757280s
> lru_gen: Aging                : 0.537770978s
> lru_gen: Aging                : 0.421915391s
> Writing to idle memory        : 0.010281959s
> lru_gen: Aging                : 0.971448688s  <------ got longer
> lru_gen: Aging                : 0.466956547s
> Reading from idle memory      : 0.008588559s
> lru_gen: Aging                : 0.971030648s  <------ got longer
>
>
> arm64, faulting memory in while aging:
>
> # perf record -g -- taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3 -p
> # # One notifier
> vcpu wall time                : 1.433908058s
> lru_gen avg pass duration     : 0.172128073s, (passes:11, total:1.893408807s)
>
> # # Two notifiers
> vcpu wall time                : 1.450387765s
> lru_gen avg pass duration     : 0.175652974s, (passes:10, total:1.756529744s)
>
> # perf report
> # # One notifier
> -    6.25%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
>    - try_to_inc_max_seq
>       - 6.06% walk_page_range
>            __walk_page_range
>          - walk_pgd_range
>             - 6.04% walk_pud_range
>                - 4.73% __mmu_notifier_clear_young
>                   + 4.29% kvm_mmu_notifier_clear_young
>
> # # Two notifiers
> -    6.43%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
>    - try_to_inc_max_seq
>       - 6.25% walk_page_range
>            __walk_page_range
>          - walk_pgd_range
>             - 6.23% walk_pud_range
>                - 2.75% __mmu_notifier_test_young
>                   + 2.48% kvm_mmu_notifier_test_young
>                - 2.39% __mmu_notifier_clear_young
>                   + 2.19% kvm_mmu_notifier_clear_young

next prev parent reply	other threads:[~2024-07-22 21:24 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-11  0:21 [PATCH v5 0/9] mm: multi-gen LRU: Walk secondary MMU page tables while aging James Houghton
2024-06-11  0:21 ` [PATCH v5 1/9] KVM: Add lockless memslot walk to KVM James Houghton
2024-06-11  0:21 ` [PATCH v5 2/9] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn James Houghton
2024-06-11  0:21 ` [PATCH v5 3/9] KVM: arm64: " James Houghton
2024-06-11  5:57   ` Oliver Upton
2024-06-11 16:52     ` James Houghton
2024-06-11  0:21 ` [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier James Houghton
2024-06-11  5:33   ` Yu Zhao
2024-06-11 16:49     ` James Houghton
2024-06-11 18:54       ` Oliver Upton
2024-06-11 19:49         ` Sean Christopherson
2024-06-13  6:52           ` Oliver Upton
2024-06-14  0:48             ` James Houghton
2024-06-11 19:42       ` Sean Christopherson
2024-06-11 23:04         ` James Houghton
2024-06-12  0:34           ` Sean Christopherson
2024-06-14  0:45             ` James Houghton
2024-06-14 16:12               ` Sean Christopherson
2024-06-14 18:23                 ` James Houghton
2024-06-14 23:17                   ` Sean Christopherson
2024-06-17 16:50                     ` James Houghton
2024-06-17 18:37                       ` Sean Christopherson
2024-06-28 23:38                         ` James Houghton
2024-07-08 16:50                           ` James Houghton
2024-07-09 17:49                             ` Sean Christopherson
2024-07-10 23:10                               ` James Houghton
2024-07-12 15:06                                 ` Sean Christopherson
2024-07-15 23:15                                   ` James Houghton
2024-06-11 20:39       ` Yu Zhao
2024-06-11  0:21 ` [PATCH v5 5/9] KVM: Add kvm_fast_age_gfn and kvm_fast_test_age_gfn James Houghton
2024-06-11  0:21 ` [PATCH v5 6/9] KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask James Houghton
2024-06-11  0:21 ` [PATCH v5 7/9] KVM: x86: Implement kvm_fast_test_age_gfn and kvm_fast_age_gfn James Houghton
2024-06-11  0:21 ` [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging James Houghton
2024-06-12 16:02   ` Sean Christopherson
2024-06-12 16:59     ` Yu Zhao
2024-06-12 17:23       ` Sean Christopherson
2024-06-13  6:49         ` Oliver Upton
2024-07-05 18:35   ` Yu Zhao
2024-07-08 17:30     ` James Houghton
2024-07-08 23:41       ` Yu Zhao
2024-07-22 20:45         ` James Houghton
2024-07-22 21:23           ` Yu Zhao [this message]
2024-06-11  0:21 ` [PATCH v5 9/9] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test James Houghton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufapvh13G9CAwsGiap0=LjE+qor4i1hT=3APOtBSjtX4Kw@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dmatlack@google.com \
    --cc=james.morse@arm.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=rananta@google.com \
    --cc=rientjes@google.com \
    --cc=ryan.roberts@arm.com \
    --cc=seanjc@google.com \
    --cc=shahuang@redhat.com \
    --cc=suzuki.poulose@arm.com \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox