Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	Charan Teja Kalla <quic_charante@quicinc.com>,
	 Kalesh Singh <kaleshsingh@google.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache
Date: Thu, 11 Jan 2024 00:02:14 -0700	[thread overview]
Message-ID: <CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7BRaRgYLf2+8=+=nWtzkrHFKmudZPRm41PR6W+A+L=AKA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 60468 bytes --]

On Wed, Jan 10, 2024 at 12:16 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Yu Zhao <yuzhao@google.com> 于2023年12月26日周二 06:01写道：
> >
> > On Mon, Dec 25, 2023 at 2:52 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Mon, Dec 25, 2023 at 5:03 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > Yu Zhao <yuzhao@google.com> 于2023年12月25日周一 14:30写道：
> > > > >
> > > > > On Wed, Dec 20, 2023 at 1:24 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月20日周三 16:17写道：
> > > > > > >
> > > > > > > On Tue, Dec 19, 2023 at 11:38 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Dec 19, 2023 at 11:58 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月19日周二 11:45写道：
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 18, 2023 at 8:21 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 18, 2023 at 11:05 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月15日周五 12:56写道：
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Dec 14, 2023 at 04:51:00PM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > On Thu, Dec 14, 2023 at 11:38 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月14日周四 11:09写道：
> > > > > > > > > > > > > > > > On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > > > > > > > > > > > > > > > > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Kairui Song <ryncsn@gmail.com> 于2023年12月12日周二 14:52写道：
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月12日周二 06:07写道：
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Yu Zhao <yuzhao@google.com> 于2023年12月8日周五 14:14写道：
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > > > > > > > > > > > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > > > > > > > > > > > > > > > > on:
> > > > > > > > > > > > > > > > > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > > > > > > > > > > > > > > > > >    rmap and flush the TLB) and have less impact on performance (don't
> > > > > > > > > > > > > > > > > > > > > >    cause major PFs and can be non-blocking if needed again).
> > > > > > > > > > > > > > > > > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > > > > > > > > > > > > > > > > >    client use cases like Android, its apps parse configuration files
> > > > > > > > > > > > > > > > > > > > > >    and store the data in heap (anon); for server use cases like MySQL,
> > > > > > > > > > > > > > > > > > > > > >    it reads from InnoDB files and holds the cached data for tables in
> > > > > > > > > > > > > > > > > > > > > >    buffer pools (anon).
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > > > > > > > > > > > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > > > > > > > > > > > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > > > > > > > > > > > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > > > > > > > > > > > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > > > > > > > > > > > > > > > > moving average smooths out the spike quickly.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > To fix the problem:
> > > > > > > > > > > > > > > > > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > > > > > > > > > > > > > > > > >    tracking range of tiers, i.e., five accesses through file
> > > > > > > > > > > > > > > > > > > > > >    descriptors, move them to the second oldest generation to give them
> > > > > > > > > > > > > > > > > > > > > >    more time to age. (Note that tiers are used by the PID controller
> > > > > > > > > > > > > > > > > > > > > >    to statistically determine whether folios accessed multiple times
> > > > > > > > > > > > > > > > > > > > > >    through file descriptors are worth protecting.)
> > > > > > > > > > > > > > > > > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > > > > > > > > > > > > > > > > >    that they are not too close to the tail. The effect of this is
> > > > > > > > > > > > > > > > > > > > > >    similar to the above.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Android, launching 55 apps sequentially:
> > > > > > > > > > > > > > > > > > > > > >                            Before     After      Change
> > > > > > > > > > > > > > > > > > > > > >   workingset_refault_anon  25641024   25598972   0%
> > > > > > > > > > > > > > > > > > > > > >   workingset_refault_file  115016834  106178438  -8%
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks you for your amazing works on MGLRU.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > > > > > > > > > > > > > > > > https://lwn.net/Articles/945266/
> > > > > > > > > > > > > > > > > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > > > > > > > > > > > > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > > > > > > > > > > > > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > > > > > > > > > > > > > > > > multiple workloads.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > > > > > > > > > > > > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > > > > > > > > > > > > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > > > > > > > > > > > > > > > > upstream.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I think both this patch and my previous series are for solving the
> > > > > > > > > > > > > > > > > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > > > > > > > > > > > > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > > > > > > > > > > > > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > > > > > > > > > > > > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > > > > > > > > > > > > > > > > similar problem):
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Previous result:
> > > > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > > > Execution Results after 905 seconds
> > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > > > > > > > > > > >   STOCK_LEVEL     2542            27121571486.2   0.09 txn/s
> > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > >   TOTAL           2542            27121571486.2   0.09 txn/s
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > This patch:
> > > > > > > > > > > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > > > > > > > > > > Execution Results after 900 seconds
> > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > > > > > > > > > > >   STOCK_LEVEL     1594            27061522574.4   0.06 txn/s
> > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > > > > > > > > > > >   TOTAL           1594            27061522574.4   0.06 txn/s
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Unpatched version is always around ~500.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the test results!
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I think there are a few points here:
> > > > > > > > > > > > > > > > > > > > > - Refault distance make use of page shadow so it can better
> > > > > > > > > > > > > > > > > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > > > > > > > > > > > > > > > > distance).
> > > > > > > > > > > > > > > > > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > > > > > > > > > > > > > > > > memory is too small to hold the whole workingset.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > > > > > > > > > > > > > > > > combined to work better on this issue, how do you think?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm working on V4 of the RFC now, which just update some comments, and
> > > > > > > > > > > > > > > > > > skip anon page re-activation in refault path for mglru which was not
> > > > > > > > > > > > > > > > > > very helpful, only some tiny adjustment.
> > > > > > > > > > > > > > > > > > And I found it easier to test with fio, using following test script:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > #!/bin/bash
> > > > > > > > > > > > > > > > > > swapoff -a
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > modprobe brd rd_nr=1 rd_size=16777216
> > > > > > > > > > > > > > > > > > mkfs.ext4 /dev/ram0
> > > > > > > > > > > > > > > > > > mount /dev/ram0 /mnt
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > mkdir -p /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > > > > cd /sys/fs/cgroup/benchmark
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > echo 4G > memory.max
> > > > > > > > > > > > > > > > > > echo $$ > cgroup.procs
> > > > > > > > > > > > > > > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > > > > > > > > > > > > > > > > >           --buffered=1 --ioengine=io_uring --iodepth=128 \
> > > > > > > > > > > > > > > > > >           --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > > > > > > > > > > > > > > > > >           --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > > > > > > > > > > > > > > > > >           --time_based --ramp_time=5m --runtime=5m --group_reporting
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > > > > > > > > > > > > > > > > towards certain pages.
> > > > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > > > > > > > > > > > > > > > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > > > > > > > > > > > > > > > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > > > > > > > > > > > > > > > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > > > > > > > > > > > > > > > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > - If I change zipf:0.5 to random:
> > > > > > > > > > > > > > > > > > Unpatched 6.7-rc4:
> > > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > > > > > > > > > > > > > > > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Patched with RFC v4:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > > > > > > > > > > > > > > > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Patched with this series:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > > > > > > > > > > > > > > > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > MGLRU off:
> > > > > > > > > > > > > > > > > > Run status group 0 (all jobs):
> > > > > > > > > > > > > > > > > >    READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > > > > > > > > > > > > > > > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > > > > > > > > > > > > > > > > test I provided before uses a SATA SSD so it will have a much higher
> > > > > > > > > > > > > > > > > > impact. I'll provides a script to setup the test case and run it, it's
> > > > > > > > > > > > > > > > > > more complex to setup than fio since involving setting up multiple
> > > > > > > > > > > > > > > > > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > > > > > > > > > > > > > > > > occupied by some other tasks but will try best to send them out as
> > > > > > > > > > > > > > > > > > soon as possible.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks! Apparently your RFC did show better IOPS with both access
> > > > > > > > > > > > > > > > > patterns, which was a surprise to me because it had higher refaults
> > > > > > > > > > > > > > > > > and usually higher refautls result in worse performance.
> > > > > > > > > > > > >
> > > > > > > > > > > > > And thanks for providing the refaults I requested for -- your data
> > > > > > > > > > > > > below confirms what I mentioned above:
> > > > > > > > > > > > >
> > > > > > > > > > > > > For fio:
> > > > > > > > > > > > >                            Your RFC   This series   Change
> > > > > > > > > > > > >   workingset_refault_file  628192729  596790506     -5%
> > > > > > > > > > > > >   IOPS                     1862k      1830k         -2%
> > > > > > > > > > > > >
> > > > > > > > > > > > > For MongoDB:
> > > > > > > > > > > > >                            Your RFC   This series   Change
> > > > > > > > > > > > >   workingset_refault_anon  10512      35277         +30%
> > > > > > > > > > > > >   workingset_refault_file  22751782   20335355      -11%
> > > > > > > > > > > > >   total                    22762294   20370632      -11%
> > > > > > > > > > > > >   TPS                      0.09       0.06          -33%
> > > > > > > > > > > > >
> > > > > > > > > > > > > For MongoDB, this series should be a big win (but apparently it's not),
> > > > > > > > > > > > > especially when using zram, since an anon refault should be a lot
> > > > > > > > > > > > > cheaper than a file refault.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So, I'm baffled...
> > > > > > > > > > > > >
> > > > > > > > > > > > > One important detail I forgot to mention: based on your data from
> > > > > > > > > > > > > lru_gen_full, I think there is another difference between our Kconfigs:
> > > > > > > > > > > > >
> > > > > > > > > > > > >                   Your Kconfig  My Kconfig  Max possible
> > > > > > > > > > > > >   LRU_REFS_WIDTH  1             2           2
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the info, my fault, I forgot to update my config as I was
> > > > > > > > > > > > testing some other features.
> > > > > > > > > > > > Buf after I changed LRU_REFS_WIDTH to 2 by disabling IDLE_PAGE, thing
> > > > > > > > > > > > got much worse for MongoDB test:
> > > > > > > > > > > >
> > > > > > > > > > > > With LRU_REFS_WIDTH == 2:
> > > > > > > > > > > >
> > > > > > > > > > > > This patch:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 919 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > >   STOCK_LEVEL     488             27598136201.9   0.02 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >   TOTAL           488             27598136201.9   0.02 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > memcg    86 /system.slice/docker-1c3a90be9f0a072f5719332419550cd0e1455f2cd5863bc2780ca4d3f913ece5.scope
> > > > > > > > > > > >  node     0
> > > > > > > > > > > >           1     948187          0x          0x
> > > > > > > > > > > >                      0          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      1          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      2          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      3          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >           2     948187          0     6051788·
> > > > > > > > > > > >                      0          0r          0e          0p      11916r
> > > > > > > > > > > >      66442e          0p
> > > > > > > > > > > >                      1          0r          0e          0p        903r
> > > > > > > > > > > >      16888e          0p
> > > > > > > > > > > >                      2          0r          0e          0p        459r
> > > > > > > > > > > >       9764e          0p
> > > > > > > > > > > >                      3          0r          0e          0p          0r
> > > > > > > > > > > >          0e       2874p
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >           3     948187    1353160        6351·
> > > > > > > > > > > >                      0          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      1          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      2          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      3          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >           4      73045      23573          12·
> > > > > > > > > > > >                      0          0R          0T          0     3498607R
> > > > > > > > > > > >    4868605T          0·
> > > > > > > > > > > >                      1          0R          0T          0     3012246R
> > > > > > > > > > > >    3270261T          0·
> > > > > > > > > > > >                      2          0R          0T          0     2498608R
> > > > > > > > > > > >    2839104T          0·
> > > > > > > > > > > >                      3          0R          0T          0           0R
> > > > > > > > > > > >    1983947T          0·
> > > > > > > > > > > >                           1486579L          0O    1380614Y       2945N
> > > > > > > > > > > >       2945F       2734A
> > > > > > > > > > > >
> > > > > > > > > > > > workingset_refault_anon 0
> > > > > > > > > > > > workingset_refault_file 18130598
> > > > > > > > > > > >
> > > > > > > > > > > >               total        used        free      shared  buff/cache   available
> > > > > > > > > > > > Mem:          31978        6705         312          20       24960       24786
> > > > > > > > > > > > Swap:         31977           4       31973
> > > > > > > > > > > >
> > > > > > > > > > > > RFC:
> > > > > > > > > > > > ==================================================================
> > > > > > > > > > > > Execution Results after 908 seconds
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >                   Executed        Time (µs)       Rate
> > > > > > > > > > > >   STOCK_LEVEL     2252            27159962888.2   0.08 txn/s
> > > > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > > >   TOTAL           2252            27159962888.2   0.08 txn/s
> > > > > > > > > > > >
> > > > > > > > > > > > workingset_refault_anon 22585
> > > > > > > > > > > > workingset_refault_file 22715256
> > > > > > > > > > > >
> > > > > > > > > > > > memcg    66 /system.slice/docker-0989446ff78106e32d3f400a0cf371c9a703281bded86d6d6bb1af706ebb25da.scope
> > > > > > > > > > > >  node     0
> > > > > > > > > > > >          22     563007       2274     1198225·
> > > > > > > > > > > >                      0          0r          1e          0p          0r
> > > > > > > > > > > >     697076e          0p
> > > > > > > > > > > >                      1          0r          0e          0p          0r
> > > > > > > > > > > >          0e     325661p
> > > > > > > > > > > >                      2          0r          0e          0p          0r
> > > > > > > > > > > >          0e     888728p
> > > > > > > > > > > >                      3          0r          0e          0p          0r
> > > > > > > > > > > >          0e    3602238p
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >          23     532222       7525     4948747·
> > > > > > > > > > > >                      0          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      1          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      2          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      3          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >          24     500367    1214667        3292·
> > > > > > > > > > > >                      0          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      1          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      2          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                      3          0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >                                 0           0           0           0
> > > > > > > > > > > >          0           0·
> > > > > > > > > > > >          25     469692      40797         466·
> > > > > > > > > > > >                      0          0R        271T          0           0R
> > > > > > > > > > > >    1162165T          0·
> > > > > > > > > > > >                      1          0R          0T          0      774028R
> > > > > > > > > > > >    1205332T          0·
> > > > > > > > > > > >                      2          0R          0T          0           0R
> > > > > > > > > > > >     932484T          0·
> > > > > > > > > > > >                      3          0R          1T          0           0R
> > > > > > > > > > > >    4252158T          0·
> > > > > > > > > > > >                          25178380L     156515O   23953602Y      59234N
> > > > > > > > > > > >      49391F      48664A
> > > > > > > > > > > >
> > > > > > > > > > > >               total        used        free      shared  buff/cache   available
> > > > > > > > > > > > Mem:          31978        6968         338           5       24671       24555
> > > > > > > > > > > > Swap:         31977        1533       30444
> > > > > > > > > > > >
> > > > > > > > > > > > Using same mongodb config (a 3 replica cluster using the same config):
> > > > > > > > > > > > {
> > > > > > > > > > > >     "net": {
> > > > > > > > > > > >         "bindIpAll": true,
> > > > > > > > > > > >         "ipv6": false,
> > > > > > > > > > > >         "maxIncomingConnections": 10000,
> > > > > > > > > > > >     },
> > > > > > > > > > > >     "setParameter": {
> > > > > > > > > > > >         "disabledSecureAllocatorDomains": "*"
> > > > > > > > > > > >     },
> > > > > > > > > > > >     "replication": {
> > > > > > > > > > > >         "oplogSizeMB": 10480,
> > > > > > > > > > > >         "replSetName": "issa-tpcc_0"
> > > > > > > > > > > >     },
> > > > > > > > > > > >     "security": {
> > > > > > > > > > > >         "keyFile": "/data/db/keyfile"
> > > > > > > > > > > >     },
> > > > > > > > > > > >     "storage": {
> > > > > > > > > > > >         "dbPath": "/data/db/",
> > > > > > > > > > > >         "syncPeriodSecs": 60,
> > > > > > > > > > > >         "directoryPerDB": true,
> > > > > > > > > > > >         "wiredTiger": {
> > > > > > > > > > > >             "engineConfig": {
> > > > > > > > > > > >                 "cacheSizeGB": 5
> > > > > > > > > > > >             }
> > > > > > > > > > > >         }
> > > > > > > > > > > >     },
> > > > > > > > > > > >     "systemLog": {
> > > > > > > > > > > >         "destination": "file",
> > > > > > > > > > > >         "logAppend": true,
> > > > > > > > > > > >         "logRotate": "rename",
> > > > > > > > > > > >         "path": "/data/db/mongod.log",
> > > > > > > > > > > >         "verbosity": 0
> > > > > > > > > > > >     }
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > The test environment have 32g memory and 16 core.
> > > > > > > > > > > >
> > > > > > > > > > > > Per my analyze, the access pattern for the mongodb test is that page
> > > > > > > > > > > > will be re-access long after it's evicted so PID controller won't
> > > > > > > > > > > > protect higher tier. That RFC will make use of the long existing
> > > > > > > > > > > > shadow to do feedback to PID/Gen so the result will be much better.
> > > > > > > > > > > > Still need more adjusting though, will try to do a rebase on top of
> > > > > > > > > > > > mm-unstable which includes your patch.
> > > > > > > > > > > >
> > > > > > > > > > > > I've no idea why the workingset_refault_* is higher in the better
> > > > > > > > > > > > case, this a clearly an IO bound workload, Memory and IO is busy while
> > > > > > > > > > > > CPU is not full...
> > > > > > > > > > > >
> > > > > > > > > > > > I've uploaded my local reproducer here:
> > > > > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > > > > > https://github.com/ryncsn/py-tpcc
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the repos -- I'm trying them right now. Which MongoDB
> > > > > > > > > > > version did you use? setup.sh didn't seem to install it.
> > > > > > > > > > >
> > > > > > > > > > > Also do you have a QEMU image? It'd be a lot easier for me to
> > > > > > > > > > > duplicate the exact environment by looking into it.
> > > > > > > > > >
> > > > > > > > > > I ended up using docker.io/mongodb/mongodb-community-server:latest,
> > > > > > > > > > and it's not working:
> > > > > > > > > >
> > > > > > > > > > # docker exec -it mongo-r1 mongosh --eval \
> > > > > > > > > > '"rs.initiate({
> > > > > > > > > >     _id: "issa-tpcc_0",
> > > > > > > > > >     members: [
> > > > > > > > > >       {_id: 0, host: "mongo-r1"},
> > > > > > > > > >       {_id: 1, host: "mongo-r2"},
> > > > > > > > > >       {_id: 2, host: "mongo-r3"}
> > > > > > > > > >     ]
> > > > > > > > > > })"'
> > > > > > > > > > Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
> > > > > > > > > > Error: can only create exec sessions on running containers: container
> > > > > > > > > > state improper
> > > > > > > > >
> > > > > > > > > Hi Yu,
> > > > > > > > >
> > > > > > > > > I've updated the test repo:
> > > > > > > > > https://github.com/ryncsn/emm-test-project/tree/master/mongo-cluster
> > > > > > > > >
> > > > > > > > > I've tested it on top of latest Fedora Cloud Image 39 and it worked
> > > > > > > > > well for me, the README now contains detailed and not hard to follow
> > > > > > > > > steps to reproduce this test.
> > > > > > > >
> > > > > > > > Thanks. I was following the instructions down to the letter and it
> > > > > > > > fell apart again at line 46 (./tpcc.py).
> > > > > > >
> > > > > > > I think you just broke it by
> > > > > > > https://github.com/ryncsn/py-tpcc/commit/7b9b380d636cb84faa5b11b5562e531f924eeb7e
> > > > > > >
> > > > > > > (But it's also possible you actually wanted me to use this latest
> > > > > > > commit but forgot to account for it in your instructions.)
> > > > > > >
> > > > > > > > Were you able to successfully run the benchmark on a fresh VM by
> > > > > > > > following the instructions? If not, I'd appreciate it if you could do
> > > > > > > > so and document all the missing steps.
> > > > > >
> > > > > > Ah, you are right, I attempted to convert it to Python3 but found it
> > > > > > only brought more trouble, so I gave up and the instruction is still
> > > > > > using Python2. However I accidentally pushed the WIP python3 convert
> > > > > > commit... I've reset the repo to
> > > > > > https://github.com/ryncsn/py-tpcc/commit/86e862c5cf3b2d1f51e0297742fa837c7a99ebf8,
> > > > > > this is working well. Sorry for the inconvenient.
> > > > >
> > > > > Thanks -- I was able to reproduce results similar to yours.
> > > > >
> > > >
> > > > Hi Yu,
> > > >
> > > > Thanks for the testing, and merry xmas.
> > > >
> > > > > It turned out the mystery (fewer refaults but worse performance) was caused by
> > > > >     13.89%    13.89%  kswapd0  [kernel.vmlinux]  [k]
> > > > > __list_del_entry_valid_or_report
> > > >
> > > > I'm not sure about this, if the task is CPU bounded, this could
> > > > explain. But it's not, the performance gap is larger when tested on
> > > > slow IO device.
> > > >
> > > > The iostat output during my test run:
> > > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >            7.40    0.00    2.42   83.37    0.00    6.80
> > > > Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s
> > > > %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> > > > vda             35.00    0.80    167.60     17.20     6.90     3.50
> > > > 16.47  81.40    0.47    1.62   0.02     4.79    21.50   0.63   2.27
> > > > vdb           5999.30    4.80 104433.60     84.00     0.00     8.30
> > > > 0.00  63.36    6.54    1.31  39.25    17.41    17.50   0.17 100.00
> > > > zram0            0.00    0.00      0.00      0.00     0.00     0.00
> > > > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
> > >
> > > I ran the benchmark on the slowest bare metal I have that roughly
> > > matches your CPU/DRAM configurations (ThinkPad P1 G4
> > > https://support.lenovo.com/us/en/solutions/pd031426).
> > >
> > > But it seems you used a VM (vda/vdb) -- I never run performance
> > > benchmarks in VMs because the host and hypervisor can complicate
> > > things, for example, in this case, is it possible the host page cache
> > > cached your disk image containing the database files?
> > >
> > > > You can see CPU is waiting for IO, %user is always around 10%.
> > > > The hotspot you posted only take up 13.89% of the runtime, which
> > > > shouldn't cause so much performance drop.
> > > >
> > > > >
> > > > > Apparently Fedora has CONFIG_DEBUG_LIST=y by default, and after I
> > > > > turned it off (the only change I made), this series showed better TPS
> > > > > (I used"--duration=10800" for more reliable results):
> > > > >                          v6.7-rc6   RFC [1]    change
> > > > > total txns               25024      24672      +1%
> > > > > workingset_refault_anon  573668     680248     -16%
> > > > > workingset_refault_file  260631976  265808452  -2%
> > > >
> > > > I have disabled CONFIG_DEBUG_LIST when doing performance comparison test.
> >
> > Also I'd suggest we both use the same distro you shared with me and
> > the default .config except CONFIG_DEBUG_LIST=n, and v6.7-rc6 for now.
> >
> > (I'm attaching the default .config based on /boot/config-6.5.6-300.fc39.x86_64.)
> >
>
> Hi Yu
>
> I've been adapting and testing the refault distance series based on
> latest 6.7. Also I found a serious bug in my previous V3, so I updated
> it here with some importance changes (using a seperate refault
> distance model, instead of glueing to active/inactive model):
> https://github.com/ryncsn/linux/commits/kasong/devel/refault-distance-2024-1/
>
> So far I can conclude that previous result is not caused by host
> cache, I setup a baremetal test environment, strictly using your
> config, I gathered some data (I also updated the refault distance
> patch series, updated version in link above, and also the baremetal
> hava a fast NVME so the performance gap wasn't so large but still
> stably observable):
>
> With latest 6.7 (Your config):
> ==================================================================
> Execution Results after 905 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4025            27162035181.5   0.15 txn/s
> ------------------------------------------------------------------
>   TOTAL           4025            27162035181.5   0.15 txn/s
>
> vmstat:
> workingset_nodes 82996
> workingset_refault_anon 269371
> workingset_refault_file 37671044
> workingset_activate_anon 121061
> workingset_activate_file 8927227
> workingset_restore_anon 121061
> workingset_restore_file 2578269
> workingset_nodereclaim 62394
>
> lru_gen_full:
> memcg    67 /machine.slice/libpod-38b33777db34724cf8edfbef1ac2e4fd0621f14151e241bbf1430d397d3dee51.scope/container
>  node     0
>          34      60565      21248     1254331
>                      0          0r          0e          0p     121186r
>     169948e          0p
>                      1          0r          0e          0p     156224r
>     222553e          0p
>                      2          0r          0e          0p          0r
>          0e    4227858p
>                      3          0r          0e          0p          0r
>          0e          0p
>                                 0           0           0           0
>          0           0
>          35      41132     714504     4300280
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          36      20586     473476        2105
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          37       2035        817         876
>                      0       6647R       9116T          0      166836R
>     871850T          0
>                      1          0R          0T          0      110807R
>     296447T          0
>                      2          0R        268T          0           0R
>    4655276T          0
>                      3          0R          0T          0           0R
>          0T          0
>                          12510062L     639646O   11048666Y      45512N
>      24520F      23613A
> iostat:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           76.29    0.00   12.09    3.50    1.44    6.69
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             16.12       684.50        36.93         0.00
> 649996      35070          0
> dm-1              0.05         1.10         0.00         0.00
> 1044          0          0
> nvme0n1      16.47       700.22        39.09         0.00     664922
>    37118          0
> nvme1n1        4905.93    205287.92      1030.70         0.00
> 194939353     978740          0
> zram0          4440.17      5356.90     12404.81         0.00
> 5086856   11779480          0
>
> free -m:
>                total        used        free      shared  buff/cache   available
> Mem:           31830        9475         403           0       21951       21918
> Swap:          31829        6500       25329
>
> With latest refault distance series (Your config):
> ==================================================================
> Execution Results after 902 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4260            27065448172.8   0.16 txn/s
> ------------------------------------------------------------------
>   TOTAL           4260            27065448172.8   0.16 txn/s
>
> workingset_nodes 113824
> workingset_refault_anon 293426
> workingset_refault_file 42700484
> workingset_activate_anon 0
> workingset_activate_file 13410174
> workingset_restore_anon 289106
> workingset_restore_file 5592042
> workingset_nodereclaim 33249
>
> memcg    67 /machine.slice/libpod-8eff6b7b65e34fe0497ff5c0c88c750f6896c43a06bb26e8cd6470de596be76e.scope/container
>  node     0
>          15     261222     266350       65081
>                      0          0r          0e          0p     185212r
>    2314329e          0p
>                      1          0r          0e          0p      40887r
>     710312e          0p
>                      2          0r          0e          0p          0r
>          0e    5026031p
>                      3          0r          0e          0p          0r
>          0e          0p
>                                 0           0           0           0
>          0           0
>          16     199341     267661     5034442
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          17     120655     547852         592
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          18      55172     127769        3855
>                      0       1910R       2975T          0     1894614R
>    4375361T          0
>                      1          0R          0T          0     2099208R
>    2861460T          0
>                      2          0R         27T          0      446000R
>    5569781T          0
>                      3          0R          0T          0           0R
>          0T          0
>                           2817512L      35421O    2579377Y      10452N
>       5517F       5414A
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           76.34    0.00   11.25    4.22    1.29    6.90
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             12.85       563.18        30.75         0.00
> 532390      29070          0
> dm-1              0.05         1.10         0.00         0.00
> 1044          0          0
> nvme0n1          13.22       578.97        32.92         0.00
> 547315      31119          0
> nvme1n1        5384.11    229164.12      1038.95         0.00
> 216635713     982152          0
> zram0          3590.88      4730.84      9633.71         0.00
> 4472204    9107032          0
>
>                total        used        free      shared  buff/cache   available
> Mem:           31830       10854         520           0       20455       20541
> Swap:          31829        4508       27321
>
> You see actually refault distance is now protecting more anon page,
> total IO on ZRAM is lower, It's mostly CPU bound, and NVME is fast
> enough, and result in a better performance.
>
> Things get more interesting if I disable page idle flag (so refs bits
> is extended, in your config, refs bit is only one bit, so it maybe
> overprotect file pages):
>
> Latest 6.7 (You config with page idle flag disabled):
> ==================================================================
> Execution Results after 904 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4016            27122163703.9   0.15 txn/s
> ------------------------------------------------------------------
>   TOTAL           4016            27122163703.9   0.15 txn/s
>
> workingset_nodes 99637
> workingset_refault_anon 309548
> workingset_refault_file 45896663
> workingset_activate_anon 129401
> workingset_activate_file 18461236
> workingset_restore_anon 129400
> workingset_restore_file 4963707
> workingset_nodereclaim 43970
>
> memcg    67 /machine.slice/libpod-7546463bd2b257a9b799817ca11bee1389d7deec20032529098520a89a207d7e.scope/container
>  node     0
>          27     103004     328070      269733
>                      0          0r          0e          0p     509949r
>    1957117e          0p
>                      1          0r          0e          0p     141642r
>     319695e          0p
>                      2          0r          0e          0p     777835r
>     793518e          0p
>                      3          0r          0e          0p          0r
>          0e    4333835p
>                                 0           0           0           0
>          0           0
>          28      82361      24748     5192182
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          29      57025     786386        5681
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          30      18619      76289        1273
>                      0       4295R       8888T          0      222326R
>    1044601T          0
>                      1          0R          0T          0      117646R
>     301735T          0
>                      2          0R          0T          0      433431R
>     825516T          0
>                      3          0R          1T          0           0R
>    4076839T          0
>                          13369819L     603360O   11981074Y      47388N
>      26235F      25276A
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           74.90    0.00   11.96    4.92    1.62    6.60
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             14.93       645.44        36.54         0.00
> 610150      34540          0
> dm-1              0.05         1.10         0.00         0.00
> 1044          0          0
> nvme0n1          15.30       661.23        38.71         0.00
> 625076      36589          0
> nvme1n1        6352.42    240726.35      1036.47         0.00
> 227565845     979808          0
> zram0          4189.65      4883.27     11876.36         0.00
> 4616304   11227080          0
>
>                total        used        free      shared  buff/cache   available
> Mem:           31830        9529         509           0       21791       21867
> Swap:          31829        6441       25388
>
> Refault distance seriese (Your config with page idle flag disabled):
> ==================================================================
> Execution Results after 901 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4268            27060267967.7   0.16 txn/s
> ------------------------------------------------------------------
>   TOTAL           4268            27060267967.7   0.16 txn/s
>
> workingset_nodes 115394
> workingset_refault_anon 144774
> workingset_refault_file 41055081
> workingset_activate_anon 8
> workingset_activate_file 13194460
> workingset_restore_anon 144629
> workingset_restore_file 187419
> workingset_nodereclaim 19598
>
> memcg    66 /machine.slice/libpod-4866051af817731602b37017b0e71feb2a8f2cbaa949f577e0444af01b4f3b0c.scope/container
>  node     0
>          12     213402      18054     1287510
>                      0          0r          0e          0p          0r
>      15755e          0p
>                      1          0r          0e          0p          0r
>       4771e          0p
>                      2          0r          0e          0p        908r
>       6810e          0p
>                      3          0r          0e          0p          0r
>          0e    3533888p
>                                 0           0           0           0
>          0           0
>          13     141209      10690     3571958
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          14      69327    1165064       34657
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          15       6404      21574        3363
>                      0        953R       1773T          0     1263395R
>    3816639T          0
>                      1          0R          0T          0     1164069R
>    1936973T          0
>                      2          0R          0T          0      350041R
>     409121T          0
>                      3          0R          3T          0       12305R
>    4767303T          0
>                           3622197L      36916O    3338446Y      10409N
>       7120F       6945A
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           75.79    0.00   10.68    3.91    1.18    8.44
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             12.66       547.71        38.73         0.00
> 526782      37248          0
> dm-1              0.05         1.09         0.00         0.00
> 1044          0          0
> nvme0n1         13.02       563.23        40.86         0.00
> 541708      39297          0
> nvme1n1       4916.00    217529.48      1018.04         0.00
> 209217677     979136          0
> zram0          1744.90      1970.86      5009.75         0.00
> 1895556    4818328          0
>
>                total        used        free      shared  buff/cache   available
> Mem:           31830       11713         485           0       19630       19684
> Swap:          31829        2847       28982
>
> And still refault distance series is better, and refault is also lower
> for both anon/file pages.
>
> ------
> I did some more test using MySQL and other workflow, no performance
> drop observed so far.
> And with a loop MongoDB test (keep running 900s test in loop) using my
> previous VM env
> (the SATA SSD vdb is using cache bypass so not a host cache issue here)
> I found one thing interesting (refs bit is set to 2 in config):
>
> Loop test using 6.7:
>   STOCK_LEVEL     874             27246011368.3   0.03 txn/s
>   STOCK_LEVEL     1533            27023181579.6   0.06 txn/s
>   STOCK_LEVEL     1122            28044867987.6   0.04 txn/s
>   STOCK_LEVEL     1032            27378070931.9   0.04 txn/s
>   STOCK_LEVEL     1021            27612530579.1   0.04 txn/s
>   STOCK_LEVEL     750             28076187896.3   0.03 txn/s
>   STOCK_LEVEL     780             27519993034.8   0.03 txn/s
> Refault stat here:
> workingset_refault_anon 126369
> workingset_refault_file 170389428
>   STOCK_LEVEL     750             27464016123.5   0.03 txn/s
>   STOCK_LEVEL     780             27529550313.0   0.03 txn/s
>   STOCK_LEVEL     750             28296286486.1   0.03 txn/s
>   STOCK_LEVEL     690             27504193850.3   0.03 txn/s
>   STOCK_LEVEL     716             28089360754.5   0.03 txn/s
>   STOCK_LEVEL     607             27852180474.3   0.02 txn/s
>   STOCK_LEVEL     689             27703367075.4   0.02 txn/s
>   STOCK_LEVEL     630             28184685482.7   0.02 txn/s
>   STOCK_LEVEL     450             28667721196.2   0.02 txn/s
>   STOCK_LEVEL     450             28047985314.4   0.02 txn/s
>   STOCK_LEVEL     450             28125609857.3   0.02 txn/s
>   STOCK_LEVEL     420             27393478488.0   0.02 txn/s
>   STOCK_LEVEL     420             27435537312.3   0.02 txn/s
>   STOCK_LEVEL     420             29060748699.2   0.01 txn/s
>   STOCK_LEVEL     420             28155584095.2   0.01 txn/s
>   STOCK_LEVEL     420             27888635407.0   0.02 txn/s
>   STOCK_LEVEL     420             27307856858.5   0.02 txn/s
>   STOCK_LEVEL     420             28842280889.0   0.01 txn/s
>   STOCK_LEVEL     390             27640696814.1   0.01 txn/s
>   STOCK_LEVEL     420             28471605716.7   0.01 txn/s
>   STOCK_LEVEL     420             27648174237.5   0.02 txn/s
>   STOCK_LEVEL     420             27848217938.7   0.02 txn/s
>   STOCK_LEVEL     420             27344698602.2   0.02 txn/s
>   STOCK_LEVEL     420             27046819537.2   0.02 txn/s
>   STOCK_LEVEL     420             27855626843.2   0.02 txn/s
>   STOCK_LEVEL     420             27971873627.9   0.02 txn/s
>   STOCK_LEVEL     420             28007014046.4   0.01 txn/s
>   STOCK_LEVEL     420             28445164626.1   0.01 txn/s
>   STOCK_LEVEL     420             27902621006.5   0.02 txn/s
>   STOCK_LEVEL     420             28282574433.3   0.01 txn/s
>   STOCK_LEVEL     390             27161599608.7   0.01 txn/s
>
> Using refault distance seriese:
>   STOCK_LEVEL     2605            27120667462.8   0.10 txn/s
>   STOCK_LEVEL     3000            27106854857.2   0.11 txn/s
>   STOCK_LEVEL     2925            27066601064.4   0.11 txn/s
>   STOCK_LEVEL     2757            27035248005.2   0.10 txn/s
>   STOCK_LEVEL     1325            28053716046.8   0.05 txn/s
>   STOCK_LEVEL     717             27455091366.3   0.03 txn/s
>   STOCK_LEVEL     967             27404085208.2   0.04 txn/s
> Refault stat here:
> workingset_refault_anon 109337
> workingset_refault_file 191249716
>   STOCK_LEVEL     716             27448213557.2   0.03 txn/s
>   STOCK_LEVEL     807             28607974517.8   0.03 txn/s
>   STOCK_LEVEL     760             28081442513.2   0.03 txn/s
>   STOCK_LEVEL     745             28594555797.6   0.03 txn/s
>   STOCK_LEVEL     450             27999536348.3   0.02 txn/s
>   STOCK_LEVEL     598             27095531895.4   0.02 txn/s
>   STOCK_LEVEL     711             27623112841.1   0.03 txn/s
>   STOCK_LEVEL     540             28358770820.6   0.02 txn/s
>   STOCK_LEVEL     480             27734277554.5   0.02 txn/s
>   STOCK_LEVEL     450             27313906125.3   0.02 txn/s
>   STOCK_LEVEL     480             27487299100.4   0.02 txn/s
>   STOCK_LEVEL     480             27804589683.5   0.02 txn/s
>   STOCK_LEVEL     480             28242205820.8   0.02 txn/s
>   STOCK_LEVEL     480             27540680102.3   0.02 txn/s
>   STOCK_LEVEL     450             27428645816.8   0.02 txn/s
>   STOCK_LEVEL     480             27946866129.2   0.02 txn/s
>   STOCK_LEVEL     480             27266068262.3   0.02 txn/s
>   STOCK_LEVEL     450             27267487051.5   0.02 txn/s
>   STOCK_LEVEL     480             27896369224.8   0.02 txn/s
>   STOCK_LEVEL     480             28784662706.1   0.02 txn/s
>   STOCK_LEVEL     450             27179853217.8   0.02 txn/s
>   STOCK_LEVEL     480             28170594101.7   0.02 txn/s
>   STOCK_LEVEL     450             28084651341.0   0.02 txn/s
>   STOCK_LEVEL     480             27901608868.6   0.02 txn/s
>   STOCK_LEVEL     480             27323790886.6   0.02 txn/s
>   STOCK_LEVEL     480             28891008895.4   0.02 txn/s
>   STOCK_LEVEL     480             27964563148.0   0.02 txn/s
>   STOCK_LEVEL     450             27942421198.4   0.02 txn/s
>   STOCK_LEVEL     480             28833968825.8   0.02 txn/s
>   STOCK_LEVEL     480             28090975437.9   0.02 txn/s
>   STOCK_LEVEL     480             27915246877.4   0.02 txn/s
>
> It seems the performance will drain as the test keep running (might be
> caused by MongoDB anon usage rising or DB internal caching/logging),
> that explains why for a long term test the performance gap seem to be
> smaller. The VM have a poor IO performance so the test run speed is
> much slower too, take a long time to warm up.
>
> But I think it's clear that refault distance series will boost the
> performance during warm up, and for long time workload it's also
> looking better, especially for low IO performance machines.
>
> I still can't explain about why workingset_refault is higher for the
> better case in the VM environment... I can resetup/reboot/randomize
> the test the the performance is same here. My guess is maybe related
> to readahead or some kernel space IO path issue? The actual IO usage
> is lower when refault distance series is applied.
>
> I notices a slight performance regression (1 - 3%) for pure in-mem FIO
> though, the "bulk series" I sent previous can help improve it.
>
> There is a bug in my previous V3 that will cause PID controller to
> lost control in long term (due to a bugged bit operation, my bad),
> which I've fixed in link above, I can send out new series if you think
> it's acceptable.

Could you try the attached patch on the mainline v6.7 and see how it
compares with the results above? Thanks.

[-- Attachment #2: mglru-6.7.patch --]
[-- Type: application/octet-stream, Size: 16210 bytes --]

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f4fe593c1400..d9a8a4affaaf 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -133,18 +133,17 @@ static inline int lru_hist_from_seq(unsigned long seq)
 	return seq % NR_HIST_GENS;
 }
 
-static inline int lru_tier_from_refs(int refs)
+static inline int lru_tier_from_refs(int refs, bool workingset)
 {
-	VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
+	VM_WARN_ON_ONCE(refs >= BIT(LRU_REFS_WIDTH));
 
 	/* see the comment in folio_lru_refs() */
-	return order_base_2(refs + 1);
+	return workingset ? MAX_NR_TIERS - 1 : order_base_2(refs + 1);
 }
 
 static inline int folio_lru_refs(struct folio *folio)
 {
 	unsigned long flags = READ_ONCE(folio->flags);
-	bool workingset = flags & BIT(PG_workingset);
 
 	/*
 	 * Return the number of accesses beyond PG_referenced, i.e., N-1 if the
@@ -152,7 +151,18 @@ static inline int folio_lru_refs(struct folio *folio)
 	 * tier. lru_tier_from_refs() will account for this off-by-one. Also see
 	 * the comment on MAX_NR_TIERS.
 	 */
-	return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
+	return flags & BIT(PG_referenced) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static inline bool lru_gen_activate(struct folio *folio)
+{
+	if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) {
+		folio_set_referenced(folio);
+		return false;
+	}
+
+	set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_workingset));
+	return true;
 }
 
 static inline int folio_lru_gen(struct folio *folio)
@@ -244,15 +254,15 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 	 *    oldest generation otherwise. See lru_gen_is_active().
 	 */
 	if (folio_test_active(folio))
-		seq = lrugen->max_seq;
+		seq = lrugen->max_seq - !folio_test_workingset(folio);
 	else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
 		 (folio_test_reclaim(folio) &&
 		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
 		seq = lrugen->max_seq - 1;
-	else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
+	else if (reclaiming)
 		seq = lrugen->min_seq[type];
 	else
-		seq = lrugen->min_seq[type] + 1;
+		seq = lrugen->min_seq[type] + folio_test_workingset(folio);
 
 	gen = lru_gen_from_seq(seq);
 	flags = (gen + 1UL) << LRU_GEN_PGOFF;
@@ -303,6 +313,11 @@ static inline bool lru_gen_in_fault(void)
 	return false;
 }
 
+static inline bool lru_gen_activate(struct folio *folio)
+{
+	return false;
+}
+
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	return false;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9db36e197712..5c29b7693f64 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -375,6 +375,7 @@ struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+#define LRU_REFS_FLAGS		(LRU_REFS_MASK | BIT(PG_referenced))
 
 #ifdef CONFIG_LRU_GEN
 
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..0a6f7b9cc987 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -424,19 +424,14 @@ static void folio_inc_refs(struct folio *folio)
 		return;
 	}
 
-	if (!folio_test_workingset(folio)) {
-		folio_set_workingset(folio);
-		return;
-	}
-
 	/* see the comment on MAX_NR_TIERS */
 	do {
-		new_flags = old_flags & LRU_REFS_MASK;
-		if (new_flags == LRU_REFS_MASK)
-			break;
+		if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) {
+			folio_set_workingset(folio);
+			return;
+		}
 
-		new_flags += BIT(LRU_REFS_PGOFF);
-		new_flags |= old_flags & ~LRU_REFS_MASK;
+		new_flags = old_flags + BIT(LRU_REFS_PGOFF);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 }
 #else
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bba207f41b14..850979a45229 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -828,7 +828,6 @@ static enum folio_references folio_check_references(struct folio *folio,
 
 	referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup,
 					   &vm_flags);
-	referenced_folio = folio_test_clear_referenced(folio);
 
 	/*
 	 * The supposedly reclaimable folio was found to be in a VM_LOCKED vma.
@@ -841,6 +840,16 @@ static enum folio_references folio_check_references(struct folio *folio,
 	if (referenced_ptes == -1)
 		return FOLIOREF_KEEP;
 
+	if (lru_gen_enabled()) {
+		if (!referenced_ptes)
+			return FOLIOREF_RECLAIM;
+
+		lru_gen_activate(folio);
+		return FOLIOREF_ACTIVATE;
+	}
+
+	referenced_folio = folio_test_clear_referenced(folio);
+
 	if (referenced_ptes) {
 		/*
 		 * All mapped folios start out with page table
@@ -1046,11 +1055,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (!sc->may_unmap && folio_mapped(folio))
 			goto keep_locked;
 
-		/* folio_update_gen() tried to promote this page? */
-		if (lru_gen_enabled() && !ignore_references &&
-		    folio_mapped(folio) && folio_test_referenced(folio))
-			goto keep_locked;
-
 		/*
 		 * The number of dirty pages determines if a node is marked
 		 * reclaim_congested. kswapd will stall and start writing
@@ -2557,8 +2561,6 @@ static bool should_clear_pmd_young(void)
  *                          shorthand helpers
  ******************************************************************************/
 
-#define LRU_REFS_FLAGS	(BIT(PG_referenced) | BIT(PG_workingset))
-
 #define DEFINE_MAX_SEQ(lruvec)						\
 	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
 
@@ -3076,16 +3078,18 @@ static int folio_update_gen(struct folio *folio, int gen)
 	VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
 	VM_WARN_ON_ONCE(!rcu_read_lock_held());
 
+	if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) {
+		folio_set_referenced(folio);
+		return -1;
+	}
+
 	do {
 		/* lru_gen_del_folio() has isolated this page? */
-		if (!(old_flags & LRU_GEN_MASK)) {
-			/* for shrink_folio_list() */
-			new_flags = old_flags | BIT(PG_referenced);
-			continue;
-		}
+		if (!(old_flags & LRU_GEN_MASK))
+			return -1;
 
-		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
-		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
+		new_flags |= ((gen + 1UL) << LRU_GEN_PGOFF) | BIT(PG_workingset);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 
 	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
@@ -3109,7 +3113,7 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
-		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
 		/* for folio_end_writeback() */
 		if (reclaiming)
@@ -3284,6 +3288,9 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	if (folio_memcg_rcu(folio) != memcg)
 		return NULL;
 
+	if (!folio_test_lru(folio))
+		return NULL;
+
 	/* file VMAs can contain anon pages from COW */
 	if (!folio_is_file_lru(folio) && !can_swap)
 		return NULL;
@@ -4032,11 +4039,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 			continue;
 		}
 
-		old_gen = folio_lru_gen(folio);
-		if (old_gen < 0)
-			folio_set_referenced(folio);
-		else if (old_gen != new_gen)
-			folio_activate(folio);
+		if (lru_gen_activate(folio)) {
+			old_gen = folio_lru_gen(folio);
+			if (old_gen >= 0 && old_gen != new_gen)
+				folio_activate(folio);
+		}
 	}
 
 	arch_leave_lazy_mmu_mode();
@@ -4206,7 +4213,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int zone = folio_zonenum(folio);
 	int delta = folio_nr_pages(folio);
 	int refs = folio_lru_refs(folio);
-	int tier = lru_tier_from_refs(refs);
+	bool workingset = folio_test_workingset(folio);
+	int tier = lru_tier_from_refs(refs, workingset);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
@@ -4237,7 +4245,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* protected */
-	if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
+	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH)) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
 		gen = folio_inc_gen(lruvec, folio, false);
@@ -4288,11 +4296,10 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 	/* see the comment on MAX_NR_TIERS */
 	if (!folio_test_referenced(folio))
-		set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, 0);
 
 	/* for shrink_folio_list() */
 	folio_clear_reclaim(folio);
-	folio_clear_referenced(folio);
 
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
@@ -4514,20 +4521,15 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 			continue;
 		}
 
-		if (folio_test_reclaim(folio) &&
-		    (folio_test_dirty(folio) || folio_test_writeback(folio))) {
-			/* restore LRU_REFS_FLAGS cleared by isolate_folio() */
-			if (folio_test_workingset(folio))
-				folio_set_referenced(folio);
+		if (folio_test_active(folio) ||
+		    (type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
+		    (folio_test_reclaim(folio) &&
+		     (folio_test_dirty(folio) || folio_test_writeback(folio))))
 			continue;
-		}
 
-		if (skip_retry || folio_test_active(folio) || folio_test_referenced(folio) ||
-		    folio_mapped(folio) || folio_test_locked(folio) ||
-		    folio_test_dirty(folio) || folio_test_writeback(folio)) {
+		if (skip_retry) {
 			/* don't add rejected folios to the oldest generation */
-			set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS,
-				      BIT(PG_active));
+			folio_set_active(folio);
 			continue;
 		}
 
@@ -4570,9 +4572,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     struct scan_control *sc, bool can_swap, unsigned long *nr_to_scan)
 {
 	int gen, type, zone;
-	unsigned long old = 0;
-	unsigned long young = 0;
-	unsigned long total = 0;
+	unsigned long size = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
@@ -4587,50 +4587,21 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 		unsigned long seq;
 
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			unsigned long size = 0;
-
 			gen = lru_gen_from_seq(seq);
 
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-
-			total += size;
-			if (seq == max_seq)
-				young += size;
-			else if (seq + MIN_NR_GENS == max_seq)
-				old += size;
 		}
 	}
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (!mem_cgroup_online(memcg)) {
-		*nr_to_scan = total;
+		*nr_to_scan = size;
 		return false;
 	}
 
-	*nr_to_scan = total >> sc->priority;
-
-	/*
-	 * The aging tries to be lazy to reduce the overhead, while the eviction
-	 * stalls when the number of generations reaches MIN_NR_GENS. Hence, the
-	 * ideal number of generations is MIN_NR_GENS+1.
-	 */
-	if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
-		return false;
-
-	/*
-	 * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
-	 * of the total number of pages for each generation. A reasonable range
-	 * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
-	 * aging cares about the upper bound of hot pages, while the eviction
-	 * cares about the lower bound of cold pages.
-	 */
-	if (young * MIN_NR_GENS > total)
-		return true;
-	if (old * (MIN_NR_GENS + 2) < total)
-		return true;
-
-	return false;
+	*nr_to_scan = size >> sc->priority;
+	return min_seq[!can_swap] + MIN_NR_GENS == max_seq;
 }
 
 /*
diff --git a/mm/workingset.c b/mm/workingset.c
index 33baad203277..757173b4a06e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -238,7 +238,8 @@ static void *lru_gen_eviction(struct folio *folio)
 	int type = folio_is_file_lru(folio);
 	int delta = folio_nr_pages(folio);
 	int refs = folio_lru_refs(folio);
-	int tier = lru_tier_from_refs(refs);
+	bool workingset = folio_test_workingset(folio);
+	int tier = lru_tier_from_refs(refs, workingset);
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
@@ -247,23 +248,23 @@ static void *lru_gen_eviction(struct folio *folio)
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
-	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
 
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
 }
 
 /*
  * Tests if the shadow entry is for a folio that was recently evicted.
  * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
  */
-static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 				unsigned long *token, bool *workingset)
 {
 	int memcg_id;
-	unsigned long min_seq;
+	unsigned long max_seq;
 	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat;
 
@@ -272,8 +273,9 @@ static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
 	memcg = mem_cgroup_from_id(memcg_id);
 	*lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-	min_seq = READ_ONCE((*lruvec)->lrugen.min_seq[file]);
-	return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
+	max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
+	return abs_diff(max_seq & (EVICTION_MASK >> LRU_REFS_WIDTH),
+			*token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
@@ -289,7 +291,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	rcu_read_lock();
 
-	recent = lru_gen_test_recent(shadow, type, &lruvec, &token, &workingset);
+	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset);
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
@@ -302,8 +304,8 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
 	/* see the comment in folio_lru_refs() */
-	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
-	tier = lru_tier_from_refs(refs);
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	tier = lru_tier_from_refs(refs, workingset);
 
 	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
 	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
@@ -315,10 +317,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	 * 2. For pages accessed multiple times through file descriptors,
 	 *    they would have been protected by sort_folio().
 	 */
-	if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
-		set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
+	if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH) - 1) {
+		folio_set_workingset(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
-	}
+	} else if (refs)
+		set_mask_bits(&folio->flags, LRU_REFS_MASK, refs << LRU_REFS_PGOFF);
 unlock:
 	rcu_read_unlock();
 }
@@ -330,7 +333,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	return NULL;
 }
 
-static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 				unsigned long *token, bool *workingset)
 {
 	return false;
@@ -426,7 +429,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	unsigned long eviction;
 
 	if (lru_gen_enabled())
-		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
+		return lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, workingset);
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
 	eviction <<= bucket_order;

next prev parent reply	other threads:[~2024-01-11  7:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-08  6:14 Yu Zhao
2023-12-08  6:14 ` [PATCH mm-unstable v1 2/4] mm/mglru: try to stop at high watermarks Yu Zhao
2023-12-08 11:00   ` Hillf Danton
2023-12-11 22:01     ` Yu Zhao
2023-12-08  6:14 ` [PATCH mm-unstable v1 3/4] mm/mglru: respect min_ttl_ms with memcgs Yu Zhao
2023-12-08  6:14 ` [PATCH mm-unstable v1 4/4] mm/mglru: reclaim offlined memcgs harder Yu Zhao
2023-12-08  8:24 ` [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache Kairui Song
2023-12-11 22:06   ` Yu Zhao
2023-12-12  6:52     ` Kairui Song
2023-12-13  3:02       ` Kairui Song
2023-12-13  7:59         ` Yu Zhao
2023-12-14  3:09           ` Yu Zhao
2023-12-14 18:37             ` Kairui Song
2023-12-14 23:51               ` Yu Zhao
2023-12-15  4:56                 ` Yu Zhao
2023-12-18 18:05                   ` Kairui Song
2023-12-19  3:21                     ` Yu Zhao
2023-12-19  3:44                       ` Yu Zhao
2023-12-19 18:58                         ` Kairui Song
2023-12-20  6:38                           ` Yu Zhao
2023-12-20  8:16                             ` Yu Zhao
2023-12-20  8:24                               ` Kairui Song
2023-12-25  6:30                                 ` Yu Zhao
2023-12-25 12:02                                   ` Kairui Song
2023-12-25 21:52                                     ` Yu Zhao
2023-12-25 22:00                                       ` Yu Zhao
2024-01-10 19:16                                         ` Kairui Song
2024-01-11  7:02                                           ` Yu Zhao [this message]
2023-12-17 18:31                 ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=quic_charante@quicinc.com \
    --cc=ryncsn@gmail.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox