From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23367C433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Mar 2022 10:29:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 487788D0003; Tue, 15 Mar 2022 06:29:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4381B8D0001; Tue, 15 Mar 2022 06:29:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2B3108D0003; Tue, 15 Mar 2022 06:29:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254])
	by kanga.kvack.org (Postfix) with ESMTP id 197C68D0001
	for <linux-mm@kvack.org>; Tue, 15 Mar 2022 06:29:52 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id BFC3E8249980
	for <linux-mm@kvack.org>; Tue, 15 Mar 2022 10:29:51 +0000 (UTC)
X-FDA: 79246249782.18.6791269
Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180])
	by imf22.hostedemail.com (Postfix) with ESMTP id 44DBDC0021
	for <linux-mm@kvack.org>; Tue, 15 Mar 2022 10:29:51 +0000 (UTC)
Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-2e5827a76f4so24311607b3.6
        for <linux-mm@kvack.org>; Tue, 15 Mar 2022 03:29:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=/b90Vp2Om8gjduKDwpSxivCY2jfdTN5BJWNgMniUKoE=;
        b=dg5PX/zIgricxvcv0+7Wkd6zxLEIhqxHX9aN1lowezPBJOcL/Bc9VDRU1jWAIOuJzM
         YkmYmzE13YlJbtLNWnavhhe+E4tZUO5tJZWqLrS4jyALdfsSN8UQNtRhDdWxoFT0VvE+
         jbJ40v0S9SOG97CZ5By7XlW2/FmsominuKY8k+1G9WP7OvZCFWyys4EnxxYZhtEOnB21
         AqconvgYT65ZDMrA43+hiuWM4w3M8+KGpTTYNEbL6J9J7BfBhpcoC6+wyoXui2SaGjLP
         gGfyH8u+kM4sCZNC167qG3jX8x94UES8pHuf6/WyW0/QhEWDlm343ph0T5aoib6PPca+
         T4Ig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=/b90Vp2Om8gjduKDwpSxivCY2jfdTN5BJWNgMniUKoE=;
        b=MOAk92d3Xv5ghSx0GSSnsTETmjGRQqLWZmFIX2w1vCTTBYrwOm9gJ/rkMTwSw8sFbj
         ZDQb1Xnt4D5pWafLA+23tecnr7Hza3zVd+FJ0v6RaTXsEnK2Whcu4iVTJ8/jMt230l8F
         quQd1viS5Tc4A4uWTcmawNO+C/Jn3hq3ONyBnei13KMUUGst4fpLv/uMhLg9jHDxuTRe
         5saka8zxFaIwBYjsAlXKyPJ0J/a/V7QEdIVYCSKS9Ra59ES6DPNIKkn2+7Jyds9t4KW/
         zc4hkIUY3Au/knT85zFCa00fJMxxuEXnEWS2XvC+baa2gL/7Be3WddmRYadsYAsh+HZr
         Hwkw==
X-Gm-Message-State: AOAM5322Tf7QY1sAs4P3Igx+tqTz8Xbhi2yb2Z6Y6sNOGgX9QyaJxsOD
	EWxrO8z8MGlbIhUOkB4dkv50JObbeRf1lLFMB3s=
X-Google-Smtp-Source: ABdhPJzZm+AJ+CohIfmFl+oYVrqbMw+m4k317aS3wJahPER5V7Yl61tondx+UFVt8RVujE5/xTpOiqqsnLYXNIEXaUU=
X-Received: by 2002:a81:9806:0:b0:2dc:5953:4d13 with SMTP id
 p6-20020a819806000000b002dc59534d13mr23231790ywg.233.1647340190137; Tue, 15
 Mar 2022 03:29:50 -0700 (PDT)
MIME-Version: 1.0
References: <CAOUHufbN_56UJBkgA2LjAfbTt9nzPOCHaSeS4P3GHcYst+Y+eg@mail.gmail.com>
 <20220314233812.9011-1-21cnbao@gmail.com> <CAOUHufa9eY44QadfGTzsxa2=hEvqwahXd7Canck5Gt-N6c4UKA@mail.gmail.com>
 <CAGsJ_4zvj5rmz7DkW-kJx+jmUT9G8muLJ9De--NZma9ey0Oavw@mail.gmail.com>
In-Reply-To: <CAGsJ_4zvj5rmz7DkW-kJx+jmUT9G8muLJ9De--NZma9ey0Oavw@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 15 Mar 2022 23:29:39 +1300
Message-ID: <CAGsJ_4zZc0oFSmBKAN77vm7VstQH=ieaQ0cfyvcMi3OQRrEpSg@mail.gmail.com>
Subject: Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
To: Yu Zhao <yuzhao@google.com>
Cc: Konstantin Kharlamov <Hi-Angel@yandex.ru>, Michael Larabel <Michael@michaellarabel.com>, 
	Andi Kleen <ak@linux.intel.com>, Andrew Morton <akpm@linux-foundation.org>, 
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>, Jens Axboe <axboe@kernel.dk>, 
	Brian Geffon <bgeffon@google.com>, Catalin Marinas <catalin.marinas@arm.com>, 
	Jonathan Corbet <corbet@lwn.net>, Donald Carr <d@chaos-reins.com>, 
	Dave Hansen <dave.hansen@linux.intel.com>, Daniel Byrne <djbyrne@mtu.edu>, 
	Johannes Weiner <hannes@cmpxchg.org>, Hillf Danton <hdanton@sina.com>, 
	Jan Alexander Steffens <heftig@archlinux.org>, =?UTF-8?Q?Holger_Hoffst=C3=A4tte?= <holger@applied-asynchrony.com>, 
	Jesse Barnes <jsbarnes@google.com>, Linux ARM <linux-arm-kernel@lists.infradead.org>, 
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>, linux-kernel <linux-kernel@vger.kernel.org>, 
	Linux-MM <linux-mm@kvack.org>, Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@kernel.org>, 
	Oleksandr Natalenko <oleksandr@natalenko.name>, Kernel Page Reclaim v2 <page-reclaim@google.com>, 
	Rik van Riel <riel@surriel.com>, Mike Rapoport <rppt@kernel.org>, Sofia Trinh <sofia.trinh@edi.works>, 
	Steven Barrett <steven@liquorix.net>, Suleiman Souhlal <suleiman@google.com>, 
	Shuang Zhai <szhai2@cs.rochester.edu>, Linus Torvalds <torvalds@linux-foundation.org>, 
	Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Matthew Wilcox <willy@infradead.org>, 
	"the arch/x86 maintainers" <x86@kernel.org>, Huang Ying <ying.huang@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 44DBDC0021
X-Rspam-User: 
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b="dg5PX/zI";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
X-Stat-Signature: aui7jm9zu9nth5wnz6hbs7a7tfjydytm
X-Rspamd-Server: rspam04
X-HE-Tag: 1647340191-966015
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Mar 15, 2022 at 10:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 15, 2022 at 6:18 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Mar 14, 2022 at 5:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Mar 15, 2022 at 5:45 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > > > > >
> > > > > > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > > > > > second time, it can be promoted
> > > > > > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > > > > > pages while kernel adds
> > > > > > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > > > > > pages go into the inactive
> > > > > > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > > > > > the active list. if we don't access
> > > > > > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > > > > > we do have some special fastpath for code section, executable file
> > > > > > > > > > pages are kept on active list
> > > > > > > > > > as long as they are accessed.
> > > > > > > > >
> > > > > > > > > Yes.
> > > > > > > > >
> > > > > > > > > > so all of the above concerns are actually not that correct?
> > > > > > > > >
> > > > > > > > > They are valid concerns but I don't know any popular workloads that
> > > > > > > > > care about them.
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > > here we can get a workload in Kim's patchset while he added workingset
> > > > > > > > protection
> > > > > > > > for anon pages:
> > > > > > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > > > > > >
> > > > > > > Thanks. I wouldn't call that a workload because it's not a real
> > > > > > > application. By popular workloads, I mean applications that the
> > > > > > > majority of people actually run on phones, in cloud, etc.
> > > > > > >
> > > > > > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > > > > > moved to use inactive first. then only after the anon page is accessed
> > > > > > > > second time, it can move to active.
> > > > > > >
> > > > > > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > > > > > access. It can be many accesses each time it's set.
> > > > > > >
> > > > > > > > "In current implementation, newly created or swap-in anonymous page is
> > > > > > > >
> > > > > > > > started on the active list. Growing the active list results in rebalancing
> > > > > > > > active/inactive list so old pages on the active list are demoted to the
> > > > > > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > > > > > >
> > > > > > > > Following is an example of this situation.
> > > > > > > >
> > > > > > > > Assume that 50 hot pages on active list and system can contain total
> > > > > > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > > > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > > > > > used-once pages.
> > > > > > > >
> > > > > > > > 1. 50 hot pages on active list
> > > > > > > > 50(h) | 0
> > > > > > > >
> > > > > > > > 2. workload: 50 newly created (used-once) pages
> > > > > > > > 50(uo) | 50(h)
> > > > > > > >
> > > > > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > > > > >
> > > > > > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > > > > > >
> > > > > > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > > > > > >
> > > > > > > I think the real question is why the 50 hot pages can be moved to the
> > > > > > > inactive list. If they are really hot, the A-bit should protect them.
> > > > > >
> > > > > > This is a good question.
> > > > > >
> > > > > > I guess it  is probably because the current lru is trying to maintain a balance
> > > > > > between the sizes of active and inactive lists. Thus, it can shrink active list
> > > > > > even though pages might be still "hot" but not the recently accessed ones.
> > > > > >
> > > > > > 1. 50 hot pages on active list
> > > > > > 50(h) | 0
> > > > > >
> > > > > > 2. workload: 50 newly created (used-once) pages
> > > > > > 50(uo) | 50(h)
> > > > > >
> > > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > > >
> > > > > > the old kernel without anon workingset protection put workload 2 on active, so
> > > > > > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > > > > > to evict the 50 hot pages.
> > > > > >
> > > > > > it seems mglru doesn't demote pages from the youngest generation to older
> > > > > > generation only in order to balance the list size? so mglru is probably safe
> > > > > > in these cases.
> > > > > >
> > > > > > I will run some tests mentioned in Kim's patchset and report the result to you
> > > > > > afterwards.
> > > > > >
> > > > >
> > > > > Hi Yu,
> > > > > I did find putting faulted pages to the youngest generation lead to some
> > > > > regression in the case ebizzy Kim's patchset mentioned while he tried
> > > > > to support workingset protection for anon pages.
> > > > > i did a little bit modification for rand_chunk() which is probably similar
> > > > > with the modifcation() Kim mentioned in his patchset. The modification
> > > > > can be found here:
> > > > > https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
> > > > >
> > > > > The test env is a x86 machine in which I have set memory size to 2.5GB and
> > > > > set zRAM to 2GB and disabled external disk swap.
> > > > >
> > > > > with the vanilla kernel:
> > > > > \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
> > > > >
> > > > > so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
> > > > >
> > > > > typical result:
> > > > >         Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > >         User time (seconds): 36.19
> > > > >         System time (seconds): 229.72
> > > > >         Percent of CPU this job got: 371%
> > > > >         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> > > > >         Average shared text size (kbytes): 0
> > > > >         Average unshared data size (kbytes): 0
> > > > >         Average stack size (kbytes): 0
> > > > >         Average total size (kbytes): 0
> > > > >         Maximum resident set size (kbytes): 2166196
> > > > >         Average resident set size (kbytes): 0
> > > > >         Major (requiring I/O) page faults: 9990128
> > > > >         Minor (reclaiming a frame) page faults: 33315945
> > > > >         Voluntary context switches: 59144
> > > > >         Involuntary context switches: 167754
> > > > >         Swaps: 0
> > > > >         File system inputs: 2760
> > > > >         File system outputs: 8
> > > > >         Socket messages sent: 0
> > > > >         Socket messages received: 0
> > > > >         Signals delivered: 0
> > > > >         Page size (bytes): 4096
> > > > >         Exit status: 0
> > > > >
> > > > > with gen_lru and lru_gen/enabled=0x3:
> > > > > typical result:
> > > > > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > > User time (seconds): 36.34
> > > > > System time (seconds): 276.07
> > > > > Percent of CPU this job got: 378%
> > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
> > > > >            **** 15% time +
> > > > > Average shared text size (kbytes): 0
> > > > > Average unshared data size (kbytes): 0
> > > > > Average stack size (kbytes): 0
> > > > > Average total size (kbytes): 0
> > > > > Maximum resident set size (kbytes): 2168120
> > > > > Average resident set size (kbytes): 0
> > > > > Major (requiring I/O) page faults: 13362810
> > > > >              ***** 30% page fault +
> > > > > Minor (reclaiming a frame) page faults: 33394617
> > > > > Voluntary context switches: 55216
> > > > > Involuntary context switches: 137220
> > > > > Swaps: 0
> > > > > File system inputs: 4088
> > > > > File system outputs: 8
> > > > > Socket messages sent: 0
> > > > > Socket messages received: 0
> > > > > Signals delivered: 0
> > > > > Page size (bytes): 4096
> > > > > Exit status: 0
> > > > >
> > > > > with gen_lru and lru_gen/enabled=0x7:
> > > > > typical result:
> > > > > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > > User time (seconds): 36.13
> > > > > System time (seconds): 251.71
> > > > > Percent of CPU this job got: 378%
> > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
> > > > >          *****better than enabled=0x3, worse than vanilla
> > > > > Average shared text size (kbytes): 0
> > > > > Average unshared data size (kbytes): 0
> > > > > Average stack size (kbytes): 0
> > > > > Average total size (kbytes): 0
> > > > > Maximum resident set size (kbytes): 2120988
> > > > > Average resident set size (kbytes): 0
> > > > > Major (requiring I/O) page faults: 12706512
> > > > > Minor (reclaiming a frame) page faults: 33422243
> > > > > Voluntary context switches: 49485
> > > > > Involuntary context switches: 126765
> > > > > Swaps: 0
> > > > > File system inputs: 2976
> > > > > File system outputs: 8
> > > > > Socket messages sent: 0
> > > > > Socket messages received: 0
> > > > > Signals delivered: 0
> > > > > Page size (bytes): 4096
> > > > > Exit status: 0
> > > > >
> > > > > I can also reproduce the problem on arm64.
> > > > >
> > > > > I am not saying this is going to block mglru from being mainlined. But  I am
> > > > > still curious if this is an issue worth being addressed somehow in mglru.
> > > >
> > > > You've missed something very important: *thoughput* :)
> > > >
> > >
> > > noop :-)
> > > in the test case, there are 4 threads. they are searching a key in 10 chunks
> > > of memory. for each chunk, the size is 200MB.
> > > a "random" chunk index is returned for those threads to search. but chunk2
> > > is the hottest, and chunk3, 7, 4 are relatively hotter than others.
> > > static inline unsigned int rand_chunk(void)
> > > {
> > >         /* simulate hot and cold chunk */
> > >         unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3, 7, 2, 2, 4};
> >
> > This is sequential access, not what you claim above, because you have
> > a repeating sequence.
> >
> > In this case MGLRU is expected to be slower because it doesn't try to
> > optimize it, as discussed before [1]. The reason is, with a manageable
> > complexity, we can only optimize so many things. And MGLRU chose to
> > optimize (arguably) popular workloads, since, AFAIK, no real-world
> > applications streams anon memory.
> >
> > To verify this is indeed sequential access, you could make rand[]
> > larger, e.g., 160, with the same portions of 2s, 3s, 4s, etc, but
> > their positions are random. The following change shows MGLRU is ~20%
> > faster on my Snapdragon 7c + 2.5G DRAM + 2GB zram.
> >
> >  static inline unsigned int rand_chunk(void)
> >  {
> >         /* simulate hot and cold chunk */
> > -       unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3,
> > 7, 2, 2, 4};
> > +       unsigned int rand[160] = {
> > +               2, 4, 7, 3, 4, 2, 7, 2, 7, 8, 6, 9, 7, 6, 5, 4,
> > +               6, 2, 6, 4, 2, 9, 2, 5, 5, 4, 7, 2, 7, 7, 5, 2,
> > +               4, 4, 3, 3, 2, 4, 2, 2, 5, 2, 4, 2, 8, 2, 2, 3,
> > +               2, 2, 2, 2, 2, 8, 4, 2, 2, 4, 2, 2, 2, 2, 3, 2,
> > +               8, 5, 2, 2, 3, 2, 8, 2, 6, 2, 4, 8, 5, 2, 9, 2,
> > +               8, 7, 9, 2, 4, 4, 3, 3, 2, 8, 2, 2, 3, 3, 2, 7,
> > +               7, 5, 2, 2, 8, 2, 2, 2, 5, 2, 4, 3, 2, 3, 6, 3,
> > +               3, 3, 9, 4, 2, 3, 9, 7, 7, 6, 2, 2, 4, 2, 6, 2,
> > +               9, 7, 7, 7, 9, 3, 4, 2, 3, 2, 7, 3, 2, 2, 2, 6,
> > +               8, 3, 7, 6, 2, 2, 2, 4, 7, 2, 5, 7, 4, 7, 9, 9,
> > +       };
> >         static int nr = 0;
> > -       return rand[nr++%16];
> > +       return rand[nr++%160];
> >  }
> >
> > Yet better, you could use some standard benchmark suites, written by
> > reputable organizations, e.g., memtier, YCSB, to generate more
> > realistic distributions, as I've suggested before [2].
> >
> > >         static int nr = 0;
> > >         return rand[nr++%16];
> > > }
> > >
> > > each thread does search_mem():
> > > static unsigned int search_mem(void)
> > > {
> > >         record_t key, *found;
> > >         record_t *src, *copy;
> > >         unsigned int chunk;
> > >         size_t copy_size = chunk_size;
> > >         unsigned int i;
> > >         unsigned int state = 0;
> > >
> > >         /* run 160 loops or till timeout */
> > >         for (i = 0; threads_go == 1 && i < 160; i++) {
> >
> > I see you've modified the original benchmark. But with "-S 200000",
> > should this test finish within an hour instead of the following?
> >     Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> >
> > >                 chunk = rand_chunk();
> > >                 src = mem[chunk];
> > >                 ...
> > >                 copy = alloc_mem(copy_size);
> > >                 ...
> > >                 memcpy(copy, src, copy_size);
> > >
> > >                 key = rand_num(copy_size / record_size, &state);
> > >
> > >                 bsearch(&key, copy, copy_size / record_size,
> > >                         record_size, compare);
> > >
> > >                         /* Below check is mainly for memory corruption or other bug */
> > >                         if (found == NULL) {
> > >                                 fprintf(stderr, "Couldn't find key %zd\n", key);
> > >                                 exit(1);
> > >                         }
> > >                 }               /* end if ! touch_pages */
> > >
> > >                 free_mem(copy, copy_size);
> > >         }
> > >
> > >         return (i);
> > > }
> > >
> > > each thread picks up a chunk, then allocates a new memory and copies the chunk to the
> > > new allocated memory, and searches a key in the allocated memory.
> > >
> > > as i have set time to rather big by -S, so each thread actually exits while it
> > > completes 160 loops.
> > > $ \time -v ./ebizzy -t 4 -s $((200*1024*1024)) -S 6000000
> >
> > Ok, you actually used "-S 6000000".
>
> I have two exits, either 160 loops have been done or -S gets timeout.
> Since -S is very big, the process exits from the completion of 160
> loops.
>
> I am seeing mglru is getting very similar speed with vanilla lru by
> using your rand_chunk() with 160 entries. the command is like:
> \time -v ./a.out -t 4 -s $((200*1024*1024)) -S 600000 -m
>
> The time to complete jobs begins to be more random, but on average,
> mglru seems to be 5% faster. actually, i am seeing mglru can be faster
> than vanilla even with more page faults. for example,
>
> MGLRU:
>         Command being timed: "./mt.out -t 4 -s 209715200 -S 600000 -m"
>         User time (seconds): 32.68
>         System time (seconds): 227.19
>         Percent of CPU this job got: 370%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10.23
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2175292
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 10977244
>         Minor (reclaiming a frame) page faults: 33447638
>         Voluntary context switches: 44466
>         Involuntary context switches: 108413
>         Swaps: 0
>         File system inputs: 7704
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
>
> VANILLA:
>         Command being timed: "./mt.out -t 4 -s 209715200 -S 600000 -m"
>         User time (seconds): 32.20
>         System time (seconds): 248.18
>         Percent of CPU this job got: 371%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.55
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2174384
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 10002206
>         Minor (reclaiming a frame) page faults: 33392151
>         Voluntary context switches: 76966
>         Involuntary context switches: 184841
>         Swaps: 0
>         File system inputs: 2032
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>

basically a perf comparison:
vanilla:
    23.81%  [lz4_compress]  [k] LZ4_compress_fast_extState
    14.15%  [kernel]        [k] LZ4_decompress_safe
    10.48%  libc-2.33.so    [.] __memmove_avx_unaligned_erms
     2.49%  [kernel]        [k] native_queued_spin_lock_slowpath
     2.05%  [kernel]        [k] clear_page_erms
     1.69%  [kernel]        [k] native_irq_return_iret
     1.49%  [kernel]        [k] mem_cgroup_css_rstat_flush
     1.05%  [kernel]        [k] _raw_spin_lock
     1.05%  [kernel]        [k] sync_regs
     1.00%  [kernel]        [k] smp_call_function_many_cond
     0.97%  [kernel]        [k] memset_erms
     0.95%  [zram]          [k] zram_bvec_rw.constprop.0
     0.91%  [kernel]        [k] down_read_trylock
     0.90%  [kernel]        [k] memcpy_erms
     0.89%  [zram]          [k] __zram_bvec_read.constprop.0
     0.88%  [kernel]        [k] psi_group_change
     0.84%  [kernel]        [k] isolate_lru_pages
     0.78%  [kernel]        [k] zs_map_object
     0.76%  [kernel]        [k] __handle_mm_fault
     0.72%  [kernel]        [k] page_vma_mapped_walk

mglru:
    23.43%  [lz4_compress]  [k] LZ4_compress_fast_extState
    16.90%  [kernel]        [k] LZ4_decompress_safe
    12.60%  libc-2.33.so    [.] __memmove_avx_unaligned_erms
     2.26%  [kernel]        [k] clear_page_erms
     2.06%  [kernel]        [k] native_queued_spin_lock_slowpath
     1.77%  [kernel]        [k] native_irq_return_iret
     1.18%  [kernel]        [k] sync_regs
     1.12%  [zram]          [k] __zram_bvec_read.constprop.0
     0.98%  [kernel]        [k] psi_group_change
     0.97%  [zram]          [k] zram_bvec_rw.constprop.0
     0.96%  [kernel]        [k] memset_erms
     0.95%  [kernel]        [k] isolate_folios
     0.92%  [kernel]        [k] zs_map_object
     0.92%  [kernel]        [k] _raw_spin_lock
     0.87%  [kernel]        [k] memcpy_erms
     0.83%  [kernel]        [k] smp_call_function_many_cond
     0.83%  [kernel]        [k] __handle_mm_fault
     0.78%  [kernel]        [k] unmap_page_range
     0.71%  [kernel]        [k] rmqueue_bulk
     0.70%  [kernel]        [k] page_counter_uncharge

it seems vanilla kernel puts more time on native_queued_spin_lock_slowpath(),
down_read_trylock(), mem_cgroup_css_rstat_flush(), isolate_lru_pages() and
page_vma_mapped_walk(), but mglru puts more time on decompress, memmove
and isolate_folios().

That is probably why mglru can be a bit faster even with more major page
faults.

>
> I guess the main cause of the regression for the previous sequence
> with 16 entries is that the ebizzy has a new allocated copy in
> search_mem(), which is mapped and used only once in each loop.
> and the temp copy can push out those hot chunks.
>
> Anyway, I understand it is a trade-off between warmly embracing new
> pages and holding old pages tightly. Real user cases from phone, server,
> desktop will be judging this better.
>
> >
> > [1] https://lore.kernel.org/linux-mm/YhNJ4LVWpmZgLh4I@google.com/
> > [2] https://lore.kernel.org/linux-mm/YgggI+vvtNvh3jBY@google.com/
>

Thanks
Barry