Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yuanchu Xie <yuanchu@google.com>
To: Henry Huang <henry.hj@antgroup.com>
Cc: yuzhao@google.com, akpm@linux-foundation.org,
	谈鉴锋 <henry.tjf@antgroup.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"朱辉(茶水)" <teawater@antgroup.com>
Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap
Date: Thu, 21 Dec 2023 15:15:54 -0800	[thread overview]
Message-ID: <CAJj2-QGqDWGVHEwU+=8+ywEAQtK9QKGZCOhkyEgp8LEWbXDggQ@mail.gmail.com> (raw)
In-Reply-To: <20231215105324.41241-1-henry.hj@antgroup.com>

Hi Henry, I have a question on memcg charging for the shared pages.
Replied inline.

On Fri, Dec 15, 2023 at 2:53 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Thanks for replying this RFC.
> > >
> > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > > proposed the idea [1], but we don't really use it anymore because of
> > > > its poor scalability.
> > >
> > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > > pages whether were accessed during a peroid of time.
> >
> > Is it a production environment? If so, what's your
> > 1. scan interval
> > 2. memory size
>
> > I'm trying to understand why scalability isn't a problem for you. On
> > an average server, there are hundreds of millions of PFNs, so it'd be
> > very expensive to use that ABI even for a time interval of minutes.
>
> Thanks for replying.
>
> Our scan interval is 10 minutes and total memory size is 512GB.
> We perferred to reclaim pages which idle age > 1 hour at least.
>
> > > We manage all pages
> > > idle time in userspace. Then use a prediction algorithm to select pages
> > > to reclaim. These pages would more likely be idled for a long time.
>
> > "There is a system in place now that is based on a user-space process
> > that reads a bitmap stored in sysfs, but it has a high CPU and memory
> > overhead, so a new approach is being tried."
> > https://lwn.net/Articles/787611/
> >
> > Could you elaborate how you solved this problem?
>
> In out environment, we found that we take average 0.4 core and 300MB memory
> to do scan, basic analyse and reclaim idle pages.
>
> For reducing cpu & memroy usage, we do:
> 1. We implement a ratelimiter to control rate of scan and reclaim.
> 2. All pages info & idle age were stored in local DB file. Our prediction
> algorithm don't need all pages info in memory at the same time.
>
> In out environment, about 1/3 memory was attemped to allocate as THP,
> which may save some cpu usage of scan.
>
> > > We only need kernel to tell use whether a page is accessed, a boolean
> > > value in kernel is enough for our case.
> >
> > How do you define "accessed"? I.e., through page tables or file
> > descriptors or both?
>
> both
>
> > > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > > recommend exporting generation numbers instead.
> > >
> > > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > > exporting generation&refs number to do the same thing.
> > >
> > > But there are serveral problems:
> > >
> > > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > > it's likely to see that different memcg's process share pages.
> >
> > This is related to my question above: are those pages mapped into
> > different memcgs or not?
>
> There is a case:
> There are two cgroup A, B (B is child cgroup of A)
> Process in A create a file and use mmap to read/write this file.
> Process in B mmap this file and usually read this file.\

How does the shared memory get charged to the cgroups?
Does it all go to cgroup A or B exclusively, or do some pages get
charged to each one?

>
> > > We still have no ideas how to solve this problem.
> > >
> > > 2. We set swappiness 0, and use proactive scan to select cold pages
> > > & proactive reclaim to swap anon pages. But we can't control passive
> > > scan(can_swap = false), which would make anon pages cold/hot inversion
> > > in inc_min_seq.
> >
> > There is an option to prevent the inversion, IIUC, the force_scan
> > option is what you are looking for.
>
> It seems that doesn't work now.
>
> static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
> {
> ......
>     for (type = ANON_AND_FILE - 1; type >= 0; type--) {
>         if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>             continue;
>
>         VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
>
>         if (inc_min_seq(lruvec, type, can_swap))
>             continue;
>
>         spin_unlock_irq(&lruvec->lru_lock);
>         cond_resched();
>         goto restart;
>     }
> .....
> }
>
> force_scan is not a parameter of inc_min_seq.
> In our environment, swappiness is 0, so can_swap would be false.
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
>     int zone;
>     int remaining = MAX_LRU_BATCH;
>     struct lru_gen_folio *lrugen = &lruvec->lrugen;
>     int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
>     if (type == LRU_GEN_ANON && !can_swap)
>         goto done;
> ......
> }
>
> If can_swap is false, would pass anon lru list.
>
> What's more, in passive scan, force_scan is also false.
>
> static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
> {
> ......
>     /* skip this lruvec as it's low on cold folios */
>     return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
> }
>
> Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq
> like this:
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
>     int zone;
>     int remaining = MAX_LRU_BATCH;
>     struct lru_gen_folio *lrugen = &lruvec->lrugen;
>     int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> -   if (type == LRU_GEN_ANON && !can_swap)
> +   if (type == LRU_GEN_ANON && !can_swap && !no_inversion)
>         goto done;
> ......
> }
>
> --
> 2.43.0
>
>

next prev parent reply	other threads:[~2023-12-21 23:16 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-06 12:50 Henry Huang
2023-12-06 12:50 ` Henry Huang
2023-12-07  1:30   ` Yu Zhao
2023-12-08  7:12     ` Henry Huang
2023-12-15  6:46       ` Yu Zhao
2023-12-15 10:53         ` Henry Huang
2023-12-16 21:06           ` Yu Zhao
2023-12-17  6:59             ` Henry Huang
2023-12-21 23:15           ` Yuanchu Xie [this message]
2023-12-22  2:44             ` Henry Huang
2023-12-22  4:35               ` Yu Zhao
2023-12-22  5:14                 ` David Rientjes
2023-12-22 15:40                   ` Henry Huang
2024-01-10 19:24                     ` Yuanchu Xie
2024-01-12  4:40                       ` Henry Huang
2023-12-15  7:23   ` Yu Zhao
2023-12-15 12:44     ` Henry Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJj2-QGqDWGVHEwU+=8+ywEAQtK9QKGZCOhkyEgp8LEWbXDggQ@mail.gmail.com' \
    --to=yuanchu@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=henry.hj@antgroup.com \
    --cc=henry.tjf@antgroup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=teawater@antgroup.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox