linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Wei Xu <weixugc@google.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Shakeel Butt <shakeelb@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	 Huang Ying <ying.huang@intel.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	 Yosry Ahmed <yosryahmed@google.com>,
	fvdl@google.com, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems
Date: Thu, 8 Dec 2022 16:59:36 -0800	[thread overview]
Message-ID: <CAAPL-u_bEhCCCnepmCuNe7q7qJY3G3wckGvG-QsF2SYpAVXhEA@mail.gmail.com> (raw)
In-Reply-To: <Y5HQgpRvPQWteNvz@dhcp22.suse.cz>

On Thu, Dec 8, 2022 at 3:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 08-12-22 01:00:40, Mina Almasry wrote:
> > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 07-12-22 13:43:55, Mina Almasry wrote:
> > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward
> > > > > but I do not expect to make a big difference in practice
> > > > >
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 026199c047e0..1b7f2d8cb128 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > > >         LIST_HEAD(ret_folios);
> > > > >         LIST_HEAD(free_folios);
> > > > >         LIST_HEAD(demote_folios);
> > > > > -       unsigned int nr_reclaimed = 0;
> > > > > +       unsigned int nr_reclaimed = 0, nr_demoted = 0;
> > > > >         unsigned int pgactivate = 0;
> > > > >         bool do_demote_pass;
> > > > >         struct swap_iocb *plug = NULL;
> > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > > > >         }
> > > > >         /* 'folio_list' is always empty here */
> > > > >
> > > > > -       /* Migrate folios selected for demotion */
> > > > > -       nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
> > > > > +       /*
> > > > > +        * Migrate folios selected for demotion.
> > > > > +        * Do not consider demoted pages to be reclaimed for the memcg reclaim
> > > > > +        * because no charges are really freed during the migration. Global
> > > > > +        * reclaim aims at releasing memory from nodes/zones so consider
> > > > > +        * demotion to reclaim memory.
> > > > > +        */
> > > > > +       nr_demoted += demote_folio_list(&demote_folios, pgdat);
> > > > > +       if (!cgroup_reclaim(sc))
> > > > > +               nr_reclaimed += nr_demoted;
> > > > > +
> > > > >         /* Folios that could not be demoted are still in @demote_folios */
> > > > >         if (!list_empty(&demote_folios)) {
> > > > >                 /* Folios which weren't demoted go back on @folio_list for retry: */
> > > > >
> > > > > [...]
> > > >
> > > > Thank you again, but this patch breaks the memory.reclaim nodes arg
> > > > for me. This is my test case. I run it on a machine with 2 memory
> > > > tiers.
> > > >
> > > > Memory tier 1= nodes 0-2
> > > > Memory tier 2= node 3
> > > >
> > > >     mkdir -p /sys/fs/cgroup/unified/test
> > > >     cd /sys/fs/cgroup/unified/test
> > > >     echo $$ > cgroup.procs
> > > >     head -c 500m /dev/random > /tmp/testfile
> > > >     echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> > > >     echo "1m nodes=0-2" > memory.reclaim
> > > >
> > > > In my opinion the expected behavior is for the kernel to demote 1mb of
> > > > memory from nodes 0-2 to node 3.
> > > >
> > > > Actual behavior on the tip of mm-unstable is as expected.
> > > >
> > > > Actual behavior with your patch cherry-picked to mm-unstable is that
> > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and
> > > > returns -EAGAIN to the user. This may be the correct behavior you're
> > > > intending, but it completely breaks the use case I implemented the
> > > > nodes= arg for and listed on the commit message of that change.
> > >
> > > Yes, strictly speaking the behavior is correct albeit unexpected. You
> > > have told the kernel to _reclaim_ that much memory but demotion are
> > > simply aging handling rather than a reclaim if the demotion target has a
> > > lot of memory free.
> >
> > Yes, by the strict definition of reclaim, you're completely correct.
> > But in reality earlier I proposed a patch to the kernel that disables
> > demotion in proactive reclaim. That would have been a correct change
> > by the strict definition of reclaim, but Johannes informed me that
> > meta already has a dependency on proactive reclaim triggering demotion
> > and directed me to add a nodes= arg to memory.reclaim to trigger
> > demotion instead, to satisfy both use cases. Seems both us and meta
> > are using this interface to trigger both reclaim and demotion, despite
> > the strict definition of the word?
>
> Well, demotion is a part of aging and that is a part of the reclaim so I
> believe we want both and demotion mostly an implementation detail. If
> you want to have a very precise control then the nodemask should drive
> you there.
>
> [...]
> > > I am worried this will popping up again and again. I thought your nodes
> > > subset approach could deal with this but I have overlooked one important
> > > thing in your patch. The user provided nodemask controls where to
> > > reclaim from but it doesn't constrain demotion targets. Is this
> > > intentional? Would it actually make more sense to control demotion by
> > > addint demotion nodes into the nodemask?
> > >
> >
> > IMO, yes it is intentional, and no I would not recommend adding
> > demotion nodes (I assume you mean adding both demote_from_nodes and
> > demote_to_nodes as arg).
>
> What I really mean is to add demotion nodes to the nodemask along with
> the set of nodes you want to reclaim from. To me that sounds like a
> more natural interface allowing for all sorts of usecases:
> - free up demotion targets (only specify demotion nodes in the mask)
> - control where to demote (e.g. select specific demotion target(s))
> - do not demote at all (skip demotion nodes from the node mask)

For clarification, do you mean to add another argument (e.g.
demotion_nodes) in addition to the "nodes" argument?   Also, which
meaning do these arguments have?

- Option 1: The "nodes" argument specifies the source nodes where
pages should be reclaimed (but not demoted) from and the
"demotion_nodes" argument specifies the source nodes where pages
should be demoted (but not reclaimed) from.

- Option 2: The "nodes" argument specifies the source nodes where
pages should be reclaimed or demoted from and the "demotion_nodes"
argument specifies the allowed demotion target nodes.

Option 1 is more flexible in allowing various use cases. Option 2 can
also support the use cases to reclaim only without demotion and
demotion with fallback to reclaim, but cannot support demotion only
without reclaim.

> > My opinion is based on 2 reasons:
> >
> > 1. We control proactive reclaim by looking for nodes/tiers approaching
> > pressure and triggering reclaim/demotion from said nodes/tiers. So we
> > know the node/tier we would like to reclaim from, but not necessarily
> > have a requirement on where the memory should go. I think it should be
> > up to the kernel.
> > 2. Currently I think most tiered machines will have 2 memory tiers,
> > but I think the code is designed to support N memory tiers. What
> > happens if N=3 and the user asks you to demote from the top tier nodes
> > to the bottom tier nodes (skipping the middle tier)? The user in this
> > case is explicitly asking to break the aging pipeline. From my short
> > while on the mailing list I see great interest in respecting the aging
> > pipeline, so it seems to me the demotion target should be decided by
> > the kernel based on the aging pipeline, and the user should not be
> > able to override it? I don't know. Maybe there is a valid use case for
> > that somewhere.
>
> I agree that the agining should be preserved as much as possible unless
> there is an explicit requirement to do otherwise which might be
> something application specific.
>
> It is really hard to assume all the usecases at this stage but we should
> keep in mind that the semantic of the interface will get cast into stone
> once it is released. As of now I do see a great confusion point in the
> nodemask semantic which pretends to allow some fine control while it is
> largerly hard to predict because it makes some assumptions about the
> reclaim while it has a very limited control of the demotion.
>
> --
> Michal Hocko
> SUSE Labs


  reply	other threads:[~2022-12-09  0:59 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-06  2:34 Mina Almasry
2022-12-06  3:13 ` Huang, Ying
2022-12-06  4:15   ` Mina Almasry
2022-12-06  5:22     ` Huang, Ying
2022-12-06 12:20 ` Michal Hocko
2022-12-06 16:06   ` Mina Almasry
2022-12-06 19:55     ` Michal Hocko
2022-12-07  1:22       ` Huang, Ying
2022-12-07  1:55       ` Mina Almasry
2022-12-07 11:12         ` Michal Hocko
2022-12-07 21:43           ` Mina Almasry
2022-12-08  8:09             ` Michal Hocko
2022-12-08  9:00               ` Mina Almasry
2022-12-08 11:54                 ` Michal Hocko
2022-12-09  0:59                   ` Wei Xu [this message]
2022-12-09  8:08                     ` Michal Hocko
2022-12-09 16:41                       ` Wei Xu
2022-12-09 21:16                         ` Michal Hocko
2022-12-09 21:39                           ` Mina Almasry
2022-12-12  8:33                             ` Michal Hocko
2022-12-10  8:01                           ` Wei Xu
2022-12-12  8:36                             ` Michal Hocko
2022-12-06 15:15 ` kernel test robot
2022-12-06 18:17 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAPL-u_bEhCCCnepmCuNe7q7qJY3G3wckGvG-QsF2SYpAVXhEA@mail.gmail.com \
    --to=weixugc@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox