From: "Huang, Ying" <ying.huang@intel.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Yang Shi <yang.shi@linux.alibaba.com>,
Yosry Ahmed <yosryahmed@google.com>,
weixugc@google.com, Tim Chen <tim.c.chen@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
Jonathan Corbet <corbet@lwn.net>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeelb@google.com>,
Muchun Song <songmuchun@bytedance.com>,
fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Yuanchu Xie <yuanchu@google.com>
Subject: Re: Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim")
Date: Thu, 19 Jan 2023 16:29:33 +0800 [thread overview]
Message-ID: <87a62fdj0y.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <Y8gqkub3AM6c+Z5y@dhcp22.suse.cz> (Michal Hocko's message of "Wed, 18 Jan 2023 18:21:22 +0100")
Michal Hocko <mhocko@suse.com> writes:
> On Wed 04-01-23 16:41:50, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>>
>> [snip]
>>
>> > This really requires more discussion.
>>
>> Let's start the discussion with some summary.
>>
>> Requirements:
>>
>> - Proactive reclaim. The counting of current per-memcg proactive
>> reclaim (memory.reclaim) isn't correct. The demoted, but not
>> reclaimed pages will be counted as reclaimed. So "echo XXM >
>> memory.reclaim" may exit prematurely before the specified number of
>> memory is reclaimed.
>
> This is reportedly a problem because memory.reclaim interface cannot be
> used for proper memcg sizing IIRC.
>
>> - Proactive demote. We need an interface to do per-memcg proactive
>> demote.
>
> For the further discussion it would be useful to reference the usecase
> that is requiring this functionality. I believe this has been mentioned
> somewhere but having it in this thread would help.
Sure.
Google people in [1] and [2] request a per-cgroup interface to demote
but not reclaim proactively.
"
For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault).
"
Meta people (Johannes) in [3] say they used per-cgroup memory.reclaim
for demote and reclaim proactively.
[1] https://lore.kernel.org/linux-mm/CAHS8izM-XdLgFrQ1k13X-4YrK=JGayRXV_G3c3Qh4NLKP7cH_g@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/CAJD7tkZNW=u1TD-Fd_3RuzRNtaFjxihbGm0836QHkdp0Nn-vyQ@mail.gmail.com/
[3] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
>> We may reuse memory.reclaim via extending the concept of
>> reclaiming to include demoting. Or, we can add a new interface for
>> that (for example, memory.demote). In addition to demote from fast
>> tier to slow tier, in theory, we may need to demote from a set of
>> nodes to another set of nodes for something like general node
>> balancing.
>>
>> - Proactive promote. In theory, this is possible, but there's no real
>> life requirements yet. And it should use a separate interface, so I
>> don't think we need to discuss that here.
>
> Yes, proactive promotion is not backed by any real usecase at the
> moment. We do not really have to focus on it but we should be aware of
> the posibility and alow future extentions towards that functionality.
OK.
> There is one requirement missing here.
> - Per NUMA node control - this is what makes the distinction between
> demotion and charge reclaim really semantically challenging - e.g.
> should demotions constrained by the provided nodemask or they should
> be implicit?
Yes. We may need to specify the NUMA nodes for demotion/reclaiming
source, target, or even path. That is, to fine control the proactive
demotion/reclaiming.
>> Open questions:
>>
>> - Use memory.reclaim or memory.demote for proactive demote. In current
>> memcg context, reclaiming and demoting is quite different, because
>> reclaiming will uncharge, while demoting will not. But if we will add
>> per-memory-tier charging finally, the difference disappears. So the
>> question becomes whether will we add per-memory-tier charging.
>
> The question is not whether but when IMHO. We've had a similar situation
> with the swap accounting. Originally we have considered swap as a shared
> resource but cgroupv2 goes with per swap limits because contention for
> the swap space is really something people do care about.
So, when we design user space interface for proactive demotion, we
should keep per-memory-tier charging in mind.
>> - Whether should we demote from faster tier nodes to lower tier nodes
>> during the proactive reclaiming.
>
> I thought we are aligned on that. Demotion is a part of aging and that
> is an integral part of the reclaim.
As in the choice A/B of the below text, we should keep more fast memory
size or slow memory size? For original active/inactive LRU lists, we
will balance the size of lists. But we don't have similar stuff for the
memory tiers. What is the preferred balancing policy? Choice A/B below
are 2 extreme policies that are defined clearly.
>> Choice A is to keep as much fast
>> memory as possible. That is, reclaim from the lowest tier nodes
>> firstly, then the secondary lowest tier nodes, and so on. Choice B is
>> to demote at the same time of reclaiming. In this way, if we
>> proactively reclaim XX MB memory, we may free XX MB memory on the
>> fastest memory nodes.
>>
>> - When we proactively demote some memory from a fast memory tier, should
>> we trigger memory competition in the slower memory tiers? That is,
>> whether to wake up kswapd of the slower memory tiers nodes?
>
> Johannes made some very strong arguments that there is no other choice
> than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/).
I have no objection for that too. The below is just another choice. If
people don't think it's useful. I will not insist on it.
>> If we
>> want to make per-memcg proactive demoting to be per-memcg strictly, we
>> should avoid to trigger the global behavior such as triggering memory
>> competition in the slower memory tiers. Instead, we can add a global
>> proactive demote interface for that (such as per-memory-tier or
>> per-node).
>
> I suspect we are left with a real usecase and then follow the path we
> took for the swap accounting.
Thanks for adding that.
> Other open questions I do see are
> - what to do when the memory.reclaim is constrained by a nodemask as
> mentioned above. Is the whole reclaim process (including aging) bound to
> the given nodemask or does demotion escape from it.
Per my understanding, we can use multiple node masks if necessary. For
example, for "source=<mask1>", we may demote from <mask1> to other
nodes; for "source=<mask1> destination=<mask2>", we will demote from
<mask1> to <mask2>, but will not demote to other nodes.
> - should the demotion be specific to multi-tier systems or the interface
> should be just NUMA based and users could use the scheme to shuffle
> memory around and allow numa balancing from userspace that way. That
> would imply that demotion is a dedicated interface of course.
It appears that if we can force the demotion target nodes (even in the
same tier). We can implement numa balancing from user space?
> - there are other usecases that would like to trigger aging from
> userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com).
> Isn't demotion just a special case of aging in general or should we
> end up with 3 different interfaces?
Thanks for pointer! If my understanding were correct, this appears a
user of proactive reclaiming/demotion interface? Cced the patch author
for any further requirements for the interface.
Best Regards,
Huang, Ying
prev parent reply other threads:[~2023-01-19 8:30 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-02 22:35 [PATCH v3] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-12-02 23:51 ` Shakeel Butt
2022-12-03 3:17 ` Muchun Song
2022-12-12 8:55 ` Michal Hocko
2022-12-13 0:54 ` Mina Almasry
2022-12-13 6:30 ` Huang, Ying
2022-12-13 7:48 ` Wei Xu
2022-12-13 8:51 ` Michal Hocko
2022-12-13 13:42 ` Huang, Ying
2022-12-13 13:30 ` Johannes Weiner
2022-12-13 14:03 ` Michal Hocko
2022-12-13 19:29 ` Mina Almasry
2022-12-14 10:23 ` Michal Hocko
2022-12-15 5:50 ` Huang, Ying
2022-12-15 9:21 ` Michal Hocko
2022-12-16 3:02 ` Huang, Ying
2022-12-15 17:58 ` Wei Xu
2022-12-16 8:40 ` Michal Hocko
2022-12-13 8:33 ` Michal Hocko
2022-12-13 15:58 ` Johannes Weiner
2022-12-13 19:53 ` Mina Almasry
2022-12-14 7:20 ` Huang, Ying
2022-12-14 7:15 ` Huang, Ying
2022-12-14 10:43 ` Michal Hocko
2022-12-16 9:54 ` [PATCH] Revert "mm: add nodes= arg to memory.reclaim" Michal Hocko
2022-12-16 12:02 ` Mina Almasry
2022-12-16 12:22 ` Michal Hocko
2022-12-16 12:28 ` Bagas Sanjaya
2022-12-16 18:18 ` Andrew Morton
2022-12-17 9:57 ` Michal Hocko
2022-12-19 22:42 ` Andrew Morton
2023-01-03 8:37 ` Michal Hocko
2023-01-04 8:41 ` Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Huang, Ying
2023-01-18 17:21 ` Michal Hocko
2023-01-19 8:29 ` Huang, Ying [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a62fdj0y.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=almasrymina@google.com \
--cc=bagasdotme@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=songmuchun@bytedance.com \
--cc=tim.c.chen@linux.intel.com \
--cc=tj@kernel.org \
--cc=weixugc@google.com \
--cc=yang.shi@linux.alibaba.com \
--cc=yosryahmed@google.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox