From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7603C00A5A for ; Thu, 19 Jan 2023 08:30:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D1166B0074; Thu, 19 Jan 2023 03:30:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4809B6B0075; Thu, 19 Jan 2023 03:30:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 320BF6B0078; Thu, 19 Jan 2023 03:30:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 209F26B0074 for ; Thu, 19 Jan 2023 03:30:44 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E60D2AB2AB for ; Thu, 19 Jan 2023 08:30:43 +0000 (UTC) X-FDA: 80370877566.23.D2622B9 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf19.hostedemail.com (Postfix) with ESMTP id CA7821A0003 for ; Thu, 19 Jan 2023 08:30:41 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Hg1+lBl0; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674117042; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dAp6tndVgTw0cMKtNFPedFG/jUc6Nf/M+C8eV77SiTA=; b=YfAo2ZVk0uXLwE1U/x1LYOw874DEdRumQ3+eQwKwVS3vXYurqwHVbcPPgqXs/PNd+CkDI5 6P8BEoy32izpJ+0qLKg/5XKNkwAjg7M3SEMqUkVpsVyn5vWwLWznxnSHi7lfFuKwPxxqyB yDknBHvN1Q/AybeR3g87MykTtRykmsc= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Hg1+lBl0; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674117042; a=rsa-sha256; cv=none; b=GBdIzQRVCihE6xbdzCRMPf61lSYmy9NZHnJlE9biKYjkG4sTaT5PWbB3PK5MVuZNzoXjLP Gkg90vcpvkCGL/HNlofSn9ruQXuwM/oZWap7BMkAj1MOnGB+So03VwvatATVebk2krdgyB 1c62Sa09WcM/fYMJq7a9hrRAZoO2SoE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674117041; x=1705653041; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=LxcRD6HtB+g61OccydTNj4VaTJvaQZCVLDO3oTLNvDo=; b=Hg1+lBl0g+sY4OATCVOcVYxyo7nXZV/lcfoObQgnP7ZPDGtetuVLLJUk cK5Np449Etn8etQP36rWTdpSY4BnuLirNeT6DtKww5XrKEbwG9YwODmW+ qcFsZNPlPTtWW8Ewu8TiCGhpCExk4F1urBiGkzW8NqRdgbJ8ObLMLosE0 4aeqTYo+pdhH/R8f2XRCrQkm3pcSKVxbiMKf1bZ/cP+jVlCgL+OmY5wX1 0sIeS83Xlt6WrwCTPODKY7AZUNLXOP4RupLWDwnbN6ej58h4B5S9ynCJk WzOAkjmqTjlgeEX5NZF7EOhhu8a0rWiVfm36kk7n9T/BHIafZ/nsPXqSw g==; X-IronPort-AV: E=McAfee;i="6500,9779,10594"; a="327303780" X-IronPort-AV: E=Sophos;i="5.97,228,1669104000"; d="scan'208";a="327303780" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 00:30:39 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10594"; a="768119456" X-IronPort-AV: E=Sophos;i="5.97,228,1669104000"; d="scan'208";a="768119456" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 00:30:32 -0800 From: "Huang, Ying" To: Michal Hocko Cc: Mina Almasry , Johannes Weiner , Yang Shi , Yosry Ahmed , weixugc@google.com, Tim Chen , Andrew Morton , Tejun Heo , Zefan Li , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Yuanchu Xie Subject: Re: Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") References: <20221202223533.1785418-1-almasrymina@google.com> <20221216101820.3f4a370af2c93d3c2e78ed8a@linux-foundation.org> <20221219144252.f3da256e75e176905346b4d1@linux-foundation.org> <87lemiitdd.fsf_-_@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 19 Jan 2023 16:29:33 +0800 In-Reply-To: (Michal Hocko's message of "Wed, 18 Jan 2023 18:21:22 +0100") Message-ID: <87a62fdj0y.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: CA7821A0003 X-Stat-Signature: e379mnuodfnhaz8iw3a6k4twotfjzu1q X-Rspam-User: X-HE-Tag: 1674117041-540276 X-HE-Meta: U2FsdGVkX19Mbskz3X2mIeCDPcGZzJ2HABvuoFu4cIKKEBAtWT3y7yZlyeEvoq/TBAYT1qYL1wxjYh2ectr6Q7JCQwktnN3zyMfgs56V2DMwoUGcGtJ9oFXcbWNHNfJTnlpYIBVGxgUiq4RZ18N2CPqs231+dvKV6gyn/zsP6Jlnt1kaOO2R3rhWT07PiQ6VhG2ONFnMdMLNtRj8de2E696xT09UD75ZRigJjC/WxROoCD1WN+dw65I4Z40GEQuPgU7avWusdo31EivOv7QfPDPXVP9+8sNSh/S2eGPMQaBVWn9DSr3Sx7hso6pxbbHsg3UVjYliRI5Da9dsxbHLnWQ3FlK8Kc6F9E/ajvstUN86tLgVoX/2tGbJvzf51Ia27YHxzHymWCT++FR+0aL4AQj4QljC6NctkHr5trYL/VPpYV52Igt4uYS8rJQ6MxYTtvOdPmFMAqQtgU+nJ+tIQyW3LheNOrR0ey8p3S6OQdIULyL/2CN1kQdwJriBuZWo5apD/OvBOU/ax66t35iMCO4u1aTkiC1vTjuM4aGHsUuJGEFlItAGlFgT4kUPcZtS2BqNePWGaPbceCNc5jizKEXtf+NBA7K0Mp3Jjbf6Pfr6L94bFMVnrLQQPCun0XgZrJAWuPJ/7vhpKrRUx0vM9XfllgR3bty1grDK9btIsg57ZXljA+kWjvVeiCiP1jREDZTYz2dXq1Y7CmVWl2il7BLhHPTmlqFqKKsYwFOkcT5k537Oilw22RYuiI3BvgV3Qdstuw8YEL0O2V+HW3omCvQWRCLZL9y2yOX7EBHmuEZksfQUQ2NeUrfB4kpKPqWIqCFQFiA9c8DKnEyE50BoB9TfeEpThGmq3DjYzaSiNOv6giQXEExWlQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Michal Hocko writes: > On Wed 04-01-23 16:41:50, Huang, Ying wrote: >> Michal Hocko writes: >> >> [snip] >> >> > This really requires more discussion. >> >> Let's start the discussion with some summary. >> >> Requirements: >> >> - Proactive reclaim. The counting of current per-memcg proactive >> reclaim (memory.reclaim) isn't correct. The demoted, but not >> reclaimed pages will be counted as reclaimed. So "echo XXM > >> memory.reclaim" may exit prematurely before the specified number of >> memory is reclaimed. > > This is reportedly a problem because memory.reclaim interface cannot be > used for proper memcg sizing IIRC. > >> - Proactive demote. We need an interface to do per-memcg proactive >> demote. > > For the further discussion it would be useful to reference the usecase > that is requiring this functionality. I believe this has been mentioned > somewhere but having it in this thread would help. Sure. Google people in [1] and [2] request a per-cgroup interface to demote but not reclaim proactively. " For jobs of some latency tiers, we would like to trigger proactive demotion (which incurs relatively low latency on the job), but not trigger proactive reclaim (which incurs a pagefault). " Meta people (Johannes) in [3] say they used per-cgroup memory.reclaim for demote and reclaim proactively. [1] https://lore.kernel.org/linux-mm/CAHS8izM-XdLgFrQ1k13X-4YrK=JGayRXV_G3c3Qh4NLKP7cH_g@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CAJD7tkZNW=u1TD-Fd_3RuzRNtaFjxihbGm0836QHkdp0Nn-vyQ@mail.gmail.com/ [3] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ >> We may reuse memory.reclaim via extending the concept of >> reclaiming to include demoting. Or, we can add a new interface for >> that (for example, memory.demote). In addition to demote from fast >> tier to slow tier, in theory, we may need to demote from a set of >> nodes to another set of nodes for something like general node >> balancing. >> >> - Proactive promote. In theory, this is possible, but there's no real >> life requirements yet. And it should use a separate interface, so I >> don't think we need to discuss that here. > > Yes, proactive promotion is not backed by any real usecase at the > moment. We do not really have to focus on it but we should be aware of > the posibility and alow future extentions towards that functionality. OK. > There is one requirement missing here. > - Per NUMA node control - this is what makes the distinction between > demotion and charge reclaim really semantically challenging - e.g. > should demotions constrained by the provided nodemask or they should > be implicit? Yes. We may need to specify the NUMA nodes for demotion/reclaiming source, target, or even path. That is, to fine control the proactive demotion/reclaiming. >> Open questions: >> >> - Use memory.reclaim or memory.demote for proactive demote. In current >> memcg context, reclaiming and demoting is quite different, because >> reclaiming will uncharge, while demoting will not. But if we will add >> per-memory-tier charging finally, the difference disappears. So the >> question becomes whether will we add per-memory-tier charging. > > The question is not whether but when IMHO. We've had a similar situation > with the swap accounting. Originally we have considered swap as a shared > resource but cgroupv2 goes with per swap limits because contention for > the swap space is really something people do care about. So, when we design user space interface for proactive demotion, we should keep per-memory-tier charging in mind. >> - Whether should we demote from faster tier nodes to lower tier nodes >> during the proactive reclaiming. > > I thought we are aligned on that. Demotion is a part of aging and that > is an integral part of the reclaim. As in the choice A/B of the below text, we should keep more fast memory size or slow memory size? For original active/inactive LRU lists, we will balance the size of lists. But we don't have similar stuff for the memory tiers. What is the preferred balancing policy? Choice A/B below are 2 extreme policies that are defined clearly. >> Choice A is to keep as much fast >> memory as possible. That is, reclaim from the lowest tier nodes >> firstly, then the secondary lowest tier nodes, and so on. Choice B is >> to demote at the same time of reclaiming. In this way, if we >> proactively reclaim XX MB memory, we may free XX MB memory on the >> fastest memory nodes. >> >> - When we proactively demote some memory from a fast memory tier, should >> we trigger memory competition in the slower memory tiers? That is, >> whether to wake up kswapd of the slower memory tiers nodes? > > Johannes made some very strong arguments that there is no other choice > than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/). I have no objection for that too. The below is just another choice. If people don't think it's useful. I will not insist on it. >> If we >> want to make per-memcg proactive demoting to be per-memcg strictly, we >> should avoid to trigger the global behavior such as triggering memory >> competition in the slower memory tiers. Instead, we can add a global >> proactive demote interface for that (such as per-memory-tier or >> per-node). > > I suspect we are left with a real usecase and then follow the path we > took for the swap accounting. Thanks for adding that. > Other open questions I do see are > - what to do when the memory.reclaim is constrained by a nodemask as > mentioned above. Is the whole reclaim process (including aging) bound to > the given nodemask or does demotion escape from it. Per my understanding, we can use multiple node masks if necessary. For example, for "source=", we may demote from to other nodes; for "source= destination=", we will demote from to , but will not demote to other nodes. > - should the demotion be specific to multi-tier systems or the interface > should be just NUMA based and users could use the scheme to shuffle > memory around and allow numa balancing from userspace that way. That > would imply that demotion is a dedicated interface of course. It appears that if we can force the demotion target nodes (even in the same tier). We can implement numa balancing from user space? > - there are other usecases that would like to trigger aging from > userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com). > Isn't demotion just a special case of aging in general or should we > end up with 3 different interfaces? Thanks for pointer! If my understanding were correct, this appears a user of proactive reclaiming/demotion interface? Cced the patch author for any further requirements for the interface. Best Regards, Huang, Ying