From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14C56C4332F for ; Tue, 13 Dec 2022 06:32:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15F518E0003; Tue, 13 Dec 2022 01:32:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E8DA8E0002; Tue, 13 Dec 2022 01:32:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA5FF8E0003; Tue, 13 Dec 2022 01:32:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D3E408E0002 for ; Tue, 13 Dec 2022 01:32:05 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 97FED120C5A for ; Tue, 13 Dec 2022 06:32:05 +0000 (UTC) X-FDA: 80236313010.23.855F9E6 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf04.hostedemail.com (Postfix) with ESMTP id B6D5D40004 for ; Tue, 13 Dec 2022 06:32:02 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Zjm8Pjem; spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670913123; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8e25xIJnU9erFwaevP0bNqeC+ZauatPeanoruBbzqbI=; b=baD+ohdcS1mU5i79hF369jubBwr0CUD0Mx/zFjEFF/h7DrbexzCH2YTV+P5X/MEAeuYREq lvlbssKLl5uxhpVkkizT5SoGufycj8GBojiZ+vrM1pC2IXsWJErjrGuPb9A5HD6D3IycVs pnBZI3jnk1bAmILMfBgXJ5ic9ty059A= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Zjm8Pjem; spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670913123; a=rsa-sha256; cv=none; b=tS938PaWXgpCqImMEnks9FjBZTd9kEqrvg0XRWdwsGOl7FyiqrE3hgPiPoIr3AGu/baLxO iDxBj8IhBHH98j7Gm0ggOKOauJqrDK+B62YkIcvp1GOsZCyPxZyOT/YydNe3KA5ii9Xu/4 P/Cy1MUaRBAAkUrdSaHL53dBXaWNoX4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670913122; x=1702449122; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=RRn4X9Gdnegzf1sgNDrCXchtLeBUYjVOO9r5f4wlFyE=; b=Zjm8PjemHSDeo8z/8qyEUG2MUMcVg4aK96YALZXDNA2WSCa/bWNyWaNZ BkRdyFQ9n41Q/LbYt3bXoZW4Ns1DpGgPdN0CnPhU2ttlkABrv/p9YZUZ2 DrCw75sjC5eGFZ1bxLgeasb58IE7g0i0lgI0502lElv71MXnr7g4azctR tpCQN6t+N8sJDCaCGAp8YMar7N8e6C5TEetYlAed+/sFfX500QPHsV6CE ycJwuRBNcBA97FClR0tV2fiTXzkUy9rs4aO+miLjCo4XYheVjkP1NgOIm edPPnC9jpIr8HYCZbD82ypXlgwqu7bt1PC/goIZe+ZjYNU+z844j9Hlac w==; X-IronPort-AV: E=McAfee;i="6500,9779,10559"; a="319199740" X-IronPort-AV: E=Sophos;i="5.96,240,1665471600"; d="scan'208";a="319199740" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2022 22:32:00 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10559"; a="648454227" X-IronPort-AV: E=Sophos;i="5.96,240,1665471600"; d="scan'208";a="648454227" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2022 22:31:54 -0800 From: "Huang, Ying" To: Mina Almasry , Michal Hocko Cc: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim References: <20221202223533.1785418-1-almasrymina@google.com> Date: Tue, 13 Dec 2022 14:30:57 +0800 In-Reply-To: (Mina Almasry's message of "Mon, 12 Dec 2022 16:54:27 -0800") Message-ID: <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: zswx6fc3gm6s8srgqhuub3emfy75i1ze X-Rspam-User: X-Rspamd-Queue-Id: B6D5D40004 X-Rspamd-Server: rspam06 X-HE-Tag: 1670913122-394917 X-HE-Meta: U2FsdGVkX1+wtIJn9eNjBLxkBekpjZs1nyHSuSl7bh+j2loGbIb0l2Xsv3uOzpmsCrHMxWqtdmzo/FJa2SVpwagaGQ8UzC3wAKtqQbhmBLdapqnvfWaHPGNVJOQ9dZOcVtXPNF3sn/icm2HR5U0dtABvFM0tz3c9bx8C+Hwo4M+WzAyAQ3Bpgi6mq1QJwAnurNpjgkH6qShcw1YgMASe6F92qG2wM6/MV6+AFWMI1lrvJxi2ldzSibk0N6+ugmw0TqQEfszR9DegbEOFZR5JCKw25Pcgr5B+hq/6o+kGunzXmZE3+KwyPxtPtT3uB7DgujhUFero9x6e/itsESvnoYk0cS4HQAhJF3Jfj9G3vUZiJEp4rKHYZ/oGJl7I2sEskLpeh5l+jUFbyXbTfx9jeSqUV3IU2/TlPXMlY76iFy3QyMg1cWs4edjjcu+wr0KP6eBKZKcx4jZOPvE8h/dpx+6egrggai4Ob0BZhsTp25RPujAbWcGM5dOxX5X6xNRtbNHt+Kx3fWmVNTO38gvCbpkm3TrjNEaBYD/usQpRR1TAKMOfVLt8/RFBuNfXMTH3v1kjYhloBl46vQycDFYCX6WF6oxWcLwawcnZRHTe6iiIBZk3nTTPHiOVSLWvq+di06oiZCqnAyQjG8auuTwmv83U4UUAYXPuomI0TmNdeimc648FlPH5DknbyVOpGx7anqc1tdGsBsfgg7psJuNIputDyNwVnwXM0pcjhlNf1sruIhfNxx0KeKhVVNZ6Pnthrhf5wfnegXdpp1GfxOXz14BhHb393DMBwmWDpfCXfQQdM+k4dhp5k51kLa6aq9v6oqLXAX/54Pkj9u86r+JMERAlGCPHCiuD12RRFmrt4aBeEyohPTlkkQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Mina Almasry writes: > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko wrote: >> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: >> > The nodes= arg instructs the kernel to only scan the given nodes for >> > proactive reclaim. For example use cases, consider a 2 tier memory system: >> > >> > nodes 0,1 -> top tier >> > nodes 2,3 -> second tier >> > >> > $ echo "1m nodes=0" > memory.reclaim >> > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. >> > Since node 0 is a top tier node, demotion will be attempted first. This >> > is useful to direct proactive reclaim to specific nodes that are under >> > pressure. >> > >> > $ echo "1m nodes=2,3" > memory.reclaim >> > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, >> > since this tier of memory has no demotion targets the memory will be >> > reclaimed. >> > >> > $ echo "1m nodes=0,1" > memory.reclaim >> > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can >> > be desirable according to the userspace policy if there is pressure on >> > the top tiers. Since these nodes have demotion targets, the kernel will >> > attempt demotion first. >> > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> > reclaim""), the proactive reclaim interface memory.reclaim does both >> > reclaim and demotion. Reclaim and demotion incur different latency costs >> > to the jobs in the cgroup. Demoted memory would still be addressable >> > by the userspace at a higher latency, but reclaimed memory would need to >> > incur a pagefault. >> > >> > The 'nodes' arg is useful to allow the userspace to control demotion >> > and reclaim independently according to its policy: if the memory.reclaim >> > is called on a node with demotion targets, it will attempt demotion first; >> > if it is called on a node without demotion targets, it will only attempt >> > reclaim. >> > >> > Acked-by: Michal Hocko >> > Signed-off-by: Mina Almasry >> >> After discussion in [1] I have realized that I haven't really thought >> through all the consequences of this patch and therefore I am retracting >> my ack here. I am not nacking the patch at this statge but I also think >> this shouldn't be merged now and we should really consider all the >> consequences. >> >> Let me summarize my main concerns here as well. The proposed >> implementation doesn't apply the provided nodemask to the whole reclaim >> process. This means that demotion can happen outside of the mask so the >> the user request cannot really control demotion targets and that limits >> the interface should there be any need for a finer grained control in >> the future (see an example in [2]). >> Another problem is that this can limit future reclaim extensions because >> of existing assumptions of the interface [3] - specify only top-tier >> node to force the aging without actually reclaiming any charges and >> (ab)use the interface only for aging on multi-tier system. A change to >> the reclaim to not demote in some cases could break this usecase. >> > > I think this is correct. My use case is to request from the kernel to > do demotion without reclaim in the cgroup, and the reason for that is > stated in the commit message: > > "Reclaim and demotion incur different latency costs to the jobs in the > cgroup. Demoted memory would still be addressable by the userspace at > a higher latency, but reclaimed memory would need to incur a > pagefault." > > For jobs of some latency tiers, we would like to trigger proactive > demotion (which incurs relatively low latency on the job), but not > trigger proactive reclaim (which incurs a pagefault). I initially had > proposed a separate interface for this, but Johannes directed me to > this interface instead in [1]. In the same email Johannes also tells > me that meta's reclaim stack relies on memory.reclaim triggering > demotion, so it seems that I'm not the first to take a dependency on > this. Additionally in [2] Johannes also says it would be great if in > the long term reclaim policy and demotion policy do not diverge. > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ After these discussion, I think the solution maybe use different interfaces for "proactive demote" and "proactive reclaim". That is, reconsider "memory.demote". In this way, we will always uncharge the cgroup for "memory.reclaim". This avoid the possible confusion there. And, because demotion is considered aging, we don't need to disable demotion for "memory.reclaim", just don't count it. Best Regards, Huang, Ying