From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14433C4332F for ; Wed, 14 Dec 2022 15:29:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 690858E0006; Wed, 14 Dec 2022 10:29:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6414A8E0002; Wed, 14 Dec 2022 10:29:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 508EB8E0006; Wed, 14 Dec 2022 10:29:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 415718E0002 for ; Wed, 14 Dec 2022 10:29:11 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0A7F81C6718 for ; Wed, 14 Dec 2022 15:29:11 +0000 (UTC) X-FDA: 80241295302.17.7187B75 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf25.hostedemail.com (Postfix) with ESMTP id 064A1A0019 for ; Wed, 14 Dec 2022 15:29:08 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=DtvP92dY; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf25.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671031749; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8LG6ZHgFYvH90THRUlEz2EYrd8tP4kMTH11ZSn8JkFY=; b=V7dSR0rjV5PWSX529quROQTKEZndqzNFN8xJT3p7YlmfGTUk988Ody1p5l/gVHwS+2UdXx EUyPPAPVN2twejyMHCxiUZ72D21bLh4YqsebDEOldFcF7wUS7GYBW1+aVG/uAUu1EcYdGU XiLKoxiJ5kSX1jfYqQEvogVC5gEg1T0= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=DtvP92dY; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf25.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671031749; a=rsa-sha256; cv=none; b=1IUzjU+Y6TinjlM5Af/rS7yOiuSx9wAG8E+G++3pK3eD/9nhd0nZ+tIvpwj/XZoegY+LM+ chOX/V+OXFxKkDFmvAKb+EOxFcCel7vXhjjpITfzCGxBOSlvoBo0K9KvKFEywBWovbw98O CIPx9o8lzyv9CuzS4kQoVBW3PSRuw7M= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 6A2741FFB7; Wed, 14 Dec 2022 15:29:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1671031747; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8LG6ZHgFYvH90THRUlEz2EYrd8tP4kMTH11ZSn8JkFY=; b=DtvP92dYRt6MN8BFZc/r30voZzeBvWaYParpA22Tgk3+IqVrEYkQkGysuRd1/9kkkwYEoN gKAuW/wyurfZ3C55toC9/x+YobgixgAvDmKeWIKmNUMJL2insHddkfZXFP9VCgNBz2xj4+ PrwsOkLyvm3wGgdGwLBl79NJhhziuZU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 4C46A1333E; Wed, 14 Dec 2022 15:29:07 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id fVtlEMPrmWOtSQAAMHmgww (envelope-from ); Wed, 14 Dec 2022 15:29:07 +0000 Date: Wed, 14 Dec 2022 16:29:06 +0100 From: Michal Hocko To: Johannes Weiner Cc: Dave Hansen , "Huang, Ying" , Yang Shi , Wei Xu , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: memcg reclaim demotion wrt. isolation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 064A1A0019 X-Stat-Signature: 9hduny41xonn3tubnkwft4thhn39pbiy X-HE-Tag: 1671031748-105479 X-HE-Meta: U2FsdGVkX1+MAZfjT0WvYzQ3etnLVjAFJF6rJ9rBTJPTzvvoHc3R5DW8oW3bmqSd2oMJbfYVV6gaMMoDWzCqMupzKT0B1XHMRoLmqeSeUUZrHbcLL/XG0npTNUWNfPsaAQzR2RC9mIh8r13JIwEtr59VLkIfTeXMKW/wzOyA5icpdjmCQFr09AAdun70gbCtllLq0nVnDDg4P8ACL0dTtdPKRgC40oz+Nctc6EVe2uXY2bAUwg7+Xobu0DluN7unh1vdj+D4wmo2BDlJqKLKtqj6JX3rm0eUQhetTeJu54spdGhjO8DbTm2BcxdQmCmmOCmu/TxHK7w3J+uWdzkdwOQsAh441WNWvhOSZu6N+C12faXjlwmbJ/Vi8DotIyngFsz+RS5oRVo+yMIm1OQb1FKzzaZxoTZw07yVUSTttUApGYtTGAtAGXKpK9mFoF4mio2ReRYPeLmBbN0F4J+DlTHtcdUOVoJRvbp9GFLbjnwoVMUQBmd3GPR7tmGIozFUbdwxvGAWEe5+X3MAOeaaTxzURcXFlCB7+N4ZSCKQMKuq08eu8LuKsqW5aFIsLG9ZpYmWYMb5b2weLPKN+vl2vbBcBmMzqx5HCynJsW9b89AOf5Q9Vae31frc+tFBhVFQxx4Ob5qk+FkSUzYu6TEDqpEk1YxnWAW0I5e//iIIvgFQ6BjA3+/SsXtQgDQhUuFK+zNgCijJI4iPEjzoKe1Lo6v9TEbo34b3/N4lZblMu6A1fpavCOW+2WKyMGP1go6onydnuvUgvypWhWN/0ZFrUhUNRQODfZyYAqlGWsid1Q/Ws+I3xFt7T8fA0B3xa79fHgttL34BEL6MDNU6HMcS8eNa6kI5aq7ADJFws1uShA1O4IibhrQ7gQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 14-12-22 13:40:33, Johannes Weiner wrote: > On Wed, Dec 14, 2022 at 10:42:56AM +0100, Michal Hocko wrote: [...] > > The reclaim behavior would make more sense to me if it was constrained > > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be > > disrupted. > > What if the second tier is full, and the memcg you're trying to demote > doesn't have any pages to vacate on that tier yet? Will it fail to > demote? > > Does that mean that a shared second tier node is only usable for the > cgroup that demotes to it first? And demotion stops for everybody else > until that cgroup vacates the node voluntarily? > > As you can see, these would be unprecedented and quite surprising > first-come-first-serve memory protection semantics. This is a very good example! > The only way to prevent cgroups from disrupting each other on NUMA > nodes is NUMA constraints. Cgroup per-node limits. That shields not > only from demotion, but also from DoS-mbinding, or aggressive > promotion. All of these can result in some form of premature > reclaim/demotion, proactive demotion isn't special in that way. Any numa based balancing is a real challenge with memcg semantic. I do not see per numa node memcg limits without a major overhaul of how we do charging though. I am not sure this is on the table even long term. Unless I am really missing something here we have to live with the existing semantic for a foreseeable future. > The default behavior for cgroups is that without limits or > protections, resource access is unconstrained and competitive. Without > NUMA constraints, it's very much expected that cgroups compete over > nodes, and that the hottest pages win out. Per aging rules, freshly > demoted pages are hotter than anything else on the target node, so it > should displace accordingly. That is certainly a way to look at it but I would really emphasise that this competition depends quite significantly on a higher level balancing on top. Memory allocations fall back to different nodes so the resource distribution should be roughly even in this case. If there is a competition then it most likely means our resources are overcommitted. The picture is slightly different with the demotion for memory tiering IMHO because that spills an internal resource contention or explicit user space balancing (via pro-active reclaim/demotion) outside because it creates pressure on the demotion target that is a shared resource as you have mentioned above. > Consider the case where you have two lower tier nodes and there are > cpuset isolation for the main workloads, but some maintenance thing > runs and pollutes one of the lower tier nodes. Well, this is not really much different from regular NUMA system where node aware and constrained workloads compete with NUMA unconstrained workloads. This has never worked. > Or consider the case > where a shared lower tier node is divvied up between two cgroups using > protection settings to allow overcommit, i.e. per-node memory.low. > Demotions, proactive or not, MUST do global reclaim on a full node. OK, but my concern is how to implement any usersoace policy around that behavior. If you see demotion failures then you can trigger some rebalancing explicitly. If those are silent then your only option left is to check the capacity of the demotion target regularly and play a catch up game. Is this sufficient? All that being said, I can see that both approaches result in some corner cases. I do agree that a starvation is likely easier scenario than an actively evil container disrupting another container by pushing its demoted pages out. So scratch the patch. Thanks -- Michal Hocko SUSE Labs