From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B045CC4332F for ; Wed, 14 Dec 2022 17:40:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E42D28E0003; Wed, 14 Dec 2022 12:40:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DF2F08E0002; Wed, 14 Dec 2022 12:40:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBA9B8E0003; Wed, 14 Dec 2022 12:40:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BCE678E0002 for ; Wed, 14 Dec 2022 12:40:24 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7D120A0D01 for ; Wed, 14 Dec 2022 17:40:24 +0000 (UTC) X-FDA: 80241625968.27.A6A1418 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf30.hostedemail.com (Postfix) with ESMTP id 7C3998000F for ; Wed, 14 Dec 2022 17:40:22 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zLiEPdGW; spf=pass (imf30.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671039622; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=kgkV86QRrWX2AtM2yZA4IVsDl96JkZ/hQXSNLZaDvgIJhghZsg33Lqog30EabW9clZiA33 16+S6Ae/uuf1W2eMjKj474DPCMMzAKMVLrnldvxiitPKeKAFW5EchZPzN9NJcl3RPpRRnH 5xpXMHaje1nrkErq+9KAMz+mvPIuMXU= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zLiEPdGW; spf=pass (imf30.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671039622; a=rsa-sha256; cv=none; b=8fFEoBqxoOsauyjnCSZiqd64zBnlSIU392CtPKTx/9JJBnWCG0MyNVXAp+5YGwvKolR9o6 5fyNBvuWTcOF+JJjRjp6OWMxXn95o2NHCV018QoN1QTuXY6yeiz66I0kJ2PKJj4/PMtS2i r8veTulOvuDzS3cdUS8sZZPDSr1PTMY= Received: by mail-ed1-f43.google.com with SMTP id r26so23554261edc.10 for ; Wed, 14 Dec 2022 09:40:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=zLiEPdGWeIplHyN5iSpM7exWPqLSjBujwmhH9CaDqwcZLozyIDL1wj7f/A4x0f7xhw lkD8XiLJnLny851wmw6zGpDjB6arYxXyLU41sJvAl3YWqKMcu/bfpzPN2AAEobUO5Qpa W+OxWPzArHW5QPgSWm6iu8T9qq/PAFK9GbfHcrpe1gh9ygZhgbyJ1+3/VYviiZ0ZkQix c8YTQI07Tib+u3vBjLkP0pGoA5MjveOUWP7Z6mKVzXFz5vyydsDQvlty5Bk07Vobz3xz R2/3XxCeEWRcDRVRl7LaIao+kY00sliQuEY1kr8G/Vbrjx/ObRPeswtJpUQOXtGewE6f l1uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=Ahp8XOreaAfeCatT4zzuqFhbj0M8qHbWyn0nUG2UJvS8OUROcSVCybQcPJryRguKY9 OpTrWTzuetDgwRDLsRjBUMt1GEK38XTG1+pH+Fufx0TQBkzdKMO263Qd9AkBY01acOLN JzQhC2psUlMYUZqU+Ea/pg5N8lFkPuWMfXgZ08HzpX920taxdAdgc175m/m8TyN+Eaya 6Wic3wFNpxGwVH7wDD3jHdSyHUlkvzZdh7MgkcDUEch2NfhTQy8lBH6R4yaiGToGA0DE Rt0/XNaWgJa4KPnDd3ZNwNUhKJ4WYW6qbHtSChGM8AEXUMlDYSBisH9bBaqhL8kQ35PV yyPQ== X-Gm-Message-State: ANoB5pltJJlPdq0tx9Dwg9uNmMGgV3YzuxRzv674/SO1NWkdK106EQfV z8JkxWiesKxIgXt/ANSKacL2MQ== X-Google-Smtp-Source: AA0mqf5eBsgtyrMc/qDoNucSvmTB8WxIbAXdWIFwHNigRd4BitOiOXIkxygy1uff1xLKC+6DbCVocg== X-Received: by 2002:a05:6402:194a:b0:461:a699:7c5c with SMTP id f10-20020a056402194a00b00461a6997c5cmr19543406edz.22.1671039620871; Wed, 14 Dec 2022 09:40:20 -0800 (PST) Received: from localhost (ip-046-005-139-011.um12.pools.vodafone-ip.de. [46.5.139.11]) by smtp.gmail.com with ESMTPSA id d10-20020a50f68a000000b0045b3853c4b7sm6696181edn.51.2022.12.14.09.40.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Dec 2022 09:40:20 -0800 (PST) Date: Wed, 14 Dec 2022 18:40:19 +0100 From: Johannes Weiner To: Michal Hocko Cc: Dave Hansen , "Huang, Ying" , Yang Shi , Wei Xu , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: memcg reclaim demotion wrt. isolation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: 7453ectw9ewtuh7doqhrk8pqz13c5kod X-Rspam-User: X-Rspamd-Queue-Id: 7C3998000F X-Rspamd-Server: rspam06 X-HE-Tag: 1671039622-306894 X-HE-Meta: U2FsdGVkX18eT8mXsH+HcC0pJGljcDfdz9cNw129g+Hkmo+hRqnxCT+usibnCcvLXcbkvdFTnXHD2qDzf48ioxFBkVfGIxCbR1fliihcpgZpV7CF9eVrY5F5BQFX7e9U30sAj3Kx2pVoRb4v8ITBS5wu4S3gmECq2aV2KNTB4u8+cj/lsxJzaXPvjglI5qYCKzfe1ebIi+vB6K6S7B8SfwuBmgMTZm+HBXv/w/22HmxvSqQtwLga+F9mIWZgVwMzpdfoliJiWp3z7PTqJI10SYRcoLH/a4VgfnX6mgNGLBRhqfLCOF+RJtaHWAUEMU8b/FY5FYuDP0CsPmMlEfzXItvmVEoUojqi984/Vp21RQxRX005N70Io3hEzjh29kqHbp/9LJM7YcPBJf4yMAdTvXCuznX6pAwpbv/zjM0KEkzH4fzHdkEls64HcMGcdjc0WEAifO7B51fosXMYjk5g6femv8KBkMzaxXqPdZ2arYITKx/JJApdJbiq9SUk5d7O3svV/VZUICNQ7zPxdhYppQTe3iAv3wlg2DKz7gze81gs208jm4/LVnyNtMQpoQTfZyOBWHJPSepcV/c3x2widl75AQwCoEX+s1tArFE2A79o1sN+f7ZgVQ6ket+MCuY2BmULwpc4qXLf1hse7RJ8p278DtqhZtQSsBxBrGLdU3UW1gmoZhqldt7fGhWCHDVhUc+/Goo8jBPNhspH44OZ1tiGcUga731ihemIAzBWjhGPe4FAN/XXuJLt1NBnXyJ8sUdBXrYkm//UhzSCPhGE3IB+5RZXTas2ybgsddNxWOnjf1C/3LclwthKcSAPUFGFCsPkbQGF8N0BxcLkqk2bFBBWaVdA5LoVRysMwFkuFuWzKe4g/FLV0h0aWfzAT3fFYE1u+9GFdjVcXAnwhyEpO9BlpUwgcpSfZPcc/e78egcvoxZW/DzBOfgvAUkJ2nDuhqUMPEGu72ywKS1clBp DM0X0FiQ nV/CVlI1Bh5HCvpTRIQNA5OoSr4DR+CxOuSVm3Uflyo1KHBkqC7DZnuy//Cz0vYsEdByaBI3sWa2wk4x5Vf5pkbJVRENP1B4hv9+12aDNzdT/EcrRdRKUofslAQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Michal, On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote: > On Wed 14-12-22 13:40:33, Johannes Weiner wrote: > > The only way to prevent cgroups from disrupting each other on NUMA > > nodes is NUMA constraints. Cgroup per-node limits. That shields not > > only from demotion, but also from DoS-mbinding, or aggressive > > promotion. All of these can result in some form of premature > > reclaim/demotion, proactive demotion isn't special in that way. > > Any numa based balancing is a real challenge with memcg semantic. I do > not see per numa node memcg limits without a major overhaul of how we do > charging though. I am not sure this is on the table even long term. > Unless I am really missing something here we have to live with the > existing semantic for a foreseeable future. Yes, I think you're quite right. We've been mostly skirting the NUMA issue in cgroups (and to a degree in MM code in general) with two possible answers: a) The NUMA distances are close enough that we ignore it and pretend all memory is (mostly) fungible. b) The NUMA distances are big enough that it matters, in which case the best option is to avoid sharing, and use bindings to keep workloads/containers isolated to their own CPU+memory domains. Tiered memory forces the issue by providing memory that must be shared between workloads/containers, but is not fungible. At least not without incurring priority inversions between containers, where a lopri container promotes itself to the top and demotes the hipri workload, while staying happily within its global memory allowance. This applies to mbind() cases as much as it does to NUMA balancing. If these setups proliferate, it seems inevitable to me that sooner or later the full problem space of memory cgroups - dividing up a shared resource while allowing overcommit - applies not just to "RAM as a whole", but to each memory tier individually. Whether we need the full memcg interface per tier or per node, I'm not sure. It might be enough to automatically apportion global allowances to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has a 20G allowance, it gets 13G on top and 7G on low. (That, or we settle on multi-socket systems with private tiers, such that memory continues to be unshared :-) Either way, I expect this issue will keep coming up as we try to use containers on such systems.