From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEFB1C4167D for ; Sat, 11 Nov 2023 11:16:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 63C3A8D0010; Sat, 11 Nov 2023 06:16:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5EBAA8D0003; Sat, 11 Nov 2023 06:16:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B2EE8D0010; Sat, 11 Nov 2023 06:16:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 384C88D0003 for ; Sat, 11 Nov 2023 06:16:40 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 097DF140738 for ; Sat, 11 Nov 2023 11:16:40 +0000 (UTC) X-FDA: 81445420560.08.08B8F69 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf16.hostedemail.com (Postfix) with ESMTP id 28A56180018 for ; Sat, 11 Nov 2023 11:16:37 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Mf+9lH7X; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf16.hostedemail.com: domain of htejun@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=htejun@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699701398; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YdIfWloVBdEvW1NG4WAbQaqnvny2qdQuJ9vc8aGDWBo=; b=s1+vsndChScgUzHOAmZx9Doy9/pCCOYTTQ2eQYc7Jcf6mo/87EX1FU0F9nlcRFqXYVG4SS 33WbPTZYY8gu9qHF99aHsck581nM4ogA66s+kzJPcfDfkJM5+7s5ipGewHq7aNm/Gkx4R/ sSCDp1B/HJA49yQ2aXeeO2h2qTcR4Js= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Mf+9lH7X; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf16.hostedemail.com: domain of htejun@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=htejun@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699701398; a=rsa-sha256; cv=none; b=g9d6j8GzQEB82/AHmrQSJvlk4Bg3AO2MUIR2T02l4UxtlRZX5NeBSUI2EcYzynEWRZrisi fh7+BJ8s07Nx0dApwtXoF7Jx0zqlG+NhudYR5638OCtrh856rTXkAurnn19FRPTlGNhc9B 1U/V75Z86qCrftjh/1X4bP2LDjBOXgE= Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-6c4d06b6ddaso1597495b3a.3 for ; Sat, 11 Nov 2023 03:16:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699701397; x=1700306197; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=YdIfWloVBdEvW1NG4WAbQaqnvny2qdQuJ9vc8aGDWBo=; b=Mf+9lH7X1w8ENYMxxCyY2+288xlf2gkWlqO8j+4GZ74+bWB8rLXKyBZVOArAgfwSYR kpVXg+SP+wv1yB2N9wvS7kH9SSs9nLrg7QBqo0GjguO6tZQjx1Wv411OFtomMHnoTQ7e 39Cle3D70XvUZzD765Jph5JhJEreQdb20hgXL+NFM6NFehW/RhenloGWnDtIpUA1UyG+ i+csScW0L2l9BrvgEu3BH+xXwUWDcBdCdXMmi4sym5OJMrTtdzAJ1d6o9vAMzJFBKJdz DkjL+o+SOC8TMelT5F2c2g03bWqKPIYV+7URPAuoSmAsBjWDFYa2PG9sNWJpzfeV26K9 hmhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699701397; x=1700306197; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YdIfWloVBdEvW1NG4WAbQaqnvny2qdQuJ9vc8aGDWBo=; b=Lh0ZZjglCvTWGmLOf3VmdqoGs9/Tnr3SHiwd5PWXbr7DAcUW/DFNG/cR2m/30Qj6bi tymFq1zEtPvVtYO6jn57ew59iUJw6ybuYjy2HJV/wgYP+SGN3iQpuNCWi77Rl1hikj5S /XMIr8dU/h5PL2VZBT93gL8CeBYycrVG2f989inYLdLWC0GhakSKCuNxUoRThUf7bAIG cIbPp/Hq8fVrgGm3ewrLq68OWTIOsYoStYlf+IYk6msTQt8PonzmBS7i1ML8Jj//47pK YoaEHPSWkaeLXiNFY7sKwNp4f8WcEAx5kIQgSb6uaSctsNt1BIsOWKeGlMSjN9XOd5PM cl0g== X-Gm-Message-State: AOJu0YyliqQc3hEJWqY/Plzw/CVjPlHnnoDrsR+uPzk1FMIVrcEPIZnq TgZdQY7amuyLWpnZTLPFfY4= X-Google-Smtp-Source: AGHT+IFh0XZy7zvKvjhxAfnYkqMCzQjdz2EV+ZIVZqjc4fhAaCnvWFjh+X55ab/JLv0N3aWRuDXNKA== X-Received: by 2002:a05:6a00:10c1:b0:690:c75e:25c8 with SMTP id d1-20020a056a0010c100b00690c75e25c8mr1361316pfu.7.1699701396667; Sat, 11 Nov 2023 03:16:36 -0800 (PST) Received: from localhost ([205.220.129.17]) by smtp.gmail.com with ESMTPSA id e20-20020aa78c54000000b006870ed427b2sm1140235pfd.94.2023.11.11.03.16.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 11 Nov 2023 03:16:35 -0800 (PST) Date: Sat, 11 Nov 2023 01:16:29 -1000 From: "tj@kernel.org" To: Gregory Price Cc: John Groves , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "ying.huang@intel.com" , "akpm@linux-foundation.org" , "mhocko@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control Message-ID: References: <20231109002517.106829-1-gregory.price@memverge.com> <0100018bb64636ef-9daaf0c0-813c-4209-94e4-96ba6854f554-000000@email.amazonses.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 28A56180018 X-Stat-Signature: eu5tmeko6x3em4j1rdh8gffpwipbagaw X-Rspam-User: X-HE-Tag: 1699701397-641087 X-HE-Meta: U2FsdGVkX1+Rl3u2Xi2j2Gi26F1sOlTqN7cHp/neLeMS0NKkfdynqxWz2xAxX29XzNfVsGD7F0PlP6JOrl45cai60krfSNKK1v1uxBwZ18uLxBQw2xFDrmoK0IXh5c9FXyNIp98UaVQyMBtThy8UNV9h/PjtL/tpiu8P+d6V6TbEyRZOI+E4s5G3+LbytmaWYF6TjvFmrWskW+JvUyZLq3Qkwv1JyJimTdMsAuVSNW1V9xJI3kZRA9+PSSEUK/Xln5RigvMqdi06x9z5RTJwETQuT5JSillyA80x+4QWiuGTjGhZIJrDKtqYW/5aGfkYx60fzhtUKyhGHkFFeuzkRyGk8V4+cCxIO/MCcI70OlpYe6o06xUJZI7oZZ/8DtBLVgM6MIIG9zXE50JKKXHwBONHqOCVo7tgQFI/88XPkGIdiXX8HAOLs+/Djfp3vpNZoN264xNg3B1d50o0Nxede6ywRJpGsTf2vMVXM0neiMEEMSF8wsdinQTMHtaQVpf/wjmRFJBUX+MHKBAjbBUgkT2rShst1L/4ZfEBzUhNJxoJJAOE11EzeBtS3A/wAYklEErTc/vyraG2qJ9w2xoJFdRxKkAe2yCOuipiNAkeD6ojFSK9elBFrR9y/+mVKlSBm6z98GF3K7h1Zu9wYKcInNPNPJXdQJl61pWu0Ajx0w46MWx0iiB/mHgLFBNVal3Wh4NJ9G4oQJJuBae/rcAHxh/KvtVEoaEHkf4sENcpQnpuWaROoTfExbFG3KBgmGYFqkUxkr/3Gzt5TB5cG96asy10NiaAG7IPsIImgrsb0pbf/8/30+Rb3wK9cvM/F5xMf9TG+wCJiZNp+bBw/InEY58Hfn+PnC1YcDxwr+z8NH9aArQYkcsWlxEwHTy0iD9IazAT3x31oPQrk+Q4wS4ZhkYrXlDQ4uAXqI7dqPaZnvWQdotJkUVUjAL0Cys0He8xmDyiQFn3B2lUyMGEiCS BllLOv68 R5EoXA32nO9yaPSstvOJAjA2NbVVQqyMwR9H+CNkMwslW1xVtjSORcUYJ1KqJEEv4f88BpEugR/WZRQnKap+IYmUjWpfAUiAp9GoIASbrFN8gZfgPch4tlvlqbF+a+hlx0fgpGksIwUY8kP+NfSpBteaRdo/giwYH1NTyvRM0gqdrdB6MKG63/3TSPfbnP4bPnNO+5Xkt95lBbmiVAizbLo/ImJ6wf2snMVmTNu3/ysp0qUJoN1YYTDn8kzM/0YmBFtivGBThXBYD1HWOL5wLINhfe6xuwLOoqSG/LUUKlZMZW56Hy9DmmIaD5jZs0g20JkM/pWyXGL2Je+l7HAg41QU5hwNbSBrAE3ltnf9NiOHATqmiqShQdpIpaEALrsUVgmxP/NUearBo+oBOGKikwh6ffgxn23mm9S1yIi1SY7Y4JSBzznmwaq9LqejBO+rbi1LZXUFr/saVGRe9Fcz+dafjetfuYeZpJDCFvA6s/3R8bex8QxGgHmTCBV61t7+gfzJGArMa5YFKU4KZk3nmZgV5s58RhZzZdcvIHMAazpPB4PcFMwkO0U1Cj/rcvb40hgROBbpEVVlQ2d5PLPqcCwGkck7/yvbPEQ4YT9b5ibszEwwd38h2JfWyQviVFBQB1NOWGZOTLtsgTrMRziD2FFWRafnUSJGlKinxM/Cc0OtQaGA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello, On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote: > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote: ... > I've been considering this as well, but there's more context here being > lost. It's not just about being able to toggle the policy of a single > task, or related tasks, but actually in support of a more global data > interleaving strategy that makes use of bandwidth more effectively as > we begin to memory expansion and bandwidth expansion occur on the > PCIE/CXL bus. > > If the memory landscape of a system changes, for example due to a > hotplug event, you actually want to change the behavior of *every* task > that is using interleaving. The fundamental bandwidth distribution of > the entire system changed, so the behavior of every task using that > memory should change with it. > > We've explored adding weights to: mempolicy, memory tiers, nodes, memcg, > and now additionally cpusets. In the last email, I'd asked whether it > might actually be worth adding a new mpol component of cgroups to > aggregate these issues, rather than jam them into either component. > I would love your thoughts on that. As for CXL and the changing memory landscape, I think some caution is necessary as with any expected "future" technology changes. The recent example with non-volatile memory isn't too far from CXL either. Note that this is not to say that we shouldn't change anything until the hardware is wildly popular but more that we need to be cognizant of the speculative nature and the possibility of overbuilding for it. I don't have a golden answer but here are general suggestions: Build something which is small and/or useful even outside the context of the expected hardware landscape changes. Enable the core feature which is absolutely required in a minimal manner. Avoid being maximalist in feature and convenience coverage. Here, even if CXL actually becomes popular, how many are going to use memory hotplug and need to dynamically rebalance memory in actively running workloads? What's the scenario? Are there going to be an army of data center technicians going around plugging and unplugging CXL devices depending on system memory usage? Maybe there are some cases this is actually useful but for those niche use cases, isn't per-task interface with iteration enough? How often are these hotplug events going to be? > > > So one concrete use case: kubernetes might like change cpusets or move > > > tasks from one cgroup to another, or a vm might be migrated from one set > > > of nodes to enother (technically not mutually exclusive here). Some > > > memory policy settings (like weights) may no longer apply when this > > > happens, so it would be preferable to have a way to change them. > > > > Neither covers all use cases. As you noted in your mempolicy message, if the > > application wants finer grained control, cgroup interface isn't great. In > > general, any changes which are dynamically initiated by the application > > itself isn't a great fit for cgroup. > > It is certainly simple enough to add weights to mempolicy, but there > are limitations. In particular, mempolicy is extremely `current task` > focused, and significant refactor work would need to be done to allow > external tasks the ability to toggle a target task's mempolicy. > > In particular I worry about the potential concurrency issues since > mempolicy can be in the hot allocation path. Changing mpol from outside the task is a feature which is inherently useful regardless of CXL and I don't quite understand why hot path concurrency issues would be different whether the configuration is coming from mempol or cgroup but that could easily be me not being familiar with the involved code. ... > > 3. Cgroup can be convenient when group config change is necessary. However, > > we really don't want to keep adding kernel interface just for changing > > configs for a group of threads. For config changes which aren't high > > frequency, userspace iterating the member processes and applying the > > changes if possible is usually good enough which usually involves looping > > until no new process is found. If the looping is problematic, cgroup > > freezer can be used to atomically stop all member threads to provide > > atomicity too. > > > > If I can ask, do you think it would be out of line to propose a major > refactor to mempolicy to enable external task's the ability to change a > running task's mempolicy *as well as* a cgroup-wide mempolicy component? I don't think these group configurations fit cgroup filesystem interface very well. As these aren't resource allocations, it's unclear what the hierarchical relationship means. Besides, it feels awkard to be keep adding duplicate interfaces where the modality changes completely based on the operation scope. There are ample examples where other subsystems use cgroup membership information and while we haven't expanded that to syscalls yet, I don't see why that'd be all that difference. So, maybe it'd make sense to have the new mempolicy syscall take a cgroup ID as a target identifier too? ie. so that the scope of the operation (e.g. task, process, cgroup) and the content of the policy can stay orthogonal? Thanks. -- tejun