From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CAD83C07548 for ; Wed, 15 Nov 2023 05:59:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C10C6B02F2; Wed, 15 Nov 2023 00:59:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 26FF26B032A; Wed, 15 Nov 2023 00:59:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1391A6B032B; Wed, 15 Nov 2023 00:59:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0034A6B02F2 for ; Wed, 15 Nov 2023 00:59:02 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C4C46160212 for ; Wed, 15 Nov 2023 05:59:02 +0000 (UTC) X-FDA: 81459135324.12.6C2B003 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by imf23.hostedemail.com (Postfix) with ESMTP id C4D4F14000F for ; Wed, 15 Nov 2023 05:59:00 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PW1xGvUu; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700027941; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=erWwPa2jR5UW8uhEsslp3CPHWkdPdVKkicJkjphlBKc=; b=qG2MCRsCe1LQF41oFpm95VQ9GigwrLLPFHSCL0d3n0LvzwKHLwCCi1heWNzIBEIo5oZkwi wiRs8aPl2gl7VYhhU12h1gkTxcdHrkVBdLojl2N6IrzfUNuOX03iHjg9NiczuOMOvuZmF+ 02tmUP8ZWh8aBrk+qAEyXd/NhF6ZUzw= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PW1xGvUu; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700027941; a=rsa-sha256; cv=none; b=DwA9gQaR7yzMKtW03T8vbIwts3iXE+j6i5jtFHruYWDIsRePzkr2GCRziaG3eyraShFTgs /eFDp3vK26UF4ixmfPWznsynrkEE3EdT+DsQteGBPi8dIbDNTNdFdfF+0fHOhoPN5J0KTP tgr0PcMzeNRoc6FUMl3vAEu8YTfe/ns= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700027940; x=1731563940; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=NQ+2z4Rav2ZbtZH8OQfdAniDcS40hfa5W0SOiCZegZ4=; b=PW1xGvUumXQywMf67Z2ud//1B3jUrknOhcbVoyGmaFSxxIZl2DFAAwi1 GX2EJ6c07GJ3mGS3kZ7BjWOgSq4NBqod9+Bp1lDkGLUajqhg/lM3kZppD qBREF+FD/4aou4bzTCz1NfeFluU3QaaSVM4PypW1kWQb4+TcmzeFfl9x1 Sql3XqFQ+BHDJ03PFn5bdViZ7wCwG/lIorQ20MNRg3DRzU0YxVohq2gYY jVVo2uXB59HEMI/k2NcnDASc66DQjD/MDm8JhlVGXPslyOqqckTfCpWUQ W+hzovVkoqWT7MXjIXz1ZV2CNbtPHE6BAcjQgFPe36PiSPJ9IIYZVEkmN Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="381212935" X-IronPort-AV: E=Sophos;i="6.03,304,1694761200"; d="scan'208";a="381212935" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2023 21:58:59 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="1012165444" X-IronPort-AV: E=Sophos;i="6.03,304,1694761200"; d="scan'208";a="1012165444" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2023 21:58:54 -0800 From: "Huang, Ying" To: Gregory Price , Michal Hocko Cc: "tj@kernel.org" , John Groves , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "akpm@linux-foundation.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control In-Reply-To: (Gregory Price's message of "Tue, 14 Nov 2023 12:49:36 -0500") References: <20231109002517.106829-1-gregory.price@memverge.com> <0100018bb64636ef-9daaf0c0-813c-4209-94e4-96ba6854f554-000000@email.amazonses.com> Date: Wed, 15 Nov 2023 13:56:53 +0800 Message-ID: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C4D4F14000F X-Stat-Signature: zfabfw659cpp6w8d5hwx3b8woxtxyzeo X-HE-Tag: 1700027940-812758 X-HE-Meta: U2FsdGVkX1+AINjEy8IpoIRxqNHwLpGm/Kg6UZTYbdwFkjHXIDTGUmYtfFf6OUJspXvVVFW96+vzuG6PcfNdPE4a+9/v1HuZ2wb1QE/cp3BQJXzMqFyD5Fnv4rzvOJs4DNdoIqHc12SsJHk8TIi3lL6OP1HuzZHgktGG1gUd6TGNX2aombY08pKI3HjpecBI4Xym+j8eHsSEUIhE9Ep8k4f5szvy/vKg8TScTmFbx11c6qFAao/b5isUAgejcwtbz5A0XPIgC0O3/y9c8DXxa5WbpZDeKjKibzl6IgPzgwME6UrFnogDbBdGPGCMGOVCFfBZ8HQiDBtwIlxEJgQ1JbBTZxALQezcVOmPED1VFonMLh7+vQKwnQOJuMotUsfRUbBSW0Yzfed3GXO1lQ0eiaeMBYAkWUoFS13LGAXz2PkhFWa2hIaybooJuo5TcOWykoAIZxQ3DcZbXiRwVPNO6iscWv8itfyshrugn/z/plvs7sJvxQVaTDTJ0BlXq4U3DzNjB5j0tCQHENuewEJHs4v1wrt4SNeFglSOCxZ0AsUW7uJQ13+DrrW95nnRaJdO4TexlhULMhOPdXJVMwlPRijz2MqOIAk4peNaOe6gZUc6FcRzrXPN6OacD8LXwEUlW1hvTyh8l2P918XhXJDLoh2DKoSyBE5EDYkhhrgstwgGp9uuY7s2FZuUntA4SJzygt5Chs0TAg9rMWLRiGoM0KJIZdbj0BXdSh0Uad//5xFpxFedhv06nvCFEbafm1/dx2q1ICYjtK8KQQH0uLjWdvr62MZyQXjpibPsx+SHtM14jai4973eUmP6IL/ab5wvvRac/GVtnEl8JQByp0XjkTc2qigNUhaaF39jcadXahhD++kIl7XOWUeVaWBg4rJlqCwUtOW15MCPN9BoGtMt65ohZps0oeAd7vQF6doSAsQpVyQxXcHfuBMjHX6vXo/bn3nlKeZptc7fKqubQDQ svnoOio6 9mYN25PWRNM2PAzrxCSbIPF+fdidfRwjLwuavINo7z5t22RL8vtPDlXiyYT0mMgLWflrjUKzarU5aMG0hM7a+qWq673vcmh6mvTNNyUvSYCdYKA+73XMNNsvAWxCYvL98KSslS7H8eAX49PtNE354SoE2/NEVDEHY40QvqEChmYxso2QogZzCs3abB3pdFsffsZo54Fu1tmBBG3liIrGVoqsJOnQ51EcuS8oJbIMQgJ8pg/BYPxtEnbAuBYen35kyMFIPY4FDKazyrxRrlvvEG1MVO1qiGIKUtPIezEDlgk31VfpvWh+5G8yn3q8GR5I9iEIOEfrn86MBSeldt7SWI7VSC7xo6jyg6MLW6yH+iz+4ORJp4Ez0lcAI5nBmpLK+MGK8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: >> On Tue 14-11-23 10:50:51, Gregory Price wrote: >> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: >> [...] >> > > That being said, I still believe that a cgroup based interface is a much >> > > better choice over a global one. Cpusets seem to be a good fit as the >> > > controller does control memory placement wrt NUMA interfaces. >> > >> > I think cpusets is a non-starter due to the global spinlock required when >> > reading informaiton from it: >> > >> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 >> >> Right, our current cpuset implementation indeed requires callback lock >> from the page allocator. But that is an implementation detail. I do not >> remember bug reports about the lock being a bottle neck though. If >> anything cpusets lock optimizations would be win also for users who do >> not want to use weighted interleave interface. > > Definitely agree, but that's a rather large increase of scope :[ > > We could consider a push-model similar to how cpuset nodemasks are > pushed down to mempolicies, rather than a pull-model of having > mempolicy read directly from cpusets, at least until cpusets lock > optimization is undertaken. > > This pattern looks like a wart to me, which is why I avoided it, but the > locking implications on the pull-model make me sad. > > Would like to point out that Tejun pushed back on implementing weights > in cgroups (regardless of subcomponent), so I think we need to come > to a consensus on where this data should live in a "more global" > context (cpusets, memcg, nodes, etc) before I go mucking around > further. > > So far we have: > * mempolicy: updating weights is a very complicated undertaking, > and no (good) way to do this from outside the task. > would be better to have a coarser grained control. > > New syscall is likely needed to add/set weights in the > per-task mempolicy, or bite the bullet on set_mempolicy2 > and make the syscall extensible for the future. > > * memtiers: tier=node when devices are already interleaved or when all > devices are different, so why add yet another layer of > complexity if other constructs already exist. Additionally, > you lose task-placement relative weighting (or it becomes > very complex to implement. Because we usually have multiple nodes in one mem-tier, I still think mem-tier-based interface is simpler than node-based. But, it seems more complex to introduce mem-tier into mempolicy. Especially if we have per-task weights. So, I am fine to go with node-based interface. > * cgroups: "this doesn't involve dynamic resource accounting / > enforcement at all" and "these aren't resource > allocations, it's unclear what the hierarchical > relationship mean". > > * node: too global, explore smaller scope first then expand. Why is it too global? I understand that it doesn't cover all possible use cases (although I don't know whether these use cases are practical or not). But it can provide a reasonable default per-node weight based on available node performance information (such as, HMAT, CDAT, etc.). And, quite some workloads can just use it. I think this is an useful feature. > For now I think there is consensus that mempolicy should have weights > per-task regardless of how the more-global mechanism is defined, so i'll > go ahead and put up another RFC for some options on that in the next > week or so. > > The limitations on the first pass will be that only the task is capable > of re-weighting should cpusets.mems or the nodemask change. -- Best Regards, Huang, Ying