From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5622C4167B for ; Mon, 4 Dec 2023 08:21:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3AB436B026D; Mon, 4 Dec 2023 03:21:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 35B0B6B026F; Mon, 4 Dec 2023 03:21:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24BEA6B0272; Mon, 4 Dec 2023 03:21:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 150FC6B026D for ; Mon, 4 Dec 2023 03:21:11 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E1432C0116 for ; Mon, 4 Dec 2023 08:21:10 +0000 (UTC) X-FDA: 81528440700.13.1E307C3 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by imf14.hostedemail.com (Postfix) with ESMTP id CB3BD100004 for ; Mon, 4 Dec 2023 08:21:08 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=L94RSXpI; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701678069; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p6c8CYxr/2zb/8jhHMv/0o8ObpVQVy8G5o4hpC8nHro=; b=s5GgtgBx+rREVs+XIn46BmecaktLISRTUrcu7+AGKE9Jy1anpWSqtoB4ca3wh40qOvcpyn WtlOZ9/jK8dCWu5Tbtb/4qtDrhxpdijKAJ5CFTgYh3qWCXGmSzGuidMKLwjDLpmNcsr/dR wZH5teQ4ML3ydHbG+Vpaz9jZQ5fvAlQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701678069; a=rsa-sha256; cv=none; b=4P96tLJtKen6rIN1XEyGn3z+R7X42D01tVIKUFH2mwcm72oQ1gMrTuam1RTPiz/LSXXrzu nWiEWlz7k3Gzoxp3AX4MJh8Lm5zk2Tw1+WRdC74/N6UC2rRo/wXPfVRB8Ap4qhtVe1PS/Z gZ7d4QR4sPEVTxiC34KSql46sopntkA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=L94RSXpI; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701678069; x=1733214069; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=SctStx0/2r6BjPSkyNIBJr82NmxzrJUg7/EEBCUvaYM=; b=L94RSXpIo4Bzx1oEjQWH0iUXKE7n0wckWJvJyyUObinili4gjPh1q9AO 4XfD+Jeonk+64nq+zMUlXEIRs6IiZniadyiFuYENufaQnL93XdLm7bq4U 0EqjtBR/wnvWLkZ3KXDbvtlu5zmhjFPpNoY3K4yatLiHxdabce4Ky70za Bed+SspxdQmwrLtJ6tuBf0zoa4Q0hnEjH3L5hV6j9ilKMfyu5YxUvzIi1 P3a7INd+I9os84SjZH9pMUVqshaDnLMVjyw0Ub2xkiP3D3cPTQvh2Ll9c ehEROM3VAXYeLt5ObVdzzCxaZPjimxzBPyr/mxc6i9fRIAeOpycaL/FLM A==; X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="752032" X-IronPort-AV: E=Sophos;i="6.04,249,1695711600"; d="scan'208";a="752032" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 00:21:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10913"; a="720242377" X-IronPort-AV: E=Sophos;i="6.04,249,1695711600"; d="scan'208";a="720242377" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2023 00:21:02 -0800 From: "Huang, Ying" To: Gregory Price Cc: Michal Hocko , "tj@kernel.org" , "John Groves" , Gregory Price , "linux-kernel@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "akpm@linux-foundation.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "muchun.song@linux.dev" , "jgroves@micron.com" Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control In-Reply-To: (Gregory Price's message of "Sun, 3 Dec 2023 22:33:08 -0500") References: <0100018bb64636ef-9daaf0c0-813c-4209-94e4-96ba6854f554-000000@email.amazonses.com> <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 04 Dec 2023 16:19:02 +0800 Message-ID: <87sf4i2xe1.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: CB3BD100004 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 9jewb4nrkkqukiczjdxhaydp9fumq5sm X-HE-Tag: 1701678068-596886 X-HE-Meta: U2FsdGVkX19YPxLBtvarwqa8M+qxLumAE05dn2zzfWuMPuXrRYl40BERGgLn1RMMxmBBBzCoji8Z8+SKBoDoyMztrbwUBU/A/2rXx0fPck3oFBK1Lh99PyI/REoEtD59eD/JhZ77cRVmWeJFWhuSkYPUJJhL8a8Bmmxpa4DbBzIM+sRC3XC+OV8qZBvjitJeq8A4atLn0F1OKfudvl7UsFugdqB3FqFN4qr/fdGfJiH+pUVWZd0SE3IeDgWxbqydKHh8V1fBiE0FEvhGPv3tU+7OUqrhNh8hbGZ28ngITSLR6iHI6s/dV13gLWreIKkn4rZMwrtXdcpm5WrEB1ItBbrU+JhlPI82sP9uU9TNeHNtB7+Jp9BRetHyxxsC4b5Ynz1JQlzdB6rVwrfizR+kv20Hh+e50J3BhsGImgpJu8F37pB1cBqehvtXBgS2uIFjf+Kfuz9eT/Gj0cmX3YLh83gUZm9UMQ2J3beSMUZQmovneAC3BeGWztBMslZ2s1IHlPLrE2wxhICpHSzAT5oWOpgAtlrM+kOBMR02CPoiLa80Vgtp8h/smrz+nuJx3Lxqjs7KIdbcIeo782Rn+BxAuTgdgIYJWdozeeb2MrASAxBB+WDvageeq2cUAfmfdgvzpUxw7mi8+c+SQ0+iBN3aDnO1zHLEkYVL6/V9OmOhPENhtdFvHl9ArOvsc0yVBi+Ewv+U8r3jWwDvfTDtRZrFSShwPYkqdO4MXLwxr27iRJCqUlBzGEEAWVZKg4dBCg6gdSBViK3WunrefHyY758OW2fNWjC0IBz65T6yQYA7p+NaNkI6hYbPGZhSLZ0FJlQMmiZ6ZR+zktsqpdLDL1PoHrWaquOQNkjMKlzE+llzmWLRi2snzn/4yYO5HrnCBIihb+wt8PaExmo5a+IWtrrXDwz60JLMfNRNA+smCHTEU8ACPSevhbqEhz4FK+W4Dgx9ugk60PHODEbMEjRbogI nNCy01VO GyyhMljht9pnqijfyqITaKM/rrbDo/yaEca3aJ0pMSyW7eKMRHMoewBvglKiArL8sB4Z2RKzuLFroZGs/OtwIQG/3l1bHMxPeuZTCBzB29apZXqr780Jg0rMbT4L64kjme1clGm7vWS9EyUZxXuO1r+cG35JHWs7lpoCCAOTPnuVPomy9Om8b2praH1uHuV4La/+k/rrzEa5Ujtnjym+uCYkMeEygBV/wD9RK/RHSYCjdzi1bELIIen/6f0FpCcxndkahPsOD/00x2ux+nHJJzbKAS1kcwdIWBz3oCdLmWJBC4PjNpNjmnXnmnvLCsIAqLXmEWJW5CEk+78Nx8zOIc8BUNQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote: >> Gregory Price writes: >>=20 >> Because we usually have multiple nodes in one mem-tier, I still think >> mem-tier-based interface is simpler than node-based. But, it seems more >> complex to introduce mem-tier into mempolicy. Especially if we have >> per-task weights. So, I am fine to go with node-based interface. >>=20 >> > * cgroups: "this doesn't involve dynamic resource accounting / >> > enforcement at all" and "these aren't resource >> > allocations, it's unclear what the hierarchical >> > relationship mean". >> > >> > * node: too global, explore smaller scope first then expand. >>=20 >> Why is it too global? I understand that it doesn't cover all possible >> use cases (although I don't know whether these use cases are practical >> or not). But it can provide a reasonable default per-node weight based >> on available node performance information (such as, HMAT, CDAT, etc.). >> And, quite some workloads can just use it. I think this is an useful >> feature. >> > > Have been sharing notes with more folks. Michal thinks a global set of > weights is unintuitive and not useful, and would prefer to see the > per-task weights first. > > Though this may have been in response to adding it as an attribute of > nodes directly.=20 > > Another proposal here suggested adding a new sysfs setting > https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8= 785952a > > $ tree /sys/kernel/mm/interleave_weight/ > /sys/kernel/mm/interleave_weight/ > =E2=94=9C=E2=94=80=E2=94=80 enabled [1] > =E2=94=9C=E2=94=80=E2=94=80 possible [2] > =E2=94=94=E2=94=80=E2=94=80 node > =E2=94=9C=E2=94=80=E2=94=80 node0 > =E2=94=82 =E2=94=94=E2=94=80=E2=94=80 interleave_weight [3] > =E2=94=94=E2=94=80=E2=94=80 node1 > =E2=94=94=E2=94=80=E2=94=80 interleave_weight [3] > > (this could be changed to /sys/kernel/mm/mempolicy/...) > > I think the internal representation of this can be simplified greatly, > over what the patch provides now, but maybe this solves the "it doesn't > belong in these other components" issue. > > Answer: Simply leave it as a static global kobject in mempolicy, which > also deals with many of the issues regarding race conditions. Although personally I prefer to add interleave weight as an attribute of nodes. I understand that some people think it's not appropriate to place anything node-specific there. So, some place under /sys/kernel/mm sounds reasonable too. > If a user provides weights, use those. If they do not, use globals. Yes. That is the target use case. > On a cpuset rebind event (container migration, mems_allowed changes), > manually set weights would have to remain, so in a bad case, the > weights would be very out of line with the real distribution of memory. > > Example: if your nodemask is (0,1,2) and a migration changes it to > (3,4,5), then unfortunately your weights will likely revert to [1,1,1] > > If set with global weights, they could automatically adjust. It > would not be perfect, but it would be better than the potential worst > case above. If that same migration occurs, the next allocation would > simply use whatever the target node weights are in the global config. > > So if globally you have weights [3,2,1,1,2,3], and you move from > nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to > [1,2,3]. That is nice. And I prefer to emphasize the simple use case. Users don't need to specify interleave weight always. Just use MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable default weight. > If the structure is built as a matrix of (cpu_node,mem_nodes), > the you can also optimize based on the node the task is running on. The matrix stuff makes the situation complex. If people do need something like that, they can just use set_memorypolicy2() with user specified weights. I still believe that "make simple stuff simple, and complex stuff possible". > That feels very intuitive, deals with many race condition issues, and > the global setting can actually be implemented without the need for > set_mempolicy2 at all - which is certainly a bonus. > > Would love more thoughts here. Will have a new RFC with set_mempolicy2, > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. Thanks for doing all these! -- Best Regards, Huang, Ying