From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6CD3C2BD09 for ; Thu, 4 Jul 2024 01:25:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C3D16B007B; Wed, 3 Jul 2024 21:25:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2734C6B0082; Wed, 3 Jul 2024 21:25:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 13B8B6B0083; Wed, 3 Jul 2024 21:25:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EB4956B007B for ; Wed, 3 Jul 2024 21:25:49 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 76B78140A04 for ; Thu, 4 Jul 2024 01:25:49 +0000 (UTC) X-FDA: 82300328418.18.2B4FFD8 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) by imf15.hostedemail.com (Postfix) with ESMTP id 10321A0016 for ; Thu, 4 Jul 2024 01:25:45 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DadJTbPH; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.17 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720056328; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2dnzxOXZqTAn/yySjFzpLHvJWDbbBX8vmqXV0HXmMKQ=; b=p3Lpoco5zaCXsC139SQJ7xT3JHiHg5WdjFhp0spDrzWSTlF9W1yBDs9KGHtiqxMZOjsCcC mF97eC++/XsVjPFnyQAH7HPuK9pXOqe+LfllXIzEFK8v3atCFD18C0n8xMYZt/RhH7p6tY ROdCDvYI5yep3bJ1da65YqhofMTnkN4= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DadJTbPH; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.17 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720056328; a=rsa-sha256; cv=none; b=1jtz+ljDevCaKbpCZxIRvF2mesxke7bODr23emfGq/PL6W/7wE0xsNRUjONSyr31p5aXGs 37PJsBrgsfe0zewhUMbwffqvyOVGFD/zC0ais/hZ2J0erNPA+GgW4OC/0Y7wxdO5HS308Q Mwy56/zy1ooXm/H2SKnrkXHHWrEw8m4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720056346; x=1751592346; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=KmHwntjfoh548MYxF2gaKvlgIfLxehlCWOkxGkzuKZ4=; b=DadJTbPH3cAJAXmLyZj3weBk1TfIoEM3zGP0LdjehX3qPk/P1BfV4SFW 8OetrMnNxj+eHINtjzZP+We7F5w8RmDcaUa+XZDpAkkjhoOE114WnzWML at59AXpkBP3SUwZq2A4sv8XH856Dyd8sgveKBLgd/i0SFKbtj7StvVuQh KHxX+j8ooYz4uhLdIlKolWVicsios8T5evHO2F9/IQhiZAKN9SBZsSsEt xhuSxSATkPRmIrmJSTk+royKBxW+6zsCXVwp9Oa8tSoOBm4Ykxy80VPws ibK2lfLtsQ44Q3dNg9OhJhb6heNh6AKIl5YI/XLZxpSWwfTGf+kb5WuzQ g==; X-CSE-ConnectionGUID: oQWuKzIhS1OIXDPVa7ZHLg== X-CSE-MsgGUID: xQ4WPWK8QNaRtbp5shVSOQ== X-IronPort-AV: E=McAfee;i="6700,10204,11122"; a="17427695" X-IronPort-AV: E=Sophos;i="6.09,183,1716274800"; d="scan'208";a="17427695" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jul 2024 18:25:44 -0700 X-CSE-ConnectionGUID: rs6IqsqkSwOyj5hZtI3gZg== X-CSE-MsgGUID: u9RiaY7yRsaozAseJu0eig== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,183,1716274800"; d="scan'208";a="46863250" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jul 2024 18:25:41 -0700 From: "Huang, Ying" To: Tvrtko Ursulin Cc: Tvrtko Ursulin , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-dev@igalia.com, Mel Gorman , Peter Zijlstra , Ingo Molnar , Rik van Riel , Johannes Weiner , "Matthew Wilcox (Oracle)" , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes Subject: Re: [PATCH v2] mm/numa_balancing: Teach mpol_to_str about the balancing mode In-Reply-To: (Tvrtko Ursulin's message of "Wed, 3 Jul 2024 09:34:01 +0100") References: <20240702150006.35206-1-tursulin@igalia.com> <87o77fkprp.fsf@yhuang6-desk2.ccr.corp.intel.com> <2fe66068-4419-4bfc-a92b-2ece3cfcb2ad@igalia.com> <87ed8akivq.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 04 Jul 2024 09:23:49 +0800 Message-ID: <874j96j6fu.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: unm51aaykbuwyoto5d6r3g3enij159t6 X-Rspam-User: X-Rspamd-Queue-Id: 10321A0016 X-Rspamd-Server: rspam02 X-HE-Tag: 1720056345-650232 X-HE-Meta: U2FsdGVkX183NlxT6GrI1NKjAmc8+d4kzimOboCzEP2MZB0Lmes1Thbmju14/4Ztu6njs1x5pSnE5nNrneCFZXJJjGYotYAs3Dhmzys5wFutQAQB8qBw0Y51g4yzcuGCVYSybuS071wsKzN5Zp8aXA6CZeqHDtz4/54DtlMXECwohk0wnbNaSzJn8uRPvtBon+Nld2QHg5uUScVxt0/8YKeKeY5FqJRMrFpC5/dimAwK5C4FGZAMhtn09vMuotBS1qaEHhFGAWjJBqo05xd/thFR7pF1I55lD8MPP3SkPHhntM0yfMFVMdPlHWR8OZsEw6GHa+MpC85wVksMOnx2YpvVmhNgTeAOLUCsBEBe/NTOGJvlL1CftdBqStehDKlziqcHxWmBgZ/0XSL/3TEcui4Bl/lHPWzuTpDXuW+Ezh3dhVQ2Hug250NDnc1TDirY/Ga5imnUtiU1kNkif20okVLAVgZGYMxPLzGC0FfNWqdeeoTr3YeGCjmspqtc9BPtXg+atHYHyKptGg5G758mUCF2TWzS3H6tjLtjTo74rHUNZMVHlopD0T0AxNjFFNzGgn5Aep5V5glTW99BifKn2bV6aRmeVbdPy8wkwwSLq9cC52dJmGTjPu1U3bPW+WYToem1UkXmemSrFWgACaEFfjASeiGUS4V0NakeppB4gHR55poMTIEXlJo7tYZ8c2a6QOrqU+g2OOaO/35Xc+hUDZczilryd3OVzCwE6rdAcl+cNbEq8ESWvkJTsULua0YIEYle39w7gMXne+grdiXA9OlDAhUwi4OCBkiVZH3kR9wALNm4xLdvBG9iFlykpY0ft7lRwidoUky2wRyyi/oNtDEQJFkVY1V6NfRj6zfvGQyLFS3Uv32pVLRseoIe1NPot96Lhkpos8CVAcQ+gfR+xKscmDXkRlK6JnoWsuSFUsfrgY6QWwRx3vIQiPj3Hyj5A1wqwGop8CVZSBvbzJI M6ZDvOVg exsqfhvjeneV9FNgElwLnoQgQSEX3b2c29eKmAK6L5c+igNEP4CqSW2GQPszgqWAJjz+C3WV8uoOrxfMx8RprXYcv7Y9olJcsaEGtdZRlIh7E3DNbwFio+MGX+6s9/ishaFHQDY3DDunxlP/Q7DOrTH3HsTdREx2DVpdIOLhmdc0tprXrbAePa2mp/paDn607k0O/RjtSqPT2w+pitoyjfRoTvekR5UB5gY/tI2fQhCKJ1/fqGsQeOH3Uirte+bf/Vu8R/2HRSAsw1UC3qbZ71PFnQ5l+Mxa6ou9JYInTUEYASpo74WmE5Flc/IwgOj8AGWGcKyDyFsKMM+jrTh6jjz+CpGENDe32D+oH4IsmDGgGgAGCvuItZovODxoLAJBCIYG/Uk9Y/j1jcEs3JtiQkmh8C+nP+wO5oMdSSK/oSe6lXEzhr/l89ApZNop1Z2kqUrmahoReJu7sCtq2/LaTQ2Nqf1gUCGE4TtxqbQnCef8M42Z9W2oxvxMLU3dUgAtYUl8d4MUN6ClN0mWysBoqZUx9kxCV0NqFZbf0RJMM+v1Nn+W9rX3H+uUpqvBGmxPt5sPUZieARpPQa6Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Tvrtko Ursulin writes: > On 03/07/2024 08:57, Huang, Ying wrote: >> Tvrtko Ursulin writes: >> >>> On 03/07/2024 06:28, Huang, Ying wrote: >>>> Tvrtko Ursulin writes: >>>> >>>>> From: Tvrtko Ursulin >>>>> >>>>> Since balancing mode was added in >>>>> bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes"), >>>>> it was possible to set this mode but it wouldn't be shown in >>>>> /proc//numa_maps since there was no support for it in the >>>>> mpol_to_str() helper. >>>>> >>>>> Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it >>>>> would be displayed as 'default' due a workaround introduced a few years >>>>> earlier in >>>>> 8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps"). >>>>> >>>>> To tidy this up we implement two changes: >>>>> >>>>> First we introduce a new internal flag MPOL_F_KERNEL and with it mark the >>>>> kernel's internal default and fallback policies (for tasks and/or VMAs >>>>> with no explicit policy set). By doing this we generalise the current >>>>> special casing and replace the incorrect 'default' with the correct >>>>> 'bind'. >>>>> >>>>> Secondly, we add a string representation and corresponding handling for >>>>> MPOL_F_NUMA_BALANCING. We do this by adding a sparse mapping array of >>>>> flags to names. With the sparseness being the downside, but with the >>>>> advantage of generalising and removing the "policy" from flags display. >>>> Please split these 2 changes into 2 patches. Because we will need >>>> to >>>> back port the first one to -stable kernel. >>> >>> Why two? AFAICT there wasn't a issue until bda420b98505, and to fix it >>> all changes from this patch are needed. >> After bda420b98505, MPOL_BIND with MPOL_F_NUMA_BALANCING will be >> shown >> as "default", which is a bug. While it's a new feature to show >> "balancing". The first fix should be back-ported to -stable kernel >> after bda420b98505. While we don't need to do that for the second one. > > You lost me but it could be I am not at my best today so if you could > please explain more precisely what you mean? > > When bda420b98505 got in, it added MPOL_F_NUMA_BALANCING. But there > was no "balancing" in mpol_to_str(). That's one fix for bda420b98505. IMO, it's not a big issue to miss "balancing" in mpol_to_str(). It's not absolutely necessary to backport this part. > But also it did not change the pre-existing check for MPOL_F_MORON > added in 8790c71a18e5, many years before it, which was the thing > causing bind+balancing to be printed as default. So that's the second > part of the fix. But also AFAICS to tag as fixes bda420b98505. > > Making 8790c71a18e5 target of Fixes: does not IMO make sense though > because *at the time* of that patch it wasn't broken. What am I > missing? Yes, we should use "Fixes: bda420b98505 ..." for this part. This is a big issue, because "default" will be shown for MPOL_BIND, which is totally wrong. We need to backport this fix. It's good for backporting to keep it small and focused. >>>>> End result: >>>>> >>>>> $ numactl -b -m 0-1,3 cat /proc/self/numa_maps >>>>> 555559580000 bind=balancing:0-1,3 file=/usr/bin/cat mapped=3 active=0 N0=3 kernelpagesize_kB=16 >>>>> ... >>>>> >>>>> v2: >>>>> * Fully fix by introducing MPOL_F_KERNEL. >>>>> >>>>> Signed-off-by: Tvrtko Ursulin >>>>> Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") >>>>> References: 8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps") >>>>> Cc: Huang Ying >>>>> Cc: Mel Gorman >>>>> Cc: Peter Zijlstra >>>>> Cc: Ingo Molnar >>>>> Cc: Rik van Riel >>>>> Cc: Johannes Weiner >>>>> Cc: "Matthew Wilcox (Oracle)" >>>>> Cc: Dave Hansen >>>>> Cc: Andi Kleen >>>>> Cc: Michal Hocko >>>>> Cc: David Rientjes >>>>> --- >>>>> include/uapi/linux/mempolicy.h | 1 + >>>>> mm/mempolicy.c | 44 ++++++++++++++++++++++++---------- >>>>> 2 files changed, 32 insertions(+), 13 deletions(-) >>>>> >>>>> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h >>>>> index 1f9bb10d1a47..bcf56ce9603b 100644 >>>>> --- a/include/uapi/linux/mempolicy.h >>>>> +++ b/include/uapi/linux/mempolicy.h >>>>> @@ -64,6 +64,7 @@ enum { >>>>> #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ >>>>> #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ >>>>> #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ >>>>> +#define MPOL_F_KERNEL (1 << 5) /* Kernel's internal policy */ >>>>> /* >>>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl >>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>>> index aec756ae5637..8ecc6d9f100a 100644 >>>>> --- a/mm/mempolicy.c >>>>> +++ b/mm/mempolicy.c >>>>> @@ -134,6 +134,7 @@ enum zone_type policy_zone = 0; >>>>> static struct mempolicy default_policy = { >>>>> .refcnt = ATOMIC_INIT(1), /* never free it */ >>>>> .mode = MPOL_LOCAL, >>>>> + .flags = MPOL_F_KERNEL, >>>>> }; >>>>> static struct mempolicy preferred_node_policy[MAX_NUMNODES]; >>>>> @@ -3095,7 +3096,7 @@ void __init numa_policy_init(void) >>>>> preferred_node_policy[nid] = (struct mempolicy) { >>>>> .refcnt = ATOMIC_INIT(1), >>>>> .mode = MPOL_PREFERRED, >>>>> - .flags = MPOL_F_MOF | MPOL_F_MORON, >>>>> + .flags = MPOL_F_MOF | MPOL_F_MORON | MPOL_F_KERNEL, >>>>> .nodes = nodemask_of_node(nid), >>>>> }; >>>>> } >>>>> @@ -3150,6 +3151,12 @@ static const char * const policy_modes[] = >>>>> [MPOL_PREFERRED_MANY] = "prefer (many)", >>>>> }; >>>>> +static const char * const policy_flags[] = { >>>>> + [ilog2(MPOL_F_STATIC_NODES)] = "static", >>>>> + [ilog2(MPOL_F_RELATIVE_NODES)] = "relative", >>>>> + [ilog2(MPOL_F_NUMA_BALANCING)] = "balancing", >>>>> +}; >>>>> + >>>>> #ifdef CONFIG_TMPFS >>>>> /** >>>>> * mpol_parse_str - parse string to mempolicy, for tmpfs mpol mount option. >>>>> @@ -3293,17 +3300,18 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) >>>>> * @pol: pointer to mempolicy to be formatted >>>>> * >>>>> * Convert @pol into a string. If @buffer is too short, truncate the string. >>>>> - * Recommend a @maxlen of at least 32 for the longest mode, "interleave", the >>>>> - * longest flag, "relative", and to display at least a few node ids. >>>>> + * Recommend a @maxlen of at least 42 for the longest mode, "weighted >>>>> + * interleave", the longest flag, "balancing", and to display at least a few >>>>> + * node ids. >>>>> */ >>>>> void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) >>>>> { >>>>> char *p = buffer; >>>>> nodemask_t nodes = NODE_MASK_NONE; >>>>> unsigned short mode = MPOL_DEFAULT; >>>>> - unsigned short flags = 0; >>>>> + unsigned long flags = 0; >>>>> - if (pol && pol != &default_policy && !(pol->flags & >>>>> MPOL_F_MORON)) { >>>>> + if (!(pol->flags & MPOL_F_KERNEL)) { >>>> Can we avoid to introduce a new flag? Whether the following code >>>> work? >>>> if (pol && pol != &default_policy && !(pol->mode != >>>> MPOL_PREFERRED) && !(pol->flags & MPOL_F_MORON)) >>>> But I think that this is kind of fragile. A flag is better. But >>>> personally, I don't think MPOL_F_KERNEL is a good name, maybe >>>> MPOL_F_DEFAULT? >>> >>> I thought along the same lines, but as you have also shown we need to >>> exclude both default and preferred fallbacks so naming the flag >>> default did not feel best. MPOL_F_INTERNAL? MPOL_F_FALLBACK? >>> MPOL_F_SHOW_AS_DEFAULT? :)) >>> >>> What I dislike about the flag more is the fact internal flags are for >>> some reason in the uapi headers. And presumably we cannot zap them. >>> >>> But I don't think we can check for MPOL_PREFERRED since it can be a >>> legitimate user set policy. >> It's not legitimate (yet) to use MPOL_PREFERRED + >> MPOL_F_NUMA_BALANCING. >> >>> >>> We could check for the address of preferred_node_policy[] members with >>> a loop covering all possible nids? If that will be the consensus I am >>> happy to change it. But flag feels more elegant and robust. >> Yes. I think that this is doable. >> (unsigned long)addr >= (unsigned >> long)(preferred_node_policy) && \ >> (unsigned long)addr < (unsigned long)(preferred_node_policy) + \ >> sizeof(preferred_node_policy) > > Not the prettiest but at least in the spirit of the existing > &default_policy check. I can do that, no problem. If someone has a > different opinion please shout soon. > >>>>> mode = pol->mode; >>>>> flags = pol->flags; >>>>> } >>>>> @@ -3328,15 +3336,25 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) >>>>> p += snprintf(p, maxlen, "%s", policy_modes[mode]); >>>>> if (flags & MPOL_MODE_FLAGS) { >>>>> - p += snprintf(p, buffer + maxlen - p, "="); >>>>> + unsigned int bit, cnt = 0; >>>>> - /* >>>>> - * Currently, the only defined flags are mutually exclusive >>>>> - */ >>>>> - if (flags & MPOL_F_STATIC_NODES) >>>>> - p += snprintf(p, buffer + maxlen - p, "static"); >>>>> - else if (flags & MPOL_F_RELATIVE_NODES) >>>>> - p += snprintf(p, buffer + maxlen - p, "relative"); >>>>> + for_each_set_bit(bit, &flags, ARRAY_SIZE(policy_flags)) { >>>>> + if (bit <= ilog2(MPOL_F_KERNEL)) >>>>> + continue; >>>>> + >>>>> + if (cnt == 0) >>>>> + p += snprintf(p, buffer + maxlen - p, "="); >>>>> + else >>>>> + p += snprintf(p, buffer + maxlen - p, ","); >>>>> + >>>>> + if (WARN_ON_ONCE(!policy_flags[bit])) >>>>> + p += snprintf(p, buffer + maxlen - p, "bit%u", >>>>> + bit); >>>>> + else >>>>> + p += snprintf(p, buffer + maxlen - p, >>>>> + policy_flags[bit]); >>>>> + cnt++; >>>>> + } >>>> Please refer to commit 2291990ab36b ("mempolicy: clean-up >>>> mpol-to-str() >>>> mempolicy formatting") for the original format. >>> >>> That was in 2008 so long time ago and in the meantime there were no >>> bars. The format in this patch tries to align with the input format >>> and I think it manages, apart from deciding to print unknown flags as >>> bit numbers (which is most probably an irrelevant difference). Why do >>> you think the pre-2008 format is better? >> If you think that your format is better, please explain why you not >> use >> the original format in the patch description. You can also show >> examples to compare. > > Because there is no "old" format? If you refer to the one which ended > in 2008. Or if you refer to the one this patch replaces, then it is > effectively the same format for a single flag. And for multiple flags > before this patch that wasn't a possibility. So I am not sure what I > would include as a comparison. Broken "default" vs > "bind=balancing:0-1"? Am I missing something? In the old format (not in the old code), it is, bind=relative|balancing:0-1 while in your format, bind=relative,balancing:0-1 Please explain why you make the change. [snip] -- Best Regards, Huang, Ying