From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02C2FC0218C for ; Tue, 21 Jan 2025 11:02:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4ED2D6B0082; Tue, 21 Jan 2025 06:02:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 49CF96B0083; Tue, 21 Jan 2025 06:02:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 364DB6B0088; Tue, 21 Jan 2025 06:02:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 177616B0082 for ; Tue, 21 Jan 2025 06:02:22 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BA1C48095A for ; Tue, 21 Jan 2025 11:02:08 +0000 (UTC) X-FDA: 83031169536.24.76C6C37 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf11.hostedemail.com (Postfix) with ESMTP id 9505B4001C for ; Tue, 21 Jan 2025 11:02:05 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=LoSp2xMh; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf11.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737457326; a=rsa-sha256; cv=none; b=uqJFV4l6Xxpd79OvVRsGFkNy8J4rHZqLAS6cjpAEwoXBeB6sPdvhnr2veZmh+d97Bt83dm Lxyo69AYEcFomrhixG4WOCf8S92sXPaIWtdEAbNDNnu/Tmpqy6bUyNDPXfgwMMHYWdzBcl La7KtYKLMACaM6psNtMjIR92C12gaI0= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=LoSp2xMh; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf11.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737457326; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=caGvGQDtH2XqXoRIUZNscLgmTWrQ6Ez/SEQhEHQiauU=; b=1UxzccuGYF5p7bzSJZYcrp1Mxevd75dWnnkinwC3EC4xJTPH8KqtUyEhkZvVCctfgOdVun BctHyCMkzaypUZ+HUKNiSUzr3BivTYk3CnWcfuCro+5qfu3JLXthYDPKqp4ccOSv+riamy WMrlzYWcL1/MiEij1dXdtkmsQ/X5XGs= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1737457322; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=caGvGQDtH2XqXoRIUZNscLgmTWrQ6Ez/SEQhEHQiauU=; b=LoSp2xMhpaBqfxrk3k3xkeG65bSquMsa+fI2c/Dd1fOhL4pv1W0Y9EDZTTRDqyqTPnzfwJ4NEwn0OUg/Uwz5RxO3y9pVVMjjrQA9aBNp3ZjVtMyCm2sLhZ38nU8EzmZTdbVLbWA+z6op1tNZ2hNci7dVOZASKEhy5EszSwsC41U= Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0WO5F9Xh_1737457311 cluster:ay36) by smtp.aliyun-inc.com; Tue, 21 Jan 2025 19:02:00 +0800 From: "Huang, Ying" To: Joshua Hahn Cc: Gregory Price , Hyeonggon Yoo , kernel_team@skhynix.com, 42.hyeyoo@gmail.com, "rafael@kernel.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , "akpm@linux-foundation.org" , =?utf-8?B?6rmA7ZmN6recKEtJTSBIT05HR1lVKQ==?= System SW , =?utf-8?B?6rmA65296riwKEtJTSBSQUtJRSk=?= System SW , "dan.j.williams@intel.com" , "Jonathan.Cameron@huawei.com" , "dave.jiang@intel.com" , "horen.chuang@linux.dev" , "hannes@cmpxchg.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "kernel-team@meta.com" Subject: Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning In-Reply-To: <20250109191102.3772288-1-joshua.hahnjy@gmail.com> (Joshua Hahn's message of "Thu, 9 Jan 2025 11:10:34 -0800") References: <20250109191102.3772288-1-joshua.hahnjy@gmail.com> Date: Tue, 21 Jan 2025 19:01:51 +0800 Message-ID: <87msfkh1ls.fsf@DESKTOP-5N7EMDA> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 9505B4001C X-Stat-Signature: cerrciakajexc3dmjrtfs7q6436we38o X-Rspam-User: X-HE-Tag: 1737457325-739182 X-HE-Meta: U2FsdGVkX1+bcSXHsaFnZIAeseO6n+QieaPqSs/++0Ffuksc+9z+V7u/UZkbCz4yItfNhUvywyzyuDAM/l7nhi1VOeILZfLh9yUoj+TrrkVap2l4jg7iQbdPElI8PdUypHjq9OvCscQ8DEqpseU0jZzoEeKYigyQbOiP+Rl5A38eld/TftJatylQ8ntfcaKZ4dUs/ZTFCE2v0pqPe1XF3MRQtH85VqX5Kv0RG8pySYznjvl1+8PCDemSJr7OhNUjIdU4mvgSM8BkxmnTtLpgQOXujqgO3pucsR7dOH3sD2zGCQV5DD89qXbdBt3qi6Spbs8n8ml3BNghI5tV1uFZy8fhV6umvttCtJqvpQ4UPCjc3k5MUIfvRaa21mGmEfdKF/XbsSNaXYh/QnT5aUZwJynNg1l8VCh8ScHC8Mg53JJl1eaSAi19nDTD/pVz95izxW6y28+KACBP4pt9oIbg7LXL/BcKddkGQUxklgG/RC4LB2Q8NOyzOENT0I50ePrRfS6d7JJvu87T8yiKGTJ86P69H0DWzmJ+P4xaI+WdWRDFjFlF8hzY5HANFxOKwYQCSfJm2safl1hhRw1gwGjvv1D+S+IV9pYnnIpwZJOlGuT5GJGNplDBV8QFz+MacUpIdMEKj+/RwPsoOMsT1nstQuIANf/EsgYy7trZPghcoIe6l6sNpVJ6tv9ia5tBIpnGgLMoO80MbfMJ+HdGddH2OQ4+Pqm3Z5qcX+jMJwRzQOIrBKDYUps57JPt0A4iXnB0wo3Rr7CFcndlHlyjJIPWipQfRiIygLRqCxrREMt3iBcUotJXnQ/UFMlUXMPAjBXp9Xf0P9rdT2RDsrLkO6XbTDPXiuVkIIHG+mBug9uXajvK4kholKKmt7iFnYGcXTsTfOPC6073XnSDF6HNFylhEHkHdMcMzmmMgl0NIXOAkpLH4cesWlr5Y4s4ZCuSc2RtpD1FResHlwcmoHoAgFW Zv2HePjk cc0EqleOdbEL+xG99lxSMnthF1lfak9yaRyB9SkAZfSqyiboI8vhjNkvIxrsN8vAVSXwuE8SCPZcnCDLuqFUJ5YfdZev41qoyoYunqtj7zrfPSPv4vXo4FC4ObT4n79l1x1zbt5x5y3dIazva9Mz/Gx/yTWndVEizg4QecRZLX6BqTRe1YM4vk+KzYzZUBGV3ntX/FkSPacm14FrTFOiFrKAUGSHB6JDmn72Hic6yAnKKL5lovrbgmzFQ6wD1kHPGVTSPhnMBR446nHxmNRq+f2Q+fktmInZNdrrPhVx3ymKq/Kqv+jqYOLDZGWiNTdZH5v3wbHzRN45GmjK5MCoRtV0hQC5izyyOpoE7zjd1Ub9irBtk7WJ6ze+TVouLKMxh5r8DHHFJ8f5hjgl1S6fnw2nFdg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Joshua, Sorry for late reply. Joshua Hahn writes: > On Thu, 9 Jan 2025 09:18:18 -0800 Joshua Hahn wrote: > >> On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price wrote: >> >> > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote: >> > > Hi, hope you all had a nice year-end holiday :) >> > > >> > ... snip ... >> > > Please let me know if there's any point we discussed that I am missing. >> > > >> > > Additionally I would like to mention that within an internal discussion >> > > my colleague Honggyu suggested introducing 'mode' parameter which can be >> > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more >> > > intuitive interface. >> > > >> > > With Honggyu's suggestion and the points we've discussed, >> > > I think the interface could be: >> > > >> > > # At booting, the mode is 'auto' where the kernel can automatically >> > > # update any weights. >> > > >> > > mode auto # User hasn't specified any weight yet. >> > > effective [2, 1, -, -] # Using system defaults for node 0-1, >> > > # and node 2-3 not populated yet. >> > > >> > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode, >> > > # all weights are re-calculated based on ACPI HMAT table, including the >> > > # weight of the new node. >> > > >> > > mode auto # User hasn't specified weights yet. >> > > effective [2, 1, 1, -] # Using system defaults for node 0-2, >> > > # and node 3 not populated yet. >> > > >> > > # When user set at least one weight value, change the mode to 'manual' >> > > # where the kernel does not update any weights automatically without >> > > # user's consent. >> > > >> > > mode manual # User changed the weight of node 0 to 4, >> > > # changing the mode to manual config mode. >> > > effective [4, 1, 1, -] >> > > >> > > >> > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode, >> > > # the new node's weight is zero because it's in manual mode and user >> > > # did not specify the weight for the new node yet. >> > > >> > > mode manual >> > > effective [4, 1, 1, 0] >> > > >> > >> > 0's cannot show up in the effective list - the allocators can never >> > percieve a 0 as there are (race) conditions where that may cause a div0. >> > >> > The actual content of the list may be 0, but the allocator will see '1'. >> > >> > IIRC this was due to lock/sleep limitations in the allocator paths and >> > accessing this RCU protected memory. If someone wants to take another >> > look at the allocator paths and characterize the risk more explicitly, >> > this would be helpful. >> >> Hi Gregory and Hyeonggon, >> >> Based on a quick look, I see that there can be a problematic scenario >> in alloc_pages_bulk_array_weighted_interleave where we sum up all >> the weights from iw_table and divide by this sum. This _can_ be problematic >> for two reasons, one of them being the div0 mentioned. >> >> Currently, you can access the weights in one of two ways: >> The first way is to call get_il_weight, which will retrieve a specified >> node's weight under an rcu read lock. Within this function, it first >> checks if the value at iw_table[nid] is 0, and if it is, returns 1. >> Although this prevents a div0 scenario by ensuring that all weights are >> nonzero, there is a coherency problem, since each instance of get_il_weight >> creates a new rcu read lock. Therefore, retrieving node weights within a >> loop creates a race condition in which the state of iw_table may change >> in between iterations of the loop. >> >> The second way is to directly dereference iw_table under a rcu lock, >> copy its contents locally, then free the lock. This is how >> alloc_pages_bulk_array_weighted_interleave currently calculates the sum. >> The problem here is that even though we solve the coherency issue, there >> is no check to ensure that this sum is zero. Thus, while having an array of >> weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each >> node individually using get_il_weight, it is still stored internally as 0 >> and can lead to a div0 here. >> >> There are a few workarounds: >> - Check that weight_total != 0 before performing the division. >> - During the weight sum iteration, add by weights[node] ? weights[node] : 1 >> like it is calculated within get_il_weight >> - Prevent users from ever storing 0 into a node. >> >> Of course, we can implement all three of these changes to make sure that >> there are no unforunate div0s. However, there are realistic scenarios >> where we may want the node to actually have a weight of 0, so perhaps >> it makes sense to just do the first to checks. I can write up a quick >> patch to perform these checks, if it looks good to everyone. >> >> Please let me know if I missed anything as well. > > On second thought, the second bullet point doesn't make much sense, if > we expect nodes to have 0 as a valid value. Here is something that could > work for the first bullet point, though. I can send this as a separate > patch since this is not explicitly related to this thread. > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index cb355bdcdd12..afb0f2a7bd4f 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2552,10 +2552,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > * if (rounds > 0) and (delta == 0), resume_node will always be > * the node following prev_node and its weight. > */ > - rounds = rem_pages / weight_total; > - delta = rem_pages % weight_total; > resume_node = next_node_in(prev_node, nodes); > resume_weight = weights[resume_node]; > + if (weight_total == 0) > + goto out; > + > + rounds = rem_pages / weight_total; > + delta = rem_pages % weight_total; > for (i = 0; i < nnodes; i++) { > node = next_node_in(prev_node, nodes); > weight = weights[node]; > @@ -2582,6 +2585,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > break; > prev_node = node; > } > + > +out: > me->il_prev = resume_node; > me->il_weight = resume_weight; > kfree(weights); > > Of course, the only way this can happen is if a user purposefully > sets all of the node weights to 0, so I don't think this is something > that should ever happen naturally. Even with the new reduce_interleave_weights > function, it manually checks and makes sure that the lowest possible value > is 1. > > Again, please let me know if I am missing anything! I don't think that "0" is a valid weight value. If you don't want to allocate pages on some nodes, just don't specify them in the node mask parameter when you call mbind() or set_mempolicy(). --- Best Regards, Huang, Ying