From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6463E77197 for ; Thu, 9 Jan 2025 17:18:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68B4E6B0082; Thu, 9 Jan 2025 12:18:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 63B716B0083; Thu, 9 Jan 2025 12:18:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DC6A6B0085; Thu, 9 Jan 2025 12:18:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 33D586B0082 for ; Thu, 9 Jan 2025 12:18:27 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B3EB714024D for ; Thu, 9 Jan 2025 17:18:26 +0000 (UTC) X-FDA: 82988572212.01.DA6BDA6 Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf04.hostedemail.com (Postfix) with ESMTP id CFAD140008 for ; Thu, 9 Jan 2025 17:18:24 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BCR6rPKD; spf=pass (imf04.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736443104; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GUGgop6jtE3EgW6/6rlTjHVm8v6s3KcgoYismoC8low=; b=BeB2mX32zAuCWnvj2oLrm4+1Xk41J5KS4qgTN0fI3SG4t2JFLulo1T02RCqnlGB4diaJyK nJmN+FefP94T8X9Zr40WjtaMgf1MKKaAwtxaz2mDdbkEcd9Ayfr+P7u8PNXZ15Rx1QLIXb zVdf43yJigzGoTvY0At8jXiGGsvWSGk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736443104; a=rsa-sha256; cv=none; b=UwidDUocsQEwx4++1ukdJvEN2Z2vxOvSGfRm7SBd47rmqir5IAy1l3ROBVRObXkMNxscX1 PMqP5i5oPWU/QIK+TzD4n858YdOy/SVFL2cSfj7wsv8IwFLCRjGw32G/zxJnp5q0eek9cG 0P01iVUNV+yB+u6gQuJDkrs5hYvKLOA= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BCR6rPKD; spf=pass (imf04.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-e54bd61e793so1900419276.2 for ; Thu, 09 Jan 2025 09:18:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736443104; x=1737047904; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GUGgop6jtE3EgW6/6rlTjHVm8v6s3KcgoYismoC8low=; b=BCR6rPKDEyaa4QndOFzy+X18W1L/4H906zLxAeeC7FUr57uB0rRUxZ11J8h4n3FGtz mbmExiiZaLMmSm2Pg2vBYccvgrPAvBtOP4NfBd4KGOjxhxOj62PPqvcq//yKURtkTFUO tr2jRZl7yQp4DhrevOHiX5YHH8G6jbAyyXNjoh5aBRzdHoD2eyKqTeqpthslP5R+n7aZ Dr87b4lpxlt3npRbLep2Stk1/FUTlNIh1WOeI2kkrnYMk9Rb/ftTcMTC110zyBlMXDOY WPbBdN2mwI1UAHBT3TqjNLom09HIDBHMo5YxPS/0L9CFyxQncxt41udEnAW8GOrvp/oW Afjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736443104; x=1737047904; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GUGgop6jtE3EgW6/6rlTjHVm8v6s3KcgoYismoC8low=; b=GOSFu5ZWcNam5Pz18YMe/opV361xbVCaJiDIdrtsD5nXgLuAjmZKKJAoEv9rBgnjzU ls8Dy92RdktxpPSTHqVhBPtjl8g65CTDCL4elKx763t6Y/j7xEEsKShyiC/Mg2sn90C0 2mUyGsup/HQj9zPJlKqA3tpSYrK6hinX1jxM0/C5MR/2Xu7VoX8DD9j2v+mG7AcosUD+ A3QAnRu2qU7J/vgQH8DJxsoO39+9PU30o+ZI9GkGhUg83YdSurd6jbZ5YOhRa8w0SbDl TA13w22YVu1iUtdzIafavlQ5rM09Eo9cYdenG/DCfzHm88e1MCzsGLj0lYizFNeO1QcB 6hzA== X-Forwarded-Encrypted: i=1; AJvYcCUWOtlR3O/J2vBtyooqC9DIAzoPBrjm6Y8+VF7Pk3XsEXrfFlE7xqisI4eY5EiVcJM4RLo5pm2OQA==@kvack.org X-Gm-Message-State: AOJu0YwaK69tidCPDPCZ09YicytBLUEjjJNh/ZLX/kbnCrLly930RuMi UUtQlyxi7jaw9CDgG5hFXmPJLpXUBF7NxTpqNeT2kfojw7VhB0hf X-Gm-Gg: ASbGncv3BIndBE8CcDUOYyJvAqwxZABaL/l0wsu/9C7aYUtEdnVdc0Wh7Bj7SCb0c+u YebAhbplabLmzDVeLknjXttPhFrbZfE4Sac9NCHqQItP3kd8A0c/mzmY54OKuJ2MmBml5o27Vxs BF0uGshzQhnQiOcuYROf17mMgCm2bKin8MVi6Kow0z/zO0+aE8aBTFyHjaFOMeBWPRF/Yw96v9A MZrJ+oUMBVCtPhg/80qOfWGv7U+ma3MRt+pflRP0BDyemMCR0Cp3Arl X-Google-Smtp-Source: AGHT+IELsYcB6T/ck4lH4klPT6vDU6KCIg9dPGWh9gmz8YJ7e4Z+gj6z3yGKPXuQMSRsPUErAOo4TA== X-Received: by 2002:a05:6902:108d:b0:e4b:6ef6:e7ba with SMTP id 3f1490d57ef6-e54ee2055ddmr6122667276.33.1736443103774; Thu, 09 Jan 2025 09:18:23 -0800 (PST) Received: from localhost ([2a03:2880:25ff:70::]) by smtp.gmail.com with ESMTPSA id 3f1490d57ef6-e55a597b658sm10842276.30.2025.01.09.09.18.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jan 2025 09:18:23 -0800 (PST) From: Joshua Hahn To: Gregory Price Cc: Hyeonggon Yoo , "Huang, Ying" , kernel_team@skhynix.com, 42.hyeyoo@gmail.com, "rafael@kernel.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , "akpm@linux-foundation.org" , =?utf-8?B?6rmA7ZmN6recKEtJTSBIT05HR1lVKQ==?= System SW , =?utf-8?B?6rmA65296riwKEtJTSBSQUtJRSk=?= System SW , "dan.j.williams@intel.com" , "Jonathan.Cameron@huawei.com" , "dave.jiang@intel.com" , "horen.chuang@linux.dev" , "hannes@cmpxchg.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "kernel-team@meta.com" Subject: Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning Date: Thu, 9 Jan 2025 09:18:18 -0800 Message-ID: <20250109171821.3203865-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: CFAD140008 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: qkk9d93g5667xmsux8p4ngue9tos7kuq X-HE-Tag: 1736443104-657094 X-HE-Meta: U2FsdGVkX1+hBPLHN/ogc/3getULFBo/p5yDcEWt6LLOZBYPMe8/kPMz+VSXJCX2CMxXv6uBmN8BV69yx/OBXwmQ/uESPrlqY1JrSS7+Phj04fG+eEnZNOutOP2fxILcYOZ78UlSIE/Bm96e9CO59kM4OUHD8uBkXJ+wmDlMmHAXkChDO9D5d9bdBg0YXxrde0DsPCqTDMa/bi3jaoBl9bHR+JlciNo91RHcwe0U/O4Ty1oko5bm17WgbW3dKwzXGGbzWOUXh9fd5YDsKKsmSZkKmPh6QiY4TWUkt3UBv97CvqCeueN/0b3wTYCnEpmcCsn+cUy9rRUCdH5YMuf9tn05ZMC3Q4pzZAfRYKjM0b78S2ZJ2imrReFpo+Z50zW2RT6dA5TrhuuP8EHK6SuJSKOYcaYtLfhrl91NhjU5Z1VBF9g3aBjd9JLyEPmwT7VyC+1iDO+CnUtS63uSu2oDL8wdgF0FuR7ai9bwelfqumI6bra8EY99S2MEtCut6uN21LNFlllF62GzxG4OBYjCorXsnjTHQeTXWbnR8GPqdwm7mdnGy5buUuSxTYVinj5Gf05enlODitjWooQjCwLTVWaP9ZxheK5MluylosFyG5D7IFw8cEl4eMVsxsWby67Q9rMXQWbJPjnd9Zy1cALlfGPen72clnxwJBGDdefRlSQjsdTOvn/o/lvrxVmJBCgeAHI047ieXFQ/1WeFzjZQp3N7kA3DTkfGJelMuRIcuwrP+ffFnG4R6rH5X5quAGaqqwMYtJjvUHgbztumJ96lmWPJI06iIbKrXLFrpJd+okFRpl1QV+LKCD8lOjZ+O3a8FglgHYqqo2ffXOgSLdOpbQWHtraVmkCqxkvZ8fpLEwX4a/xb3Z0gnOrsqef80sA98n9yD0qQo1NpnbBr28IGzMdyRfEeuKrSg2nJGOo1FhFKhgQoktWE9rIfVPFvzMdtCiiONH0gLc2H60+Y1R1 Umz5fHIq U1M4pjsfPPEK1ZXzxPbwtWuVUMfHpm9nWIAs0nyJ9EppbS62hlAtKyOSrZ6qB2Lx7/WPBknt6kEyJPakwR9T0yRJKze7U/oYVBxG1GcS+igRCzqc2hzhT9UXPypa4jKOuY71PMW4ICnX8CJJYslruup9Bvp+W6KTnQ1z54prGn5uaBzfNA0qsiI0tTRAECevHoPYk+0Wl/l2FTKGNzG0NmUJ+7mHtwa1JVwZNc7ckW1rzw2D3FQ3vMorqKWsjtB/OFrVsUByFRJDPUh/prjpDkq6MibNBrbdWjJdhVFvBTrg037gOlH4kPsZgKFNPlmRpTBZ4M+cA9g6LrY73jTAPexLHI/xRkymjoXaSozjtzgOymQgZUdYCN4dwrrT8DzQeP0SgBCpW1H0ogJ2FvIc16IQFPB7HLO+fFIqY6lsx/9PdqlCZ39DEHXmTSxENi9NbCqBLiv7qHmJ7FO63/6P4dEZRgUMdRzCFaRBq6vgmLhBOz3PmcBF8mD/GmbQ49Dsjo6ajLTm7emD7gljUhBsr+baKHucPocrb2G2BhbRRXw5YcHLgmc0Yh6V7C2tCApJNWdGDyIkdyueaS4vC9hkVPTm1XBvqlJCrWlHf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price wrote: > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote: > > Hi, hope you all had a nice year-end holiday :) > > > ... snip ... > > Please let me know if there's any point we discussed that I am missing. > > > > Additionally I would like to mention that within an internal discussion > > my colleague Honggyu suggested introducing 'mode' parameter which can be > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more > > intuitive interface. > > > > With Honggyu's suggestion and the points we've discussed, > > I think the interface could be: > > > > # At booting, the mode is 'auto' where the kernel can automatically > > # update any weights. > > > > mode auto # User hasn't specified any weight yet. > > effective [2, 1, -, -] # Using system defaults for node 0-1, > > # and node 2-3 not populated yet. > > > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode, > > # all weights are re-calculated based on ACPI HMAT table, including the > > # weight of the new node. > > > > mode auto # User hasn't specified weights yet. > > effective [2, 1, 1, -] # Using system defaults for node 0-2, > > # and node 3 not populated yet. > > > > # When user set at least one weight value, change the mode to 'manual' > > # where the kernel does not update any weights automatically without > > # user's consent. > > > > mode manual # User changed the weight of node 0 to 4, > > # changing the mode to manual config mode. > > effective [4, 1, 1, -] > > > > > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode, > > # the new node's weight is zero because it's in manual mode and user > > # did not specify the weight for the new node yet. > > > > mode manual > > effective [4, 1, 1, 0] > > > > 0's cannot show up in the effective list - the allocators can never > percieve a 0 as there are (race) conditions where that may cause a div0. > > The actual content of the list may be 0, but the allocator will see '1'. > > IIRC this was due to lock/sleep limitations in the allocator paths and > accessing this RCU protected memory. If someone wants to take another > look at the allocator paths and characterize the risk more explicitly, > this would be helpful. Hi Gregory and Hyeonggon, Based on a quick look, I see that there can be a problematic scenario in alloc_pages_bulk_array_weighted_interleave where we sum up all the weights from iw_table and divide by this sum. This _can_ be problematic for two reasons, one of them being the div0 mentioned. Currently, you can access the weights in one of two ways: The first way is to call get_il_weight, which will retrieve a specified node's weight under an rcu read lock. Within this function, it first checks if the value at iw_table[nid] is 0, and if it is, returns 1. Although this prevents a div0 scenario by ensuring that all weights are nonzero, there is a coherency problem, since each instance of get_il_weight creates a new rcu read lock. Therefore, retrieving node weights within a loop creates a race condition in which the state of iw_table may change in between iterations of the loop. The second way is to directly dereference iw_table under a rcu lock, copy its contents locally, then free the lock. This is how alloc_pages_bulk_array_weighted_interleave currently calculates the sum. The problem here is that even though we solve the coherency issue, there is no check to ensure that this sum is zero. Thus, while having an array of weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each node individually using get_il_weight, it is still stored internally as 0 and can lead to a div0 here. There are a few workarounds: - Check that weight_total != 0 before performing the division. - During the weight sum iteration, add by weights[node] ? weights[node] : 1 like it is calculated within get_il_weight - Prevent users from ever storing 0 into a node. Of course, we can implement all three of these changes to make sure that there are no unforunate div0s. However, there are realistic scenarios where we may want the node to actually have a weight of 0, so perhaps it makes sense to just do the first to checks. I can write up a quick patch to perform these checks, if it looks good to everyone. Please let me know if I missed anything as well. Hope you all have a great day! Joshua