From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1427E77199 for ; Thu, 9 Jan 2025 19:11:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61A7B6B00A5; Thu, 9 Jan 2025 14:11:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A30A6B00AC; Thu, 9 Jan 2025 14:11:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F4EF6B00B5; Thu, 9 Jan 2025 14:11:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 1CA3B6B00A5 for ; Thu, 9 Jan 2025 14:11:11 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 94845AEE84 for ; Thu, 9 Jan 2025 19:11:10 +0000 (UTC) X-FDA: 82988856300.30.2511329 Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) by imf15.hostedemail.com (Postfix) with ESMTP id A95AEA000D for ; Thu, 9 Jan 2025 19:11:08 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=atIBu6PF; spf=pass (imf15.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.219.172 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736449868; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RBTrbYpmfem2EyOa0J8S2yQxi7tWCdipbLey68Ww0NA=; b=vn3hP4QszRlTefTfR/aHuK7g5A9wzyb90zu/UVQ9bgEV2ekBH3NhXunjJ9z6Gr7NYWZodI lGI30ntGEAWQ1S4wN7WZoE/RzfgCAVrWitZSRwDmpjAD1MckH8t0MGTdoyfleZaWNibFjf K+b0sDfUNacBSPngKIImJ58Mgo5Epzw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736449868; a=rsa-sha256; cv=none; b=FmFTGZJnVW6V/opRn42qrlYhcrsDG09seto7gxA28wF67BCNF4HXpzK4sz9vXit9MRT9sG mtG0e9Tgx05lq1ijHcrZS6/DoFm11K4m/K0D0gDT/5iy6n6gfoXKD84sfDc9lOoM0khApy KwPRVzpxFmhx/2iwxW4b+ZdSaTHhhOk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=atIBu6PF; spf=pass (imf15.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.219.172 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yb1-f172.google.com with SMTP id 3f1490d57ef6-e398484b60bso1931140276.1 for ; Thu, 09 Jan 2025 11:11:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736449867; x=1737054667; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=RBTrbYpmfem2EyOa0J8S2yQxi7tWCdipbLey68Ww0NA=; b=atIBu6PFIXpvccSMXtcKJlLuONDhrx2FjBb01Z0KqIawG5nkRN+EAP7k0hK5+OoTY+ aUbv+wmS/kLdEe/bsUbCWH6oXC5ELgHZjCE7KIjWxp+WR7ckHgEUcstDi91WzwBRcDlQ ikAYSJuuqIWe+chaet8cTOVNAKFTR8vfauCaRj6m9hAyApD1gAZsZ2l6HzASQK4hHph7 H+OtdBvwSsQiokzp5i2wzzCicj4APEVLlPAhCXl9pgHvziXKtK19ED/jVR0FMMhA4zcA wPAA+DgINtFd38zds9obNRHn1NZuHJzxrc2K2AP3qnjtWht7iuC7atLcj3Ie3f15uCvw ORHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736449867; x=1737054667; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RBTrbYpmfem2EyOa0J8S2yQxi7tWCdipbLey68Ww0NA=; b=wdQ0CrFxypQXIFFTTMzECdh49NQtDuYHIiPYfJmD8JRUtUkHofSxlT871plY/aKfOo hEH+ajlZPdwmlUQs0Hs8+D1VEVilIU3uv83+HtPboYG567xuRgNkVWkIUC9IglfrhlBt CdTXQMiUeFYvqZhkR8loQ9r/Km9FH2tZ8+/5e7hHYu7sCDdPjAEnouPC09urvPmoYpgA yUTB9YtcJnKl50M1h8iBi2HL8F13hdPjEfPvus8sF2P/JPKcPwmeJzzSFhiSpP5piMS4 vte47bV9/p2XImc1gDThK4rY0Rl8uWgPI99FDAcU1mhGiyQ3oi3/1W4B+brzAKhTcYxO HsRg== X-Forwarded-Encrypted: i=1; AJvYcCXvIKK08mUHw4cq/UdPo38URTDzz71dqggjFG928Kj/YCdEY4zr/DSqlQrSAHqmO1bI+oYkqvNHFg==@kvack.org X-Gm-Message-State: AOJu0YzUMiFdlCFJzZ79vQVUjVk6rIFNANEjmHPsVo913/OBB+255vNg PvW7a798U1nPfiY1b5GjDeSuyrA5E4rro7/UPhWLgRpFmrHu88YJ X-Gm-Gg: ASbGncuyIYTCJ6SKb3d4QXk5Fp6uXBSYvGqjd1sOTq8nsrD/y3eSoQBGNHAARhK6u/g r5CPF+fbK6kuHzimr7P3XtZ7lJnqX8HAJ0/qVBP55w4ira6n2J0R6LZqp0d+lS23RQlMjnM0nyu y41Jfz2FOCGg0RoxkM4oFRfOJv71tNFGJjGY04YvRUdbWpbgGFpRmCngZ/VC10G5ZS/+tL7wa2y Q6kx3QWg6nR7+e/2SqLJbnMg/jJJnB3+aNgV4JYl8TV3IqbByTG/B8= X-Google-Smtp-Source: AGHT+IGlMK+MnZJnvD8iPdz04O+PPRS8FHqS2mJSglSFK+0FQXFaGCYsiWwk+CCTrDKD3ANBmQTbOg== X-Received: by 2002:a05:690c:f83:b0:6e3:fd6:6ccb with SMTP id 00721157ae682-6f53124d794mr68717987b3.13.1736449867423; Thu, 09 Jan 2025 11:11:07 -0800 (PST) Received: from localhost ([2a03:2880:25ff:4::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6f546dd7066sm3604577b3.93.2025.01.09.11.11.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jan 2025 11:11:06 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Gregory Price , Hyeonggon Yoo , "Huang, Ying" , kernel_team@skhynix.com, 42.hyeyoo@gmail.com, "rafael@kernel.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , "akpm@linux-foundation.org" , =?utf-8?B?6rmA7ZmN6recKEtJTSBIT05HR1lVKQ==?= System SW , =?utf-8?B?6rmA65296riwKEtJTSBSQUtJRSk=?= System SW , "dan.j.williams@intel.com" , "Jonathan.Cameron@huawei.com" , "dave.jiang@intel.com" , "horen.chuang@linux.dev" , "hannes@cmpxchg.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "kernel-team@meta.com" Subject: Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning Date: Thu, 9 Jan 2025 11:10:34 -0800 Message-ID: <20250109191102.3772288-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250109171821.3203865-1-joshua.hahnjy@gmail.com> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: A95AEA000D X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: idnjxfmkroqryfk4sztm3onckui78r4g X-HE-Tag: 1736449868-930395 X-HE-Meta: U2FsdGVkX19eUPrnNYVf4ylJoKlduwxfP6BxiRQT1GhRR19Jp6Mf9zq7MnuMwBIWnf62n5UDFNPAWwl0jiLULbXhHDhpP/hEgMAy/gjOesgNvAMQ0vVN7CMaUVv3EFx1SgwfMx5thLtX6GwJjfO5Mz/5F/wyaZd33eWPnX+dIPSBHZRaz/Lv1eXbwmQd+AHhic+FZzkWzYtpRlye20VT2J74M9rpkAr6F3Ho7fr6zODeaxp4DKrE4qmcBQqcvWOkmUe8pakgOMcCVjEi+h9VVE9py7sUUz1trzZs8dpcM4RUXzKpPVKpjPYLgOioFHVIHmGo7mt+AMupWHncDWgx6eXaMk+gKRUyJPyjGE+uF+nw29FdvkVPvqN0z8sb6NiUlB09LIh2UDf7Goo/uaU0NdLk8kNpE7RFvbeAXJkUGl+ebjS64u70olOb3gHYop+Auep4008tJF1W3lyrV2uIRGqFWmnq1wvgXTCOKF5moqNgoUYZMc3racY6rAT1SIMBP7irNY/2KWrNnpN757xAV5DLcnXAA7K2anuM2O1uVjxQtKu+Z8KD4uVIgu0kg8jBxZEbnm3kfq9RAGTGcb6V0SAz2SDmG+wIFttKpW48FckO+CwfUuBWDFfvX8tgTbP7aiWfk21IrAiZnCZJrFSVXPC2fnAh1q9au1s4MUdaCZ8rSCfiAT0zDtwmDQQbhFcCW2iaFIlDsSgK7e6pGVgwPnaUkdKeyAlauPVsfhsyGJYZGTuR08HSgjDVytrje1RiPorkpcLXK41yttUlQTe2DqZpPWIqKJ95jxOYyJlcK4QtpGGoLLPRVt4kYet2sO+Y1dYZF59e/BIkZ8J2XWq0XQ4F37tPnSjHDNfO7aDdg426HFnRCUpGUjhJe0N+laHRLkJhuy6wP/ZL+qz5zwJRD7HacJ0wwZxABDauUn9sJ0YWBWGmyP5k0PRLTzYXzpqcPIl7hIXLUl/AGLn3CPP yW1oVNB/ vzlKmlhVz/OgRocAXg/4oFinh2kDClb0sz9/eX1bdPPx2LvsgHJRhUyLC2I0YQH8prC0MOLQD/eiYwy7LiBatzgk5t2QpZjy4nznfeh7reLANjms3UZniVPaWajaynuCGO1V0RHQJPfZrKRq1Bev4FW4syTHilTHaT632XYt714SRaQbfrhx8iQmgyIJSjpExIZhkDngqgiHKCtQKDz6FJC+wwGS9hn0MKIDawJ0m8kg1+1avMbDz16kbMu75EjysPVHb8L8XXfWMrkDWiEQeVmmMHlyarF3YuHDkecguh2YMscEt43scLGxiIIx1+HDbdPY5oOHfCE2pME6mAL2/b9CQch54IJoVUICN5E9aiDRzxnzihmv5Ye6qQ9c+E7anfgelxjK+B636KU6dev7z7xAGmDcMdW8WHFlDnZFhG2lbXCoTjx4NBfu1zOcjot+GeOjQBLg9MTwoQhCK77ho9AsnMopu3Q0MuBDyyaQYgrDX3QBWetXNW1+XTkGaCGU2MklJaxRTwCVSm5BtYuCPLkN0It2oBiPxivK1n4yFYHoW4AeuSp9QYpuErhX8UydKr25TglTG/cISlyc2LxkXzdYO2w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 9 Jan 2025 09:18:18 -0800 Joshua Hahn wrote: > On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price wrote: > > > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote: > > > Hi, hope you all had a nice year-end holiday :) > > > > > ... snip ... > > > Please let me know if there's any point we discussed that I am missing. > > > > > > Additionally I would like to mention that within an internal discussion > > > my colleague Honggyu suggested introducing 'mode' parameter which can be > > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more > > > intuitive interface. > > > > > > With Honggyu's suggestion and the points we've discussed, > > > I think the interface could be: > > > > > > # At booting, the mode is 'auto' where the kernel can automatically > > > # update any weights. > > > > > > mode auto # User hasn't specified any weight yet. > > > effective [2, 1, -, -] # Using system defaults for node 0-1, > > > # and node 2-3 not populated yet. > > > > > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode, > > > # all weights are re-calculated based on ACPI HMAT table, including the > > > # weight of the new node. > > > > > > mode auto # User hasn't specified weights yet. > > > effective [2, 1, 1, -] # Using system defaults for node 0-2, > > > # and node 3 not populated yet. > > > > > > # When user set at least one weight value, change the mode to 'manual' > > > # where the kernel does not update any weights automatically without > > > # user's consent. > > > > > > mode manual # User changed the weight of node 0 to 4, > > > # changing the mode to manual config mode. > > > effective [4, 1, 1, -] > > > > > > > > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode, > > > # the new node's weight is zero because it's in manual mode and user > > > # did not specify the weight for the new node yet. > > > > > > mode manual > > > effective [4, 1, 1, 0] > > > > > > > 0's cannot show up in the effective list - the allocators can never > > percieve a 0 as there are (race) conditions where that may cause a div0. > > > > The actual content of the list may be 0, but the allocator will see '1'. > > > > IIRC this was due to lock/sleep limitations in the allocator paths and > > accessing this RCU protected memory. If someone wants to take another > > look at the allocator paths and characterize the risk more explicitly, > > this would be helpful. > > Hi Gregory and Hyeonggon, > > Based on a quick look, I see that there can be a problematic scenario > in alloc_pages_bulk_array_weighted_interleave where we sum up all > the weights from iw_table and divide by this sum. This _can_ be problematic > for two reasons, one of them being the div0 mentioned. > > Currently, you can access the weights in one of two ways: > The first way is to call get_il_weight, which will retrieve a specified > node's weight under an rcu read lock. Within this function, it first > checks if the value at iw_table[nid] is 0, and if it is, returns 1. > Although this prevents a div0 scenario by ensuring that all weights are > nonzero, there is a coherency problem, since each instance of get_il_weight > creates a new rcu read lock. Therefore, retrieving node weights within a > loop creates a race condition in which the state of iw_table may change > in between iterations of the loop. > > The second way is to directly dereference iw_table under a rcu lock, > copy its contents locally, then free the lock. This is how > alloc_pages_bulk_array_weighted_interleave currently calculates the sum. > The problem here is that even though we solve the coherency issue, there > is no check to ensure that this sum is zero. Thus, while having an array of > weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each > node individually using get_il_weight, it is still stored internally as 0 > and can lead to a div0 here. > > There are a few workarounds: > - Check that weight_total != 0 before performing the division. > - During the weight sum iteration, add by weights[node] ? weights[node] : 1 > like it is calculated within get_il_weight > - Prevent users from ever storing 0 into a node. > > Of course, we can implement all three of these changes to make sure that > there are no unforunate div0s. However, there are realistic scenarios > where we may want the node to actually have a weight of 0, so perhaps > it makes sense to just do the first to checks. I can write up a quick > patch to perform these checks, if it looks good to everyone. > > Please let me know if I missed anything as well. On second thought, the second bullet point doesn't make much sense, if we expect nodes to have 0 as a valid value. Here is something that could work for the first bullet point, though. I can send this as a separate patch since this is not explicitly related to this thread. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cb355bdcdd12..afb0f2a7bd4f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2552,10 +2552,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, * if (rounds > 0) and (delta == 0), resume_node will always be * the node following prev_node and its weight. */ - rounds = rem_pages / weight_total; - delta = rem_pages % weight_total; resume_node = next_node_in(prev_node, nodes); resume_weight = weights[resume_node]; + if (weight_total == 0) + goto out; + + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; for (i = 0; i < nnodes; i++) { node = next_node_in(prev_node, nodes); weight = weights[node]; @@ -2582,6 +2585,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, break; prev_node = node; } + +out: me->il_prev = resume_node; me->il_weight = resume_weight; kfree(weights); Of course, the only way this can happen is if a user purposefully sets all of the node weights to 0, so I don't think this is something that should ever happen naturally. Even with the new reduce_interleave_weights function, it manually checks and makes sure that the lowest possible value is 1. Again, please let me know if I am missing anything! Joshua