From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1719CC282DE for ; Mon, 10 Mar 2025 17:34:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7AA9E280025; Mon, 10 Mar 2025 13:34:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 759EE280024; Mon, 10 Mar 2025 13:34:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5FB5F280025; Mon, 10 Mar 2025 13:34:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 39E7F280024 for ; Mon, 10 Mar 2025 13:34:24 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F2431B87C8 for ; Mon, 10 Mar 2025 17:34:25 +0000 (UTC) X-FDA: 83206340490.22.F0C4DBD Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf17.hostedemail.com (Postfix) with ESMTP id 204E440015 for ; Mon, 10 Mar 2025 17:34:23 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SHIdN0zA; spf=pass (imf17.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741628064; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6OqdZvwViuemiAWalUFPpXAEeEJqmTa93Po2Bd4FDRE=; b=iSizTPRVa2aZGp2YaBsyKv+pMzrX9HyJl+sj4APArtd9g2YypcEOWTiTGZr9WlQL86/qxy l6ksi6y/8AcPWLp8Ir9lrnmXrOluJaBkaf3JxhMXM+Bn5eOGWlwqqtVBBlune5aK2wgLTp yReLJ3f4bzgr3BKyuT+lF3BN9Cc3NsY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741628064; a=rsa-sha256; cv=none; b=fAJurEMMZ8lVkS5RhxfHxUuBW58yRvfjSq/90EMI9nFjMAzx+r5dv4Zy5NcM07Ft6zPWa5 x38WEH5jIp1+czlR2VBz9KnV9nmChsC8ZS2qsAVJHJOpM6qmlO4FvqGv+Hg/DMQjILGmCt NPI3wDyvsJzhiTvdH+Mh29zT6RszXNQ= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SHIdN0zA; spf=pass (imf17.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-6f47ed1f40dso34900417b3.1 for ; Mon, 10 Mar 2025 10:34:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741628063; x=1742232863; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6OqdZvwViuemiAWalUFPpXAEeEJqmTa93Po2Bd4FDRE=; b=SHIdN0zANUJO5LlMWiL+51caTx7G5NoS3Cr2bdNsGP319eiPj/+o0a6sPoymXDOb4y WhEBPCrPrF0inwUutxLDs2xaSY+rAlwiG3ADpERwhsooCMTN1t69dOh91IDxU7a59MOV 0BSNjnytZS+e9brLZfuHz2k27IgnxM3TnF6xw3poNY1RtltXJtykipE0j6FBYEiX5ND2 Juf6ibAg8vjOt+5hRxPCk48KzKLG/wB229fBw8TnaXeM0MyN0jnFons7ybU/DRqaw0G1 0Zf1DdMwU9T/BjJfjbBAcR/vVXA1/tWSqQXSRsz6WGgZJ7yIXLuuS/8W5Gu5YTLldyVI yUOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741628063; x=1742232863; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6OqdZvwViuemiAWalUFPpXAEeEJqmTa93Po2Bd4FDRE=; b=a48XmHIUc37ohRRDbofz7tic+VBNNcxhdz7wPiUdZ8Qol2HQZmu7UCrVFNkn8U5O5Z WlfVHYqlHGqpxA4HmP8cBVsTmvb3/kZouC6IZR68krOg0IWRcvUlrVD3v7LZ3Rxt8NS8 FyBKBgc5g3yfG6pMZmiW6/oGzDuuTQg1TUjbnddcLT5KTRwj76Gl4QY0wM7xLasHcEup tkrXNePd2OjehflNnC28leiajdvxUucCMnn2obC+69h54dNmcHnizUI2KfD3DINyKB/r /0DJEI+34BRwQgfj6stZ99djkOakcNt5GTQn9y/WZUAESTBRdnemP0yePN0WM2fWYny4 IokQ== X-Forwarded-Encrypted: i=1; AJvYcCWq9HUM0jQhkcfRxFKCJ9B0FJxWijWhGu1XcM//GcYIUsF+NZRYcXydXHajWPRZSlS8ksGu0X6uCw==@kvack.org X-Gm-Message-State: AOJu0Ywdo1ZXxcveJIv8wh2bEWMFhepoICJ3HfXwFPTbaOyH4QeF7tk3 NAdcNQzUaFXQ1XelIs8ZiqQ7d16OGjk5wji5KmZWgxpb53p6YqYv X-Gm-Gg: ASbGncvA/ZqVI3o9uX42qKss6/qa3rAMnR58qYt1fjx/7gUdD5OFBwh9Te+L/wm0QZx d5WJoU1udRQs1RXoqsksG5MynEmmcgzH1RPffXxBw2Ub+ZW/koPknqqusi8c+oi7avXD2Ym8BrF Hy7/JhczR/SoOsogbFItKqTlJNNmRNOtKUEDuRVG3BMDmraUQg5itMCUn+WlpslLAjV9+HOLQgm 1uKjHUzDV4zTrfvtyGqPgXiJsm54XdQT1LX9thOlnY+O2Lc10oBAx1VB2HsRZW4Xz5m/FVdToem SVfA9AOq3QhMzJx7c7siI/XV4S943wOHt1kPQpXD+8g= X-Google-Smtp-Source: AGHT+IFFP80rhMMdb4dQpy8xACZsyTMqZugcmMkdrMh8ivN/mf7QiBrFhw+eHS5oi/KRsnSfOfwc+g== X-Received: by 2002:a05:690c:670f:b0:6fd:32ab:ed28 with SMTP id 00721157ae682-6febf2d84b1mr217949967b3.15.1741628062939; Mon, 10 Mar 2025 10:34:22 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:8::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6feb2c2e885sm22133987b3.70.2025.03.10.10.34.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Mar 2025 10:34:22 -0700 (PDT) From: Joshua Hahn To: "Huang, Ying" , akpm@linux-foundation.org Cc: gourry@gourry.net, harry.yoo@oracle.com, honggyu.kim@sk.com, yunjeong.mun@sk.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, osalvador@suse.de, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: Re: [PATCH v7] mm/mempolicy: Weighted Interleave Auto-tuning Date: Mon, 10 Mar 2025 10:33:37 -0700 Message-ID: <20250310173420.539849-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <874j0162vt.fsf@DESKTOP-5N7EMDA> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 204E440015 X-Stat-Signature: gq6zyj9xfmidnndqk1misftfwypza1f1 X-Rspamd-Server: rspam10 X-HE-Tag: 1741628063-14096 X-HE-Meta: U2FsdGVkX18XMOCb5LuofPC1KKU/hZSqlkgpYHeFY8zudm5gnr8PwZ8mlgX5YrRzo4GzxGUqy4gpTtca9ffvBKPu6Q+zmbIcE6LxJvghUve/4GCMgNvNsD1BRpRlaOsbBvV1Vh5MJ51qj6EEae3sZ7PqAwfrUhTi6Ip3xuGiTaXmsJLmjS0R99XtzRX58smfYN1F2rlTZ5lOi88mEn6RFn+T2MQYqcxx8VyfKDH/2viFfy0AUaeuf3YsuvZgIAQJOaOv01qDnKC0PUAlu2GcvVESu8GbxE2iuWGCcj/2am22P6yTRUxiPnRQELNDiSUDr0aS6C/pvb2UeN43vfKZcQWeLuxUqclkfwrJPgPoQ2kNPu76mkxlUYigTMWSQiBzTeucZYfCax+/C6709+nwkkSoGa8SPJs8QaYczr6IMk1fyMa4ELjU8DX4063N5lv2qgaSM2rUi+xF3GPE9S+WIy5J5baBS+7ZGeh4oZHkHUSG0xrJdm9UZaxfUzLrLgnGXerBc/ap2373c8Pbr9vKjsHdzkXZYMHs40aMF+6b+BVNOJqMKyK+767dg2OMGMFv9LybGl1iGtM/WdsJP2VSd2/fawINcYTVB5N0ZIt5K2cRhaeMHrVIfd9bRv5FCuafeMmkukGyDhmOGxZVINwwWEByyE0Ade04/daNFYT5aMa/g96rRu0zp4jxflsqUmhKTTms//SqU6aMSU1mMOFFeEeJVPHQls74lMImxSQ9FVi9nYVkjvXyGe9eVl/JTFC8pw75VcynxGBdfa8oa2Aee3Kum2wh7Wt+hwr/ritymGwl6DgKa9BqrICc5ElM2SkarbYaSV1BBZgSfndipAU6/f0qVR57q6ccJQHE098R3ipdiyQrSvinOrEi1f5gvkvYd4ILq1ENIWBFnEXDw+K5fdwJz6jpQalIw/EAqfkODSTs197tGDWrqrLHdIJGIDZLIX1VXeYtz3vyaa1ZfzG cR3IdX9L dnXv3w86Xj4HxXobelmC/OH4icxxDyexv4buwLuqPtTKo5DkwJ4QUjui05ZkYg9eITGEnkURY0EEnDJaE8Sviv8aK+V89cjC7D1NZKO5eP4JJD4EZMGTw1Db23aOkf7iEM0V9CXc/MWCsQVyiAlAizdi2ZFrh/Xibi4c3Ti1iqU96wrTRb3S7BtmztREYkVxNbWJKesIyR9ROi7cSgf/skfoVtQste11eLNfqiIuUv+Nkb06hRGpv5KCkTSGkw7ALtmOjEGIugOwn9UlJYTiQjAe8tk62j1XgS1zLuUq7bI2kPdRf51LHINZK0FBwfiITfPItj7/ocA4RYxbo8GlHlJT2xAfUuwyahGuRwZ3ey26Adz69NC5n0pbHwvl4OH+4bX4bPro2G1GG/IFA2s4ZAszNQKIMkmJ4Xc24P3LACgjZvzmMOrjHlLe2G02vTaf4HJUlKpBi7A8AiqLY1NAcc0IGlLgWMYJY5ah/zHqKhgUeUiycDntHBW6tYo+L1MPL2XlnnLnk5FP6JH3eU0UHXvXDoJxwTJ4PE2nYU1TBOHgcbRv5v4K/E60znmY/bmtjK13a X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 10 Mar 2025 10:22:30 +0800 "Huang, Ying" wrote: Hello Andrew, I'm sincerely sorry, but I think I think that there are some RCU race conditions that I overlooked in this patch. Would it be ok with you to pull the patch out of mm-unstable once more, and for me to send a v8? I think it would also be safe to wait for Ying's signature on this patch as well, since he has been reviewing this patch since the first iteration. Thank you for your help as always! > Hi, Joshua, > > Thanks for your new version. > > Joshua Hahn writes: > > > On machines with multiple memory nodes, interleaving page allocations > > across nodes allows for better utilization of each node's bandwidth. > > Previous work by Gregory Price [1] introduced weighted interleave, which > > allowed for pages to be allocated across nodes according to user-set ratios. > > > > Ideally, these weights should be proportional to their bandwidth, so > > that under bandwidth pressure, each node uses its maximal efficient > > bandwidth and prevents latency from increasing exponentially. > > > > Previously, weighted interleave's default weights were just 1s -- which > > would be equivalent to the (unweighted) interleave mempolicy, which goes > > through the nodes in a round-robin fashion, ignoring bandwidth information. > > > > This patch has two main goals: > > First, it makes weighted interleave easier to use for users who wish to > > relieve bandwidth pressure when using nodes with varying bandwidth (CXL). > > By providing a set of "real" default weights that just work out of the > > box, users who might not have the capability (or wish to) perform > > experimentation to find the most optimal weights for their system can > > still take advantage of bandwidth-informed weighted interleave. > > > > Second, it allows for weighted interleave to dynamically adjust to > > hotplugged memory with new bandwidth information. Instead of manually > > updating node weights every time new bandwidth information is reported > > or taken off, weighted interleave adjusts and provides a new set of > > default weights for weighted interleave to use when there is a change > > in bandwidth information. > > > > To meet these goals, this patch introduces an auto-configuration mode > > for the interleave weights that provides a reasonable set of default > > weights, calculated using bandwidth data reported by the system. In auto > > mode, weights are dynamically adjusted based on whatever the current > > bandwidth information reports (and responds to hotplug events). > > > > This patch still supports users manually writing weights into the nodeN > > sysfs interface by entering into manual mode. When a user enters manual > > mode, the system stops dynamically updating any of the node weights, > > even during hotplug events that shift the optimal weight distribution. > > > > A new sysfs interface "auto" is introduced, which allows users to switch > > between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The > > system also automatically enters manual mode when a nodeN interface is > > manually written to. > > > > There is one functional change that this patch makes to the existing > > weighted_interleave ABI: previously, writing 0 directly to a nodeN > > interface was said to reset the weight to the system default. Before > > this patch, the default for all weights were 1, which meant that writing > > 0 and 1 were functionally equivalent. > > Forget to describe the new functionality? Hi Ying, thank you for reviewing my patch again! Thank you for letting me know. When I re-wrote the patch letter from v5-->v6, I was reworking this portion, and tried to make it shorter and shorter... and I think I missed being explicit about what the new behavior is. [...snip...] > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > > index 0b7972de04e9..862b19943a85 100644 > > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > > @@ -20,6 +20,34 @@ Description: Weight configuration interface for nodeN > > Minimum weight: 1 > > Maximum weight: 255 > > > > - Writing an empty string or `0` will reset the weight to the > > - system default. The system default may be set by the kernel > > - or drivers at boot or during hotplug events. > > + Writing invalid values (i.e. any values not in [1,255], > > + empty string, ...) will return -EINVAL. > > + > > + Changing the weight to a valid value will automatically > > + update the system to manual mode as well. > > + > > +What: /sys/kernel/mm/mempolicy/weighted_interleave/auto > > +Date: February 2025 > > +Contact: Linux memory management mailing list > > +Description: Auto-weighting configuration interface > > + > > + Configuration mode for weighted interleave. A 'Y' indicates > > + that the system is in auto mode, and a 'N' indicates that > > + the system is in manual mode. > > str_true_false() is used to show the attribute, so the "true/false" will > be displayed? Yep, makes sense to me! > > + > > + In auto mode, all node weights are re-calculated and overwritten > > + (visible via the nodeN interfaces) whenever new bandwidth data > > + is made available during either boot or hotplug events. > > + > > + In manual mode, node weights can only be updated by the user. > > + Note that nodes that are onlined with previously set weights > > + will inherit those weights. If they were not previously set or > > s/inherit/reuse/? > > However my English is poor, so keep it if you think that is better. Hmm, I think reuse is indeed the better word to use here. Inherit kind of makes it seeem like there is some parent-child hierarchy, which is definitely not the case here. > > + are onlined with missing bandwidth data, the weights will use > > + a default weight of 1. > > + > > + Writing Y or 1 to the interface will enable auto mode, while > > kstrtobool() is used to parser user input, so maybe something like > below? > > Writing any true value string (e.g., Y or 1) will enable auto mode. Noted, I will take this change as well. > > + writing N or 0 will enable manual mode. All other strings will > > + be ignored, and -EINVAL will be returned. > > + > > + Writing a new weight to a node directly via the nodeN interface > > + will also automatically update the system to manual mode. > > s/update/switch/? > > Again, keep your words if think that it's better. And here as well. Thank you for the suggestions! [...snip...] > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > > index ce9885e0178a..78f1299bdd42 100644 > > --- a/include/linux/mempolicy.h > > +++ b/include/linux/mempolicy.h > > @@ -11,6 +11,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -56,6 +57,11 @@ struct mempolicy { > > } w; > > }; > > > > +struct weighted_interleave_state { > > + bool mode_auto; > > Just "auto" looks more natural for me. However, I have no strong > opinion on thist. Yep, my concern was that just leaving "auto" might be a bit vague, but there is always the doucmentation for folks to reference if they are confused. > > + u8 iw_table[]; /* A null iw_table is interpreted as an array of 1s. */ > > What is "null" array? You are right, there is no concept of a null array in a dynamically sized struct > IIUC, iw_state is prevous iw_table now, so we may replace this with, > > A null wi_state is interpreted as mode is "auto" and the weight of any > node is "1". Yup, this makes sense. The only 2 cases with a "null" iw_table is if wi_state itself is null, or if the length is 0 (in which case the table isn't null, it will just point to the next address in memory). I'll take your new description here. Thank you for the suggestion! > > /* > > - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes > > - * system-default value should be used. A NULL iw_table also denotes that > > - * system-default values should be used. Until the system-default table > > - * is implemented, the system-default is always 1. > > - * > > - * iw_table is RCU protected > > + * weightiness balances the tradeoff between small weights (cycles through nodes > > + * faster, more fair/even distribution) and large weights (smaller errors > > + * between actual bandwidth ratios and weight ratios). 32 is a number that has > > + * been found to perform at a reasonable compromise between the two goals. > > */ > > -static u8 __rcu *iw_table; > > -static DEFINE_MUTEX(iw_table_lock); > > +static const int weightiness = 32; > > + > > +/* wi_state is RCU protected */ > > "__rcu" below can replace the above comments? Yes, I will remove the comments above. > > +static struct weighted_interleave_state __rcu *wi_state; > > +static unsigned int *node_bw_table; > > + > > +/* > > + * wi_state_lock protects both wi_state and node_bw_table. > > + * node_bw_table is only used by writers to update wi_state. > > + */ > > +static DEFINE_MUTEX(wi_state_lock); > > > > static u8 get_il_weight(int node) > > { > > - u8 *table; > > - u8 weight; > > + u8 weight = 1; > > > > rcu_read_lock(); > > - table = rcu_dereference(iw_table); > > - /* if no iw_table, use system default */ > > - weight = table ? table[node] : 1; > > - /* if value in iw_table is 0, use system default */ > > - weight = weight ? weight : 1; > > + if (rcu_access_pointer(wi_state)) > > + weight = rcu_dereference(wi_state)->iw_table[node]; > > IIUC, wi_state may be changed between rcu_access_pointer() and > rcu_dereference(). If so, it's better to use rcu_dereference() > directly. Yes, you are correct. To be completely transparent, I had misunderstood the rcu_dereference_pointer function and had assumed NULL pointers should not be passed to it. I now understand that we should actually do the null check afterwards. There are a few other places where this is used -- I'll go and fix all of them. [...snip...] > > +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) > > +{ > > + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; > > + unsigned int *old_bw, *new_bw; > > + unsigned int bw_val; > > + int i; > > + > > + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); > > + new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL); > > + if (!new_bw) > > + return -ENOMEM; > > + > > + new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids), > > + GFP_KERNEL); > > NIT: because we will always initialize new_wi_state->iw_table[] below, > we can just use kmalloc() and initailze new_wi_state->mode_auto? Yes, this makes sense to me. > > + if (!new_wi_state) { > > + kfree(new_bw); > > + return -ENOMEM; > > + } > > + for (i = 0; i < nr_node_ids; i++) > > + new_wi_state->iw_table[i] = 1; > > + > > + /* > > + * Update bandwidth info, even in manual mode. That way, when switching > > + * to auto mode in the future, iw_table can be overwritten using > > + * accurate bw data. > > + */ > > + mutex_lock(&wi_state_lock); > > + > > + old_bw = node_bw_table; > > + if (old_bw) > > + memcpy(new_bw, old_bw, nr_node_ids * sizeof(unsigned int)); > > I prefer > > memcpy(new_bw, old_bw, nr_node_ids * sizeof(*old_bw)); > > a little. But it's not a big deal. We can do this. old_bw should not be null here, anyways! > > + new_bw[node] = bw_val; > > + node_bw_table = new_bw; > > + > > + /* wi_state not initialized yet; assume auto == true */ > > + if (!rcu_access_pointer(wi_state)) > > + goto reduce; > > + > > + old_wi_state = rcu_dereference_protected(wi_state, > > + lockdep_is_held(&wi_state_lock)); > > + if (old_wi_state->mode_auto) > > Because we can use "!old_wi_state || !old_wi_state->mode_auto" here, I > don't think rcu_access_pointer() above gives us something. Sounds good as well. > > + goto reduce; > > + > > + mutex_unlock(&wi_state_lock); > > + kfree(new_wi_state); > > + kfree(old_bw); > > + return 0; > > + > > +reduce: > > + new_wi_state->mode_auto = true; > > + reduce_interleave_weights(new_bw, new_wi_state->iw_table); > > + rcu_assign_pointer(wi_state, new_wi_state); > > + > > + mutex_unlock(&wi_state_lock); > > + if (old_wi_state) { > > + synchronize_rcu(); > > + kfree(old_wi_state); > > + } > > + kfree(old_bw); > > + > > + return 0; > > +} > > + > > /** > > * numa_nearest_node - Find nearest node by state > > * @node: Node id to start the search > > @@ -1988,34 +2093,33 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > > u8 *table; > > unsigned int weight_total = 0; > > u8 weight; > > - int nid; > > + int nid = 0; > > > > nr_nodes = read_once_policy_nodemask(pol, &nodemask); > > if (!nr_nodes) > > return numa_node_id(); > > > > rcu_read_lock(); > > - table = rcu_dereference(iw_table); > > + if (!rcu_access_pointer(wi_state)) > > + goto out; > > If wi_state == NULL, why should we always return 0? IIUC, wi_state == > NULL means the weight of any node is 1. That is true, we can still find out what the correct value should be based on just assuming all weights to be 1 -- I will make this change. > > + > > + table = rcu_dereference(wi_state)->iw_table; > > /* calculate the total weight */ > > - for_each_node_mask(nid, nodemask) { > > - /* detect system default usage */ > > - weight = table ? table[nid] : 1; > > - weight = weight ? weight : 1; > > - weight_total += weight; > > - } > > + for_each_node_mask(nid, nodemask) > > + weight_total += table ? table[nid] : 1; > > When will table be NULL here? It couldn't before. But given your feedback above, we can just set table to be null if iw_table does not exist, and the code should behave as intended. [...snip...] > > +update_wi_state: > > + rcu_assign_pointer(wi_state, new_wi_state); > > + mutex_unlock(&wi_state_lock); > > + if (old_wi_state) { > > + synchronize_rcu(); > > + kfree(old_wi_state); > > + } > > + return count; > > +} > > + > > +static struct kobj_attribute wi_attr = > > NIT: "wi_attr" appears too general for me. Maybe something like > "wi_auto_attr"? Will do! > --- > Best Regards, > Huang, Ying Thank you for all of your feedback, Ying! I will send out a v8 soon with all of your proposed changes. Have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail)