From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E079C47DB3 for ; Thu, 1 Feb 2024 01:57:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D99B06B008A; Wed, 31 Jan 2024 20:57:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D49C76B008C; Wed, 31 Jan 2024 20:57:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C12826B0092; Wed, 31 Jan 2024 20:57:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B22BD6B008A for ; Wed, 31 Jan 2024 20:57:13 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6ED8B1C1704 for ; Thu, 1 Feb 2024 01:57:13 +0000 (UTC) X-FDA: 81741572346.22.5D43D36 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by imf25.hostedemail.com (Postfix) with ESMTP id F22BBA0011 for ; Thu, 1 Feb 2024 01:57:10 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PV0HCC3w; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706752631; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eXlaS+CGhpueSttFTbNvJwRA4Tha4uY6ZlSug2QGEYE=; b=TwZIta+PohWTe/PZDWwzOqfM8xFLvaeMtxXARONZsiYLMIRvot6vuHEpbhJfFZ81pXY0ln VpctzHBQfDu8dBDc/oWMMudcdba+HFC30FNdWiDNTj7tkGJVHbv3KdLhPda89fkvaVZMwc wFpKHAD9yNCdLWDL7GnV8DRYT3DhCtc= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PV0HCC3w; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706752631; a=rsa-sha256; cv=none; b=hd0pOKZvaxgO1IU+IfwPihmNStU1aWPTlunz/86tklEa//VvvrqonbuPD+NWsSrhpNviwu eMLfxhCxdhonEul5VkFkaFcXwfHqcah1xSPm6Ffcc88vjQdSVFxJAqAtdFhKyTgUB0iAOK xK+8zA1VHkiwI6jvJ3vhV2hReTRU+PY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706752631; x=1738288631; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=jg8fsA7gXRgMZuwES+GCMsjZx4lbadXYHh8Ji9UDptE=; b=PV0HCC3wsRsCSPgYYL1BiStCBH2YblLicu78b3sg5USlg6YY/Qi1fK/R Gtc3Iruwpa9JFrSiJanpUBwRXURJZGOOroLRRw4e2SYcspdmNv+So090F eu3Ayn3Gh2FFEWhoazflTPAfuSPdGrUrAK//T+yP6svgXa234ae55pvYl Ic4lXHaa6yZLmX3sngYTpxQbzjQbT6Kvp/w2mVxdbGYITHNppF2imE25G IcLFKxzKxBXRUlP6TuRhanDigtFL3mgwuz4BhtIWnwdnA+VthbpdBc83B QAiSy38/AYh7P26Q3DkZ2gjIS/fj2TbXj1N3VC6pDDCMRdeRNwzCWwDGK Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10969"; a="394228895" X-IronPort-AV: E=Sophos;i="6.05,233,1701158400"; d="scan'208";a="394228895" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jan 2024 17:57:09 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.05,233,1701158400"; d="scan'208";a="4375381" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jan 2024 17:57:04 -0800 From: "Huang, Ying" To: Gregory Price Cc: Gregory Price , , , , , , , , , , , , , , , , , , , , , Srinivasulu Thanneeru Subject: Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving In-Reply-To: (Gregory Price's message of "Wed, 31 Jan 2024 12:29:37 -0500") References: <20240130182046.74278-1-gregory.price@memverge.com> <20240130182046.74278-4-gregory.price@memverge.com> <877cjqgfzz.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1c5g8qw.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 01 Feb 2024 09:55:07 +0800 Message-ID: <871q9xeyo4.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Stat-Signature: ng9zsofy9eu5r6u7z1zn75de3rpazqu8 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: F22BBA0011 X-HE-Tag: 1706752630-15350 X-HE-Meta: U2FsdGVkX1+uSHQ8738pw0UdY1cnFckD+vJJLvD96Kkod7dD23XPcEMn8UraS3SKTUlri3Gobc4u31TvQVh1YWtPcfp1JAqFFz7SmTpH0iT0yYfbLlumE9IN0Z0/WSq/cPAvHp3jVCXjr3EGNJf2JaggyaUTRolhzTJ556fxBPLE8k1PCqXjAX2kevjPFNOK71JP0g2aLdZx5ilgoksXZsibuvo75j5zKRx274Q2m0ZSr83dxgjtA+ou+zl5gZRxrouVtYf/wX2gv9syvP1+D7b+n9G0ZrkIIkMDasUjcIkJikoq+8x0Cy8sHHIRzwtUJ3VkispXoZauTvTwDmRDbsXhmwXrMLAjL1DxKFh1rUTiD7AhrM3h6vWiWH7YcjwHqxPNxuymL/xFZRjhprUVyTOTPP1seGfzCtVixBIRtvFRc0aH9eM+WNWPRB5zvzSa/CsFu4r6cawALBftTC2O/GpvFXhhdZqLTgtW0CB8+SxhlLB0MVOFWbGYbe3v57ilsdKGimFEnRfm62XVtvShI3W+bsbyWd8GhbwrMC1BiZNuIpKlkZJ5yyBK8kn67OyvhWE0KsWCjGuSQE7S1Sm7hg+FiEKVwXT/joV1/ncyV8Uf1RexgeqAZJoNHIOwPw0GzyRqgYEaceXcR4TYN7i20JigOHPV+qwUHOhtunf5aNG0F2zA5TIR1e9u/P1oZZ046HkyNDlVYf5Qcdn36G0Q4JKGhGTDTlCEnftNyKVO/U2mwPqU/8RxL3QcVVEtOmSfPlgO3BXu6JTQpzkkTvZMHLYlVQkJqiVIHn1JR9h6M4pa/FoVoYezshajcIM7h2nsu2u/E5gWtZL6cD+haF5yOwBVREQLSE+KLUzKTXuT6s2z6rkg8XQaWQ/8y9GVcfkY7yUIaQb9ucqPHAoRz9A/+YGXsbZz1BKufU0jhSoly6x8McDg+WdURX1urm9KiZlrL+e9zTTMUEis2snVO8Y crw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > >> > I think this is handled already? It is definitely an explicit race >> > condition that is documented elsewhere: >> > >> > /* >> > * mpol_rebind_policy - Migrate a policy to a different set of nodes >> > * >> > * Per-vma policies are protected by mmap_lock. Allocations using per-task >> > * policies are protected by task->mems_allowed_seq to prevent a premature >> > * OOM/allocation failure due to parallel nodemask modification. >> > */ >> >> Thanks for pointing this out! >> >> If we use task->mems_allowed_seq reader side in >> weighted_interleave_nodes() we can guarantee the consistency of >> policy->nodes. That may be not deserved, because it's not a big deal to >> allocate 1 page in a wrong node. >> >> It makes more sense to do that in >> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may >> be allocated there. >> > > To save the versioning if there are issues, here are the 3 diffs that > I have left. If you are good with these changes, I'll squash the first > 2 into the third commit, keep the last one as a separate commit (it > changes the interleave_nodes() logic too), and submit v5 w/ your > reviewed tag on all of them. > > > Fix one (pedantic?) warning from syzbot: > ---------------------------------------- > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index b1437396c357..dfd097009606 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > unsigned long nr_allocated = 0; > unsigned long rounds; > unsigned long node_pages, delta; > - u8 __rcu *table, *weights, weight; > + u8 __rcu *table, __rcu *weights, weight; The __rcu usage can be checked with `sparse` directly. For example, make C=1 mm/mempolicy.o More details can be found in https://www.kernel.org/doc/html/latest/dev-tools/sparse.html Per my understanding, we shouldn't use "__rcu" here. Please search "__rcu" in the following document. https://www.kernel.org/doc/html/latest/RCU/checklist.html > unsigned int weight_total = 0; > unsigned long rem_pages = nr_pages; > nodemask_t nodes; > > > > Simplifying resume_node/weight logic: > ------------------------------------- > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 2c1aef8eab70..b0ca9bcdd64c 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > page_array += nr_allocated; > total_allocated += nr_allocated; > /* if that's all the pages, no need to interleave */ > - if (rem_pages < weight) { > - /* stay on current node, adjust il_weight */ > + if (rem_pages <= weight) { > me->il_weight -= rem_pages; > return total_allocated; > - } else if (rem_pages == weight) { > - /* move to next node / weight */ > - me->il_prev = next_node_in(node, nodes); > - me->il_weight = get_il_weight(me->il_prev); > - return total_allocated; > } > /* Otherwise we adjust remaining pages, continue from there */ > rem_pages -= weight; > @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > node_pages += weight; > delta -= weight; > } else if (delta) { > + /* when delta is deleted, resume from that node */ ~~~~~~~ depleted? > node_pages += delta; > - /* delta may deplete on a boundary or w/ a remainder */ > - if (delta == weight) { > - /* boundary: resume from next node/weight */ > - resume_node = next_node_in(node, nodes); > - resume_weight = weights[resume_node]; > - } else { > - /* remainder: resume this node w/ remainder */ > - resume_node = node; > - resume_weight = weight - delta; > - } > + resume_node = node; > + resume_weight = weight - delta; > delta = 0; > } > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > > > > > > task->mems_allowed_seq protection (added as 4th patch) > ------------------------------------------------------ > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index b0ca9bcdd64c..b1437396c357 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > { > unsigned int node = current->il_prev; > + unsigned int cpuset_mems_cookie; > > +retry: > + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ > + cpuset_mems_cookie = read_mems_allowed_begin(); > if (!current->il_weight || !node_isset(node, policy->nodes)) { > node = next_node_in(node, policy->nodes); node will be changed in the loop. So we need to change the logic here. > - /* can only happen if nodemask is being rebound */ > + if (read_mems_allowed_retry(cpuset_mems_cookie)) > + goto retry; > if (node == MAX_NUMNODES) > return node; > current->il_prev = node; > @@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > static unsigned int interleave_nodes(struct mempolicy *policy) > { > unsigned int nid; > + unsigned int cpuset_mems_cookie; > + > + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nid = next_node_in(current->il_prev, policy->nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > > - nid = next_node_in(current->il_prev, policy->nodes); > if (nid < MAX_NUMNODES) > current->il_prev = nid; > + > return nid; > } > > @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > struct page **page_array) > { > struct task_struct *me = current; > + unsigned int cpuset_mems_cookie; > unsigned long total_allocated = 0; > unsigned long nr_allocated = 0; > unsigned long rounds; > @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > int prev_node; > int i; > > + Change by accident? > if (!nr_pages) > return 0; > > - nnodes = read_once_policy_nodemask(pol, &nodes); > + /* read the nodes onto the stack, retry if done during rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nnodes = read_once_policy_nodemask(pol, &nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > + > + /* if the nodemask has become invalid, we cannot do anything */ > if (!nnodes) > return 0; -- Best Regards, Huang, Ying