From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 90FDCC7EE2A
	for <linux-mm@archiver.kernel.org>; Fri, 27 Jun 2025 04:28:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 018028D000F; Fri, 27 Jun 2025 00:28:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F316B8D0001; Fri, 27 Jun 2025 00:28:38 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DF8F48D000F; Fri, 27 Jun 2025 00:28:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id D16CC8D0001
	for <linux-mm@kvack.org>; Fri, 27 Jun 2025 00:28:38 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 9706010438C
	for <linux-mm@kvack.org>; Fri, 27 Jun 2025 04:28:38 +0000 (UTC)
X-FDA: 83599899516.16.4F83E48
Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181])
	by imf07.hostedemail.com (Postfix) with ESMTP id C7B4540005
	for <linux-mm@kvack.org>; Fri, 27 Jun 2025 04:28:36 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=gourry.net header.s=google header.b=JrAoQvZk;
	dmarc=none;
	spf=pass (imf07.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.181 as permitted sender) smtp.mailfrom=gourry@gourry.net
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750998516; a=rsa-sha256;
	cv=none;
	b=xW4vy2tWbjNqWdB+DGWy7rKehurr74f/raNd6mff9aD1O8LUXpL7J7erBAIegLB8D+jHaB
	TRGLc0SIPyheHmsO31Pe9B8fY5+KWQX/ugZpd/L4S/lrm97ssUk6V6OPFFBZhrg+ZthoL2
	eZ+xK/e/yLgGxER4wQmFox0CLJiODMg=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=gourry.net header.s=google header.b=JrAoQvZk;
	dmarc=none;
	spf=pass (imf07.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.181 as permitted sender) smtp.mailfrom=gourry@gourry.net
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1750998516;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=y36ADeRuhcP2ckTXHCpKxSMGwCYMvo3oVcA5wg7Axs0=;
	b=uY373zvh76nzPmfkgLB/SxsGDC20ZLNH4jLrIZbYr/KV6c+2SpzjvTkuq0WKhHhceD3tzt
	PRWQITeyusAp3AGARn+BJrgPzdfy4cHB+UlhJHhMSWhvomkfZ3TRJRb2rHnkhLOVDZZBJ+
	DQvm0bK0549HrSw2ZxD7S8VZ3PZCqC8=
Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-4a76ea97cefso19754981cf.2
        for <linux-mm@kvack.org>; Thu, 26 Jun 2025 21:28:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gourry.net; s=google; t=1750998516; x=1751603316; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=y36ADeRuhcP2ckTXHCpKxSMGwCYMvo3oVcA5wg7Axs0=;
        b=JrAoQvZkm8NXFjj54pCCijqIsoikEty7oO61E8IETsJijKgVMUzRRt0mdaQ6HEeqTz
         lroK+yU4vBUHOTH0PdiN2eTWmhfPwco3zhzMlLAlBOp4Jiyw6JJ0+6gwarH4x9EPGHt5
         v5ptGA/yvlGrnrsIaJpUBA4V+TfTiccArT7YIQMP6UpHngW/XTCJt4uXdMMLaCZ6MLWt
         ZeV3pjUTy+O9fxjDOQj9OTxVPPfhKzzmKKGxC8CTb+rwuBUxY2quc5uSESISo+VEPqhC
         iPR2n0sLvjgRWVgVYTR2o9kqO0fmy6U+Cs4AyQvC0YcPEciix8ViQH+v1O8mIJnWEviB
         bzhA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750998516; x=1751603316;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=y36ADeRuhcP2ckTXHCpKxSMGwCYMvo3oVcA5wg7Axs0=;
        b=QNB6VUme5B46WaQwKgeQEUOTWfOrgD18W5keTyYMGOyVGnY1+g0O6q45llsPrqwm4d
         BxRBQ7dKlzR0nx+yFD1lG6lwBZxGr8L2LIDhKxrtPcyh47x47Pc2ZOziwgq+MONbqYTP
         6sJIVZITMdgwQDUa04CFWmnZilfmPN+V3t2mhZNFEgziGk62XV6agRsCLbS4yFskEqZx
         6V22+8rayNCODl2SIOGttKmSV/w+y6uI535ZaLZinBrC/QTkOYK8RVS9eo0LtjXNNWUG
         juyXzVtFvirzuVy/elji3RsLI7mKALXqLJcckbNdDpRfclujDWwdiu04bIDTrLMSlaPm
         0eSg==
X-Forwarded-Encrypted: i=1; AJvYcCVI3Gk1lLM/IwBq5/XOemxya/aHZp6oCEOB3lmtBdX9CT0HcnwMnu1FCtsnQzE2AklNo58ZIfdMRw==@kvack.org
X-Gm-Message-State: AOJu0YyeJDv6V8ZZJwh2jFDWP+wd1WEjbunE7cMjIePMu02V2EFQNuzN
	y9LFTqfpWORleyfPOruljg5k2mebgGhlWs2/eWlsOH7H3gKYGqnrb6U88sGB6f+rrkk=
X-Gm-Gg: ASbGncv6V8ad8xDIrB/jmpyTDjO3ti7Mps5ONiUa+xh4v4jvjf66b6E3wC3mxyGusl8
	Wo3Kxg+hdHGe3GvB7sFmxA3+E6pRzChzi479vQGCx2ld+S9WwvThsKXm2ythWAFAEj06aKoUQnY
	bAE6iZD+rbt0dRfNxgYyGcexvPbZBsi30rZTMkNmW8oFobPFGVZsmw6WAtQpZxEMavNUhvrIf3z
	G7Yxvh6bt6nE97BAfl4mZOqxh8WoUMAFoTLSRCc/sDzoVMoRpaaQvvfPODRv8AURykU7qrE8G2q
	m7BTqrfyWoNp777c7MxihbTdv5zbuN/d/O6sqS42RwcbxF14TVQDK/B4KG+amdoSgYnfauke/wK
	aa+dSSvM2DAItoNu8YYDwxKA+lLpIrU1VxxfaALaFTQ==
X-Google-Smtp-Source: AGHT+IHlenObjLvxgQaDPllwQc6sqiGkpBMmQxlqbMC/T6pMgBUdAGjC7LC6AAR4RMvOp3V9rD4LSA==
X-Received: by 2002:ac8:58d2:0:b0:494:9fc0:de3 with SMTP id d75a77b69052e-4a7fcad0725mr37559191cf.32.1750998515837;
        Thu, 26 Jun 2025 21:28:35 -0700 (PDT)
Received: from gourry-fedora-PF4VCD3F (pool-96-255-20-42.washdc.ftas.verizon.net. [96.255.20.42])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-7d44322884fsm75867685a.89.2025.06.26.21.28.34
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 26 Jun 2025 21:28:35 -0700 (PDT)
Date: Fri, 27 Jun 2025 00:28:33 -0400
From: Gregory Price <gourry@gourry.net>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Alistair Popple <apopple@nvidia.com>,
	Byungchul Park <byungchul@sk.com>,
	David Hildenbrand <david@redhat.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Rakie Kim <rakie.kim@sk.com>,
	Ying Huang <ying.huang@linux.alibaba.com>, Zi Yan <ziy@nvidia.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel-team@meta.com
Subject: Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk
 in weighted interleave
Message-ID: <aF4d8YcGbfTktnu6@gourry-fedora-PF4VCD3F>
References: <20250626200936.3974420-1-joshua.hahnjy@gmail.com>
 <20250626200936.3974420-3-joshua.hahnjy@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250626200936.3974420-3-joshua.hahnjy@gmail.com>
X-Rspam-User: 
X-Stat-Signature: bu1h1n13riwe4t3a71yaqbgzfwfjqhd9
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: C7B4540005
X-HE-Tag: 1750998516-994948
X-HE-Meta: U2FsdGVkX19iiWpHgi+/T+BD0feVWjouPrR4PEc6lxywUjaTcqoVXJVZ0iHzTM0Usy4fug6+nmju7TWEd858DC7UcYMjv6A9CAlerMp/sjKdbwDFkKr4HGeg8Y2WqTvmpxUwQ3bfWzUktHXQcqsm5k0PZjggVsSHneOvre0byV8DZKFBR7QnSbPrkVkEah4wyDQpI3GwSE3Lb061nbMrtdJ1Ogu93f7F0aAvDaPsVn1wbLYMWlHDrTze11fVOHnU7XjKkPCP5/w2sUUtSO5zZ3dRUIIr51gJMSs1y2W7brs6tx8e30nSI43JbPKyC2GJyVIYeD8j6sbkAYV5pfWkzy33Hvc7rJyWs3A51u8pWdAxtFeUCUJgTyr7MkLKG8nmLP9jBbumwkTP1PHirGv5XCpLePbdHhTwNp73tODeeQDmPtl79lotMEzmABlpduh0fCW3v88j1F24fYod8V6jCFOWPvzDekBXyeiHtBXLs4sLvgyYMFv/Q+XHQWOSu6UMUf5s5VowCvYdeaPLmELBpBk6uVApCSI+qPKY9ASQ8KPFppYT3tqWpA5yky7pMT796q7eHRjTOXjKV7nsQKnJ7hWBiiSBv2P5AuUJbjeCe6LyHhHIllE4m5OvWKWHDTr/LENLOxIbbiXMHCKEMUXEurxBlX8+l7/x/BGORgpRDhxrH+VQWbV3oseJU1qj0PYXunmB22MZ86MZ3BcLa7BWh50aN1KpKbFfHCxhboTcie4IQmjfyLEvKvf2WbAlVReiQuz7VTuUwNdmPQqPAy7ckVrA1R4SQdgmQG77m3ah2vQOJWcCBZK62Nbha1esuUVlSOT+6JIucBsPJisMujuBQJDdYJ6qjWmvAGGqM8UawDZtfPV5bWPg11aLh/epBCZ1UiWDTnbkpLNZ5iG1M7Jy7QwvbG0Y/XwOtiqIKwoRhsWY5kWKzsmDh/e3hLTPtGk4qWbu3w7QHAY202zAipO
 XYLZuKf1
 WiKtRlk4bnn79UTILGS0VdO3KYXUsnoA0omIIM11WzDyrnsyx6n4r7ipnNjqbT0q4EDGtUKbKMDZ2IsEdE9AZGAEs/hiiis6NalSLKaU+jSSBHCkon2yH4cgNOhuTbXGENUAUEkOiEW1vEWKzxs808st8Gv+QdBpBuD8b7UmivkKzjO0/tqQ/HzF5ag==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jun 26, 2025 at 01:09:34PM -0700, Joshua Hahn wrote:
> Currently, alloc_pages_bulk_weighted_interleave can make up to nr_node_ids+1
> calls to __alloc_pages_bulk. The additional allocation can happen if the
> previous call to this function finished the weighted round robin allocation
> partially on a node. To make up for this, the next time this function is
> called, an extra allocation is made to finish cleanly on the node boundaries
> before performing the weighted round-robin cycles again.
> 

task->il_weight can be operated on by other weighted_interleave
functions in mempolicy, so it's not just
alloc_pages_bulk_weighted_interleave that affects this.

Observations here are still true, just a correction for clarity.

i.e.:

The additional allocation happens for the current il_node/il_weight.

I *think* I did it this way just because it was easier to reason about
the two chunks separately.  So I don't see a reason not to improve this.
I will say that, at least at the time, I took the core math and
validated the edge conditions in a separate program.

If you get it wrong in the kernel, you'd either fail to allocate - or more
likely just get the wrong distribution of pages.  The latter is
non-obvious unless you go looking for it, so it might be good to at
least add this test result in the change log.  It's hard to write this
in LTP or kernel selftest unfortunately.

> Instead of making an additional call, we can calculate how many additional
> pages should be allocated from the first node (aka carryover) and add that
> value to the number of pages that should be allocated as part of the current
> round-robin cycle.
> 
> Running a quick benchmark by compiling the kernel shows a small increase
> in performance. These experiments were run on a machine with 2 nodes, each
> with 125GB memory and 40 CPUs.
> 
> time numactl -w 0,1 make -j$(nproc)
> 
> +----------+---------+------------+---------+
> | Time (s) |  6.16   | With patch | % Delta |
> +----------+---------+------------+---------+
> | Real     |  88.374 |    88.3356 | -0.2019 |
> | User     |  3631.7 |   3636.263 |  0.0631 |
> | Sys      | 366.029 |    363.792 | -0.7534 |
> +----------+---------+------------+---------+
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> 
> ---
>  mm/mempolicy.c | 39 ++++++++++++++++++++-------------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 78ad74a0e249..0d693f96cf66 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2569,7 +2569,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	unsigned long node_pages, delta;
>  	u8 *weights, weight;
>  	unsigned int weight_total = 0;
> -	unsigned long rem_pages = nr_pages;
> +	unsigned long rem_pages = nr_pages, carryover = 0;
>  	nodemask_t nodes;
>  	int nnodes, node;
>  	int resume_node = MAX_NUMNODES - 1;
> @@ -2594,18 +2594,12 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	node = me->il_prev;
>  	weight = me->il_weight;
>  	if (weight && node_isset(node, nodes)) {
> -		node_pages = min(rem_pages, weight);
> -		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> -						  page_array);
> -		page_array += nr_allocated;
> -		total_allocated += nr_allocated;
> -		/* if that's all the pages, no need to interleave */
>  		if (rem_pages <= weight) {
> -			me->il_weight -= rem_pages;
> -			return total_allocated;
> +			node_pages = rem_pages;
> +			me->il_weight -= node_pages;
> +			goto allocate;
>  		}
> -		/* Otherwise we adjust remaining pages, continue from there */
> -		rem_pages -= weight;
> +		carryover = weight;
>  	}
>  	/* clear active weight in case of an allocation failure */
>  	me->il_weight = 0;
> @@ -2614,7 +2608,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	/* create a local copy of node weights to operate on outside rcu */
>  	weights = kzalloc(nr_node_ids, GFP_KERNEL);
>  	if (!weights)
> -		return total_allocated;
> +		return 0;

may be worth noting that this change means small bulk allocations that
could be covered by the current il_node/weight may now fail if kzalloc
fails.  But then there are probably other problems.  However, this is a
functional difference between the old and new state of the function.

>  
>  	rcu_read_lock();
>  	state = rcu_dereference(wi_state);
> @@ -2634,16 +2628,17 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	/*
>  	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
>  	 * Track which node weighted interleave should resume from.
> +	 * Account for carryover. It is always allocated from the first node.
>  	 *
>  	 * if (rounds > 0) and (delta == 0), resume_node will always be
>  	 * the node following prev_node and its weight.
>  	 */
> -	rounds = rem_pages / weight_total;
> -	delta = rem_pages % weight_total;
> +	rounds = (rem_pages - carryover) / weight_total;
> +	delta = (rem_pages - carryover) % weight_total;
>  	resume_node = next_node_in(prev_node, nodes);
>  	resume_weight = weights[resume_node];
> +	node = carryover ? prev_node : next_node_in(prev_node, nodes);
>  	for (i = 0; i < nnodes; i++) {
> -		node = next_node_in(prev_node, nodes);
>  		weight = weights[node];
>  		/* when delta is depleted, resume from that node */
>  		if (delta && delta < weight) {
> @@ -2651,12 +2646,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  			resume_weight = weight - delta;
>  		}
>  		/* Add the node's portion of the delta, if there is one */
> -		node_pages = weight * rounds + min(delta, weight);
> +		node_pages = weight * rounds + min(delta, weight) + carryover;
>  		delta -= min(delta, weight);
> +		carryover = 0;
>  
>  		/* node_pages can be 0 if an allocation fails and rounds == 0 */
>  		if (!node_pages)
>  			break;
> +allocate:
>  		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
>  						  page_array);
>  		page_array += nr_allocated;
> @@ -2664,10 +2661,14 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		if (total_allocated == nr_pages)
>  			break;
>  		prev_node = node;
> +		node = next_node_in(prev_node, nodes);
> +	}
> +
> +	if (weights) {
> +		me->il_prev = resume_node;
> +		me->il_weight = resume_weight;
> +		kfree(weights);
>  	}
> -	me->il_prev = resume_node;
> -	me->il_weight = resume_weight;
> -	kfree(weights);
>  	return total_allocated;
>  }
>  
> -- 
> 2.47.1