From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23CD3C46CD2
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jan 2024 08:44:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7ABC86B0269; Tue,  2 Jan 2024 03:44:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 75CEB6B026A; Tue,  2 Jan 2024 03:44:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 64B526B026C; Tue,  2 Jan 2024 03:44:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 4F9406B0269
	for <linux-mm@kvack.org>; Tue,  2 Jan 2024 03:44:52 -0500 (EST)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 2A640A06C7
	for <linux-mm@kvack.org>; Tue,  2 Jan 2024 08:44:52 +0000 (UTC)
X-FDA: 81633735624.13.DA51917
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12])
	by imf04.hostedemail.com (Postfix) with ESMTP id AEF7F40016
	for <linux-mm@kvack.org>; Tue,  2 Jan 2024 08:44:49 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=drmuxsHu;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704185090; a=rsa-sha256;
	cv=none;
	b=pQuEJCLeIFbZErxy3MUba62lan38fXjHVYJTo/TeC0A/hd3cwjJsOhfLaJUQcTDljW6GDd
	DDq3X6AkgO2EttKKX4KwgdjZ5oF1DvDV9oo3JpIKtrHh7aNu9EVyviUSvhXYHLCl3zHnXM
	FjM9kiyGg3GI/PXBv3e/IZW8y16hyEw=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=drmuxsHu;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1704185090;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+YWS1/e7QRpxqOBQPaxW3Jarn/LuIzn4V6OXMBdFnjg=;
	b=wA6TBsest1z5gBORk+zrP87lmRdmYQrOofLXfUBH+iwVg4wJSfGWpgYbR89+bi7wwPctyo
	bo/KfRZNrsKYaxEPaoG2lfw9U3tfCBC08vYYUORcBKOdUIkssLi8TMqKfzmQ5n9l+1us1+
	1ZiLoP8VvNCKQsQmrReALxIrF6VTEq0=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1704185090; x=1735721090;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=UvmgInCVS3LhauRxpvGYGqfhEKWYPGEDg6KPf7j+yCc=;
  b=drmuxsHuZ9Yyg7E3zxErkKuehi95xOfvlXoQTYEf7dkHw3w2wnmFM+15
   4gCxNzwVczAJXn3QGSpCVbo4/u+cc2tpNRz3QOczw6tT/qoE3ddjs6Krp
   q6Yt31WbivWCmE0N4kJ1Wur85zApfaf6MOcqgUfuw1Ay80BbCrCaR+1iT
   Caufo2tWw1G8kcxeAACObQF5iqRe+vYeiz7ddX97rVB/CBozeJ/wrLl7Q
   hXdvPR+HDTAjzHhBwO8jOuP/ruL+FMJKqPgfu2kfRQaZlo6A1gidaJmni
   WamusEX/iHPIZpcUjoWR6A/szfNemegvI3FuxewamoQ4HqeM2KWPkhLaF
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10940"; a="4189869"
X-IronPort-AV: E=Sophos;i="6.04,324,1695711600"; 
   d="scan'208";a="4189869"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
  by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jan 2024 00:44:48 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10940"; a="779632795"
X-IronPort-AV: E=Sophos;i="6.04,324,1695711600"; 
   d="scan'208";a="779632795"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jan 2024 00:44:41 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-doc@vger.kernel.org>,  <linux-fsdevel@vger.kernel.org>,
  <linux-kernel@vger.kernel.org>,  <linux-api@vger.kernel.org>,
  <x86@kernel.org>,  <akpm@linux-foundation.org>,  <arnd@arndb.de>,
  <tglx@linutronix.de>,  <luto@kernel.org>,  <mingo@redhat.com>,
  <bp@alien8.de>,  <dave.hansen@linux.intel.com>,  <hpa@zytor.com>,
  <mhocko@kernel.org>,  <tj@kernel.org>,  <corbet@lwn.net>,
  <rakie.kim@sk.com>,  <hyeongtak.ji@sk.com>,  <honggyu.kim@sk.com>,
  <vtavarespetr@micron.com>,  <peterz@infradead.org>,
  <jgroves@micron.com>,  <ravis.opensrc@micron.com>,
  <sthanneeru@micron.com>,  <emirakhur@micron.com>,  <Hasan.Maruf@amd.com>,
  <seungjun.ha@samsung.com>,  Srinivasulu Thanneeru
 <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce
 MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
In-Reply-To: <ZYp6ZRLZQVtTHest@memverge.com> (Gregory Price's message of "Tue,
	26 Dec 2023 02:01:57 -0500")
References: <20231223181101.1954-1-gregory.price@memverge.com>
	<20231223181101.1954-3-gregory.price@memverge.com>
	<8734vof3kq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZYp6ZRLZQVtTHest@memverge.com>
Date: Tue, 02 Jan 2024 16:42:42 +0800
Message-ID: <878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: AEF7F40016
X-Stat-Signature: x8q5uh1gwjmoh4wqir6j7oocaija8yka
X-HE-Tag: 1704185089-504230
X-HE-Meta: U2FsdGVkX1+vk9D0S1YX1EyriOlJ4W8SUGOYoYQvazr0fgZE4cFCSsciOaFWY3AxWmYraGuYZiPiLXtk8pj3YhZ6me6ruphNCsOpBwRVcZ8FFaTYBxQM61MNC3cO6qVVBrebab8eX0H2Sp7uqRvkaZfQ4SjvthppcEGSEn5FDUpisZq8+7C4KHj71jrFkv2TJMKa61m2dotHtnzrtFJs6Pi+MsMt9BBxXmnshH93jEoxeiqwACt/4yVJt4o3zMB/SwKgYGzvO3vfHuJNlEaMBc7ZsEra8mexRMRI0iWa6sCi8LYHUSFnRDkX9BsvNuvu9LTjeBOouFcVm23YOfYOhDyQCq8FdnUu3ZLzn1sYhsnfcYmMbR9i53TKs231kRQQFp5xOA7+yY03EzT/qsBUV0NGpQSLynkMc5DnNicgRJfzTok80iRIASp6NacZCj8lERmYbTEHN+59e5bOz4DJn8dwJvLkiqkPJjgbX9wJu6E0OiW1Jfj/p4RXkmxIU9yR/VCYRoJJ5IKsh0FZUQ2xTuUXLUgX889Fjl7BcDEb4ahB5s1VTVI0TPLNKMc8Zi0a1giEKoyDNL4FteV5hZlXY9NQ7jyje4lee7QtRSp7yW6P2wRlCE0n9PF2ZvtYaI6L1fTR7gOK5NDxklituizKTG7TmFRz8mN19ne/ZZOpqeLwkhwPJbuin5hccYiV6KL6F23Gtf4caPOPfwAwYg83r4PxRDl58yXOtmUnJQ3S2LHHeDLwnllE1Rn2q9Lctd6nv6D4wvTA5wmWTWC2e7TJ6ubE3yWxz3NElJXfnt6XAtAmWOlX5/3HZUbXMRLwsnutYHcqBXWejdM4IHo8B+D53VBdPpYDsomeUD/MDn+tfVplV5k7cfIjjlh4qwdCE9xzSHh6L4YDty3Yrk0EXQztYLr2QqdvK7LngyslG3kLGtEq5pKacHMisWvgqykQqtJ9KQ1CFUyeI3/a2omGtjs
 R7809Jzp
 GA6Wzn2vz9NjCOLVaLiwegitvEzivxJUa9HWadczcV0b7uRlZZC5o3SJpHowC5zELpLiANV3dz0DKzRna/JYxoF4yBjqv+WySCDnxe33p1LS+a4VlwtRoY1p7PY9HvDYLF90X6N9qJLojvbpf60xpotYOcWTBB1MB1VMG6KMZUToFdNr+uwxHmD2FQJOPWmQaCBpmEz2E+e1PkB4qUhf+xJlEDGv1Gj6X1tavm58C64+yyzQt7qhqlueqveYH4tsqMODTGua0jdJOyJBASPmOpMMbXV7kTBiP+ZwGOSB3zXmC5HJllnYN177kp/Kht5d6IGGlv3Ck3EwZZ39a/Pc1qbOSZxED8h362L8X
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> > +{
>> > +	nodemask_t nodemask = pol->nodes;
>> > +	unsigned int target, weight_total = 0;
>> > +	int nid;
>> > +	unsigned char weights[MAX_NUMNODES];
>> 
>> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> large?
>> 
>
> I've been struggling with a good solution to this.  We need a local copy
> of weights to prevent weights from changing out from under us during
> allocation (which may take quite some time), but it seemed unwise to
> to allocate 1KB heap in this particular path.
>
> Is my concern unfounded?  If so, I can go ahead and add the allocation
> code.

Please take a look at NODEMASK_ALLOC().

>> > +	unsigned char weight;
>> > +
>> > +	barrier();
>> 
>> Memory barrier needs comments.
>> 
>
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.

Please see below.

>> > +
>> > +	/* first ensure we have a valid nodemask */
>> > +	nid = first_node(nodemask);
>> > +	if (nid == MAX_NUMNODES)
>> > +		return nid;
>> 
>> It appears that this isn't necessary, because we can check whether
>> weight_total == 0 after the next loop.
>> 
>
> fair, will snip.
>
>> > +
>> > +	/* Then collect weights on stack and calculate totals */
>> > +	for_each_node_mask(nid, nodemask) {
>> > +		weight = iw_table[nid];
>> > +		weight_total += weight;
>> > +		weights[nid] = weight;
>> > +	}
>> > +
>> > +	/* Finally, calculate the node offset based on totals */
>> > +	target = (unsigned int)ilx % weight_total;
>> 
>> Why use type casting?
>> 
>
> Artifact of old prototypes, snipped.
>
>> > +
>> > +	/* Stabilize the nodemask on the stack */
>> > +	barrier();
>> 
>> I don't think barrier() is needed to wait for memory operations for
>> stack.  It's usually used for cross-processor memory order.
>>
>
> This is present in the old interleave code.  To the best of my
> understanding, the concern is for mempolicy->nodemask rebinding that can
> occur when cgroups.cpusets.mems_allowed changes.
>
> so we can't iterate over (mempolicy->nodemask), we have to take a local
> copy.
>
> My *best* understanding of the barrier here is to prevent the compiler
> from reordering operations such that it attempts to optimize out the
> local copy (or do lazy-fetch).
>
> It is present in the original interleave code, so I pulled it forward to
> this, but I have not tested whether this is a bit paranoid or not.
>
> from `interleave_nid`:
>
>  /*
>   * The barrier will stabilize the nodemask in a register or on
>   * the stack so that it will stop changing under the code.
>   *
>   * Between first_node() and next_node(), pol->nodes could be changed
>   * by other threads. So we put pol->nodes in a local stack.
>   */
>  barrier();

Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
comments all over the place.  Can we implement a wrapper for it?  For
example, memcpy_once().  __read_once_size() in
tools/include/linux/compiler.h can be used as reference.

Because node_weights[] may be changed simultaneously too.  We may need
to consider similar issue for it too.  But RCU seems more appropriate
for node_weights[].

>> > +		/* Otherwise we adjust nr_pages down, and continue from there */
>> > +		rem_pages -= pol->wil.cur_weight;
>> > +		pol->wil.cur_weight = 0;
>> > +		prev_node = node;
>> 
>> If pol->wil.cur_weight == 0, prev_node will be used without being
>> initialized below.
>> 
>
> pol->wil.cur_weight is not used below.
>
>> > +	}
>> > +
>> > +	/* Now we can continue allocating as if from 0 instead of an offset */
>> > +	rounds = rem_pages / weight_total;
>> > +	delta = rem_pages % weight_total;
>> > +	for (i = 0; i < nnodes; i++) {
>> > +		node = next_node_in(prev_node, nodes);
>> > +		weight = weights[node];
>> > +		node_pages = weight * rounds;
>> > +		if (delta) {
>> > +			if (delta > weight) {
>> > +				node_pages += weight;
>> > +				delta -= weight;
>> > +			} else {
>> > +				node_pages += delta;
>> > +				delta = 0;
>> > +			}
>> > +		}
>> > +		/* We may not make it all the way around */
>> > +		if (!node_pages)
>> > +			break;
>> > +		/* If an over-allocation would occur, floor it */
>> > +		if (node_pages + total_allocated > nr_pages) {
>> 
>> Why is this possible?
>> 
>
> this may have been a paranoid artifact from an early prototype, will
> snip and validate.

--
Best Regards,
Huang, Ying