From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>, <linux-mm@kvack.org>,
<linux-doc@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-api@vger.kernel.org>,
<x86@kernel.org>, <akpm@linux-foundation.org>, <arnd@arndb.de>,
<tglx@linutronix.de>, <luto@kernel.org>, <mingo@redhat.com>,
<bp@alien8.de>, <dave.hansen@linux.intel.com>, <hpa@zytor.com>,
<mhocko@kernel.org>, <tj@kernel.org>, <corbet@lwn.net>,
<rakie.kim@sk.com>, <hyeongtak.ji@sk.com>, <honggyu.kim@sk.com>,
<vtavarespetr@micron.com>, <peterz@infradead.org>,
<jgroves@micron.com>, <ravis.opensrc@micron.com>,
<sthanneeru@micron.com>, <emirakhur@micron.com>,
<Hasan.Maruf@amd.com>, <seungjun.ha@samsung.com>,
Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Tue, 02 Jan 2024 16:42:42 +0800 [thread overview]
Message-ID: <878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <ZYp6ZRLZQVtTHest@memverge.com> (Gregory Price's message of "Tue, 26 Dec 2023 02:01:57 -0500")
Gregory Price <gregory.price@memverge.com> writes:
> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>>
>> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> > +{
>> > + nodemask_t nodemask = pol->nodes;
>> > + unsigned int target, weight_total = 0;
>> > + int nid;
>> > + unsigned char weights[MAX_NUMNODES];
>>
>> MAX_NUMNODSE could be as large as 1024. 1KB stack space may be too
>> large?
>>
>
> I've been struggling with a good solution to this. We need a local copy
> of weights to prevent weights from changing out from under us during
> allocation (which may take quite some time), but it seemed unwise to
> to allocate 1KB heap in this particular path.
>
> Is my concern unfounded? If so, I can go ahead and add the allocation
> code.
Please take a look at NODEMASK_ALLOC().
>> > + unsigned char weight;
>> > +
>> > + barrier();
>>
>> Memory barrier needs comments.
>>
>
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.
Please see below.
>> > +
>> > + /* first ensure we have a valid nodemask */
>> > + nid = first_node(nodemask);
>> > + if (nid == MAX_NUMNODES)
>> > + return nid;
>>
>> It appears that this isn't necessary, because we can check whether
>> weight_total == 0 after the next loop.
>>
>
> fair, will snip.
>
>> > +
>> > + /* Then collect weights on stack and calculate totals */
>> > + for_each_node_mask(nid, nodemask) {
>> > + weight = iw_table[nid];
>> > + weight_total += weight;
>> > + weights[nid] = weight;
>> > + }
>> > +
>> > + /* Finally, calculate the node offset based on totals */
>> > + target = (unsigned int)ilx % weight_total;
>>
>> Why use type casting?
>>
>
> Artifact of old prototypes, snipped.
>
>> > +
>> > + /* Stabilize the nodemask on the stack */
>> > + barrier();
>>
>> I don't think barrier() is needed to wait for memory operations for
>> stack. It's usually used for cross-processor memory order.
>>
>
> This is present in the old interleave code. To the best of my
> understanding, the concern is for mempolicy->nodemask rebinding that can
> occur when cgroups.cpusets.mems_allowed changes.
>
> so we can't iterate over (mempolicy->nodemask), we have to take a local
> copy.
>
> My *best* understanding of the barrier here is to prevent the compiler
> from reordering operations such that it attempts to optimize out the
> local copy (or do lazy-fetch).
>
> It is present in the original interleave code, so I pulled it forward to
> this, but I have not tested whether this is a bit paranoid or not.
>
> from `interleave_nid`:
>
> /*
> * The barrier will stabilize the nodemask in a register or on
> * the stack so that it will stop changing under the code.
> *
> * Between first_node() and next_node(), pol->nodes could be changed
> * by other threads. So we put pol->nodes in a local stack.
> */
> barrier();
Got it. This is kind of READ_ONCE() for nodemask. To avoid to add
comments all over the place. Can we implement a wrapper for it? For
example, memcpy_once(). __read_once_size() in
tools/include/linux/compiler.h can be used as reference.
Because node_weights[] may be changed simultaneously too. We may need
to consider similar issue for it too. But RCU seems more appropriate
for node_weights[].
>> > + /* Otherwise we adjust nr_pages down, and continue from there */
>> > + rem_pages -= pol->wil.cur_weight;
>> > + pol->wil.cur_weight = 0;
>> > + prev_node = node;
>>
>> If pol->wil.cur_weight == 0, prev_node will be used without being
>> initialized below.
>>
>
> pol->wil.cur_weight is not used below.
>
>> > + }
>> > +
>> > + /* Now we can continue allocating as if from 0 instead of an offset */
>> > + rounds = rem_pages / weight_total;
>> > + delta = rem_pages % weight_total;
>> > + for (i = 0; i < nnodes; i++) {
>> > + node = next_node_in(prev_node, nodes);
>> > + weight = weights[node];
>> > + node_pages = weight * rounds;
>> > + if (delta) {
>> > + if (delta > weight) {
>> > + node_pages += weight;
>> > + delta -= weight;
>> > + } else {
>> > + node_pages += delta;
>> > + delta = 0;
>> > + }
>> > + }
>> > + /* We may not make it all the way around */
>> > + if (!node_pages)
>> > + break;
>> > + /* If an over-allocation would occur, floor it */
>> > + if (node_pages + total_allocated > nr_pages) {
>>
>> Why is this possible?
>>
>
> this may have been a paranoid artifact from an early prototype, will
> snip and validate.
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2024-01-02 8:44 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2024-01-02 14:47 ` Geert Uytterhoeven
2023-12-25 7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Huang, Ying
2023-12-26 7:45 ` Gregory Price
2024-01-02 4:27 ` Huang, Ying
2024-01-02 19:06 ` Gregory Price
2024-01-03 3:15 ` Huang, Ying
[not found] ` <20231223181101.1954-2-gregory.price@memverge.com>
2023-12-27 6:42 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Huang, Ying
2023-12-26 6:48 ` Gregory Price
2024-01-02 7:41 ` Huang, Ying
2024-01-02 19:45 ` Gregory Price
2024-01-03 2:45 ` Huang, Ying
2024-01-03 2:59 ` Gregory Price
2024-01-03 6:03 ` Huang, Ying
2024-01-03 2:46 ` Gregory Price
[not found] ` <20231223181101.1954-3-gregory.price@memverge.com>
2023-12-27 8:32 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Huang, Ying
2023-12-26 7:01 ` Gregory Price
2023-12-26 8:06 ` Gregory Price
2023-12-26 11:32 ` Gregory Price
2024-01-02 8:42 ` Huang, Ying [this message]
2024-01-02 20:30 ` Gregory Price
2024-01-03 5:46 ` Huang, Ying
2024-01-03 22:09 ` Gregory Price
2024-01-04 5:39 ` Huang, Ying
2024-01-04 18:59 ` Gregory Price
2024-01-05 6:51 ` Huang, Ying
2024-01-05 7:25 ` Gregory Price
2024-01-08 7:08 ` Huang, Ying
[not found] ` <20231223181101.1954-4-gregory.price@memverge.com>
2023-12-27 8:39 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Huang, Ying
2023-12-26 7:05 ` Gregory Price
2023-12-26 11:48 ` Gregory Price
2024-01-02 9:09 ` Huang, Ying
[not found] ` <20231223181101.1954-9-gregory.price@memverge.com>
2024-01-02 14:38 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Geert Uytterhoeven
[not found] ` <20231223181101.1954-10-gregory.price@memverge.com>
2024-01-02 14:46 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Geert Uytterhoeven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=Hasan.Maruf@amd.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=emirakhur@micron.com \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=honggyu.kim@sk.com \
--cc=hpa@zytor.com \
--cc=hyeongtak.ji@sk.com \
--cc=jgroves@micron.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rakie.kim@sk.com \
--cc=ravis.opensrc@micron.com \
--cc=seungjun.ha@samsung.com \
--cc=sthanneeru.opensrc@micron.com \
--cc=sthanneeru@micron.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vtavarespetr@micron.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox