From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70E5BC3DA6E for ; Wed, 3 Jan 2024 05:49:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D63168D0006; Wed, 3 Jan 2024 00:49:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CECB06B0134; Wed, 3 Jan 2024 00:49:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B3E068D0006; Wed, 3 Jan 2024 00:49:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9D7E06B0113 for ; Wed, 3 Jan 2024 00:49:06 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6D0451A0E62 for ; Wed, 3 Jan 2024 05:49:06 +0000 (UTC) X-FDA: 81636921492.19.48D95FA Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) by imf15.hostedemail.com (Postfix) with ESMTP id 6424CA0013 for ; Wed, 3 Jan 2024 05:49:03 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=mqoeUZkE; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704260944; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nHnJ+0NLa8SD7hvyEWpitqNgkI/prmVbgwoD9plghH4=; b=uvmb1RQeIwjjl0wsaMpEXLNuzt14dX4Bt6mZi7vUUGKbY4Qc4eMKStyWwit/NpetGrzSxW rMOrfx4BViqqbZK6QViOQ5PFY31lgZFsE9E8uVnh6ZmcSTwR+2gUGhjbFsayD0y4PRD2Qd zqv27IbtSINVQuHnEjDYes9VqJkFlMY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704260944; a=rsa-sha256; cv=none; b=bOJewB0Kkt3i1VCg/M9nUweaTWgjOu3e1LDpER3dNDBGSXND+57IP1lMZCaw8c0t1XjJ7Z HEniOqx2WfiyhPj1rzWSY3Qr0gHhl/quEnQIhyCTQiJA1w7YN3wtoUcnDjW2BdVb3g+0no a/5isS3OW+vYScdqV0NDeT+uQBOzXKw= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=mqoeUZkE; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1704260944; x=1735796944; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=vSVxbwN6/wmar2Lbtj1KTSwlY+xHAKgJbZ0423wPmJs=; b=mqoeUZkEOwAzpDZg2GUAkv+wBWUDjw8J0PjAfwPks5iW6VoyOkWudvQ7 8L/GIuq2xZ1WXMiwg8Al7Sf2piHzVbQt+uvL6MacBKV89O0hxOtbEP/tq gtD6ktwawyIrNNgad4n0hr2V8N5KtKkzUBwuaF0L5UOZuownVGKj1JQeI 9+YRmYna1Qs3dkV3n5bmamaYFVhpCJIVw737TUVb+yL6Xnoxi0KrDiSXs NMRLNjnK96D+5OcVlmEy1WLwF8X+gH5Fn7dauXFj6BiipSF3gRMUzv2f6 bb0HHcT+UMYfCYx1htHrbUk2zSb/0F7mwL8dq8+4RObLqi/KYlj9SUZTe g==; X-IronPort-AV: E=McAfee;i="6600,9927,10941"; a="15604281" X-IronPort-AV: E=Sophos;i="6.04,327,1695711600"; d="scan'208";a="15604281" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jan 2024 21:49:02 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10941"; a="903330999" X-IronPort-AV: E=Sophos;i="6.04,327,1695711600"; d="scan'208";a="903330999" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jan 2024 21:48:54 -0800 From: "Huang, Ying" To: Gregory Price Cc: Gregory Price , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Srinivasulu Thanneeru Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving In-Reply-To: (Gregory Price's message of "Tue, 2 Jan 2024 15:30:36 -0500") References: <20231223181101.1954-1-gregory.price@memverge.com> <20231223181101.1954-3-gregory.price@memverge.com> <8734vof3kq.fsf@yhuang6-desk2.ccr.corp.intel.com> <878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 03 Jan 2024 13:46:56 +0800 Message-ID: <87mstnc6jz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: wu8ac8pns4c8i3cpshddphjo8qsfmmdi X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 6424CA0013 X-Rspam-User: X-HE-Tag: 1704260943-760974 X-HE-Meta: U2FsdGVkX1/sObRp2gASMLhfL1zbOnaubIGk9nD2aeXrc662zS07GOZRCnxV/KcrP5qjh5hWo26B7JMcQqIlNz5OAOdy0klIGco0Nukl/Sbsffhz8nagk/FnZs1lkwFgE6LiAxFqEQoFptBIXUlVXY73+squJL6bWo29pEIUSfvbtRd3ylTQtwkhPmxtJ+3ad6VG0T6UBsC0s8gUTne5Fbrd71z4yZJOSfcczWhSHP1+hHGKHlRv1L31m8Y1fFsO7nU1vLkqOn3f+d3D1RASJlCFPKGFwZ+xcHCROtQaYB9IGDPtkHW53fGAMLOOGO7SbqGRTgT47SjVbYHz7UC21qab3yilTxFdhK/MrQ7q8thmOcmoe/8j8K21bO2e7X88bgRPDIpXWCFZs+K+ERv3i6aXxtyHEbPHyf3PwU9sv3ARldegaUw/Eo8lBwVLFDVCgNhfbtt6nvZOZXSrL1LiBbiagE/GAg4uRLN1Vi4P/J+jQj6CmTs86tFDBlC6hEmWstIcDyt6Psh+Z96UYn4uWalH+YCS+77AuejbqUFbWTciLmJ9dwC5iyb//pRoV/q3QDYb8d3z/PNvhUhuSI5ZGpH0RCXrhJrJc5PDw4pFDxGCAaUG+uMpva1vnkasIi/9CZqkPOCNGjp5zoKwQhl3j9w2iflEpYMkyxPFW6K3JqF8EzKTLVBskvVuTRGF0MaqO7Z2YDcdh3tcRTssRXtXSnX5FytkCbJ6qvhKZHYhAZCL7E9NTsiMRYslVyp0t6d9nvdSw/yMYCQSwefKPgSqMOA84q7kB+znKbUbvy7ItCvReIuOKOqFc03SsvPc/mrE1b7btVDmiNzl7dSbeuWeZ2GRERtngjcNdqeXN1Ij1bDe75WFHf89m07BRbQXYZJn3yl5DmHrvb/jZ+qCgXUh3EPGRcApRdwZ97dOfVjzwdXRWTTmG3YNKqkNfDOMYDMQWOrjDG4AqSEUt6yQTcO 0y8q/w2/ KEMoRn43Ec52UNXlFYziAeUI+fB3dCI3hRSm5YZJqznWr1bQXpJahg4iSoiVAevrwN2Ddb7naa7g/ycnm9CPh7Oze4Mgn9YJLrwiOVpPqHQpnRVfmlPFeZe4vIKQSWo9NqcszOxSaVGz8KOYvjt0ysLINEEvwgn3q3qRpJd4VsywYBl/LxyYeYe0gzbjexyu5JOmbpRCge2ZQV28lDH9LhQ1vnPkpS5ASShEXU5Be9cHfV3MoXWmPYoPrvXAgmr27asr3ks8jp2Rc7zuB71PVnDbLgZDHMKH4OeOkhkqImd34RRoUa/sbgh5S5UzvNvdfSuqX4FfhF4GlA5LfqulKZjDn4zumXu2SxBw7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote: >> >> Gregory Price writes: >> >> >> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) >> >> > +{ >> >> > + nodemask_t nodemask = pol->nodes; >> >> > + unsigned int target, weight_total = 0; >> >> > + int nid; >> >> > + unsigned char weights[MAX_NUMNODES]; >> >> >> >> MAX_NUMNODSE could be as large as 1024. 1KB stack space may be too >> >> large? >> >> >> > >> > I've been struggling with a good solution to this. We need a local copy >> > of weights to prevent weights from changing out from under us during >> > allocation (which may take quite some time), but it seemed unwise to >> > to allocate 1KB heap in this particular path. >> > >> > Is my concern unfounded? If so, I can go ahead and add the allocation >> > code. >> >> Please take a look at NODEMASK_ALLOC(). >> > > This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. > > Some of the allocations on the stack can be replaced with a scratch > allocation, that's no big deal. > > I'm specifically concerned about: > weighted_interleave_nid > alloc_pages_bulk_array_weighted_interleave > > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those > contexts. If kmalloc/kfree is safe fine, this problem is trivial. > > If not, there is no good solution to this without pre-allocating a > scratch area per-task. You need to audit whether it's safe for all callers. I guess that you need to allocate pages after calling, so you can use the same GFP flags here. >> >> I don't think barrier() is needed to wait for memory operations for >> >> stack. It's usually used for cross-processor memory order. >> >> >> > >> > This is present in the old interleave code. To the best of my >> > understanding, the concern is for mempolicy->nodemask rebinding that can >> > occur when cgroups.cpusets.mems_allowed changes. >> > >> > so we can't iterate over (mempolicy->nodemask), we have to take a local >> > copy. >> > >> > My *best* understanding of the barrier here is to prevent the compiler >> > from reordering operations such that it attempts to optimize out the >> > local copy (or do lazy-fetch). >> > >> > It is present in the original interleave code, so I pulled it forward to >> > this, but I have not tested whether this is a bit paranoid or not. >> > >> > from `interleave_nid`: >> > >> > /* >> > * The barrier will stabilize the nodemask in a register or on >> > * the stack so that it will stop changing under the code. >> > * >> > * Between first_node() and next_node(), pol->nodes could be changed >> > * by other threads. So we put pol->nodes in a local stack. >> > */ >> > barrier(); >> >> Got it. This is kind of READ_ONCE() for nodemask. To avoid to add >> comments all over the place. Can we implement a wrapper for it? For >> example, memcpy_once(). __read_once_size() in >> tools/include/linux/compiler.h can be used as reference. >> >> Because node_weights[] may be changed simultaneously too. We may need >> to consider similar issue for it too. But RCU seems more appropriate >> for node_weights[]. >> > > Weights are collected individually onto the stack because we have to sum > them up before we actually apply the weights. > > A stale weight is not offensive. RCU is not needed and doesn't help. When you copy weights from iw_table[] to stack, it's possible for compiler to cache its contents in register, or merge, split the memory operations. At the same time, iw_table[] may be changed simultaneously via sysfs interface. So, we need a mechanism to guarantee that we read the latest contents consistently. > The reason the barrier is needed is not weights, it's the nodemask. Yes. So I said that we need similar stuff for weights. > So you basically just want to replace barrier() with this and drop the > copy/pasted comments: > > static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask) > { > /* > * The barrier will stabilize the nodemask in a register or on > * the stack so that it will stop changing under the code. > * > * Between first_node() and next_node(), pol->nodes could be changed > * by other threads. So we put pol->nodes in a local stack. > */ > barrier(); > __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t)); > barrier(); > } > > - nodemask_t nodemask = pol->nodemask > - barrier() > + nodemask_t nodemask; > + read_once_policy_nodemask(pol, &nodemask) > > Is that right? Yes. Something like that. Or even more general (need to be optimized?), static inline static void memcpy_once(void *dst, void *src, size_t n) { barrier(); memcpy(dst, src, n); barrier(); } memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask)); The comments can be based on that of READ_ONCE(). -- Best Regards, Huang, Ying