From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E2ACC4332F for ; Tue, 31 Oct 2023 17:52:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AD7B88D0024; Tue, 31 Oct 2023 13:52:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A865B8D0012; Tue, 31 Oct 2023 13:52:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 927F78D0024; Tue, 31 Oct 2023 13:52:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 828C88D0012 for ; Tue, 31 Oct 2023 13:52:35 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 57583120723 for ; Tue, 31 Oct 2023 17:52:35 +0000 (UTC) X-FDA: 81406501470.07.27E11FF Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2043.outbound.protection.outlook.com [40.107.243.43]) by imf19.hostedemail.com (Postfix) with ESMTP id 157EF1A0011 for ; Tue, 31 Oct 2023 17:52:31 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=micron.com header.s=selector2 header.b=mKiIRtWA; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf19.hostedemail.com: domain of sthanneeru.opensrc@micron.com designates 40.107.243.43 as permitted sender) smtp.mailfrom=sthanneeru.opensrc@micron.com; dmarc=pass (policy=reject) header.from=micron.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698774752; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jmnx9GYAD/EUBUNMmCGmGKqofqSKhNjtJL8BBoCN3s8=; b=x5R+OdQzHbuDvWvzqL/q/yZBDonbUxQWkRdu9ppXh0hh80352tCK7Fcd8hXLfSPukta+qn MWscyYbJgxZMcQkQlFI3a2BYnwx+Td4vhL4bva8FEy0nrn6Mb5YRVOx+JUeDpU0XT/qdnK WkUL4gFdeD56rCP56y25vt/qFeY1CSQ= ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1698774752; a=rsa-sha256; cv=pass; b=EiZ/rHnJuJ8iwtO05TBSF2AAGGiIgji6blytR002fuF1/qQyt8dQ3k6XuZhQbX07Locuv/ qwdVJK66yCfV5SsllZ6JcRVDadK2JpL4LxgCK29hRvuWrEzwV7ZOzFhRvSOoNziOeTKv6B 7OxBJpKdMbSJkChZlWGSAwVD6jy3Jgk= ARC-Authentication-Results: i=2; imf19.hostedemail.com; dkim=pass header.d=micron.com header.s=selector2 header.b=mKiIRtWA; arc=pass ("microsoft.com:s=arcselector9901:i=1"); spf=pass (imf19.hostedemail.com: domain of sthanneeru.opensrc@micron.com designates 40.107.243.43 as permitted sender) smtp.mailfrom=sthanneeru.opensrc@micron.com; dmarc=pass (policy=reject) header.from=micron.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oO7Meoi8vzPVrvpWQH5CPZgCtGciI3/9NvAm0rWK5AJNpvs1aIxyNmmi2IJUKY4N0Ms2+ZeMLxYlIxaoQ07aqPr8tdvliVVVOIzooI+4Z6K6YTYgo9D58SZcsaBgTAvxnb0APOaqLh5aLEJTJZl9QvWt7k+aOis2qNFRqUAVUNvH4d6vmbaW521PpliqzA7Jw/GywwiaIpy3eBIoIpn/VnY30TgWa3nvhcWqAB60FQyiTUSkUM4b59EyuNP5tdVvQPJHUmlK+Ro/I0HkqoSMZQZEZ2CpsGoxy/MGtc3HTWwACQZrMi6AiaKuq3BUN8mbGVeSDgWGbjI1+a8fVfwTvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=jmnx9GYAD/EUBUNMmCGmGKqofqSKhNjtJL8BBoCN3s8=; b=gIGjPQktDgVnXk1SjeTuXsGE1O/9LGAqzcvjU+zDkFWzY7byO4NhkdSACi8L1MAfh8T6DapPXK4b4vTq9S1UzWQbSVe3qPDhc82pvw+yRF0c/vrXZ51VGdNQbEvjkbh6VpC7ow2cfiDrGHJzPQHIaXa54uHy1o6JdmUG04APcph/wh0aoS9ker15tRYWXyMuAuFvrm0ShUB2njGQzmnP38MIk89kegbiJhO1FItiFY19tvO5lEFC9E7kmJgtxBCgVyXqY9LtdmQ9IDEeWd0BefyF0EObLs4h4QXYsNi3vIJYn3V+1ZAHm90hM8aTogi7DWzAw3MGQy+nQaezR3egzw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 137.201.242.130) smtp.rcpttodomain=gmail.com smtp.mailfrom=micron.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=micron.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=micron.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jmnx9GYAD/EUBUNMmCGmGKqofqSKhNjtJL8BBoCN3s8=; b=mKiIRtWAwe8jY9pi1TacB1X1dWuSbG79ogmucwCe21rnnkhHIBdvqhHiH9iC2xqpzcsWWnFA4+NwzJ+6Jegr2B2NH1fz6gfxASXZPSjsbFkI6eW2Mi2GRX4tadWo605/X5YK8HDlKtGIhJGqQY8K65FX5xqCZCiAJ/N8gpOzvHJt6Rw9PD+Wqeqw0QGVQD8WkBrLhHqlaYvpSFpSxjjjJuBg4dQldNSmhSiP4uBHCOM5h73n3mTANKIWtjJ2gb0B/5+G7uaqvLfo9hlK+dA9Ga1GkqCEVaNYetp+6msUgw+rkWMitSPXGOD3auu0CKYnt5AkwrQP2zKjOBrOU9zblw== Received: from BL1PR13CA0279.namprd13.prod.outlook.com (2603:10b6:208:2bc::14) by CO1PR08MB7578.namprd08.prod.outlook.com (2603:10b6:303:151::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6933.29; Tue, 31 Oct 2023 17:52:24 +0000 Received: from MN1PEPF0000ECD8.namprd02.prod.outlook.com (2603:10b6:208:2bc:cafe::79) by BL1PR13CA0279.outlook.office365.com (2603:10b6:208:2bc::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6954.15 via Frontend Transport; Tue, 31 Oct 2023 17:52:23 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 137.201.242.130) smtp.mailfrom=micron.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=micron.com; Received-SPF: Pass (protection.outlook.com: domain of micron.com designates 137.201.242.130 as permitted sender) receiver=protection.outlook.com; client-ip=137.201.242.130; helo=mail.micron.com; pr=C Received: from mail.micron.com (137.201.242.130) by MN1PEPF0000ECD8.mail.protection.outlook.com (10.167.242.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6954.19 via Frontend Transport; Tue, 31 Oct 2023 17:52:23 +0000 Received: from [10.3.79.254] (137.201.84.68) by BOW36EX19A.micron.com (137.201.85.33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.12; Tue, 31 Oct 2023 11:52:15 -0600 Content-Type: multipart/alternative; boundary="------------CzGWM4dJJEmd1vS0WxPgtL6T" Message-ID: <067c962f-1eb0-4342-a957-62f215ddf229@micron.com> Date: Tue, 31 Oct 2023 23:22:12 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [EXT] [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Content-Language: en-US To: Gregory Price , CC: , , , , , , , , , , , , , , Gregory Price References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031003810.4532-5-gregory.price@memverge.com> From: Srinivasulu Thanneeru In-Reply-To: <20231031003810.4532-5-gregory.price@memverge.com> X-ClientProxiedBy: BOW17EX19B.micron.com (137.201.21.219) To BOW36EX19A.micron.com (137.201.85.33) X-MT-Whitelisted: matched X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MN1PEPF0000ECD8:EE_|CO1PR08MB7578:EE_ X-MS-Office365-Filtering-Correlation-Id: 0be20bd0-7427-4d25-8996-08dbda3a2330 X-EXT-ByPass: 1 X-MT-RULE-Whitelisted: Triggered X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: KHgjOrmvOJRt9cJsSV7wCM+GulpQec4ssLY6Z5fZXM/WIefz2l9MUo0FNd42crYujA5wpA+orrt+XMc+CYVapoRZzxDN2mvpWhDFACbk+3sVxAh/YtzvIjXRptIXrVnbu/onPAxmWsSC6hNsBisrXyioK/PxfE2WqhKHYuTMrAV3ONnby3rRGKbfWlCthyNsShEwmZ/+52SSh8iBnLFEDG87V+KJNrjUBen3J43f1a47AnqUG3r/mjgT2Qdycfq5KsVA7ZncIuokvXxmfe/EFztWCxp+V0UpkUu8n9wiq8QccbdELckNu9Yk3v6vOy1VwLRcwlfoDbQrFiJ2ova73/Mn9Zbkr1XAdCAAAv+Pw6YS/9//0AIMd7sCJpDKAQPOBP6kCY5dp38DNdzoPvp264YI+2OU/L6dViPvDY9XAZQspPkpx4xmAVMHHdX7dT/JGwA23PfPRasaB973ZBBJl4vJFGILjOMkJyYZyEjZaNYnW63x3c8gn7yJnQSX1dCrgp7WLYonuFDm2NMyRYEiV3Gr1U8RU/hPCzHf4slP6fzpfBrpih98zH/9xBV+AZXrkuajkl4zXX0fNBoXnoSAhL/peKIGMjRjb0hYzLwnnSmAWgK9gRrsmP4BYNUXcjjGfmo0hAMjmcJN3NU7XGFyWB25I8u0sSYk7c0uC6qqD84FPMrjs3E9AqL7bIlTqY4FfxfP1JTvHYlCFb31MQbVc7WwMoCujQ1b4rDUXBWtZJlvZyLutcsafOHeHtRJqz4+TZFEEP9t8HOH2iUOUxk3/xpkX0cNKTEaclnlEjfExwm/MD1rjVwd3ttEBlhFmJ+OG/B9JM8x392dTErlNl+XsA== X-Forefront-Antispam-Report: CIP:137.201.242.130;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.micron.com;PTR:masquerade.micron.com;CAT:NONE;SFS:(13230031)(4636009)(396003)(346002)(136003)(39860400002)(376002)(230273577357003)(230173577357003)(230922051799003)(82310400011)(64100799003)(451199024)(1800799009)(186009)(36840700001)(40470700004)(46966006)(110136005)(316002)(16576012)(70586007)(40460700003)(54906003)(70206006)(33964004)(36756003)(53546011)(6666004)(2616005)(956004)(26005)(426003)(336012)(7636003)(356005)(82740400003)(83380400001)(47076005)(31696002)(86362001)(36860700001)(478600001)(40480700001)(4326008)(5660300002)(41300700001)(8936002)(8676002)(30864003)(7416002)(2906002)(31686004)(16393002)(43740500002);DIR:OUT;SFP:1101; X-OriginatorOrg: micron.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Oct 2023 17:52:23.4992 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 0be20bd0-7427-4d25-8996-08dbda3a2330 X-MS-Exchange-CrossTenant-Id: f38a5ecd-2813-4862-b11b-ac1d563c806f X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f38a5ecd-2813-4862-b11b-ac1d563c806f;Ip=[137.201.242.130];Helo=[mail.micron.com] X-MS-Exchange-CrossTenant-AuthSource: MN1PEPF0000ECD8.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO1PR08MB7578 X-Stat-Signature: rsj6gfz5ug6fj1ibf88cgytmgxieb9xp X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 157EF1A0011 X-Rspam-User: X-HE-Tag: 1698774751-365212 X-HE-Meta: U2FsdGVkX18HkFgr5Rs5bUhQFE19r8cEGBudY7bdrPRnn09TFSQexSmbi7fHYrQmD94QFh14Hkd2xUytFpF5haimCE7QLRCxVNAu0tYB6y9NzYvEIiTicOKocOIsxR7Hg7kWlzTgtzQbNi2+MSJeMOKRpuEbx1no2WwqfQ+DlyIoQeKn+s+/Ijm4pwyCttM0DMW5OB6uA7JBvFavvUAFUODKjakdt17kc+dBWArSLA2JzBTfnvHXsm3NRPRhSSmONYebRTFEugByQa9QsS8azpMz6cEhDtDxlUziUxWOs7UjF9213lDbjJWZbEHqsC+G4a5cce0ELLBj2+qgmXqEHZVGLcRqprN8XS1j7rgM72IAGexk+E8eXyTdBlqIA4g6US+sII39Lvforx3nRI/pw+CmTeferI2CetRECxbt5Zj3uEfFPWNGsOInSRxPVs1IEVXXx/6+nZZJDUvJNp2adjMEd8XZ0Ey8NPZJT9UPVhsBurmliGrqQMckTuhiONTyWcNkLCQiCwr4VPaJmds+brRgru0YHyPgyXKOyMdUNkJOnQno3Tg1pUxWjHYFrJrUMIlVxYc2cZgLE2tze2bgCkAQ2Vue8xpcf9dXAWTMiBCEIdM6sHWIUKP03aV1NPu7waoFEmD/xbhXRJzPbUZN4TBWzIRV2YTpkLYTf0hYDIjX5Felz7Bk5NGTyRxgAdk+gk6ihTpuGlK1bcrk1FCCnVn27RMblPvL20H11d2IlRp0Ii/awLJ4/b1DYN2uUWaCN5bwTHu8/JIYsb1vdKFaxWElUChhZ06p8IvJQ/DCFd2xDayTROLPaccyvdRmRIsuCIKYGk58Q0aOiI9Ivrd5zamg2oDZrSoFAPHfrn++K1lm14D9JPy3wx+MeQgrHkqn7m3TLe9Tp6N4nAYX3CBmE7K9wHjuHIU0O2dUC3U1ZcML1SJlIFOQ8/YtkRPP1QlRD9n17lQvUaBhYo7c6md xvE2VLDy PIaQpKoBj1+pwj1x2+row5fp9Rd0uyInbxg4QMhrDuJX2OMoI3JN2LDleAtPt42ilaLP3mJ5BbZy4fm58k771Y/cB1VgaLUA1kO2GAfnPOkeDkZWPCOHQxyBbGn8m9VrAYmOz8ToqyLmwCd3gB6hvXo0ve2M+r/ORR4jiClzVVaO7nv58kdEOVBHj6qgJZf1Ru/KrSrfDnr5O1pMZ4zLZXrD8UttPQyCSikquaIoV2y7/x4/h6DJierNQsjDDL4q6Xuor3AkdmAjRtUpLP9Ckvv1Sn+4guSPwDzYYmpuILSdRlMAkybAK1OqjwSciWJoMw8UNecNhNBcQwnugRcKEXlnM6SiGU90G1rh2NY7lQ6WLQNUxA64Ziz6EhpoO4eDrH7LaVSnS1kd9x/8GHU90Kslb4V9b3Q+/8JqG6O5TZnp4nbDVX9fSyFfwchF6LVPHYnFBipTgNQijufMXaT9zOuc4U4UiqbASi2mveXcRnKGZzAjYj8oT4O+e1YVzT7gD2ABW+OeXrzn4XPeZOIDo0x6sAdkW4HHnxekXXWWdiiFKaiEPAGdDKdXa49bzcPrX2zpt6W1jtWYIG7Dz2DS81fcsgfACSo57ofH+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --------------CzGWM4dJJEmd1vS0WxPgtL6T Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit On 10/31/2023 6:08 AM, Gregory Price wrote: > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message. > > > The node subsystem implements interleave weighting for the purpose > of bandwidth optimization. Each node may have different weights in > relation to each compute node ("access node"). > > The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement > weighted interleave. By default, since all nodes default to a weight > of 1, the original interleave behavior is retained. > > Examples > > Weight settings: > echo 4 > node0/access0/il_weight > echo 1 > node0/access1/il_weight > > echo 3 > node1/access0/il_weight > echo 2 > node1/access1/il_weight > > Results: > > Task A: > cpunode: 0 > nodemask: [0,1] > weights: [4,3] > allocation result: [0,0,0,0,1,1,1 repeat] > > Task B: > cpunode: 1 > nodemask: [0,1] > weights: [1,2] > allocation result: [0,1,1 repeat] > Weights are relative to access node > > Signed-off-by: Gregory Price Thank you Gregory for the collaboration. Signed-off-by: Srinivasulu Thanneeru > --- > include/linux/mempolicy.h | 4 ++ > mm/mempolicy.c | 138 +++++++++++++++++++++++++++++--------- > 2 files changed, 112 insertions(+), 30 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index d232de7cdc56..240468b669fd 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -48,6 +48,10 @@ struct mempolicy { > nodemask_t nodes; /* interleave/bind/perfer */ > int home_node; /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */ > > + /* weighted interleave settings */ > + unsigned char cur_weight; > + unsigned char il_weights[MAX_NUMNODES]; > + > union { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 29ebf1e7898c..d62e942a13bd 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -102,6 +102,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->cur_weight = 0; > > return policy; > } > @@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) > tmp = *nodes; > > pol->nodes = tmp; > + pol->cur_weight = 0; > } > > static void mpol_rebind_preferred(struct mempolicy *pol, > @@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && new->mode == MPOL_INTERLEAVE) { > current->il_prev = MAX_NUMNODES-1; > + new->cur_weight = 0; > + } > + > task_unlock(current); > mpol_put(old); > ret = 0; > @@ -1903,12 +1909,21 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) > /* Do dynamic interleaving for a process */ > static unsigned interleave_nodes(struct mempolicy *policy) > { > - unsigned next; > + unsigned int next; > + unsigned char next_weight; > struct task_struct *me = current; > > next = next_node_in(me->il_prev, policy->nodes); > - if (next < MAX_NUMNODES) > + if (!policy->cur_weight) { > + /* If the node is set, at least 1 allocation is required */ > + next_weight = node_get_il_weight(next, numa_node_id()); > + policy->cur_weight = next_weight ? next_weight : 1; > + } > + > + policy->cur_weight--; > + if (next < MAX_NUMNODES && !policy->cur_weight) > me->il_prev = next; > + > return next; > } > > @@ -1967,25 +1982,37 @@ unsigned int mempolicy_slab_node(void) > static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) > { > nodemask_t nodemask = pol->nodes; > - unsigned int target, nnodes; > - int i; > + unsigned int target, nnodes, il_weight; > + unsigned char weight; > int nid; > + int cur_node = numa_node_id(); > + > /* > * The barrier will stabilize the nodemask in a register or on > * the stack so that it will stop changing under the code. > * > * Between first_node() and next_node(), pol->nodes could be changed > * by other threads. So we put pol->nodes in a local stack. > + * > + * Additionally, place the cur_node on the stack in case of a migration > */ > barrier(); > > nnodes = nodes_weight(nodemask); > if (!nnodes) > - return numa_node_id(); > - target = (unsigned int)n % nnodes; > + return cur_node; > + > + il_weight = nodes_get_il_weights(cur_node, &nodemask, pol->il_weights); > + target = (unsigned int)n % il_weight; > nid = first_node(nodemask); > - for (i = 0; i < target; i++) > - nid = next_node(nid, nodemask); > + while (target) { > + weight = pol->il_weights[nid]; > + if (target < weight) > + break; > + target -= weight; > + nid = next_node_in(nid, nodemask); > + } > + > return nid; > } > > @@ -2319,32 +2346,83 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, > struct mempolicy *pol, unsigned long nr_pages, > struct page **page_array) > { > - int nodes; > - unsigned long nr_pages_per_node; > - int delta; > - int i; > - unsigned long nr_allocated; > + struct task_struct *me = current; > unsigned long total_allocated = 0; > + unsigned long nr_allocated; > + unsigned long rounds; > + unsigned long node_pages, delta; > + unsigned char weight; > + unsigned long il_weight; > + unsigned long req_pages = nr_pages; > + int nnodes, node, prev_node; > + int cur_node = numa_node_id(); > + int i; > > - nodes = nodes_weight(pol->nodes); > - nr_pages_per_node = nr_pages / nodes; > - delta = nr_pages - nodes * nr_pages_per_node; > - > - for (i = 0; i < nodes; i++) { > - if (delta) { > - nr_allocated = __alloc_pages_bulk(gfp, > - interleave_nodes(pol), NULL, > - nr_pages_per_node + 1, NULL, > - page_array); > - delta--; > - } else { > - nr_allocated = __alloc_pages_bulk(gfp, > - interleave_nodes(pol), NULL, > - nr_pages_per_node, NULL, page_array); > + prev_node = me->il_prev; > + nnodes = nodes_weight(pol->nodes); > + /* Continue allocating from most recent node */ > + if (pol->cur_weight) { > + node = next_node_in(prev_node, pol->nodes); > + node_pages = pol->cur_weight; > + if (node_pages > nr_pages) > + node_pages = nr_pages; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + /* if that's all the pages, no need to interleave */ > + if (req_pages <= pol->cur_weight) { > + pol->cur_weight -= req_pages; > + return total_allocated; > } > - > + /* Otherwise we adjust req_pages down, and continue from there */ > + req_pages -= pol->cur_weight; > + pol->cur_weight = 0; > + prev_node = node; > + } > + > + il_weight = nodes_get_il_weights(cur_node, &pol->nodes, > + pol->il_weights); > + rounds = req_pages / il_weight; > + delta = req_pages % il_weight; > + for (i = 0; i < nnodes; i++) { > + node = next_node_in(prev_node, pol->nodes); > + weight = pol->il_weights[node]; > + node_pages = weight * rounds; > + if (delta > weight) { > + node_pages += weight; > + delta -= weight; > + } else if (delta) { > + node_pages += delta; > + delta = 0; > + } > + /* The number of requested pages may not hit every node */ > + if (!node_pages) > + break; > + /* If an over-allocation would occur, floor it */ > + if (node_pages + total_allocated > nr_pages) { > + node_pages = nr_pages - total_allocated; > + delta = 0; > + } > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > page_array += nr_allocated; > total_allocated += nr_allocated; > + prev_node = node; > + } > + > + /* > + * Finally, we need to update me->il_prev and pol->cur_weight > + * If the last node allocated on has un-used weight, apply > + * the remainder as the cur_weight, otherwise proceed to next node > + */ > + if (node_pages) { > + me->il_prev = prev_node; > + node_pages %= weight; > + pol->cur_weight = weight - node_pages; > + } else { > + me->il_prev = node; > + pol->cur_weight = 0; > } > > return total_allocated; > -- > 2.39.1 > > --------------CzGWM4dJJEmd1vS0WxPgtL6T Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 7bit


On 10/31/2023 6:08 AM, Gregory Price wrote:
CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message.


The node subsystem implements interleave weighting for the purpose
of bandwidth optimization.  Each node may have different weights in
relation to each compute node ("access node").

The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave.  By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.

Examples

Weight settings:
echo 4 > node0/access0/il_weight
echo 1 > node0/access1/il_weight

echo 3 > node1/access0/il_weight
echo 2 > node1/access1/il_weight

Results:

Task A:
   cpunode:  0
   nodemask: [0,1]
   weights:  [4,3]
   allocation result: [0,0,0,0,1,1,1 repeat]

Task B:
   cpunode:  1
   nodemask: [0,1]
   weights:  [1,2]
   allocation result: [0,1,1 repeat]
   Weights are relative to access node

Signed-off-by: Gregory Price <gregory.price@memverge.com>
Thank you Gregory for the collaboration.  
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
---
 include/linux/mempolicy.h |   4 ++
 mm/mempolicy.c            | 138 +++++++++++++++++++++++++++++---------
 2 files changed, 112 insertions(+), 30 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..240468b669fd 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -48,6 +48,10 @@ struct mempolicy {
        nodemask_t nodes;       /* interleave/bind/perfer */
        int home_node;          /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */

+       /* weighted interleave settings */
+       unsigned char cur_weight;
+       unsigned char il_weights[MAX_NUMNODES];
+
        union {
                nodemask_t cpuset_mems_allowed; /* relative to these nodes */
                nodemask_t user_nodemask;       /* nodemask passed by user */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 29ebf1e7898c..d62e942a13bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -102,6 +102,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/memory-tiers.h>

 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
        policy->mode = mode;
        policy->flags = flags;
        policy->home_node = NUMA_NO_NODE;
+       policy->cur_weight = 0;

        return policy;
 }
@@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
                tmp = *nodes;

        pol->nodes = tmp;
+       pol->cur_weight = 0;
 }

 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,

        old = current->mempolicy;
        current->mempolicy = new;
-       if (new && new->mode == MPOL_INTERLEAVE)
+       if (new && new->mode == MPOL_INTERLEAVE) {
                current->il_prev = MAX_NUMNODES-1;
+               new->cur_weight = 0;
+       }
+
        task_unlock(current);
        mpol_put(old);
        ret = 0;
@@ -1903,12 +1909,21 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
-       unsigned next;
+       unsigned int next;
+       unsigned char next_weight;
        struct task_struct *me = current;

        next = next_node_in(me->il_prev, policy->nodes);
-       if (next < MAX_NUMNODES)
+       if (!policy->cur_weight) {
+               /* If the node is set, at least 1 allocation is required */
+               next_weight = node_get_il_weight(next, numa_node_id());
+               policy->cur_weight = next_weight ? next_weight : 1;
+       }
+
+       policy->cur_weight--;
+       if (next < MAX_NUMNODES && !policy->cur_weight)
                me->il_prev = next;
+
        return next;
 }

@@ -1967,25 +1982,37 @@ unsigned int mempolicy_slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 {
        nodemask_t nodemask = pol->nodes;
-       unsigned int target, nnodes;
-       int i;
+       unsigned int target, nnodes, il_weight;
+       unsigned char weight;
        int nid;
+       int cur_node = numa_node_id();
+
        /*
         * The barrier will stabilize the nodemask in a register or on
         * the stack so that it will stop changing under the code.
         *
         * Between first_node() and next_node(), pol->nodes could be changed
         * by other threads. So we put pol->nodes in a local stack.
+        *
+        * Additionally, place the cur_node on the stack in case of a migration
         */
        barrier();

        nnodes = nodes_weight(nodemask);
        if (!nnodes)
-               return numa_node_id();
-       target = (unsigned int)n % nnodes;
+               return cur_node;
+
+       il_weight = nodes_get_il_weights(cur_node, &nodemask, pol->il_weights);
+       target = (unsigned int)n % il_weight;
        nid = first_node(nodemask);
-       for (i = 0; i < target; i++)
-               nid = next_node(nid, nodemask);
+       while (target) {
+               weight = pol->il_weights[nid];
+               if (target < weight)
+                       break;
+               target -= weight;
+               nid = next_node_in(nid, nodemask);
+       }
+
        return nid;
 }

@@ -2319,32 +2346,83 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
                struct mempolicy *pol, unsigned long nr_pages,
                struct page **page_array)
 {
-       int nodes;
-       unsigned long nr_pages_per_node;
-       int delta;
-       int i;
-       unsigned long nr_allocated;
+       struct task_struct *me = current;
        unsigned long total_allocated = 0;
+       unsigned long nr_allocated;
+       unsigned long rounds;
+       unsigned long node_pages, delta;
+       unsigned char weight;
+       unsigned long il_weight;
+       unsigned long req_pages = nr_pages;
+       int nnodes, node, prev_node;
+       int cur_node = numa_node_id();
+       int i;

-       nodes = nodes_weight(pol->nodes);
-       nr_pages_per_node = nr_pages / nodes;
-       delta = nr_pages - nodes * nr_pages_per_node;
-
-       for (i = 0; i < nodes; i++) {
-               if (delta) {
-                       nr_allocated = __alloc_pages_bulk(gfp,
-                                       interleave_nodes(pol), NULL,
-                                       nr_pages_per_node + 1, NULL,
-                                       page_array);
-                       delta--;
-               } else {
-                       nr_allocated = __alloc_pages_bulk(gfp,
-                                       interleave_nodes(pol), NULL,
-                                       nr_pages_per_node, NULL, page_array);
+       prev_node = me->il_prev;
+       nnodes = nodes_weight(pol->nodes);
+       /* Continue allocating from most recent node */
+       if (pol->cur_weight) {
+               node = next_node_in(prev_node, pol->nodes);
+               node_pages = pol->cur_weight;
+               if (node_pages > nr_pages)
+                       node_pages = nr_pages;
+               nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+                                                 NULL, page_array);
+               page_array += nr_allocated;
+               total_allocated += nr_allocated;
+               /* if that's all the pages, no need to interleave */
+               if (req_pages <= pol->cur_weight) {
+                       pol->cur_weight -= req_pages;
+                       return total_allocated;
                }
-
+               /* Otherwise we adjust req_pages down, and continue from there */
+               req_pages -= pol->cur_weight;
+               pol->cur_weight = 0;
+               prev_node = node;
+       }
+
+       il_weight = nodes_get_il_weights(cur_node, &pol->nodes,
+                                        pol->il_weights);
+       rounds = req_pages / il_weight;
+       delta = req_pages % il_weight;
+       for (i = 0; i < nnodes; i++) {
+               node = next_node_in(prev_node, pol->nodes);
+               weight = pol->il_weights[node];
+               node_pages = weight * rounds;
+               if (delta > weight) {
+                       node_pages += weight;
+                       delta -= weight;
+               } else if (delta) {
+                       node_pages += delta;
+                       delta = 0;
+               }
+               /* The number of requested pages may not hit every node */
+               if (!node_pages)
+                       break;
+               /* If an over-allocation would occur, floor it */
+               if (node_pages + total_allocated > nr_pages) {
+                       node_pages = nr_pages - total_allocated;
+                       delta = 0;
+               }
+               nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+                                                 NULL, page_array);
                page_array += nr_allocated;
                total_allocated += nr_allocated;
+               prev_node = node;
+       }
+
+       /*
+        * Finally, we need to update me->il_prev and pol->cur_weight
+        * If the last node allocated on has un-used weight, apply
+        * the remainder as the cur_weight, otherwise proceed to next node
+        */
+       if (node_pages) {
+               me->il_prev = prev_node;
+               node_pages %= weight;
+               pol->cur_weight = weight - node_pages;
+       } else {
+               me->il_prev = node;
+               pol->cur_weight = 0;
        }

        return total_allocated;
--
2.39.1


--------------CzGWM4dJJEmd1vS0WxPgtL6T--