From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C63E1C4167B for ; Tue, 31 Oct 2023 00:38:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E5AE6B00A5; Mon, 30 Oct 2023 20:38:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3711E6B00AB; Mon, 30 Oct 2023 20:38:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 176936B00AD; Mon, 30 Oct 2023 20:38:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 019346B00A5 for ; Mon, 30 Oct 2023 20:38:18 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D082D40611 for ; Tue, 31 Oct 2023 00:38:18 +0000 (UTC) X-FDA: 81403895076.01.0C846D9 Received: from mail-yb1-f196.google.com (mail-yb1-f196.google.com [209.85.219.196]) by imf23.hostedemail.com (Postfix) with ESMTP id 216AE140002 for ; Tue, 31 Oct 2023 00:38:16 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=b6NNn22M; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.196 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698712697; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=qLpCc3PyWTTTD0AKPiPlElmu9sH8xImon2sau71Xr/E=; b=R7owN/NIAvOmkHiKopYdMVY4D7UgywJEEuwEI0knog8lcXKEEpyK7zJeZodpRrJ6RLWF72 3HWjzD4HwXPLZOU+nyaYWac5t/i0gdXoNLdeZdstc/M4pTcc2qI/ZdCL4llQE84NqYpSSm KGDY4B+IWw5qFmApTRujE0gLBg8thGI= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=b6NNn22M; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf23.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.196 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698712697; a=rsa-sha256; cv=none; b=7z4KQjD6fPeKklgYW/tP7SO89qYQ7kYTaU2bsBp8Ar1uxeMLErU+C25ifHwi13cDjOBRjT erI/qcbcEq1BZFChHycFLQX+9Qvp1ryDNGMohN5oMwIHpUUa4yXEISkUDFyMiBxkT/CIxL dqYqc+DJevPPJHsG4TyE5zpFOUORAV8= Received: by mail-yb1-f196.google.com with SMTP id 3f1490d57ef6-da0cfcb9f40so3849787276.2 for ; Mon, 30 Oct 2023 17:38:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712696; x=1699317496; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=qLpCc3PyWTTTD0AKPiPlElmu9sH8xImon2sau71Xr/E=; b=b6NNn22M2zH8URZcSkqeruRXNVQsUaAVPAqtxwIJmCJrmfJcnfWyzSTjw3D8DjKCqw M/rFpiFGF34uuUBGd7+X9mQY3TTL+Cft0mzgIzdR8xQTiAYZ+LWwH0qDWngl4TZdDHI3 X31WtEy0FD1vMGaCZtwrPa7M1kVOSwv8Y+FWvWzwHZumYW+wL8oyKmNFBCTIzUCa8ma7 bkR/dPFea/S1dPiJbRMTm+DMPqFHRNdI+1OA2y0afFEuKnM5YKhgfiUOUcz0u9OfLAOE yo5O+bedz5iDNY4JeUc6IX9+b6IAGuF0t3tzCHLZjlSxI99rcZNcSZymWQp+68JZIR9V HEFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712696; x=1699317496; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qLpCc3PyWTTTD0AKPiPlElmu9sH8xImon2sau71Xr/E=; b=EtG980zg7UzHdOaCrbZN5fqvXGDGhkXN/0RpcSsbAG0Qc46CsgjmZoZxI8QjpzTq3q rlYw+gHA32qW5CLokPVGX0pSBp+TJ18/4xfjvkJ8+uw2YEQmyzZvFWyNi3V8ASO8XyYR IM1x7YtHQkDrPk9RDHFTbbdNy7Ua7J62lcmZcOEr+VpcC2rT0bwuGgJcKbr9XPLAbSCY M1SU5po/N/cilf/4uRTfaHf9EKroQ/9W5XeA71ZvKG0KcnnN37HbfUT/bsE3vnNkB8zv 4OFRxINWRribw053ZsAFcF7ZtmluH40Jb3JhmNBv9m5PmidUiyi3hD2mUQn1mVYcBUOe 5VVw== X-Gm-Message-State: AOJu0YyP7vc8l2dD+OmTTwNmeV5qViI1yAiSIr50GJR3sB9mN9rNybjV khtgktA+Pl/crQdVRdSzbA== X-Google-Smtp-Source: AGHT+IG6TgaDvVWkgXrMoVTwaJlU2gSdyzoxReWXHWwJSUYAaeuuyocGk5Eo93eo9OzmgpOqIZtQtw== X-Received: by 2002:a25:50d0:0:b0:d9a:4839:68fe with SMTP id e199-20020a2550d0000000b00d9a483968femr9952161ybb.43.1698712696084; Mon, 30 Oct 2023 17:38:16 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:15 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Date: Mon, 30 Oct 2023 20:38:06 -0400 Message-Id: <20231031003810.4532-1-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 216AE140002 X-Stat-Signature: 8fxt6wxw1codswbhggsza569drnoajck X-Rspam-User: X-HE-Tag: 1698712696-141141 X-HE-Meta: U2FsdGVkX1/cwSGwecVmGvJCgX7r6aJw/zE/LVX9JdUYPG9zPjwN0f2CXinftd/SFzF/DjzEKP6cuWywdYN2VrTExaUtcHk93uFqtn0G4IiK/m4EGryJJsT3Q+hdpnNgujsUAjDQPlsGXmHLEgf5I5hGxHTQFMn4ID11CitNGAmXr2IDzODgxANDRiKRroliJBUK/dvHhCcUbuwJDm3uu+TCSf/2uUTUAgmS4muekGJNsMEqvZ+3qYV+82Q3G78+TfVdsTXiA+b8saokaTN/I5+kfmVPk5cVDfMrwp9OAUpevSbr2vlSQwRspPhVmI6dDGp41CUepv1KTfnqNzIYquGQNFQfOLIfcIXl262UKaU0cPDOA/lEAqlVLFVuenZyDfu2YrL5cKg0IaA7nmOWQTIVlRiwXJWiA0ENL0SXlxnPHrOvcfB/ms7gWjlRq4gyobTe9ionCmTSDrYZuDWL5DnuVDUT7EglSOZGdlkG1RZ4dbDmcutOlBFpqkLTZZNf/nTRqF+w94HGOhUaQLLi2XK/A1ZrBTviKGhGTmol3i/N4TaDXnnvgJQbaI5cYvuCFuFey/LF06rZfEQ/29I0s0+5CZd2Ssq7EdHTpYiudQ0QVK4cSuignuejdWJxXKwbelBBHAsjnaXwha2syBxWr70bgQelucqPFt4NGuX8R71U7ZeDEXgK6d36Ad0ow74/f0Khl19vvZC0bCMDrizWExjASbTXQNun4/QwoECKSdbINNA/GrMCxAVffYKSFcCo+7CPk6AOmVZzM2OFv+UVYzmMIQBKFl2Vl8pdhHht95lfnYLWFaHXOz1aRtHJ1rs0tvBOw9XDVkYXfXvQrlhaEwp7HAgjXQEcu1NgmosIq62hBetQhha05KdHOc82WfMWIH28zJSvNvS/azNE9zsk385AS4yIVnMbj7aUHj1bAbx/eXKJTaUyGAYfsUe9axs/ONynKC58aqzLS9zeG/A 8/9b+koY veAceuKnfOx25BZ0wjYjQhoDNHPWLBEzb3SrWuNxR6G2vCX8TowiA4vtX8YGeushJ6t/el10gxM/680WWs1PjmxQLNLa5Jju+LSEU6cNQXQKiXUnAulYv/P1PSKzCr/sGSc0A9r2Wj+fhqD29oX8kp7ePfqTVjCF48JE4nKip3TKWzzL+sBxWRYFcbYDjH7HSOjTE7CTOSkzDAh1tqP8W1zb5Z8smNimxH2mE86xG/DaSp3SbaLdibRVXu/VwaHui2yU8NonLLOQGqtNxDTUmvBj/omHVfbFhsYmEfNV3RT845ezko8SsKkOCQD1QgJR8BOzLMbdET1EXTvpoLMFdpFUEI37ctCDAjCJ1sSTjW7iVSJhTqUfXF7VAGYI51M5rOT23jTbJmHn4gzJvEMKaRMQX7akrrUIZvZq7OlCx1D0qpLLRx9z6+58D1wTm2ymppW2nPlrHEZI5W68ZCPRgeVnlwBTvDoEQ8/F1FjiiAd/9mjxn9m2cgHahciq1eOO/9yVqXnayUBXlIyvV4VE716+v8K2v8nENi2rM04dKMrmGMCoHpQeDKojluJ0xKObF6JnhootAE7MtoW13b0xtH7i1icIsIRfDDwgFqOBakUebF4mXToy2XqHIqX+zilELnMez9I0g+6qnwm8/ousb6nHL9HEgbzflycBf0V8uLv2OpGSBI5pYyBQuNAEIc179bBv0rcA3A4HlPezZA4xplJ2BUQFsx9Hecfcp148sewBmx7M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000032, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patchset implements weighted interleave and adds a new sysfs entry: /sys/devices/system/node/nodeN/accessM/il_weight. The il_weight of a node is used by mempolicy to implement weighted interleave when `numactl --interleave=...` is invoked. By default il_weight for a node is always 1, which preserves the default round robin interleave behavior. Interleave weights may be set from 0-100, and denote the number of pages that should be allocated from the node when interleaving occurs. For example, if a node's interleave weight is set to 5, 5 pages will be allocated from that node before the next node is scheduled for allocations. Additionally, "node accessors" (synonmous with cpu nodes) are used to allow for accessor-relative weighting. The "accessor" for a task is defined as the node the task is presently running on. # Set node weight for node0 accessed by tasks on node0 to 5 echo 5 > /sys/devices/system/node/node0/access0/il_weight # Set node weight for node0 accessed by tasks on node1 to 3 echo 3 > /sys/devices/system/node/node0/access1/il_weight In this way it becomes possible to set an interleaving strategy that fits the available bandwidth for the devices available on the system. An example system: Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex In this setup, the effective weights for nodes 0-3 for a task running on Node 0 may be [60, 20, 10, 10]. This spreads memory out across devices which all have different latency and bandwidth attributes at a way that can maximize the available resources. ~Gregory (sorry for the repeat send, automation failure) ================================================================ Version Notes: v3: move weights into node rather than memtiers some additional fixes to node.c to support this v1/v2: add weighted-interleave support to mempolicy = v3 notes This update effectively removes the connection between mempolicy and memory-tiers by simply placing the interleave weights directly in the node accessor information structure. Node was recommended by Huang, Ying Accessor was recommended by Ravi Shankar == Move weights into node Originally this work was done by placing weights in the memory tier. In this patch set we changed the weights to live in the numa node accessor structure, which allows for a more natural weighting scheme and also supports source-node relative weighting. Interleave weight is located in: /sys/devices/system/node/nodeN/accessM/il_weight and is set with a value between 1 and 100: # Set node weight for node0 accessed by node0 to 5 echo 5 > /sys/devices/system/node/node0/access0/il_weight By default, il_weight is always set to 1, which mimics the default interleave behavior (simple round-robin). == Other Node fixes 2 other updates to node.c were required to support this: 1) The access list must be initialized prior to the node struct pointer being registered in the node array 2) The accessor's in the list must be registered regardless of whether HMAT/HMEM information is reported. Presently this results in 0-value information being present in the various access subgroup == Weighted interleave mm/mempolicy: modify interleave mempolicy to use node weights The node subsystem implements interleave weighting for the purpose of bandwidth optimization. Each node may have different weights in relation to each compute node ("access node"). The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement weighted interleave. By default, since all nodes default to a weight of 1, the original interleave behavior is retained. Examples Weight settings: echo 4 > node0/access0/il_weight echo 3 > node1/access0/il_weight echo 2 > node1/access1/il_weight echo 1 > node0/access1/il_weight Results: Task A: cpunode: 0 nodemask: [0,1] weights: [4,3] allocation result: [0,0,0,0,1,1,1 repeat] Task B: cpunode: 1 nodemask: [0,1] weights: [1,2] allocation result: [0,1,1 repeat] === original RFCs ==== Memory-tier based weights By: Ravi Shankar https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@micron.com/ Mempolicy multi-node weighting w/ set_mempolicy2: By: Gregory Price https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/ N:M weighting in mempolicy By: Hasan Al Maruf https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ Ying Huang's presentation in lpc22, 16th slide in https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\ Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf Gregory Price (4): base/node.c: initialize the accessor list before registering node: add accessors to sysfs when nodes are created node: add interleave weights to node accessor mm/mempolicy: modify interleave mempolicy to use node weights drivers/base/node.c | 120 ++++++++++++++++++++++++++++++++- include/linux/mempolicy.h | 4 ++ include/linux/node.h | 17 +++++ mm/mempolicy.c | 138 +++++++++++++++++++++++++++++--------- 4 files changed, 246 insertions(+), 33 deletions(-) -- 2.39.1