From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 227FDCDB483 for ; Wed, 11 Oct 2023 20:44:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3F0B8D00D3; Wed, 11 Oct 2023 16:44:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BF05D8D0002; Wed, 11 Oct 2023 16:44:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A90FD8D00D3; Wed, 11 Oct 2023 16:44:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9AF4D8D0002 for ; Wed, 11 Oct 2023 16:44:08 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 676421603AC for ; Wed, 11 Oct 2023 20:44:08 +0000 (UTC) X-FDA: 81334357776.10.CB5E3B9 Received: from mail-yw1-f196.google.com (mail-yw1-f196.google.com [209.85.128.196]) by imf24.hostedemail.com (Postfix) with ESMTP id 7C3B0180027 for ; Wed, 11 Oct 2023 20:44:06 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ADg1Sdxd; spf=pass (imf24.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.196 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697057046; a=rsa-sha256; cv=none; b=ZT4kzI7htEJDTcAhfqv7mNbJq1wwsjd2NVIKgtMgdcwqrAwybtM1FF2/6fkeJvpkXH2dvT EElLJNscCzmTLm7JTKFYqD1f2OcvFqMopCARxgtMPuPuAGNfm2L1rtEHQByYbbkpSewymj V3RDh/6FLdX3KSyA8GqcVWBTHcWawB4= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ADg1Sdxd; spf=pass (imf24.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.196 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697057046; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=W9UI87hogpDREX7ILSAbskFyWe+DuAm4NfdE6acGURg=; b=lChsTRceo3JNiHX5T46qASCoe8hbQWZ8sdrUu8tbRrtb6Jq5fgSjuz2VICks22ZD/80l3Q 5Mt+zv39j0nkXYIr/6CGbHN7wsI7wG4VbOxA2LzsQJaxrwmFEwhZK8+p88rPlPLN1R0uY1 rSOuvS6H+SwP3TDb3WZYgfoNL8EV0b4= Received: by mail-yw1-f196.google.com with SMTP id 00721157ae682-5a7c93507d5so3183887b3.2 for ; Wed, 11 Oct 2023 13:44:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697057045; x=1697661845; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=W9UI87hogpDREX7ILSAbskFyWe+DuAm4NfdE6acGURg=; b=ADg1Sdxdwe481JO0ZG36slmrUCPf79Wq4aFJiodEZXfYPuHyU7qRxoyES7BsNcw8Rl A8cPQtVUUWyk/TLoeKZPx/KcUwF3OCiwWoZGxnwHU1TJXnEG8RhInHvLHbMQy1Skv8Mq zTzd2PYVeF7is7Z1cD6IEImwVoCdxYExvj83PrYmM5gpbUnYs1I2cy4/leDk7qmEHLbW EJ39JC2cdJ1EXKgrken8GvPGcs+QU26pbgA5veAaJdfgcWVmWrte9wS+ZkKoDFLokkzG 7EY8IwJl97WOrgSiqCkXlAiS2Lf6RbxymsR1qtugxbG4cf86n9Ym7rCNUeE5EuPDInd7 SS1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697057045; x=1697661845; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=W9UI87hogpDREX7ILSAbskFyWe+DuAm4NfdE6acGURg=; b=gCa2jZEcTQsFI79dpEA+woDfJA49QXAcYRQywgIex2dOt/1We5rHeYZZc7YSaYnXgV xyxyrAFeZ7yrVI5IU7VubPezJSeQ8WipxkXikedKtwe3Dj0DLRZ41yimDU43E1WPTo9r DGdji28f8aoq8TurTqe0Oru5/7514LsmRFxfMmD9Mzyrd2f+wcJAM0KRUZk9ch5MuvTK LOrQr4rO9/j9cxgnjCjNeFsYyr/rtovjyfLaRxO8DjSSic/rcv6skXJRM9LJWLfni2rV c5lIQV4Zj7NwTIMaOFzE46CITL2O2ARY1slZktAPS/QONy7DSjZ8M4RJWRdXQUFdgYbF V21Q== X-Gm-Message-State: AOJu0YwSGSe+CTFd5LHb1hmw7+QCXn1PNeoGM/bFjVnZId/FaBT5dv0t 1Y0X3v7xX0Gk1XwV8LBBvLYEN4S5kghfEmI= X-Google-Smtp-Source: AGHT+IFBybVofn8uA1QG5FhUr2s6pjfiaJnoBxHbehRcIa/YpEEJNFJWO1bMZ/d+F+TzB5zVTpZ6LQ== X-Received: by 2002:a05:690c:368b:b0:5a5:575:cf42 with SMTP id fu11-20020a05690c368b00b005a50575cf42mr20698083ywb.40.1697057045483; Wed, 11 Oct 2023 13:44:05 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id q2-20020a819902000000b0059bc0d766f8sm1844588ywg.34.2023.10.11.13.44.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 13:44:05 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, akpm@linux-foundation.org, sthanneeru@micron.com, ying.huang@intel.com, gregory.price@memverge.com, Ravi Jonnalagadda Subject: [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Date: Mon, 9 Oct 2023 16:42:58 -0400 Message-Id: <20231009204259.875232-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231009204259.875232-1-gregory.price@memverge.com> References: <20231009204259.875232-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 7C3B0180027 X-Stat-Signature: gyrzoojexfpjjy9f6dujkg9pjq9t9g4j X-Rspam-User: X-HE-Tag: 1697057046-208403 X-HE-Meta: U2FsdGVkX19mOhgfpF1945eI3NqrTGBfUmNh0vK8iD1d7FiWBwLlZRk6SvAGuKUiRdNZGIQcpcId9yR0n1QmPc6VK7J/6V/EBtXh+hZYOF8TqO0Y/6vVpu9bS5sR+gf0CE5SxN3mTKwkh/9Y7Goye3BPGcQ7UQzGKAy/Wrj0pesq1T6lF6UxoQdpLjOJ1V/vvtKIB2i4BL38o3O15k2gJBJKOo/r8RENuOZNlULbBDD/nHvlJgNREEX0rM7Ps/gwAsn0LwQkfoPsX/dSJZLlk2+1DAbreIcDBUKTd1ac58MSEIwF7as5zd3NhnOA/M5CS1o4J4Ha6D+423+OQ3VEqPYonO2SaXX4f2EY7MxpuII/pNatFpGTZQummsnv8l/K0bhR0G9FRuwlvJU+B5jqxjx9vC+IuPy6atnTO2vBPhu8Kli3JHDPUTOS5MoIqtxxal26/1IyZ2PZ1VKTYVzBDUa5rKKEtkSA15fnGKUYryOdE+YNflIWPtsW8IAjnjUeqF8yLli+Kfn/AGb7fzxZ3DhDoZqbD1J79DTIPIOskGqWS44ehpExffRNWCwcN77nmYdwTmNcdwPCXgQlw9OjddYI3EuvWf8mHK+Y6sNqafgtoeqiBAiQEgawx0WdiXvjsO9GlSGiyF4Eh9uEAoWyWARu2DCVZhf06XURQZqxXgCfTP4A3/HLaK4ZWQWmFN0f88sHk+kQujkfSZNMM/PXQ+hhfPzZkD2fnToEK8v7Obgp4Y8fRGqn+x/uVnsFcSU8hXphD/fMChycMzKGDNt2UaJ/S8v7PG7XqtaoIqUM+S9jTPEN5HVvF4Q1gxlMEHp0kFbzfs+LVrir+DiFuLR4oFdAlaQ3dlxpJg8detbnSqtM3TttOrxwC0imKiBUfks2aY6P+nrJCw83cfxRBptETFI+IZJu7E3HevsmWWqkuX4HhHLyCZDyHe5GC4ORJlgc+dEviR5fpHKMYVx8h4s sgrJdfOV 65Hchau5doB18zo0SS6+I5xMKuCevliO3iNAoFNrAA1SI5PE/pwtT4VK/Fzb8WA3BEdWEfOi7XapDLU07gBMwxVpS0n4S3YzLSGmthg+6Wix3woLf8o+S+U8E2gP/MAPdiBz4ri7PmWLbFG8JmUUK018NRnTeMKPDMcbZGS/gqbIicDhuajZ0N7pzlAOypxVRtt+bTOTBtvzTyE/+aUKZEIZAz+mPcAOZWz1h7w+XstPEBmgHnULrRXXhb8JrasYQwpaQsS7qk0ZQ4x5coHc8nS9wgJCX9Jpc6WdHfQDpkOXw+TiZ5buzKGKhYeb+K+s94rZ+4uQFOg+PTSXEnVqcI26Lg4hbcKhq0wRbEzNZKZ19pJZ4OcEhGcBxpzOVefn1t0in+/ANSipA4M40McSTfUqWV3SIZ6CFdOvG52CFN60mSy6IlCySOakrEmiY5wx7RBpoXKFAaSxdT7LZCsx7M+H1Kg5wrQN7/qNZW7aNPvkdDZxiLiAGt2b01M8YyO2i25fObiQQKFQuHbXV4jh2TAeicEGbM5KZbBd20sCfytVYPNFHfgeyMGJSMqja0YhikZWsUfSI2vIiJzfUPliC8vHWrmhRqhhmuoMeJnh6M9YCO3a2ZWsYqUvANg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Allocating pages across tiers is accomplished by provisioning interleave weights for each tier, with the distribution based on these weight values. Weights are relative to the node requesting it (i.e. the weight for tier2 from node0 may be different than the weight for tier2 from node1). This allows for cpu-bound tasks to have more precise control over the distribution of memory. To represent this, tiers are captured as an array of weights, where the index is the source node. tier->interleave_weight[source_node] = weight; weights are set with the following sysfs mechanism: Set tier4 weight from node 0 to 85 echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight By default, all tiers will have a weight of 1 for all source nodes, which maintains the default interleave behavior. Weights are effectively aligned (up) to the number of nodes in the operating nodemask (i.e. (policy_nodes & tier_nodes)) to simplify the allocation logic and to avoid having to hold the tiering semaphore for a long period of time during bulk allocation. Weights apply to a tier, not each node in the tier. The weight is split between the nodes in that tier, similar to hardware interleaving. However, when the task defines a nodemask that splits a tier's nodes, the weight will be split between the remaining nodes - retaining the overall weight of the tier. Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Co-developed-by: Gregory Price Signed-off-by: Gregory Price --- include/linux/memory-tiers.h | 16 ++++ mm/memory-tiers.c | 140 ++++++++++++++++++++++++++++++++++- 2 files changed, 155 insertions(+), 1 deletion(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 437441cdf78f..a000b9745543 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -19,6 +19,8 @@ */ #define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) +#define MAX_TIER_INTERLEAVE_WEIGHT 100 + struct memory_tier; struct memory_dev_type { /* list of memory types that are part of same tier as this type */ @@ -36,6 +38,9 @@ struct memory_dev_type *alloc_memory_type(int adistance); void put_memory_type(struct memory_dev_type *memtype); void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); +unsigned char memtier_get_node_weight(int from_node, int target_node, + nodemask_t *pol_nodes); +unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -97,5 +102,16 @@ static inline bool node_is_toptier(int node) { return true; } + +unsigned char memtier_get_node_weight(int from_node, int target_node, + nodemask_t *pol_nodes) +{ + return 0; +} + +unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes) +{ + return 0; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0a3241a2cadc..37fc4b3f69a4 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -14,6 +14,11 @@ struct memory_tier { struct list_head list; /* list of all memory types part of this tier */ struct list_head memory_types; + /* + * By default all tiers will have weight as 1, which means they + * follow default standard allocation. + */ + unsigned char interleave_weight[MAX_NUMNODES]; /* * start value of abstract distance. memory tier maps * an abstract distance range, @@ -146,8 +151,72 @@ static ssize_t nodelist_show(struct device *dev, } static DEVICE_ATTR_RO(nodelist); +static ssize_t interleave_weight_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int ret = 0; + struct memory_tier *tier = to_memory_tier(dev); + int node; + int count = 0; + + down_read(&memory_tier_sem); + for_each_online_node(node) { + if (count > 0) + ret += sysfs_emit_at(buf, ret, ","); + ret += sysfs_emit_at(buf, ret, "%d:%d", node, tier->interleave_weight[node]); + count++; + } + up_read(&memory_tier_sem); + sysfs_emit_at(buf, ret++, "\n"); + + return ret; +} + +static ssize_t interleave_weight_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t size) +{ + unsigned char weight; + int from_node; + char *delim; + int ret; + struct memory_tier *tier; + + delim = strchr(buf, ':'); + if (!delim) + return -EINVAL; + delim[0] = '\0'; + + ret = kstrtou32(buf, 10, &from_node); + if (ret) + return ret; + + if (from_node >= MAX_NUMNODES || !node_online(from_node)) + return -EINVAL; + + ret = kstrtou8(delim+1, 0, &weight); + if (ret) + return ret; + + if (weight > MAX_TIER_INTERLEAVE_WEIGHT) + return -EINVAL; + + down_write(&memory_tier_sem); + tier = to_memory_tier(dev); + if (tier) + tier->interleave_weight[from_node] = weight; + else + ret = -ENODEV; + up_write(&memory_tier_sem); + + return size; +} +static DEVICE_ATTR_RW(interleave_weight); + static struct attribute *memtier_dev_attrs[] = { &dev_attr_nodelist.attr, + &dev_attr_interleave_weight.attr, NULL }; @@ -239,6 +308,72 @@ static struct memory_tier *__node_get_memory_tier(int node) lockdep_is_held(&memory_tier_sem)); } +unsigned char memtier_get_node_weight(int from_node, int target_node, + nodemask_t *pol_nodes) +{ + struct memory_tier *tier; + unsigned char tier_weight, node_weight = 1; + int tier_nodes; + nodemask_t tier_nmask, tier_and_pol; + + /* + * If the lock is already held, revert to a low weight temporarily + * This should revert any interleave behavior to basic interleave + * this only happens if weights are being updated or during init + */ + if (!down_read_trylock(&memory_tier_sem)) + return 1; + + tier = __node_get_memory_tier(target_node); + if (tier) { + tier_nmask = get_memtier_nodemask(tier); + nodes_and(tier_and_pol, tier_nmask, *pol_nodes); + tier_nodes = nodes_weight(tier_and_pol); + tier_weight = tier->interleave_weight[from_node]; + node_weight = tier_weight / tier_nodes; + node_weight += (tier_weight % tier_nodes) ? 1 : 0; + } + up_read(&memory_tier_sem); + return node_weight; +} + +unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes) +{ + unsigned int weight = 0; + struct memory_tier *tier; + unsigned int min = nodes_weight(*pol_nodes); + int node; + nodemask_t tier_nmask, tier_and_pol; + int tier_nodes; + unsigned int tier_weight; + + /* + * If the lock is already held, revert to a low weight temporarily + * This should revert any interleave behavior to basic interleave + * this only happens if weights are being updated or during init + */ + if (!down_read_trylock(&memory_tier_sem)) + return nodes_weight(*pol_nodes); + + for_each_node_mask(node, *pol_nodes) { + tier = __node_get_memory_tier(node); + if (!tier) { + weight += 1; + continue; + } + tier_nmask = get_memtier_nodemask(tier); + nodes_and(tier_and_pol, tier_nmask, *pol_nodes); + tier_nodes = nodes_weight(tier_and_pol); + /* divide node weight by number of nodes, take ceil */ + tier_weight = tier->interleave_weight[from_node]; + weight += tier_weight / tier_nodes; + weight += (tier_weight % tier_nodes) ? 1 : 0; + } + up_read(&memory_tier_sem); + + return weight >= min ? weight : min; +} + #ifdef CONFIG_MIGRATION bool node_is_toptier(int node) { @@ -490,8 +625,11 @@ static struct memory_tier *set_node_memory_tier(int node) memtype = node_memory_types[node].memtype; node_set(node, memtype->nodes); memtier = find_create_memory_tier(memtype); - if (!IS_ERR(memtier)) + if (!IS_ERR(memtier)) { rcu_assign_pointer(pgdat->memtier, memtier); + memset(memtier->interleave_weight, 1, + sizeof(memtier->interleave_weight)); + } return memtier; } -- 2.39.1