From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCB4DC4167B for ; Tue, 31 Oct 2023 15:21:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 422986B00E5; Tue, 31 Oct 2023 11:21:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D02E6B00F0; Tue, 31 Oct 2023 11:21:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24A956B00ED; Tue, 31 Oct 2023 11:21:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0C56C6B00E5 for ; Tue, 31 Oct 2023 11:21:47 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D2F5E160A1B for ; Tue, 31 Oct 2023 15:21:46 +0000 (UTC) X-FDA: 81406121412.24.49C2C2A Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf22.hostedemail.com (Postfix) with ESMTP id C269AC001C for ; Tue, 31 Oct 2023 15:21:44 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="U/YIddwD"; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.179 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698765705; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; b=JRaXhh/rk9n5UnbT7ikm0wFkpcL620GIB+D6fxfj75XFtHSPIZBf2Q10sGUanP55HQSkTv iD8j3urTKok7FGQ9XdFLvAn294BM0MPZJu8Zx2hfej9A5686WimsWixF+gRQ6cpOGvh0WE mfI2TFd74o19wlcPQb4pqSz8TGp9ePA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698765705; a=rsa-sha256; cv=none; b=HiDUiy11tCGbqA/M9GL9y6dEQLZ2YPLajLSe6j56BBN7Dre2vufb0mO70l2Bb0oVvpr5pq 0kBH3jZkm9pnnvqH5rjt7SxmlEIWY8xPm49eyP9CDa/K2Fie3kqLSWNSCfcfMh9Os564e4 f/9Z53wFY1Zlyb4wp1QJMZZByaxOI1o= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="U/YIddwD"; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.179 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-7789a4c01easo388423185a.0 for ; Tue, 31 Oct 2023 08:21:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1698765704; x=1699370504; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; b=U/YIddwDYEbqW1cl7K3AAH+aFECVSia/EyALdpBmLLWsLPrNgkwSMeHj7isw+yiPMl jxHgXSuG9rUF9+Fgskz3YqKL98cNqsxOkJgB2ANXrgL3wkNrbckUy9CkDcPfip/w1x6p XE5g5SdE6PJU0Gg+DzwUzmjBfpb/pm93e9qeLuIaJPxwbMwzHJtxQMHx2DNcOhz3Luvb 9fCV28relhyL8foB+qGT/cKN0QiaXWosQQA4Y1bgoEAwapyKUkVmO0IZsq04mRic6tqF X8qpOun+5t6gDeZH1BbbdB+9Tthz247K2FkQm6QovpUCBD4c8gWmR+BeO2c3WzHYjst6 r96w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698765704; x=1699370504; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; b=J5mbqe5PkTK9hOSv5VwaEpL7ef0MeGwKQNxFjGrvlgeoKsS/DwzncjbET9xJw4zaZ/ fKC5cWuDQg7u5ysHLShroRcfZBamVkGBMC/PtMZ1J1Abl3j6k0vhK+Fp2MCwv28eXIAo winIPf7k019g/mIcueN5RViweR/JD8IjUiAwzq6e5hoq0zG2SCR8ubNK4UTd8vshMg16 jHqX6vdD+7+tDTiApicO8FsPxKgPRK1qlRNxRBHJOVdqU3Ixwo8AMvrTmwGzYd0vfkUD 6Ti6bJBLaGZG69yTczAp+kjjl4VhNaMCKHgl9YxLG56fxf6I4knyZeT6yaKO51JQTRMg DBvQ== X-Gm-Message-State: AOJu0YwOHljFiIsGxLrnc1Do3cwQFu/1hE2bDIufpi0lMyAqDd0ahLwS ibyJdBR65GHTNTApz2usW56TGw== X-Google-Smtp-Source: AGHT+IGPYBEeF9BbUTVwUbma4WDE/1ptaZPy3XJsVsp7P2LAAwW3/vB/rtdvJjOrS3sR2ClQKgFiXQ== X-Received: by 2002:a05:620a:25c8:b0:77a:47ad:1211 with SMTP id y8-20020a05620a25c800b0077a47ad1211mr1196140qko.69.1698765703762; Tue, 31 Oct 2023 08:21:43 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:a294]) by smtp.gmail.com with ESMTPSA id az40-20020a05620a172800b0076db5b792basm577846qkb.75.2023.10.31.08.21.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 08:21:43 -0700 (PDT) Date: Tue, 31 Oct 2023 11:21:42 -0400 From: Johannes Weiner To: Michal Hocko Cc: Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: <20231031152142.GA3029315@cmpxchg.org> References: <20231031003810.4532-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: C269AC001C X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 4whz5tdhpcyujc48iou1wnmok81ubyx1 X-HE-Tag: 1698765704-708043 X-HE-Meta: U2FsdGVkX18ne7L676PLYxh5YRBSXyoKsdh3zpPVU1CdsmYnbdAkzbeWW74IjUZfpXsQ+GnQIQLS7LyD7gss1oSbfORuxmcKEEV8TtGRfUIr6pPjJawpd2C4K7Au8lE3CM4KunFayapiHY9sTXZghcphs7siwDZl5s7mZyJo5vkjoidm1b9Zlq3ZN+Qt54utENjMOPLimo4YqZ+D2CNkAvM2iQ59YQGNgPFb4MqYHRmrKFRTEhyDjf4oavul3XYDpU1/W2u2G1Z4gSgOWd8qxO7KR5WdoNVPg50Lu7nGvnX5QTj/7Bqo+TsqhDwRbG/VwogIQ7ObAFOusNblUrZ+iTyY7CU+XyZlpo2sNVrwBElC7uPrUtxnl/Q2kcGsKq3bY1sQW2uQ+71fOLOLcLkTehhQjV1o5g2HOhOhkq1uu7WMs9ed46nj1M49IzRo+7QhR1pAud/pU4tG6FnJweaxS2oYywDnNhBZ63wtOGvCgg7jFKQ7dduisscSeriZfLtBMGc7Uba2f3eU1wgSndcz5b6+0IEhshe0s6TSZF5qlI5UozfMDLXUQTFL4biwufQz/XxQQnFq7xO1xJvpSJecsKY3NVFbs4XcevfrZZ4j9NG/ggcS8G9nTtJ4afs/BNKQc6bTY/D8CidwGGbpB4cDZ9c6zkR8nRA2sWNk4Ct7cjmSd7EYJEF4Fi9R9ZxGOrRm5yrcm/vvrvtS9OXS5zdPK35u2QPrBmLYwxQIwg21KUMb0zuQG2BsJBb5QWH7La3JknWCZsWAkdgwMBgfJb+6x4eQIWgejIBqiPWMFbgTtLDmqNtuCS32MyaHo/0a+PwazP7gO6P5a7ZTglGNwzTdgwSGB0zUp6qDoZ44nScH4VMPHaj2LgTujAN/ZRkWbAJ3+2LKTO5Lx28eBcu+TvesUNnUQNKLqEaqMKDR+dh3nVg84U3P6VV1egpv7DQEzuhdaKINtHFFqKUf31Xwdci x26rRhUK bOGvAw8CZoR/FztYv6LoLW16c2nrSv0zPOHlQEs812o8O+fSRGTbw8muxpuiOQ24W3nKHCwsUv4HnYfjLDpZ5WtzrMXr3OT9r7e6/F/v+krj6/8XyCrouxOEbJ0Hf9KP5QBf5GEsCQyQT+8mzYRWmqnvH7HdZ8mSAi95CUaobQB5/sR8+J2u1C4e6NnrnmBJRzqIqBtOvRXsXyD8iE9CZjdVZC8TP+ZPnv712YUoe5dBaaXYezJzOF43xrvMDrNwwwLFyaiefOHN7lU3O+zv4A2Lu4hO4fJ/mkKxTmbSnWviI667KA4e84DBqqbqYEUD56kC8ARKWpTgVDhO1r2TP+xE3czCMUlvF2WtCmXFTlpWTX6AI+lXX57z5w2Ax5THOhdGYuKMYgwmtTpSqMP/uiTZMKvpite4Lm5hN/nOsXxfZVPisaa3zxbSFljRgH5SznEGzQCnU/OlHjDsG86tfWaE2v4C73Cl9yn7zTdxy/uLRi4HhhMHqHLQmvr3shCY2QYEjzu5KXWg/yLI5WROxWtdJHm3/+yV6o0TkIMquf++qjlbfR8TWGfqalt9DVIMACrGrQXm3xUfjF1raLs48dM4pLw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > This patchset implements weighted interleave and adds a new sysfs > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > The il_weight of a node is used by mempolicy to implement weighted > > interleave when `numactl --interleave=...` is invoked. By default > > il_weight for a node is always 1, which preserves the default round > > robin interleave behavior. > > > > Interleave weights may be set from 0-100, and denote the number of > > pages that should be allocated from the node when interleaving > > occurs. > > > > For example, if a node's interleave weight is set to 5, 5 pages > > will be allocated from that node before the next node is scheduled > > for allocations. > > I find this semantic rather weird TBH. First of all why do you think it > makes sense to have those weights global for all users? What if > different applications have different view on how to spred their > interleaved memory? > > I do get that you might have a different tiers with largerly different > runtime characteristics but why would you want to interleave them into a > single mapping and have hard to predict runtime behavior? > > [...] > > In this way it becomes possible to set an interleaving strategy > > that fits the available bandwidth for the devices available on > > the system. An example system: > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > In this setup, the effective weights for nodes 0-3 for a task > > running on Node 0 may be [60, 20, 10, 10]. > > > > This spreads memory out across devices which all have different > > latency and bandwidth attributes at a way that can maximize the > > available resources. > > OK, so why is this any better than not using any memory policy rely > on demotion to push out cold memory down the tier hierarchy? > > What is the actual real life usecase and what kind of benefits you can > present? There are two things CXL gives you: additional capacity and additional bus bandwidth. The promotion/demotion mechanism is good for the capacity usecase, where you have a nice hot/cold gradient in the workingset and want placement accordingly across faster and slower memory. The interleaving is useful when you have a flatter workingset distribution and poorer access locality. In that case, the CPU caches are less effective and the workload can be bus-bound. The workload might fit entirely into DRAM, but concentrating it there is suboptimal. Fanning it out in proportion to the relative performance of each memory tier gives better resuls. We experimented with datacenter workloads on such machines last year and found significant performance benefits: https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ This hopefully also explains why it's a global setting. The usecase is different from conventional NUMA interleaving, which is used as a locality measure: spread shared data evenly between compute nodes. This one isn't about locality - the CXL tier doesn't have local compute. Instead, the optimal spread is based on hardware parameters, which is a global property rather than a per-workload one.