From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5C1AC4332F for ; Tue, 31 Oct 2023 16:22:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 257E48D0019; Tue, 31 Oct 2023 12:22:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 207198D0012; Tue, 31 Oct 2023 12:22:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F74C8D0019; Tue, 31 Oct 2023 12:22:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 00D9D8D0012 for ; Tue, 31 Oct 2023 12:22:20 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C94A61CB79C for ; Tue, 31 Oct 2023 16:22:20 +0000 (UTC) X-FDA: 81406274040.23.79DC29C Received: from mail-oi1-f171.google.com (mail-oi1-f171.google.com [209.85.167.171]) by imf09.hostedemail.com (Postfix) with ESMTP id C4728140022 for ; Tue, 31 Oct 2023 16:22:18 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=qDVnP4JJ; spf=pass (imf09.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.171 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698769339; a=rsa-sha256; cv=none; b=ItGyEwYse0hy6rmv9rW9QeyJ2sF0navrXdCr+Z+AEmlqagqJuKnoALQ9JuvLU/KDp0ER1T yI/ykHR/A8bHgGm+73IGhHdHRiykEkpykAOq1JLY8tpyvNcZNN8ILrMWByfUyj1T8HwBkl nB2ONHt4rylj6Dau3iUHxRKN7mzUmGs= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=qDVnP4JJ; spf=pass (imf09.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.171 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698769339; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3z6tukCl+9Ypo8mcVzJeMZlWMbgi6HguP+juUjy6MUA=; b=dNJmmlLsn1w6E0H28j7OAk+gTD+a/JVexVfp3G9jjgAWz6UKZlkiL/YveRFgFPrWxffQNP aAIQYjhPqY4Yl+oxHsrnZ1aI0+zgEV+HLPQOka332HyDsAdvF0QTf2jMEsxIAy4+HQxQcO Ny4aLFvsdVdKWqUrmVEFeCmeFN3/fuw= Received: by mail-oi1-f171.google.com with SMTP id 5614622812f47-3b2f507c03cso3180406b6e.2 for ; Tue, 31 Oct 2023 09:22:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1698769338; x=1699374138; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=3z6tukCl+9Ypo8mcVzJeMZlWMbgi6HguP+juUjy6MUA=; b=qDVnP4JJVpvbg0aHlwuEed74yn0/LQQUwW0KHuyyART9fWGYMMv27wCvjdMUQTKDcD 6IHxJhYm6wu4YzWp+965lUjORw4NLRT9YOsoBUWIJJQi5KyF7WO5YutMottRQHRRqO76 2d2n+erX1QVV3vIF13k0rSKhpECU97RcSLA8k6GJiJT9Xchen+FFQVCPA4L23eVYMOhH sAGyjfQDOGgiuCtFjpJBdqnp1L55Vtskpj0659mDvNqd8D+gUlsjbscqvhep9x/bKG8p NETqI8hdO8dbzVatJhieLNRfq+M0h51iBVUuiGZnz96fOW3TvjlnUv2cfQMYEnqq72fE ipCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698769338; x=1699374138; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=3z6tukCl+9Ypo8mcVzJeMZlWMbgi6HguP+juUjy6MUA=; b=C7Uc2PVfKxVs6zqApiUrJCq9OthLbaRPS5FDgvxd3cj33aPoGVPHefzbYCXOAgqVHM fpV03lzRr8t0nNHuBp+1ItV3iJbTCR2NYUAaUJjM9s0TactHG58++QyVQt5MEGLzCpIY JWapBmzbi54o7OpynFsFfwwWEr6sMJdLE9m3+UuXrqUL4DKYrMlZkPBUFL7QYuaqbYSk EIjvu54DzB4Xfk81y1/hVLRMP7xxqMrNUf4Fxc15ZyAXa271ajwTN1HSXV0Rv64kcKA+ XsU4uQD/8UU3iqv6IXTadV1M05XtLXCwVXuSt1WoSs0UgINsehJmSU4Za3CrSaXWJkUs pVvQ== X-Gm-Message-State: AOJu0Yx+GgHIRes6GpqbRLlpzHj0RXaEkSp6o7OOGbMMcjPtkvMDcQNN uzv9zvAK9v9RJdd4mZvm0N0L+A== X-Google-Smtp-Source: AGHT+IFuyfW7nO6nH2U0/Ul9RAvsJtuReIeOW8nEg9tgWykrqT2tdTJvGFEngkm4AktnJZX2S1p9/w== X-Received: by 2002:a05:6808:14d0:b0:3ae:4774:c00c with SMTP id f16-20020a05680814d000b003ae4774c00cmr15224164oiw.53.1698769337776; Tue, 31 Oct 2023 09:22:17 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:a294]) by smtp.gmail.com with ESMTPSA id b22-20020a05620a119600b0076cc4610d0asm628787qkk.85.2023.10.31.09.22.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 09:22:17 -0700 (PDT) Date: Tue, 31 Oct 2023 12:22:16 -0400 From: Johannes Weiner To: Michal Hocko Cc: Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: <20231031162216.GB3029315@cmpxchg.org> References: <20231031003810.4532-1-gregory.price@memverge.com> <20231031152142.GA3029315@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: C4728140022 X-Stat-Signature: 3k9uw9hdasfw59pzhsaikwny3rnncapo X-Rspam-User: X-HE-Tag: 1698769338-369937 X-HE-Meta: U2FsdGVkX18oMiFYEEo81DHQ/du7bB3MSyVbnaE4RS8da2uFct7wEw5v4PXPSrFhWQ/cl+EKuyVW8YV4ZMpWHGhKZes5yf6O7knzOn9SNZ92ltz31AqnKKjVRsn7gIKgjD9euBrBAeqEPIQ4LMPnMJTZ2J8rv51EMTx5EBHuMTOGepum2GknndU3IHaxpJrEyn/CTUOmYQ5SquKsYjBfNBnPvrkwkBINS3bkfy0e2puFgtQjIYZ7HNIszaSWAH7WCqIzcLQRMpDYdBgwNsSDZwrtgYd45hhi+Gr7ydFRodeGTebJREvEixRWpgxeX4rdJh/AaZO0GKpQ5wULsSIjAjq+dYGBsczUHgNblt3TiAfpmK25ptB45aLXGgMGcuK8v55DLyKWyQTi/ssvs2oTsQRBYzHGkbk1PgGqEkrrhaU2YY+/TjqmAgAgYWYxY11kU/WOb933/7aRUppUiZAidpEjRJXA072Z6gQS9RqXgPnUv+YU4Ve8/wB0ersQesb3XrcJLWtrvJbS0G3G5Wr9dxxiBpm3DjoGhArvyILHDprvthWhD0HnR9nmIj6hmXIdwFjYwn0/qh0yV+3uerUsHqWxZY0LRKeoZATDHaHE9Q5/n+rK8VyP7yP6j7Et8z6YuZhGbNMNzhsMkxxi//5MmOIJrCfPirm/pcBbQO4bhESnFI+Kc5jaOMd5avjTZvlxUePSo9HcjwD3EL3X1zRM4cjdd3jg5dgyTIPCRPdakKjFnqCpdsHzkqwLw3uT59xmwYgCAxc35baW5jheV0KDJCim3qc7HhdwhNyZnkMXfn3EGvg1AonruPvCkLvZB4y115AF+OX0ZuqVZJVpjlnifQJplkSDjSqY8Tg+fdehhjL+uEcfvrLHmFnHhUCdLkUMuRFpvfGOQujG2FRx/9pGiqmK7ICA2Z527wZgb4TXQyflp294N0TPCCSEtuSWkCnvilXjAmr5IhFFAWrmCbK HPxY1FV9 ipWk76CLiSuEbFaOU+5L0d5Qk41ed8I3qw4zsU8hEk8MpSOyQVh5wMQ2YMu7a9F3MfG9uLW6n1mHMKWcqXwTvwYqg+7ntlgsHW6oCclE0NN5lSNaOtHTREKQkL7brEjnoCGx7/XlwMTQO2dNPJhhnxQn5e+3yZVe4qn1UthOejPPAtwWhw1Fw9dxwQ7Sm5KqwFl6D7Opu0D/9VtBMjfdaumM32gu25bvy1OYTtOc4Zw9h9HlObg49k0QDFnmGvJB1R+3Q49OmH3vqvQCktWasPZPT5HYhA3KTITAnyOMW5VE/9RB1wULOZQZ9T8LIrSedSWeoMapnCeK61thI8uBjYADG22cdvdaiQ8l9kdZDJV2+zpDnUM7FkmjatboVWnoyGoY96xEq19ol8KAqFJHFNqlD0Z2zxlP28lm222HdKRjhZBkZ3DxVfnuH8VchHImdj+D+PbIyCALs581BOSaL84U6+mD0rmtQXN+1msH7v3mIZ9uQFYKxIak1t4w7Dn7Lx9SIccRMPQvRuIbz739NVzVSsdjgU1BCKNCp6yaHoqRDz0BddEhtTaYNR/+zWYG0/8sUUuyNQNP+AFfc1e4RR6QnwsbjK2OU4h4lvs9CiVu59Lc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > On Tue 31-10-23 11:21:42, Johannes Weiner wrote: > > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > > > This patchset implements weighted interleave and adds a new sysfs > > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > > > > > The il_weight of a node is used by mempolicy to implement weighted > > > > interleave when `numactl --interleave=...` is invoked. By default > > > > il_weight for a node is always 1, which preserves the default round > > > > robin interleave behavior. > > > > > > > > Interleave weights may be set from 0-100, and denote the number of > > > > pages that should be allocated from the node when interleaving > > > > occurs. > > > > > > > > For example, if a node's interleave weight is set to 5, 5 pages > > > > will be allocated from that node before the next node is scheduled > > > > for allocations. > > > > > > I find this semantic rather weird TBH. First of all why do you think it > > > makes sense to have those weights global for all users? What if > > > different applications have different view on how to spred their > > > interleaved memory? > > > > > > I do get that you might have a different tiers with largerly different > > > runtime characteristics but why would you want to interleave them into a > > > single mapping and have hard to predict runtime behavior? > > > > > > [...] > > > > In this way it becomes possible to set an interleaving strategy > > > > that fits the available bandwidth for the devices available on > > > > the system. An example system: > > > > > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > > > > > In this setup, the effective weights for nodes 0-3 for a task > > > > running on Node 0 may be [60, 20, 10, 10]. > > > > > > > > This spreads memory out across devices which all have different > > > > latency and bandwidth attributes at a way that can maximize the > > > > available resources. > > > > > > OK, so why is this any better than not using any memory policy rely > > > on demotion to push out cold memory down the tier hierarchy? > > > > > > What is the actual real life usecase and what kind of benefits you can > > > present? > > > > There are two things CXL gives you: additional capacity and additional > > bus bandwidth. > > > > The promotion/demotion mechanism is good for the capacity usecase, > > where you have a nice hot/cold gradient in the workingset and want > > placement accordingly across faster and slower memory. > > > > The interleaving is useful when you have a flatter workingset > > distribution and poorer access locality. In that case, the CPU caches > > are less effective and the workload can be bus-bound. The workload > > might fit entirely into DRAM, but concentrating it there is > > suboptimal. Fanning it out in proportion to the relative performance > > of each memory tier gives better resuls. > > > > We experimented with datacenter workloads on such machines last year > > and found significant performance benefits: > > > > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ > > Thanks, this is a useful insight. > > > This hopefully also explains why it's a global setting. The usecase is > > different from conventional NUMA interleaving, which is used as a > > locality measure: spread shared data evenly between compute > > nodes. This one isn't about locality - the CXL tier doesn't have local > > compute. Instead, the optimal spread is based on hardware parameters, > > which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. > > Is there any specific reason for not having a new interleave interface > which defines weights for the nodemask? Is this because the policy > itself is very dynamic or is this more driven by simplicity of use? A downside of *requiring* weights to be paired with the mempolicy is that it's then the application that would have to figure out the weights dynamically, instead of having a static host configuration. A policy of "I want to be spread for optimal bus bandwidth" translates between different hardware configurations, but optimal weights will vary depending on the type of machine a job runs on. That doesn't mean there couldn't be usecases for having weights as policy as well in other scenarios, like you allude to above. It's just so far such usecases haven't really materialized or spelled out concretely. Maybe we just want both - a global default, and the ability to override it locally. Could you elaborate on the 'get what you pay for' usecase you mentioned?