From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD384C3ABC0 for ; Wed, 7 May 2025 16:38:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05AA16B0099; Wed, 7 May 2025 12:38:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 008BB6B009B; Wed, 7 May 2025 12:38:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E14286B009C; Wed, 7 May 2025 12:38:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C2A116B0099 for ; Wed, 7 May 2025 12:38:22 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BBA0FC9C06 for ; Wed, 7 May 2025 16:38:23 +0000 (UTC) X-FDA: 83416669686.05.BCA1E74 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) by imf19.hostedemail.com (Postfix) with ESMTP id 976D91A000A for ; Wed, 7 May 2025 16:38:21 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=SHfhGmjH; spf=pass (imf19.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.180 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746635901; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FNwnXaMBa9hDs8yRXt3IkZX2NwxNzCzlBs/R2yeu/pc=; b=CoajLYFL/UYt/346z6zFseZuhzm7lOBTH4ZiVcdz3bxnH6nTC5bF5IxauqBtIkeeJ690RU RD9uN4Ior9lYvIVXD3IrUy57TmLS/gtNXJmfr/tKkDg60NCqHwvK3K/SDf4DPITD3F4Nbc koyXw5Z7NVK/o+JGWheV9hYfaiTodwE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=SHfhGmjH; spf=pass (imf19.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.180 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746635901; a=rsa-sha256; cv=none; b=oVTqWf+9bijGa0HWDzkeW5SGVu8SQBtqdunuPiYnyPiscAflXRaU8V9RsqLjhy3aaZwMnr xtlepnugHDC9vp22n/Yq9sbPb1HTyxs4KQfI5UnhZPAGEXL8xm0LPWebTiHWxI1MRu0qWD 0jb/dlcsEyfNxrTiXc/maFMw+MiFRFs= Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-7cad57f88eeso1949985a.2 for ; Wed, 07 May 2025 09:38:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1746635901; x=1747240701; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=FNwnXaMBa9hDs8yRXt3IkZX2NwxNzCzlBs/R2yeu/pc=; b=SHfhGmjHObKaihkhPNF226HOKMlbDJchi671LskIoj1mcl4k4i6A5ityt7cGMZdpsO 5GF/MjNBZGVgUufLeizXz5HQAF+yfiEnymlcChoUlIfbUFC8NdXT1J+yMItPweTQd1gb G22b5JifOuoQaerZaw8aM1ZuSi2MNjK0CfFSTmyvdwFINcBy2hXFNhqs8X7Pb1E8f59i N4W5oYUIOaAn4FhLOEl8yCU6wUb/hm1Uv/Z939sn313rxywKCTCbEYTyn3L6MwSVNpbN 4DhJ2a7/J7JhpdIr+g3dtP8k5rwK+QHRXF9IXQ7Wdf3vSAKCbnxcwvFollC88FuzYBi5 5CIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746635901; x=1747240701; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=FNwnXaMBa9hDs8yRXt3IkZX2NwxNzCzlBs/R2yeu/pc=; b=Jpyxa115ZfWEKsI0khdBbKkSLVP8y8Ekgg5o6pmO9MCx74wORkbcALS7Wf7uQuWLyu 4wXhrnn+ljL2LGTzErJff/l1KZ8sfvSsP5DzEXMUzS0YTkmtVm2VyJ+zLgsBGeUf1FAr pzNEbN9N5phlHAU2QSkO+SUoKQL0MX7J2DhuLzcx0OCF2V3nz7Ch9HUIFyNbWQe/85Iy a22NmWE8cOKprpcLdmcNigKQ6F5shbGC4EbYBWCI2Qf5nVNgOwWUj7w5DQU4KGqlvuWM yOUg82dkwKuAYR946udHvI189ArCa3GvAF4xCEX1LjAdi25c8q8eg2SYXen+MqXCLwJt s7tA== X-Forwarded-Encrypted: i=1; AJvYcCVVVy/IpGYfmXJP37/R4xynQp4gQmI+vx0/75ZRxhnGwlWXP6SsjNKwVz0MSZ3t4QTCJmU8GMD6mA==@kvack.org X-Gm-Message-State: AOJu0Ywhf9WjgXUzr1tRfAtRzJEsifP2LNplpgC4imMopWT0eTWHGnJM jiJaW6KLQQopuTqTtrDJnyurSh7FjEe715MZ11wkNF57NhPhRqJqcNhB23gKe/E= X-Gm-Gg: ASbGncvJUVANWutCmcNXbvDg8Sniy9TyEcJCdQumGBN+6uSJX97RrtjW9JwMauOyIiy eTIS2V81j94bNRdK4MSM/NzgHUh67hTuIanOYPyxtSu2Lzq3xqfSLDu7cUXd+OJrtGWN4cwoE8W 31d/Lmn6pbPJYAzBg0rh1d79KGdOZCatGrY7QivswGSy8dwg1WivxzIqQMBIXtQkfAdtT0KswQ9 yidmLJCHbVq7oUzvweBoxLZfIb69R/UtNTnP/fGQjR8oKrw4TJxHkvD3hqhZXw/UYjJl6/2GPFm V9irHAceojVHUIXflX0CoVuaAS5dz0AGXOCUnEY1I60lSEEDAh93FRu1egsfQ7u/OzvdtzqFLjw wUjs2qSlXkHdPYEhnXSYD X-Google-Smtp-Source: AGHT+IF0xHUJtr97towfcTvdfSOQ4JyJPdVT4WVEd27PBR1E81nlMSzfNE409MAkSdrttE1G0v49TQ== X-Received: by 2002:a05:620a:4621:b0:7c7:a5c9:d109 with SMTP id af79cd13be357-7caf74188e8mr542999885a.50.1746635900579; Wed, 07 May 2025 09:38:20 -0700 (PDT) Received: from gourry-fedora-PF4VCD3F (pool-96-255-20-42.washdc.ftas.verizon.net. [96.255.20.42]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7caf75b8841sm171484285a.69.2025.05.07.09.38.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 May 2025 09:38:20 -0700 (PDT) Date: Wed, 7 May 2025 12:38:18 -0400 From: Gregory Price To: rakie.kim@sk.com Cc: joshua.hahnjy@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, dan.j.williams@intel.com, ying.huang@linux.alibaba.com, kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave Message-ID: References: <20250507093517.184-1-rakie.kim@sk.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250507093517.184-1-rakie.kim@sk.com> X-Stat-Signature: 56n9isg6pcytew99m8cfontytr9c4h5w X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 976D91A000A X-Rspam-User: X-HE-Tag: 1746635901-39114 X-HE-Meta: U2FsdGVkX1+4ZijxdnxAEoLIoFe+d22fIWazG/rbCWNvkPshI2MjI6O1qpI7WKnk4jNIhY3HI0pFHK9a+hk1Wzrspe9160gnn4lfSfeRCQur2tsWTdvo011JM+/q6BOBeE3p6dzpSNcD853UcGWv2EEIYut2B4vgfAtPSbP6z36Dp6js1tNzfRrSrZ4v/7u8gACFTigsyOVBHXoEFfFqvaCOjnfnHaRVSTHBpJozNdobbDsDZ9bmC8a3ZFR3DuR1z2K9UypFd2wVTGLkOxfRML2A4TEIrPxRw2CQkaQ8/TWyQzAbisuwRrwSsgdWP1Ofvt5gBwaKW1NZbiag4je7x3HDLLh8UrVZZRWsjqT5V0IRlZB/CWFQ6a2BlkB+cPPj/Fmnkp6jzVfqLLgrmLTeWoVpYdHzQ/eyWC1DjxRJSG17as/9xXCSjAX7zwQNk5F/S5QrmlxNU6GBm7j7k0b1CPywfOBdQ5yQib5hXj8YomgUnBLKvSogEaCqsS/suBIVlBjgBAlUb83Z3PELw9sneg8KdJcEhFfmvgOUeXtzaqr50AMOv9khN98JvwvxvbPiLIFJy6w/CH0/CULFbJvVD2b/cBaJQxIyCU+WEGJ7vng6dSTJPyCr5mBahIWiUxu68/yS02KvL/xwzVAGiyNmiVAw6d2hNsPlwcm2t9c87zbYqBHdd+0d1IvsA8PPEtR7J93tNoqZGFeJvAXDzog8rlE+il9RqCUE31mtE/vTwNN5exvcDAOY8X0yLKz2Rnkm02zRncNkCz/7Se3hLKDxRcV6NrftOg0yllesFFpPRRaVdWBlsUx21xDFDCU58syZKNnh9xPY+Ne7hUFuMhw2FCw519Azr+jv8pDDy/ZQqX0JCyL4Bi741+6tq0dhLUek9O9b6mHnSBygRblbVyNFvp+7LCxS9/F2kHFaIHu1K2jTmUJciH2fBbnzfZWiVtXXBcYA1KbpmwQ6Lmtq4BQ 4Tmxn+/M +Bk9HBEGGG2GzBvOqA6ZTIAXXI11mxZCWu/EQtKG/2IvGIFwikUF/U+UuYE8d/l3QtDrXJeXmA/8weyJi1zy2uWUbedxcm9ZlHB0m/C1jJlbLgtAgcHB2pBw4Cp8F0FEEZH8aM3oTRNW5pI/WMRjKrNFV7fhReuIGwXGMK6tqj7KQWlffORoSdIEmVd7xozWWJ6xjRVyYkrMq9t7ydGLg+UhUDMNDspBDwJLIrVEizlwE2i7Dcan+dQV+JMXtQizvhTI7wd9KuKpwfLJtmMr/h05WAsAqZTcHgDUmHHSJWtXa/vIz4ri04a6QKwt0nlUpVeh6XMZIjqOSKbfAh23wmp14tcr8GvoZkxUqx9FPfshMmnl2Y1dukfz3XoWlJTRn7g/cbnYhn9Px3DcxX/vAPzUVD8i91CfqqN4yzL3MGYJD3cPITaWYs+hyde9xTaKgc0NhT1+75xVi7blj4+Qnr6ox2eucd3PhoPX+AJRKDM9+E/iu7CotIkpQMvizSct2sDepD5PUSJvrJkExSYT6G3Ac2/MJdK43g/lW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 07, 2025 at 06:35:16PM +0900, rakie.kim@sk.com wrote: > Hi Gregory, Joshua, > > I hope this message finds you well. I'm writing to discuss a feature I > believe would enhance the flexibility of the weighted interleave policy: > support for per-socket weighting in multi-socket systems. > > --- > > > > While reviewing the early versions of the weighted interleave patches, > I noticed that a source-aware weighting structure was included in v1: > > https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/ > > However, this structure was removed in a later version: > > https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/ > > Unfortunately, I was unable to participate in the discussion at that > time, and I sincerely apologize for missing it. > > From what I understand, there may have been valid reasons for removing > the source-relative design, including: > > 1. Increased complexity in mempolicy internals. Adding source awareness > introduces challenges around dynamic nodemask changes, task policy > sharing during fork(), mbind(), rebind(), etc. > > 2. A lack of concrete, motivating use cases. At that stage, it might > have been more pragmatic to focus on a 1D flat weight array. > > If there were additional reasons, I would be grateful to learn them. > x. task local weights would have required additional syscalls, and there was insufficient active users to warrant the extra complexity. y. numa interfaces don't capture cross-socket interconnect information, and as a result actually hides "True" bandwidth values from the perspective of a given socket. As a result, mempolicy just isn't well positioned to deal with this as-designed, and introducing the per-task weights w/ the additional extensions just was a bridge too far. Global weights are sufficient if you combine cpusets/core-pinning and a nodemask that excludes cross-socket nodes (i.e.: Don't use cross-socket memory). For workloads that do scale up to use both sockets and both devices, you either want to spread it out according to global weights or use region-specific (mbind) weighted interleave anyway. > --- > > Scenario 1: Adapt weighting based on the task's execution node > > Many applications can achieve reasonable performance just by using the > CXL memory on their local socket. However, most workloads do not pin > tasks to a specific CPU node, and the current implementation does not > adjust weights based on where the task is running. > "Most workloads don't..." - but they can, and fairly cleanly via cgroups/cpusets. > If per-source-node weighting were available, the following matrix could > be used: > > 0 1 2 3 > 0 3 0 1 0 > 1 0 3 0 1 > > This flexibility is currently not possible with a single flat weight > array. This can be done with a mempolicy that omits undesired nodes from the nodemask - without requiring any changes. > > Scenario 2: Reflect relative memory access performance > > Remote memory access (e.g., from node0 to node3) incurs a real bandwidth > penalty. Ideally, weights should reflect this. For example: > > Bandwidth-based matrix: > > 0 1 2 3 > 0 6 3 2 1 > 1 3 6 1 2 > > Or DRAM + local CXL only: > > 0 1 2 3 > 0 6 0 2 1 > 1 0 6 1 2 > > While scenario 1 is probably more common in practice, both can be > expressed within the same design if per-socket weights are supported. > The core issue here is actually that NUMA doesn't have a good way to represent the cross-socket interconnect bandwidth - and the fact that it abstracts all devices behind it (both DRAM and CXL). So reasoning about this problem in terms of NUMA is trying to fit a square peg in a round hole. I think it's the wrong tool - maybe we need a new one. I don't know what this looks like. > --- > > > > Instead of removing the current sysfs interface or flat weight logic, I > propose introducing an optional "multi" mode for per-socket weights. > This would allow users to opt into source-aware behavior. > (The name 'multi' is just an example and should be changed to a more > appropriate name in the future.) > > Draft sysfs layout: > > /sys/kernel/mm/mempolicy/weighted_interleave/ > +-- multi (bool: enable per-socket mode) > +-- node0 (flat weight for legacy/default mode) > +-- node_groups/ > +-- node0_group/ > | +-- node0 (weight of node0 when running on node0) > | +-- node1 > +-- node1_group/ > +-- node0 > +-- node1 > This is starting to look like memory-tiers.c, which is largely useless at the moment. Maybe we implement such logic in memory-tiers, and then extend mempolicy to have a MPOL_MEMORY_TIER or MPOL_F_MEMORY_TIER? That would give us better flexibility to design the mempolicy interface without having to be bound by the NUMA infrastructure it presently depends on. We can figure out how to collect cross-socket interconnect information in memory-tiers, and see what issues we'll have with engaging that information from the mempolicy/page allocator path. You'll see in very very early versions of weighted interleave I originally implemented it via memory-tiers. You might look there for inspiration. > > > 1. Compatibility: The proposal avoids breaking the current interface or > behavior and remains backward-compatible. > > 2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal > change. Scenario 2 (bandwidth-aware tuning) would require more > development, and I would welcome Joshua's input on this. > > 3. Zero weights: Currently the minimum weight is 1. We may want to allow > zero to fully support asymmetric exclusion. > I think we need to explore different changes here - it's become fairly clear when discussing tiering at LSFMM that NUMA is a dated abstraction that is showing its limits here. Lets ask what information we want and how to structure/interact with it first, before designing the sysfs interface for it. ~Gregory