From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19AC6C02181 for ; Fri, 24 Jan 2025 15:48:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8EBE26B00AF; Fri, 24 Jan 2025 10:48:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 89A966B00B0; Fri, 24 Jan 2025 10:48:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 762BB280059; Fri, 24 Jan 2025 10:48:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 594FF6B00AF for ; Fri, 24 Jan 2025 10:48:22 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 096501C8DB6 for ; Fri, 24 Jan 2025 15:48:22 +0000 (UTC) X-FDA: 83042777244.05.DD0FD55 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) by imf22.hostedemail.com (Postfix) with ESMTP id 18F6BC0006 for ; Fri, 24 Jan 2025 15:48:19 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=l3XzgxQJ; spf=pass (imf22.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737733700; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=; b=rdVK2yjeLNNckh/Aqdv3E5ve4Wu+saEo4a0zPn5BZsywvho6/XXKZAMH4eh9H4M62t54bW oZco9PXuDWp8I2np3hhBszOMZt98l0cUnKQTnLTesneOFTOQKfyAIPWWRidS7jkcbfSKbG fr4+EI3iMyr7z+Pwd5J6ZluQuVXF5cA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737733700; a=rsa-sha256; cv=none; b=uU1U78gNuL37Di1lgXdtu3z/iufMWAa4dgxOOqkXZinIn2JGZRM/WZ8XgQFQL1zGIJCDua fmm1kpeEZZdd0Av6gE4v4kHFZl4xAqm75iHCTIMy66zBCk0OXuBTEGRpqbk5p6BDvgqM4B zZQf0PChsHtzudiFsXHBa39Kn1Qf7CE= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=l3XzgxQJ; spf=pass (imf22.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-7b9bc648736so227259185a.1 for ; Fri, 24 Jan 2025 07:48:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1737733699; x=1738338499; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=; b=l3XzgxQJBbO4GDgpUfPEM+oO2KIJO0s+zbfdWUjMozQtS+MSSUveFMTC4TrEualIBV 7ZiCuuA3VEwTHbWV7/N+/KMeefgO06hlsfEjOCKCHWgoZIfCljcLjJealrDjzyXsuQq6 hJ0jLFESXpQDId6hqi3X9KlQNvCmEaeR+p+zOHcxCEfr4bvy9O0HuC0V8EakK6VQUJOB eZwGRPmxSLwKhURXjIsst0wql33SWhN26FN4VqoutiyjPdjgnAKpSHEi+5W4Yj5T4+hR aUyXZ6O9f3otM5tfxCb56Ndzruj8d+IM4cUPsYwHWDgMoJh3QykQP4zyg0oRGn6FU0q8 SOhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737733699; x=1738338499; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=; b=lPWlTYnq6/52YVMqGHHShoWBXDiIXY55+lIGN19Y32Bg2Rk3YQKNQQooG7Oy600N7P uVQU1/5iGCJN+X+4B/d5kyH0ij1kjXpz2XkXjTfjlRtgH0um+IozxYm1YRZz9g/zMXOt ZzdwruQoFF0ZdliogFIOaaq24MbEyuLHzj8jEsnzgoJeWkS0ijdiynRw2NAbdB911Pdg iEV9/tcNXJf1aWUvvdrt1WuEV+gcDMkOmdu3ppuAFvMYNX9+McTnH40Lip/85TmKkVIu aJ72cRc1lg0LWLdd7Cq5zFfvcz3dm/mIZsdgd0TXX0ctt4+nhgxuxez4zml0JfmtHUJX 2jQA== X-Forwarded-Encrypted: i=1; AJvYcCVZCkb7G+sWe+fkdqi7QZq3Id2OFXjgzzbOdhph/aYxyQw5bf0ok07dppC2Aq1ErR+x0jsdJngwCA==@kvack.org X-Gm-Message-State: AOJu0Yx07nMJjXeMCiXOMeUUxe8P+/cvHvE8wMSBzcWvXpURguQL8R8a eRQoRcqEmXs+7rd5p9vO0bIWqmmuQtv7Qiu8+0xnVdf1fhgI+Lh0/761wGWQ7Oo= X-Gm-Gg: ASbGnctXTypnAoNpJOnVwNYTERweu3mUadVCv8d1UkzbHfcLjv7O5QMZt6SNgktZ32a s3+GzNbMW69k2OsvIRo8P9CIRy4CS5df6PLOpfB9bC5xnExWXgOoAN7Y02lOkRkTIj8lb3yM3fj b5He29F8UKixaJ51WQIogcMOnEkbvcorsboZXmp3P1C3NK0VP4xvtHyjny7nFMl6aVvT7JKWwY6 g5UW20A6kgiE9E2KEIKElP7U2Q12jYQLKDWsHRzUxJiJz8QHIyOv/4RTmeknJgCoihIrWciricM 4lFl0UOo1lGxLCKXvVAnh8/Ibvtj9bpbSQOyWSKhk8uGZD80N9J9PLThcsWnM5w= X-Google-Smtp-Source: AGHT+IHAmF9H/uJyOZ1fy3dvSN0mMIgEX3R3qF1HUo1sYQICkvQxxUc2gSc15oHllcaWPb/TZ5k1Sw== X-Received: by 2002:a05:620a:2687:b0:7b6:d70a:86df with SMTP id af79cd13be357-7be6320aee0mr4993653785a.27.1737733699074; Fri, 24 Jan 2025 07:48:19 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7be9aeee9b7sm104073185a.68.2025.01.24.07.48.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Jan 2025 07:48:18 -0800 (PST) Date: Fri, 24 Jan 2025 10:48:16 -0500 From: Gregory Price To: Matthew Wilcox Cc: Joshua Hahn , hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com, rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: Re: [PATCH v3] Weighted interleave auto-tuning Message-ID: References: <20250115185854.1991771-1-joshua.hahnjy@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 18F6BC0006 X-Stat-Signature: f957e88o4pyreh8ik9krpgiz1ssfqagu X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1737733699-197515 X-HE-Meta: U2FsdGVkX1/ZV1gr0OUejVmUZAE0FKlTTv97KiQtODvgBcpLzwO7LssfMgI2N2SCpA3ZMGhEbGxaMPYZdjbRVLvq4bOrPXrP05XWKpORhnhdm0SOXnwkg32gTP/kEGHfGIu+n1U7QKSR5XPV18CAzCBKBdmBtdnarpFyIb8VAJRR3Wy6u9YMRPeXhILlww9gBRxrmxGh32S/6HW0XgjKJB3lxI9RC9lE3Txf+d1L6y9W+8WzBsk9L7v+As1lB+zjh9Y6C583CeRcZrUX2AbQ1PkLFw12wcdp1Y8bgblaTUYWTLi0hQ5jVCkPm5jq4aVmbSlhpM3mCdj2F/JXdrNYK64ey/lkOu/JZZDCFTrvjeBu8v/CQ0NOCGBvMxOYZD5Ob6CuqzPTvXSLqGdAmTmbseSvH+0Psjv1IFvYLBxaYhM4Ymcq/BN+usTxWRxaFEe/opQlrgl3IrSncOlBx+H9Hv1Rjd9//8ISd4TwHeb/cDyQwcGM8Htoc/NXbuxNavVhDYCiQwpPnyX1Fp1M1ZUg06iI/nxh6FRsI9LysBXCmhDXRUZYpEQvETH6LcM48jFQ/UkOmxcZFJrzsbMzcJJXTqt54hK5PybtRdZS9H/hxeHkfXM4ilGjdf5S9bFv/o4R1M5FNQ5x8ugO1ejSyAPLvqnaYT/gb6nyj8QIBiycj3ORNmn/DO0mR0Mt/xA6EccIwbYBCBTMNcTd+mybAl0Rr9pqUgR61kTtPyt2v9ge5Xak9bEuNwhhaymB0rZhz41tPyI9r4Z8VY1lCduXISW/JuSUL2B3l83OsMtZNrecDVUyknC++wt7sFPLwfN8aDWfYkBxArS8yJvgf6uZ04SaCYttaaAMyOnaBfi/W0fueVUUIPViZgbynu+1T2W778Bw3pPxVHU5MpgsY5tK/dLfyAG6vYx8wS2MBgwsWEr+g36uA/SxItS3PSD0dileJvKmlqmjn7sswI9lzLW7kyX dj5495qN CuE8t1MOBYTqJXhSgGLcYiIm8A/oFqqg6/ZzS2+hDEqYVv+XQJEhugzMuOW65MXTAJiqJxjCdowQzcgM4eRjGZ+IKn+loz5cTHo8D1AgNlrIEXzUOHtgPNtXzld+/DWVt2EUQny5vPZGBdmdKidCsFZ71j9bEQe05hWbcKpWWNwdBsB698kvhHs8h4puKxu05lzKGPGsAk9c/NaQXU923lLo51AtS9QssCxKjsxGLTEqNfoDL4Geb2irR6/wtlNDE5G+35FbofyOl+1gbJ2Z4P8GDOMtYmCAnySumUqF2tUaW0TBtW/BJqUzcq3FNUBgfOreTFoPBI9NShs19QkyI+ylyfgBeEeyvkwROfOT6mCC2kStulvXm6obR8nTRUdQ2oPOwS/BuHXUS7by7vMMqyrQKL4xDd5gceVgb4cds4AXKCktW5GoBaz98Q7ataO5g91In2YQk8RAwbeTuBdiA6XUbbtGJFataeIwRvau24FoZSq66C6krTLV8gNoUINvmpxlEqHt1GmG4Z6Ck7UJIh0M6M7y7BjM/4a730wdecEqwkxikT6cKk4UWbgGet+jU8gofAeQwl2rzG7M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote: > On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote: > > On machines with multiple memory nodes, interleaving page allocations > > across nodes allows for better utilization of each node's bandwidth. > > Previous work by Gregory Price [1] introduced weighted interleave, which > > allowed for pages to be allocated across NUMA nodes according to > > user-set ratios. > > I still don't get it. You always want memory to be on the local node or > the fabric gets horribly congested and slows you right down. But you're > not really talking about NUMA, are you? You're talking about CXL. > > And CXL is terrible for bandwidth. I just ran the numbers. > > On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs, > each with a bandwidth of 38.4GB/s for a total of 300GB/s. > > For each CXL lane, you take a lane of PCIe gen5 away. So that's > notionally 32Gbit/s, or 4GB/s per lane. But CXL is crap, and you'll be > lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s. > You're not going to use all 80 lanes for CXL (presumably these CPUs are > going to want to do I/O somehow), so maybe allocate 20 of them to CXL. > That's 60GB/s, or a 20% improvement in bandwidth. On top of that, > it's slow, with a minimum of 10ns latency penalty just from the CXL > encode/decode penalty. > >From the original - the performance tests show considerable opportunity in the scenarios where DRAM bandwidth is pressured - as you can either 1) Lower the DRAM bandwidth pressure by offloading some cachelines to CXL - reducing latency on DRAM and reducing average latency overall. The latency cost on CXL lines gets amortized over all DRAM fetches no longer hitting stalls. 2) Under full-pressure scenarios (DRAM and CXL are saturated), the additional lanes / buffers provide more concurrent fetches - i.e. you're just doing more work (and avoiding going to storage). This is the weaker of the two scenarios. No one is proposing we switch the default policy to weighted interleave. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. ===================================================================== Performance tests - MLC >From - Ravi Jonnalagadda Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream >From - Gregory Price Hardware: Single socket, single CXL expander Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind weights : +2.5% to +4% (consistently better than DRAM) ===================================================================== Performance tests - XSBench >From - Hyeongtak Ji Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. > Putting page cache in the CXL seems like nonsense to me. I can see it > making sense to swap to CXL, or allocating anonymous memory for tasks > with low priority on it. But I just can't see the point of putting > pagecache on CXL. No one said anything about page cache - but it depends. If you can keep your entire working set in-memory and on-CXL, as opposed to swapping to disk - you win. "Swapping to CXL" incurs a bunch of page faults, that sounds like a lose. However - the stream test from the original proposal agrees with you that just making everything interleaved (code, pagecache, etc) is at best a wash: Global weighting : -6% to +4% (workload dependant) But targeting specific regions can provide a modest bump mbind weights : +2.5% to +4% (consistently better than DRAM) ~Gregory