From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 31D14E63C85 for ; Sun, 25 Jan 2026 03:35:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 81B0E6B008C; Sat, 24 Jan 2026 22:35:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 806106B0092; Sat, 24 Jan 2026 22:35:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D3116B0095; Sat, 24 Jan 2026 22:35:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4BCB96B008C for ; Sat, 24 Jan 2026 22:35:48 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id E9AC4B977F for ; Sun, 25 Jan 2026 03:35:47 +0000 (UTC) X-FDA: 84369071934.01.5CEB177 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) by imf29.hostedemail.com (Postfix) with ESMTP id 35901120006 for ; Sun, 25 Jan 2026 03:35:46 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=QWFAbf61; spf=pass (imf29.hostedemail.com: domain of rientjes@google.com designates 209.85.214.194 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1769312146; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=UNQgEWiXrS9CZkOQGKvePMDa1YP3T+67ZGAzTjgZ9Kc=; b=ZLs4qe+g/wOEQi3+s3jEhLWr/wmvnSvlY74XR18vNlmHcv5NZMzfdHEEuBnGIkYciGAz3Z XqujmLG+hoBEjMVnMEAEtBIv1arZN0C5qOyLAwbRFJeJNZrIZLoKyUuHHu253rKZbwFaWF X+gVwHy4Y4LLRh+k4qipnu46Vdes8NQ= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=QWFAbf61; spf=pass (imf29.hostedemail.com: domain of rientjes@google.com designates 209.85.214.194 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1769312146; a=rsa-sha256; cv=none; b=sTjqcElMD7AHV0nHkJnyqeDD2UM45uppuwdka20wjmp7dyNQua8MLaN/1YNO1wJIptzYUI i/pttWJ2wfzFpUkftm/jOQWE+0EGQhIXOkQFFKScfYPKG37687UjBQxmoCVhvANgy+CPxL i8i6aFbTAJ5X67G66n5pCfUMG2Z5p5I= Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-2a76b39587aso53635ad.0 for ; Sat, 24 Jan 2026 19:35:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769312145; x=1769916945; darn=kvack.org; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=UNQgEWiXrS9CZkOQGKvePMDa1YP3T+67ZGAzTjgZ9Kc=; b=QWFAbf61YB/FWohKOXDdlsvaW8BGqpNdj3ekV+uExdOswlbFgucBDTq3itFq0qXkbZ zXlRWg44QiaSA31qj56IS1JPegx1Horzag8YiItJqXnCJy/AqeRwsxuvz0KhpNhXSjHO lTorbnSwRjglLSpuQekQ7NK9R2w+bVdx6wDc1t6FYQRi1UAYgCfNCBJYObMeJxU3hU+U GM1v/uVhp4l2ZmfGT7CYe9LEtRRy4Cp2x/7LJ1cnOX+fm2/gGEr5AKIP6lD6hP4w6pwT Y4o/YcOvUWOYwlkRrYJOTFuXzhSD/NYJFSJjzGi5OChwgqON5a5S/u+1GAk770p48xDy sA2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769312145; x=1769916945; h=mime-version:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UNQgEWiXrS9CZkOQGKvePMDa1YP3T+67ZGAzTjgZ9Kc=; b=tPOrcmOlABdQ/N9/trpsO9NDobSf3T8hAu35OXhio6XSuzLiCcc5xWHsfoQ561JMnn jguslE8EpSUsUd4YK6d3c9b9vC4m1FTti08fUN024kCpQZaX7ScgC+dpMTQw1a9rhPWY LXtrFX2LC22A19Q2l7UifLnevKgxoNr9BGL4P2Bq4Shm/B65lBcv06+bYhfp2RJ2IFOz vympyvflkJz+A8rQVPiD35ktD2DsNkauE8aGJgoq+8g5mbfyg5Lo8TZ5/F91fwW51Dy0 Ud+kEIfluj5vkMD1OOmQasphdy/7m8KfEnUQYTcfKm06LsISh1PqlMFo92H6+mtZ7B85 ioUw== X-Gm-Message-State: AOJu0Ywnqdg7ALcLafI9bgFVBVHJPGsv/v4vJTAkiT1QKciHw4T/rz7w ZqwUguzBTVw3ePJs24iAe7ex6kwnujOuIMOmQPCS6iCPYGzOwp+s89mOpXcOhzONAQ== X-Gm-Gg: AZuq6aIH97bJNBbNNthnKpiniQlZ0WFWEB+jBokN0Dt2RB/BcsCcoFW4PQEb3QGUaK4 XASTEeAVkK+PJ1FlHUJT7ATt4cvmeEDB/od37tmv2bol3Q4zeh5bFV2tcRDNPbBzW+ox13gifZN nISYv2wk8lyyT7fuAi4nqcBkGe86bv4pdYx742DQ/43713eHRyBnogxVZAOWO2+78be+k69J8YT 6IsR5vJn+cVHhhPsZ7c65LADhVo6doJfYbXpUONY+/EtDfuG0QoV9c9jhBPlfp1qR2wjh06JmA1 /r1LVDgqxBOK6Gs4tl7XxmbC6v2ASRDix07KlFx5XbUFlnsPLIdd/bedrQ3Qyjc9iGBsaIm/v/D yFPE/Pqvhpv5/+15iOcvBZMW6fT30J/AfkIrc7q3kePT2a/4i6ghHFm9UZHab23ZV+lxi0WUQQK RLx5UveDWEsgWNuQV7XXRe8kSbInGbCrEXGStRZ07LnBqEn4x6oGZSDPZWstI6NxYZBXIGmgFuJ fjfHP0LCcFPKqgBIvze23qnV2vNnsx2iUyVowAQjA== X-Received: by 2002:a17:903:230a:b0:2a3:ccfa:c41f with SMTP id d9443c01a7336-2a8449014a6mr228525ad.1.1769312144471; Sat, 24 Jan 2026 19:35:44 -0800 (PST) Received: from [2a00:79e0:2eb0:8:d429:8c65:e94d:134e] ([2a00:79e0:2eb0:8:d429:8c65:e94d:134e]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a802dcd585sm56012215ad.29.2026.01.24.19.35.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 24 Jan 2026 19:35:43 -0800 (PST) Date: Sat, 24 Jan 2026 19:35:43 -0800 (PST) From: David Rientjes To: Davidlohr Bueso , Fan Ni , Gregory Price , Jonathan Cameron , Joshua Hahn , Raghavendra K T , "Rao, Bharata Bhasker" , SeongJae Park , Wei Xu , Xuezheng Chu , Yiannis Nikolakopoulos , Zi Yan cc: linux-mm@kvack.org Subject: [Linux Memory Hotness and Promotion] Notes from January 15, 2026 Message-ID: <684fb18e-6367-a043-3ee5-dd435da30b91@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: 8ukwugxjf13cw5h6ti7k6ndw4yj9c78y X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 35901120006 X-HE-Tag: 1769312146-702852 X-HE-Meta: U2FsdGVkX19LMwfqhGeTytjjubqrtTr6Ju2fIqZI/i7/41wrr9UvuvwZfwsOSAo/p2iXWCEnIcP5mc92UlGqIaNnj2kh1+3MuaWB3QWi4Ny2I17prV423rc/toLAcOnkP6O6zh7BUBkjIXvfnISJHqFlOxBpDHaN9WKs+b4bx7t+kDXtm60a4FtDHuWQnjV8WlvgNLUWLTYrF5QfdTBYO60kjU/GfVV+gobt8wWcmzEQQ8RjQZ4wCItO0CDR4BnzJQzulBta1OeyqZWsFBUMX5VVoEQNXmxnaNWkBTTQy2xVg0iEorf7B85v2rBqk7zlODS5YTw2YzXpUfIuNOTljHCdALc5cL8DamRxR3TSLj/J9E3aG2sJgQSX8CbHE62qC5rvZJCY75XQUe40Yp9P8VfsV8rUoYymCTKra/4rYrflqY8MXz6SmRvJOFp2rLwCyuZecpoAVeCMaDiwIDBycsi6uLLfBRxCPoOTHEWkqvXlSpTI4aMnhyK53WK39vu1B+epVUC/n83DqeMmzSsA6l8xji/W0dnruvwxql0dZE2erPUlPYWq2k1nReniqjMGrx0TPOlEV8oK1twXSjqUjT4Ukm6ijcIwWWIXDhgOnS9UsA3X+i2fxhhtFeynzf05InZPAIdEaAnOOjWSZsx2eJMoisd/OwI6whh2GIk3DJlTFNz2ixHthkf6V0rP/8g+e5hjydBHzBqk13rddGHfZdIoBh8a1fxyfgsSv/nOiPjXDMLC4dpOuljw68UQ3Nsp7OSSf0QCQgFfKoNZO2TsNK7MaUd14FL16k5AocP83iUuRvc23Z1J/8IQS0kKpMA1IoV893KS783ffuBYZcMcQf3oLhqFA+QINOYFd+QkPgkEUTCCIM5h5Pqa83fkhwVvpRgzKOYdenLGve8OWjL0DRc75tWdoM0aX6ADxxWaqKfTNrJewqG5RrTMZM239EiAJRofWFkl9rcj663PQlZ tHIH96tP i0wq/uvxPf1CiD9MtNfIq4ygLy2w+69sDq0Fw3MHzc19mF60ohE0S8bW6O6Ulgl0A0yoHHECddKUM9cAOQhSuykXthXekDaWzMvaEvsWyQ+t8MlI0ytQEE1n15Teh8uVnU270YhYTqfUhD0bDfVXickTtppbyVe/IsobtbDXGO/q4zDWamHh03f9eP5nWu0XtfNMzuMwxILR0y9h9knfdcC3eYYt/0GQnqWZiiLlHcNawStZfpSP0kufnsdtxtFjzJ7dtkj5/MOMKEiFRPl8/2y8MztNa++eHX69a8wX1LBJK4C0aV/2fLRiaaphdLacF2RPnqttwIxycVxD84Yd2uv7NwGEVqoo09ZIZtenEcem6cg8wxZKTVD0uRYakuRBAHvVZX2E+uu31EAQs7/dePiQJKwLwCdOzgIwPhS0TmqdR9L1Fk/k1ON9COY1nn7VeBGvjOAhvEitZDelwTaKC6Bjvh9nuAC4t1Gfmh5taDA7srmwxNORDKXy6PKB2d65U2S/v1036xsYJ7eE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi everybody, Here are the notes from the last Linux Memory Hotness and Promotion call that happened on Thursday, January 15. Thanks to everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- We started by chatting about benchmarks and workloads that are useful for evaluating different approaches to native memory tiering support in the kernel. Wei Xu noted that there has been heavy reliance on memcached and specjbb which have been useful to evaluate policy decisions. I brought up the previous discussion in this series of meetings back in November where Jonathan noted that memcached was not ideal because it's too predictable. Yiannis reiterated that they've used a mixture of workloads that were oversubscribed, it's never a single workload. He questioned how much this represents real production-like scenarios, however. He wanted the community to provide the direction on how to evaluate this. Jonathan said oversubscribed, multi-tenant containers would give temporal variation -- he gave the example of webservers that may sometimes be busy and sometimes they aren't. This may give the dynamics that are necessary with the variability needed to evaluate different approaches. Gregory compared this to using microbenchmarks that originally get scheduled on CXL memory nodes and then finding the hot memory to promote to top tier, but we can just randomize what pages are actually located on the CXL device. He suggested that this was more functional testing than production representative workloads. His plan is to run workloads with synthetic data and then share that with the group for something that more closely resembles real-world workloads. Gregory noted that preventing churn, however, is the hard thing to actually measure in situations where there is more warm/hot memory than top tier memory. He suggested monitoring bandwidth stats: you back off if bandwidth is high across all the devices. I further suggested that performance consistency is likely more important than small slices of time with optimal performance that may turn out to be inconsistent. We want to avoid always optimizing memory locality only to take it away later when another workload schedules or spikes in memory usage. Gregory connected that with general QoS with two forms: limiting the variance and minimizing the variance. Minimizing the variance can go very degenerate very quickly; limiting the variance is likely the goal. Jonathan said that, today, the guard rails for limiting the variance is page faulting which is pretty slow. Gregory said we lack this on multi-tenant systems because we don't have a sense of reclaim fairness, so we can't limit the downside of any given workload. We'll need this to provide a consistent quality of service. Wei said that when we do promotion that the allocation function does not trigger direct reclaim so if there is no space on top tier memory then we just fail the promotion. Promotion itself will not cause this thrashing; the question is whether we want to aggressively demote to make room for promotion. He preferred to focus on getting a promotion story upstream beyond just today's NUMA Balancing support. Gregory said that the promotion rate is a function of the demotion rate when capacity is fully used; thus, promotion will not occur if top tier capacity is fully utilized. Demotion will only occur if new allocation pressure happens. So there is a guard rail, but the demotion policy has to be put on the user. If there is some amount of proactive demotion, then that is the possible rate of churn. Capturing this as part of the story is imperative; we won't be able to sell this without a comprehensive story. ----->o----- I suggested that performance consistency is imperative; we don't want to go free a lot of top tier memory and then suddenly promote everything from the bottom tier only to find when we land another workload that the performance of the first workload ends up tanking. Wei said that we need to ensure that we only demote memory that matches the coldness definition that the user asserts. Gregory noted that, in this example, the original workload is optimized for having some level of consistent upward motion performance and the second workload necessarily ends up in the opposite situation. There was discussion about per-node memory limits that has always been met with resistance upstream. To get performance consistency across hosts regardless of other tenants on the system for a single workload, it required a static allocation for each memory tier. I suggested that we could proactively demote or avoid promotion of warm memory to ensure that there isn't transient performance improvement for a customer VM based on other VMs that were running on the same host. This could be handled with userspace policy through a memory.reclaim-like interface for demotion. ----->o----- Next meeting will be on Thursday, January 29 at 8:30am PST (UTC-8), everybody is welcome: https://meet.google.com/jak-ytdx-hnm Topics for the next meeting: - updates on Bharata's patch series with new benchmarks and consolidation of tunables - avoiding noisy neighbor situations especially for cloud workloads based on the amount of hot memory that may saturate the top tier - later: Gregory's analysis of more production-like workloads - discuss generalized subsystem for providing bandwidth information independent of the underlying platform, ideally through resctrl, otherwise utilizing bandwidth information will be challenging + preferably this bandwidth monitoring is not per NUMA node but rather slow and fast - similarly, discuss generalized subsystem for providing memory hotness information - determine minimal viable upstream opportunity to optimize for tiering that is extensible for future use cases and optimizations + extensible for multiple tiers + suggestion: limited to 8 bits per page to start, add a precision mode later + limited to 64 bits per page as a ceiling, may be less + must be possible to disable with no memory or performance overhead - update on non-temporal stores enlightenment for memory tiering - enlightening migrate_pages() for hardware assists and how this work will be charged to userspace, including for memory compaction Please let me know if you'd like to propose additional topics for discussion, thank you!