From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6820ACAC5B1 for ; Thu, 25 Sep 2025 15:01:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 563A38E000E; Thu, 25 Sep 2025 11:01:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5145D8E0006; Thu, 25 Sep 2025 11:01:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 403528E000E; Thu, 25 Sep 2025 11:01:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 278718E0006 for ; Thu, 25 Sep 2025 11:01:12 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CD055589F0 for ; Thu, 25 Sep 2025 15:01:11 +0000 (UTC) X-FDA: 83928085542.30.BC53489 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf26.hostedemail.com (Postfix) with ESMTP id 55589140014 for ; Thu, 25 Sep 2025 15:01:09 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1758812469; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=L2VZMaI8H4SduB5mW2rT6I2nKRE53Z5dd56Ku5CZC1Y=; b=nAvYyRvoO/j0G0iafz+MDymJjHm0AGa+4oDGUKMobUqNLPpj9ITItuCqsDfcCO6yc6f/1+ TT7GM32Cf5jwmUlIqKFyxHlKGH14nqQrnfd0u9eOqGbwfet+LB+WTPCS0nANl2+064bFEL Og3jGBRiCkGYBdwppXK5H8lAx4MckYU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1758812469; a=rsa-sha256; cv=none; b=HCHzI1bDjoIxbo4zR22ZaLM9G7+rsBFaxkUu97uyC3HyXeSYTpFpLQIST8pdaQF/egs5o8 AQpylN2ufudmeBtLdyxFmrHm2WiWnpmBF86IV+E5K8VvTrGu1p49mjox0AhvuAxA7IK9Ne c/qb9IWLmcWbMmUNE+3Gt2xYExfPNzk= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4cXcK72fQHz6M4l7; Thu, 25 Sep 2025 22:58:03 +0800 (CST) Received: from dubpeml100005.china.huawei.com (unknown [7.214.146.113]) by mail.maildlp.com (Postfix) with ESMTPS id AD7361402F2; Thu, 25 Sep 2025 23:01:03 +0800 (CST) Received: from localhost (10.47.28.112) by dubpeml100005.china.huawei.com (7.214.146.113) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 25 Sep 2025 16:01:02 +0100 Date: Thu, 25 Sep 2025 16:00:58 +0100 From: Jonathan Cameron To: Yiannis Nikolakopoulos CC: Wei Xu , David Rientjes , Gregory Price , Matthew Wilcox , Bharata B Rao , , , , , , , , , , , , , , , , , , , , , , , , , Adam Manzanares Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Message-ID: <20250925160058.00002645@huawei.com> In-Reply-To: <5A7E0646-0324-4463-8D93-A1105C715EB3@gmail.com> References: <20250910144653.212066-1-bharata@amd.com> <7e3e7327-9402-bb04-982e-0fb9419d1146@google.com> <20250917174941.000061d3@huawei.com> <5A7E0646-0324-4463-8D93-A1105C715EB3@gmail.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.47.28.112] X-ClientProxiedBy: lhrpeml500011.china.huawei.com (7.191.174.215) To dubpeml100005.china.huawei.com (7.214.146.113) X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 55589140014 X-Stat-Signature: gjgrmphhrt5fw6ezp86aeydpfitdteb3 X-Rspam-User: X-HE-Tag: 1758812469-687003 X-HE-Meta: U2FsdGVkX1/BH/c6D5l9Z+7AM147IXTgqoVF5gO4Co79Cqeao4DTqIf3/rFZESvSNcRfpp0o3hzYtDpXMAed3hSjM/H1xE1xXsOLHbBXzq5rxliSN301CsVuj80lNcpuhWJpQneNjwyTxd6ZOVN2TL17geDtNzOv7luODh6+e2li2S/7UlAidDwqq42trwF0jX7cZEXvXzfsZE7z3668AvS72+7g48Svy77cHdy8ATGJkTdr7kVRmtfqkE6QJ9d62ecuggqa0hi7KrQPKxxiZFYM9ohsEKUfRDVlEGgY4sCxrjt54OKZmeqHMKfkD2UYHXCMIJIPWlYtt0DRQW79qJktuRbXmqnAc0bh/egfAEM0ZQ9RBiGzzcVdEECEwgMtcY/BTIhR2e3dg8tnmE9+exAT3hXwAUV1Mk4f6gWZ2RaDSjtW6VPUs0edb9k2kVFidcZtTNHx+VU2EJGB+Dru1xMkjkN8CvVQU5qP3OiJ3j9Gm+RKml0gx2k32AwkkIz7eXmPKYzVGoIVuZ0dvX1icmbPLMnheomwcEZx40v4Ou3EOOMS1d3kGYvN/XOXWHHfub0DeDPwwtjtOo6ibUnhWKjxaD/5kP61Zg9wE9Uf9o71YWjKbqZAMSciQyZympukxjcZl3vDv4q+UZrpyVJbHZlIW6kUkJI+7sAU3SaJSO1qvC/hMtkI7O53UVtN2zDSiUVk+aoa8YZBZ/e34HfmCMhQAiFH6D8Q4Q2PmSTd2GmvWFGF0GSGtMkSdWR+Qbg0rExvEcUGYReEQSNQGZK9KDdkxOJ4BnRxO+LZI4aOVmkJsSTe+DzItWOZz/4ItKfUP7VxHoJFsPdSne/aALXlnJ7SNLDD1IbYDCxqdPhWKwVB9TbEk/h+lT5aZjAFHDodJ6M4axv4EqXzU3XyYLTQa+PhJ1NvBs7pj+4i1KzJfsDGV0VsISBB5JDJEt5r9i85hWNkaqAETjOpFHhYryA z59+PSjQ e0ZqbDU2rF7rk2NHlNtb9151J65LqD+YF1mqDyKjgsCBXMKgVY7HNYnQ6VShjlAm+LYSPvt3q92DKa2VF55a/Fmb30F9EtLyacli9uKqJUWQEy4DrD9uKO2MY7y7baeGObR+Po5iLNvE0t1yBtq+ZnezlsAbdXq3kMxbbwolUJk8gqfNwdYQyAOv7T++Y6KRM38WlRDy/U4O1z8pifnjuA/RDw53cszjJ+e1rP1LdUt32Rt4giBSAbDX4Ln+tXnuSoX1nCGN0D0wkmwDQxZCoiWQlLyJexTHq6uaA35ngFjelpYl8nFFZ43INFRJuiJ+jBgOkjPFiJ45s8hFBA9rNU95G5IBdAcJWqrCeQypxg05x7GX5LxxosGG6tZBU59lu4VyIsDgVzI8PD9UbHYIUn3+RE0yGmxABcualabG7DJ+hDyQMKYzCublbIp0293oRmvkQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 25 Sep 2025 16:03:46 +0200 Yiannis Nikolakopoulos wrote: Hi Yiannis, > > On 17 Sep 2025, at 18:49, Jonathan Cameron wrote: > >=20 > > On Tue, 16 Sep 2025 17:30:46 -0700 > > Wei Xu wrote: > > =20 > >> On Tue, Sep 16, 2025 at 12:45=E2=80=AFPM David Rientjes wrote: =20 > >>>=20 > >>> On Wed, 10 Sep 2025, Gregory Price wrote: > >>> =20 > >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: =20 > >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: =20 > >>>>>> This patchset introduces a new subsystem for hot page tracking > >>>>>> and promotion (pghot) that consolidates memory access information > >>>>>> from various sources and enables centralized promotion of hot > >>>>>> pages across memory tiers. =20 > >>>>>=20 > >>>>> Just to be clear, I continue to believe this is a terrible idea and= we > >>>>> should not do this. If systems will be built with CXL (and given t= he > >>>>> horrendous performance, I cannot see why they would be), the kernel > >>>>> should not be migrating memory around like this. =20 > >>>>=20 > >>>> I've been considered this problem from the opposite approach since L= SFMM. > >>>>=20 > >>>> Rather than decide how to move stuff around, what if instead we just > >>>> decide not to ever put certain classes of memory on CXL. Right now,= so > >>>> long as CXL is in the page allocator, it's the wild west - any page = can > >>>> end up anywhere. > >>>>=20 > >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real > >>>> workloads to show local CXL expansion is valuable and performant eno= ugh > >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE > >>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of > >>>> CXL, but allows any given user-driven page allocation (including page > >>>> cache, file, and anon mappings) to land there. > >>>> =20 > >>> =20 > [snip] > >>> There's also some feature support that is possible with these CXL mem= ory > >>> expansion devices that have started to pop up in labs that can also > >>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to > >>> chime in as well. > >>>=20 > >>> This topic seems due for an alignment session as well, so will look t= o get > >>> that scheduled in the coming weeks if people are up for it. =20 > >>=20 > >> Our experience is that workloads in hyper-scalar data centers such as > >> Google often have significant cold memory. Offloading this to CXL memo= ry > >> devices, backed by cheaper, lower-performance media (e.g. DRAM with > >> hardware compression), can be a practical approach to reduce overall > >> TCO. Page promotion and demotion are then critical for such a tiered > >> memory system. =20 > >=20 > > For the hardware compression devices how are you dealing with capacity = variation > > / overcommit? =20 > I understand that this is indeed one of the key questions from the upstre= am > kernel=E2=80=99s perspective. > So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I= can > not speak of other solutions/deployments. However, our HW interface follo= ws=20 > existing open specifications from OCP=E2=80=8B [1], so what I am describi= ng below is > more widely applicable. >=20 > At a very high level, the way our HW works is that the DPA is indeed > overcommitted. Then, there is a control plane over CXL.io (PCIe) which > exposes the real remaining capacity, as well as some configurable > MSI-X interrupts that raise warnings when the capacity crosses over > certain configurable thresholds. >=20 > Last year I presented this interface in LSF/MM [2]. Based on the feedback= I > got there, we have an early prototype that acts as the *last* memory tier > before reclaim (kind of "compressed tier in lieu of discard" as was > suggested to me by Dan). >=20 > What is different from standard tiering is that the control plane is > checked on demotion to make sure there is still capacity left. If not, the > demotion fails. While this seems stable so far, a missing piece is to > ensure that this tier is mainly written by demotions and not arbitrary ke= rnel > allocations (at least as a starting point). I want to explore how mempoli= cies > can help there, or something of the sort that Gregory described. >=20 > This early prototype still needs quite some work in order to find the rig= ht > abstractions. Hopefully, I will be able to push an RFC in the near future > (a couple of months). >=20 > > Whilst there have been some discussions on that but without a > > backing store of flash or similar it seems to be challenging to use > > compressed memory in a tiering system (so as 'normalish' memory) unless= you > > don't mind occasionally and unexpectedly running out of memory (in nasty > > async ways as dirty cache lines get written back). =20 > There =E2=80=8Bare several things that may be done on the device side. Fo= r now, I > think the kernel should be unaware of these. But with what I described > above, the goal is to have the capacity thresholds configured in a way > that we can absorb the occasional dirty cache lines that are written back. In worst case they are far from occasional. It's not hard to imagine a mali= cious program that ensures that all L3 in a system (say 256MiB+) is full of cache= lines from the far compressed memory all of which are changed in a fashion that m= akes the allocation much less compressible. If you are doing compression at cac= he line granularity that's not so bad because it would only be 256MiB margin needed. If the system in question is doing large block side compression, say 4KiB. Then we have a 64x write amplification multiplier. If the virus is streamin= g over memory the evictions we are seeing at the result of new lines being fetched to be made much less compressible. Add a accelerator (say DPDK or other zero copy into userspace buffers) into= the mix and you have a mess. You'll need to be extremely careful with what goes in this compressed memory or hold enormous buffer capacity against fast changes in compressability. Key is that all software is potentially malicious (sometimes accidentally s= o ;) Now, if we can put this into a special pool where it is acceptable to drop = the writes and return poison (so the application crashes) then that may be fine. Or block writes. Running compressed memory as read only CoW is one way to avoid this problem. > >=20 > > Or do you mean zswap type use with a hardware offload of the actual > > compression? =20 > I would categorize this as a completely different discussion (and product > line for us). >=20 > [1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expand= er-specification-for-compute-express-link-cxl-1-pdf > [2] https://www.youtube.com/watch?v=3DtXWEbaJmZ_s >=20 > Thanks, > Yiannis >=20 > PS: Sending from a personal email address to avoid issues with > confidentiality footers of the corporate domain.