From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52DCFC36018 for ; Wed, 2 Apr 2025 20:09:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A69D5280004; Wed, 2 Apr 2025 16:09:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F185280001; Wed, 2 Apr 2025 16:09:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 89142280004; Wed, 2 Apr 2025 16:09:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6801F280001 for ; Wed, 2 Apr 2025 16:09:42 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D21E11C959B for ; Wed, 2 Apr 2025 20:09:42 +0000 (UTC) X-FDA: 83290194204.02.41F95DB Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com [209.85.219.45]) by imf12.hostedemail.com (Postfix) with ESMTP id 1C91C40009 for ; Wed, 2 Apr 2025 20:09:40 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fU4jtJYr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743624581; a=rsa-sha256; cv=none; b=h7ewUc9MFUmpwatwF53U5wVZiPIaT0tywWQfCPIQ80ovO6OCos1WvxzHT6G78Yu+eTvHB5 fpP/o9YLxg6F58UXrQCNRm8AlabjnwCN7rUuGMojiSXXzJUJMvp2Bp+RHLEZU/SuWp967d dHm9pXN9VBv3trejS8yYkgZtaql8z3s= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fU4jtJYr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743624581; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QI+hDVvk/lnDL2A8AC7WBXR/LV2+ef/NDppre0X5ehI=; b=Tcby1Ne91rIB04hd+NSUe0PshIizAH/ZQ+ObSuFrDvra1fR7i/F0ySha/e2ySWVIn4zJRg S6BAdekAmvM7KxPsThynztZPfVrGIDBpGZ6dhT5k0vl/5g8U7DYkf2cjjtMkgKcPGXSGt2 4zvK6s5pZNY3zt0FBkVpyg/w2hSh+sE= Received: by mail-qv1-f45.google.com with SMTP id 6a1803df08f44-6e900a7ce55so2185886d6.3 for ; Wed, 02 Apr 2025 13:09:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743624580; x=1744229380; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QI+hDVvk/lnDL2A8AC7WBXR/LV2+ef/NDppre0X5ehI=; b=fU4jtJYrtZ9tkO1DaN/vxN97WO0BlIVdiweYA/NGCBPCFWTkYsS5W590ljvYy0Keo7 +e9d+pW3sLhdLelYj1/InL5sKHrNzq7eECjCjwHZd9YlXBTPpnlV76MTP2hZkHiuHAfl B8PiLfQsNk/aX9ZhWWsG90pVgqmqDKvPtxOtO131kKDHq/3feQaNG9g0p1XU4swPI/nb 3PujIEwgNTbXDlqNzsngptWBSozvz2nEKW3M2IPB1dV4Msu9MURblCutnA9CQ713bpwf 8xMGs0Tr9Tc3YlPEidrhS9WUokPvQgmH8t6iE2B5pUdcv8s6lrg/qAEcPpI0p6dYiHe4 kbBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743624580; x=1744229380; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QI+hDVvk/lnDL2A8AC7WBXR/LV2+ef/NDppre0X5ehI=; b=cHsmH5mDQ6zfnLm+CUW+wXMtg5uvo/sTCfEG5ESPv+58vJvNfWUAOc3kZWGEmeR/xy 3/sRPYA0r+Q+++j8HHzD1Qhnw1/3ScEKWjUpgr03ip+Jpm6wzhyYEjFZ/M/4OP7ioVmK 4rlW8KIuSuNJ/yblLKzrob3yuz7TEQ2Xi1PelVyfoAji/2rRbGH6mpyMZAWZ8YOwl4cW csMpGc0HaL7c5t62RN/hzpcV8E0bdM7fJfHJxQ+WVy2mhdZ+qWcAOQ4OH6mMeq/YPCTL ZaK4ovub3hxD5oCAQisll0Dt+nIeh/KnfIvOOryU9ct1FtevtRMiFKcwo6IcUzebr1sm 9Dtw== X-Gm-Message-State: AOJu0YxSXQu7B0l4duccWYFtPhRG+yGJIozbw5HWGkKgmUAuoKubamug /I2pPrHP1mbmoGwU6EShKsYfUZGaT3HaCa44MJ8U6wCt7TFvuLgOdvtluxumhapytjWMPUlsgU0 Pwn4V/djNWpDF+0xwnPZjjlF7/O4= X-Gm-Gg: ASbGncvf/PjuwhweeMtE8icbv42qI0bHPnbcvZo0Gy5prdX14UWVgsbz9kTw+99xQkI s+kluW+bq+a9+WUQjjldLs7cif813csvMp1P8kYBw82V/xBrfXL7m7P7jZC4IWETdvXng7xGwFC htPpMrhVNrhd7TAKpCSYqKOC1nEQ== X-Google-Smtp-Source: AGHT+IHKzklSU5dCAuInQGhnRSYikjtIZn6796G74lMwNSFf8x7It/71egFFr6iRafiwhbRR1xXRjkt+57dX4pFkqdw= X-Received: by 2002:a05:6214:2624:b0:6e8:fbe2:2db0 with SMTP id 6a1803df08f44-6ef0dc538demr1435556d6.30.1743624580182; Wed, 02 Apr 2025 13:09:40 -0700 (PDT) MIME-Version: 1.0 References: <20250402191145.2841864-1-nphamcs@gmail.com> <20250402195741.GD198651@cmpxchg.org> In-Reply-To: <20250402195741.GD198651@cmpxchg.org> From: Nhat Pham Date: Wed, 2 Apr 2025 13:09:29 -0700 X-Gm-Features: AQ5f1JqI1gioD_bl8uUZFOfqwHMpRQhwqqha8JdVJUBjsVeCXGxQHoX60KiYz0c Message-ID: Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data To: Johannes Weiner Cc: linux-mm@kvack.org, akpm@linux-foundation.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1C91C40009 X-Stat-Signature: fqacbd1cmon3wpfaj8fzw38gko6d7rzg X-HE-Tag: 1743624580-746868 X-HE-Meta: U2FsdGVkX1/ilxAC95XBrT03RdLtD0hzR1JWUtyIIVYPU8EZdLgVdyHJc4xxkDSMbiYAUVthw1liIDRUAa11HR92Lq5FsM236YP6SVO7qnhrgPfP2Wko7xja6x81QgX8J3+XhggN8cAyNiC5erWoS5oKzpjPI7y4I7sxu3fm4sR1hdoGT70bFO13JPYdPSHLqj5oXNJbl9j30zdSlw8Lrn8OYHsMjLkWBIaXYX+T6A55Aym3NZbv/WpKPqTcn9KA1XnyqN23tjjDbp7bZlnE9q7o8LsTr5+QnLX5CmzK4Z/M0I5ttZqkwt3IhL2UpEOaCtgCYhrL1T63yyqy67I0o6RmPjLCu7sX/o+B8pESFgcyJo/ikou45QqWmcOSL71SLRRkWIdE0t4lX/NbYWyKuWbctZfGIRboToyoJ2KNNDdg49SD18wAS8Mw7BCST1x4vzNtNsung3uKXBT1hMjblLsM5IxZxwDtD5sOFgGwrjJiDE+0/KpDho0mbs77Nk7is8JH3yYusga2gpTeowEQu02DlS2VnWylalMvnkK8zQnsk0Og20cXdhr49Qgx8m7QBusFGwq8TAWjCfOpPQCBeuLtrWUP8PpuClI+OiC2XC4UDGIyDhyWDt1W2br8ukX5RJsNR45U9nRq4IX2mgU5unv3SYrn/p1Wi26DF6jAbPSbWgvJ7k2mtz5NCii3v6lEqMZBUSLIRPPhjWKQ9OcBLM4U3IPSVx8MCikXw3SyL3tpFiYc90lwQ99p5OgHjl0h6dTsO8ldZs4cmQcaxuCsqleiqQ8K2/zqVydqhN/mT9C3JlUQbggKLO2IgOoGK5LJOXhp2deteN4eoNqa5YaXc9XIJgISL7wxb6FjT+6tMisd80qLz8MxrkPyBk9xjPMpdRJdguRnOg5RmxVioeRTwZhGbgg7jOZpMZvoDoHwc1L8b74uDI6e42XbSpt/uKEZ22PQ8f+7SZI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 2, 2025 at 12:57=E2=80=AFPM Johannes Weiner wrote: > > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote: > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce > > any policy for the allocation of memory for the compressed data, > > instead just adopting the memory policy of the task entering reclaim, > > or the default policy (prefer local node) if no such policy is > > specified. This can lead to several pathological behaviors in > > multi-node NUMA systems: > > > > 1. Systems with CXL-based memory tiering can encounter the following > > inversion with zswap: the coldest pages demoted to the CXL tier > > can return to the high tier when they are zswapped out, creating > > memory pressure on the high tier. > > > > 2. Consider a direct reclaimer scanning nodes in order of allocation > > preference. If it ventures into remote nodes, the memory it > > compresses there should stay there. Trying to shift those contents > > over to the reclaiming thread's preferred node further *increases* > > its local pressure, and provoking more spills. The remote node is > > also the most likely to refault this data again. This undesirable > > behavior was pointed out by Johannes Weiner in [1]. > > > > 3. For zswap writeback, the zswap entries are organized in > > node-specific LRUs, based on the node placement of the original > > pages, allowing for targeted zswap writeback for specific nodes. > > > > However, the compressed data of a zswap entry can be placed on a > > different node from the LRU it is placed on. This means that reclaim > > targeted at one node might not free up memory used for zswap entries > > in that node, but instead reclaiming memory in a different node. > > > > All of these issues will be resolved if the compressed data go to the > > same node as the original page. This patch encourages this behavior by > > having zswap pass the node of the original page to zsmalloc, and have > > zsmalloc prefer the specified node if we need to allocate new (zs)pages > > for the compressed data. > > > > Note that we are not strictly binding the allocation to the preferred > > node. We still allow the allocation to fall back to other nodes when > > the preferred node is full, or if we have zspages with slots available > > on a different node. This is OK, and still a strict improvement over > > the status quo: > > > > 1. On a system with demotion enabled, we will generally prefer > > demotions over zswapping, and only zswap when pages have > > already gone to the lowest tier. This patch should achieve the > > desired effect for the most part. > > > > 2. If the preferred node is out of memory, letting the compressed data > > going to other nodes can be better than the alternative (OOMs, > > keeping cold memory unreclaimed, disk swapping, etc.). > > > > 3. If the allocation go to a separate node because we have a zspage > > with slots available, at least we're not creating extra immediate > > memory pressure (since the space is already allocated). > > > > 3. While there can be mixings, we generally reclaim pages in > > same-node batches, which encourage zspage grouping that is more > > likely to go to the right node. > > > > 4. A strict binding would require partitioning zsmalloc by node, which > > is more complicated, and more prone to regression, since it reduces > > the storage density of zsmalloc. We need to evaluate the tradeoff > > and benchmark carefully before adopting such an involved solution. > > > > This patch does not fix zram, leaving its memory allocation behavior > > unchanged. We leave this effort to future work. > > zram's zs_malloc() calls all have page context. It seems a lot easier > to just fix the bug for them as well than to have two allocation APIs > and verbose commentary? I think the recompress path doesn't quite have the context at the callsite: static int recompress_slot(struct zram *zram, u32 index, struct page *page, u64 *num_recomp_pages, u32 threshold, u32 prio, u32 prio_max) Note that the "page" argument here is allocated by zram internally, and not the original page. We can get the original page's node by asking zsmalloc to return it when it returns the compressed data, but that's quite involved, and potentially requires further zsmalloc API change.