From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A4C2C36018 for ; Wed, 2 Apr 2025 20:24:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10BC8280003; Wed, 2 Apr 2025 16:24:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B9C9280001; Wed, 2 Apr 2025 16:24:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9B8C280003; Wed, 2 Apr 2025 16:24:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C8621280001 for ; Wed, 2 Apr 2025 16:24:20 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4BFB01A0799 for ; Wed, 2 Apr 2025 20:24:21 +0000 (UTC) X-FDA: 83290231122.30.BEC8BCB Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf23.hostedemail.com (Postfix) with ESMTP id 22B2F14000D for ; Wed, 2 Apr 2025 20:24:18 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=su9n9Fys; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743625459; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jlKsIdBA3751Uewe6o63Rt1BOvdcDnO+aqIrz5DM7Dk=; b=6CdjAnxlZR7IewDW9eLoSY++my8BtzvIKV+mkdiMlpZgnZSVyLhyhXaYwIkG3BUanG0Gvz +Y9U5knVMfGeBVBdXY3J+WSFOVkHV+p2QuXqpzTYRihMfg8jSHgt6pbJZnhO8mcFClk009 Zvd7r+xH3MsY5yGrGg2p8/mvm9dKuR0= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=su9n9Fys; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743625459; a=rsa-sha256; cv=none; b=kXY0FwdN+wxUdZF6JFkYBBlG2dbKn8N3IS3mG+K7PI/NvwzSJjcZH9RNGZHW0JJeHoST1l wKljFqLH5n08oKGQFoh9r4uik++XoWgSL7rtNe8B3j6XazxPyQShEfVU1g/6rroX18Wfhg C/e2p074gyF0C1zfPfHZj3/SZlFQqxg= Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-4772f48f516so12722721cf.1 for ; Wed, 02 Apr 2025 13:24:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1743625458; x=1744230258; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=jlKsIdBA3751Uewe6o63Rt1BOvdcDnO+aqIrz5DM7Dk=; b=su9n9Fys02zSzS+rEg+CyG7Dhora1o1rU53Hetw1Uts7w/nWJ3pnMQdeQk+tL2WAdT f8qfgnUqRKFB2k2gCWYi/uS6siOP4zaj0ybq+RzQ5+xFbgb/HrDj8WvGP9bhvI3A6B4P 44OmQnbiaOkzi3ajVc1cTywjRDH4fvBp5s6lLsKrLhToYfJ+FH9LYt+V6Dm3nEqB/Xr3 HwGh8NJotjqkfxxYJK/u4r3Ndhx2XGIcSbOeU+w+7Eza1QIUvQUN2wWDAcPs2Flfeq2K maXSsRu8BvVHmW1/dpaitHhf/17My6egbDK6SzjfKYOA5Y0ZNR1glRwHBRH9lEWz8qez VaAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743625458; x=1744230258; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jlKsIdBA3751Uewe6o63Rt1BOvdcDnO+aqIrz5DM7Dk=; b=xLpnEZBd2hVkiFIWihfuuuhS/4kGxJoIfpIQN50ClH7bh48YdnKgZyB6cMx+EdVjtT VvEXqlzazfNL4PZ9yKLM1hqhiY0CrLpIrl+6mjRmDgRbksNgsXKGx67vPWI7Io3JlK3g tYoxzpwJB/crYnI/Emw980Dsyhi3yjNgiyXpbgdnC8kqxyRj0cOIqJzvdSBQnlYH7f3h sbJhNRixcuaaB67BGcf5u3hH2afuvlwMtWLqsxvDaxS6uaDVjpGg0J3jf0JVPGGgu+IR ksTX+gAK2ReF+CAGADv/jF2OHAi2p0Z9qYM5hRW58gviT+gKqo8sbdz2BzLqSCs2AMK4 bmrg== X-Gm-Message-State: AOJu0Yzja307brpGNkiZLbRLttPjts/WI868yQ/sGK6qJt87GX5JrJXM ZC5qK/ootKkn3EIv5nn79b+3ifcZW1iiQIfTjAO8GfGrmuOrSc+DXfHDwWGr0R8= X-Gm-Gg: ASbGncst/atp+wASaIgb4uyZGqyXb9u6EFR0JsaEdWL4M8niYlfKaBnW+K84FLXpeRl xLsTNynPOY8ltPWf1rJYftB+FETfP1QgHeIIfn2oR/sXTyLd7is1+gNpJNmU/LX6PLUTpCh7ZB2 Wvc8AVfbR+SorBdB9qjS9L+vJ67A6hQDzoqYLOF0ctd8e/a+vgRhac8/9/4aV747nQHKvrkhycy cXIXaZAgbN5eAWqsIfkPVng+Ii1brof0ZHyXyLazPkW9l8z1PZmN3FmV9PzKaw7RCLnFcbzUR8h bci8RloYzDqNk5M0Odg4Or+Z8ctb93nt7BNJsSbx5Qo= X-Google-Smtp-Source: AGHT+IGLRRkKyiKbQIIEjOC/9qkPzHlmQ2oHBsFZcm/Bojdzg8BtC0+7P2RKcRzroUtH8XCrfAKOdQ== X-Received: by 2002:a05:622a:19a6:b0:467:6563:8b1d with SMTP id d75a77b69052e-4791615e3c9mr13324771cf.6.1743625458002; Wed, 02 Apr 2025 13:24:18 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id d75a77b69052e-477831a6579sm83209061cf.73.2025.04.02.13.24.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Apr 2025 13:24:17 -0700 (PDT) Date: Wed, 2 Apr 2025 16:24:16 -0400 From: Johannes Weiner To: Nhat Pham Cc: linux-mm@kvack.org, akpm@linux-foundation.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data Message-ID: <20250402202416.GE198651@cmpxchg.org> References: <20250402191145.2841864-1-nphamcs@gmail.com> <20250402195741.GD198651@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam01 X-Stat-Signature: gmapuam4m8iopbwuryn3pitz6ep3x3mw X-Rspam-User: X-Rspamd-Queue-Id: 22B2F14000D X-HE-Tag: 1743625458-727774 X-HE-Meta: U2FsdGVkX19yldzKKdKM4MuDjjngMTIw7nxkhI+i73/2JdVN9oRJpmdlJjKNAWkEjMFWdNJmIEeBCWL9BQGPaTYILVYm6B8VOixwyEOsBJYn8sZojvwuiRAzFO1fXX79wkLho15S4ravZcph+jvXp/bdEJE5XIZS3T0DcmZIarlS+U7sreTeacJOFDrr+e8TYmOuaxlgUidp3RKwOP+5rns57MBd/WwcJo/P94LrcOG3B8dA5Nss5yoW9DSn/JZzubL+8kPrJ4rJFlksz2NSQBJtNrD6aIwzKu8YipcNYL59E6+fmi1+MX9JOkktZx608mIyAJ/AZPPgOQn0CdnQ1LZzTOY51XNh3OLd7YpyeIcSzlS7rnasaA3INgiwFJKY1989VuR7wZFdNqyXjOmNkSudV+A4EE4gMVLBbs04U7CqUHkNn3B2qn9zKU5le/bw44wGUFbXYpGcVfIoqS8Ef1GryEeLbJkUqXGrnoBAjip8AErVrns5s3wVI2v/KYvGHfyCt9n39iYUZYPeUFsC5QiVgYDnE3fM7LjpwzFckPqQLrqToqOLRIIT/v5HSc3LkoQbcP3/rP/x94FEwbt2gokRUwagQlV3w/flhHQvqSwnLr3ofE2oYNWo0k9fwmsNBJmV1+Y9NM+kP2f44ht7BFgsK6OSq3SNgjUuvC4JhforjIAgxQ56gszl9T/AabeXrkPbI9erNm+CnQm4JJp/qHQ9/1dlHFLsuD/UMQ6wKzYQxyKEjkvSxZ+I9H0bJ093wxYw3N7VTrDtS5i9VqZFX4G/RSyRnPe+eM/4jALRVjcoWgfmx+aiaQfQd5EbBBrUkq2tquqGVAAtLGRrGAU0L0VLlPm/rOqLG6TE04fQZRDe3ginMZEji/C3D/LmcS8PjBruy6Wo0VAOq2u5pPRp7e686Vy1ctQ9+mVYl1jDOyyTEHw4q/Y6ieIwyjuAQ2iLvu+ug4MlVzvIQRcWWfu eHUo27/+ jc7UnAgZy0DeqhLGuvOFd63S3Mr15yy9923BrRfxOdgh1w4yMbCQNdJkwXXBPO2zDN4Rflxoc5Vgd+EfLfgbHfbHfhooaLt1+X04Mpym/th4r0GpCyMLetMFpaQxTZyPSAxXn/L0T0RaoTl35FVt7mmbpZGu8Cp1aHEy0mBtniPe/GlHvkIJqyr+qR0OXQv6IB0LcanR+lLVD75fFg50wiWONQt+0TckAMklpKbwIh44L8+8eehfGT0qgcRpuU5RrIEGQwlMgTd1Bp2VSNPJWMpwyXGYqUBKTEkZMBqRNIjfl3tuauBRrfa85WArulFTbHvqX+/hVIhbXvQtl5pIhi2I69p05ZUdcdVIGafkIpERtOdtpKHETtmCOjYPjzl/2QfMr9nxKkS7wXDsNadeu8pYKnljzfQvqLTiID7TBB2KAKXqLINDW6Wsn3sGp5Io1ncJSjsJAFtaxuDcLwzDdc+3FZIOwYzfWjUN6lz/eQo8STvY5VHCBkwB0hN4psp2Z+cXh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote: > On Wed, Apr 2, 2025 at 12:57 PM Johannes Weiner wrote: > > > > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote: > > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce > > > any policy for the allocation of memory for the compressed data, > > > instead just adopting the memory policy of the task entering reclaim, > > > or the default policy (prefer local node) if no such policy is > > > specified. This can lead to several pathological behaviors in > > > multi-node NUMA systems: > > > > > > 1. Systems with CXL-based memory tiering can encounter the following > > > inversion with zswap: the coldest pages demoted to the CXL tier > > > can return to the high tier when they are zswapped out, creating > > > memory pressure on the high tier. > > > > > > 2. Consider a direct reclaimer scanning nodes in order of allocation > > > preference. If it ventures into remote nodes, the memory it > > > compresses there should stay there. Trying to shift those contents > > > over to the reclaiming thread's preferred node further *increases* > > > its local pressure, and provoking more spills. The remote node is > > > also the most likely to refault this data again. This undesirable > > > behavior was pointed out by Johannes Weiner in [1]. > > > > > > 3. For zswap writeback, the zswap entries are organized in > > > node-specific LRUs, based on the node placement of the original > > > pages, allowing for targeted zswap writeback for specific nodes. > > > > > > However, the compressed data of a zswap entry can be placed on a > > > different node from the LRU it is placed on. This means that reclaim > > > targeted at one node might not free up memory used for zswap entries > > > in that node, but instead reclaiming memory in a different node. > > > > > > All of these issues will be resolved if the compressed data go to the > > > same node as the original page. This patch encourages this behavior by > > > having zswap pass the node of the original page to zsmalloc, and have > > > zsmalloc prefer the specified node if we need to allocate new (zs)pages > > > for the compressed data. > > > > > > Note that we are not strictly binding the allocation to the preferred > > > node. We still allow the allocation to fall back to other nodes when > > > the preferred node is full, or if we have zspages with slots available > > > on a different node. This is OK, and still a strict improvement over > > > the status quo: > > > > > > 1. On a system with demotion enabled, we will generally prefer > > > demotions over zswapping, and only zswap when pages have > > > already gone to the lowest tier. This patch should achieve the > > > desired effect for the most part. > > > > > > 2. If the preferred node is out of memory, letting the compressed data > > > going to other nodes can be better than the alternative (OOMs, > > > keeping cold memory unreclaimed, disk swapping, etc.). > > > > > > 3. If the allocation go to a separate node because we have a zspage > > > with slots available, at least we're not creating extra immediate > > > memory pressure (since the space is already allocated). > > > > > > 3. While there can be mixings, we generally reclaim pages in > > > same-node batches, which encourage zspage grouping that is more > > > likely to go to the right node. > > > > > > 4. A strict binding would require partitioning zsmalloc by node, which > > > is more complicated, and more prone to regression, since it reduces > > > the storage density of zsmalloc. We need to evaluate the tradeoff > > > and benchmark carefully before adopting such an involved solution. > > > > > > This patch does not fix zram, leaving its memory allocation behavior > > > unchanged. We leave this effort to future work. > > > > zram's zs_malloc() calls all have page context. It seems a lot easier > > to just fix the bug for them as well than to have two allocation APIs > > and verbose commentary? > > I think the recompress path doesn't quite have the context at the callsite: > > static int recompress_slot(struct zram *zram, u32 index, struct page *page, > u64 *num_recomp_pages, u32 threshold, u32 prio, > u32 prio_max) > > Note that the "page" argument here is allocated by zram internally, > and not the original page. We can get the original page's node by > asking zsmalloc to return it when it returns the compressed data, but > that's quite involved, and potentially requires further zsmalloc API > change. Yeah, that path currently allocates the target page on the node of whoever is writing to the "recompress" file. I think it's fine to use page_to_nid() on that one. It's no worse than the current behavior. Add an /* XXX */ to recompress_store() and should somebody care to make that path generally NUMA-aware they can do so without having to garbage-collect dependencies in zsmalloc code.