From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4B62C36018 for ; Wed, 2 Apr 2025 20:25:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 48A0C280004; Wed, 2 Apr 2025 16:25:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 439AB280001; Wed, 2 Apr 2025 16:25:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30187280004; Wed, 2 Apr 2025 16:25:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 0FA96280001 for ; Wed, 2 Apr 2025 16:25:52 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D72821A083D for ; Wed, 2 Apr 2025 20:25:52 +0000 (UTC) X-FDA: 83290234944.07.3C1A5E2 Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com [209.85.219.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 0F0621C000C for ; Wed, 2 Apr 2025 20:25:50 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Oszhk7KR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743625551; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S3DrS55bhHGD7DHOAVlR2qBGqWbDWkcAoal4Y0Sld1E=; b=yAFoWT7C7AKVn3cxrmqA4PQY6KnGPJMo7EoMluQuSdTwqjiz02PmuRwM5ChMjQVzOS6jrP G1vVyy3vXVwVk5gpT6cYSeyZcj7yI0Zm+ZDATLbfmM3LNpUXgeDrY/L9nWz2WrF+US7XVN EyB1fOU45gnxBi6GyVbUESflulLvEak= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Oszhk7KR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743625551; a=rsa-sha256; cv=none; b=NxCGQKTxIBX+USoKSpKQA1xKoD8c4i9dEpwYK6b3bgdOBrvztKHzm27OZzkwswddUrsv2i uXv3wh26o0jfREtKz+ievDBp7efvT9k0UNis4drQHxVEW0oVUBMOL1h31gVCzBDvG8DwFp c8vy1Zgk5MeDlEI2L3ThXJ7rCJJ2Lhc= Received: by mail-qv1-f49.google.com with SMTP id 6a1803df08f44-6e8f8657f29so1380666d6.3 for ; Wed, 02 Apr 2025 13:25:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743625550; x=1744230350; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=S3DrS55bhHGD7DHOAVlR2qBGqWbDWkcAoal4Y0Sld1E=; b=Oszhk7KROwDoL4/H2VGSc8fvf6WGJ59XMpjUxOCmud3uRCPPteceEEqBEw6S/EAR5Q VaMi7vxmpUt3+XmMGhCzdsDIN2iyC9o4rwIxUbJrp2PDnaloo0nKNjw3lCaiAdylgD3S GRtIbVVpSffVTWoljAfFXaWKW0gv5xv2hqUEglgH+RBXXHXz699MXCs5zuQFdiEah9UE Dy/XcZ7A5ms4gQzTdPiZZdPZrCkoBSGc4HAVmMF7v842K8+ar4matV0R3BY+XoECG/fh WScIkfDE6nRwkvZ+DExUdcNzRbwjUEl0cTLOqH8Ek3xmLQ9PoA0LhrlSIejw2k36hriY ZbVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743625550; x=1744230350; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=S3DrS55bhHGD7DHOAVlR2qBGqWbDWkcAoal4Y0Sld1E=; b=tiygl0ufio7KLbjzU46xous5ymIv3y5OaGS0zs2KJW9wq4Uv1jG5JdQQdaLNMauZvZ iWd/tTwh91609SIOuoZq/grCdgsc/h3hypXNRytnfYmdbDwvEPAivS84SAp1x6F2E5pE 6teaWgojYDLZZnL0v8I+5UvREknGoQJ7bR6ZlGnv22EcI0RlxAb9llLdFt1tevv9b4JI tunmiD0L2NkPdBomZmaISUcLoGN89B0z5HCBn09vlwGWMGjNCKdsvNsDCXTg6ge0eP8R LFyKnNaRUBbKIrzsFQC6/vKWj6JDCUJlVAyz5uzAfdQrJVs6bK0K9+C/y2WANdqG0dfu Bc7A== X-Gm-Message-State: AOJu0YzSnblVwUfKfNSHrGk0ImCAHerhgryuiZQ8PJBKKkLc5WdcAyo+ UKskDtqWUX0Oktsu0W2mhfULNVenWasY1E3NFhCVXg/grlUjxFMVFMVEth+Ppu9SdJCSLfhCnhA DYVrIe1CBIJmdkC4rUjTH5pvLv80= X-Gm-Gg: ASbGncs8cZboH0DowWURiyhb7GKyh7nOyAJZ/q53X8ZuhHjIYgsk5TJeLKuKZv4CZ1A DJ9AWvA4jHSiSX8Ir3DQcT+ehr77kbkx/U1obFYd4eDzdldCgxxERdKTFqJzqfuokx/ah9Ukcac M77T3QkO84Hay3d5u6jg5ZUP1LIT2Eo7cN6SgQ X-Google-Smtp-Source: AGHT+IHmQed7v+lq9IkGBVOW+jNL+WtFln9asiTYwb6z9Eqqu9hKr7TL3SccBTvnlq3OWx7o3BeVjueDxgq5l619nSo= X-Received: by 2002:ad4:4ea7:0:b0:6e4:3c52:d67e with SMTP id 6a1803df08f44-6ef0dc43304mr2906356d6.18.1743625550089; Wed, 02 Apr 2025 13:25:50 -0700 (PDT) MIME-Version: 1.0 References: <20250402191145.2841864-1-nphamcs@gmail.com> <20250402195741.GD198651@cmpxchg.org> <20250402202416.GE198651@cmpxchg.org> In-Reply-To: <20250402202416.GE198651@cmpxchg.org> From: Nhat Pham Date: Wed, 2 Apr 2025 13:25:39 -0700 X-Gm-Features: AQ5f1Jq9X9Psw0eZV6-BYQrcYQnwLYFghU4MCnLCZtU85EeJ9t8dWo3sALTD_Ys Message-ID: Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data To: Johannes Weiner Cc: linux-mm@kvack.org, akpm@linux-foundation.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam01 X-Stat-Signature: c88poqp73bhzmeormqj16oie1fbawfjn X-Rspam-User: X-Rspamd-Queue-Id: 0F0621C000C X-HE-Tag: 1743625550-631188 X-HE-Meta: U2FsdGVkX18bm+tF3MigIuWiV7VRRJAvi/fQWcd8FIWgwdYDSq+U+DyDNtLSci6DLXDAU5VbDd8uCs2Hqzgj0itjmiElv6Vw3a2ERtZysW744bWcOgJn7sy1Mz7TOpH8jO+WKM967BvRj0hpVRl8z15tWcvddTDaCNYkGsaQQxsJW4UieMYqjEr2HeZx9Y+VyFRa+bUi+FxLuDRhghuuodml2ZFIGotvEpyEAe/mNnV+uGurSyzEqojLpmssqo4RZzGTGdOMcb0Wq2GA6maAnQgGI/g0oQtah3cLX1xrgoOQLnPLEpbbd1u7Hmwn4B+Lp5wKvumXSSMaEydYerDsF6ZPcYuLjIAdMHNBD3sBO0sjuh/R4T4LkuJQG7pPXXZMEq+T+q/BWv4vchlE2PeUi3SyYrlySJit3hC30aor1P/TizrHo5yoHVPfU+3unzsidE9MdA6xJ+5NzqtRF7W0lTBOt04HNBQ8/+Hdv2aZLaC8/YOJlTHa00N7QdpmrTIYd7bug080cl3kG0qAbhX19z+ftdKr6o/e4kYZHz8OUxmtQJqXBv4kCXUgU7q8Nd/pR5suRFdALFpIdMlI3VQ2ZbKMOmta1QPEDt3z7HNLlAFaXLz9cR0INvCrLA42EOD/ySC2TJ+H4xZOSgyUrIa3kwblyq2hI+6bmIp/fwUPK/0rmy0pmQZDATiv9/CjrTmApOt9rH1a0HJaz5iTwaum2FMLZNTL6niG13eTIeikz0m0sV+kJIASl36qebNnyoXSpZDiZh5DrC9TZUSrSCdSKRbhcPe0dew/VdNzHc2DftL5Q5X6LM6kUgHfKEJ/urfn94A1coIfJPO3XktxnkIm87wFtHcUiP1flYojSouvQtBlTyVdZQ5STrPAEfXgW6pgvc2Mfh8JYEaOZSy6/iSZK+Oze/+BlIcn4AEhxukM3eOg7nkMjHfziNkbr2qFsMz3l7Q2QCMaGc2l7bXS/3X hm3tMKzp Jptz4lZJ5fUxBpWsPc+7OEpdamiH1dehMKCUhQeW+9DYi7I+ADNk+YnIaseJ4nYMDoRl/0qurheHc4CwFHKu4clmNtjcKPR++tuqDJ960W2Zj6wX/PJHvdeTfUxpstzOlUZy6G2HyHqCA67yQbAS6y+JxwtHmGue1heiNNjV4a6B6crYYAE28tAwstpb8vZmTjI+dPJ1tDn3uGP/2s7nm4SO3hgf8vbBuro8PyGhzH8G2KLlHpgSYu5HOmZN6DtKohdyn5GJFLVUifNrrBMfOpM2iBIW0HMLmMduxJkgXfDFyNY0HZhBTy9n5upLOUP+cAWr+0nKTrh+PfeVz15EXadoHCVbeb0Wvzoxgazr4jyqGlt6QKqRCdp8S/YuRl+7EN8ix X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 2, 2025 at 1:24=E2=80=AFPM Johannes Weiner = wrote: > > On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote: > > On Wed, Apr 2, 2025 at 12:57=E2=80=AFPM Johannes Weiner wrote: > > > > > > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote: > > > > Currently, zsmalloc, zswap's backend memory allocator, does not enf= orce > > > > any policy for the allocation of memory for the compressed data, > > > > instead just adopting the memory policy of the task entering reclai= m, > > > > or the default policy (prefer local node) if no such policy is > > > > specified. This can lead to several pathological behaviors in > > > > multi-node NUMA systems: > > > > > > > > 1. Systems with CXL-based memory tiering can encounter the followin= g > > > > inversion with zswap: the coldest pages demoted to the CXL tier > > > > can return to the high tier when they are zswapped out, creating > > > > memory pressure on the high tier. > > > > > > > > 2. Consider a direct reclaimer scanning nodes in order of allocatio= n > > > > preference. If it ventures into remote nodes, the memory it > > > > compresses there should stay there. Trying to shift those conten= ts > > > > over to the reclaiming thread's preferred node further *increase= s* > > > > its local pressure, and provoking more spills. The remote node i= s > > > > also the most likely to refault this data again. This undesirabl= e > > > > behavior was pointed out by Johannes Weiner in [1]. > > > > > > > > 3. For zswap writeback, the zswap entries are organized in > > > > node-specific LRUs, based on the node placement of the original > > > > pages, allowing for targeted zswap writeback for specific nodes. > > > > > > > > However, the compressed data of a zswap entry can be placed on a > > > > different node from the LRU it is placed on. This means that rec= laim > > > > targeted at one node might not free up memory used for zswap ent= ries > > > > in that node, but instead reclaiming memory in a different node. > > > > > > > > All of these issues will be resolved if the compressed data go to t= he > > > > same node as the original page. This patch encourages this behavior= by > > > > having zswap pass the node of the original page to zsmalloc, and ha= ve > > > > zsmalloc prefer the specified node if we need to allocate new (zs)p= ages > > > > for the compressed data. > > > > > > > > Note that we are not strictly binding the allocation to the preferr= ed > > > > node. We still allow the allocation to fall back to other nodes whe= n > > > > the preferred node is full, or if we have zspages with slots availa= ble > > > > on a different node. This is OK, and still a strict improvement ove= r > > > > the status quo: > > > > > > > > 1. On a system with demotion enabled, we will generally prefer > > > > demotions over zswapping, and only zswap when pages have > > > > already gone to the lowest tier. This patch should achieve the > > > > desired effect for the most part. > > > > > > > > 2. If the preferred node is out of memory, letting the compressed d= ata > > > > going to other nodes can be better than the alternative (OOMs, > > > > keeping cold memory unreclaimed, disk swapping, etc.). > > > > > > > > 3. If the allocation go to a separate node because we have a zspage > > > > with slots available, at least we're not creating extra immediat= e > > > > memory pressure (since the space is already allocated). > > > > > > > > 3. While there can be mixings, we generally reclaim pages in > > > > same-node batches, which encourage zspage grouping that is more > > > > likely to go to the right node. > > > > > > > > 4. A strict binding would require partitioning zsmalloc by node, wh= ich > > > > is more complicated, and more prone to regression, since it redu= ces > > > > the storage density of zsmalloc. We need to evaluate the tradeof= f > > > > and benchmark carefully before adopting such an involved solutio= n. > > > > > > > > This patch does not fix zram, leaving its memory allocation behavio= r > > > > unchanged. We leave this effort to future work. > > > > > > zram's zs_malloc() calls all have page context. It seems a lot easier > > > to just fix the bug for them as well than to have two allocation APIs > > > and verbose commentary? > > > > I think the recompress path doesn't quite have the context at the calls= ite: > > > > static int recompress_slot(struct zram *zram, u32 index, struct page *p= age, > > u64 *num_recomp_pages, u32 threshold, u32 prio, > > u32 prio_max) > > > > Note that the "page" argument here is allocated by zram internally, > > and not the original page. We can get the original page's node by > > asking zsmalloc to return it when it returns the compressed data, but > > that's quite involved, and potentially requires further zsmalloc API > > change. > > Yeah, that path currently allocates the target page on the node of > whoever is writing to the "recompress" file. > > I think it's fine to use page_to_nid() on that one. It's no worse than > the current behavior. Add an /* XXX */ to recompress_store() and > should somebody care to make that path generally NUMA-aware they can > do so without having to garbage-collect dependencies in zsmalloc code. SGTM. I'll fix that.