From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6239C28B20 for ; Wed, 2 Apr 2025 19:11:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B063280003; Wed, 2 Apr 2025 15:11:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 15ECE280001; Wed, 2 Apr 2025 15:11:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 001F7280003; Wed, 2 Apr 2025 15:11:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D6C6D280001 for ; Wed, 2 Apr 2025 15:11:48 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 9C72C1A0769 for ; Wed, 2 Apr 2025 19:11:48 +0000 (UTC) X-FDA: 83290048296.09.90BBFDE Received: from mail-yw1-f181.google.com (mail-yw1-f181.google.com [209.85.128.181]) by imf12.hostedemail.com (Postfix) with ESMTP id D5BDF4000D for ; Wed, 2 Apr 2025 19:11:46 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=faa9KsvB; spf=pass (imf12.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743621106; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=hugjdKQ+ZgP9n9IK0KpVI4y8Euv3bvaVXpqEaZlua4Q=; b=wnlH2UKcfSz56trdeH23+M5f2cLsqm0Xh/b3h1gUvS9rdK3xI0TDqe1u9hrucLd4R/um7H la8BJm+Kv+lSTumR+5TXn/tDRTtttGrd0gYSDrnHFNYDZsmgkf+tdd5LCLzbWnHOeYI1Jq 1MJf6iHQiyPif8ZgbY7vH8V/KOZE4kw= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=faa9KsvB; spf=pass (imf12.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743621106; a=rsa-sha256; cv=none; b=maZYMgM2L++wC3fFPzuxhYU25Ur3S0HE/BvwMdD7S0EcxDJWKXaBbLRkT26mQfxDtYCgU8 GBTnrf2MLAmOs54mNF9qt7XKOhG+FLqq8NFJeuvoB1Q5TrMj7ecgOtub2eH7Bb3HtTAHI2 XKsZ5pBWjIQKKzOWTaqFHK+BSvSIOu0= Received: by mail-yw1-f181.google.com with SMTP id 00721157ae682-703cd93820fso1674547b3.2 for ; Wed, 02 Apr 2025 12:11:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743621106; x=1744225906; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=hugjdKQ+ZgP9n9IK0KpVI4y8Euv3bvaVXpqEaZlua4Q=; b=faa9KsvBHcdLwXgQGArAn+5pj6OKC/64URf3gDc1kOB/Bw1yXJxbRhdKWVPAzROU76 dnsn8nGBZ4hBI89ODX7xuY2pPqi/0doaaulkNwtyo+wdV9oNR5Q9CshjPjixFT8hCdR/ F8tnL/m+TRrW3o+55KcZVGBxLCwdwjXoT4/hFnbsFMtaa+I0145Wx7IPg+XmrtHhd27z 0gRmKf9idvK8W6dV5wY2ORPH7OHSzVGsUlbiYC8siIJ5Ig6mfqCFZmmajbrX9FaTBzrf 3KcJ4uN+b44q8muX40Lklpz/96GpARJ8C6VV3cSYaij8gcHVdXVWFRxALoGJnx31S+Th 1g6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743621106; x=1744225906; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=hugjdKQ+ZgP9n9IK0KpVI4y8Euv3bvaVXpqEaZlua4Q=; b=S+uUL8ZMa6SP29ENEyRqWW5duvB4D+Gl04/6wP+zON3+5mRvHc7kE7gItQa701GTaI 1b5Hqn2b51zhBQCu57XpTbOx66rXZRuaEyEqlub61MVffkf4c3ZxwygszisXZZ+j0IxN PL8eAu3Rq9H2qnsoZJGxFiifTC4eVTlK/EoA9xPpxw9WfGa90KL8tjaAs3cZZBAwfTp8 bkLv38N1wJdAB9deYhz5em9721SNeATEPa2Gj34nr2m7bKmoZH28QB2Ajv5Fc+h+0NZ2 eauaZm+1K0xh16WJsSya4uMc3g4dkx7tXUcWw5unkRqdDFowxDv3bIPigBEYj0qwWSCX 7c3w== X-Gm-Message-State: AOJu0Yzzwwl26N2BL7alBBKYM0iD0m9kM9dYzzjvG+2nmWVTDbZbu58i Gv6xlgtf78e7Mcw2QPWLoKtvQJz0xb7WhwhPPBMYluKQG+NUZ6zF0RrIHg== X-Gm-Gg: ASbGncvf2V44SM6BYa/AV9GJlYITzodc817vB+5n/0AR2aHw28nE/xUw2CZ7xgdp2tk /jugKw4q4SUANfTITzIBblB/hu3mumgzL8oLSkVhx3wyawH7l3hs9JbRiKZIyZ+ClScNTq5ayHZ 3UEqvNuYsMlUqCLoJ2/L0+yz3Pwo1xMYb1d7ZNQHErg5whJKdvkxHv9LDiatMyDTxe1bTxhQGs7 uyydzLCoJIwc9G+sFU9FcNH1FWpoKmcmG2czp7zM2admq9IwDCy7leKG0qutu7zFGWfMnr0w3bL C6Tm6artTVZrjbbhYUMk+tD1F7kYmvKU/wTbVpYLzekHGxU= X-Google-Smtp-Source: AGHT+IHlO9Ha/7YLYabER5MI1ixhiQPH6SpxlryPlhe6JKDN5HBzN9a0LLhpVn/QLfCTiAX2uYaJMw== X-Received: by 2002:a05:690c:708c:b0:6fe:d759:b178 with SMTP id 00721157ae682-703d079ddacmr1950077b3.6.1743621105662; Wed, 02 Apr 2025 12:11:45 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:74::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7023a37b4f0sm34321607b3.37.2025.04.02.12.11.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Apr 2025 12:11:45 -0700 (PDT) From: Nhat Pham To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data Date: Wed, 2 Apr 2025 12:11:45 -0700 Message-ID: <20250402191145.2841864-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.47.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: D5BDF4000D X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: hd64fsxedyp4d93xo4u14d497kc5d1c4 X-HE-Tag: 1743621106-26776 X-HE-Meta: U2FsdGVkX1+wTny1OK3F1HU/nV0NoX1CmIf8FmiTvYf9a6SCT2NF6LrUxg0d+JF3GSq/JB2MUV9fSdsEUVHpQijwvgflQF1k41wUYUkqT+NneNOxrwzQQxjfQuuT1sA1YZ3ca4gK+qVNcfeloD1bNAABjrLgfI/lRJXsTomxyJ/BE+GuIS39d6MEc0jn6kOSWpn7Fw5Ckdk1k/AGgYpVCYwFNu/LqDheg3HaqOQkQcV12g1yoO9ZA3Ym0V3gi+e+bCj5bnkIeIQJAdOTUKwfth22UZ0WWW6uSj0XNJMHWMhrmFf+UtzyAG7plgtLoCtnmnv1ryQp5bYhLf6NAz2/GU/jlNYHcM1OwnFUYlM3uxlLYfCreutdyEUACYWiaD5GlVV7p1gDmTtMilAUrsGKoC+EFOIQIZGNTHiXroV5L/Fpd44vfjH6gJ/vo55wQTs/TIhbcJtJW//kTBmTe5BNmuecBPGktzWKz6rr8NdrQZVlUz7EvdGFEUCjm9MutsZCWhtB3P5oHJXIxuVuJbzWXLG4l+a0hNQkMXjJch1SaYluzkbXru2QM8kG8f/W2Xth9D5/7sMxuR8uIRsZOUuCrntK8kj+AB7ry/r/8doAT3pN+JARPEZX8knU6tPu/8YZMEvP6P5H2O+2ZumkZ1PrvT2vXwnNUaFH6q0JltK48oi/L33XCi2wtBpsyL9M9qLtoaF5Nnw3rRppyAtjc5bQXBN/4fxZLSjhzMzt+fDG9oIt+DXZTppRHLmdiKY+eT6cAPoMKtmEm3Eu/TIDiww5ATOkb2KABxnGyHGweiuglVgOZYE6NzSX8bgmPSmERUZRxU9vXxwGuSxdMw4wEbXT3DTjPG1hYZ1vw7GgLx7cAkOnrq4XrAzPn6qw7aQ8xDzBJIB9kk43uBrD8PT4BR393axdrNLfRw1z5+IKjDGKCuWd7zFXocTY+LYMBQT2mNIoAI7ggDEO3R4+omubY3R gxhQ/kfR SLdfq/gspKBcFSCzrlqP5Cld1MufZld3ysARb/UklucZgl/cLM0F+jUcW8xcgVmgFJcEXhpvv1VTGVNJDm9HYDBMxlBGaLmrh2zYN+eRy4fRTQv7jK/BWGvf/tfMDgBiM/i9gPB93XfkDnrk2mt6Sn3DzWJ5v5fF4hdO/bDwd+XK4dnCmU6gvMLgNqaNkmnVgUHfPzNvHWXktEALcqeG9xXqj8Qo8qi8tnKRtfzq10kk4wtueLYofQh0UyuEcQW3fxYlJCEWc08EYyZfL5VOSpxuyYdjspQlMQBcKMULfsduv9Sfyh3C8o5m+DkPIxz/eJpd+ViUPCi7euN9JfIveXdeS3eviFAhdk+npqlBU0+lxP+QfRS+t3X5OQ5/9iRUB1h6DljXuhFVcJ8ReUQUPcvGx/0/L/V1fhmnet75NPv03j4EKT5bYKN5pZAdjrnRKec8lKDx77JsfhlTfUQNxY5pWopWY/tIABs3kD4D4misH71XCmA1Hijiy9M5b7oezR0H2sMKq5bEVDygy/uwWCE2vPIokHKMmu0CuZN9DcZj6kZ0T4kvx2oFpXsaRrgRKW+Gx X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, zsmalloc, zswap's backend memory allocator, does not enforce any policy for the allocation of memory for the compressed data, instead just adopting the memory policy of the task entering reclaim, or the default policy (prefer local node) if no such policy is specified. This can lead to several pathological behaviors in multi-node NUMA systems: 1. Systems with CXL-based memory tiering can encounter the following inversion with zswap: the coldest pages demoted to the CXL tier can return to the high tier when they are zswapped out, creating memory pressure on the high tier. 2. Consider a direct reclaimer scanning nodes in order of allocation preference. If it ventures into remote nodes, the memory it compresses there should stay there. Trying to shift those contents over to the reclaiming thread's preferred node further *increases* its local pressure, and provoking more spills. The remote node is also the most likely to refault this data again. This undesirable behavior was pointed out by Johannes Weiner in [1]. 3. For zswap writeback, the zswap entries are organized in node-specific LRUs, based on the node placement of the original pages, allowing for targeted zswap writeback for specific nodes. However, the compressed data of a zswap entry can be placed on a different node from the LRU it is placed on. This means that reclaim targeted at one node might not free up memory used for zswap entries in that node, but instead reclaiming memory in a different node. All of these issues will be resolved if the compressed data go to the same node as the original page. This patch encourages this behavior by having zswap pass the node of the original page to zsmalloc, and have zsmalloc prefer the specified node if we need to allocate new (zs)pages for the compressed data. Note that we are not strictly binding the allocation to the preferred node. We still allow the allocation to fall back to other nodes when the preferred node is full, or if we have zspages with slots available on a different node. This is OK, and still a strict improvement over the status quo: 1. On a system with demotion enabled, we will generally prefer demotions over zswapping, and only zswap when pages have already gone to the lowest tier. This patch should achieve the desired effect for the most part. 2. If the preferred node is out of memory, letting the compressed data going to other nodes can be better than the alternative (OOMs, keeping cold memory unreclaimed, disk swapping, etc.). 3. If the allocation go to a separate node because we have a zspage with slots available, at least we're not creating extra immediate memory pressure (since the space is already allocated). 3. While there can be mixings, we generally reclaim pages in same-node batches, which encourage zspage grouping that is more likely to go to the right node. 4. A strict binding would require partitioning zsmalloc by node, which is more complicated, and more prone to regression, since it reduces the storage density of zsmalloc. We need to evaluate the tradeoff and benchmark carefully before adopting such an involved solution. This patch does not fix zram, leaving its memory allocation behavior unchanged. We leave this effort to future work. [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ Suggested-by: Gregory Price Signed-off-by: Nhat Pham --- include/linux/zpool.h | 4 +-- mm/zpool.c | 8 +++--- mm/zsmalloc.c | 60 ++++++++++++++++++++++++++++++------------- mm/zswap.c | 2 +- 4 files changed, 50 insertions(+), 24 deletions(-) diff --git a/include/linux/zpool.h b/include/linux/zpool.h index 52f30e526607..697525cb00bd 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool); void zpool_destroy_pool(struct zpool *pool); int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, const int nid); void zpool_free(struct zpool *pool, unsigned long handle); @@ -64,7 +64,7 @@ struct zpool_driver { void (*destroy)(void *pool); int (*malloc)(void *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, const int nid); void (*free)(void *pool, unsigned long handle); void *(*obj_read_begin)(void *pool, unsigned long handle, diff --git a/mm/zpool.c b/mm/zpool.c index 6d6d88930932..b99a7c03e735 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool) * @size: The amount of memory to allocate. * @gfp: The GFP flags to use when allocating memory. * @handle: Pointer to the handle to set + * @nid: The preferred node id. * * This allocates the requested amount of memory from the pool. * The gfp flags will be used when allocating memory, if the * implementation supports it. The provided @handle will be - * set to the allocated object handle. + * set to the allocated object handle. The allocation will + * prefer the NUMA node specified by @nid. * * Implementations must guarantee this to be thread-safe. * * Returns: 0 on success, negative value on error. */ int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, const int nid) { - return zpool->driver->malloc(zpool->pool, size, gfp, handle); + return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid); } /** diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 961b270f023c..0b8a8c445fc2 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -243,9 +243,24 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc) dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES); } -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp) +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int *nid) { - struct page *page = alloc_page(gfp); + struct page *page; + + if (nid) + page = alloc_pages_node(*nid, gfp, 0); + else { + /* + * XXX: this is the zram path. We should consider fixing zram to also + * use alloc_pages_node() and prefer the same node as the original page. + * + * Note that alloc_pages_node(NUMA_NO_NODE, gfp, 0) is not equivalent + * to allloc_page(gfp). The former will prefer the local/closest node, + * whereas the latter will try to follow the memory policy of the current + * process. + */ + page = alloc_page(gfp); + } return page_zpdesc(page); } @@ -461,10 +476,13 @@ static void zs_zpool_destroy(void *pool) zs_destroy_pool(pool); } +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, + gfp_t gfp, const int *nid); + static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, const int nid) { - *handle = zs_malloc(pool, size, gfp); + *handle = zs_malloc_node(pool, size, gfp, &nid); if (IS_ERR_VALUE(*handle)) return PTR_ERR((void *)*handle); @@ -1044,7 +1062,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, */ static struct zspage *alloc_zspage(struct zs_pool *pool, struct size_class *class, - gfp_t gfp) + gfp_t gfp, const int *nid) { int i; struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE]; @@ -1061,7 +1079,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, for (i = 0; i < class->pages_per_zspage; i++) { struct zpdesc *zpdesc; - zpdesc = alloc_zpdesc(gfp); + zpdesc = alloc_zpdesc(gfp, nid); if (!zpdesc) { while (--i >= 0) { zpdesc_dec_zone_page_state(zpdescs[i]); @@ -1331,17 +1349,8 @@ static unsigned long obj_malloc(struct zs_pool *pool, } -/** - * zs_malloc - Allocate block of given size from pool. - * @pool: pool to allocate from - * @size: size of block to allocate - * @gfp: gfp flags when allocating object - * - * On success, handle to the allocated object is returned, - * otherwise an ERR_PTR(). - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. - */ -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, + gfp_t gfp, const int *nid) { unsigned long handle; struct size_class *class; @@ -1376,7 +1385,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) spin_unlock(&class->lock); - zspage = alloc_zspage(pool, class, gfp); + zspage = alloc_zspage(pool, class, gfp, nid); if (!zspage) { cache_free_handle(pool, handle); return (unsigned long)ERR_PTR(-ENOMEM); @@ -1397,6 +1406,21 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) return handle; } + +/** + * zs_malloc - Allocate block of given size from pool. + * @pool: pool to allocate from + * @size: size of block to allocate + * @gfp: gfp flags when allocating object + * + * On success, handle to the allocated object is returned, + * otherwise an ERR_PTR(). + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. + */ +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) +{ + return zs_malloc_node(pool, size, gfp, NULL); +} EXPORT_SYMBOL_GPL(zs_malloc); static void obj_free(int class_size, unsigned long obj) diff --git a/mm/zswap.c b/mm/zswap.c index 204fb59da33c..455e9425c5f5 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, zpool = pool->zpool; gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; - alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle); + alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page)); if (alloc_ret) goto unlock; base-commit: 8c65b3b82efb3b2f0d1b6e3b3e73c6f0fd367fb5 -- 2.47.1