From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26F24C36018 for ; Wed, 2 Apr 2025 19:57:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1932A280003; Wed, 2 Apr 2025 15:57:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11BD5280001; Wed, 2 Apr 2025 15:57:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFC93280003; Wed, 2 Apr 2025 15:57:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CFA62280001 for ; Wed, 2 Apr 2025 15:57:45 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id CD2671A075D for ; Wed, 2 Apr 2025 19:57:46 +0000 (UTC) X-FDA: 83290164132.14.D15BBC2 Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) by imf28.hostedemail.com (Postfix) with ESMTP id 021AFC000D for ; Wed, 2 Apr 2025 19:57:44 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=Zu4d1Bvs; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf28.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.177 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743623865; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/arAk71omemZ9BOkrPpo5LzhJig+JDI2M6opIBMZzFY=; b=b4CCusqVonSKsajAl3YBHN2zs1kAuWCJr/l2hecSqgbQP1HHrnziCNZucmn9bu1ECnyU5M xvtuv7O3IB4rdlc+fzRmN7J8dkhf6TD+KMRBGXJAmMTqX2sHBJxRGSJXga1gZV2OnYSrIv L955NeAkgeU436TS2kJq+m1LQN45PZY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743623865; a=rsa-sha256; cv=none; b=wzGyRSTeTyg7//xK95gYToP5QoBWpZLCbf+h8ZyqIYYoaCLK4eIwyzWVO/h90HEQ70Kr05 yuamueizAUjLWpR6MS5ksUyP3o2+o6oq0abKUebX20IBaSaiw30yw+FHvL0A/dObQqPnQO xSZdwvfZDN6E145/6yDFgOG9bMuEAZ8= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=Zu4d1Bvs; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf28.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.177 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org Received: by mail-qk1-f177.google.com with SMTP id af79cd13be357-7c5ba363f1aso16360985a.0 for ; Wed, 02 Apr 2025 12:57:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1743623864; x=1744228664; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=/arAk71omemZ9BOkrPpo5LzhJig+JDI2M6opIBMZzFY=; b=Zu4d1Bvs7S68gsx48uHBtI02ytsKPzMIhVIlKKGSglZLjK5sSC0AJAw6wecgTZuMrs MD5sOT9vroyDASIYv1IShvz2HbMDYFddF1x3lGX+BePS0KDVmYR99AMkjl8izQdXvv5M JAbD8X4js6+xU+Iux2DlB8ZjWarvCFGYAL+kzJNdYEp535O7EeCj/0WEqSp/fcjD31nC T8OWpmO9dYwBko/RgF8iWABY/H9xbMPuiR938c0U2pgYwfdcJkgEQ4AooJWSgc5w7QdC K0VtioVbTlqe4wcmaiWhKtUhs6zCzbQVZU8mIMUj9QzgP3mMFBcGrRzYoDgRnTp/OAY4 DsNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743623864; x=1744228664; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=/arAk71omemZ9BOkrPpo5LzhJig+JDI2M6opIBMZzFY=; b=oKiVWKZ2U855e2o4PbL7lrT3VX2V51WiMGrk41Qs9ZKSkMEi3bQqwVOgU3Wjk0NN47 7GlKoQxkeYI/93xPq/fWu0YsBmmULg//k0+x99yUMFdS/l4at+9OGvRoLnpKtB3t28/B p7lCd6gxSHWM431kiah7OZ6arHS11Ca3BiZJNSGoL99+NfsfxV/Jy5o4tz6S+VEBg49w vVf706n8xv6a/W9tezqoQeibMcKNcyDT6QZ/Rx8Ycog2kMR5TFuck7GSZP3EwMy4WmE2 Ur9+I6wugTLaGcDdsmHEQP/+sfy3MFZWVZTSbQpPsWMle9XgWBtz2C8Vjw/Oe3Q1pnHu +fIw== X-Gm-Message-State: AOJu0YzxDGDm68fOOj7qyhBCbESSDKxLe8jrCrlZLUUjLBik0v+ZxEO/ F4Y9dRO8ZBABltRGL/vpzZxRLAqp2gKfcb0uYJ9ZOnxqtXjGxKLXl+Ng10FsLgw= X-Gm-Gg: ASbGncvsjHxhCKxzFhHOmgNafg0rC/FjdsIPWG7oVWvd8cSPgFulJ8qMzO069w8+uYe 1OB4wqOCtUCkchQBX8RBqAW22TP03G99fDAR2SaWyD00KiHUgQIOcarsG61dh5kU6gKbubz0T+c xk232vYdM7w1d/SvYBdSNRsZIJAr5qU5Qr8XZqFIL1mqikwEjzAVPFH81IkYoRVYELxcLV+iREr nqd8ATHaEaYtUxzM6xPkn/xkLnzLka5VhIg8nPJMd+H3kvQUaPL3N6gIGp/hHZ9moerrg4iNggX VsYYo4QpJAyKQRSExxG6PO1LTsEXeNP6Xi8h6jQeaiidH765Up3M0A== X-Google-Smtp-Source: AGHT+IHy6QeY7a8yJUt9hJrNflcfkJNYEEj7ZgU7hjOFP0BZQAnXXMq17Ms1f18ggo+ZGs85mSN/sg== X-Received: by 2002:a05:6214:1941:b0:6ee:ba5b:d8d1 with SMTP id 6a1803df08f44-6ef0dc230eamr1343376d6.16.1743623863658; Wed, 02 Apr 2025 12:57:43 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eec9645a69sm78002106d6.31.2025.04.02.12.57.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Apr 2025 12:57:43 -0700 (PDT) Date: Wed, 2 Apr 2025 15:57:41 -0400 From: Johannes Weiner To: Nhat Pham Cc: linux-mm@kvack.org, akpm@linux-foundation.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data Message-ID: <20250402195741.GD198651@cmpxchg.org> References: <20250402191145.2841864-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250402191145.2841864-1-nphamcs@gmail.com> X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 021AFC000D X-Stat-Signature: pdho7wchyxcewg58f1ynb4zdsphmi1y7 X-Rspam-User: X-HE-Tag: 1743623864-8497 X-HE-Meta: U2FsdGVkX18bpUIA37vWWgiLeJWW1ikDF/L1FhAUM+/4m/y//C1FMiwDUIRS08+VnP6xuTAraweYTHYIIMpv1hqisJ7gMfeBdGoFjXdg7jHyuGQEZxqXJu5z79lD9wfGgUTF2+2aUXfmeYwMfLUitg8hgqb/OiuCqsOnDmgOGSZsxb8VDll49rkERONHQkILQ6lic33/nm/Gx+3OIqBkMgn8mlQAmkxiE9OxBZFO7tKiZMFQ2uIzJsefitUlxxBOOaMgucF8dxJPX0M4+g574HpSlWD5fKGGggTsOb63sKaNks9U310p8fvJsPY/cwD00VLedTc1/Jmu1ODRBRFREdy+DWF9gbzO5NhbpF5dE8b5S0lrREfCzCE0StpTphLNAR5sV/vlgpdS18S0KD246ASkA9WILuNkEmnXmCETU14KsJAU0drR2XOXnshgUtr82XC3EUUYXtKQZjc3BwIsh3R3HMq4OvAuQimcx3rTz/Mx4ihN2P1XXTVwX6HBOs5VRxnmplRPEMBggbQ0G8Wx17pvvYmGqjUPPjLzPza7Bqz4ARvfg/lgnfjHjT2nU1lx2j+aIVLuNzbuXo8ZDw6YxuIMtjiOF2tLU0a/aSCjwbJOB2iwNDdSFiBXSkH5FioyfH7V2bfM36lxoPwrG1LNPDVXzoneYH8a+nqkJ/Fdd8Fzoe6Tz0C798gKozDy1btFvLO7T2lZnPvh7n9xGosFHZfLRe5Gyz/88sOFEzpwEd2Z4/hci8wkDVQ+ht/J8XH4Gv+oMa8FcqwmyTBaGEjDAo97LfVzht6rx9aDPvSXOOiE0hi9E7UuSpsMRdqrFeFElcs8R03bsySxTdupkTJETdF5m/3FjcrHFNDF4ibDzw/xVYp5G4dG+lUZZhbWecnxAfc6zaJjKPGPa2zDncnoW69SpG9Qy46OTiT9HIFxUmIa9OEoZ0aNFsrD/X8XvhmEvX8PVc8yiOu+j5FV43g R82pdgTM wElwtaDES5ud265CPkSJkbivW8nTRzNM8htKAEVmaI361704MYowedp+BNIAE62aG36A13fzGbRk/+32aGgEEUm6FsQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote: > Currently, zsmalloc, zswap's backend memory allocator, does not enforce > any policy for the allocation of memory for the compressed data, > instead just adopting the memory policy of the task entering reclaim, > or the default policy (prefer local node) if no such policy is > specified. This can lead to several pathological behaviors in > multi-node NUMA systems: > > 1. Systems with CXL-based memory tiering can encounter the following > inversion with zswap: the coldest pages demoted to the CXL tier > can return to the high tier when they are zswapped out, creating > memory pressure on the high tier. > > 2. Consider a direct reclaimer scanning nodes in order of allocation > preference. If it ventures into remote nodes, the memory it > compresses there should stay there. Trying to shift those contents > over to the reclaiming thread's preferred node further *increases* > its local pressure, and provoking more spills. The remote node is > also the most likely to refault this data again. This undesirable > behavior was pointed out by Johannes Weiner in [1]. > > 3. For zswap writeback, the zswap entries are organized in > node-specific LRUs, based on the node placement of the original > pages, allowing for targeted zswap writeback for specific nodes. > > However, the compressed data of a zswap entry can be placed on a > different node from the LRU it is placed on. This means that reclaim > targeted at one node might not free up memory used for zswap entries > in that node, but instead reclaiming memory in a different node. > > All of these issues will be resolved if the compressed data go to the > same node as the original page. This patch encourages this behavior by > having zswap pass the node of the original page to zsmalloc, and have > zsmalloc prefer the specified node if we need to allocate new (zs)pages > for the compressed data. > > Note that we are not strictly binding the allocation to the preferred > node. We still allow the allocation to fall back to other nodes when > the preferred node is full, or if we have zspages with slots available > on a different node. This is OK, and still a strict improvement over > the status quo: > > 1. On a system with demotion enabled, we will generally prefer > demotions over zswapping, and only zswap when pages have > already gone to the lowest tier. This patch should achieve the > desired effect for the most part. > > 2. If the preferred node is out of memory, letting the compressed data > going to other nodes can be better than the alternative (OOMs, > keeping cold memory unreclaimed, disk swapping, etc.). > > 3. If the allocation go to a separate node because we have a zspage > with slots available, at least we're not creating extra immediate > memory pressure (since the space is already allocated). > > 3. While there can be mixings, we generally reclaim pages in > same-node batches, which encourage zspage grouping that is more > likely to go to the right node. > > 4. A strict binding would require partitioning zsmalloc by node, which > is more complicated, and more prone to regression, since it reduces > the storage density of zsmalloc. We need to evaluate the tradeoff > and benchmark carefully before adopting such an involved solution. > > This patch does not fix zram, leaving its memory allocation behavior > unchanged. We leave this effort to future work. zram's zs_malloc() calls all have page context. It seems a lot easier to just fix the bug for them as well than to have two allocation APIs and verbose commentary? > -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp) > +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int *nid) > { > - struct page *page = alloc_page(gfp); > + struct page *page; > + > + if (nid) > + page = alloc_pages_node(*nid, gfp, 0); > + else { > + /* > + * XXX: this is the zram path. We should consider fixing zram to also > + * use alloc_pages_node() and prefer the same node as the original page. > + * > + * Note that alloc_pages_node(NUMA_NO_NODE, gfp, 0) is not equivalent > + * to allloc_page(gfp). The former will prefer the local/closest node, > + * whereas the latter will try to follow the memory policy of the current > + * process. > + */ > + page = alloc_page(gfp); > + } > > return page_zpdesc(page); > } > @@ -461,10 +476,13 @@ static void zs_zpool_destroy(void *pool) > zs_destroy_pool(pool); > } > > +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, > + gfp_t gfp, const int *nid); > + > static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp, > - unsigned long *handle) > + unsigned long *handle, const int nid) > { > - *handle = zs_malloc(pool, size, gfp); > + *handle = zs_malloc_node(pool, size, gfp, &nid); > > if (IS_ERR_VALUE(*handle)) > return PTR_ERR((void *)*handle); > } > > > -/** > - * zs_malloc - Allocate block of given size from pool. > - * @pool: pool to allocate from > - * @size: size of block to allocate > - * @gfp: gfp flags when allocating object > - * > - * On success, handle to the allocated object is returned, > - * otherwise an ERR_PTR(). > - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. > - */ > -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) > +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, > + gfp_t gfp, const int *nid) > { > unsigned long handle; > struct size_class *class; > @@ -1397,6 +1406,21 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) > > return handle; > } > + > +/** > + * zs_malloc - Allocate block of given size from pool. > + * @pool: pool to allocate from > + * @size: size of block to allocate > + * @gfp: gfp flags when allocating object > + * > + * On success, handle to the allocated object is returned, > + * otherwise an ERR_PTR(). > + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. > + */ > +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) > +{ > + return zs_malloc_node(pool, size, gfp, NULL); > +} > EXPORT_SYMBOL_GPL(zs_malloc); > > static void obj_free(int class_size, unsigned long obj)