From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40364C369D3 for ; Tue, 22 Apr 2025 11:48:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3CCCA6B0022; Tue, 22 Apr 2025 07:48:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 37A6A6B0023; Tue, 22 Apr 2025 07:48:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 243BE6B0024; Tue, 22 Apr 2025 07:48:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 055566B0022 for ; Tue, 22 Apr 2025 07:48:09 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id EBEFD1C9812 for ; Tue, 22 Apr 2025 11:48:09 +0000 (UTC) X-FDA: 83361506298.29.5CFC269 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) by imf29.hostedemail.com (Postfix) with ESMTP id 253FA12000A for ; Tue, 22 Apr 2025 11:48:07 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=F95sxA5l; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf29.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745322488; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DlOy5pgOMDLNOF17VOxP0jw2ph04ic2bIH7kzzlbOHI=; b=l4UPBX81mu1ylBPRauD4R6xV8Ou4VJhbFhw9t/aLvIyRRInmQ1Uu0kCTdDHPzK2d/eWMta vDUfB0UUeXFVOGXiTET9jrCHrYH02poMNIMgTsYoHhIeRSbopRT11s2MH5N+hfwMb4a9lh MBusv0eONFReVvmZfo0/OttHMBQUezY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745322488; a=rsa-sha256; cv=none; b=F4MyJ8VoLBSOr94+GI0OCdFlP3aeQpNWzC/pXpQgs3kearNymL+c5dCCJJ3ou8CDyXBQSy /6oYPbkMS6OvYFrf7g3CKHzEzejCq7q0VvZyy35z/Fi2rnjf1Gbt1QkS4SX6BCCaiw4MgH 51U+sHp1T3HD/UkpD8cu4koFCqVsp0g= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=F95sxA5l; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf29.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev Date: Tue, 22 Apr 2025 04:47:19 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745322485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DlOy5pgOMDLNOF17VOxP0jw2ph04ic2bIH7kzzlbOHI=; b=F95sxA5lw7P8QmrrzD85lNvmnPmm9bJbaQFbZRyO8zsokGbePISBD4Zp2xa9m5HmyD4pZi 9InQKq9XKutv7roD05XGKcJvwpyyuBneY5YHZEVxT6DRVOviq6Q3intImd7OuS9jgn5B4l YARCPhgTzAhhOduK1xEbRW0uCv5FvYk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, chengming.zhou@linux.dev, sj@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data Message-ID: References: <20250402204416.3435994-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250402204416.3435994-1-nphamcs@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 253FA12000A X-Rspam-User: X-Stat-Signature: y31idkff6id6epxj8uo5tcat4uyu9csq X-HE-Tag: 1745322487-908731 X-HE-Meta: U2FsdGVkX18zk6lj7sWzQRS9+UvZDbjnoT5LJEHckjcvgDOsx3XUntiB5i0qipXJICVFkiJ1c1YrAtv8Ua4fE4LPHWa2lhEadEyCRmwBZL4t7RCFdZgE0lhdG4I+50FVzvKidrfI5ipj9iX161FqOktB1El0DxtKnbmdk5Tpt3XzUU+ICjzIXQ3pw903/Q/VCFBv/vqqEHqc+YH51pFDx3up5k/SmVjQnNuyUAaVsrM6SmV+QcovOuxh2ZT3btzoye4UTV1VzkuS0+q9K96RJ+yO/mMd453f/u7mncwnknvhEFHaDfcTj8KoNpRDNU/nkV/9hTGkMo/s14Ebn54D4Rgv87+YCPB32T/farSW9GebFpt17sbtENb4Oi5HqA5JerXC0Mfn319tkGxsOke7FCHPGN6OpHFZf+zJl3nUh7+Csyi29RP+6xYE4iaMRAe32ssodWXig0pkxgoNe+Risqzq5vNmloMs7u9eIMlWAUwtK6wl+fC9evh2BI79cy/84GIwmpZx3jSeOofm8j0E1a1ebSiu6Or/njw8Vx0xpaOANkGJIXujdSzgP5fCsbw3DypvJ/OVcHg4LIPfbWeYISYqv0UV4/Cejd+TDMtTis/dop4BhMiEB4hB7U7V0krkO43gGY7xj3IaFPp7HQ2yyydDqyIUZVIV46qDOi/bwcgb4v8ZP/3rDVcsdet2Gr6dx+SylODCXFoROmIaBwgH5d3FlX3cXXTNk9HJpREJw0KNYiGy1R24bI9VfL6qhHiJIC42ip+dUuCkL8Dld9HNJaSdyVTMGxN/SVne8KT+YJIwu5Ps54T2C+d3lj58s+MxOiFvnQ90SKropQQJ5fPs7asamhOZzxW5lWyCJJKb0Gr9c9LPzsbMfFWWEGPfLMj5JAQwy23RGChW0BccS7U8KVTMnvKMHafzVAfkKJoLM1jRpBpCzp5cxEM5ErcqqAhhos2j8pn7SO3KjuJrM/N MgiSYn1G R6ebqglIfSYn55HZZNG1QelSHKpcalykIyb96q7qtY1U1zC3DEoALqc795SxuFYDFgEGD0SR0gNAUNEh0b9h1PCGDOM/kG+cW9xNrT8rOsY5cDzCcHdMHe9E0C0btjFq9YjZoMJr5xs3OlgX0i3n0MtqLFDYaxyF5dN/gFMGVxYUmW0GRFJs0upWoFskQlAQfuSRcFCiuCFDfhKSb+KRpEXXfU3hmIFZ3Fham0wX9luTFPHpPdbRQTQr884wcmqmsUxrgt8cjnAPPH7bbwzMH8T9Jp9Y2YJXhLFCi35CNZORbVqxvLMg9PAcO6mxd4n9xVFKUkJfB6x160oDOIEQwOR5w7m+4rSJJlecP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 02, 2025 at 01:44:16PM -0700, Nhat Pham wrote: > Currently, zsmalloc, zswap's and zram's backend memory allocator, does > not enforce any policy for the allocation of memory for the compressed > data, instead just adopting the memory policy of the task entering > reclaim, or the default policy (prefer local node) if no such policy is > specified. This can lead to several pathological behaviors in > multi-node NUMA systems: > > 1. Systems with CXL-based memory tiering can encounter the following > inversion with zswap/zram: the coldest pages demoted to the CXL tier > can return to the high tier when they are reclaimed to compressed > swap, creating memory pressure on the high tier. > > 2. Consider a direct reclaimer scanning nodes in order of allocation > preference. If it ventures into remote nodes, the memory it > compresses there should stay there. Trying to shift those contents > over to the reclaiming thread's preferred node further *increases* > its local pressure, and provoking more spills. The remote node is > also the most likely to refault this data again. This undesirable > behavior was pointed out by Johannes Weiner in [1]. > > 3. For zswap writeback, the zswap entries are organized in > node-specific LRUs, based on the node placement of the original > pages, allowing for targeted zswap writeback for specific nodes. > > However, the compressed data of a zswap entry can be placed on a > different node from the LRU it is placed on. This means that reclaim > targeted at one node might not free up memory used for zswap entries > in that node, but instead reclaiming memory in a different node. > > All of these issues will be resolved if the compressed data go to the > same node as the original page. This patch encourages this behavior by > having zswap and zram pass the node of the original page to zsmalloc, > and have zsmalloc prefer the specified node if we need to allocate new > (zs)pages for the compressed data. > > Note that we are not strictly binding the allocation to the preferred > node. We still allow the allocation to fall back to other nodes when > the preferred node is full, or if we have zspages with slots available > on a different node. This is OK, and still a strict improvement over > the status quo: > > 1. On a system with demotion enabled, we will generally prefer > demotions over compressed swapping, and only swap when pages have > already gone to the lowest tier. This patch should achieve the > desired effect for the most part. > > 2. If the preferred node is out of memory, letting the compressed data > going to other nodes can be better than the alternative (OOMs, > keeping cold memory unreclaimed, disk swapping, etc.). > > 3. If the allocation go to a separate node because we have a zspage > with slots available, at least we're not creating extra immediate > memory pressure (since the space is already allocated). > > 3. While there can be mixings, we generally reclaim pages in > same-node batches, which encourage zspage grouping that is more > likely to go to the right node. > > 4. A strict binding would require partitioning zsmalloc by node, which > is more complicated, and more prone to regression, since it reduces > the storage density of zsmalloc. We need to evaluate the tradeoff > and benchmark carefully before adopting such an involved solution. > > [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ > > Suggested-by: Gregory Price > Signed-off-by: Nhat Pham For the zswap/zsamlloc bits: Acked-by: Yosry Ahmed