From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5F49C3600C for ; Thu, 3 Apr 2025 15:56:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66FE0280005; Thu, 3 Apr 2025 11:56:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F780280001; Thu, 3 Apr 2025 11:56:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49806280005; Thu, 3 Apr 2025 11:56:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 28F87280001 for ; Thu, 3 Apr 2025 11:56:01 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 86727BA09C for ; Thu, 3 Apr 2025 15:56:01 +0000 (UTC) X-FDA: 83293183722.30.0F02934 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf15.hostedemail.com (Postfix) with ESMTP id 54492A000B for ; Thu, 3 Apr 2025 15:55:59 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743695759; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yxVjWZlxtePNRvhbATwWQ/9hOro997H8fkHd+ZKf4U4=; b=nrbLruK/VJzZCM4PX7MSkUeD/OEsk15roppY2iCuaM9ZwoSPSSbR0JhlTUh+lLbJgv8S6g Z8oAHLd2QYdH8Z0Vv6/8Vo/YCNMhrmGlvVslmqHin3QLoCg/7a//2Js3x3iL8a+KxLalBb cQdP1MK4shFpIiKiFyuA2t4Z3Vnu2oo= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf15.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743695759; a=rsa-sha256; cv=none; b=smDSGPbePNdkx0zM9VBOYBHaeh/cvQJsRuvFzfpxrUsCCZhlJfXH8XG0HrDXgYxdKU8KLG kiJtkndfXImVqFKL7CW8vRDIYfWolx/h4fTogk9yQd3/ItzXCpH36bcvLUxnKBJR/jig89 Pzd1SBlmDtTVKms388KUPLK3FHJzPqk= Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4ZT5sx1cpKz6L52T; Thu, 3 Apr 2025 23:55:17 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id CD7511400D4; Thu, 3 Apr 2025 23:55:56 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Thu, 3 Apr 2025 17:55:56 +0200 Date: Thu, 3 Apr 2025 16:55:54 +0100 From: Jonathan Cameron To: Nhat Pham CC: , , , , , , , , , , , , , Subject: Re: [PATCH v2] zsmalloc: prefer the the original page's node for compressed data Message-ID: <20250403165554.00004dd3@huawei.com> In-Reply-To: <20250402204416.3435994-1-nphamcs@gmail.com> References: <20250402204416.3435994-1-nphamcs@gmail.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100009.china.huawei.com (7.191.174.83) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspamd-Queue-Id: 54492A000B X-Rspamd-Server: rspam05 X-Rspam-User: X-Stat-Signature: ygnb8xo5uox9xjwoth9pqf3zwjiobt3k X-HE-Tag: 1743695759-356238 X-HE-Meta: U2FsdGVkX193FQi5mK1+QEWtfGRlq/egORaFXRNAg48WyzVxVcKnCLYX6nYY3dkk6GjNACyUwpcJ1nInqbNdbrALCXCr04s64IaIfRJ1i1tsgb3nceELeLwhnLBiEWYtasj3HhCjMtHZMrlUcYquZwWwKnMai8cxeITRsbJ/vfz2XNdj0veB1HIgD2pvgi1tMdxif+wWanuQjcCbwmSA9YNdgcFObUF54BQZNdemEbMmss4chnVhslJuFbjeaN+ruAufFz2PtWwfod64y7ijBk/6MZtNrZWXdmY+nEwlepDwLWLLmkQDqcyCzIFQ958YqrBSbWND390GfT/4avXh02KR0HyB47lhRsEqIqNcPotvFymWQbxwu7OtKigbKo+KnSgO4ZINPqq9ZRNStjdM7483AZeOfxvtPO+ft+3Le+ms43uOE9zfRQj2mLKnl0D+WxUyue2j5N1DK5fRSHUgIhvVFwSTR3B1/g12ClSP1dogRlCgeAF/HZbm1ez5anT5EINdcJ+g/DsZ1A+ds0CCshCaz8BcWXmG/7QldR8pTAtnWw35tU1UglOXTSb+Quo2hWqWnnfzBj2ilfJ79cdDoKl7Mkhfws6GomxrcKBq7A07AO29ejEkmcPDQtE2UoIs3y9y7kGZPQ4cMlJ0VQBQlOWUyUKSguxZX/CpELCttY/WkHyDNTHze8typuR5I1L1J+yy6cV1QKtokJEeTo/J+vmsJmQF4sQUwR3rZvVKJ7N0igwqhiNSFBFs1mZ+7raRapLNUvUAWvaoFzo+0ZwgJEehJMcmakIUH9BFJhR1j4z/tg1bQA4RxBoKqlxPFhx2EuBVYZPjpMiP2ROafnw1mOzysKoEW2hMgqg3mjoWXdQ8qg+WFwa4tsRlVg5D5NoivEcssTBk8XE8PHIkJUE4w8wCGiebDeFCI23oerkTqOYJEwq3MDNcLXcxxSSQ/JaK4lJpxQauD4V3WGb33YE VsaXNzdR Ll9eT9HnV+rEsHz/Zwh+K0T2tmIChx2q6nWe0BBqrYLq+hwr+ShsGE448mJcwbdODIQUZ5ioLV18W/DaQ9xju3QY+XtMBLroZ6wQUFv1GqCpZz+l20EcrMlOrGmGv/7fj2czv17qCldeIcF0qL8RVv8S7ELoAK7FSnk/5gtEdSxlfeN49dtdZuTSXqp8EnhW1ji4HRM4FNsoheItpMaWzKPho/ckuAop+XcxxyKzFV7Lj+Bni7yJmALsTmt5GsNghSJDyR8ozlb2aeWW0DQpUV07w0/bhy38ouktHyGv6ATlFJw7wky6RMVQdMYhqPwRnmjhYaI7FofwKyr9IWwVuWpUSEmBqabR6J1cROsWrLx/fed0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 2 Apr 2025 13:44:16 -0700 Nhat Pham wrote: > Currently, zsmalloc, zswap's and zram's backend memory allocator, does > not enforce any policy for the allocation of memory for the compressed > data, instead just adopting the memory policy of the task entering > reclaim, or the default policy (prefer local node) if no such policy is > specified. This can lead to several pathological behaviors in > multi-node NUMA systems: > > 1. Systems with CXL-based memory tiering can encounter the following > inversion with zswap/zram: the coldest pages demoted to the CXL tier > can return to the high tier when they are reclaimed to compressed > swap, creating memory pressure on the high tier. > > 2. Consider a direct reclaimer scanning nodes in order of allocation > preference. If it ventures into remote nodes, the memory it > compresses there should stay there. Trying to shift those contents > over to the reclaiming thread's preferred node further *increases* > its local pressure, and provoking more spills. The remote node is > also the most likely to refault this data again. This undesirable > behavior was pointed out by Johannes Weiner in [1]. > > 3. For zswap writeback, the zswap entries are organized in > node-specific LRUs, based on the node placement of the original > pages, allowing for targeted zswap writeback for specific nodes. > > However, the compressed data of a zswap entry can be placed on a > different node from the LRU it is placed on. This means that reclaim > targeted at one node might not free up memory used for zswap entries > in that node, but instead reclaiming memory in a different node. > > All of these issues will be resolved if the compressed data go to the > same node as the original page. This patch encourages this behavior by > having zswap and zram pass the node of the original page to zsmalloc, > and have zsmalloc prefer the specified node if we need to allocate new > (zs)pages for the compressed data. > > Note that we are not strictly binding the allocation to the preferred > node. We still allow the allocation to fall back to other nodes when > the preferred node is full, or if we have zspages with slots available > on a different node. This is OK, and still a strict improvement over > the status quo: > > 1. On a system with demotion enabled, we will generally prefer > demotions over compressed swapping, and only swap when pages have > already gone to the lowest tier. This patch should achieve the > desired effect for the most part. > > 2. If the preferred node is out of memory, letting the compressed data > going to other nodes can be better than the alternative (OOMs, > keeping cold memory unreclaimed, disk swapping, etc.). > > 3. If the allocation go to a separate node because we have a zspage > with slots available, at least we're not creating extra immediate > memory pressure (since the space is already allocated). > > 3. While there can be mixings, we generally reclaim pages in > same-node batches, which encourage zspage grouping that is more > likely to go to the right node. > > 4. A strict binding would require partitioning zsmalloc by node, which > is more complicated, and more prone to regression, since it reduces > the storage density of zsmalloc. We need to evaluate the tradeoff > and benchmark carefully before adopting such an involved solution. > > [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ > > Suggested-by: Gregory Price > Signed-off-by: Nhat Pham Makes sense. Formatting suggestions in other review nice to have though. Reviewed-by: Jonathan Cameron