From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDCE6C77B73 for ; Tue, 30 May 2023 04:13:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0351E900002; Tue, 30 May 2023 00:13:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F27626B0074; Tue, 30 May 2023 00:13:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1667900002; Tue, 30 May 2023 00:13:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D33D96B0072 for ; Tue, 30 May 2023 00:13:45 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9749CAE12A for ; Tue, 30 May 2023 04:13:45 +0000 (UTC) X-FDA: 80845602810.29.AF9C501 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) by imf12.hostedemail.com (Postfix) with ESMTP id 4C3DE40005 for ; Tue, 30 May 2023 04:13:43 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=dZ4FfHCw; spf=pass (imf12.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685420023; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tyqnzUTxnxhmPWkuNt0nbJNVYrVvyezEPu908a9ooAA=; b=iEdP2V35XWucDp318b/bvfGUXaH+HFhMk2Joo253/z3PNOdKaxt3BCxDGZ6pQRwCOflkTK /Mpe9W5q57whelilT4R++5iEPEoaOsSzeE8aG5rp0zH1x9ULcamBvAjM2PY0KxXfgGkNf4 CNxj2gCDkAZhyl6N1cUvOxrk0WVd4is= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685420023; a=rsa-sha256; cv=none; b=fFUgVn8NK13Z/wV1tLzNWrHyl52HcqFkiDkm8vlQWP0/X7CMWQcQaduCif899FpdC6AX1W YFZq2YYwOrMPgXuA4Q46ZRO1WYQOC/Rxw5Qwc0oxU+/w7cIX27016WOwbBeVx6PZvVlLRE c/yIy0kRhbxvy1g87+fCaqa0wb5W7Nc= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=dZ4FfHCw; spf=pass (imf12.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-75b04e897a5so483167085a.2 for ; Mon, 29 May 2023 21:13:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20221208.gappssmtp.com; s=20221208; t=1685420022; x=1688012022; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=tyqnzUTxnxhmPWkuNt0nbJNVYrVvyezEPu908a9ooAA=; b=dZ4FfHCwFRr7DI27ClBfnQdPFEsTDHI+rwycrkhaQuvNIrK/ufq5mtV8lLvGo4+51e mwrksdnoBLNzXDotj0TUZitCFik4SqmsHgeWjuIYQRgBG3qbAoy3VsUjTB71S5MA2zT1 O+GVgbufiGNaYWMOAaSWx0LArTf4wnfdhO5sv0aiW3qycHIIlBnUSpInlr6ZZQx1Sn17 JdgU9pU0qdwdPTh+66069+lQ4uaTQFyVW4w/p9+UWfSzrzLy6A9iil8wODyi1KO0xUPQ IAiqwGzaJU5PhNQj9+7M8FB9gi3CNW1ALTMYx/xDR4pvTQB3oj/+vUbVi8rlwNb5Hust oPRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685420022; x=1688012022; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=tyqnzUTxnxhmPWkuNt0nbJNVYrVvyezEPu908a9ooAA=; b=Nbe0082TLpD79oCDhr09BwaBiLm9Kthzr9MW86W55xcZ0jniss5PQ0QaWmcc8W6YfZ 5eBQafVNODrjAG/faf2twhoVkC6MMCl+PSUKsuSrJU6GVBSkubWsrNPdis6xMS1QaBS8 GdXafetRpwYg6+izowumD158xkYBr/FLOLCirb1eKvvI9S2zxtRk1uviHMHQNSoXTJNO jQnReco6z6fcM4gWknQpkDN4oxvT58Cn1GBuX8bP6F1JhEha1QoVdX2TXzygcaQsrcZ0 7I2y7Sxgtn2dq2hR3evmQrnYEmgUsZtASnmi2sYJC+mF1vTUrc96E0qrYMNLJC9GpSty 8lDQ== X-Gm-Message-State: AC+VfDyWaxiqJf5Tu7dEZDgnRaYTACkIi3hDkVPu0/vBZwIxEDlDB/pk I+rKuWQKiLt1UvZ0ZO7AyZzAoA== X-Google-Smtp-Source: ACHHUZ7MnNt5DC0QVMW5xmcyjoLYktC8rt49PNDlcyewiQzOPUaAULN8qdij9vG/bTLHQB2BiGLTAw== X-Received: by 2002:a05:620a:916:b0:75b:23a1:3633 with SMTP id v22-20020a05620a091600b0075b23a13633mr824483qkv.68.1685420022290; Mon, 29 May 2023 21:13:42 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:8bb6]) by smtp.gmail.com with ESMTPSA id g11-20020a37e20b000000b0075945c93107sm3804709qki.102.2023.05.29.21.13.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 May 2023 21:13:41 -0700 (PDT) Date: Tue, 30 May 2023 00:13:41 -0400 From: Johannes Weiner To: Chris Li Cc: Domenico Cerasuolo , linux-mm@kvack.org, linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, yosryahmed@google.com, kernel-team@fb.com Subject: Re: [PATCH] mm: zswap: shrink until can accept Message-ID: <20230530041341.GB84971@cmpxchg.org> References: <20230524065051.6328-1-cerasuolodomenico@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 4C3DE40005 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: biogyjca5u1awcu4prdmkthdisfa19r4 X-HE-Tag: 1685420023-838154 X-HE-Meta: U2FsdGVkX1+w+6nsgTq8PZ991ZVZXiYjD8giBTGChU/E8JjSYH7oMQyYLWxPz5qT9eHD01oYT/S2R3SNd0GyP6Nk0EVYJuluEcNQD01VkbVJeh7lBO4JOfnINprUKoa4mGlvR7if+/ELdBLYWDuNlY9Aqb0XE65WJdc2FLHZSOWTxWaj+NHgQkR5ujqCu/Ze7okJyHV/Sq7YDp6nVw8HCnkd311nx4j+02GXA2Q89M8tRn7efAH5mWnijQZACBlfouKPpJ31pQ808Po0yfmos0lWFVMb7AepNewGoV+TXHbgj+j7kt2SBHEn2pCjGoW/EjWxipAvX2rB6++yV6F8OFWKdCe6Ggd0TibrDN9hHHymK1RJ6b+dsbxq2qITKUQU6n6/qkpsCY7cyAol7E0Zgg8Cmf30sKzn7ZGVSwGCm8rPsJ9yoMbIbFl3Bil3CFCEYebYWRDpKzUdhDrSiPiZ+gMvFDBECuHL2a9FACP0jFjFgQLFRYZOd1H98yTVOTL7rGkut7ryF1RQg7lIziLtg87kUR0ZI4UH39h+LeMPOWbuMP0d4GWoe1PE061zL6y0c90HAhVchUQnWpA+EiNC6kQlm0JfH2OTlIQazs2/rexOAbWp0ZB8a7Qrogb3hvoVNzjPKD9+mE3EkUZB5cnlZ6hEvTw4O3CgA7KHe1DVhQ73Brfz7USZyGiVfI8UiajFmB9vAxSDYcTc2bxsPlM45JF5WjxrNaUM7QZevuRPOrbHZbnlBgnlH8+1OSHfcxTbXOM4MUYYEwk0WW0jJNekntP2XeOjaJeDdc1MrzRiX3H5xQCPDOfzlKeYU+EK+Y+EkiGx+Thc1ecvo5OFafO7sdkXwmDvy/pJ07go8x7KDVSApXzZ7VOBGhC8GDEstlqsw6JiNJR+YWYjSHgR+0nY4zKwgsVyVa0KnukC9tgHnB5eFzebChf7Yl/k8jaF2NBP5+mriFCKiPeo04KvqJs IVNPeXA4 OVotW9p2206qF0vD2PFslDhZ/ghXdE9uHXD0N8lt2tbgxgJIcnZqlLHCW2hAOdCpjAELYRPxvel1gRZD72BD6AnHnglAqJDrgRLZDbrkmu3oHPi/8abqkmWTJuA7x9HEg0fL4WQrODkkU5zV6tFXCkuAgAR4fMKbqAtUwoFGHBDJs4kGi6412oYxspI32jNThAOPQfqkTclQjM/hqW1gqyoiCuPkxhppB1OvWCZUvacx9p3HT1BzM/QvyPucUXlUoZ9/sH/P0UdDveNL2xJCGNEvavdXuC2sBldeTNUJHqOzvSPz20ID71Bx/fSjpzWce7l73nPyUg2YmWYxTLskqEm7nKKAYqL9o+KpX1dbaGUeovPA7OF3vpUu87aKYrl8vgyz7NtWxxNtbDarEd8mNoL7pJKZBnPe4qi0rLRkrz14dKkDjwzbQB241+m4tZM61JjQkMrZLekaDOK+qfwMLeRKIjjsRV3ljctBQLYB3cW4k2Q3VhOU8jQAllg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 29, 2023 at 02:00:46PM -0700, Chris Li wrote: > On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote: > > On Sat, May 27, 2023 at 1:05 AM Chris Li wrote: > > > > > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > > hinders the efficient offloading of cold pages to disk, thereby > > > > compromising the preservation of the LRU order and consequently > > > > diminishing, if not inverting, its performance benefits. > > > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > > as shown by basic benchmark test. For the test, a kernel build was > > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > > a 5G swap file provided. The results are presented below, these are > > > > averages of three runs without the use of zswap: > > > > > > > > real 46m26s > > > > user 35m4s > > > > sys 7m37s > > > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > > system), the results changed to: > > > > > > > > real 56m4s > > > > user 35m13s > > > > sys 8m43s > > > > > > > > written_back_pages: 18 > > > > reject_reclaim_fail: 0 > > > > pool_limit_hit:1478 > > > > > > > > Besides the evident regression, one thing to notice from this data is > > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > > when zswap is completely full, doesn't account for a particular > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > > still above the acceptance threshold. Once we include the rejections due > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > > 1478 to a significant 21578266. > > > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > > it's above the acceptance threshold, yet fails to attempt memory > > > > reclaimation. This happens because the shrink work is only queued when > > > > zswap_frontswap_store detects that it's full and the work itself only > > > > reclaims one page per run. > > > > > > > > This state results in hot pages getting written directly to disk, > > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > > order is completely broken and zswap ends up being just an overhead > > > > without providing any benefits. > > > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > > pages until the acceptance threshold is met and b) the task is also > > > > enqueued when zswap is not full but still above the threshold. > > > > > > > > Testing this suggested update showed much better numbers: > > > > > > > > real 36m37s > > > > user 35m8s > > > > sys 9m32s > > > > > > > > written_back_pages: 10459423 > > > > reject_reclaim_fail: 12896 > > > > pool_limit_hit: 75653 > > > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > > Signed-off-by: Domenico Cerasuolo > > > > --- > > > > mm/zswap.c | 10 +++++++--- > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 59da2a415fbb..2ee0775d8213 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > > { > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > shrink_work); > > > > + int ret; > > > Very minor nit pick, you can move the declare inside the do > > > statement where it get used. > > > > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > > - zswap_reject_reclaim_fail++; > > > > + do { > > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > > + if (ret) > > > > + zswap_reject_reclaim_fail++; > > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > > > As others point out, this while loop can be problematic. > > > > Do you have some specific concern that's not been already addressed following > > other reviewers' suggestions? > > Mostly as it is, the outer loop seems the wrong > place to do the retry because of lacking the error > context and properly continuing the iteration. > > > > > > > > > > Have you find out what was the common reason causing the > > > reclaim fail? Inside the shrink function there is a while > > > loop that would be the place to perform try harder conditions. > > > For example, if all the page in the LRU are already try once > > > there's no reason to keep on calling the shrink function. > > > The outer loop actually doesn't have this kind of visibilities. > > > > > > > The most common cause I saw during testing was concurrent operations > > on the swap entry, if an entry is being loaded/invalidated at the same time > > as the zswap writeback, then errors will be returned. This scenario doesn't > > seem harmful at all because the failure doesn't indicate that memory > > cannot be allocated, just that that particular page should not be written back. > > > > As far as I understood the voiced concerns, the problem could arise if the > > writeback fails due to an impossibility to allocate memory, that could indicate > > that the system is in extremely high memory pressure and this loop could > > aggravate the situation by adding more contention on the already scarce > > available memory. > > It seems to me the inner loop should continue if page writeback > fails due to loaded/invalidate, abort due to memory pressure. > > The key difference is that one is transient while others might > be more permanent  > > You definitely don't want to retry if it is already ENOMEM. This is actually a separate problem that Domenico was preparing a patch for. Writeback is a reclaim operation, so it should have access to the memory reserves (PF_MEMALLOC). Persistent -ENOMEM shouldn't exist. It actually used have reserve access when writeback happened directly from inside the store (reclaim -> swapout path). When writeback was moved out to the async worker, it lost reserve access as an unintended side effect. This should be fixed, so we don't OOM while we try to free memory. Should it be fixed before merging this patch? I don't think the ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't really persistent either. Retrying a few times in that case certainly doesn't seem to make things worse. > > As I was writing to Yosry, the differentiation would be a great improvement > > here, I just have a patch set in the queue that moves the inner reclaim loop > > from the zpool driver up to zswap. With that, updating the error handling > > would be more convenient as it would be done in one place instead of three.i > > This has tricky complications as well. The current shrink interface > doesn't support continuing from the previous error position. If you want > to avoid a repeat attempt if the page has a writeback error, you kinda > of need a way to skip that page. A page that fails to reclaim is put back to the tail of the LRU, so for all intents and purposes it will be skipped. In the rare and extreme case where it's the only page left on the list, I again don't see how retrying a few times will make the situation worse. In practice, IMO there is little upside in trying to be more discerning about the error codes. Simple seems better here.