From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86591C77B7A for ; Thu, 25 May 2023 00:59:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EDC73280001; Wed, 24 May 2023 20:59:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E6441900002; Wed, 24 May 2023 20:59:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDF80280001; Wed, 24 May 2023 20:59:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B9981900002 for ; Wed, 24 May 2023 20:59:17 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 959A180AC3 for ; Thu, 25 May 2023 00:59:17 +0000 (UTC) X-FDA: 80826968754.12.606AC91 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf03.hostedemail.com (Postfix) with ESMTP id BDCD92000F for ; Thu, 25 May 2023 00:59:15 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=pMz8ll0I; spf=pass (imf03.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684976355; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H7O+MGtW5STi1cURc/ErCWnTNocNwaHwGgp9OHGvjA0=; b=27zJEtXv5q7tWIsWChCMg1uQRKczD3uAk8ZGm4RPIw95ZQbkbXTEqtFoXYxib8j21BGj9K JxrJxpI39DWb/fK4BR2VDoI1nKfWb2Bb7mXqdNy5Lu6cI7mtoBUoLt3iOxQsgDotaSHJLq P61nH1llApb02Wq7Vak4X5dhr9g+tBg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684976355; a=rsa-sha256; cv=none; b=a1q3Ga87xcOTD06v/s8wZmmJifX1HNc1ATt9J6vkvaPjOg/OeoX2mmbwxhmdwfWjuCjl0D ayDW6HHTl5CHeJ5Zk9mVtc+qNwtmQUoS7bYbodEMeW42+ig1/XzQ4uXKiUyBWxotfDiPME 22Z0mV8vZenhm6+otTisqWQi6wgjKCk= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=pMz8ll0I; spf=pass (imf03.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-96f7bf29550so230073966b.3 for ; Wed, 24 May 2023 17:59:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684976354; x=1687568354; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=H7O+MGtW5STi1cURc/ErCWnTNocNwaHwGgp9OHGvjA0=; b=pMz8ll0IxQ5yt8R9P6I8xCZPu9I+cRgIgR3+fBngOI9Ig7lCBR4bBb/ZreCFBEkV6I z1iRo+NksYKieTfBkmU3SyzoD2QJOsgfqfcIRY0PDC+eACyWOCuOJRRWBFQxiGPnkjah M+ftVkksFE5B+LjOvF3WOD1XmL3LFnNKQ298IIzheeE1ZE+BhKJUZvYfcJ0ZtLcGSRLV l9SL1FNMexJgROK9Y7FQ8oA3Y8QHee8NcG36fTnUZK+2KsCpvsCHxtz/br+LuimwjIK1 xzhYwcUIcXcXO2dw8X6NEcZ5OUsYiz8sG3snAVD+IK8WwuO0TEwZ1YFnHy2zTNB8876U HPPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684976354; x=1687568354; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H7O+MGtW5STi1cURc/ErCWnTNocNwaHwGgp9OHGvjA0=; b=PskW6rWmpkTUf+RMbFTHrdpq+SnBbwbuUsVE5mUFchsJqbFsyJFg57KQ+A9hsuEAoB 6S3KuxSmugCvv5kITNIW+hKROqSg1+5BHOjt+jcHftmOYvvrxIRbaGzm7RDUnkL0An4+ D6dB5NNGbcocDdIh21o1Cf9UxnHHSinWIB1FtRNG3Xf3kRNoqmYdMzU2Ypqt2rcBw3hS wfP90e6vLJXjI53+LD7u6KwtjPQ233xgmXKPWbTupFyYiSImdw7abRq9xkWX7QuLOzIi EnmDoRmQdMagq1BKCrkBaneanLAvBcLhbAgJAFim2X4eK2V8hoz8dALFdYqVhGV9EbcB RCdw== X-Gm-Message-State: AC+VfDwPAU/oV0NJAFaMCkFuzvcsVBbYu2Sk3Bcp9N48njClFzhaB0sQ 4wNhKFIqQ/oPKHHe7BUk033vLQtcabyhKq9bdudrQw== X-Google-Smtp-Source: ACHHUZ6659Z/p96w9VxQsaT1zGhL4bqwzwOESDGtnuFvYDxmymXnSko3zu+QU9Z/qu7PVPXu2hAOlCCWlMaiIs+pVDo= X-Received: by 2002:a17:907:3206:b0:94f:6058:4983 with SMTP id xg6-20020a170907320600b0094f60584983mr16257342ejb.76.1684976353990; Wed, 24 May 2023 17:59:13 -0700 (PDT) MIME-Version: 1.0 References: <20230524065051.6328-1-cerasuolodomenico@gmail.com> In-Reply-To: <20230524065051.6328-1-cerasuolodomenico@gmail.com> From: Yosry Ahmed Date: Wed, 24 May 2023 17:58:37 -0700 Message-ID: Subject: Re: [PATCH] mm: zswap: shrink until can accept To: Domenico Cerasuolo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, hannes@cmpxchg.org, kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: iepsx3n7n531nymw4aga6zczu1w8nr88 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BDCD92000F X-Rspam-User: X-HE-Tag: 1684976355-831646 X-HE-Meta: U2FsdGVkX1+0xUfGp7MwwEtu6i/vc5rHKAHy/EcdG9ZARFDwINmwbckCFbG6bhQ3Dw5NPVLOZa2q32+lUhlWbE+Mb/kbc1ffwXInOJ4Ffu/2/5+UtFwJNSxm+u8/w/QrLvRsgE/pKzHrfXF4dnm+AeWeyxXxs0OVXzFUWW5n8TLKJ5lcS2o8X18YAWOTu901AEEOdiAIPiGo9BxckWcjUvUCow3rd1beI9VYZXReYLieOLkaJFvteiK69lr3zzFRQkjyj8fiPa5PW4C1iFOlc4iYSkm+QzTgTcHonCt1yqhtQoOvuYByUN2Cl1hfrp0+OvrepFEvm55ygitJU1FkMUhVRLrCw6Y3ammKoheC+mTDjs84SuFqBiFeLBlxCWsP71PvdXv8lKI9sbqdOdRZi623GJWqI5Qg4VOhHfBo5dtNokfIRDMChavVtfKyeRBb7C2s4mqx9Xv8slgof/BkFZJQ62ZEh9kRmpVZ8JHKdI+jPTaJAHR4twDCpbBto1Lo5oAemAm8V4fyMtlZufeBkhOSftnF2L1LpJwM/x3g6KSjykDtDpwk67Y/b7bQCtE23o6YHuRztVG2aZQ73NU55ufs2shkFIYCEbnaJxOXa/2y94jg8tjE9Y7Qd7QgJ6xfQT76uQV/2fEAFjFVhCYM3uQ3jrAkOrrvvSZtgr8MJuP+MLh5IhXoNzvASH58Jfx49RNxIdWMJ3PPZC4Kun12SeVAoyJbN0AAuImsi07dca6toWh0YNt8ThNrhOx0o24rrUM6KlLjs4WaLmqcXHWE3nQAVKePhngC9MWdjrv+t9tpNbAduy+/EH6Xe57kqkKaxfMaxtSNfveDygVRYeQ5h/fLwsRKzuiV/cB7Ox9HPwCp1NLT+y2xtMh91TclomxKqCoeIot47mwOe3VIjROUCz8WsbvOcQG6+vRdB9s74/+uZKk9X9EpZOPsUmeiWjuSS/wSsAhyCBAEgbTOFt9 JVPLdkbN aTpyLm5kdILzewP2J9Z6tRWurLyR335KxQRuwFzPDGBuR5Zsm9KK+MZ1feJLGBYXG/8vEc8ndpCAvWbB9nSt01eucR7rg8DWyg66pSaKMdfqO+tPsleXP8tIINXPcxgog8Mbc2T1g/g2pswB0dL/573MllVYXc65O7If/YcVwdxgqgzxwPZJ8HMFmkjg3LJL2F/vP1zXsBx7bDrLyW66MbGbsIi/wnamviL5LkEMlwvF8GAZ4OY/NkFbESyHx1lAgR0MvBndZ4zdYZ7GtRCmTZpD+rWVpzfh1y+S8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000049, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Domenico, On Tue, May 23, 2023 at 11:50=E2=80=AFPM Domenico Cerasuolo wrote: > > This update addresses an issue with the zswap reclaim mechanism, which > hinders the efficient offloading of cold pages to disk, thereby > compromising the preservation of the LRU order and consequently > diminishing, if not inverting, its performance benefits. > > The functioning of the zswap shrink worker was found to be inadequate, > as shown by basic benchmark test. For the test, a kernel build was > utilized as a reference, with its memory confined to 1G via a cgroup and > a 5G swap file provided. The results are presented below, these are > averages of three runs without the use of zswap: > > real 46m26s > user 35m4s > sys 7m37s > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > system), the results changed to: > > real 56m4s > user 35m13s > sys 8m43s > > written_back_pages: 18 > reject_reclaim_fail: 0 > pool_limit_hit:1478 > > Besides the evident regression, one thing to notice from this data is > the extremely low number of written_back_pages and pool_limit_hit. > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > when zswap is completely full, doesn't account for a particular > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > still above the acceptance threshold. Once we include the rejections due > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > 1478 to a significant 21578266. > > Zswap is stuck in an undesirable state where it rejects pages because > it's above the acceptance threshold, yet fails to attempt memory > reclaimation. This happens because the shrink work is only queued when > zswap_frontswap_store detects that it's full and the work itself only > reclaims one page per run. > > This state results in hot pages getting written directly to disk, > while cold ones remain memory, waiting only to be invalidated. The LRU > order is completely broken and zswap ends up being just an overhead > without providing any benefits. > > This commit applies 2 changes: a) the shrink worker is set to reclaim > pages until the acceptance threshold is met and b) the task is also > enqueued when zswap is not full but still above the threshold. > > Testing this suggested update showed much better numbers: > > real 36m37s > user 35m8s > sys 9m32s > > written_back_pages: 10459423 > reject_reclaim_fail: 12896 > pool_limit_hit: 75653 Impressive numbers, and great find in general! I wonder how other workloads benefit/regress from this change. Anything else that you happened to run? :) > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit= is hit") > Signed-off-by: Domenico Cerasuolo > --- > mm/zswap.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 59da2a415fbb..2ee0775d8213 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > { > struct zswap_pool *pool =3D container_of(w, typeof(*pool), > shrink_work); > + int ret; > > - if (zpool_shrink(pool->zpool, 1, NULL)) > - zswap_reject_reclaim_fail++; > + do { > + ret =3D zpool_shrink(pool->zpool, 1, NULL); > + if (ret) > + zswap_reject_reclaim_fail++; > + } while (!zswap_can_accept() && ret !=3D -EINVAL); One question/comment here about the retry logic. So I looked through the awfully convoluted writeback code, and there are multiple layers, and some of them tend to overwrite the return value of the layer underneath :( For zsmalloc (as an example): zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zs= wap_writeback_entry(). First of all, that zpool_ops is an unnecessarily confusing indirection, but anyway. - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru is empty, and -EAGAIN on other errors. - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the return value of zs_reclaim_page()/zbud_reclaim_page(). - zpool_shrink() will return -EINVAL if the driver does not support shrinking, otherwise it will propagate the return value from the driver. So it looks like we will get -EINVAL only if the driver lru is empty or the driver does not support writeback, so rightfully we should not retry. If zswap_writeback_entry() returns -EEXIST, it probably means that we raced with another task decompressing the page, so rightfully we should retry to writeback another page instead. If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily mean that we couldn't allocate memory (unfortunately), it looks like we will return -ENOMEM in other cases as well. Arguably, we can retry in all cases, because even if we were actually out of memory, we are trying to make an allocation that will eventually free more memory. In all cases, I think it would be nicer if we retry if ret =3D=3D -EAGAIN instead. It is semantically more sane. In this specific case it is functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly return -EAGAIN anyway, but in case someone tries to use saner errnos in the future, this will age better. Also, do we intentionally want to keep retrying forever on failure? Do we want to add a max number of retries? > zswap_pool_put(pool); > } > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgo= ff_t offset, > if (zswap_pool_reached_full) { > if (!zswap_can_accept()) { > ret =3D -ENOMEM; > - goto reject; > + goto shrink; > } else > zswap_pool_reached_full =3D false; > } > -- > 2.34.1 >