From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84EF7C3DA4A for ; Tue, 30 Jul 2024 00:07:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E23196B007B; Mon, 29 Jul 2024 20:07:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD2A16B0083; Mon, 29 Jul 2024 20:07:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9B0D6B0085; Mon, 29 Jul 2024 20:07:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id AA4536B007B for ; Mon, 29 Jul 2024 20:07:27 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1354E12017F for ; Tue, 30 Jul 2024 00:07:27 +0000 (UTC) X-FDA: 82394479734.16.AB38C8C Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf21.hostedemail.com (Postfix) with ESMTP id 52A0A1C0004 for ; Tue, 30 Jul 2024 00:07:25 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kPfExCm6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722298005; a=rsa-sha256; cv=none; b=ie9JtKFg3K3gLS8qqaH7qWt572HABnv8AIosUAntaqewPBpeb3Ay6OLqqwzeK+nLpixhqc CqRYvFBkYdf74Aa0uSf1gQCqOvUb5wYklayChAqIl9KHV4ursuCpvGAeSnn97KazFVfam+ REW9+v6qeuzgf9MV5TXKPnrLJVNa9As= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kPfExCm6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722298005; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=; b=L3exyZiql27LUb45SLwkFSkNMydDEdkpFDtNE2z+uYqxBbhF3TtCvGjQ+wdHPLiVmb04m1 wibG1UqVe8hAO++iJ0PEyEoU+ixVOZEI/b/v8HKHtg9IguFIPY/9aQNQKhAHU853MZwdlR 9OmpTA9ObH72OiIXfulkM5vFjUQ7lLc= Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-6b7a3e4686eso20303216d6.1 for ; Mon, 29 Jul 2024 17:07:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722298044; x=1722902844; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=; b=kPfExCm6fctGpoOiUFLBRWWvnzZXT7MoI9Auxp5liJyKXWjZOik9OhMtec+pPhT3Tb 4FqYCxtMUzGDHJ3SpHHscWsKf39HSGT5SjQPEmsIibJN4iHaO7URFHC2CBbnHB+Jsto7 Wvta4TRIfpm8jSJn39OH3pTePykDtyHnmC5ddBK72gVSOglE1EwgrZJTcvQuQqZh2Ji4 ybfjPl3BeVE4CUiqbJs4Wd+cKebnX27h2bil8cmErvbo1sUWXonX/HHLZ15TSRBjuTIa U7J2yX6/lx5tCzTYMTRTZwcZJc0j/+/a2S1kl43ujmQcWNvKNk/qAIYsWZ55w/X/tIFt 2J4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722298044; x=1722902844; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=; b=ZtbCqQptYjHWj83ncsY9EO5wyF0+G5fugmj1HVUous/LzhQLhSN6qlvlIEo4pEfq31 4yT+W3i4QUyIQ/0sqZG6HkAq28uBoehDBi/FkZcjSEptk/hzaTZLC9CZPAuuFIy4mzIc AMtAKofFTMJi2ITQ3hMOt0MlqL4CyhtOOuwXQzGXtJKnL+c4FQ6k8+Zbnmr+g3enwFXp Ic4RZfeRTSd92ExXpIxrUzhGrm6rvxEAQLv1ve+JJtg0Ax0m/HKrr8PsdPsV6FgzfdZO bA0/27WhVII3KyzjZ4GJcRS1fhYXuX7OjTPa1UpMmiHaGCQ2JAMPsG5BvLScFwZFHL8D GTsQ== X-Forwarded-Encrypted: i=1; AJvYcCXU1gaLW1yGD7fh+5xrztXwJR7YI/fMXu4kVho+XQ6bcrbU7aVWzHMLDJYNwPse+9zbPKEKTPKq2uFAQYvI5e88qW8= X-Gm-Message-State: AOJu0Yw8svNYwdJF002LlEMPfwdM3q/+1WSs/qYjV0agqGcv3PTTorE6 INJxMOVx6arutlQ/BIoitm5ye9q6LgFIRsPPDnSsOXWSwVd701B/TTBP1RGLbyhKOen+owZS36z dR77/C9Xkzze4jjeeLTCZTzn0bOY= X-Google-Smtp-Source: AGHT+IHUvUYEC6/5IYxB+EFnDlJ5wjo4798gdsIFS7DehCwH06pIHxknwUGmC/f0V3bf/SUX6+cBePrbRiSC6XprFss= X-Received: by 2002:a05:6214:2609:b0:6b7:ad38:2df3 with SMTP id 6a1803df08f44-6bb55aa5049mr110889266d6.39.1722298044131; Mon, 29 Jul 2024 17:07:24 -0700 (PDT) MIME-Version: 1.0 References: <20240725232813.2260665-1-nphamcs@gmail.com> <20240725232813.2260665-2-nphamcs@gmail.com> In-Reply-To: From: Nhat Pham Date: Mon, 29 Jul 2024 17:07:11 -0700 Message-ID: Subject: Re: [PATCH 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker To: Yosry Ahmed Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, shakeelb@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, flintglass@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 52A0A1C0004 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: u85ia5kuh4kaqpttaacwa7kidq5eczx6 X-HE-Tag: 1722298045-502681 X-HE-Meta: U2FsdGVkX1+Go8SeHaefL36NcDfLRlP5C7IX/PsG9NbsiktBUKUCd1OinGD1x4OxHB+FLU8T2JppX2GJ1UaNGlZnBH3+WT98aiEYSpWgpyUmW/omZ2RF7I58oy/1Ne9P4MV+85wZ57TGELxR8V+ulj4DHEEff9vz10XO+gXORAfoJa0FEa2PXt58uPK9fbJcKy8GOJ46UWafdXcLM3uDV2L6hrB6g8UzePYMtfPf8FbvLgAnK515CHV2GSYlczq7g+WAMA1Mx2O85uh2QRfmYLml0wp0K1W+pSs8nZaDF+4WQenauaBos53jh8dZOufdyzEMDJS/iL/d4DM+J457vDkqxdUUDqxIyCACZiYmv4ulVKbvGeiyC+xskjU2AykZ4LEDCfz9OwETTmbk6yoSURWAD/UO9optKe9gxIQ6XspR+/wt+93pPfkL9Qaf62ypQuy/5PlFQQSfh79GN/Phz3cmf0mwJvQVGEtoS9i113ua2VD975Y16ijB++2EEIKNiL/TIlaYZYjkCKzlFQc8uFhtLqUXG7XtOkh6ZC0nGhDIrhFC5P057Ptlmj4Z9DnE1nME6UCBKm/n8PODpRFCpK9hgpbCjsHungRobrcEouxonBSSfD2IOxmyCFqxZM2odnvyI37oYHiuVqiLN4goxf2vAqqcbuVbhSSPlCB6Usu1WZHD7vOxDjxmUT8yD1NyEZW3Pa4pLYmCSY32STGApt8rwyiojL8Aff2B2tr7L2GI0kU8V1Od1T1YkjZZWN3C7xfgEYR3n5t2ZQgYq61AYb16urIMoHCURjLoXeyvktkKxntHcEStRPVmmAfVFPoo0KWXgjoLWgcfM/xAFDBi0R0+Ihx8aPjO8QTy+VtvX6W/dEdoqZZQQTWZDJ5Hd7rffxfenlWsoTMU0RT8t3jnm/CRzNEavj0IttsNWy29lA/tI56JaKbPe5pFkyxk1e92tDZPx//CqEdxgjPRyIE L1ZnaC0Q YQEqNtk2hRGcncvrfxTvBRwHJ5Jrfqx5evsXheW+cGnEZ+FpXS8UsydSKacDq0Z9pzGlPq1g7DksBIHZXl99F4gMvjjHvs+Rm38EMeflakeUIWEDpZdr7ZU2DQ8v8k3vIPo1gEIcev6Xw1swG2WbBqU2ZugULmFzoNgt22L31Xe1+hP5FGbEF/RXBQQFX8f65FxRgiXrFJyR2EHwF8+P04epcDI09gl1pr+KbcnlWoKqbmKzd2Ftku0Tu/OlYL+r0lI8MBKAwmVeIPeWSG+xi6GCz9aPR0aYUgRhHPNNMsuQxfzkPz8u+4K3vazFuqgTyUmknZ/+nEPDM0XwdmjV4Ztim5w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.001865, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 29, 2024 at 4:07=E2=80=AFPM Nhat Pham wrote= : > > On Fri, Jul 26, 2024 at 2:58=E2=80=AFPM Yosry Ahmed wrote: > > > > On Thu, Jul 25, 2024 at 4:28=E2=80=AFPM Nhat Pham w= rote: > > > > > > Current zswap shrinker's heursitics to prevent overshrinking is britt= le > > > and inaccurate, specifically in the way we decay the protection size > > > (i.e making pages in the zswap LRU eligible for reclaim). > > > > Thanks for working on this and experimenting with different > > heuristics. I was not a huge fan of these, so I am glad we are trying > > to replace them with something more intuitive. > > That is certainly the intention :) I'm not a huge fan of those > heuristics either - they seem fairly brittle and arbitrary to me even > back then (as you have pointed out). > > > > > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > > This leads to the following unfortunate effect: when a new batch of > > > pages enter zswap, the protection size rapidly decays to below 25% of > > > the zswap LRU size, which is way too low. > > > > > > We have observed this effect in production, when experimenting with t= he > > > zswap shrinker: the rate of shrinking shoots up massively right after= a > > > new batch of zswap stores. This is somewhat the opposite of what we w= ant > > > originally - when new pages enter zswap, we want to protect both thes= e > > > new pages AND the pages that are already protected in the zswap LRU. > > > > > > Replace existing heuristics with a second chance algorithm > > > > > > 1. When a new zswap entry is stored in the zswap pool, its reference = bit > > > is set. > > > 2. When the zswap shrinker encounters a zswap entry with the referenc= e > > > bit set, give it a second chance - only flips the reference bit an= d > > > rotate it in the LRU. > > > 3. If the shrinker encounters the entry again, this time with its > > > reference bit unset, then it can reclaim the entry. > > > > At the first look, this is similar to the reclaim algorithm. A > > fundamental difference here is that the reference bit is only set > > once, when the entry is created. It is different from the conventional > > second chance page reclaim/replacement algorithm. > > I suppose, yeah. We no longer have non-exclusive loading (in most > scenarios), so the reference bit is only set once, on store attempt. > > The interpretation/justification is still there though (somewhat). We > are still giving each object another "chance" to stay in the zswap > pool, in case it is needed soon, before it is reclaimed. > > > Taking a step back, what we really want is to writeback zswap entries > > in order, and avoid writing back more entries than needed. I think the > > key here is "when needed", which is defined by how much memory > > pressure we have. The shrinker framework should already be taking this > > into account. > > > > Looking at do_shrink_slab(), in the case of zswap (seek =3D 2), > > total_scan should boil down to: > > > > total_scan =3D (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > > > , and this is bounded by zswap_shrinker_count() * 2. > > > > Ignoring nr_deferred, we start by scanning 1/2048th of > > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > > zswap_shrinker_count() at zero priority (before OOMs). At > > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > > > Keep in mind that zswap_shrinker_count() does not return the number of > > all zswap entries, it subtracts the protected part (or recent swapins) > > Note that this "protection" decays rather aggressively with zswap > stores. From zswap_lru_add(): > > new =3D old > lru_size / 4 ? old / 2 : old; > > IOW, if the protection size exceeds 25% lru_size, half it. A couple of > zswap_stores() could easily slash this to 25% of the LRU (or even > below) rapidly. > > I guess I can fiddle with this decaying attempt + decouple the > decaying from storing (i.e moving it somewhere else). But any formula > I can come up with (another multiplicative factor?) seems as arbitrary > as this formula, and any places that I place the decaying could > potentially couple two actions, which lead to unintended effect. > > The second chance algorithm creates a similar two-section LRU > configuration as the old scheme - protected/unprotected (or, analogous > to the page reclaim algorithm, active/inactive). However, it bypasses > the need for such an artificially constructed formula, which you can > think of as the rate of objects aging. It naturally ties the aging of > objects (i.e moving the objects from one section to another) in the > zswap pool with memory pressure, which makes sense to me, FWIW. > > > and scales by the compression ratio. So this looks reasonable at first > > sight, perhaps we want to tune the seek to slow down writeback if we > > think it's too much, but that doesn't explain the scenario you are > > describing. > > > > Now let's factor in nr_deferred, which looks to me like it could be > > the culprit here. I am assuming the intention is that if we counted > > freeable slab objects before but didn't get to free them, we should do > > it the next time around. This feels like it assumes that the objects > > will remain there unless reclaimed by the shrinker. This does not > > apply for zswap, because the objects can be swapped in. > > > > Also, in the beginning, before we encounter too many swapins, the > > protection will be very low, so zswap_shrinker_count() will return a > > relatively high value. Even if we don't scan and writeback this > > amount, we will keep carrying this value forward in next reclaim > > operations, even if the number of existing zswap entries have > > decreased due to swapins. > > > > Could this be the problem? The number of deferred objects to be > > scanned just keeps going forward as a high value, essentially > > rendering the heuristics in zswap_shrinker_count() useless? > > Interesting theory. I wonder if I can do some tracing to investigate > this (or maybe just spam trace_printk() lol). OK, I found this trace: sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/en= able I'll give this a shot, and report any interesting findings in the data collected :) > > But yeah, this nr_deferred part seems suspicious. The number of > freeable objects returned by the shrinker_count() function (at least > the one implemented by zswap) doesn't really track which object is > newly freeable since the last shrinker_count() invocation. So an > object can be accounted for in a previous invocation, fail to be > reclaimed in that invocation yet still count towards future shrinker > invocation? That sounds wrong :) Even without zswapin, that's still > double counting, no? Future shrinker_count() call will still consider > the previously-failed-to-reclaim object. Do other shrinkers somehow > track the objects that have been accounted for in previous shrinker > attempts, and use this in the freeable computation? > > That said, how bad it is in practice depends on the rate of reclaim > failure, since the nr_deferred only increments when we fail to scan. > Which is actually not very high - for instance, this is the statistics > from a couple of my benchmark runs (using the old zswap shrinker > protection scheme, i.e the status quo): > > /sys/kernel/debug/zswap/reject_reclaim_fail:292 > /sys/kernel/debug/zswap/written_back_pages:6947364 > > IIRC, exclusive loading really lowers the number of reclaim rejection > (you're less likely to observe a page that is already loaded into > memory and in swap cache - its zswap entry is already invalidated). > Actually, on closer inspection, we also increasing the deferred count when we short-circuit in each of the following scenario: a) When we disable zswap shrinker or zswap writeback in general. I don't think this matters too much in practice - we're not reclaiming here anyway :) b) When we detect that there are less unprotected objects than the batch si= ze. FWIW, the second case also goes away with the new scheme :) > This could be a problem, and probably should be fixed (for instance, > by adding a no-defer mode for shrinkers similar to zswap), but there > seems to be an additional/orthogonal reason for overshrinking. I can also just implement this and benchmark it :)