From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 84EF7C3DA4A
	for <linux-mm@archiver.kernel.org>; Tue, 30 Jul 2024 00:07:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E23196B007B; Mon, 29 Jul 2024 20:07:27 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DD2A16B0083; Mon, 29 Jul 2024 20:07:27 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C9B0D6B0085; Mon, 29 Jul 2024 20:07:27 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id AA4536B007B
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 20:07:27 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 1354E12017F
	for <linux-mm@kvack.org>; Tue, 30 Jul 2024 00:07:27 +0000 (UTC)
X-FDA: 82394479734.16.AB38C8C
Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41])
	by imf21.hostedemail.com (Postfix) with ESMTP id 52A0A1C0004
	for <linux-mm@kvack.org>; Tue, 30 Jul 2024 00:07:25 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=kPfExCm6;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722298005; a=rsa-sha256;
	cv=none;
	b=ie9JtKFg3K3gLS8qqaH7qWt572HABnv8AIosUAntaqewPBpeb3Ay6OLqqwzeK+nLpixhqc
	CqRYvFBkYdf74Aa0uSf1gQCqOvUb5wYklayChAqIl9KHV4ursuCpvGAeSnn97KazFVfam+
	REW9+v6qeuzgf9MV5TXKPnrLJVNa9As=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=kPfExCm6;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf21.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722298005;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=;
	b=L3exyZiql27LUb45SLwkFSkNMydDEdkpFDtNE2z+uYqxBbhF3TtCvGjQ+wdHPLiVmb04m1
	wibG1UqVe8hAO++iJ0PEyEoU+ixVOZEI/b/v8HKHtg9IguFIPY/9aQNQKhAHU853MZwdlR
	9OmpTA9ObH72OiIXfulkM5vFjUQ7lLc=
Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-6b7a3e4686eso20303216d6.1
        for <linux-mm@kvack.org>; Mon, 29 Jul 2024 17:07:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722298044; x=1722902844; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=;
        b=kPfExCm6fctGpoOiUFLBRWWvnzZXT7MoI9Auxp5liJyKXWjZOik9OhMtec+pPhT3Tb
         4FqYCxtMUzGDHJ3SpHHscWsKf39HSGT5SjQPEmsIibJN4iHaO7URFHC2CBbnHB+Jsto7
         Wvta4TRIfpm8jSJn39OH3pTePykDtyHnmC5ddBK72gVSOglE1EwgrZJTcvQuQqZh2Ji4
         ybfjPl3BeVE4CUiqbJs4Wd+cKebnX27h2bil8cmErvbo1sUWXonX/HHLZ15TSRBjuTIa
         U7J2yX6/lx5tCzTYMTRTZwcZJc0j/+/a2S1kl43ujmQcWNvKNk/qAIYsWZ55w/X/tIFt
         2J4Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722298044; x=1722902844;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Uf58E/eylA0+2Cy7cmMYdD/Le/LOPNLJjLq0qHzbTiU=;
        b=ZtbCqQptYjHWj83ncsY9EO5wyF0+G5fugmj1HVUous/LzhQLhSN6qlvlIEo4pEfq31
         4yT+W3i4QUyIQ/0sqZG6HkAq28uBoehDBi/FkZcjSEptk/hzaTZLC9CZPAuuFIy4mzIc
         AMtAKofFTMJi2ITQ3hMOt0MlqL4CyhtOOuwXQzGXtJKnL+c4FQ6k8+Zbnmr+g3enwFXp
         Ic4RZfeRTSd92ExXpIxrUzhGrm6rvxEAQLv1ve+JJtg0Ax0m/HKrr8PsdPsV6FgzfdZO
         bA0/27WhVII3KyzjZ4GJcRS1fhYXuX7OjTPa1UpMmiHaGCQ2JAMPsG5BvLScFwZFHL8D
         GTsQ==
X-Forwarded-Encrypted: i=1; AJvYcCXU1gaLW1yGD7fh+5xrztXwJR7YI/fMXu4kVho+XQ6bcrbU7aVWzHMLDJYNwPse+9zbPKEKTPKq2uFAQYvI5e88qW8=
X-Gm-Message-State: AOJu0Yw8svNYwdJF002LlEMPfwdM3q/+1WSs/qYjV0agqGcv3PTTorE6
	INJxMOVx6arutlQ/BIoitm5ye9q6LgFIRsPPDnSsOXWSwVd701B/TTBP1RGLbyhKOen+owZS36z
	dR77/C9Xkzze4jjeeLTCZTzn0bOY=
X-Google-Smtp-Source: AGHT+IHUvUYEC6/5IYxB+EFnDlJ5wjo4798gdsIFS7DehCwH06pIHxknwUGmC/f0V3bf/SUX6+cBePrbRiSC6XprFss=
X-Received: by 2002:a05:6214:2609:b0:6b7:ad38:2df3 with SMTP id
 6a1803df08f44-6bb55aa5049mr110889266d6.39.1722298044131; Mon, 29 Jul 2024
 17:07:24 -0700 (PDT)
MIME-Version: 1.0
References: <20240725232813.2260665-1-nphamcs@gmail.com> <20240725232813.2260665-2-nphamcs@gmail.com>
 <CAJD7tkY4Jt_OXDEsUN9jzQkrF4mEeLA0BxyNKppSK8k9xL-xKw@mail.gmail.com> <CAKEwX=PsqfYcfwzuhdx-hzHXE7cujf7DQT8vcN7v_M0gOb1uPA@mail.gmail.com>
In-Reply-To: <CAKEwX=PsqfYcfwzuhdx-hzHXE7cujf7DQT8vcN7v_M0gOb1uPA@mail.gmail.com>
From: Nhat Pham <nphamcs@gmail.com>
Date: Mon, 29 Jul 2024 17:07:11 -0700
Message-ID: <CAKEwX=PvmTugsswpLhM=Fxhc1Su036oqHtWjwCQ0bF4Ed5tuvg@mail.gmail.com>
Subject: Re: [PATCH 1/2] zswap: implement a second chance algorithm for
 dynamic zswap shrinker
To: Yosry Ahmed <yosryahmed@google.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, shakeelb@google.com, 
	linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, 
	flintglass@gmail.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 52A0A1C0004
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: u85ia5kuh4kaqpttaacwa7kidq5eczx6
X-HE-Tag: 1722298045-502681
X-HE-Meta: U2FsdGVkX1+Go8SeHaefL36NcDfLRlP5C7IX/PsG9NbsiktBUKUCd1OinGD1x4OxHB+FLU8T2JppX2GJ1UaNGlZnBH3+WT98aiEYSpWgpyUmW/omZ2RF7I58oy/1Ne9P4MV+85wZ57TGELxR8V+ulj4DHEEff9vz10XO+gXORAfoJa0FEa2PXt58uPK9fbJcKy8GOJ46UWafdXcLM3uDV2L6hrB6g8UzePYMtfPf8FbvLgAnK515CHV2GSYlczq7g+WAMA1Mx2O85uh2QRfmYLml0wp0K1W+pSs8nZaDF+4WQenauaBos53jh8dZOufdyzEMDJS/iL/d4DM+J457vDkqxdUUDqxIyCACZiYmv4ulVKbvGeiyC+xskjU2AykZ4LEDCfz9OwETTmbk6yoSURWAD/UO9optKe9gxIQ6XspR+/wt+93pPfkL9Qaf62ypQuy/5PlFQQSfh79GN/Phz3cmf0mwJvQVGEtoS9i113ua2VD975Y16ijB++2EEIKNiL/TIlaYZYjkCKzlFQc8uFhtLqUXG7XtOkh6ZC0nGhDIrhFC5P057Ptlmj4Z9DnE1nME6UCBKm/n8PODpRFCpK9hgpbCjsHungRobrcEouxonBSSfD2IOxmyCFqxZM2odnvyI37oYHiuVqiLN4goxf2vAqqcbuVbhSSPlCB6Usu1WZHD7vOxDjxmUT8yD1NyEZW3Pa4pLYmCSY32STGApt8rwyiojL8Aff2B2tr7L2GI0kU8V1Od1T1YkjZZWN3C7xfgEYR3n5t2ZQgYq61AYb16urIMoHCURjLoXeyvktkKxntHcEStRPVmmAfVFPoo0KWXgjoLWgcfM/xAFDBi0R0+Ihx8aPjO8QTy+VtvX6W/dEdoqZZQQTWZDJ5Hd7rffxfenlWsoTMU0RT8t3jnm/CRzNEavj0IttsNWy29lA/tI56JaKbPe5pFkyxk1e92tDZPx//CqEdxgjPRyIE
 L1ZnaC0Q
 YQEqNtk2hRGcncvrfxTvBRwHJ5Jrfqx5evsXheW+cGnEZ+FpXS8UsydSKacDq0Z9pzGlPq1g7DksBIHZXl99F4gMvjjHvs+Rm38EMeflakeUIWEDpZdr7ZU2DQ8v8k3vIPo1gEIcev6Xw1swG2WbBqU2ZugULmFzoNgt22L31Xe1+hP5FGbEF/RXBQQFX8f65FxRgiXrFJyR2EHwF8+P04epcDI09gl1pr+KbcnlWoKqbmKzd2Ftku0Tu/OlYL+r0lI8MBKAwmVeIPeWSG+xi6GCz9aPR0aYUgRhHPNNMsuQxfzkPz8u+4K3vazFuqgTyUmknZ/+nEPDM0XwdmjV4Ztim5w==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.001865, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 29, 2024 at 4:07=E2=80=AFPM Nhat Pham <nphamcs@gmail.com> wrote=
:
>
> On Fri, Jul 26, 2024 at 2:58=E2=80=AFPM Yosry Ahmed <yosryahmed@google.co=
m> wrote:
> >
> > On Thu, Jul 25, 2024 at 4:28=E2=80=AFPM Nhat Pham <nphamcs@gmail.com> w=
rote:
> > >
> > > Current zswap shrinker's heursitics to prevent overshrinking is britt=
le
> > > and inaccurate, specifically in the way we decay the protection size
> > > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > Thanks for working on this and experimenting with different
> > heuristics. I was not a huge fan of these, so I am glad we are trying
> > to replace them with something more intuitive.
>
> That is certainly the intention :) I'm not a huge fan of those
> heuristics either - they seem fairly brittle and arbitrary to me even
> back then (as you have pointed out).
>
> >
> > >
> > > We currently decay protection aggressively in zswap_lru_add() calls.
> > > This leads to the following unfortunate effect: when a new batch of
> > > pages enter zswap, the protection size rapidly decays to below 25% of
> > > the zswap LRU size, which is way too low.
> > >
> > > We have observed this effect in production, when experimenting with t=
he
> > > zswap shrinker: the rate of shrinking shoots up massively right after=
 a
> > > new batch of zswap stores. This is somewhat the opposite of what we w=
ant
> > > originally - when new pages enter zswap, we want to protect both thes=
e
> > > new pages AND the pages that are already protected in the zswap LRU.
> > >
> > > Replace existing heuristics with a second chance algorithm
> > >
> > > 1. When a new zswap entry is stored in the zswap pool, its reference =
bit
> > >    is set.
> > > 2. When the zswap shrinker encounters a zswap entry with the referenc=
e
> > >    bit set, give it a second chance - only flips the reference bit an=
d
> > >    rotate it in the LRU.
> > > 3. If the shrinker encounters the entry again, this time with its
> > >    reference bit unset, then it can reclaim the entry.
> >
> > At the first look, this is similar to the reclaim algorithm. A
> > fundamental difference here is that the reference bit is only set
> > once, when the entry is created. It is different from the conventional
> > second chance page reclaim/replacement algorithm.
>
> I suppose, yeah. We no longer have non-exclusive loading (in most
> scenarios), so the reference bit is only set once, on store attempt.
>
> The interpretation/justification is still there though (somewhat). We
> are still giving each object another "chance" to stay in the zswap
> pool, in case it is needed soon, before it is reclaimed.
>
> > Taking a step back, what we really want is to writeback zswap entries
> > in order, and avoid writing back more entries than needed. I think the
> > key here is "when needed", which is defined by how much memory
> > pressure we have. The shrinker framework should already be taking this
> > into account.
> >
> > Looking at do_shrink_slab(), in the case of zswap (seek =3D 2),
> > total_scan should boil down to:
> >
> > total_scan =3D (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> >
> > , and this is bounded by zswap_shrinker_count() * 2.
> >
> > Ignoring nr_deferred, we start by scanning 1/2048th of
> > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> > zswap_shrinker_count() at zero priority (before OOMs). At
> > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> >
> > Keep in mind that zswap_shrinker_count() does not return the number of
> > all zswap entries, it subtracts the protected part (or recent swapins)
>
> Note that this "protection" decays rather aggressively with zswap
> stores. From zswap_lru_add():
>
> new =3D old > lru_size / 4 ? old / 2 : old;
>
> IOW, if the protection size exceeds 25% lru_size, half it. A couple of
> zswap_stores() could easily slash this to 25% of the LRU (or even
> below) rapidly.
>
> I guess I can fiddle with this decaying attempt + decouple the
> decaying from storing (i.e moving it somewhere else). But any formula
> I can come up with (another multiplicative factor?) seems as arbitrary
> as this formula, and any places that I place the decaying could
> potentially couple two actions, which lead to unintended effect.
>
> The second chance algorithm creates a similar two-section LRU
> configuration as the old scheme - protected/unprotected (or, analogous
> to the page reclaim algorithm, active/inactive). However, it bypasses
> the need for such an artificially constructed formula, which you can
> think of as the rate of objects aging. It naturally ties the aging of
> objects (i.e moving the objects from one section to another) in the
> zswap pool with memory pressure, which makes sense to me, FWIW.
>
> > and scales by the compression ratio. So this looks reasonable at first
> > sight, perhaps we want to tune the seek to slow down writeback if we
> > think it's too much, but that doesn't explain the scenario you are
> > describing.
> >
> > Now let's factor in nr_deferred, which looks to me like it could be
> > the culprit here. I am assuming the intention is that if we counted
> > freeable slab objects before but didn't get to free them, we should do
> > it the next time around. This feels like it assumes that the objects
> > will remain there unless reclaimed by the shrinker. This does not
> > apply for zswap, because the objects can be swapped in.
> >
> > Also, in the beginning, before we encounter too many swapins, the
> > protection will be very low, so zswap_shrinker_count() will return a
> > relatively high value. Even if we don't scan and writeback this
> > amount, we will keep carrying this value forward in next reclaim
> > operations, even if the number of existing zswap entries have
> > decreased due to swapins.
> >
> > Could this be the problem? The number of deferred objects to be
> > scanned just keeps going forward as a high value, essentially
> > rendering the heuristics in zswap_shrinker_count() useless?
>
> Interesting theory. I wonder if I can do some tracing to investigate
> this (or maybe just spam trace_printk() lol).

OK, I found this trace:

sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/en=
able

I'll give this a shot, and report any interesting findings in the data
collected :)


>
> But yeah, this nr_deferred part seems suspicious. The number of
> freeable objects returned by the shrinker_count() function (at least
> the one implemented by zswap) doesn't really track which object is
> newly freeable since the last shrinker_count() invocation. So an
> object can be accounted for in a previous invocation, fail to be
> reclaimed in that invocation yet still count towards future shrinker
> invocation? That sounds wrong :) Even without zswapin, that's still
> double counting, no? Future shrinker_count() call will still consider
> the previously-failed-to-reclaim object. Do other shrinkers somehow
> track the objects that have been accounted for in previous shrinker
> attempts, and use this in the freeable computation?
>
> That said, how bad it is in practice depends on the rate of reclaim
> failure, since the nr_deferred only increments when we fail to scan.
> Which is actually not very high - for instance, this is the statistics
> from a couple of my benchmark runs (using the old zswap shrinker
> protection scheme, i.e the status quo):
>
> /sys/kernel/debug/zswap/reject_reclaim_fail:292
> /sys/kernel/debug/zswap/written_back_pages:6947364
>
> IIRC, exclusive loading really lowers the number of reclaim rejection
> (you're less likely to observe a page that is already loaded into
> memory and in swap cache - its zswap entry is already invalidated).
>

Actually, on closer inspection, we also increasing the deferred count
when we short-circuit in each of the following scenario:

a) When we disable zswap shrinker or zswap writeback in general. I
don't think this matters too much in practice - we're not reclaiming
here anyway :)
b) When we detect that there are less unprotected objects than the batch si=
ze.

FWIW, the second case also goes away with the new scheme :)

> This could be a problem, and probably should be fixed (for instance,
> by adding a no-defer mode for shrinkers similar to zswap), but there
> seems to be an additional/orthogonal reason for overshrinking.

I can also just implement this and benchmark it :)