From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 72353C3DA7F
	for <linux-mm@archiver.kernel.org>; Mon,  5 Aug 2024 23:11:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E597C6B0083; Mon,  5 Aug 2024 19:11:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E09A16B0085; Mon,  5 Aug 2024 19:11:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CA9ED6B0088; Mon,  5 Aug 2024 19:11:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id A58506B0083
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 19:11:37 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 20A43A84CC
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 23:11:37 +0000 (UTC)
X-FDA: 82419740634.02.7078070
Received: from mail-ot1-f50.google.com (mail-ot1-f50.google.com [209.85.210.50])
	by imf16.hostedemail.com (Postfix) with ESMTP id 4BBA3180012
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 23:11:35 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QyXs05XA;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf16.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.210.50 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722899445; a=rsa-sha256;
	cv=none;
	b=LxHjNGaRN4A/opAFNu+xQjRh3n/lbKBMYwqaIttcgKmfM5XmUmo5SB5XZPkiBSOoWWKxB5
	NRhrmJolfL/TLjnI++vWOI/8JYDahLjC24EZZuFBCiehfJMohHtUN4cEArhTAnB5+ndrub
	J44/INpYjB40HcVnNO1kimAEnH8yDY0=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QyXs05XA;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf16.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.210.50 as permitted sender) smtp.mailfrom=nphamcs@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722899445;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=lMR8iN/O7kgqDgb1/puphv50VjW/RsvtYTC3HHTUkBg=;
	b=2p6Yoinifh22ulvOj3SnwIcJGiiUY03ZK6Q0po2BNdGhM5teyV/eqLsUjlOIGwEI6zu2vd
	o0pPZLpwgmYchklTWgaviU/wZYbDbWsIvycwLHpQstjMAJ0PlK8M4NTfSu5oBszq79aNus
	VvmKY0x9gjUiLW+GF5KpM259w9TkBIE=
Received: by mail-ot1-f50.google.com with SMTP id 46e09a7af769-70944dc8dc6so6223695a34.3
        for <linux-mm@kvack.org>; Mon, 05 Aug 2024 16:11:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722899494; x=1723504294; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=lMR8iN/O7kgqDgb1/puphv50VjW/RsvtYTC3HHTUkBg=;
        b=QyXs05XA0vRf66EqdOS3yTLDwKNTnHeLdKn6PxboOVgKr3w5pC7u9xHaGBjH6uXygT
         l6lWv0S2b9UYDp9gT+/NutHSZeTQvMw1geqtgJ19R/+xpeOQmp3S+/AlJpKUFtnvWyLw
         6wq8pDnDE4DpzXlUR80G0rfEl1UGVHhen4hbexmMgaUINC+DlLini85On2H827xVAOrU
         Rr+VxQ9jTR2aC3xl0pQ3sWQn/Dpal/cZstI6+C1xGrbUQV+IwrOaXRbf76jzFMhbOzOq
         ly41OaOpzZq3zvTyhOwpcTOr+QFNyrTmu5EmCEChXKJ5QU4rw7tysAwJMvSyRjoadUo7
         Tz5Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722899494; x=1723504294;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=lMR8iN/O7kgqDgb1/puphv50VjW/RsvtYTC3HHTUkBg=;
        b=B9V3rFSNM2XbynCYzMLY14by0IaVaL1vSqTQ6TXfNXWwlO3a90Fe+729gv4mE3IRpJ
         BX/rzTFHE864nqMDp+dV45bdUOSTFAYgl4lH1fl72K/Rfp00aEW2uEqMj1pxcIu1y/q/
         T0tNJmrzt0ElCR8YqwgdBbFKatFzd/fQxS/mjsTwKoKMkr0ezCQ7hYJqxZf8xLFbxNvP
         fk68iHUhp+Obxu3O2AUkWRzPyJ0trdDDv+MJtNvlD2Ez4qNVRS/f+ZgE6z9E8LuMCFpJ
         MQMYztsn3drW0mUPTP39O0GfbDcUnq46jmqhdC+4DudXho3hmX8rTISfh/sQcR+xtwuX
         AI7A==
X-Forwarded-Encrypted: i=1; AJvYcCWQWbp+Nl1jzI+6ssEbBSjRjRMKVE0HEfFMQYarkB97seQfiKvKwd78/9vZWsSZM1zsTm8cVWFPHgPoxuQ/uTBmw98=
X-Gm-Message-State: AOJu0YzcMqYfsiDHr+Pdr2zdi85u0A4untnNz3u7oKlXD1y2Wc55ZA9F
	Gr5+7/CXUCtnXcnA5bi1tnf1SpKtuhqZ5uVLAvpKTlzCXMWCcUyKR5dw3RrsgZaczO86FmeU87c
	Rdp6989K1fNaqoZHfiKHkbxfnYGg=
X-Google-Smtp-Source: AGHT+IEBYyDaA6Q3Zx3x+sgnLlgC1sfXJB9neHTJ2Oy8cXboJVp9BCDEnmwk1kmxZRPaEF+GwdPb6qAJnKWrUQ4eutU=
X-Received: by 2002:a05:6358:6f1a:b0:1ac:65e7:939f with SMTP id
 e5c5f4694b2df-1af3baaa16emr1648001255d.11.1722899493780; Mon, 05 Aug 2024
 16:11:33 -0700 (PDT)
MIME-Version: 1.0
References: <20240730222707.2324536-1-nphamcs@gmail.com> <20240730222707.2324536-2-nphamcs@gmail.com>
 <CAJD7tkYujNCPzrXfmRqFYESeAnrzgJYys-R_=Ftx=rJkUuhBWw@mail.gmail.com>
In-Reply-To: <CAJD7tkYujNCPzrXfmRqFYESeAnrzgJYys-R_=Ftx=rJkUuhBWw@mail.gmail.com>
From: Nhat Pham <nphamcs@gmail.com>
Date: Mon, 5 Aug 2024 16:11:20 -0700
Message-ID: <CAKEwX=M24GgiAHbwEqef=rBT94SvSnHwX-zKz78WO+C9HjJjOQ@mail.gmail.com>
Subject: Re: [PATCH v2 1/2] zswap: implement a second chance algorithm for
 dynamic zswap shrinker
To: Yosry Ahmed <yosryahmed@google.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, 
	linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, 
	flintglass@gmail.com, chengming.zhou@linux.dev
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 4BBA3180012
X-Stat-Signature: akaqdd9hrehd4654k4pw338zerpusuhd
X-Rspam-User: 
X-HE-Tag: 1722899495-542016
X-HE-Meta: U2FsdGVkX19ntW+4MZyVUh2XLxjJWoxBjb/WkTUr+Yuef5Mian0Yabld0Xuvs3/dCzXzLRwMRuWJAwuqzdu3ZZswGNgF25+j2Pxut9/c+tbtUcsPWd5aAgxfU/HQCYh3eti+xndb6E4fnsQYBovvU1PY1DC+0jWmhGGlC3N7y2p7kBC/2tJ5Gpr/CCKByvTERpE6vYNutZWFr2ufZ0JdDIhr9vl1XK8BOWyUWKZ+xMbhvPhXTxFFzuWeyRmBSyFVHrviQ7aRhXANm/OMc+sAaoz0OWiyGDzg0u1274uMTFaeA64wdap3f4i9j8CkiAUrr/5mcQ3DrSgr26d7+cGzRN7DVT+QSz3SCxyH2Ox87BkhpXCiDbgXTvGMZ+ghPKmkPVjBVZXHedDoTJwCQ3Gw+C9yAhDc6toAh3/qN2WVTfLlY/GdVEyW5aws9QbVESndmTDizpmXfuHudY9JmdIlVokQR75SHfongKI0AzduZKkZnovfp+NdofL3DAJ6oMXGvB86adMml8UArEitlvCleL9AKNks0M4mCSiehEOSJCCSXwAeIcZ80P7+XCYvHbPUNPPeHsM/VRubvX//FUNbCrMaeL4PjUq3OY9oqqT+mQK1LEjy8KWiHAYnFDGAuN1UcPeLJhdTw8Iaz3GT5+TKbzhPGDiEUOaLcNZGRT4ILvVzA6yhGPZx9fzvsNBWoEGSbqajUpvjxcmTLrSrX5N8aCPXNxjDmgf70ItHxlmUuXPwy21I5zOwsFdy8C2wIKkSTYmQdX3iukf+Qj2VWvh5/yCWKUe0ATUU1a1TC5mqc7EWPFuzMFJADGlsbNSwPdRoc3eqFMo3cy1ulSps4N84UyMq0EPwDtvGYR0VkgL+d5aZAVBqIqMSp5R0bmGqa4CjBEFvo8b2Yw7zGsEmV1IQhu4oalhqpdMOvrrPqy+laQdoFemurGaCzrtID9TAw1NJtkQNDZCvvfjwNgsiEZE
 mSD9eCYI
 eQ99cLmyvcXIdAUO9qvf8oovui/B7lsXiJhFS621dMVhkGn8vyvyfW0zSprzWnl4oPd1fzRUsnBWAML85fvE2rwL1fddlvVhNbqRoyP+N6BhIBPKEBKVRerL0xU2zNOJhW16rd+xWoTbtC+aUtf4QOYXsk1GIY4mEKVAUpRs63RYKcRvmECE0N24KbYHgGvMnYJK9UciGtXSNUr0GTDFPDI7Ts7ETNr+85xpr0t4HF7VLQ1pGY67bByrTM6cNS6M7n/B9ImCcp+K0rYZFaQYAID9qCvA570phvBXSnyJAdvbsh4syqzbYMVwDia1Mx31AdIoheJYwDnGGtpyrn2Tkt3YGd6/GUgz7WPEW4P3hMuaAqqQm+dxuNIzmxSeTAlEj94jmk7wYD2sYy4hxpjeB6Hebeg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.004451, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Aug 1, 2024 at 12:57=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com>=
 wrote:
>
> On Tue, Jul 30, 2024 at 3:27=E2=80=AFPM Nhat Pham <nphamcs@gmail.com> wro=
te:
> >
> > Current zswap shrinker's heuristics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we wan=
t
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bi=
t
> >    is set.
>
> Probably worth mentioning that this is added in a hole and doesn't
> consume any extra memory.

Will do!

>
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> >    bit set, give it a second chance - only flips the reference bit and
> >    rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> >    reference bit unset, then it can reclaim the entry.
> >
> > In this manner, the aging of the pages in the zswap LRUs are decoupled
> > from zswap stores, and picks up the pace with increasing memory pressur=
e
> > (which is what we want).
> >
> > The second chance scheme allows us to modulate the writeback rate based
> > on recent pool activities. Entries that recently entered the pool will
> > be protected, so if the pool is dominated by such entries the writeback
> > rate will reduce proportionally, protecting the workload's workingset.O=
n
> > the other hand, stale entries will be written back quickly, which
> > increases the effective writeback rate.
> >
> > We will still maintain the count of swapins, which is consumed and
> > subtracted from the lru size in zswap_shrinker_count(), to further
> > penalize past overshrinking that led to disk swapins. The idea is that
> > had we considered this many more pages in the LRU active/protected, the=
y
> > would not have been written back and we would not have had to swapped
> > them in.
> >
> > To test this new heuristics, I built the kernel under a cgroup with
> > memory.max set to 2G, on a host with 36 cores:
> >
> > With the old shrinker:
> >
> > real: 263.89s
> > user: 4318.11s
> > sys: 673.29s
> > swapins: 227300.5
> >
> > With the second chance algorithm:
> >
> > real: 244.85s
> > user: 4327.22s
> > sys: 664.39s
> > swapins: 94663
> >
> > (average over 5 runs)
> >
> > We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
> > reduction in real time. Note that the number of swapped in pages
> > dropped by 58%.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  include/linux/zswap.h |  16 +++---
> >  mm/zswap.c            | 110 ++++++++++++++++++++++++------------------
> >  2 files changed, 70 insertions(+), 56 deletions(-)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 6cecb4a4f68b..b94b6ae262d5 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
> >
> >  struct zswap_lruvec_state {
> >         /*
> > -        * Number of pages in zswap that should be protected from the s=
hrinker.
> > -        * This number is an estimate of the following counts:
> > +        * Number of swapped in pages, i.e not found in the zswap pool.
>
> With the next patch, this should be "Number of swapped in pages from
> disk". Without the "from disk", the second part about not being found
> in the zswap pool doesn't really make sense.
>
> Maybe also the variable name should be changed to nr_disk_swapins or simi=
lar.

Sounds good :)

>
> >          *
> > -        * a) Recent page faults.
> > -        * b) Recent insertion to the zswap LRU. This includes new zswa=
p stores,
> > -        *    as well as recent zswap LRU rotations.
> > -        *
> > -        * These pages are likely to be warm, and might incur IO if the=
 are written
> > -        * to swap.
> > +        * This is consumed and subtracted from the lru size in
> > +        * zswap_shrinker_count() to penalize past overshrinking that l=
ed to disk
> > +        * swapins. The idea is that had we considered this many more p=
ages in the
> > +        * LRU active/protected and not written them back, we would not=
 have had to
> > +        * swapped them in.
> >          */
> > -       atomic_long_t nr_zswap_protected;
> > +       atomic_long_t nr_swapins;
> >  };
> >
> >  unsigned long zswap_total_pages(void);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index adeaf9c97fde..f4e001c9e7e0 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -184,6 +184,10 @@ static struct shrinker *zswap_shrinker;
> >   * page within zswap.
> >   *
> >   * swpentry - associated swap entry, the offset indexes into the red-b=
lack tree
> > + * referenced - true if the entry recently entered the zswap pool. Uns=
et by the
> > + *              dynamic shrinker. The entry is only reclaimed by the d=
ynamic
> > + *              shrinker if referenced is unset. See comments in the s=
hrinker
> > + *              section for context.
> >   * length - the length in bytes of the compressed page data.  Needed d=
uring
> >   *          decompression. For a same value filled page length is 0, a=
nd both
> >   *          pool and lru are invalid and must be ignored.
> > @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker;
> >  struct zswap_entry {
> >         swp_entry_t swpentry;
> >         unsigned int length;
> > +       bool referenced;
> >         struct zswap_pool *pool;
> >         union {
> >                 unsigned long handle;
> > @@ -700,11 +705,10 @@ static inline int entry_to_nid(struct zswap_entry=
 *entry)
> >
> >  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entr=
y *entry)
> >  {
> > -       atomic_long_t *nr_zswap_protected;
> > -       unsigned long lru_size, old, new;
> >         int nid =3D entry_to_nid(entry);
> >         struct mem_cgroup *memcg;
> > -       struct lruvec *lruvec;
> > +
> > +       entry->referenced =3D true;
>
> Would it be clearer to initialize this in zswap_store() with the rest
> of the zswap_entry initialization?

Sure thing!

>
> >
> >         /*
> >          * Note that it is safe to use rcu_read_lock() here, even in th=
e face of
> > @@ -722,19 +726,6 @@ static void zswap_lru_add(struct list_lru *list_lr=
u, struct zswap_entry *entry)
> >         memcg =3D mem_cgroup_from_entry(entry);
> >         /* will always succeed */
> >         list_lru_add(list_lru, &entry->lru, nid, memcg);
> > -
> > -       /* Update the protection area */
> > -       lru_size =3D list_lru_count_one(list_lru, nid, memcg);
> > -       lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> > -       nr_zswap_protected =3D &lruvec->zswap_lruvec_state.nr_zswap_pro=
tected;
> > -       old =3D atomic_long_inc_return(nr_zswap_protected);
> > -       /*
> > -        * Decay to avoid overflow and adapt to changing workloads.
> > -        * This is based on LRU reclaim cost decaying heuristics.
> > -        */
> > -       do {
> > -               new =3D old > lru_size / 4 ? old / 2 : old;
> > -       } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new=
));
> >         rcu_read_unlock();
> >  }
> >
> > @@ -752,7 +743,7 @@ static void zswap_lru_del(struct list_lru *list_lru=
, struct zswap_entry *entry)
> >
> >  void zswap_lruvec_state_init(struct lruvec *lruvec)
> >  {
> > -       atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected,=
 0);
> > +       atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
> >  }
> >
> >  void zswap_folio_swapin(struct folio *folio)
> > @@ -761,7 +752,7 @@ void zswap_folio_swapin(struct folio *folio)
> >
> >         if (folio) {
> >                 lruvec =3D folio_lruvec(folio);
> > -               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_pr=
otected);
> > +               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins)=
;
> >         }
> >  }
> >
> > @@ -1082,6 +1073,28 @@ static int zswap_writeback_entry(struct zswap_en=
try *entry,
> >  static enum lru_status shrink_memcg_cb(struct list_head *item, struct =
list_lru_one *l,
> >                                        spinlock_t *lock, void *arg)
> >  {
> > @@ -1091,6 +1104,16 @@ static enum lru_status shrink_memcg_cb(struct li=
st_head *item, struct list_lru_o
> >         enum lru_status ret =3D LRU_REMOVED_RETRY;
> >         int writeback_result;
> >
> > +       /*
> > +        * Second chance algorithm: if the entry has its referenced bit=
 set, give it
> > +        * a second chance. Only clear the referenced bit and rotate it=
 in the
> > +        * zswap's LRU list.
> > +        */
> > +       if (entry->referenced) {
> > +               entry->referenced =3D false;
> > +               return LRU_ROTATE;
> > +       }
> > +
> >         /*
> >          * As soon as we drop the LRU lock, the entry can be freed by
> >          * a concurrent invalidation. This means the following:
> > @@ -1157,8 +1180,7 @@ static enum lru_status shrink_memcg_cb(struct lis=
t_head *item, struct list_lru_o
> >  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >                 struct shrink_control *sc)
> >  {
> > -       struct lruvec *lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DAT=
A(sc->nid));
> > -       unsigned long shrink_ret, nr_protected, lru_size;
> > +       unsigned long shrink_ret;
> >         bool encountered_page_in_swapcache =3D false;
> >
> >         if (!zswap_shrinker_enabled ||
> > @@ -1167,25 +1189,6 @@ static unsigned long zswap_shrinker_scan(struct =
shrinker *shrinker,
> >                 return SHRINK_STOP;
> >         }
> >
> > -       nr_protected =3D
> > -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_p=
rotected);
> > -       lru_size =3D list_lru_shrink_count(&zswap_list_lru, sc);
> > -
> > -       /*
> > -        * Abort if we are shrinking into the protected region.
> > -        *
> > -        * This short-circuiting is necessary because if we have too ma=
ny multiple
> > -        * concurrent reclaimers getting the freeable zswap object coun=
ts at the
> > -        * same time (before any of them made reasonable progress), the=
 total
> > -        * number of reclaimed objects might be more than the number of=
 unprotected
> > -        * objects (i.e the reclaimers will reclaim into the protected =
area of the
> > -        * zswap LRU).
> > -        */
> > -       if (nr_protected >=3D lru_size - sc->nr_to_scan) {
> > -               sc->nr_scanned =3D 0;
> > -               return SHRINK_STOP;
> > -       }
> > -
>
> Do we need a similar mechanism to protect against concurrent shrinkers
> quickly consuming nr_swapins?

Not for nr_swapins consumption per se, and the original reason why I
included this (racy) check is just so that concurrent reclaimers do
not disrespect the protection scheme. We had no guarantee that we
wouldn't just reclaim into the protected region (well even with this
racy check technically). With the second chance scheme, a "protected"
page (i.e with its referenced bit set) would not be reclaimed right
away - a shrinker encountering it would have to "age" it first (by
unsetting the referenced bit), so the intended protection is enforced.

That said, I do believe we need a mechanism to limit the concurrency
here. The amount of pages aged/reclaimed should scale (linearly?
proportionally?) with the reclaim pressure, i.e more reclaimers =3D=3D
more pages reclaimed/aged, so the current behavior is desired.
However, at some point, if we have more shrinkers than there are work
assigned to each of them, we might be unnecessarily wasting resources
(and potentially building up the nr_deferred counter that we discussed
in v1 of the patch series). Additionally, we might be overshrinking in
a very short amount of time, without letting the system have the
chance to react and provide feedback (through swapins/refaults) to the
memory reclaimers.

But let's do this as a follow-up work :) It seems orthogonal to what
we have here.

> > -        * Subtract the lru size by an estimate of the number of pages
> > -        * that should be protected.
> > +        * Subtract the lru size by the number of pages that are recent=
ly swapped
>
> nit: I don't think "subtract by" is correct, it's usually "subtract
> from". So maybe "Subtract the number of pages that are recently
> swapped in from the lru size"? Also, should we remain consistent about
> mentioning that these are disk swapins throughout all the comments to
> keep things clear?

Yeah I should be clearer here - it should be swapped in from disk, or
more generally (accurately?) swapped in from the backing swap device
(but the latter can change once we decoupled swap from zswap). Or
maybe swapped in from the secondary tier?

Let's just not overthink and go with swapped in from disk for now :)

>
> > +        * in. The idea is that had we protect the zswap's LRU by this =
amount of
> > +        * pages, these swap in would not have happened.
> >          */
> > -       nr_freeable =3D nr_freeable > nr_protected ? nr_freeable - nr_p=
rotected : 0;
> > +       nr_swapins_cur =3D atomic_long_read(nr_swapins);
> > +       do {
> > +               if (lru_size >=3D nr_swapins_cur)
> > +                       nr_remain =3D 0;
> > +               else
> > +                       nr_remain =3D nr_swapins_cur - lru_size;
> > +       } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, =
nr_remain));
> > +
> > +       lru_size -=3D nr_swapins_cur - nr_remain;
>
> It's a little bit weird that we reduce the variable named "lru_size"
> by the consumed swapins. Maybe we should keep this named as
> "nr_freeable", or add another variable here to hold the value after
> subtraction?

Hmmmm yeah now I remember why I called it nr_freeable back then. Seems
like past Nhat is a bit smarter than present Nhat :)
I'll just revert this part then.