From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FD94C27C79
	for <linux-mm@archiver.kernel.org>; Thu, 20 Jun 2024 22:45:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E9B228D00EE; Thu, 20 Jun 2024 18:45:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E4A358D00DB; Thu, 20 Jun 2024 18:45:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CEB238D00EE; Thu, 20 Jun 2024 18:45:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id AB7CA8D00DB
	for <linux-mm@kvack.org>; Thu, 20 Jun 2024 18:45:33 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 3341B1C05FA
	for <linux-mm@kvack.org>; Thu, 20 Jun 2024 22:45:33 +0000 (UTC)
X-FDA: 82252750146.22.83124E5
Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46])
	by imf30.hostedemail.com (Postfix) with ESMTP id 631628000E
	for <linux-mm@kvack.org>; Thu, 20 Jun 2024 22:45:31 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="efk3/hSR";
	spf=pass (imf30.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718923524;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=;
	b=szz2Uf722WJKl/Mn1DJxxw++5bGZGYGvZvsKtvCjVx3kisXOl6oyHRnZ/Ty9Q3XkdOlDPM
	SzwdltzBvTwntaUyGuaQOC/ChC9Yzjz4uhLotQtZ1QeroSZp5mU5KSDVanCB+0RsffMADb
	rfO4WFavXKB2D3/e9u4/BLkWTtG7QPY=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="efk3/hSR";
	spf=pass (imf30.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718923524; a=rsa-sha256;
	cv=none;
	b=BaLZmZ4b5FrUOOVaUWALoeC6LYl6JayeW4Ju4zuPqyjceXbgBv5GQD/ljaLx/G2Xi3qgEL
	SktzO+9yeZ0Vt2pNa+qMqX+r8vBbNLe7mN7Yp7EEjugRKsl5kAGSqm8OVfx/om0GddhmJL
	f/0Ebx4opC0DXCae3QGTItjwSZ8AIac=
Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6b50a68b0b3so16824856d6.0
        for <linux-mm@kvack.org>; Thu, 20 Jun 2024 15:45:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1718923530; x=1719528330; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=;
        b=efk3/hSR5ubJy7CC/psi4x3xIU2M0fR7aZzqFfqUhlhhPdinrt/BIa9/xXMkF2Yp6c
         gSyAtYFI1aMnlh9zXXN8++7Bh0tQgjpEGoyvNCpLxtU8aFlAqYn8exEz9pXy/mTdsG5J
         JgeMO/iYDymaD9vwp2d1FwW4B4TOWN9f8UNyryrS2Ai7YfuNn8uUc67TbEJyev13yime
         mGsw/pxq4PSUv8fUcLCnEPYj/Yl7uXvWoB/QuHQmk7qQf2BC0xc4ZUJwG1JTZzxeJ2fM
         u2nvqma8UMQv5KaztVpPBLk+Kr8EwA+GPY/LOFE2yxdBuHBUkY8UABkv3wr/32tbcpq0
         QV6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718923530; x=1719528330;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=;
        b=M5SKj7Ju1wEITHMB3O69DjCMT2AtJkYmYUvW6ZWoX0t4LimwAKvSdF/n2GS2H2vtci
         uhRnlRiZihPHsV0UAEa638ENBrzAk5IOw0ZQzU50Cobji25QULPN6h9FrzU+xLqS4WZS
         I2MQcA6LhRLrSNO5t5hlQfAeX3NNr9rGYUBdrQ3BLxHbjfjZNlYaiHaDD0PSRydKm0Zf
         DuJPMFY+kZvUThrRDLnR1moiO4oMbUXIpdlo/3aKbtw3NC0M4i2E0bzlZ9cSxOUy/X0P
         bb1WLJ5dij5wgbZTsAScdSE+eLWHs2FVI+gC8sTTMD0JHbmrjjqCehjiqvj5zsHd4pz6
         9vfA==
X-Forwarded-Encrypted: i=1; AJvYcCW+0SkMjKVvdVyrlCI04wfwXfdk3yMLs1Yn9MXMachwy1JOrCY44aZd3iHWQ16re0b94HmsvbZekAP4/YlhG5YPWmg=
X-Gm-Message-State: AOJu0Yw8pwalldbWQsEaJ8FZvLK4USO/ukhSuPXeb04xGCGotqcgKQCR
	I0cOLHauEe9EOQr/vD27yrsVZaxZgDNvjL8aLdmn1mevR++cHNw/c2xpsX9OBTztjh9Ud0Fkmfg
	y3JTjSUHd6r1dtKVrx+T4mjAPslI=
X-Google-Smtp-Source: AGHT+IHUHiKhuO3lflJiK1mEtBHHQQANwPa3f18lchzyH5vGRdVxN8b4X5W0wRN32prRqb0Z1SVcif1uQeeWXATBtg8=
X-Received: by 2002:ad4:420c:0:b0:6b2:b997:6513 with SMTP id
 6a1803df08f44-6b5019b852emr115022296d6.7.1718923530320; Thu, 20 Jun 2024
 15:45:30 -0700 (PDT)
MIME-Version: 1.0
References: <20240608155316.451600-1-flintglass@gmail.com> <CAKEwX=PsmuPQUvrsOO7a+JGd=gDmjP5_XDGD+z-0R6dBea+BOg@mail.gmail.com>
 <CAPpoddcgmZs6=s1MrzLgOAJxoVW5_bLa4CGxHq3KhF3GOi8VBw@mail.gmail.com>
 <CAJD7tkYD+y54-KYEotWspRdNL_AC0SxE147tR+dSLvY-=9jJyg@mail.gmail.com> <CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com>
In-Reply-To: <CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com>
From: Nhat Pham <nphamcs@gmail.com>
Date: Thu, 20 Jun 2024 15:45:17 -0700
Message-ID: <CAKEwX=NFAh95smCyJidENLytQjU8xDbosqahM6OOzYrnmJ5ojg@mail.gmail.com>
Subject: Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink
To: Takero Funaki <flintglass@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Chengming Zhou <chengming.zhou@linux.dev>, Jonathan Corbet <corbet@lwn.net>, 
	Andrew Morton <akpm@linux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>, linux-mm@kvack.org, linux-doc@vger.kernel.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: ii8rbcje1x5ysdtf5asezdds3bxeqgdm
X-Rspam-User: 
X-Rspamd-Queue-Id: 631628000E
X-Rspamd-Server: rspam02
X-HE-Tag: 1718923531-102473
X-HE-Meta: U2FsdGVkX18h9viBctS1Ar7XBbiXHuokjmonpmoZnTAanPAZmONOSLQvuz2lOIlEv+4IS2ts1pj5dyD/FriUhwD42Q77696I8zRDsO1elPLRIlO7qusEFJ098MhKxSX1n/pXCpLCSyYqI/5u2SsKrBtGb5PZxAczgBn9y5jwe2z7QphY/vastlJ/RBXj28PlQMvAneb2sovz/GpKCD6xGYumNjU9mriWKwQyeRQQcR6SkPdxBlns5ZnrcA68jQ6Ili/aNb2cX/MwBOlRfwvyVVi7SDoLMCcpbZQUTZj1FUKivD6GEpL8IkYug0L1FXiyPHnIIawvLTx5910Skco3HrtY0VWdAlf5DYskP09yYG5BW/sDZ/pHdq+T8CiE/kIEliG89MGnzyZ55aXFx8AYp0zbG6/PhoT3iiChohmW0TWiQZ5ylHqYyXAoJc0pgf63YwkRCs5amrYXL0pn62HBiaKXem3wsQ6iNgukBfNQSQn6uoyFWzki9AxlZJo9ON3gcVcm64VjUjQX3aGYDAwH/+EWLefR2ILwKNe/VNdQY1tPglROiABeG4ZkPPyHzGDCbWm9Gp2WFHNn7TTzZeGPXT/pQfeCBqO+qndey8KuscLdar79zcDPEi4EipGiOqpYSlwelC+2RVURY78ItAnJAzIcYgMTsrd3hvI9fCvgtm1ARVqI2SGTGnepL2mJu4BDwFP9uRl7kKTtpCaWey0Yd7tSzUBQioZU1qoUGyaUt9ry1j78Djiq0H+1n9eq3uOCvDnY9xv2l5Nx6fUbB7Tjus3lUf+bWs2dlNYh+KLoiL4KJhjedMomGWMSYLFIG3uaEVrEfOY9Bf4MBNOTAI76sqcYoJ4yXMkR8Jzqc2Zew62+m8UOIZztsVSBUQf4zaSQAw1nW2F5dhV2n+KBsdxafIhILQfWDTqFVg9ml1hlRNp8RhPaC0U8fUYoKct8mEMXDhQeWyANtXVFQgx7opm
 kJFaC3sV
 mI/wMLSyN2FpjhLhkKp38ZByN8KSf6zd/4FBnq2myLnRFvWAlLCejcp/ucfH8po1I8Sv9MbqASnMs8WqhdPAkk3GDM5yfH1piyR13bbcfcwr1P8l3XEn5spNVAiyjPD5ouXfQgO/uF0bJeOAw37d4a8B/FDa3fcD27ckAbLF3fTl7oeQR3PrVtUVnLuSaCTBzXtJWxf0D2mTIVN2b+NJkCMiWcrWBqzvQ84ySHVLEOYWB4DHm6MDLOpGLFuyR9bY+PcjaN1robnb8mMm2rBYxjOtClB4ewEPy2D1MHrGG7gSImjL0v5BfpaVCg57lx0f9wy//RlEBKraQ15R5pta4Zb/VPQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000088, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jun 19, 2024 at 6:03=E2=80=AFPM Takero Funaki <flintglass@gmail.com=
> wrote:
>
> Hello,
>
> Sorry for the late reply. I am currently investigating a
> responsiveness issue I found while benchmarking with this series,
> possibly related to concurrent zswap writeback and pageouts.
>
> This series cannot be applied until the root cause is identified,
> unfortunately. Thank you all for taking the time to review.
>
> The responsiveness issue was confirmed with 6.10-rc2 with all 3
> patches applied. Without patch 3, it still happens but is less likely.
>
> When allocating much larger memory than zswap can buffer, and
> writeback and rejection by pool_limit_hit happen simultaneously, the
> system stops responding. I do not see this freeze when zswap is
> disabled or when there is no pool_limit_hit. The proactive shrinking
> itself seems to work as expected as long as the writeback and pageout
> do not occur simultaneously.
>
> I suspect this issue exists in current code but was not visible
> without this series since the global shrinker did not writeback
> considerable amount of pages.
>
>
> 2024=E5=B9=B46=E6=9C=8815=E6=97=A5(=E5=9C=9F) 7:48 Nhat Pham <nphamcs@gma=
il.com>:
> >
> > BTW, I'm curious. Have you experimented with increasing the pool size?
> > That 20% number is plenty for our use cases, but maybe yours need a
> > different cap?
> >
>
> Probably we can allocate a bit more zswap pool size. But that will
> keep more old pages once the pool limit is hit. If we can ensure no
> pool limit hits and zero writeback by allocating more memory, I will
> try the same amount of zramswap.
>
> > Also, have you experimented with the dynamic zswap shrinker? :) I'm
> > actually curious how it works out in the small machine regime, with
> > whatever workload you are running.
> >
>
> It seems the dynamic shrinker is trying to evict all pages. That does
> not fit to my use case that prefer balanced swapin and swapout
> performance

Hmm not quite. As you have noted earlier, it (tries to) shrink the
unprotected pages only,

>
>
> 2024=E5=B9=B46=E6=9C=8815=E6=97=A5(=E5=9C=9F) 9:20 Yosry Ahmed <yosryahme=
d@google.com>:
> > >
> > > 1.
> > > The visible issue is that pageout/in operations from active processes
> > > are slow when zswap is near its max pool size. This is particularly
> > > significant on small memory systems, where total swap usage exceeds
> > > what zswap can store. This means that old pages occupy most of the
> > > zswap pool space, and recent pages use swap disk directly.
> >
> > This should be a transient state though, right? Once the shrinker
> > kicks in it should writeback the old pages and make space for the hot
> > ones. Which takes us to our next point.
> >
> > >
> > > 2.
> > > This issue is caused by zswap keeping the pool size near 100%. Since
> > > the shrinker fails to shrink the pool to accept_thr_percent and zswap
> > > rejects incoming pages, rejection occurs more frequently than it
> > > should. The rejected pages are directly written to disk while zswap
> > > protects old pages from eviction, leading to slow pageout/in
> > > performance for recent pages on the swap disk.
> >
> > Why is the shrinker failing? IIUC the first two patches fixes two
> > cases where the shrinker stumbles upon offline memcgs, or memcgs with
> > no zswapped pages. Are these cases common enough in your use case that
> > every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
> > putting the zswap usage below accept_thr_percent?
> >
> > This would be surprising given that we should be restarting the
> > shrinker with every swapout attempt until we can accept pages again.
> >
> > I guess one could construct a malicious case where there are some
> > sticky offline memcgs, and all the memcgs that actually have zswap
> > pages come after it in the iteration order.
> >
> > Could you shed more light about this? What does the setup look like?
> > How many memcgs there are, how many of them use zswap, and how many
> > offline memcgs are you observing?
> >
>
> Example from ubuntu 22.04 using zswap:
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep . | wc
>      31      31    2557
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep ^0 | wc
>      11      11     911
>
> This indicates 11 out of 31 services have no pages in zswap. Without
> patch 2, shrink_worker() aborts shrinking in the second tree walk,
> before evicting about 40 pages from the services. The number varies,
> but I think it is common to see a few memcg that has no zswap pages
>
> > I am not saying we shouldn't fix these problems anyway, I am just
> > trying to understand how we got into this situation to begin with.
> >
> > >
> > > 3.
> > > If the pool size were shrunk proactively, rejection by pool limit hit=
s
> > > would be less likely. New incoming pages could be accepted as the poo=
l
> > > gains some space in advance, while older pages are written back in th=
e
> > > background. zswap would then be filled with recent pages, as expected
> > > in the LRU logic.
> >
> > I suspect if patches 1 and 2 fix your problem, the shrinker invoked
> > from reclaim should be doing this sort of "proactive shrinking".
> >
> > I agree that the current hysteresis around accept_thr_percent is not
> > good enough, but I am surprised you are hitting the pool limit if the
> > shrinker is being run during reclaim.
> >
> > >
> > > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent=
.
> > > Patch 3 makes zswap_store trigger the shrinker before reaching the ma=
x
> > > pool size. With this series, zswap will prepare some space to reduce
> > > the probability of problematic pool_limit_hit situation, thus reducin=
g
> > > slow reclaim and the page priority inversion against LRU.
> > >
> > > 4.
> > > Once proactive shrinking reduces the pool size, pageouts complete
> > > instantly as long as the space prepared by shrinking can store the
> > > direct reclaim. If an admin sees a large pool_limit_hit, lowering
> > > accept_threshold_percent will improve active process performance.
> >
> > I agree that proactive shrinking is preferable to waiting until we hit
> > pool limit, then stop taking in pages until the acceptance threshold.
> > I am just trying to understand whether such a proactive shrinking
> > mechanism will be needed if the reclaim shrinker for zswap is being
> > used, how the two would work together.
>
> For my workload, the dynamic shrinker (reclaim shrinker) is disabled.
> The proposed global shrinker and the existing dynamic shrinker are
> both proactive, but their goals are different.
>
> The global shrinker starts shrinking when the zswap pool exceeds
> accept_thr_percent + 1%, then stops when it reaches
> accept_thr_percent. Pages below accept_thr_percent are protected from
> shrinking.
>
> The dynamic shrinker starts shrinking based on memory pressure
> regardless of the zswap pool size, and stops when the LRU size is
> reduced to 1/4. Its goal is to wipe out all pages from zswap. It
> prefers swapout performance only.
>
> I think the current LRU logic decreases nr_zswap_protected too quickly
> for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to
> between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin()
> increments nr_zswap_protected when page-ins of evicted pages occur
> later, this technically has no effect while reclaim is in progress.
>
> While zswap_store() and zswap_lru_add() are called, the dynamic
> shrinker is likely running due to the pressure. The dynamic shrinker
> reduces the LRU size to 1/4, and then a few subsequent zswap_store()
> calls reduce the protected count to 1/4 of the LRU size. The stored
> pages will be reduced to zero through a few shrinker_scan calls.

Ah this is a fair point. We've been observing this in
production/experiments as well - there's seems to be a positive
correlation between zswpout rate and zswap_written_back rate. Whenever
there's a spike in zswpout, you also see a spike in writtenback pages
too - looks like the flood of zswpout weaken zswap's lru protection,
which is not quite the intended effect.

We're working to improve this situation. We have a couple of ideas
floating around, none of which are too complicated to implement, but
need experiments to validate before sending upstream :)