From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79D5BC3DA41
	for <linux-mm@archiver.kernel.org>; Wed, 10 Jul 2024 21:21:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E33526B0099; Wed, 10 Jul 2024 17:21:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DE2356B00B0; Wed, 10 Jul 2024 17:21:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C5D186B00B1; Wed, 10 Jul 2024 17:21:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id A3D7E6B0099
	for <linux-mm@kvack.org>; Wed, 10 Jul 2024 17:21:23 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 55EA7C0278
	for <linux-mm@kvack.org>; Wed, 10 Jul 2024 21:21:23 +0000 (UTC)
X-FDA: 82325114046.13.D60C0DD
Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176])
	by imf17.hostedemail.com (Postfix) with ESMTP id 78D0C4000F
	for <linux-mm@kvack.org>; Wed, 10 Jul 2024 21:21:21 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Fw2OFbxd;
	spf=pass (imf17.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=flintglass@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720646450;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=;
	b=pis/myA5ZO5cYpEqrUW00Hf4Ln4mHYSx0E80MbjNEv32pCCSSzHb8jQS3c/kKoc6G2QvRq
	clm8y2W+PMK2cdOxlprfTw7FCgH61zj4w5WiIx6OM3wgSeA1kEfDAZa/yrEI9YfKX8DpN9
	3EcT35ZKr19bMFlmWD8wMUmovJ54rgw=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720646450; a=rsa-sha256;
	cv=none;
	b=2DDKvNdScwHmicjp/OjXJBbjykprkLPI/z80RPmmO7QgHjqP+n3W+m6herT+4cUC7FS7+n
	b4mhz9FLalvHf4wKDhcy8idNR9NKJgTBaH78faaMY3ckNS6Ypi6fASsXHXtH3r8NUKq13T
	RTQqGH3v+lxf9BKxou2KW6vrlDpf3IM=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Fw2OFbxd;
	spf=pass (imf17.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=flintglass@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-yb1-f176.google.com with SMTP id 3f1490d57ef6-e036d1ce4f7so243825276.0
        for <linux-mm@kvack.org>; Wed, 10 Jul 2024 14:21:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720646480; x=1721251280; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=;
        b=Fw2OFbxdRfjCJaO9JjDmGOmwkxJ0KV9Xb9BmP5pJ/cjlr27vBSU8T0QowQq6bSQWKX
         qkAScBTehef+1LpdyGYXwzXss531uiOR6+FBkU1Rt73Uug5yn1BjeAJ3pbR4fqoQg0Et
         rD3o/puccMF6MowFDOqSAS0UMY8Eel9Pqt5CnUnoXEOvJHZkc4Ixf5BOfzEvZKmwxYPX
         3CfbaBxIKoFfUbfjMQkNKizvd5ZutFaqvXGTqs7VvOSXvKZvbs+mdD5ZpXeX8EKE0UJz
         30LqpTZSXm0cDc9Ap132Ti4uNK9CWkp2MkXlyicKpFXB6wyuAPBtfZ5Xj9deTCafI20i
         FiPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720646480; x=1721251280;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=;
        b=m/TVe7kuNg65Dvnr2Pmbg+qusCXhR7giGxWGRBpgOMPj6WM/Ni/u4SSY4S7ON8xScH
         OSkmxKMnas1hHxhUCIKfXTMGrzOC6cTietJxf2+m/x8sVLptH6LCvf/1gWh3cNzGFBz/
         o1audbfwDL+91FSz+nCaumLGeqJqWmigtMZzPIkqOwLxw3k4BgiOXp3m9nTESFKzRmAf
         13KDqeLnzOYjmDCJOkAuAce0xf0E/kigRUiRZ2n9TTA2bYNLf6RI3d5J5jY4/BztoBc8
         CoLwFmhm87PZVZl2ElYA1Lqrv5b9wsU3gU95M3ArCSiaspVXLM+UFRNPE5IFSOOBHe/K
         ya3A==
X-Forwarded-Encrypted: i=1; AJvYcCWzaX6bNKkzIRR0Ugjm3IvJoJJc5yKgRinyph1G/avHhtE7c6TyEA/Kxeq2tn0kquSPQ/hfxVI1GokPs/fojmj4w+I=
X-Gm-Message-State: AOJu0YzT+zvb/qaeNCYZXp0/Z44Q2HVPz9T164tXuVwRVrxDxcxKXfpY
	exISKjt0aAswoByJNMNauAmK/iyJC41DptxgvlUGsoi4UFe5b1TKDpx6izWEaik1Z1PKbYxiCyp
	oJ7SShNa3PhlEpZbvK4j9q3QtGpY=
X-Google-Smtp-Source: AGHT+IH6bYgfsoNbVGtdjaiR7EunA/HaBDZYQjDSrczVs0R6MHfQSww7S19jaEyCiiHD4LMGaazU7obguZVGu0/l9pM=
X-Received: by 2002:a25:7d02:0:b0:e03:5b97:cbec with SMTP id
 3f1490d57ef6-e0577fb65e5mr780329276.10.1720646480390; Wed, 10 Jul 2024
 14:21:20 -0700 (PDT)
MIME-Version: 1.0
References: <20240706022523.1104080-1-flintglass@gmail.com>
 <20240706022523.1104080-7-flintglass@gmail.com> <CAKEwX=NJjDL3aW3hXioxh=yASSsHbDBWubV9cE2RiH+tSXpscw@mail.gmail.com>
In-Reply-To: <CAKEwX=NJjDL3aW3hXioxh=yASSsHbDBWubV9cE2RiH+tSXpscw@mail.gmail.com>
From: Takero Funaki <flintglass@gmail.com>
Date: Thu, 11 Jul 2024 06:21:09 +0900
Message-ID: <CAPpoddfpU1rN5ST49vBBJ_=MHKehQQrwsz_hwBd6xyzi4-uQkQ@mail.gmail.com>
Subject: Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while
 pagein/out IO
To: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, 
	Chengming Zhou <chengming.zhou@linux.dev>, Jonathan Corbet <corbet@lwn.net>, 
	Andrew Morton <akpm@linux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>, linux-mm@kvack.org, linux-doc@vger.kernel.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 78D0C4000F
X-Stat-Signature: 3oy1wofmy55eh7ey1wwt5bc7tu34ktrt
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1720646481-920764
X-HE-Meta: U2FsdGVkX18JuW+JrhrkIu7dajavsdWws43e84wFTbyUUJ/Rgvcc/rltGC8Z82gosmBUOQoka8cxnLvi5w5yJtLTZIrjpBfP0QCtN6E6JxnNBsG+jL8R3dLQV9ZIvTlc3pKx4qHeRt6hZ4y+3+PDwKfj6ZoaOjQbW42J+UKASeHVPk7rBOobvtLqWg4IMbkF6eUcg+2n7Xgzz5nGS/n9dLHTzRdCZZjkyUW+i4/g7gyGChhsT5DFH0OP/T4/2zeHrTR6DUMhUDE5mJsqtjO+tJSrwv4n2ipcmzh5TMMYL/MUQDfC9nV4YgalnH/IzlSGXJQN1UmhL8YN8apyyedC8yFKfMkNina4VWDflJhqUdRWsU44htwJo7pTD75INsFyJJbXoFX8Mb+vuUSfT9Bkb99FJnczAMLgcRva68+nhdB7ITGxFegrx0ur2f4G6gQFTOSN42nboSGQv+ZwlDXIEPbMbY6pOBwAwGpFCrNbyMx3evaETy5WKGzM3usv05RvHbgWKsAczk1NPOkr74upuJOy3TK+rkBtoREuBqYAOx4EvQVVunzMsVTYy/94AENjbGTDa5h+TqczIGoqr7WLbXnxRM8Ktf1szIqTkCXIuvod69tBwMv8tIcapSWAMyGeXUc3FSco2QQ9jQztyOcTBpaz7ogn3XNialrTv5bVATKFy3gZAqf46Xbe+u4GgmFFR240jGdz4ImhJGfqhnUZxS18C1LuyrwxQBEIh3QWDHXW9mwdD21+4K1mNFaovD/ctClNEOJLcploiCAny78LUF6YsWA2QRlx282X3s95sqiTliQPyOO2n+wgf71M/FObIbzMavgzdbMbhnOurqhWp/vsUCCTksbJrrHWyGCtPa258NTfDR8ZYPQ4hszdvKP3gCSSK1EvbDXZL5XmXYWnQagjczCjCperzKAdP4cHcbnUkvkzvgA74uSrWLsimPnZFMeRh1d7eEh+Bb4ejhJ
 Ne08/ySv
 4BGNGjOhwmPBpvKh7SMdDGxKbYuhrfAwlN3WD45rDUunbxU1YT/BMua/VWuZHsOYANhSvX8nSDB0dvckZ9jXIKh+BrNEQ4esRLHCrF++S5kcUMHwLKwhJHDSfDjGnMGrRRlrQrW2amHuxnl3tZ5ICJEpKcyte6qiwljjc7Q4wrXsDKY6Fk7Xy7S1JPGuP9cYQC/yU5YM1TCes99fnBSEn1LQd8Zl12ghce6Cr/jSFaz3C8yMFZBOa1PSfIFAC6/vWP6UGr/8/cDbDwJaEeQ30xEemM92hfxseij2RefU8OH5zJid/U1+jQP39arJNcj4QrQvD
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 4:17 Nhat Pham <nphamcs@gmail.=
com>:

>
> Do you see this problem actually manifesting in real life? Does not
> sound infeasible to me, but I wonder how likely this is the case.
>
> Do you have any userspace-visible metrics, or any tracing logs etc.
> that proves that it actually happens?
>
> This might also affect the dynamic shrinker as well FWIW.
>

Although it is rare, on a small VM with 0.5GB RAM, performing `apt
upgrade` for ubuntu kernel update degrades system responsiveness.
Since kernel upgrade is memory consuming for zstd compressed
initramfs, there is heavy memory pressure like the benchmark.

Unfortunately I could not get evidence that clearly indicates the
contention. Perhaps IO latency can be a metric?
While allocating large memory, perf showed that __swap_writepage() was
consuming time and was called mostly from kswapd and some fraction
from user faults of python script and from shrink_worker. CPU was
mostly idling even in a single CPU system, so lock contention and
compression should not be the reason. I believe these behaviors
suggest contention on writeback IO.
As shown in the benchmark,  reducing shrinker writeback by patches 3
to 6 reduced elapsed time, which also indicates IO contention.

> > +/*
> > + * To avoid IO contention between pagein/out and global shrinker write=
back,
> > + * track the last jiffies of pagein/out and delay the writeback.
> > + * Default to 500msec in alignment with mq-deadline read timeout.
>
> If there is a future version, could you include the reason why you
> select 500msec in the patch's changelog as well?
>

The 500ms can be any value longer than the average interval of each
pageout/in and is not significant for behavior. If subsequent pageout
rejection occurs while the shrinker is sleeping, writeback will be
delayed again by 500ms from the last timestamp update. If pageout
occurs at a 1ms interval on average, the minimal delay should be 1+ms.

I chose 500ms from the mq-deadline scheduler that tries to perform
read IO in a 500ms timeframe by default (bfq for HDD uses a shorter
timeout).
When the shrinker performs writeback IO with a 500ms delay from the
last pagein, the write IO will be of lower priority than the read IO
waiting in the queue, as the pagein read becomes the highest priority
by the deadline. This logic emulates low-priority write IO by
voluntarily delaying IO.


>
> Hmmm is there a reason why we do not just do:
>
> zswap_shrinker_delay_start =3D jiffies;
>
> or, more unnecessarily:
>
> unsigned long now =3D jiffies;
>
> zswap_shrinker_delay_start =3D now;
>
> IOW, why the branching here? Does not seem necessary to me, but
> perhaps there is a correctness/compiler reason I'm not seeing?
>
> In fact, if it's the first version, then we could manually inline it.
>

That was to avoid invalidating the CPU cache of the shared variable
unnecessarily. Removing the branch and manually inlining it for v3.


> Additionally/alternatively, I wonder if it is more convenient to do it
> at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
> zswap_load() returns false, then delay the shrinker before proceeding
> with the IO steps.
>

Should we expose the timestamp variable? It is only used in zswap
internally, and the timestamp is not required when zswap is disabled.

> >         do {
> > +               /*
> > +                * delay shrinking to allow the last rejected page comp=
letes
> > +                * its writeback
> > +                */
> > +               sleepuntil =3D delay + READ_ONCE(zswap_shrinker_delay_s=
tart);
>
> I assume we do not care about racy access here right? Same goes for
> updates - I don't see any locks protecting these operations (but I
> could be missing something).
>

Right. Do we need atomic or spinlock for safety?
 I think the bare store/load of unsigned long is sufficient here. The
possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep
does not need ms accuracy.

Ah, I found a mistake here. v2 missed continue statement in the loop.
The delay should be extended if zswap_store() rejects another page. In
v2, one writeback was allowed per 500ms, which was not my intended
behavior.
The corrected logic for v3 should be:

               if (time_before(now, sleepuntil) &&
                               time_before(sleepuntil, now + delay + 1)) {
                       fsleep(jiffies_to_usecs(sleepuntil - now));
                       /* check if subsequent pageout/in extended delay */
                       continue;
               }


2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 9:57 Nhat Pham <nphamcs@gmail.=
com>:
>
> Hmm what about this scenario: when we disable zswap writeback on a
> cgroup, if zswap_store() fails, we are delaying the global shrinker
> for no gain essentially. There is no subsequent IO. I don't think we
> are currently handling this, right?
>
> >
> > The same logic applies to zswap_load(). When zswap cannot find requeste=
d
> > page from pool and read IO is performed, shrinker should be interrupted=
.
> >
>
> Yet another (less concerning IMHO) scenario is when a cgroup disables
> zswap by setting zswap.max =3D 0 (for instance, if the sysadmin knows
> that this cgroup's data are really cold, and/or that the workload is
> latency-tolerant, and do not want it to take up valuable memory
> resources of other cgroups). Every time this cgroup reclaims memory,
> it would disable the global shrinker (including the new proactive
> behavior) for other cgroup, correct? And, when they do need to swap
> in, it would further delay the global shrinker. Would this break of
> isolation be a problem?
>
> There are other concerns I raised in the cover letter's response as
> well - please take a look :)

I haven't considered these cases much, but I suppose the global
shrinker should be delayed in both cases as well. In general, any
pagein/out should be prefered over shrinker writeback throughput.

When zswap writeback was disabled for a memcg
(memcg.zswap.writeback=3D0), I suppose disabling/delaying writeback is
harmless.
If the rejection incurs no IO, there is no more memory pressure and
shrinking is not urgent. We can postpone the shrinker writeback. If
the rejection incurs IO (i.e. mm choose another page from a memcg with
writeback enabled), again we should delay the shrinker.

For pageout from latency-tolerant memcg (zswap.max=3D0), I think pageout
latency may affect other memcgs.
For example, when a latency-sensitive workload tries to allocate
memory, mm might choose to swap out pages from zswap-disabled memcg.
The slow direct pageout may indirectly delay allocation of the
latency-sensitive workload. IOW, we cannot determine which workload
would be blocked by a slow pageout based on which memcg owns the page.
In this case, it would be better to delay shrinker writeback even if
the owner is latency tolerant memcg.
Also for pagein, we cannot determine how urgent the pagein is.

Delaying shrinker on any pagein/out diminishes proactive shrinking
progress, but that is still better than the existing shrinker that
cannot shrink.