From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56A6EC3DA42 for ; Wed, 10 Jul 2024 22:10:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB8DA6B009A; Wed, 10 Jul 2024 18:10:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D3F4F6B00A1; Wed, 10 Jul 2024 18:10:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB8D56B00A2; Wed, 10 Jul 2024 18:10:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9AF816B009A for ; Wed, 10 Jul 2024 18:10:14 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4420C1C079D for ; Wed, 10 Jul 2024 22:10:14 +0000 (UTC) X-FDA: 82325237148.16.D03CD0B Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf04.hostedemail.com (Postfix) with ESMTP id 7F3D440007 for ; Wed, 10 Jul 2024 22:10:12 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=H3lWdIwD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720649376; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZVjrUx+weY5vD7cc11tkTR78ZpLmg3aZMvb61tjOLWY=; b=CXvhNBNtLjMbyg1nn5O5X+is5azrBC2SSzdBZ2qeZ7S5PCbesdN5knxvJ6N9yuTw5JLvk3 U+TaQhNWk/J8ZBTArDkbRQky0YT6xIi5TS/JPvMYbm6sVxFOKdpVnbcQijRqprway/aZqW A09DxJ4Go8OTEBMOXCJvTaLMa5MP4cY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720649376; a=rsa-sha256; cv=none; b=URNzf/pOUdg94BURq2wgAIoS49SMqOeLIjJRv55v5b69tM0l/AXwFRgFUNUmYfPqhjSB9u Oto2bduSSfnOo0PQVpSwWlwmYiQP2/hXBxp1XliXbNc32Kf/B5o315AwtfxzyJ+41ixl5u ekzx8zC1nDM0HbHRM36m0edzlxntunE= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=H3lWdIwD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-449f23df593so1205011cf.1 for ; Wed, 10 Jul 2024 15:10:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720649411; x=1721254211; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ZVjrUx+weY5vD7cc11tkTR78ZpLmg3aZMvb61tjOLWY=; b=H3lWdIwDAivdYTVZx+x2qcp7fZo4I6xOccjs7i/Wp0E2qbat5LxJCcEtdpcrS05pKM 8H3bwdFsbOU2Ol3veLdcL/RS8smD+WCIK2bPxAzm3SHCQss1/LQytVSRw3EOKrbfi+ln qwnpV2aGCIR840eF7/+ggzvIz1/2Z/vSxiGMZIGTWtyYMahTldPGvhzaThJrJGy546BS Id+dUjJC5tUr4m0PuJqxC8VcKAEPqi1f/s9B4PZXHinWOy9WjwNGyPup3O426cp6GdGD 433x0HgsqJyOPwun/fUWtVoIfGMdik7ll6lWmnSdwJ0Gr0Gt+vtDze/NlgBjnPf5vjd0 lggA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720649411; x=1721254211; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZVjrUx+weY5vD7cc11tkTR78ZpLmg3aZMvb61tjOLWY=; b=SYgKVS4pmro7YS5bcGpYjSpYyRMgih+C34c3Gh/2GQLC4JDjMDNGV8Ef2siVhjoUMZ PG327+FvsxE025RdYMrmIF2/GQRnCLo4V47G603AS8iV7SSyn3e17qknQP7fHwj3nVWB yGZKDjAhQ5xW6rbNmN5w/riJ3zfKDWh2f8vzA8ZeW2OQzICwuk0bidv+cH5yN+R1p2bg 4XiSfMnqKtvC0ER3yY4VHfSq/DUwmiAA9a2NC2MEAvUWtYtq+Hrkj+/JIjhTZKhWzN7W VyXEVNXNdIkxYf/BzT7+MwUMW/w6dlVnjEim2Kvd9dEBrHr7SWGDjVPk83MtBxoknmbn LIww== X-Forwarded-Encrypted: i=1; AJvYcCWnF++HzcYbbUnu7q84e6DwT3ZP0MMZb2bl/735dDmwH+RyQFruI+XPLSMFEB20TcxZZrNkKEWYUkSiHKX0rlvIFFE= X-Gm-Message-State: AOJu0YzXrFWD+kHouDdjAzmgGptG6fxl0fqisLJDdHrQ8ZVjj23l4+3t z2Sgkq0H2pGV3DlrDVgD2+ZZ2RkUZavZu+HlNC8BWrNB/Q8dLucK8yy1TiOPFVK5uaKAeuM2Byq ihK/HISzKwHfizRJPNAwp4dgvlIo= X-Google-Smtp-Source: AGHT+IGZPx0x7zW5guu7Qw+p3cdui2ZtjmYISMR5K/xNjwu28e4BdvnWdG95ho4IIMQE0Fb1eXVE0yO/B6kn0DRHGqA= X-Received: by 2002:a05:6214:d01:b0:6b5:49c9:ed53 with SMTP id 6a1803df08f44-6b61bf4c959mr83436136d6.37.1720649411397; Wed, 10 Jul 2024 15:10:11 -0700 (PDT) MIME-Version: 1.0 References: <20240706022523.1104080-1-flintglass@gmail.com> <20240706022523.1104080-7-flintglass@gmail.com> In-Reply-To: From: Nhat Pham Date: Wed, 10 Jul 2024 15:10:00 -0700 Message-ID: Subject: Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO To: Takero Funaki Cc: Johannes Weiner , Yosry Ahmed , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7F3D440007 X-Stat-Signature: ricgbrm4wu35ccgwnjbsy8y3nthicbps X-Rspam-User: X-HE-Tag: 1720649412-354872 X-HE-Meta: U2FsdGVkX19MZQe6tJPjdCf9B1do1xKZW1AL2lNi23ZTnIL6WZtwKW6Y+plLZFyI7F0RbaazCR/QKatnGKQoxLy7NCfUiBMRNJ5aesMH1Yat0i1/hykUv/3/O1XBIfax4XPIBuwz+3/ZnK77GqePsOtVRaghEpDUGY9WxpCQAGR//8HXVTuJ8kHSgM7P12c8Mu50cwnz70GbYNH8aKXyEi79FL8X2etAzvILP/3qKhG/MC0O0/MV6JP+VzaVTCdea9lahojOooaDF7dsPax77qh8qs3zZk0EFAqoxuYrGER9DHB76OcYLrmSWeUQHy8Ar9dLg6NVEXaC0gy1nU3MxlyVrNpM6WHrZ4qtXp1BQKyJHoDiMGzeP/tiaOtGy2JowgUfgInIkM+uoeb11ilmsV03zaV+MMejFJ+uVTz0YUy7b/f8nHohIl1QGTt+uN/mI28y5Cw1ck6QgpnWKlI95hrl3z9nAdsd3OHwa/QqZse4tw5IdiFQOk1I7kF92eJFV67pFKos4EGG1LPHGobSW5uOWOrnAF1MPLBTzGe9qCukAlZWLDeWQgzX5u3mSFzLU4E9wCEWweISoIXxrfQAI/CfpWANMV0Dnjw1TXdB6TNxwkT0wF3sPH0kkNArPTx7JKZ6h3CyFXmpA2I2KR4fiATwgBAnk3I/52G/B4kfoM3w4C5tPH67uhhz6HvhKpWrJG6yXLGUIMMVPQ5kctJeUDATD5yhc9jcVS1u0QAmxcY101qw5vIbAd3yXEuS9/+9MkNH2u+NBOc6Gj6GcBtJsvUzMR6ynewsNEmItSoNw8ZIpHeBupJ8sIYtolNZC821fdFOPfGgxtLoLQvrfQkMXJcnP8ryOxeLmcFph9afs02llbkoJaOyXH7+TKQwK2bDcqfA3+oxvdiZ2+Fs30nHNEIWT+UB+RBU1mleW7AK8lkuFArrKBswUhvOqlQOz9+VS6ZySrDlYvr+3OtsqGm joDptV5E eANsIdJ/cDxkW6DHNmI5d+MExeG38cQZaPYRBsI1Z3objK4SO5aZ/W38MAcq/2oHQfrSPtkKnEOErnqoMnebCw5ZwA5fzJ99NDnkEP3UGXC7CSXL717uocXN/K8FhLxb1/yz5rSiWGvNIiHVrq+568tejJEWSdFgoe8WTSEOszny8yKGjJzEw1QvgKnka4eVB6eEGWLuMYIWpvznApU6SWZrm5lFAsseMY4x38a/PLJz2RYOFelRe27eGel/Y7b1Zs+vl0DrdhHeDO3bGJ7FU7x+xwOeY2+5nBPknSJ80KV726yr1yx7kyYTI+jlYcPM7Hng4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 10, 2024 at 2:21=E2=80=AFPM Takero Funaki wrote: > > 2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 4:17 Nhat Pham : > > > > > Do you see this problem actually manifesting in real life? Does not > > sound infeasible to me, but I wonder how likely this is the case. > > > > Do you have any userspace-visible metrics, or any tracing logs etc. > > that proves that it actually happens? > > > > This might also affect the dynamic shrinker as well FWIW. > > > > Although it is rare, on a small VM with 0.5GB RAM, performing `apt > upgrade` for ubuntu kernel update degrades system responsiveness. > Since kernel upgrade is memory consuming for zstd compressed > initramfs, there is heavy memory pressure like the benchmark. > > Unfortunately I could not get evidence that clearly indicates the > contention. Perhaps IO latency can be a metric? IO latency, IO pressure, etc. I'm not sure either, as I do not have the benchmark setup at hand - just throwing out ideas :) > While allocating large memory, perf showed that __swap_writepage() was > consuming time and was called mostly from kswapd and some fraction > from user faults of python script and from shrink_worker. CPU was > mostly idling even in a single CPU system, so lock contention and > compression should not be the reason. I believe these behaviors > suggest contention on writeback IO. Sounds fair to me. > As shown in the benchmark, reducing shrinker writeback by patches 3 > to 6 reduced elapsed time, which also indicates IO contention. > > > > +/* > > > + * To avoid IO contention between pagein/out and global shrinker wri= teback, > > > + * track the last jiffies of pagein/out and delay the writeback. > > > + * Default to 500msec in alignment with mq-deadline read timeout. > > > > If there is a future version, could you include the reason why you > > select 500msec in the patch's changelog as well? > > > > The 500ms can be any value longer than the average interval of each > pageout/in and is not significant for behavior. If subsequent pageout > rejection occurs while the shrinker is sleeping, writeback will be > delayed again by 500ms from the last timestamp update. If pageout > occurs at a 1ms interval on average, the minimal delay should be 1+ms. > > I chose 500ms from the mq-deadline scheduler that tries to perform > read IO in a 500ms timeframe by default (bfq for HDD uses a shorter > timeout). > When the shrinker performs writeback IO with a 500ms delay from the > last pagein, the write IO will be of lower priority than the read IO > waiting in the queue, as the pagein read becomes the highest priority > by the deadline. This logic emulates low-priority write IO by > voluntarily delaying IO. > > > > > > Hmmm is there a reason why we do not just do: > > > > zswap_shrinker_delay_start =3D jiffies; > > > > or, more unnecessarily: > > > > unsigned long now =3D jiffies; > > > > zswap_shrinker_delay_start =3D now; > > > > IOW, why the branching here? Does not seem necessary to me, but > > perhaps there is a correctness/compiler reason I'm not seeing? > > > > In fact, if it's the first version, then we could manually inline it. > > > > That was to avoid invalidating the CPU cache of the shared variable > unnecessarily. Removing the branch and manually inlining it for v3. Ah I see. Seems a tad overengineering IMHO, but I'll leave this up to you. If you keep the current version, please add a comment. > > > > Additionally/alternatively, I wonder if it is more convenient to do it > > at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and > > zswap_load() returns false, then delay the shrinker before proceeding > > with the IO steps. > > > > Should we expose the timestamp variable? It is only used in zswap > internally, and the timestamp is not required when zswap is disabled. Not necessarily the timestamp. If you keep the helper, you can just expose that helper (and have a no-op when !CONFIG_ZSWAP in the zswap.h header file). > > > > do { > > > + /* > > > + * delay shrinking to allow the last rejected page co= mpletes > > > + * its writeback > > > + */ > > > + sleepuntil =3D delay + READ_ONCE(zswap_shrinker_delay= _start); > > > > I assume we do not care about racy access here right? Same goes for > > updates - I don't see any locks protecting these operations (but I > > could be missing something). > > > > Right. Do we need atomic or spinlock for safety? > I think the bare store/load of unsigned long is sufficient here. The > possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep > does not need ms accuracy. > > Ah, I found a mistake here. v2 missed continue statement in the loop. > The delay should be extended if zswap_store() rejects another page. In > v2, one writeback was allowed per 500ms, which was not my intended > behavior. > The corrected logic for v3 should be: > > if (time_before(now, sleepuntil) && > time_before(sleepuntil, now + delay + 1)) = { > fsleep(jiffies_to_usecs(sleepuntil - now)); > /* check if subsequent pageout/in extended delay *= / > continue; > } > > > 2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 9:57 Nhat Pham : > > > > Hmm what about this scenario: when we disable zswap writeback on a > > cgroup, if zswap_store() fails, we are delaying the global shrinker > > for no gain essentially. There is no subsequent IO. I don't think we > > are currently handling this, right? > > > > > > > > The same logic applies to zswap_load(). When zswap cannot find reques= ted > > > page from pool and read IO is performed, shrinker should be interrupt= ed. > > > > > > > Yet another (less concerning IMHO) scenario is when a cgroup disables > > zswap by setting zswap.max =3D 0 (for instance, if the sysadmin knows > > that this cgroup's data are really cold, and/or that the workload is > > latency-tolerant, and do not want it to take up valuable memory > > resources of other cgroups). Every time this cgroup reclaims memory, > > it would disable the global shrinker (including the new proactive > > behavior) for other cgroup, correct? And, when they do need to swap > > in, it would further delay the global shrinker. Would this break of > > isolation be a problem? > > > > There are other concerns I raised in the cover letter's response as > > well - please take a look :) > > I haven't considered these cases much, but I suppose the global > shrinker should be delayed in both cases as well. In general, any > pagein/out should be prefered over shrinker writeback throughput. > > When zswap writeback was disabled for a memcg > (memcg.zswap.writeback=3D0), I suppose disabling/delaying writeback is > harmless. > If the rejection incurs no IO, there is no more memory pressure and > shrinking is not urgent. We can postpone the shrinker writeback. If > the rejection incurs IO (i.e. mm choose another page from a memcg with > writeback enabled), again we should delay the shrinker. You are delaying writeback globally right? IOW, other cgroup is also affect= ed? Say we are under memory pressure, with two cgroups under reclaim - one with zswap writeback disabled. The one with writeback disabled will constantly fail at zswap_store(), and delay the global shrinking action, which could have targeted the other cgroup (which should proceed fully because there is no contention here since the first cgroup is not swapping either). > > For pageout from latency-tolerant memcg (zswap.max=3D0), I think pageout > latency may affect other memcgs. > For example, when a latency-sensitive workload tries to allocate > memory, mm might choose to swap out pages from zswap-disabled memcg. > The slow direct pageout may indirectly delay allocation of the > latency-sensitive workload. IOW, we cannot determine which workload > would be blocked by a slow pageout based on which memcg owns the page. > In this case, it would be better to delay shrinker writeback even if > the owner is latency tolerant memcg. > Also for pagein, we cannot determine how urgent the pagein is. > > Delaying shrinker on any pagein/out diminishes proactive shrinking > progress, but that is still better than the existing shrinker that > cannot shrink. This is fair though :)