From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79D5BC3DA41 for ; Wed, 10 Jul 2024 21:21:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E33526B0099; Wed, 10 Jul 2024 17:21:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE2356B00B0; Wed, 10 Jul 2024 17:21:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C5D186B00B1; Wed, 10 Jul 2024 17:21:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A3D7E6B0099 for ; Wed, 10 Jul 2024 17:21:23 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 55EA7C0278 for ; Wed, 10 Jul 2024 21:21:23 +0000 (UTC) X-FDA: 82325114046.13.D60C0DD Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176]) by imf17.hostedemail.com (Postfix) with ESMTP id 78D0C4000F for ; Wed, 10 Jul 2024 21:21:21 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Fw2OFbxd; spf=pass (imf17.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=flintglass@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720646450; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=; b=pis/myA5ZO5cYpEqrUW00Hf4Ln4mHYSx0E80MbjNEv32pCCSSzHb8jQS3c/kKoc6G2QvRq clm8y2W+PMK2cdOxlprfTw7FCgH61zj4w5WiIx6OM3wgSeA1kEfDAZa/yrEI9YfKX8DpN9 3EcT35ZKr19bMFlmWD8wMUmovJ54rgw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720646450; a=rsa-sha256; cv=none; b=2DDKvNdScwHmicjp/OjXJBbjykprkLPI/z80RPmmO7QgHjqP+n3W+m6herT+4cUC7FS7+n b4mhz9FLalvHf4wKDhcy8idNR9NKJgTBaH78faaMY3ckNS6Ypi6fASsXHXtH3r8NUKq13T RTQqGH3v+lxf9BKxou2KW6vrlDpf3IM= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Fw2OFbxd; spf=pass (imf17.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.176 as permitted sender) smtp.mailfrom=flintglass@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yb1-f176.google.com with SMTP id 3f1490d57ef6-e036d1ce4f7so243825276.0 for ; Wed, 10 Jul 2024 14:21:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720646480; x=1721251280; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=; b=Fw2OFbxdRfjCJaO9JjDmGOmwkxJ0KV9Xb9BmP5pJ/cjlr27vBSU8T0QowQq6bSQWKX qkAScBTehef+1LpdyGYXwzXss531uiOR6+FBkU1Rt73Uug5yn1BjeAJ3pbR4fqoQg0Et rD3o/puccMF6MowFDOqSAS0UMY8Eel9Pqt5CnUnoXEOvJHZkc4Ixf5BOfzEvZKmwxYPX 3CfbaBxIKoFfUbfjMQkNKizvd5ZutFaqvXGTqs7VvOSXvKZvbs+mdD5ZpXeX8EKE0UJz 30LqpTZSXm0cDc9Ap132Ti4uNK9CWkp2MkXlyicKpFXB6wyuAPBtfZ5Xj9deTCafI20i FiPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720646480; x=1721251280; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JWwzDBJy8f277WXqXl6R3RMuPRUwh+qJXOSNVB7ffMc=; b=m/TVe7kuNg65Dvnr2Pmbg+qusCXhR7giGxWGRBpgOMPj6WM/Ni/u4SSY4S7ON8xScH OSkmxKMnas1hHxhUCIKfXTMGrzOC6cTietJxf2+m/x8sVLptH6LCvf/1gWh3cNzGFBz/ o1audbfwDL+91FSz+nCaumLGeqJqWmigtMZzPIkqOwLxw3k4BgiOXp3m9nTESFKzRmAf 13KDqeLnzOYjmDCJOkAuAce0xf0E/kigRUiRZ2n9TTA2bYNLf6RI3d5J5jY4/BztoBc8 CoLwFmhm87PZVZl2ElYA1Lqrv5b9wsU3gU95M3ArCSiaspVXLM+UFRNPE5IFSOOBHe/K ya3A== X-Forwarded-Encrypted: i=1; AJvYcCWzaX6bNKkzIRR0Ugjm3IvJoJJc5yKgRinyph1G/avHhtE7c6TyEA/Kxeq2tn0kquSPQ/hfxVI1GokPs/fojmj4w+I= X-Gm-Message-State: AOJu0YzT+zvb/qaeNCYZXp0/Z44Q2HVPz9T164tXuVwRVrxDxcxKXfpY exISKjt0aAswoByJNMNauAmK/iyJC41DptxgvlUGsoi4UFe5b1TKDpx6izWEaik1Z1PKbYxiCyp oJ7SShNa3PhlEpZbvK4j9q3QtGpY= X-Google-Smtp-Source: AGHT+IH6bYgfsoNbVGtdjaiR7EunA/HaBDZYQjDSrczVs0R6MHfQSww7S19jaEyCiiHD4LMGaazU7obguZVGu0/l9pM= X-Received: by 2002:a25:7d02:0:b0:e03:5b97:cbec with SMTP id 3f1490d57ef6-e0577fb65e5mr780329276.10.1720646480390; Wed, 10 Jul 2024 14:21:20 -0700 (PDT) MIME-Version: 1.0 References: <20240706022523.1104080-1-flintglass@gmail.com> <20240706022523.1104080-7-flintglass@gmail.com> In-Reply-To: From: Takero Funaki Date: Thu, 11 Jul 2024 06:21:09 +0900 Message-ID: Subject: Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO To: Nhat Pham Cc: Johannes Weiner , Yosry Ahmed , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 78D0C4000F X-Stat-Signature: 3oy1wofmy55eh7ey1wwt5bc7tu34ktrt X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1720646481-920764 X-HE-Meta: U2FsdGVkX18JuW+JrhrkIu7dajavsdWws43e84wFTbyUUJ/Rgvcc/rltGC8Z82gosmBUOQoka8cxnLvi5w5yJtLTZIrjpBfP0QCtN6E6JxnNBsG+jL8R3dLQV9ZIvTlc3pKx4qHeRt6hZ4y+3+PDwKfj6ZoaOjQbW42J+UKASeHVPk7rBOobvtLqWg4IMbkF6eUcg+2n7Xgzz5nGS/n9dLHTzRdCZZjkyUW+i4/g7gyGChhsT5DFH0OP/T4/2zeHrTR6DUMhUDE5mJsqtjO+tJSrwv4n2ipcmzh5TMMYL/MUQDfC9nV4YgalnH/IzlSGXJQN1UmhL8YN8apyyedC8yFKfMkNina4VWDflJhqUdRWsU44htwJo7pTD75INsFyJJbXoFX8Mb+vuUSfT9Bkb99FJnczAMLgcRva68+nhdB7ITGxFegrx0ur2f4G6gQFTOSN42nboSGQv+ZwlDXIEPbMbY6pOBwAwGpFCrNbyMx3evaETy5WKGzM3usv05RvHbgWKsAczk1NPOkr74upuJOy3TK+rkBtoREuBqYAOx4EvQVVunzMsVTYy/94AENjbGTDa5h+TqczIGoqr7WLbXnxRM8Ktf1szIqTkCXIuvod69tBwMv8tIcapSWAMyGeXUc3FSco2QQ9jQztyOcTBpaz7ogn3XNialrTv5bVATKFy3gZAqf46Xbe+u4GgmFFR240jGdz4ImhJGfqhnUZxS18C1LuyrwxQBEIh3QWDHXW9mwdD21+4K1mNFaovD/ctClNEOJLcploiCAny78LUF6YsWA2QRlx282X3s95sqiTliQPyOO2n+wgf71M/FObIbzMavgzdbMbhnOurqhWp/vsUCCTksbJrrHWyGCtPa258NTfDR8ZYPQ4hszdvKP3gCSSK1EvbDXZL5XmXYWnQagjczCjCperzKAdP4cHcbnUkvkzvgA74uSrWLsimPnZFMeRh1d7eEh+Bb4ejhJ Ne08/ySv 4BGNGjOhwmPBpvKh7SMdDGxKbYuhrfAwlN3WD45rDUunbxU1YT/BMua/VWuZHsOYANhSvX8nSDB0dvckZ9jXIKh+BrNEQ4esRLHCrF++S5kcUMHwLKwhJHDSfDjGnMGrRRlrQrW2amHuxnl3tZ5ICJEpKcyte6qiwljjc7Q4wrXsDKY6Fk7Xy7S1JPGuP9cYQC/yU5YM1TCes99fnBSEn1LQd8Zl12ghce6Cr/jSFaz3C8yMFZBOa1PSfIFAC6/vWP6UGr/8/cDbDwJaEeQ30xEemM92hfxseij2RefU8OH5zJid/U1+jQP39arJNcj4QrQvD X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 4:17 Nhat Pham : > > Do you see this problem actually manifesting in real life? Does not > sound infeasible to me, but I wonder how likely this is the case. > > Do you have any userspace-visible metrics, or any tracing logs etc. > that proves that it actually happens? > > This might also affect the dynamic shrinker as well FWIW. > Although it is rare, on a small VM with 0.5GB RAM, performing `apt upgrade` for ubuntu kernel update degrades system responsiveness. Since kernel upgrade is memory consuming for zstd compressed initramfs, there is heavy memory pressure like the benchmark. Unfortunately I could not get evidence that clearly indicates the contention. Perhaps IO latency can be a metric? While allocating large memory, perf showed that __swap_writepage() was consuming time and was called mostly from kswapd and some fraction from user faults of python script and from shrink_worker. CPU was mostly idling even in a single CPU system, so lock contention and compression should not be the reason. I believe these behaviors suggest contention on writeback IO. As shown in the benchmark, reducing shrinker writeback by patches 3 to 6 reduced elapsed time, which also indicates IO contention. > > +/* > > + * To avoid IO contention between pagein/out and global shrinker write= back, > > + * track the last jiffies of pagein/out and delay the writeback. > > + * Default to 500msec in alignment with mq-deadline read timeout. > > If there is a future version, could you include the reason why you > select 500msec in the patch's changelog as well? > The 500ms can be any value longer than the average interval of each pageout/in and is not significant for behavior. If subsequent pageout rejection occurs while the shrinker is sleeping, writeback will be delayed again by 500ms from the last timestamp update. If pageout occurs at a 1ms interval on average, the minimal delay should be 1+ms. I chose 500ms from the mq-deadline scheduler that tries to perform read IO in a 500ms timeframe by default (bfq for HDD uses a shorter timeout). When the shrinker performs writeback IO with a 500ms delay from the last pagein, the write IO will be of lower priority than the read IO waiting in the queue, as the pagein read becomes the highest priority by the deadline. This logic emulates low-priority write IO by voluntarily delaying IO. > > Hmmm is there a reason why we do not just do: > > zswap_shrinker_delay_start =3D jiffies; > > or, more unnecessarily: > > unsigned long now =3D jiffies; > > zswap_shrinker_delay_start =3D now; > > IOW, why the branching here? Does not seem necessary to me, but > perhaps there is a correctness/compiler reason I'm not seeing? > > In fact, if it's the first version, then we could manually inline it. > That was to avoid invalidating the CPU cache of the shared variable unnecessarily. Removing the branch and manually inlining it for v3. > Additionally/alternatively, I wonder if it is more convenient to do it > at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and > zswap_load() returns false, then delay the shrinker before proceeding > with the IO steps. > Should we expose the timestamp variable? It is only used in zswap internally, and the timestamp is not required when zswap is disabled. > > do { > > + /* > > + * delay shrinking to allow the last rejected page comp= letes > > + * its writeback > > + */ > > + sleepuntil =3D delay + READ_ONCE(zswap_shrinker_delay_s= tart); > > I assume we do not care about racy access here right? Same goes for > updates - I don't see any locks protecting these operations (but I > could be missing something). > Right. Do we need atomic or spinlock for safety? I think the bare store/load of unsigned long is sufficient here. The possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep does not need ms accuracy. Ah, I found a mistake here. v2 missed continue statement in the loop. The delay should be extended if zswap_store() rejects another page. In v2, one writeback was allowed per 500ms, which was not my intended behavior. The corrected logic for v3 should be: if (time_before(now, sleepuntil) && time_before(sleepuntil, now + delay + 1)) { fsleep(jiffies_to_usecs(sleepuntil - now)); /* check if subsequent pageout/in extended delay */ continue; } 2024=E5=B9=B47=E6=9C=889=E6=97=A5(=E7=81=AB) 9:57 Nhat Pham : > > Hmm what about this scenario: when we disable zswap writeback on a > cgroup, if zswap_store() fails, we are delaying the global shrinker > for no gain essentially. There is no subsequent IO. I don't think we > are currently handling this, right? > > > > > The same logic applies to zswap_load(). When zswap cannot find requeste= d > > page from pool and read IO is performed, shrinker should be interrupted= . > > > > Yet another (less concerning IMHO) scenario is when a cgroup disables > zswap by setting zswap.max =3D 0 (for instance, if the sysadmin knows > that this cgroup's data are really cold, and/or that the workload is > latency-tolerant, and do not want it to take up valuable memory > resources of other cgroups). Every time this cgroup reclaims memory, > it would disable the global shrinker (including the new proactive > behavior) for other cgroup, correct? And, when they do need to swap > in, it would further delay the global shrinker. Would this break of > isolation be a problem? > > There are other concerns I raised in the cover letter's response as > well - please take a look :) I haven't considered these cases much, but I suppose the global shrinker should be delayed in both cases as well. In general, any pagein/out should be prefered over shrinker writeback throughput. When zswap writeback was disabled for a memcg (memcg.zswap.writeback=3D0), I suppose disabling/delaying writeback is harmless. If the rejection incurs no IO, there is no more memory pressure and shrinking is not urgent. We can postpone the shrinker writeback. If the rejection incurs IO (i.e. mm choose another page from a memcg with writeback enabled), again we should delay the shrinker. For pageout from latency-tolerant memcg (zswap.max=3D0), I think pageout latency may affect other memcgs. For example, when a latency-sensitive workload tries to allocate memory, mm might choose to swap out pages from zswap-disabled memcg. The slow direct pageout may indirectly delay allocation of the latency-sensitive workload. IOW, we cannot determine which workload would be blocked by a slow pageout based on which memcg owns the page. In this case, it would be better to delay shrinker writeback even if the owner is latency tolerant memcg. Also for pagein, we cannot determine how urgent the pagein is. Delaying shrinker on any pagein/out diminishes proactive shrinking progress, but that is still better than the existing shrinker that cannot shrink.