From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E53ECDB47E for ; Thu, 19 Oct 2023 01:12:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B9D4F80051; Wed, 18 Oct 2023 21:12:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4D338004A; Wed, 18 Oct 2023 21:12:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A149180051; Wed, 18 Oct 2023 21:12:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 92DCD8004A for ; Wed, 18 Oct 2023 21:12:24 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 5AC6C80FEF for ; Thu, 19 Oct 2023 01:12:24 +0000 (UTC) X-FDA: 81360435408.20.9FE9956 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf08.hostedemail.com (Postfix) with ESMTP id 6940F16000C for ; Thu, 19 Oct 2023 01:12:22 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=elgKX9lg; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697677942; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dhz7SA9P+XJBcY68CPcxHuRXH5ke/B4swdMIdIEkvG8=; b=6jojVskMeEDoAegjON/Fzum1sKx2J6Yu4sIjXYd+722gan7/A8oqZKyVOLV64D6kWZ9i0J l4u1a9NpiUICgulo6iOs5h/hD0w70zdsPLf6fXJ+PpSsULS/S1PHoBW8ON9rEI8xurrjwg RJuwegITSWI86OgpY7mdNr6LNU03WWE= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=elgKX9lg; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697677942; a=rsa-sha256; cv=none; b=qXy+AOdXFR0fANt10iTts2q01ZaMWBNX1z65sw9t4dO15lbn3QNqZ3qWGLLa6rTuO6WTeV V0KRLN17KIxJzO6B/5aGp+XptVIadDco+uNhC5gBUlp0IANv50aOUHbQLU7/XE6FDKmnCU Gajj7V0llwrYANKBGE4gTwNEUNjpF9o= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-9be02fcf268so864491966b.3 for ; Wed, 18 Oct 2023 18:12:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1697677941; x=1698282741; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dhz7SA9P+XJBcY68CPcxHuRXH5ke/B4swdMIdIEkvG8=; b=elgKX9lgLBWDrdb8ircq9+f+/w88+62+MAFIWOrStcrTx7PKahxc6OqOWvwzP08rbk fQUzLZsBcwOmDcHMvKgkXLshyrRWinDq0BH5Rh9l3R9SjIfNCSKwlRcIqV0nc7rZv07Q bSxSXGi0Z4iGAj2SsLMRVjAiVlZquHWSceA0fDUoyEiQ52Y7YiDICVuuIza163OsrZn9 lrSFTsgX3KRJ9BZv/0rDsFQYDLz0/0Gdvymvgt9aLppuXSPKpfcC8VlckA7hAB2UW5+g SCZN6LiUtC4QNwNTl7P+8AnqqhHn9jkixt1ES/fGE80bF5A1gmFwPKWjQslpuuKgWMVE AJWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697677941; x=1698282741; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dhz7SA9P+XJBcY68CPcxHuRXH5ke/B4swdMIdIEkvG8=; b=T7uo4AaNbE1Vr11svHMZtkqDFyUUbSHRs3xYmMh6b/jmtGSSLIG1L/nWO1a3LNpW/c WOs0Ygck6WLWNFFtdYDZfuQpXFVqSEH+27zuKLxrhf5amZIsSK3SbBn4s+Mvs7wv+RPH kqhKhaURLd+eyZJwi32Bng3+CGWAL6bA4iv8KHe35WBk+SWM1SKLyrNYBAjcL2hj80Fy Hyrcl9ZiyH//ly1C9q6GMuRqV6U1ERDHBxMEJAs49ktOEAttufSgyR6k87rYUoNY1BoE vB/NFZuAKuCCb9FVYU80pKnmbUKXfZDqpZ/2bq50MwPnyxye/muOPhsIWHrQ+p5HGOBh g2Ig== X-Gm-Message-State: AOJu0YwmXSiQHb1TnouPCl4xibTkD5qamPycfL2PaDSXpF7ISnn1Bnj3 XRpp6X/n7aNeTy4RoKd5y2RiONrhKJAsoS0N/+oVZQ== X-Google-Smtp-Source: AGHT+IGHwNm3FxuIImSmxPAbFGp0c1yIKTSGLdrYiFc3epvSZQVMDEXR/8eKkuYyYFUstTct/Rdwn0qkA6qA1qwCpuE= X-Received: by 2002:a17:907:c16:b0:9ae:59c9:b831 with SMTP id ga22-20020a1709070c1600b009ae59c9b831mr625646ejc.49.1697677940697; Wed, 18 Oct 2023 18:12:20 -0700 (PDT) MIME-Version: 1.0 References: <20231017232152.2605440-1-nphamcs@gmail.com> <20231017232152.2605440-3-nphamcs@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 18 Oct 2023 18:11:41 -0700 Message-ID: Subject: Re: [PATCH v3 2/5] zswap: make shrinking memcg-aware To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6940F16000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: jpukpqp76qqaoehmyxatf5a8o9dm5kpa X-HE-Tag: 1697677942-615932 X-HE-Meta: U2FsdGVkX19+0W9updPhVcIifX5QjRI7Ci7CFBwaI5O96W6xLLYQ0+SVXs1W1e9VuZW2sM8UzYWuR1FVFSf2hblCix4AEEDF5FdwfLAVYb63k048/qCILTSqIZPrTdMi1Wz3Ko0S+9JbH9jP5kO7Gx1+Icaxvd6InreEwaqhJuJOckIhrsaOYWGURYNl12YYlJpOP7MM0PG7zx7/Bon/TuF68sRH7g62h+WsRgRxrKd43ajyz7VKyuEKOySiM0DAdWCF7ZgaWoBWmeJX+vV4+eEUdEptIDCWOjeef1WAKRV3K8NYdxJzySb2y++3PXn+2D5gz7rqVSboq2+VG+agIcWGRHRXWLHvbjWblbgqLQoNOe8oII/UG+mqHgHAMlDSZzatF30HZ5yqhwDLD8gdrG0QKeYtHe5s5Cxh849Mc9uNk4dFyo4JiPbYT5puyGitzC6Gof0WJSUjNfYBXJe0UdfMBou1nv6LS4nf8CFbPxnjwrY0sElJVSG++KRYLW7rhV1W+8DTIHZEFC+k6mKBWB5g/owlD96+EpE1owZu++qeO1/32fV7EXhbFydSSYVrtlXAiWNPFop+gIyPQHcbkMkGPBs0bbDHY0BHCUzQH1viLVzwfSQW14CTh+uuEOWwfZH5WVCTkAtwIvCvumhG2ZAiWhTXFuzoT3FjW284f991CF6MvDsnxPvZmvp0K7UX2w98QpwZ/hprQp0X/9ceAnqQb3GYLTcKHqVZwXKhb105deiwmBEXIBml9L3MXchvMMrifrg/W88rmDEKJUC1gW7umgehNun9ZEErlGSI7daqP0y2FVnW+Fv7d+LOX1HycTsX3FeYrHVshCzqlUNFpRlHly3GwVPjADr6eiK6waoQ0sdju5ClqeMJrlY7ZrnhQ9nQNvkKhNqCydeWthYL0qBhMREd8blv3tmmy8MKY8IDZXBBw7cIPzIrLP1oZWE4AcSq/J+itkLdGnSVqF4 9bKgRkki TSdXbJsZ2RnPnEIG2cpmNEetdCYMVv8RDQERKOqmR4zVssoNjSP/vagMjpkW3yObWY30RQyZrN43KOFivPPC7dEnbrHgrmGFftT0h30dx01PsG1hxeV7idObim8VO2FvYx4oQL6xtOk1B1ESOFjSXmAzmQUH5oKiEmI8P8YPTXDSNKX6VPN4tlp2rgNitgRGTHpIXjiRaEJJhUz68p1nVvJz3s2unrryZGfQ3bQjOmBP7TrAxErWbdipFDyY2KN33XSaMicbkcUQqhpBT4x3ONUWSVgZJc4i2HQhEiygsdUeu/waRUheaNZ7fNe7lAVqJGoQ4tdfLMHN3b4o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 18, 2023 at 4:47=E2=80=AFPM Nhat Pham wrote= : > > On Wed, Oct 18, 2023 at 4:20=E2=80=AFPM Yosry Ahmed wrote: > > > > On Tue, Oct 17, 2023 at 4:21=E2=80=AFPM Nhat Pham w= rote: > > > > > > From: Domenico Cerasuolo > > > > > > Currently, we only have a single global LRU for zswap. This makes it > > > impossible to perform worload-specific shrinking - an memcg cannot > > > determine which pages in the pool it owns, and often ends up writing > > > pages from other memcgs. This issue has been previously observed in > > > practice and mitigated by simply disabling memcg-initiated shrinking: > > > > > > https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.co= m/T/#u > > > > > > This patch fully resolves the issue by replacing the global zswap LRU > > > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > > > > > a) When a store attempt hits an memcg limit, it now triggers a > > > synchronous reclaim attempt that, if successful, allows the new > > > hotter page to be accepted by zswap. > > > b) If the store attempt instead hits the global zswap limit, it will > > > trigger an asynchronous reclaim attempt, in which an memcg is > > > selected for reclaim in a round-robin-like fashion. > > > > Could you explain the rationale behind the difference in behavior here > > between the global limit and the memcg limit? > > The global limit hit reclaim behavior was previously asynchronous too. > We just added the round-robin part because now the zswap LRU is > cgroup-aware :) > > For the cgroup limit hit, however, we cannot make it asynchronous, > as it is a bit hairy to add a per-cgroup shrink_work. So, we just > perform the reclaim synchronously. > > The question is whether it makes sense to make the global limit > reclaim synchronous too. That is a task of its own IMO. Let's add such context to the commit log, and perhaps an XXX comment in the code asking whether we should consider doing the reclaim synchronously for the global limit too. > > (FWIW, this somewhat mirrors the direct reclaimer v.s kswapd > story to me, but don't quote me too hard on this). > [..] > > > > > > /* Hold a reference to prevent a free during writeback */ > > > zswap_entry_get(entry); > > > spin_unlock(&tree->lock); > > > > > > - ret =3D zswap_writeback_entry(entry, tree); > > > + writeback_result =3D zswap_writeback_entry(entry, tree); > > > > > > spin_lock(&tree->lock); > > > - if (ret) { > > > - /* Writeback failed, put entry back on LRU */ > > > - spin_lock(&pool->lru_lock); > > > - list_move(&entry->lru, &pool->lru); > > > - spin_unlock(&pool->lru_lock); > > > + if (writeback_result) { > > > + zswap_reject_reclaim_fail++; > > > + memcg =3D get_mem_cgroup_from_entry(entry); > > > + spin_lock(lock); > > > + /* we cannot use zswap_lru_add here, because it incre= ments node's lru count */ > > > + list_lru_putback(&entry->pool->list_lru, item, entry_= to_nid(entry), memcg); > > > + spin_unlock(lock); > > > + mem_cgroup_put(memcg); > > > + ret =3D LRU_RETRY; > > > goto put_unlock; > > > } > > > + zswap_written_back_pages++; > > > > Why is this moved here from zswap_writeback_entry()? Also why is > > zswap_reject_reclaim_fail incremented here instead of inside > > zswap_writeback_entry()? > > Domenico should know this better than me, but my understanding > is that moving it here protects concurrent modifications of > zswap_written_back_pages with the tree lock. > > Is writeback single-threaded in the past? This counter is non-atomic, > and doesn't seem to be protected by any locks... > > There definitely can be concurrent stores now though - with > a synchronous reclaim from cgroup-limit hit and another > from the old shrink worker. > > (and with the new zswap shrinker, concurrent reclaim is > the expectation!) The comment above the stats definition stats that they are left unprotected purposefully. If we want to fix that let's do it separately. If this patch makes it significantly worse such that it would cause a regression, let's at least do it in a separate patch. The diff here is too large already. > > zswap_reject_reclaim_fail was previously incremented in > shrink_worker I think. We need it to be incremented > for the shrinker as well, so might as well move it here. Wouldn't moving it inside zswap_writeback_entry() near incrementing zswap_written_back_pages make it easier to follow?