From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1C81CDB465 for ; Thu, 19 Oct 2023 16:28:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4AE0B800A1; Thu, 19 Oct 2023 12:28:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45EBC80090; Thu, 19 Oct 2023 12:28:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3256D800A1; Thu, 19 Oct 2023 12:28:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2357F80090 for ; Thu, 19 Oct 2023 12:28:54 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F03E4B5C2A for ; Thu, 19 Oct 2023 16:28:53 +0000 (UTC) X-FDA: 81362744946.01.7B42CE9 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf21.hostedemail.com (Postfix) with ESMTP id 08ACC1C0017 for ; Thu, 19 Oct 2023 16:28:50 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PJDSyznn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697732931; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ujM1wfP0X1TfyN6UNvLQZYpXjOqlX96jhGMomCpg6WI=; b=8JpMt/VlJYXQiSCxb18V+qyk6LKz/WqOYglD/F3YhYt1IuZ1czLtGDopF/jlIfrqdGgAof dtxfDgTK5P6yv2Z/EBHqJ2nL1W+hKz5mytB3jPqlnhhPoM5Anu+3lxftljHUqo1oBCHZSx meM3fdd/rJSYCDaqDWOqXCRz3Jj3Z5o= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PJDSyznn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697732931; a=rsa-sha256; cv=none; b=Qv7Cz93+wFlCHLXNYS/+X9yVoIMyVVq2Ood6Il/pRnkAvIg/wJfRHR5c8YUbk8OloZ9NGx feE1GEiquTRGr4vmt17+iWate1Xy0yRqvOzGOQHfI1a2szjdctGOFR1uvvcTi/MJJUWGv7 7UbwRfmgkKwsBDMz1/NM+o1QSs3Mnl4= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-99c3d3c3db9so1309375166b.3 for ; Thu, 19 Oct 2023 09:28:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1697732929; x=1698337729; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ujM1wfP0X1TfyN6UNvLQZYpXjOqlX96jhGMomCpg6WI=; b=PJDSyznnBw4HWJioF/6/pPva2SW2gzEsCQOgac1o2/YGUz4hK5RVoOYhqKojvYFD3t W3jyXiHAhYMf3T0R10pigg/HU0qVML6y5TBiiSEoh5+5b/q07Bv68Akx28BRlibLJK2l 4txoEuI4uxtNuv+1x+mFDNSUyjhNyD/HPMXpMI9+efI33BXihutog+Klfrhr13ZIP4YQ UBB+2vEJ5HwUggIAIXFdLtKf0Fqe7AE2b9C8cMe5Jcw9M5DDl4nwznEzn+jxdtD2O/6j yVW7kbbWTTsy52l8N149j30RqkmLwhUvWzeRcJAj3HAlTOKLL+K3x07hN1pmG/MyChXl Bbog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697732929; x=1698337729; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ujM1wfP0X1TfyN6UNvLQZYpXjOqlX96jhGMomCpg6WI=; b=JwOLAjhXw+k+M2sHlsPC4n7DaGx4Q6Sq7OElfaH6P/3xJjlyMc2B1ESXhF3ZwbbVxe TseOjVmWctaVgyuJlv5ivOku3ryerZucwl4cUgOsetdPinVJsMuhsphvNnv2jS76zeu7 Z1liJ1yK51uJ5r+dKQZ1BydyVw+1ypUrUDZWcfxR3NjuoALDK6OeMNw1tnY0dfM7wikq d4GbeQFJ8b061G3uGVmwA0JKxtJFymtkC41SCdRfytaOilFncJAo3A3mheHmaz6LRCqj p2XPRC6V6wddYdKGhEeIFX71PdKVreIPnlpiwvaoCiCOZxcMgQHXNewKrChBOqnQ3Rrs cG6A== X-Gm-Message-State: AOJu0YwMGd//Q0iY8lTs+TBcIEPnXq1loHfvUIba2lyfCijrk6mWEbAG wJ6wzvrFGvj56SUzl851NfYRkjOvg2LEZP1A3YorlA== X-Google-Smtp-Source: AGHT+IH1mhZ593C522jUwS7zeIu38LLXk4SWolAFCunW7qpDA82KIVoklWuarRvwv+WrfLGsNtAXOyLsixAfdhaKuns= X-Received: by 2002:a17:906:7308:b0:9c7:5a01:ffe6 with SMTP id di8-20020a170906730800b009c75a01ffe6mr2197542ejc.13.1697732929159; Thu, 19 Oct 2023 09:28:49 -0700 (PDT) MIME-Version: 1.0 References: <20231017232152.2605440-1-nphamcs@gmail.com> <20231017232152.2605440-3-nphamcs@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 19 Oct 2023 09:28:09 -0700 Message-ID: Subject: Re: [PATCH v3 2/5] zswap: make shrinking memcg-aware To: Domenico Cerasuolo Cc: Nhat Pham , akpm@linux-foundation.org, hannes@cmpxchg.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 08ACC1C0017 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 71wti1oxp19d74scahsi7a7n7iopqksh X-HE-Tag: 1697732930-178931 X-HE-Meta: U2FsdGVkX1/QSDttSlujcHi3BOikSAUAuxaYnXfhyzxgvScQ7RfwjCXn+V/3vpe+YBlYnri9OU+no/ckU5VRXgP78NdwBeudw8CAx+Wh5i06gH941/igLLVAAgG69FhBEEKpEi3un9EF9EsNjCODTq3VP2NAat1zNtuxxgZNuFpkYgoq7x9/sRc9fM42I47oq1Ywpcp7VshIL/grr/9K73RPQn0AX6DjiF2O0shn3PI3eajrs03h98LXyTtXBDHKLAeWcxtRLFO1hqkDlo3mD56SblrkaZcnxTBDp38IKKyqIB2GZ2cINaLo+DZwQvK3QS2ixr2ahNC+xqJJPdGsf2pISGLcU4aecYPzrB5esHZJXuXknTBXKMB4lO3INYlPfr5T4tUyBLFg+qztM8tL7zjQElNs349Iyi7Xe8TcapmIYC0b8U+dL+YLEcAkM/7iTCYswfXufwK6LzLMiBWzmOvLAbn1agirKjyxYdSrtxqur+e3yrcuQLb5+2a0qZbWpSivYpKIR5imVotw/j0AlXc6ZjeumBwOPs9JpM+ileUluvc6K08w6NiqJIYp4TjsXEBwyBmOANJVgaZaYQAXZ364n30XheSrsSixbDZFD741M0IoUwdfuw9H8Ys3n21D33k1ZiSogmFAYDe/nF0Qd5+/Kt28lHNwtjQ+XOrHH3e6QkqWSoHRfJ0cwgVLxGWKcn7FkBd6ryoVk2BQF7FdGkLz5ESZ9IQWyB5KT6w/N1kFDDgLTOsLFAcVIOEEYkC6Q74PyKfg39rOxMf2mMWOJ4iMBZK/mQ59NRhkxIRZKk1BrZ82wg2HBSxeEQxL9eQdmjji6wLPwTqK7KOEoPETfjx0qwJyYbDOdnSE0rbZrAnCAQeEQ5e1sfICRlTN+Y14PMRhvLr/L4ljXv1E3GWsYOcbOhCmwq6tKrxRA7GFGwbKlDcknsBehsVm34I8ZuCCtKVEeR1jiuJSEQZPe3O TdNmRkE/ tqTMcJlPhSFbazh+HQI8XHlYOWVsB9c/ZB42+HrJssez4RYFtDBv1OkibWy4ZFb349xAOkDMeKhDp9uJoIhPZ96Vuel9TvU6w0XaUH8KmlylSKNaKNmsMdsSNZZ5jMOus1EE4i15Ps3Jqyt8zaISltmxkGcS8f45TpKziRy4zDWN7/stbZW2AGgJ/mKTDx+Fe3LqYOoXQ9Zz/xZQ/kVWbOhsCsTxa+v6+ikTAZ6WLLrGn7gEC84dQB3U9ehQhEaEN4wWB0+1CWbA0JdS73i787Uu/dcz+Gq4ovo8C+TzJ+V+g3m1Vdj9YcbVF+Ka6aypKdsKLQOIChYonByo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 19, 2023 at 5:47=E2=80=AFAM Domenico Cerasuolo wrote: > > On Thu, Oct 19, 2023 at 3:12=E2=80=AFAM Yosry Ahmed wrote: > > > > On Wed, Oct 18, 2023 at 4:47=E2=80=AFPM Nhat Pham w= rote: > > > > > > On Wed, Oct 18, 2023 at 4:20=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > On Tue, Oct 17, 2023 at 4:21=E2=80=AFPM Nhat Pham wrote: > > > > > > > > > > From: Domenico Cerasuolo > > > > > > > > > > Currently, we only have a single global LRU for zswap. This makes= it > > > > > impossible to perform worload-specific shrinking - an memcg canno= t > > > > > determine which pages in the pool it owns, and often ends up writ= ing > > > > > pages from other memcgs. This issue has been previously observed = in > > > > > practice and mitigated by simply disabling memcg-initiated shrink= ing: > > > > > > > > > > https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmai= l.com/T/#u > > > > > > > > > > This patch fully resolves the issue by replacing the global zswap= LRU > > > > > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > > > > > > > > > a) When a store attempt hits an memcg limit, it now triggers a > > > > > synchronous reclaim attempt that, if successful, allows the ne= w > > > > > hotter page to be accepted by zswap. > > > > > b) If the store attempt instead hits the global zswap limit, it w= ill > > > > > trigger an asynchronous reclaim attempt, in which an memcg is > > > > > selected for reclaim in a round-robin-like fashion. > > > > > > > > Could you explain the rationale behind the difference in behavior h= ere > > > > between the global limit and the memcg limit? > > > > > > The global limit hit reclaim behavior was previously asynchronous too= . > > > We just added the round-robin part because now the zswap LRU is > > > cgroup-aware :) > > > > > > For the cgroup limit hit, however, we cannot make it asynchronous, > > > as it is a bit hairy to add a per-cgroup shrink_work. So, we just > > > perform the reclaim synchronously. > > > > > > The question is whether it makes sense to make the global limit > > > reclaim synchronous too. That is a task of its own IMO. > > > > Let's add such context to the commit log, and perhaps an XXX comment > > in the code asking whether we should consider doing the reclaim > > synchronously for the global limit too. > > Makes sense, I wonder if the original reason for switching from a synchro= nous > to asynchronous reclaim will still be valid with the shrinker in place. > Seems like it was done as part of the hysteresis that stops accepting pages into zswap once full until it reaches a certain threshold, commit 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit"). I guess the point is that zswap will stop accepting pages when the limit is hit anyway, so no need to synchronously shrink it since we can't store soon anyway. More useful context for the commit log. > > > > > > > > (FWIW, this somewhat mirrors the direct reclaimer v.s kswapd > > > story to me, but don't quote me too hard on this). > > > > > [..] > > > > > > > > > > > > /* Hold a reference to prevent a free during writeback */ > > > > > zswap_entry_get(entry); > > > > > spin_unlock(&tree->lock); > > > > > > > > > > - ret =3D zswap_writeback_entry(entry, tree); > > > > > + writeback_result =3D zswap_writeback_entry(entry, tree); > > > > > > > > > > spin_lock(&tree->lock); > > > > > - if (ret) { > > > > > - /* Writeback failed, put entry back on LRU */ > > > > > - spin_lock(&pool->lru_lock); > > > > > - list_move(&entry->lru, &pool->lru); > > > > > - spin_unlock(&pool->lru_lock); > > > > > + if (writeback_result) { > > > > > + zswap_reject_reclaim_fail++; > > > > > + memcg =3D get_mem_cgroup_from_entry(entry); > > > > > + spin_lock(lock); > > > > > + /* we cannot use zswap_lru_add here, because it i= ncrements node's lru count */ > > > > > + list_lru_putback(&entry->pool->list_lru, item, en= try_to_nid(entry), memcg); > > > > > + spin_unlock(lock); > > > > > + mem_cgroup_put(memcg); > > > > > + ret =3D LRU_RETRY; > > > > > goto put_unlock; > > > > > } > > > > > + zswap_written_back_pages++; > > > > > > > > Why is this moved here from zswap_writeback_entry()? Also why is > > > > zswap_reject_reclaim_fail incremented here instead of inside > > > > zswap_writeback_entry()? > > > > > > Domenico should know this better than me, but my understanding > > > is that moving it here protects concurrent modifications of > > > zswap_written_back_pages with the tree lock. > > > > > > Is writeback single-threaded in the past? This counter is non-atomic, > > > and doesn't seem to be protected by any locks... > > > > > > There definitely can be concurrent stores now though - with > > > a synchronous reclaim from cgroup-limit hit and another > > > from the old shrink worker. > > > > > > (and with the new zswap shrinker, concurrent reclaim is > > > the expectation!) > > > > The comment above the stats definition stats that they are left > > unprotected purposefully. If we want to fix that let's do it > > separately. If this patch makes it significantly worse such that it > > would cause a regression, let's at least do it in a separate patch. > > The diff here is too large already. > > > > > > > > zswap_reject_reclaim_fail was previously incremented in > > > shrink_worker I think. We need it to be incremented > > > for the shrinker as well, so might as well move it here. > > > > Wouldn't moving it inside zswap_writeback_entry() near incrementing > > zswap_written_back_pages make it easier to follow? > > As Nhat said, zswap_reject_reclaim_fail++ had to be moved, I naturally mo= ved it > here because it's where we act upon the result of the writeback. I then n= oticed > that zswap_written_back_pages++ was elsewhere and decided to move that as= well > so that they're in the same place and at least they're under the tree loc= k. > > It's not meant to fix the unprotected counters, it's just a mitigation si= nce we > are forced to move at least one of them. I see. Just for my own understanding, it would be correct to update them in zswap_writeback_entry(), but we choose to do it here as we happen to hold the lock so we get the free synchronization? I think it could be beneficial as we may see increased concurrent writeback with this series. Probably something to call out in the commit log as well.