From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EFC1C25B75 for ; Wed, 29 May 2024 12:42:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D62E66B0098; Wed, 29 May 2024 08:42:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D130A6B0099; Wed, 29 May 2024 08:42:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB2F26B009A; Wed, 29 May 2024 08:42:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9AB596B0098 for ; Wed, 29 May 2024 08:42:38 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4A3401609E4 for ; Wed, 29 May 2024 12:42:38 +0000 (UTC) X-FDA: 82171397196.08.984F658 Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com [209.85.128.171]) by imf12.hostedemail.com (Postfix) with ESMTP id 6F92140022 for ; Wed, 29 May 2024 12:42:36 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hngUUMc0; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of flintglass@gmail.com designates 209.85.128.171 as permitted sender) smtp.mailfrom=flintglass@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716986556; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FcVvsj5wuI35cIO8lkBusWD2C94l8ZGuanLrynn79Jo=; b=0ji7SW53d+TNnuLeNMnWrPuBINpmDNBEOWK5f4DUuHg3XI9sZVUJPU6Tjhyl1UidPpzo8y 07Ql100pufLBDVSb5cR7Ntuccp4arBmzbUy2xk7ihYYSzhLGywlSOdQPFB8YqEgONbgXzT Hq7REBWmAGrXH+ooyR2PeSW0xCy19yg= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hngUUMc0; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of flintglass@gmail.com designates 209.85.128.171 as permitted sender) smtp.mailfrom=flintglass@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716986556; a=rsa-sha256; cv=none; b=YkfHKlIBc8qA5B4hKZyShhJwRQSu9dMGPyutlCaXvVZqJO3EENE+MCjbPSMmGulcn86EwH nTRmGO5+L9XrMFXSsKQU64Eg8/J3ceGKNfzs/xw7Spds8+ao/gKUdq6X6x8WdDBi12NaDJ GygxHbEjeJfXu3YmMKBlslpFYzzsS3A= Received: by mail-yw1-f171.google.com with SMTP id 00721157ae682-627eb38eb1bso16988227b3.1 for ; Wed, 29 May 2024 05:42:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716986555; x=1717591355; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FcVvsj5wuI35cIO8lkBusWD2C94l8ZGuanLrynn79Jo=; b=hngUUMc0p4dPlMsVe4RetiKCa1Eu5iWK4nbkijoAJhwiR13sRsTs5BNob7wZ7rv8Sx tMWeoPg0InKJnlEClA6EWniOpflxfHDPhA6QUt1OMvIYeRQF7lLL37wpA3vJmRCgpWwN tSyX/KbAjev8U8yxFDLjeeDIZg8OixcNuxwh8Y+xBLkBHMACXEQOMrXAfWZhJKKuxjQO ILMjk/EAWBPsfVBbu8Umu3Ac5AweYedP1/xl1+Ug6HfGoiY2h8gYsDBvpKiHF7P13xxv Ki5vSMJsSfw/CEz3f93j1r49I1GJgE5BxYmdzsXahadYEzdwdnWCc8+ICXXZIhvMvV/X Z4qA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716986555; x=1717591355; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FcVvsj5wuI35cIO8lkBusWD2C94l8ZGuanLrynn79Jo=; b=MeXy2/K+4J7vNc/BOnkso8kVR+3+NqfllhuXVP/iG0MwGMp6OmIyr6T0XADJm8UhhM yuYqqM+UwsGr5oQ+VIAzCAZrGHH8b9tQ/aB9SbN7bVLaujLIeuQjEx+1tVjJrGTjBiFh JnELOSGcWtd0SG6M2pcNc8t13CKMD5WZxN3Lw44Z/K/CfbZRXe/f3xvF/32+ZHcZVKfz iEZ21oIhUgi5SxeOMdqTCG0Pvzi3XPLZIbbwEpByHpZIU3z4tpQaIGnTt/7Rvx4LVgza rv5Be/CXlSBxdi+XOLuHdvdK0ESW4uwMVZ5vwGhLYDDqLIY55WdTY9LhvH/2qOdrpDdu whsw== X-Forwarded-Encrypted: i=1; AJvYcCXnAWY6DvP6mWWHoJYBTr9S6tHUP9MorVNbZGKb48UkipiPjIDOkD9B5r6hbmZwqfWjUEjU3NErXIT0Ywfsi0U+4Z0= X-Gm-Message-State: AOJu0YwYYBADpsB/YSLk4bmE4MVSUpvEwY2Z2hxNDThe9sHKnVlBgx4p EWgLRfAYGlKUzqCcKzOaYIzrwg9uvpawtp+Tj5iqh7oMhFhlxoMkM0EbqwdMq4387Kbsj+r/YRM lc2wIiZqtmWL1igqJMYGfqCgBjps= X-Google-Smtp-Source: AGHT+IGfJmkyfeAoH1PK8nD22RQ2FCOmEIRbm+9A28GZ3t1z0wsyghvBwgyVlcJpY99RjuLJaYocEx6k+23hfEkbx3M= X-Received: by 2002:a81:9182:0:b0:617:cb98:f9b2 with SMTP id 00721157ae682-62a08f2cae4mr149086817b3.43.1716986555273; Wed, 29 May 2024 05:42:35 -0700 (PDT) MIME-Version: 1.0 References: <20240528043404.39327-2-flintglass@gmail.com> <20240528043404.39327-3-flintglass@gmail.com> In-Reply-To: From: Takero Funaki Date: Wed, 29 May 2024 21:42:23 +0900 Message-ID: Subject: Re: [PATCH 1/3] mm: zswap: fix global shrinker memcg iteration To: Nhat Pham Cc: Johannes Weiner , Yosry Ahmed , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6F92140022 X-Stat-Signature: mjcih8h37soi1y3f5yddbc4k7xqiu3sj X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1716986556-964950 X-HE-Meta: U2FsdGVkX1/KOAphUhLeKFWpnwtz27emUB0kB4+MYC77UCQPZFUQp0Z/BJoGIcUcLPUSVXj5PwMmPxzKk3LFuKqthfaKxvDqRkDD86KBPtgUXujYMio/+a6Nl78Hs5GvbWWjdsYhoS1O6Nd6BNvdNMPTfYpXqhiy77/cTmDYeRBz/oGAts/2Q6muckDCuXHeab9gAoadDBBGxMlXfpryZLCuZjUo+iE1mqqTUkh0hoz85bq6UqdsYuUqX83p3m+X9O/H6XUBxkmOaosN/v1G3ELCvf8gqtuWc9+API8uv9ZNNNgUvixF1f2FugcaBSrF7SxLdc67kd288oMK0nHZGreGEVWaOJsBQQWn5pfvxy+4pVP/m0SpinqWWSBq7cJTGIF3l3pferDciThGnam1fvrtoctJFEaWLnJmCw+WyHtOygO3pWjupcQqNdLJw7vFniQ3Q4FS7KehB1ZA31xXyzeBFwaBfR0kvfNLQmPJERHz8xFEjkFwSEVFsA90wEfPJAwcY2vq8vLKeqbdaJFtODrpEYCY6XQypGlvyFsCSKt4dUqdB4SaGrTmDxBfwe12X9OTEgMriHhZTaLCd8nM947K+okTSSp3hbzE8vhf+15yiXz4xY/fPBkbZwf0Wkm2uYbeqQ6/ACA0Q0jeL95zdqy8Uu5+CEu4iCS86fmwE12klmBFQIXt1PN0gZ9o5DwfP62aSplH9hMvcJVtbyn55URGHTTU066PhQroS8BrxMSBjykFKC+h9bWN0CGSDpgsVq6nkkq9SA0RT4T+YSlW+CzJwjyMohV4OlWE5MNcAMPUSr7G1pv5iZkHpNXTy42XaRoIQaS+1W1qstIB6l0/p0G9/ZrKntTITmX42v3//h9adHHfWetVfi1AZ05R5iQ9hIz/Vbqp08EZdRwgfaBRgji2kqBYLSC6a4lbDFEcr/O7ndRUaQEqTwwW5MzOeMozAP4bDkEaTrQYXS7pS8W mJWS89En XVTbri96h9ni3aLnI8CWBTIX4nktB5fF1T+pxZ8fYLBsGLHINCHv5A/q3NhY4DAdxtbIa8IA5WujgBzizo1/r88PO0e3B6e0Vfnc7zHrTpSFDREB5t9R/NC1tQ9mmsXgg6fLLWm2YTbQTkMGJgyNHPlQMwQKXRM6fFuwzouQB/LTtX+WANYEzt7uydHVmM4Rv9VmMhyRdbLClMpPqf/HkRKDmkd3kJc5jWix4WkXXMT65HnCbxsN6+IC7jcY5kdP9dllJsXn/L1nY+GlXXYJSlo+2lS7NzmcfP2AK4OIBL4EvSUN0KOJJeV0Gpf9Gz8GpnO9VJmfY+m8knW+IpFiQLIeeTc8rekjqKc2i8/PmMqVDZVajl43uDNLADw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 2024=E5=B9=B45=E6=9C=8829=E6=97=A5(=E6=B0=B4) 0:10 Nhat Pham : > > On Mon, May 27, 2024 at 9:34=E2=80=AFPM Takero Funaki wrote: > > > > This patch fixes an issue where the zswap global shrinker stopped > > iterating through the memcg tree. > > Did you observe this problem in practice? Thank you for your comments. I think this situation is rare, but I need to address this to ensure patch 2 will not stop at the offline memcg. The main issue I am seeing in version 6.9.0 to 6.10.0rc1 is that the shrinker did not shrink until accept_thr_percent, and the written_back_pages is smaller than max_limit_hit. This can be explained by the shrinker stopping too early or looping over only part of the tree. > > > > The problem was that `shrink_worker()` would stop iterating when a memc= g > > was being offlined and restart from the tree root. Now, it properly > > handles the offlining memcg and continues shrinking with the next memcg= . > > > > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") > > Signed-off-by: Takero Funaki > > --- > > mm/zswap.c | 76 ++++++++++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 56 insertions(+), 20 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index a50e2986cd2f..0b1052cee36c 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -775,12 +775,27 @@ void zswap_folio_swapin(struct folio *folio) > > } > > } > > > > +/* > > + * This function should be called when a memcg is being offlined. > > + * > > + * Since the global shrinker shrink_worker() may hold a reference > > + * of the memcg, we must check and release the reference in > > + * zswap_next_shrink. > > + * > > + * shrink_worker() must handle the case where this function releases > > + * the reference of memcg being shrunk. > > + */ > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > > { > > /* lock out zswap shrinker walking memcg tree */ > > spin_lock(&zswap_shrink_lock); > > - if (zswap_next_shrink =3D=3D memcg) > > - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_= shrink, NULL); > > + > > + if (READ_ONCE(zswap_next_shrink) =3D=3D memcg) { > > + /* put back reference and advance the cursor */ > > + memcg =3D mem_cgroup_iter(NULL, memcg, NULL); > > + WRITE_ONCE(zswap_next_shrink, memcg); > > + } > > Hmm could you expand on how your version differs from what's existing? > What's the point of all these extra steps? The whole thing is done > under a big spin lock anyway. I agree that the code is not necessary. These ONCE are for clarifying what is expected and to align with shrink_worker(), where READ_ONCE is required to reload the shared variable. It can be a safeguard for me, sometimes forgetting that we must not read zswap_next_shrink before acquiring the lock. > > + > > spin_unlock(&zswap_shrink_lock); > > } > > > > @@ -1312,25 +1327,38 @@ static int shrink_memcg(struct mem_cgroup *memc= g) > > > > static void shrink_worker(struct work_struct *w) > > { > > - struct mem_cgroup *memcg; > > + struct mem_cgroup *memcg =3D NULL; > > + struct mem_cgroup *next_memcg; > > int ret, failures =3D 0; > > unsigned long thr; > > > > /* Reclaim down to the accept threshold */ > > thr =3D zswap_accept_thr_pages(); > > > > - /* global reclaim will select cgroup in a round-robin fashion. = */ > > + /* global reclaim will select cgroup in a round-robin fashion. > > + * > > + * We save iteration cursor memcg into zswap_next_shrink, > > + * which can be modified by the offline memcg cleaner > > + * zswap_memcg_offline_cleanup(). > > + */ > > I feel like the only difference between this loop and the old loop, is > that if we fail to get an online reference to memcg, we're trying from > the next one (given by mem_cgroup_iter()) rather than from the top > (NULL). > > For instance, consider the first two steps: > > 1. First, we check if memcg is the same as zswap_next_shrink. if not, > then reset it to zswap_next_shrink. > 2. Advance memcg, then store the result at zswap_next_shrink. > > Under the big zswap_shrink_lock, this is the same as: > > 1. Advance zswap_next_shrink. > 2. Look up zswap_next_shrink, then store it under the local memcg variabl= e. > > which is what we were previously doing. > > The next step is shared - check for null memcg (which again, is the > same as zswap_next_shrink now), and attempt to get an online > reference. > The only difference is when we fail to get the online reference - > instead of starting from the top, we advance memcg again. > > Can't we just drop the lock, then add a continue statement? That will > reacquire the lock, advance zswap_next_shrink, look up the result and > store it at memcg, which is what you're trying to achieve? > If I understand correctly, in this offlining pattern, we are not allowed to leave an offline memcg in zswap_next_shrink more than once. While offline memcg can appear in memcg_iter_next repeatedly, the cleaner is called only once per an offline memcg. (or is there some retry logic?) If the shrink_worker finds an offline memcg in iteration AFTER the cleaner was called, we must put back the offline memcg reference inside shrink_worker() BEFORE going to return/sleep. Otherwise, the offline memcg reference will be kept in zswap_next_shrink until the next shrink_worker() is requeued. There is no rescue chance from the cleaner again. This means the shrink_worker has to check: 1. We get a lock. Check if the memcg is online while locked. 2a. If online, it can be offlined while we have the lock. But the lock assures us that the cleaner is waiting for the lock just behind us. We can rely on this. 2b. If offline, we cannot determine if the cleaner has already been called or is being called. We have to put back the offline memcg reference, assuming no cleaner is available. If we get offline memcg AND abort the shrinker by the max failure limit, that breaks condition 2b. Thus we need to unconditionally restart from the beginning of the do block. I will add these expected situations to the comments. For unlocking, should we rewrite, goto iternext; to spin_unlock(); goto before_spin_lock; /* just after do{ */ I think that will minimize the critical section and satisfy the condition 2= b. For the memcg iteration, > 2. Advance memcg, then store the result at zswap_next_shrink. should be done only if `(memcg =3D=3D zswap_next_shrink)`. Say iterating A -> B -> C and A is being offlined. While we have memcg=3DA and drop the lock, the cleaner may advance zswap_next_shrink=3DA to B. If we do not check `memcg !=3D next_memcg`, we advance zswap_next_shrink=3DB to C again, forgetting B. That is the reason I added `(memcg !=3D next_memcg)` check. Although It can be negligible as it only ignores one memcg per one cleaner callback. This is my understanding. Am I missing any crucial points? Any comments or advice would be greatly appreciated. > > do { > > spin_lock(&zswap_shrink_lock); > > - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_= shrink, NULL); > > - memcg =3D zswap_next_shrink; > > + next_memcg =3D READ_ONCE(zswap_next_shrink); > > + > > + if (memcg !=3D next_memcg) { > > + /* > > + * Ours was released by offlining. > > + * Use the saved memcg reference. > > + */ > > + memcg =3D next_memcg; > > + } else { > > +iternext: > > + /* advance cursor */ > > + memcg =3D mem_cgroup_iter(NULL, memcg, NULL); > > + WRITE_ONCE(zswap_next_shrink, memcg); > > + } > > > > /* > > - * We need to retry if we have gone through a full roun= d trip, or if we > > - * got an offline memcg (or else we risk undoing the ef= fect of the > > - * zswap memcg offlining cleanup callback). This is not= catastrophic > > - * per se, but it will keep the now offlined memcg host= age for a while. > > - * > > Why are we removing this comment? The existing comment about aborting the loop on the offlining memcg, is not valid on this patch. The offline memcg will just be skipped. I think the new behavior is commented at the beginning of the loop and in the !mem_cgroup_tryget_online branch. Please let me know if you have any suggestions. > > * Note that if we got an online memcg, we will keep th= e extra > > * reference in case the original reference obtained by= mem_cgroup_iter > > * is dropped by the zswap memcg offlining callback, en= suring that the > > @@ -1345,16 +1373,18 @@ static void shrink_worker(struct work_struct *w= ) > > } > > > > if (!mem_cgroup_tryget_online(memcg)) { > > - /* drop the reference from mem_cgroup_iter() */ > > - mem_cgroup_iter_break(NULL, memcg); > > - zswap_next_shrink =3D NULL; > > - spin_unlock(&zswap_shrink_lock); > > - > > - if (++failures =3D=3D MAX_RECLAIM_RETRIES) > > - break; > > - > > - goto resched; > > I think we should still increment the failure counter, to guard > against long running/infinite loops. Since we skip the offline memcg instead of restarting from the root, the new iteration will be terminated when it reaches tree root in finite steps. I am afraid that counting this as a failure will terminate the shrinker too easily. I think shrinker_worker() running longer is not an issue because the amount of work per the while loop is limited by rescheduling. As it has a dedicated WQ_MEM_RECLAIM workqueue, returning from the function is not necessary. cond_resched() should allow the other workqueue to run. The next patch also adds a progress check per walking. > > + /* > > + * It is an offline memcg which we cannot shrin= k > > + * until its pages are reparented. > > + * Put back the memcg reference before cleanup > > + * function reads it from zswap_next_shrink. > > + */ > > + goto iternext; > > } > > + /* > > + * We got an extra memcg reference before unlocking. > > + * The cleaner cannot free it using zswap_next_shrink. > > + */ > > spin_unlock(&zswap_shrink_lock); > > > > ret =3D shrink_memcg(memcg); > > @@ -1368,6 +1398,12 @@ static void shrink_worker(struct work_struct *w) > > resched: > > cond_resched(); > > } while (zswap_total_pages() > thr); > > + > > + /* > > + * We can still hold the original memcg reference. > > + * The reference is stored in zswap_next_shrink, and then reuse= d > > + * by the next shrink_worker(). > > + */ > > } > > > > /********************************* > > -- > > 2.43.0 > > --