From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 737A3D1D88B
	for <linux-mm@archiver.kernel.org>; Tue, 15 Oct 2024 16:59:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C77D46B0085; Tue, 15 Oct 2024 12:59:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C28926B0088; Tue, 15 Oct 2024 12:59:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AEF6C6B0089; Tue, 15 Oct 2024 12:59:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 9400B6B0085
	for <linux-mm@kvack.org>; Tue, 15 Oct 2024 12:59:42 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id B4E66402BF
	for <linux-mm@kvack.org>; Tue, 15 Oct 2024 16:59:36 +0000 (UTC)
X-FDA: 82676448120.14.2D3572A
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171])
	by imf01.hostedemail.com (Postfix) with ESMTP id EB45D40009
	for <linux-mm@kvack.org>; Tue, 15 Oct 2024 16:59:33 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=fH5H7xCj;
	spf=pass (imf01.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729011438;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=;
	b=0b0NBHumugLQwtEo5vP6QCIKdR/nbgWnpBe8LXIepTevI7NQfnBZb8mDg65PoOSijw1uAb
	0pMpDsfKDwazpZjcHqPvn5NLP660KqMBG2Yo2dEBJ3NBxNRV4j9oMpQ/nLUvI7ROQ9jboU
	nnCq0zc9Xd2o5PxpxNYVQPflNKFQUW4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729011438; a=rsa-sha256;
	cv=none;
	b=eoXEIyeAXlEZAtVPL2FE4+ngn3qx8v+EBMLiTMGgOrQY8tPPbt/67LE1R1MeR9kI7o5VE6
	FDpguWLRNw86IYjaXPqVXAm1BDWqxUb5W8Ljz9tpQFkUtmEN21hAVUQ+LRTnRutTLnejfP
	0Sg/5l98HVxlsGSWiIpO611exrGAt9o=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=fH5H7xCj;
	spf=pass (imf01.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-460391553ecso51918901cf.1
        for <linux-mm@kvack.org>; Tue, 15 Oct 2024 09:59:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729011579; x=1729616379; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=;
        b=fH5H7xCjM1pDGYxL4q88AHkzRqfdI7NkuZCUTsXwF2V2KeJwZTdlw8liaUGkxwdx6B
         FnAWNwMpzLI6+F9X4YeaV/n/uMj3nHm7ZVoBC9z4STMywBaERbfomX/xbaeIKW+ru7qd
         ATK8Y4/5hkNSfwfrCnoAgec/t/t0h5j5blSnGdZ+WQL3nG30ogZYOeDXSDCVxbgcN2pr
         jIamhJsXQzFuWVvmFO/rBy9PF+DjNczzhhgBPNz5KTvDQysk8zponZh3x9K7i/2n9btT
         bJr3EoYNCGga7j8/MrfgvA/HG/sRN5z8zmt/rbFwbRmIKpS8iRw3/u6ZhX0eFTsQ3sWj
         gxcg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729011579; x=1729616379;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=;
        b=VZ8tOJjVzKTKoJ+cQf/3MhLdkfYFiD+rq/EraLmfw2yikAkuHZNf0CKqQGyyM44pBd
         l734WwGyOuJ0nNz9BPQd8oT/R0JWnxjHo7+gSuMt14k9NQy7Z2kdaodSDxYcI91UTcaz
         8rmFDyzjoWoN3BV5WfIhFr5wdzKGgfvfHYSswreZ9NSfUQgsx+zQiTMMGV3oKGF9joVj
         mPl9PklArBaTZ9jStO8iNGUkV2X3ts+NDPoXCSiod4pveb2X6mhRNrrBdvRRXxeytYTQ
         etqKZ+v7J+EsvixDLm5xHfZTVie9wM1A1sFB9VOalewWjM+4y1Sv/3syHzmgNHc8wTAp
         4Jag==
X-Forwarded-Encrypted: i=1; AJvYcCXw24duXMemdXAJj4a4NPkxsjx7RKzJZ9EBWlCIinWN2DG2nzME72c7Jw+QV+x90i2UJxXOLxwoRQ==@kvack.org
X-Gm-Message-State: AOJu0YyTlBbR/P5ww7ZEEbn1KCwze5ibAkX4czjSBHyKEC7LZIi8nm7G
	rUkr9YXUWYIcVjGzp1JqOGwSHczW4SJ64FREOi2/wMldM4JQ0MLyTUU93qNV5WfYiDNt6L+kiJK
	1sBE1b8Hhmkl/3wYU7k4V0cU07k4=
X-Google-Smtp-Source: AGHT+IEzKxSyuGa1u19VyJ1QozJtT4I0UpsgYyVVZ4ObTgk3zSnNTQ3YLtlev3syyZY8uiFoV4BGxQDjW+2ESvY1akw=
X-Received: by 2002:a05:622a:114e:b0:453:7634:bbfa with SMTP id
 d75a77b69052e-460584249f1mr231528341cf.21.1729011579203; Tue, 15 Oct 2024
 09:59:39 -0700 (PDT)
MIME-Version: 1.0
References: <20241014182228.1941246-1-joannelkoong@gmail.com>
 <20241014182228.1941246-2-joannelkoong@gmail.com> <265keu5uzo3gzqrvhcn2cagii4sak3e2a372ra7jlav35fnkrx@aicyzyftun3l>
 <CAJnrk1Yrn3_eXPCrXDqA-5F2un33BAxrP=GdmrLw7bhtbGypjA@mail.gmail.com> <5yasw5ke7aqfp2g7kzj2uzrrmvblesywavs6qs3bdcpe4vkmv2@iwpivyu7kzgy>
In-Reply-To: <5yasw5ke7aqfp2g7kzj2uzrrmvblesywavs6qs3bdcpe4vkmv2@iwpivyu7kzgy>
From: Joanne Koong <joannelkoong@gmail.com>
Date: Tue, 15 Oct 2024 09:59:28 -0700
Message-ID: <CAJnrk1aAVmCx5hHfWETEGer7OuEeRPvmjvu-XG0o-E1u-C-r6Q@mail.gmail.com>
Subject: Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts
 that may trigger deadlock
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, 
	bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com, hannes@cmpxchg.org, 
	linux-mm@kvack.org, kernel-team@meta.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: EB45D40009
X-Stat-Signature: otiogng1egpty6b5z1xxu45h6yjupnqg
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1729011573-249339
X-HE-Meta: U2FsdGVkX1+Crhpcd4Xt52/+8GtsWVeipH/rvtV86lGghJUjb4o9SemCh+nrf9RsjRaHAuRZrFMHvei7yDbiyfpTD5p0eaO8QiRrIbsf1uelgiAgWIkh2/I+6+LpBuJ/OVNhuVWxfvEf94CQb2qcCbU6AElqX4sCPvn9odmwBIEghOShnCRGRQVvWLQ6rULQlr17ih+LBt14i/PUhcxoeaecfHUcAyNq2GSnPAVlI1+HbIhPl/+i1C0My9ZIHTrKMx4KHNYQqGwhNf+JBvka0IWRe5snhi5Fb317pQaPGBmxHOcG86w8C/3n+Sgj2N4s+oBSQ4ZNKfG6mvgkEw6jXl47duy5ENq1UxMwl6K1NN1YEYzTbkSDjWE1vP+1j4YmkJqKG7032BkojAVkIugx2+wVKUrTuj5KviS9MZiYN2vVp0XV3jGP4EAkw2a7FjcULAvdIuR9jszq/QHuuvYETwz5gYkkc9RYvx5hX0TIn1YtNrVMdSvlS4/W1/aERIT8UQrndICi5KZoWqg0lLoF0SjseWQ7Qg8z3tPFlIlLdF1LH01eG5FD8tmfsdK4z+uJ8b5cS6EzZnwRiY6WR4fQ3GilOuVlK4rMSGxfOlK1H4NJy63+Y05yr7O3rc+PrvU9oqfAM5I6H3SxIfkX9pThbuvWJGVpbc8/jE6qoAxURrKAdZb5cmKEP3vz4GGRCHv4+ONfvrgDI16a6hA72d92JVirfwAT18TEP51hAVyXIaA346yWxbVLdbRndPbJ3Y0MsVryIluVLarRFQAFqWpxrOVOPrIcUjfAlE+BzH+wEZzLtEPr0W65BDUwad385PiYZqg60BR5q/hJuWgBk5WdIAbPD/nZn9cBePhwFMsm3p+Jh3nReWhenEVzywM77UEyq6fSjqVp6Zvkxeh9CvEAkiNRgFUzUYIstGNqCEC24IFEj6ZFZ+A+kTegCI5VrvzuoCnDIgdecq+ESMNIHR+
 i++TXrgr
 xcjs6wQ7t/farOueqBiwARtvlBf9VVFUB91cBa8Q9dI2pHluyvODXo7R0GggFLYabGb0t7Or/O4lbATM+jMp9cBk90P2+7IOJn1XbQzXGXrX01YO+uPnPaCS1r0Hq6vLYsDecFquX9A7UG2PRgTxStZ5wL6jJubZr/EPkFINX2gIkB/Ksz3NokZ/mj5RZ5Z5wPciOOZ5mxQop4cPuYcwEMRwmxcBAe7lD9JmU/kuTjPC+CPz5jlTp9CDuWPUvFpAf9jMwkARehEl0gGk0M+d9WSeQUFKRBLVWKa7fnwKY3XX22+/0BvtFXzXTT1nou2POL5B3pI3/ujJvgOOh6Sp28vCQOg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Oct 14, 2024 at 4:57=E2=80=AFPM Shakeel Butt <shakeel.butt@linux.de=
v> wrote:
>
> On Mon, Oct 14, 2024 at 02:04:07PM GMT, Joanne Koong wrote:
> > On Mon, Oct 14, 2024 at 11:38=E2=80=AFAM Shakeel Butt <shakeel.butt@lin=
ux.dev> wrote:
> > >
> > > On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote:
> > > > Currently in shrink_folio_list(), reclaim for folios under writebac=
k
> > > > falls into 3 different cases:
> > > > 1) Reclaim is encountering an excessive number of folios under
> > > >    writeback and this folio has both the writeback and reclaim flag=
s
> > > >    set
> > > > 2) Dirty throttling is enabled (this happens if reclaim through cgr=
oup
> > > >    is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> > > >    if reclaim is on the root cgroup), or if the folio is not marked=
 for
> > > >    immediate reclaim, or if the caller does not have __GFP_FS (or
> > > >    __GFP_IO if it's going to swap) set
> > > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim =
flag
> > > >    set and the caller did not have __GFP_FS (or __GFP_IO if swap) s=
et
> > > >
> > > > In cases 1) and 2), we activate the folio and skip reclaiming it wh=
ile
> > > > in case 3), we wait for writeback to finish on the folio and then t=
ry
> > > > to reclaim the folio again. In case 3, we wait on writeback because
> > > > cgroupv1 does not have dirty folio throttling, as such this is a
> > > > mitigation against the case where there are too many folios in writ=
eback
> > > > with nothing else to reclaim.
> > > >
> > > > The issue is that for filesystems where writeback may block, sub-op=
timal
> > > > workarounds need to be put in place to avoid potential deadlocks th=
at may
> > > > arise from the case where reclaim waits on writeback. (Even though =
case
> > > > 3 above is rare given that legacy cgroupv1 is on its way to being
> > > > deprecated, this case still needs to be accounted for)
> > > >
> > > > For example, for FUSE filesystems, when a writeback is triggered on=
 a
> > > > folio, a temporary folio is allocated and the pages are copied over=
 to
> > > > this temporary folio so that writeback can be immediately cleared o=
n the
> > > > original folio. This additionally requires an internal rb tree to k=
eep
> > > > track of writeback state on the temporary folios. Benchmarks show
> > > > roughly a ~20% decrease in throughput from the overhead incurred wi=
th 4k
> > > > block size writes. The temporary folio is needed here in order to a=
void
> > > > the following deadlock if reclaim waits on writeback:
> > > > * single-threaded FUSE server is in the middle of handling a reques=
t that
> > > >   needs a memory allocation
> > > > * memory allocation triggers direct reclaim
> > > > * direct reclaim waits on a folio under writeback (eg falls into ca=
se 3
> > > >   above) that needs to be written back to the fuse server
> > > > * the FUSE server can't write back the folio since it's stuck in di=
rect
> > > >   reclaim
> > > >
> > > > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum
> > > > mapping_flags" which filesystems can set to signify that reclaim
> > > > should not happen when the folio is already in writeback. This only=
 has
> > > > effects on the case where cgroupv1 memcg encounters a folio under
> > > > writeback that already has the reclaim flag set (eg case 3 above), =
and
> > > > allows for the suboptimal workarounds added to address the "reclaim=
 wait
> > > > on writeback" deadlock scenario to be removed.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > >  include/linux/pagemap.h | 11 +++++++++++
> > > >  mm/vmscan.c             |  6 ++++--
> > > >  2 files changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > index 68a5f1ff3301..513a72b8451b 100644
> > > > --- a/include/linux/pagemap.h
> > > > +++ b/include/linux/pagemap.h
> > > > @@ -210,6 +210,7 @@ enum mapping_flags {
> > > >       AS_STABLE_WRITES =3D 7,   /* must wait for writeback before m=
odifying
> > > >                                  folio contents */
> > > >       AS_INACCESSIBLE =3D 8,    /* Do not attempt direct R/W access=
 to the mapping */
> > > > +     AS_NO_WRITEBACK_RECLAIM =3D 9, /* Do not reclaim folios under=
 writeback */
> > >
> > > Isn't it "Do not wait for writeback completion for folios of this
> > > mapping during reclaim"?
> >
> > I think if we make this "don't wait for writeback completion for
> > folios of this mapping during reclaim", then the
> > mapping_no_writeback_reclaim check in shrink_folio_list() below would
> > need to be something like this instead:
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 885d496ae652..37108d633d21 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                         /* Case 3 above */
> >                         } else {
> >                                 folio_unlock(folio);
> > -                               folio_wait_writeback(folio);
> > +                               if (mapping &&
> > !mapping_no_writeback_reclaim(mapping))
> > +                                       folio_wait_writeback(folio);
> >                                 /* then go back and try same folio agai=
n */
> >                                 list_add_tail(&folio->lru, folio_list);
> >                                 continue;
>
> The difference between the outcome for Case 2 and Case 3 is that in Case
> 2 the kernel is putting the folio in an active list and thus the kernel
> will not try to reclaim it in near future but in Case 3, the kernel is
> putting back in the list from which it is currently reclaiming meaning
> the next iteration will try to reclaim the same folio.
>
> We definitely don't want it in Case 3.
>
> >
> > which I'm not sure if that would be the correct logic here or not.
> > I'm not too familiar with vmscan, but it seems like if we are going to
> > reclaim the folio then we should wait on it or else we would just keep
> > trying the same folio again and again and wasting cpu cycles. In this
> > current patch (if I'm understanding this mm code correctly), we skip
> > reclaiming the folio altogether if it's under writeback.
> >
> > Either one (don't wait for writeback during reclaim or don't reclaim
> > under writeback) works for mitigating the potential fuse deadlock,
> > but I was thinking "don't reclaim under writeback" might also be more
> > generalizable to other filesystems.
> >
> > I'm happy to go with whichever you think would be best.
>
> Just to be clear that we are on the same page that this scenario should
> be handled in Case 2. Our difference is on how to describe the scenario.
> To me the reason we are taking the path of Case 2 is because we don't
> want what Case 3 is doing and thus wrote that. Anyways I don't think it
> is that importatnt, use whatever working seems reasonable to you.

Gotcha, thanks for clarifying. Your point makes sense to me - if we go
this route we should also probably change the name to
AS_NO_RECLAIM_WAIT_WRITEBACK or something like that to make it more
congruent.

For now, I'll keep it as AS_NO_WRITEBACK_RECLAIM because I think that
might be more generalizable of a use case for other filesystems too.

>
> BTW you will need to update the comment for Case 2 which is above code
> block.

Great point, I will do this in v3.


Thanks,
Joanne
>
> thanks,
> Shakeel