From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 737A3D1D88B for ; Tue, 15 Oct 2024 16:59:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C77D46B0085; Tue, 15 Oct 2024 12:59:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C28926B0088; Tue, 15 Oct 2024 12:59:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AEF6C6B0089; Tue, 15 Oct 2024 12:59:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9400B6B0085 for ; Tue, 15 Oct 2024 12:59:42 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B4E66402BF for ; Tue, 15 Oct 2024 16:59:36 +0000 (UTC) X-FDA: 82676448120.14.2D3572A Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) by imf01.hostedemail.com (Postfix) with ESMTP id EB45D40009 for ; Tue, 15 Oct 2024 16:59:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fH5H7xCj; spf=pass (imf01.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729011438; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=; b=0b0NBHumugLQwtEo5vP6QCIKdR/nbgWnpBe8LXIepTevI7NQfnBZb8mDg65PoOSijw1uAb 0pMpDsfKDwazpZjcHqPvn5NLP660KqMBG2Yo2dEBJ3NBxNRV4j9oMpQ/nLUvI7ROQ9jboU nnCq0zc9Xd2o5PxpxNYVQPflNKFQUW4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729011438; a=rsa-sha256; cv=none; b=eoXEIyeAXlEZAtVPL2FE4+ngn3qx8v+EBMLiTMGgOrQY8tPPbt/67LE1R1MeR9kI7o5VE6 FDpguWLRNw86IYjaXPqVXAm1BDWqxUb5W8Ljz9tpQFkUtmEN21hAVUQ+LRTnRutTLnejfP 0Sg/5l98HVxlsGSWiIpO611exrGAt9o= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=fH5H7xCj; spf=pass (imf01.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-460391553ecso51918901cf.1 for ; Tue, 15 Oct 2024 09:59:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729011579; x=1729616379; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=; b=fH5H7xCjM1pDGYxL4q88AHkzRqfdI7NkuZCUTsXwF2V2KeJwZTdlw8liaUGkxwdx6B FnAWNwMpzLI6+F9X4YeaV/n/uMj3nHm7ZVoBC9z4STMywBaERbfomX/xbaeIKW+ru7qd ATK8Y4/5hkNSfwfrCnoAgec/t/t0h5j5blSnGdZ+WQL3nG30ogZYOeDXSDCVxbgcN2pr jIamhJsXQzFuWVvmFO/rBy9PF+DjNczzhhgBPNz5KTvDQysk8zponZh3x9K7i/2n9btT bJr3EoYNCGga7j8/MrfgvA/HG/sRN5z8zmt/rbFwbRmIKpS8iRw3/u6ZhX0eFTsQ3sWj gxcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729011579; x=1729616379; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7hdQgYz3+tlcMaERkh9NayoSNz5uNPePWp1YebiOWzA=; b=VZ8tOJjVzKTKoJ+cQf/3MhLdkfYFiD+rq/EraLmfw2yikAkuHZNf0CKqQGyyM44pBd l734WwGyOuJ0nNz9BPQd8oT/R0JWnxjHo7+gSuMt14k9NQy7Z2kdaodSDxYcI91UTcaz 8rmFDyzjoWoN3BV5WfIhFr5wdzKGgfvfHYSswreZ9NSfUQgsx+zQiTMMGV3oKGF9joVj mPl9PklArBaTZ9jStO8iNGUkV2X3ts+NDPoXCSiod4pveb2X6mhRNrrBdvRRXxeytYTQ etqKZ+v7J+EsvixDLm5xHfZTVie9wM1A1sFB9VOalewWjM+4y1Sv/3syHzmgNHc8wTAp 4Jag== X-Forwarded-Encrypted: i=1; AJvYcCXw24duXMemdXAJj4a4NPkxsjx7RKzJZ9EBWlCIinWN2DG2nzME72c7Jw+QV+x90i2UJxXOLxwoRQ==@kvack.org X-Gm-Message-State: AOJu0YyTlBbR/P5ww7ZEEbn1KCwze5ibAkX4czjSBHyKEC7LZIi8nm7G rUkr9YXUWYIcVjGzp1JqOGwSHczW4SJ64FREOi2/wMldM4JQ0MLyTUU93qNV5WfYiDNt6L+kiJK 1sBE1b8Hhmkl/3wYU7k4V0cU07k4= X-Google-Smtp-Source: AGHT+IEzKxSyuGa1u19VyJ1QozJtT4I0UpsgYyVVZ4ObTgk3zSnNTQ3YLtlev3syyZY8uiFoV4BGxQDjW+2ESvY1akw= X-Received: by 2002:a05:622a:114e:b0:453:7634:bbfa with SMTP id d75a77b69052e-460584249f1mr231528341cf.21.1729011579203; Tue, 15 Oct 2024 09:59:39 -0700 (PDT) MIME-Version: 1.0 References: <20241014182228.1941246-1-joannelkoong@gmail.com> <20241014182228.1941246-2-joannelkoong@gmail.com> <265keu5uzo3gzqrvhcn2cagii4sak3e2a372ra7jlav35fnkrx@aicyzyftun3l> <5yasw5ke7aqfp2g7kzj2uzrrmvblesywavs6qs3bdcpe4vkmv2@iwpivyu7kzgy> In-Reply-To: <5yasw5ke7aqfp2g7kzj2uzrrmvblesywavs6qs3bdcpe4vkmv2@iwpivyu7kzgy> From: Joanne Koong Date: Tue, 15 Oct 2024 09:59:28 -0700 Message-ID: Subject: Re: [PATCH v2 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock To: Shakeel Butt Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EB45D40009 X-Stat-Signature: otiogng1egpty6b5z1xxu45h6yjupnqg X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729011573-249339 X-HE-Meta: U2FsdGVkX1+Crhpcd4Xt52/+8GtsWVeipH/rvtV86lGghJUjb4o9SemCh+nrf9RsjRaHAuRZrFMHvei7yDbiyfpTD5p0eaO8QiRrIbsf1uelgiAgWIkh2/I+6+LpBuJ/OVNhuVWxfvEf94CQb2qcCbU6AElqX4sCPvn9odmwBIEghOShnCRGRQVvWLQ6rULQlr17ih+LBt14i/PUhcxoeaecfHUcAyNq2GSnPAVlI1+HbIhPl/+i1C0My9ZIHTrKMx4KHNYQqGwhNf+JBvka0IWRe5snhi5Fb317pQaPGBmxHOcG86w8C/3n+Sgj2N4s+oBSQ4ZNKfG6mvgkEw6jXl47duy5ENq1UxMwl6K1NN1YEYzTbkSDjWE1vP+1j4YmkJqKG7032BkojAVkIugx2+wVKUrTuj5KviS9MZiYN2vVp0XV3jGP4EAkw2a7FjcULAvdIuR9jszq/QHuuvYETwz5gYkkc9RYvx5hX0TIn1YtNrVMdSvlS4/W1/aERIT8UQrndICi5KZoWqg0lLoF0SjseWQ7Qg8z3tPFlIlLdF1LH01eG5FD8tmfsdK4z+uJ8b5cS6EzZnwRiY6WR4fQ3GilOuVlK4rMSGxfOlK1H4NJy63+Y05yr7O3rc+PrvU9oqfAM5I6H3SxIfkX9pThbuvWJGVpbc8/jE6qoAxURrKAdZb5cmKEP3vz4GGRCHv4+ONfvrgDI16a6hA72d92JVirfwAT18TEP51hAVyXIaA346yWxbVLdbRndPbJ3Y0MsVryIluVLarRFQAFqWpxrOVOPrIcUjfAlE+BzH+wEZzLtEPr0W65BDUwad385PiYZqg60BR5q/hJuWgBk5WdIAbPD/nZn9cBePhwFMsm3p+Jh3nReWhenEVzywM77UEyq6fSjqVp6Zvkxeh9CvEAkiNRgFUzUYIstGNqCEC24IFEj6ZFZ+A+kTegCI5VrvzuoCnDIgdecq+ESMNIHR+ i++TXrgr xcjs6wQ7t/farOueqBiwARtvlBf9VVFUB91cBa8Q9dI2pHluyvODXo7R0GggFLYabGb0t7Or/O4lbATM+jMp9cBk90P2+7IOJn1XbQzXGXrX01YO+uPnPaCS1r0Hq6vLYsDecFquX9A7UG2PRgTxStZ5wL6jJubZr/EPkFINX2gIkB/Ksz3NokZ/mj5RZ5Z5wPciOOZ5mxQop4cPuYcwEMRwmxcBAe7lD9JmU/kuTjPC+CPz5jlTp9CDuWPUvFpAf9jMwkARehEl0gGk0M+d9WSeQUFKRBLVWKa7fnwKY3XX22+/0BvtFXzXTT1nou2POL5B3pI3/ujJvgOOh6Sp28vCQOg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 14, 2024 at 4:57=E2=80=AFPM Shakeel Butt wrote: > > On Mon, Oct 14, 2024 at 02:04:07PM GMT, Joanne Koong wrote: > > On Mon, Oct 14, 2024 at 11:38=E2=80=AFAM Shakeel Butt wrote: > > > > > > On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote: > > > > Currently in shrink_folio_list(), reclaim for folios under writebac= k > > > > falls into 3 different cases: > > > > 1) Reclaim is encountering an excessive number of folios under > > > > writeback and this folio has both the writeback and reclaim flag= s > > > > set > > > > 2) Dirty throttling is enabled (this happens if reclaim through cgr= oup > > > > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > > > > if reclaim is on the root cgroup), or if the folio is not marked= for > > > > immediate reclaim, or if the caller does not have __GFP_FS (or > > > > __GFP_IO if it's going to swap) set > > > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim = flag > > > > set and the caller did not have __GFP_FS (or __GFP_IO if swap) s= et > > > > > > > > In cases 1) and 2), we activate the folio and skip reclaiming it wh= ile > > > > in case 3), we wait for writeback to finish on the folio and then t= ry > > > > to reclaim the folio again. In case 3, we wait on writeback because > > > > cgroupv1 does not have dirty folio throttling, as such this is a > > > > mitigation against the case where there are too many folios in writ= eback > > > > with nothing else to reclaim. > > > > > > > > The issue is that for filesystems where writeback may block, sub-op= timal > > > > workarounds need to be put in place to avoid potential deadlocks th= at may > > > > arise from the case where reclaim waits on writeback. (Even though = case > > > > 3 above is rare given that legacy cgroupv1 is on its way to being > > > > deprecated, this case still needs to be accounted for) > > > > > > > > For example, for FUSE filesystems, when a writeback is triggered on= a > > > > folio, a temporary folio is allocated and the pages are copied over= to > > > > this temporary folio so that writeback can be immediately cleared o= n the > > > > original folio. This additionally requires an internal rb tree to k= eep > > > > track of writeback state on the temporary folios. Benchmarks show > > > > roughly a ~20% decrease in throughput from the overhead incurred wi= th 4k > > > > block size writes. The temporary folio is needed here in order to a= void > > > > the following deadlock if reclaim waits on writeback: > > > > * single-threaded FUSE server is in the middle of handling a reques= t that > > > > needs a memory allocation > > > > * memory allocation triggers direct reclaim > > > > * direct reclaim waits on a folio under writeback (eg falls into ca= se 3 > > > > above) that needs to be written back to the fuse server > > > > * the FUSE server can't write back the folio since it's stuck in di= rect > > > > reclaim > > > > > > > > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum > > > > mapping_flags" which filesystems can set to signify that reclaim > > > > should not happen when the folio is already in writeback. This only= has > > > > effects on the case where cgroupv1 memcg encounters a folio under > > > > writeback that already has the reclaim flag set (eg case 3 above), = and > > > > allows for the suboptimal workarounds added to address the "reclaim= wait > > > > on writeback" deadlock scenario to be removed. > > > > > > > > Signed-off-by: Joanne Koong > > > > --- > > > > include/linux/pagemap.h | 11 +++++++++++ > > > > mm/vmscan.c | 6 ++++-- > > > > 2 files changed, 15 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > > > > index 68a5f1ff3301..513a72b8451b 100644 > > > > --- a/include/linux/pagemap.h > > > > +++ b/include/linux/pagemap.h > > > > @@ -210,6 +210,7 @@ enum mapping_flags { > > > > AS_STABLE_WRITES =3D 7, /* must wait for writeback before m= odifying > > > > folio contents */ > > > > AS_INACCESSIBLE =3D 8, /* Do not attempt direct R/W access= to the mapping */ > > > > + AS_NO_WRITEBACK_RECLAIM =3D 9, /* Do not reclaim folios under= writeback */ > > > > > > Isn't it "Do not wait for writeback completion for folios of this > > > mapping during reclaim"? > > > > I think if we make this "don't wait for writeback completion for > > folios of this mapping during reclaim", then the > > mapping_no_writeback_reclaim check in shrink_folio_list() below would > > need to be something like this instead: > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 885d496ae652..37108d633d21 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct > > list_head *folio_list, > > /* Case 3 above */ > > } else { > > folio_unlock(folio); > > - folio_wait_writeback(folio); > > + if (mapping && > > !mapping_no_writeback_reclaim(mapping)) > > + folio_wait_writeback(folio); > > /* then go back and try same folio agai= n */ > > list_add_tail(&folio->lru, folio_list); > > continue; > > The difference between the outcome for Case 2 and Case 3 is that in Case > 2 the kernel is putting the folio in an active list and thus the kernel > will not try to reclaim it in near future but in Case 3, the kernel is > putting back in the list from which it is currently reclaiming meaning > the next iteration will try to reclaim the same folio. > > We definitely don't want it in Case 3. > > > > > which I'm not sure if that would be the correct logic here or not. > > I'm not too familiar with vmscan, but it seems like if we are going to > > reclaim the folio then we should wait on it or else we would just keep > > trying the same folio again and again and wasting cpu cycles. In this > > current patch (if I'm understanding this mm code correctly), we skip > > reclaiming the folio altogether if it's under writeback. > > > > Either one (don't wait for writeback during reclaim or don't reclaim > > under writeback) works for mitigating the potential fuse deadlock, > > but I was thinking "don't reclaim under writeback" might also be more > > generalizable to other filesystems. > > > > I'm happy to go with whichever you think would be best. > > Just to be clear that we are on the same page that this scenario should > be handled in Case 2. Our difference is on how to describe the scenario. > To me the reason we are taking the path of Case 2 is because we don't > want what Case 3 is doing and thus wrote that. Anyways I don't think it > is that importatnt, use whatever working seems reasonable to you. Gotcha, thanks for clarifying. Your point makes sense to me - if we go this route we should also probably change the name to AS_NO_RECLAIM_WAIT_WRITEBACK or something like that to make it more congruent. For now, I'll keep it as AS_NO_WRITEBACK_RECLAIM because I think that might be more generalizable of a use case for other filesystems too. > > BTW you will need to update the comment for Case 2 which is above code > block. Great point, I will do this in v3. Thanks, Joanne > > thanks, > Shakeel