From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DB2CCF2591 for ; Sun, 13 Oct 2024 04:55:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 346426B0082; Sun, 13 Oct 2024 00:55:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F6416B0083; Sun, 13 Oct 2024 00:55:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BE376B0085; Sun, 13 Oct 2024 00:55:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EFB5F6B0082 for ; Sun, 13 Oct 2024 00:55:06 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7C07616056A for ; Sun, 13 Oct 2024 04:54:59 +0000 (UTC) X-FDA: 82667364444.19.291C44A Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf01.hostedemail.com (Postfix) with ESMTP id 3AED74000C for ; Sun, 13 Oct 2024 04:55:00 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=E7juDQ4m; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728795165; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=f/JeRHxxFFj0bX2toEeHeEhD+6g9XfctiL4Bv2VRl9E=; b=d6miE8Y++BuJF3Pw9R7StW4WGpP2pCv6EoBejSYSuoxB1H+57ojosBsj/Jpxg3q7aus42f AnYYIKrMI/6fd752+dqREiBNajjBS+sGEO0kbFeUAWRjt5hNgxLKHnEUXr5MMvSDXN3NGE LHRWIJXBkXFbZnsWgMXn4U7Foc/VDys= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728795165; a=rsa-sha256; cv=none; b=MpAJtkxJaiq6/o9SvSAjY+gJeow3FZxplRw098z/6UMNN0e0nz1RfC6OoRDwmMwnQpJiio zBfeQPacOy7sTz40nAyizqlpUyq8H3AREYCBGKGxhU5sDK8Ba+tXiEAVmc0kWfdxRiyGAH Yw+/VshRgh2QMhngzHBcabPrYxfYsPg= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=E7juDQ4m; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Sat, 12 Oct 2024 21:54:50 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1728795302; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=f/JeRHxxFFj0bX2toEeHeEhD+6g9XfctiL4Bv2VRl9E=; b=E7juDQ4mbGHRmv3c4jeCa09nsPryytQjosMifU/i7ZT722lufDY2s3/3RIt58YZM1PE58X 0wZh0venqNzjwgOzXvLLe6sPRj6YT9Q9HS0Ydl+nO06DSZliSDeh5taTXC67WchyMv2zeR q6KS4qJ4qlv5pl8W8UgCADK4Vo4AJuo= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Joanne Koong Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@meta.com Subject: Re: [PATCH 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Message-ID: References: <20241011223434.1307300-1-joannelkoong@gmail.com> <20241011223434.1307300-2-joannelkoong@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241011223434.1307300-2-joannelkoong@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 3AED74000C X-Stat-Signature: 1s4rcn4fuey9d34bqk1q14wdoqyku8hm X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1728795300-919853 X-HE-Meta: U2FsdGVkX186IwIMG/UkgDuZg78p8k65H+Ax6xM648xkyhw1fcDUJf5v5hLEnYFN9Doo8QVNbHsqWkKx3eVDGdHkKPt4X8lTsdkzBt9QSnTmbQIjKcUDMiDirwqB3S2dSlxfBRUH2JcLi3g5hZCy92bje+3qZq1KF6snO2C9KX1PZp6ITQIhaz7sKC0ppZbLORHTQs2Y6HfgXnV8sNzZYg4lh7E1iF5tu46CPT+QCpI+LK92J2pzbn1E02Yhrt7kozlvhAddMz4GCuM+WAsCm07IONWIbFcmWQXKkS9IY3yjRCx1Kuv9Ft9D6DGt6nYut9jlcERGUq2OtzB6R4DP7v91P8MlpRJjeJ846W56a+z4d09dLtNnYzIHjvpVOeNk6f8jNQH0Gi3lZ2Rgh3YNf4CoBzpXyra8r92FfZMlSRismJP3lTut/ADyJcBPJlN5TwJft5JK1pfeqk24L1OYkxklJ0M67QQIm5QFl1sSqNzefsZvPXfcWis1PWQ0YdGuLjgFrSmTUtkyFZo3CEeepJHa+RRKnKp6Dpnnai+2KDilIpH2im4JoaBhTz5iCxrfjImPk5wvtW79KAjnyFvy9AtxnCj1dPEP+DU6GE/TxpYCPtV45af/Hmvd1P7diNH7AaSmuEIQ1+CSVALTLHUjHihkh/PwC6Crm7MR2Bd3TzDwxDCuPkJxQPGOi3o6IWhToLa9g/5v0PvdNCiaoi9pf/jagToDfEsGNpdLPT+fEMVDvH3Y+NENk9okYGH0OUFgLxkq8zkBO0dT0NwQpFePMhmpyQr35zT9vnpKpR98rN61kG+Nlw4n+huIz9shnlxtc+dzjsbzWBAYrjUSfOB+TzoA6KnY0Dwmwty7vTJBHITcRsWJcfT/gx9ZMuR9zzPMMW8pxaaI0svS6c6uw3drctxkOQr0guhVX8+JYGYrjCOZp484AAsYm4l/FNaUerhJaRukFO7l1QsUv7JuK1s 2GNKywhU DVIHc7/3slnfuZBJXH78bb4nAY5ppBDOYi+AXO3JlRCr7gIH29V3vZCmkrwSQvI15hrtSahEChizgw8C44kidQs4fQq331C2/HhPdBij41pzy8CiBK+Cve6nrXIYBl+Rco94g/Uhbp0IXaD5Jh/C3X6+kDlyyANnJLFn1WX4+gK/5ETs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote: > Currently in shrink_folio_list(), reclaim for folios under writeback > falls into 3 different cases: > 1) Reclaim is encountering an excessive number of folios under > writeback and this folio has both the writeback and reclaim flags > set > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > if reclaim is on the root cgroup), or if the folio is not marked for > immediate reclaim, or if the caller does not have __GFP_FS (or > __GFP_IO if it's going to swap) set > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > In cases 1) and 2), we activate the folio and skip reclaiming it while > in case 3), we wait for writeback to finish on the folio and then try > to reclaim the folio again. In case 3, we wait on writeback because > cgroupv1 does not have dirty folio throttling, as such this is a > mitigation against the case where there are too many folios in writeback > with nothing else to reclaim. > > The issue is that for filesystems where writeback may block, sub-optimal > workarounds need to be put in place to avoid potential deadlocks that may > arise from the case where reclaim waits on writeback. (Even though case > 3 above is rare given that legacy cgroupv1 is on its way to being > deprecated, this case still needs to be accounted for) > > For example, for FUSE filesystems, when a writeback is triggered on a > folio, a temporary folio is allocated and the pages are copied over to > this temporary folio so that writeback can be immediately cleared on the > original folio. This additionally requires an internal rb tree to keep > track of writeback state on the temporary folios. Benchmarks show > roughly a ~20% decrease in throughput from the overhead incurred with 4k > block size writes. The temporary folio is needed here in order to avoid > the following deadlock if reclaim waits on writeback: > * single-threaded FUSE server is in the middle of handling a request that > needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback (eg falls into case 3 > above) that needs to be written back to the fuse server > * the FUSE server can't write back the folio since it's stuck in direct > reclaim > > This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK > flag in the address_space_operations struct to signify that reclaim > should not happen when the folio is already in writeback. This only has > effects on the case where cgroupv1 memcg encounters a folio under > writeback that already has the reclaim flag set (eg case 3 above), and > allows for the suboptimal workarounds added to address the "reclaim wait > on writeback" deadlock scenario to be removed. > > Signed-off-by: Joanne Koong > --- > include/linux/fs.h | 14 ++++++++++++++ > mm/vmscan.c | 6 ++++-- > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e3c603d01337..808164e3dd84 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb) > return kiocb->ki_complete == NULL; > } > > +typedef unsigned int __bitwise asop_flags_t; > + > struct address_space_operations { > + asop_flags_t asop_flags; > int (*writepage)(struct page *page, struct writeback_control *wbc); > int (*read_folio)(struct file *, struct folio *); > > @@ -438,6 +441,12 @@ struct address_space_operations { > int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); > }; > > +/** > + * This flag is only to be used by filesystems whose folios cannot be > + * reclaimed when in writeback (eg fuse) > + */ > +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0)) > + > extern const struct address_space_operations empty_aops; > > /** > @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) > atomic_inc(&mapping->i_mmap_writable); > } > > +static inline bool mapping_no_reclaim_in_writeback(struct address_space *mapping) > +{ > + return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK; Any reason not to add this flag in enum mapping_flags and use mapping->flags field instead of adding a field in struct address_space_operations? > +} > + > /* > * Use sequence counter to get consistent i_size on 32-bit processors. > */ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 749cdc110c74..2beffbdae572 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (writeback && folio_test_reclaim(folio)) > stat->nr_congested += nr_pages; > > + mapping = folio_mapping(folio); > + > /* > * If a folio at the tail of the LRU is under writeback, there > * are three cases to consider. > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* Case 2 above */ > } else if (writeback_throttling_sane(sc) || > !folio_test_reclaim(folio) || > - !may_enter_fs(folio, sc->gfp_mask)) { > + !may_enter_fs(folio, sc->gfp_mask) || > + (mapping && mapping_no_reclaim_in_writeback(mapping))) { > /* > * This is slightly racy - > * folio_end_writeback() might have > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (folio_maybe_dma_pinned(folio)) > goto activate_locked; > > - mapping = folio_mapping(folio); > if (folio_test_dirty(folio)) { > /* > * Only kswapd can writeback filesystem folios > -- > 2.43.5 >