From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7218D18133 for ; Mon, 14 Oct 2024 17:18:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A1EE6B0089; Mon, 14 Oct 2024 13:18:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 451E06B008A; Mon, 14 Oct 2024 13:18:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31A946B008C; Mon, 14 Oct 2024 13:18:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 12AB96B0089 for ; Mon, 14 Oct 2024 13:18:21 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6AC171C69A9 for ; Mon, 14 Oct 2024 17:18:11 +0000 (UTC) X-FDA: 82672866108.08.CDFBB13 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) by imf03.hostedemail.com (Postfix) with ESMTP id 2C48D2000C for ; Mon, 14 Oct 2024 17:18:14 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="e/abyttB"; spf=pass (imf03.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728926157; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c6GjKiroA4ql0lEHofgbnDwiGAfJixcOP1LU5Sya9WQ=; b=apc1GXdGirr11aXijNdR5p7YBwD2nbfnBWdP7ZR//pqrqnFZOL+55eJPqgJukm5Xap8+RL wUIdyLkE0h77dCPaw0wxASE1Z0DHHfcPtEMskVbmadrRk6L7wOYtw3X7HH8RY0JVVB4XxB n5owrKHOQKO9NSLszdj8wycLEnpWCFk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728926157; a=rsa-sha256; cv=none; b=p8UlP4m97kJUhbhd8z4uH7XWds4zoEYO327PqNKvBp3gLpztk93ah0M8l2a6q/3npwa+Fx kM6zn6lnA9zkwG1WnNt0QW/ynXzi1j8VS96YHNKKnU6Id8elmpz4n2ouO9dcf0NM+uEMSf t9i5oKJ9N8MeqG5DbHMOghXpdPSw98c= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="e/abyttB"; spf=pass (imf03.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-4604111f629so43883861cf.3 for ; Mon, 14 Oct 2024 10:18:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728926297; x=1729531097; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=c6GjKiroA4ql0lEHofgbnDwiGAfJixcOP1LU5Sya9WQ=; b=e/abyttBMfw7mgag0m2f5w+AVc29LDYQCHwq3s+De0OvtjOnIR5J0pK0xt2m2kTt/u aCnEGGcKTLJmCHfJUYjrjbYtSI3tCLlma0g80GfLwDpXfY4NZtdkKcJw7btLubmOzu/X aSMl0n7AmkNxvUUiVryQumuRhoBNj/IfeHWcbzvrh1X3T8XDktgMM2+g+MDXAkWZMDQe j+27OcX50VN+4f+9awoEDj26doE4gghyJn2J7JJNs4IRhAhXEyN6m/Hx+MQ0AffB+tka 2tWPXTo4v6mfYmG3SOFD8tb3LvXGI4R6s/fxIIK6vzYagBNoGio3UK8hzhMFgi8vDHto GN+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728926297; x=1729531097; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=c6GjKiroA4ql0lEHofgbnDwiGAfJixcOP1LU5Sya9WQ=; b=Y64+XU23M8qF7+NdkrgJB4TN4AYvTcbqjbedzRBbaU5DkPqPfrvL6yGOFx87pVEU04 bldTl7959S25ehbzb9H0ENUR+P+hFqhsLvblNN8V8qf7LTop98cOAOLwT1CpK+gEBFht 6CBnGPd4lPlW7NZ9NaTJmy/62E5PfwUzYpKM4B3wfk8vbLe7kSOmcDKOySfCOXqdhLQM xlav6jyGdqpbDS8BMqDvutnUXxgzS27O94pptG6MqO9gKiq2HTjdAN1eHBQBmH5ELFjO u2c7BYnBKL3QkQUDDBTQU1VaXUR/Yzjiy3dgFE/GljiSZzXVQnMCbYyGTT7GOTgrmgE6 OYjA== X-Forwarded-Encrypted: i=1; AJvYcCWy4nVaqwgws/sCI18V6HESR2oQ2b1nG1/vvSQaT7Son/kU0N6lGDHZZamNDSJUG2R3w8DhWRsNDw==@kvack.org X-Gm-Message-State: AOJu0YzHCCJ1/x1mo4eeL0rJ7g4U2OiSsorkbOQdA9jSWPqz2kN73vvJ oW9IC1hJNYcNH5sXjPVzFYmf8PCM6GpEHLh/06mrYrIapgXTbi7w7XkBhFSrwq+tKoeVgcBwRUB ECGmAdhADX5tkDHjuL3fLoWxk8IE= X-Google-Smtp-Source: AGHT+IG7XUNvTVPoEsMkLtNckOcRYYSZWECevnmpgCY2UmmEZ8NlqS20kYMUcpP4vODA57/MIdHezQoP14Mo4ZXPAEA= X-Received: by 2002:a05:622a:2c5:b0:458:4a68:7d15 with SMTP id d75a77b69052e-4604bc45d49mr213815351cf.44.1728926297318; Mon, 14 Oct 2024 10:18:17 -0700 (PDT) MIME-Version: 1.0 References: <20241011223434.1307300-1-joannelkoong@gmail.com> <20241011223434.1307300-2-joannelkoong@gmail.com> In-Reply-To: From: Joanne Koong Date: Mon, 14 Oct 2024 10:18:05 -0700 Message-ID: Subject: Re: [PATCH 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock To: Shakeel Butt Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, josef@toxicpanda.com, bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2C48D2000C X-Stat-Signature: thnisf6pttwg6idrqz6dakcwzfrhab9k X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1728926294-770349 X-HE-Meta: U2FsdGVkX183EIN7pHhPCm31MUR6/YRavFJi+7M3wd+ZAvlFk6CT7ZfjWx2bidpPYNomyji83xBWtU22CLaEUQqvHFKj1ZmN3O6gLVcdwMp2T4L4sClWtxKL+FazYBbh7PS/P0t53I+D9l/H7eebe0whKSRqWtC67Pt5fLkWWN+6q/+dBSRsxGYpzjdp6/ScMP8hD8QlbgvbLa3MDdKBf4Se5xLD+t4rJZZCYF5V56Kr6RupLHHkS62JcuKbLXlOYG1yG8/8laWi8uSDaarsBGX9wptaoOlVL2YC8sH1zuVvGOklb4N3OOGynpkHwfOHqz5VLZEzMmDjiTQao/0hFioi4qTTS/DRYCrTQpFh1dV9ez0GdlWH6DHDnNSwGAXK51ll2SzZAatEDrN9sR8QRjU/Jx1SBsPm1PuM7/gCuCWt1N0EPG24XS5m0aVxyJKxOLcNs19M2I/4kw5/aoUsZA7/BI5DPVWejR6MdSMO7rPlfWhYEpKug/rXpNaWbW8QG+HUClsL2Sm4mdujSvaG/U3EGi6iiitp57eGYg/hnERNznYxq5Fin9fvKP0gJmNJjlKtq/vuxpYpe2E/lj0fn3qIU9+wHQH5a23Un9iz5ioX6L5OdPg95yUp2uuPkTzIf7kFz+DGNHgKbnylvxEKA+ZKc3UTTMMB9JPb8QUJADRwUi1aLcE+B//PJRJQ4hJFSMlsYj3KEfTpnI8svPfENlXrSmqCRcMZfUXSYGo6B4ygRVs58TRxdipw04LftYb4G8O5uM/ZcUbburftUC/uUjV1mlgcwarEq/7OTZwg/dOCAB3zV/Atg9VCPYvS6VE3itBEhVXQyDmWfwiIsa7Duq1h98pHrG1VKsn2eMxSXZM7Kg301rEc2Oqd7wt1dR6a+vbD5Map4HqlqIK1XPa+HEzlUdy6fYgQqwMgisCViTRZQGDnH1/5DxnTnuxAlISHqFT2ErgXNrSZWljIrdx 8CUEmPqD lnUU7NV6xcXBKhrzzdoFfLwL81cLdIRZ28Vs86yUnfIa0k8aKaDK2qXnMumlJsNokzWgcQLwCkAh/GJtqGCPa++CBKVX7lRoGiLGgx922DipDOQ9zp6bxTyq9tgawJ3MlwZDwxXqceaH4jkjjV9J3HMjXmbaOUTOY1HF+/Ar5rcVNUVd460PBb1IKAGWMj1PtCUp7iTBp/wh6+vcY9bRk1bayzICmXQ3GWdDuhzTOuWN98gBgRPUspfwyUBpEizbg8pUPCpTkCd+PvMmMN6oPcapOVSuhRFoUl9hnRIcrAVwvCmy/ziknlOJA43a4o/1ACxVQQxqU94D3n1RdEvYOQeyWkA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Oct 12, 2024 at 9:55=E2=80=AFPM Shakeel Butt wrote: > > On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote: > > Currently in shrink_folio_list(), reclaim for folios under writeback > > falls into 3 different cases: > > 1) Reclaim is encountering an excessive number of folios under > > writeback and this folio has both the writeback and reclaim flags > > set > > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > > if reclaim is on the root cgroup), or if the folio is not marked for > > immediate reclaim, or if the caller does not have __GFP_FS (or > > __GFP_IO if it's going to swap) set > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > > > In cases 1) and 2), we activate the folio and skip reclaiming it while > > in case 3), we wait for writeback to finish on the folio and then try > > to reclaim the folio again. In case 3, we wait on writeback because > > cgroupv1 does not have dirty folio throttling, as such this is a > > mitigation against the case where there are too many folios in writebac= k > > with nothing else to reclaim. > > > > The issue is that for filesystems where writeback may block, sub-optima= l > > workarounds need to be put in place to avoid potential deadlocks that m= ay > > arise from the case where reclaim waits on writeback. (Even though case > > 3 above is rare given that legacy cgroupv1 is on its way to being > > deprecated, this case still needs to be accounted for) > > > > For example, for FUSE filesystems, when a writeback is triggered on a > > folio, a temporary folio is allocated and the pages are copied over to > > this temporary folio so that writeback can be immediately cleared on th= e > > original folio. This additionally requires an internal rb tree to keep > > track of writeback state on the temporary folios. Benchmarks show > > roughly a ~20% decrease in throughput from the overhead incurred with 4= k > > block size writes. The temporary folio is needed here in order to avoid > > the following deadlock if reclaim waits on writeback: > > * single-threaded FUSE server is in the middle of handling a request th= at > > needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback (eg falls into case 3 > > above) that needs to be written back to the fuse server > > * the FUSE server can't write back the folio since it's stuck in direct > > reclaim > > > > This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK > > flag in the address_space_operations struct to signify that reclaim > > should not happen when the folio is already in writeback. This only has > > effects on the case where cgroupv1 memcg encounters a folio under > > writeback that already has the reclaim flag set (eg case 3 above), and > > allows for the suboptimal workarounds added to address the "reclaim wai= t > > on writeback" deadlock scenario to be removed. > > > > Signed-off-by: Joanne Koong > > --- > > include/linux/fs.h | 14 ++++++++++++++ > > mm/vmscan.c | 6 ++++-- > > 2 files changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index e3c603d01337..808164e3dd84 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kio= cb) > > return kiocb->ki_complete =3D=3D NULL; > > } > > > > +typedef unsigned int __bitwise asop_flags_t; > > + > > struct address_space_operations { > > + asop_flags_t asop_flags; > > int (*writepage)(struct page *page, struct writeback_control *wbc= ); > > int (*read_folio)(struct file *, struct folio *); > > > > @@ -438,6 +441,12 @@ struct address_space_operations { > > int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); > > }; > > > > +/** > > + * This flag is only to be used by filesystems whose folios cannot be > > + * reclaimed when in writeback (eg fuse) > > + */ > > +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0)) > > + > > extern const struct address_space_operations empty_aops; > > > > /** > > @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct a= ddress_space *mapping) > > atomic_inc(&mapping->i_mmap_writable); > > } > > > > +static inline bool mapping_no_reclaim_in_writeback(struct address_spac= e *mapping) > > +{ > > + return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK; > > Any reason not to add this flag in enum mapping_flags and use > mapping->flags field instead of adding a field in struct > address_space_operations? No, thanks for the suggestion - I really like your idea of adding this to enum mapping_flags instead as AS_NO_WRITEBACK_RECLAIM. I don't know why I didn't see mapping_flags when I was looking at this. I'll make this change for v2. Thanks, Joanne > > > +} > > + > > /* > > * Use sequence counter to get consistent i_size on 32-bit processors. > > */ > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 749cdc110c74..2beffbdae572 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list= _head *folio_list, > > if (writeback && folio_test_reclaim(folio)) > > stat->nr_congested +=3D nr_pages; > > > > + mapping =3D folio_mapping(folio); > > + > > /* > > * If a folio at the tail of the LRU is under writeback, = there > > * are three cases to consider. > > @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list= _head *folio_list, > > /* Case 2 above */ > > } else if (writeback_throttling_sane(sc) || > > !folio_test_reclaim(folio) || > > - !may_enter_fs(folio, sc->gfp_mask)) { > > + !may_enter_fs(folio, sc->gfp_mask) || > > + (mapping && mapping_no_reclaim_in_writeback(m= apping))) { > > /* > > * This is slightly racy - > > * folio_end_writeback() might have > > @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list= _head *folio_list, > > if (folio_maybe_dma_pinned(folio)) > > goto activate_locked; > > > > - mapping =3D folio_mapping(folio); > > if (folio_test_dirty(folio)) { > > /* > > * Only kswapd can writeback filesystem folios > > -- > > 2.43.5 > >