From: Shakeel Butt <shakeel.butt@linux.dev>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
josef@toxicpanda.com, bernd.schubert@fastmail.fm,
jefflexu@linux.alibaba.com, hannes@cmpxchg.org,
linux-mm@kvack.org, kernel-team@meta.com
Subject: Re: [PATCH 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock
Date: Sat, 12 Oct 2024 21:54:50 -0700 [thread overview]
Message-ID: <prx7opxb3zqofuejohnqikxqbau6mde3lqxkistcwqun2xzr36@rpxky5oltnvs> (raw)
In-Reply-To: <20241011223434.1307300-2-joannelkoong@gmail.com>
On Fri, Oct 11, 2024 at 03:34:33PM GMT, Joanne Koong wrote:
> Currently in shrink_folio_list(), reclaim for folios under writeback
> falls into 3 different cases:
> 1) Reclaim is encountering an excessive number of folios under
> writeback and this folio has both the writeback and reclaim flags
> set
> 2) Dirty throttling is enabled (this happens if reclaim through cgroup
> is not enabled, if reclaim through cgroupv2 memcg is enabled, or
> if reclaim is on the root cgroup), or if the folio is not marked for
> immediate reclaim, or if the caller does not have __GFP_FS (or
> __GFP_IO if it's going to swap) set
> 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
> set and the caller did not have __GFP_FS (or __GFP_IO if swap) set
>
> In cases 1) and 2), we activate the folio and skip reclaiming it while
> in case 3), we wait for writeback to finish on the folio and then try
> to reclaim the folio again. In case 3, we wait on writeback because
> cgroupv1 does not have dirty folio throttling, as such this is a
> mitigation against the case where there are too many folios in writeback
> with nothing else to reclaim.
>
> The issue is that for filesystems where writeback may block, sub-optimal
> workarounds need to be put in place to avoid potential deadlocks that may
> arise from the case where reclaim waits on writeback. (Even though case
> 3 above is rare given that legacy cgroupv1 is on its way to being
> deprecated, this case still needs to be accounted for)
>
> For example, for FUSE filesystems, when a writeback is triggered on a
> folio, a temporary folio is allocated and the pages are copied over to
> this temporary folio so that writeback can be immediately cleared on the
> original folio. This additionally requires an internal rb tree to keep
> track of writeback state on the temporary folios. Benchmarks show
> roughly a ~20% decrease in throughput from the overhead incurred with 4k
> block size writes. The temporary folio is needed here in order to avoid
> the following deadlock if reclaim waits on writeback:
> * single-threaded FUSE server is in the middle of handling a request that
> needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback (eg falls into case 3
> above) that needs to be written back to the fuse server
> * the FUSE server can't write back the folio since it's stuck in direct
> reclaim
>
> This commit allows filesystems to set a ASOP_NO_RECLAIM_IN_WRITEBACK
> flag in the address_space_operations struct to signify that reclaim
> should not happen when the folio is already in writeback. This only has
> effects on the case where cgroupv1 memcg encounters a folio under
> writeback that already has the reclaim flag set (eg case 3 above), and
> allows for the suboptimal workarounds added to address the "reclaim wait
> on writeback" deadlock scenario to be removed.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/fs.h | 14 ++++++++++++++
> mm/vmscan.c | 6 ++++--
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e3c603d01337..808164e3dd84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,7 +394,10 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
> return kiocb->ki_complete == NULL;
> }
>
> +typedef unsigned int __bitwise asop_flags_t;
> +
> struct address_space_operations {
> + asop_flags_t asop_flags;
> int (*writepage)(struct page *page, struct writeback_control *wbc);
> int (*read_folio)(struct file *, struct folio *);
>
> @@ -438,6 +441,12 @@ struct address_space_operations {
> int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
> };
>
> +/**
> + * This flag is only to be used by filesystems whose folios cannot be
> + * reclaimed when in writeback (eg fuse)
> + */
> +#define ASOP_NO_RECLAIM_IN_WRITEBACK ((__force asop_flags_t)(1 << 0))
> +
> extern const struct address_space_operations empty_aops;
>
> /**
> @@ -586,6 +595,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
> atomic_inc(&mapping->i_mmap_writable);
> }
>
> +static inline bool mapping_no_reclaim_in_writeback(struct address_space *mapping)
> +{
> + return mapping->a_ops->asop_flags & ASOP_NO_RECLAIM_IN_WRITEBACK;
Any reason not to add this flag in enum mapping_flags and use
mapping->flags field instead of adding a field in struct
address_space_operations?
> +}
> +
> /*
> * Use sequence counter to get consistent i_size on 32-bit processors.
> */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..2beffbdae572 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1110,6 +1110,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (writeback && folio_test_reclaim(folio))
> stat->nr_congested += nr_pages;
>
> + mapping = folio_mapping(folio);
> +
> /*
> * If a folio at the tail of the LRU is under writeback, there
> * are three cases to consider.
> @@ -1165,7 +1167,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> /* Case 2 above */
> } else if (writeback_throttling_sane(sc) ||
> !folio_test_reclaim(folio) ||
> - !may_enter_fs(folio, sc->gfp_mask)) {
> + !may_enter_fs(folio, sc->gfp_mask) ||
> + (mapping && mapping_no_reclaim_in_writeback(mapping))) {
> /*
> * This is slightly racy -
> * folio_end_writeback() might have
> @@ -1320,7 +1323,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (folio_maybe_dma_pinned(folio))
> goto activate_locked;
>
> - mapping = folio_mapping(folio);
> if (folio_test_dirty(folio)) {
> /*
> * Only kswapd can writeback filesystem folios
> --
> 2.43.5
>
next prev parent reply other threads:[~2024-10-13 4:55 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-11 22:34 [PATCH 0/2] fuse: remove extra page copies in writeback Joanne Koong
2024-10-11 22:34 ` [PATCH 1/2] mm: skip reclaiming folios in writeback contexts that may trigger deadlock Joanne Koong
2024-10-13 4:54 ` Shakeel Butt [this message]
2024-10-14 17:18 ` Joanne Koong
2024-10-11 22:34 ` [PATCH 2/2] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-10-13 4:56 ` Shakeel Butt
2024-10-14 17:24 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=prx7opxb3zqofuejohnqikxqbau6mde3lqxkistcwqun2xzr36@rpxky5oltnvs \
--to=shakeel.butt@linux.dev \
--cc=bernd.schubert@fastmail.fm \
--cc=hannes@cmpxchg.org \
--cc=jefflexu@linux.alibaba.com \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox