linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Tal Zussman <tz2294@columbia.edu>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Vlastimil Babka <vbabka@suse.cz>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	 Jens Axboe <axboe@kernel.dk>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	 Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Christoph Hellwig <hch@infradead.org>,
	 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,  linux-block@vger.kernel.org
Subject: Re: [PATCH RFC v3 1/2] filemap: defer dropbehind invalidation from IRQ context
Date: Mon, 2 Mar 2026 10:11:19 +0100	[thread overview]
Message-ID: <wen63cjbk3k54mjzgw7zftsuze6bzxmdk5u5wdjabzdiqg645k@67666k5lrevh> (raw)
In-Reply-To: <20260227-blk-dontcache-v3-1-cd309ccd5868@columbia.edu>

On Fri 27-02-26 11:41:07, Tal Zussman wrote:
> folio_end_dropbehind() is called from folio_end_writeback(), which can
> run in IRQ context through buffer_head completion.
> 
> Previously, when folio_end_dropbehind() detected !in_task(), it skipped
> the invalidation entirely. This meant that folios marked for dropbehind
> via RWF_DONTCACHE would remain in the page cache after writeback when
> completed from IRQ context, defeating the purpose of using it.
> 
> Fix this by adding folio_end_dropbehind_irq() which defers the
> invalidation to a workqueue. The folio is added to a per-cpu folio_batch
> protected by a local_lock, and a work item pinned to that CPU drains the
> batch. folio_end_writeback() dispatches between the task and IRQ paths
> based on in_task().
> 
> A CPU hotplug dead callback drains any remaining folios from the
> departing CPU's batch to avoid leaking folio references.
> 
> This unblocks enabling RWF_DONTCACHE for block devices and other
> buffer_head-based I/O.
> 
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>

Thanks for the patch. Couple of comments below:

> @@ -1613,26 +1617,131 @@ static void filemap_end_dropbehind(struct folio *folio)
>   * If folio was marked as dropbehind, then pages should be dropped when writeback
>   * completes. Do that now. If we fail, it's likely because of a big folio -
>   * just reset dropbehind for that case and latter completions should invalidate.
> + *
> + * When called from IRQ context (e.g. buffer_head completion), we cannot lock
> + * the folio and invalidate. Defer to a workqueue so that callers like
> + * end_buffer_async_write() that complete in IRQ context still get their folios
> + * pruned.
> + */
> +struct dropbehind_batch {
> +	local_lock_t lock_irq;
> +	struct folio_batch fbatch;
> +	struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct dropbehind_batch, dropbehind_batch) = {
> +	.lock_irq = INIT_LOCAL_LOCK(lock_irq),
> +};
> +
> +static void dropbehind_work_fn(struct work_struct *w)
> +{
> +	struct dropbehind_batch *db_batch;
> +	struct folio_batch fbatch;
> +
> +again:
> +	local_lock_irq(&dropbehind_batch.lock_irq);
> +	db_batch = this_cpu_ptr(&dropbehind_batch);
> +	fbatch = db_batch->fbatch;
> +	folio_batch_reinit(&db_batch->fbatch);
> +	local_unlock_irq(&dropbehind_batch.lock_irq);
> +
> +	for (int i = 0; i < folio_batch_count(&fbatch); i++) {
> +		struct folio *folio = fbatch.folios[i];
> +
> +		if (folio_trylock(folio)) {
> +			filemap_end_dropbehind(folio);
> +			folio_unlock(folio);
> +		}
> +		folio_put(folio);
> +	}

This logic of take folio batch and call filemap_end_dropbehind() for each
folio repeats twice in this patch - perhaps we can factor it out into a
helper function fbatch_end_dropbehind()?

> +
> +	/* Drain folios that were added while we were processing. */
> +	local_lock_irq(&dropbehind_batch.lock_irq);
> +	if (folio_batch_count(&db_batch->fbatch)) {
> +		local_unlock_irq(&dropbehind_batch.lock_irq);
> +		goto again;

I'm somewhat nervous from this potentially unbounded loop if someone is
able to feed folios into db_batch fast enough. That could hog the CPU for
quite a long time causing all sorts of interesting effects. If nothing else
we should abort this loop if need_resched() is true.

> +	}
> +	local_unlock_irq(&dropbehind_batch.lock_irq);
> +}
> +
> +/*
> + * Drain a dead CPU's dropbehind batch. The CPU is already dead so no
> + * locking is needed.
> + */
> +void dropbehind_drain_cpu(int cpu)
> +{
> +	struct dropbehind_batch *db_batch = per_cpu_ptr(&dropbehind_batch, cpu);
> +	struct folio_batch *fbatch = &db_batch->fbatch;
> +
> +	for (int i = 0; i < folio_batch_count(fbatch); i++) {
> +		struct folio *folio = fbatch->folios[i];
> +
> +		if (folio_trylock(folio)) {
> +			filemap_end_dropbehind(folio);
> +			folio_unlock(folio);
> +		}
> +		folio_put(folio);
> +	}
> +	folio_batch_reinit(fbatch);
> +}
> +
> +static void __init dropbehind_init(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct dropbehind_batch *db_batch = per_cpu_ptr(&dropbehind_batch, cpu);
> +
> +		folio_batch_init(&db_batch->fbatch);
> +		INIT_WORK(&db_batch->work, dropbehind_work_fn);
> +	}
> +}
> +
> +/*
> + * Must be called from task context. Use folio_end_dropbehind_irq() for
> + * IRQ context (e.g. buffer_head completion).
>   */
>  void folio_end_dropbehind(struct folio *folio)
>  {
>  	if (!folio_test_dropbehind(folio))
>  		return;
>  
> -	/*
> -	 * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
> -	 * but can happen if normal writeback just happens to find dirty folios
> -	 * that were created as part of uncached writeback, and that writeback
> -	 * would otherwise not need non-IRQ handling. Just skip the
> -	 * invalidation in that case.
> -	 */
> -	if (in_task() && folio_trylock(folio)) {
> +	if (folio_trylock(folio)) {
>  		filemap_end_dropbehind(folio);
>  		folio_unlock(folio);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(folio_end_dropbehind);
>  
> +/*
> + * In IRQ context we cannot lock the folio or call into the invalidation
> + * path. Defer to a workqueue. This happens for buffer_head-based writeback
> + * which runs from bio IRQ context.
> + */
> +static void folio_end_dropbehind_irq(struct folio *folio)
> +{
> +	struct dropbehind_batch *db_batch;
> +	unsigned long flags;
> +
> +	if (!folio_test_dropbehind(folio))
> +		return;
> +
> +	local_lock_irqsave(&dropbehind_batch.lock_irq, flags);
> +	db_batch = this_cpu_ptr(&dropbehind_batch);
> +
> +	/* If there is no space in the folio_batch, skip the invalidation. */
> +	if (!folio_batch_space(&db_batch->fbatch)) {
> +		local_unlock_irqrestore(&dropbehind_batch.lock_irq, flags);
> +		return;

Folio batches are relatively small (31 folios). With 4k folios it is very
easy to overflow the batch with a single IO completion. Large folios will
obviously make this less likely but I'm not sure reasonable working of
dropbehind should be dependent on large folios... Not sure how to best
address this though. We could use larger batches but that would mean using
our own array of folios instead of folio_batch.

> +	}
> +
> +	folio_get(folio);
> +	folio_batch_add(&db_batch->fbatch, folio);
> +	local_unlock_irqrestore(&dropbehind_batch.lock_irq, flags);
> +
> +	schedule_work_on(smp_processor_id(), &db_batch->work);
> +}
> +
>  /**
>   * folio_end_writeback_no_dropbehind - End writeback against a folio.
>   * @folio: The folio.
> @@ -1685,7 +1794,10 @@ void folio_end_writeback(struct folio *folio)
>  	 */
>  	folio_get(folio);
>  	folio_end_writeback_no_dropbehind(folio);
> -	folio_end_dropbehind(folio);
> +	if (in_task())
> +		folio_end_dropbehind(folio);
> +	else
> +		folio_end_dropbehind_irq(folio);

I think it would be more elegant to have folio_end_dropbehind() which based
on context decides whether to offload to workqueue or not. Because
folio_end_dropbehind() is never safe in irq context so I don't think it
makes sense to ever give users possibility to call it in wrong context.

>  	folio_put(folio);
>  }
>  EXPORT_SYMBOL(folio_end_writeback);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


  reply	other threads:[~2026-03-02  9:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-27 16:41 [PATCH RFC v3 0/2] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-02-27 16:41 ` [PATCH RFC v3 1/2] filemap: defer dropbehind invalidation from IRQ context Tal Zussman
2026-03-02  9:11   ` Jan Kara [this message]
2026-02-27 16:41 ` [PATCH RFC v3 2/2] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-03-02  9:13   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=wen63cjbk3k54mjzgw7zftsuze6bzxmdk5u5wdjabzdiqg645k@67666k5lrevh \
    --to=jack@suse.cz \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=jackmanb@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tz2294@columbia.edu \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox