[RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range()
@ 2026-03-25 11:36 Rushil Patel
  2026-03-25 11:36 ` [RFC PATCH 1/1] " Rushil Patel
  2026-03-25 19:15 ` [RFC PATCH 0/1] " Matthew Wilcox
  0 siblings, 2 replies; 3+ messages in thread
From: Rushil Patel @ 2026-03-25 11:36 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-fsdevel, linux-mm, linux-kernel, Rushil Patel

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]

We run Slurm on compute nodes with NFS mounts (NFSv4.1, NetApp).
When a job is cancelled, processes with dirty NFS pages get stuck
in D-state inside folio_wait_bit_common() because
__filemap_fdatawait_range() uses folio_wait_writeback(), which is
TASK_UNINTERRUPTIBLE. If the filer is slow to respond these processes are
unkillable - we've found the only recovery in practice is rebooting
the node.

The patch switches to folio_wait_writeback_killable() so SIGKILL can
interrupt the wait. Writeback itself continues on the server, we just stop
waiting for the ack. All 6 callers of __filemap_fdatawait_range() detect
errors independently via errseq_t / filemap_check_errors(), so the early
return doesn't suppress error reporting.

The tricky part is a re-entry through do_exit(). Making the wait killable
alone isn't enough - we hit this in testing:

  1. SIGKILL wakes the killable wait, signal is consumed by get_signal()
  2. do_exit() -> exit_signals() sets PF_EXITING
  3. do_exit() -> exit_files() -> nfs4_file_flush() -> nfs_wb_all()
     re-enters __filemap_fdatawait_range()
  4. wants_signal() checks PF_EXITING *before* the SIGKILL special case
     (kernel/signal.c:951 vs 954), so it returns false
  5. No signal can wake the second wait -> stuck in D-state again

The PF_EXITING check at the top of the function avoids re-entering the
wait entirely. This is the same pattern used in mm/oom_kill.c,
mm/memcontrol.c, block/blk-ioc.c, and io_uring/.

Reproduced with iptables DROP on port 2049, confirmed the killable-only
revision gets stuck on re-entry, and the PF_EXITING + killable revision
kills cleanly.

Sending as RFC because this touches the generic writeback sync path in
mm/filemap.c rather than being NFS-specific. NFS can't really fix this on
its own - it reaches __filemap_fdatawait_range() through
filemap_write_and_wait() and doesn't own the wait. But I wanted to get
guidance on whether this is the right place for the fix, or if you'd prefer
a different approach.

Best regards,

Rushil

Rushil Patel (1):
  mm/filemap: make writeback wait killable in
    __filemap_fdatawait_range()

 mm/filemap.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

-- 
2.47.3

For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

[-- Attachment #2: Type: text/html, Size: 3751 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC PATCH 1/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range()
  2026-03-25 11:36 [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range() Rushil Patel
@ 2026-03-25 11:36 ` Rushil Patel
  2026-03-25 19:15 ` [RFC PATCH 0/1] " Matthew Wilcox
  1 sibling, 0 replies; 3+ messages in thread
From: Rushil Patel @ 2026-03-25 11:36 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-fsdevel, linux-mm, linux-kernel, Rushil Patel

[-- Attachment #1: Type: text/plain, Size: 3790 bytes --]

__filemap_fdatawait_range() waits for writeback using
folio_wait_writeback(), which sleeps in TASK_UNINTERRUPTIBLE. On network
filesystems (NFS, CIFS, 9P) the server can stall for minutes or hours,
leaving processes in unkillable D-state. The only recovery is a node
reboot.

Replace folio_wait_writeback() with folio_wait_writeback_killable() so
that SIGKILL can interrupt the wait. On fatal signal, release the folio
batch and return early. The writeback itself continues on the server --
only the client-side wait is interrupted. The dying process never
inspects the return value, and other processes get correct error
reporting through errseq_t (mapping->wb_err) which is set by the
writeback completion path independently of this wait.

Additionally, skip the writeback wait entirely when the calling process
has PF_EXITING set. This handles a re-entry trap discovered in testing:
when SIGKILL wakes a process from the killable wait, do_exit() sets
PF_EXITING and then exit_files() closes file descriptors, which on NFS
triggers nfs4_file_flush() -> nfs_wb_all() -> filemap_write_and_wait()
-> __filemap_fdatawait_range(). At that point wants_signal()
(kernel/signal.c) rejects all signals for PF_EXITING tasks -- including
SIGKILL -- because the PF_EXITING check precedes the SIGKILL exception.
The second killable wait would therefore block forever. Checking
PF_EXITING at entry avoids this: the process is dying and has no use for
writeback confirmation.

Signed-off-by: Rushil Patel <rushil.patel@gsacapital.com>
---
 mm/filemap.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 406cef06b684..d348f1dd75f9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -515,6 +515,15 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
 	struct folio_batch fbatch;
 	unsigned nr_folios;

+	/*
+	 * If the process is exiting (PF_EXITING), skip the writeback wait.
+	 * During do_exit(), nfs4_file_flush() re-enters this function, but
+	 * wants_signal() rejects signals for PF_EXITING tasks so a second
+	 * SIGKILL cannot wake us from the TASK_KILLABLE wait below.
+	 */
+	if (current->flags & PF_EXITING)
+		return;
+
 	folio_batch_init(&fbatch);

 	while (index <= end) {
@@ -529,7 +538,10 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
 		for (i = 0; i < nr_folios; i++) {
 			struct folio *folio = fbatch.folios[i];

-			folio_wait_writeback(folio);
+			if (folio_wait_writeback_killable(folio)) {
+				folio_batch_release(&fbatch);
+				return;
+			}
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
-- 
2.47.3

For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

[-- Attachment #2: Type: text/html, Size: 4264 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range()
  2026-03-25 11:36 [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range() Rushil Patel
  2026-03-25 11:36 ` [RFC PATCH 1/1] " Rushil Patel
@ 2026-03-25 19:15 ` Matthew Wilcox
  1 sibling, 0 replies; 3+ messages in thread
From: Matthew Wilcox @ 2026-03-25 19:15 UTC (permalink / raw)
  To: Rushil Patel
  Cc: Andrew Morton, linux-fsdevel, linux-mm, linux-kernel, linux-nfs,
	Trond Myklebust, Anna Schumaker

On Wed, Mar 25, 2026 at 11:36:15AM +0000, Rushil Patel wrote:
> We run Slurm on compute nodes with NFS mounts (NFSv4.1, NetApp).
> When a job is cancelled, processes with dirty NFS pages get stuck
> in D-state inside folio_wait_bit_common() because
> __filemap_fdatawait_range() uses folio_wait_writeback(), which is
> TASK_UNINTERRUPTIBLE. If the filer is slow to respond these processes are
> unkillable - we've found the only recovery in practice is rebooting
> the node.

Hi Rushil.  Thanks for the patch!  I have a lot of sympathy for the
problem you're trying to solve.  It was something similar which led
to me introducing the TASK_KILLABLE infrastructure back in 2007.
My problem was read-only though, and while I had an initial attempt
to also handle write workloads, it didn't work and I didn't have a
personal need for it, so I abandoned it.  Now you have a real need, so
let's make it work.

> The patch switches to folio_wait_writeback_killable() so SIGKILL can
> interrupt the wait. Writeback itself continues on the server, we just stop
> waiting for the ack. All 6 callers of __filemap_fdatawait_range() detect
> errors independently via errseq_t / filemap_check_errors(), so the early
> return doesn't suppress error reporting.

Well ... I'm not entirely sure it doesn't suppress error reporting.
But I think I see what you're trying to say, and I think the change
of behaviour is one that was never guaranteed anyway.

> The tricky part is a re-entry through do_exit(). Making the wait killable
> alone isn't enough - we hit this in testing:
> 
>   1. SIGKILL wakes the killable wait, signal is consumed by get_signal()
>   2. do_exit() -> exit_signals() sets PF_EXITING
>   3. do_exit() -> exit_files() -> nfs4_file_flush() -> nfs_wb_all()
>      re-enters __filemap_fdatawait_range()
>   4. wants_signal() checks PF_EXITING *before* the SIGKILL special case
>      (kernel/signal.c:951 vs 954), so it returns false
>   5. No signal can wake the second wait -> stuck in D-state again

Yes, this was where I got stuck too!

> The PF_EXITING check at the top of the function avoids re-entering the
> wait entirely. This is the same pattern used in mm/oom_kill.c,
> mm/memcontrol.c, block/blk-ioc.c, and io_uring/.

I'm not entirely comfortable with the location of the check.  I feel
that __filemap_fdatawait_range() is a bit too low level for a check
of PF_EXITING.  I could see there being other places
which really do want to wait, even in the presence of an exiting task.
Maybe I'm being overly paranoid there, but I would suppress the call
from nfs_wb_all().  Maybe something like this?

-	ret = filemap_write_and_wait(inode->i_mapping);
+	if (current->flags & PF_EXITING)
+		ret = filemap_fdatawrite(inode->i_mapping);
+	else
+		ret = filemap_write_and_wait(inode->i_mapping);

What held me up from doing this though was the next part of
nfs_wb_all():

        ret = nfs_commit_inode(inode, FLUSH_SYNC);

I didn't trace through exactly what this would do, but I inferred from
the FLUSH_SYNC that it would also wait for the file server to finish
the write of the inode ...

> Reproduced with iptables DROP on port 2049, confirmed the killable-only
> revision gets stuck on re-entry, and the PF_EXITING + killable revision
> kills cleanly.

... but if your testing shows that it works, I must be mistaken about
that.

> Sending as RFC because this touches the generic writeback sync path in
> mm/filemap.c rather than being NFS-specific. NFS can't really fix this on
> its own - it reaches __filemap_fdatawait_range() through
> filemap_write_and_wait() and doesn't own the wait. But I wanted to get
> guidance on whether this is the right place for the fix, or if you'd prefer
> a different approach.

Appreciate your flexibillity on this ... sounds like you considered
doing it this way, but didn't know about filemap_fdatawrite()?

Anyway, adding the NFS people for their opinions.  Other filesystems
don't do this flush-on-close behaviour (for various reasons, but
basically NFS has a close-to-open consistency model).  I believe
we can break this guarantee in this case as it's not an orderly close
but an involuntary termination of the process.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-25 19:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-25 11:36 [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range() Rushil Patel
2026-03-25 11:36 ` [RFC PATCH 1/1] " Rushil Patel
2026-03-25 19:15 ` [RFC PATCH 0/1] " Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox