linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Zorro Lang <zlang@redhat.com>
Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org, linux-xfs@vger.kernel.org
Subject: Re: [Bug report][xfstests generic/751] hang on nfs writeback
Date: Wed, 29 Oct 2025 16:50:13 -0400	[thread overview]
Message-ID: <aQJ-BfBblQ5hlgm1@kernel.org> (raw)
In-Reply-To: <aPkFLDsQ71BrTXt7@kernel.org>

On Wed, Oct 22, 2025 at 12:24:12PM -0400, Mike Snitzer wrote:
> On Tue, Oct 21, 2025 at 01:14:08PM +0800, Zorro Lang wrote:
> > Hi,
> > 
> > When I did xfstests regression test on nfs, it hang on generic/751 many times,
> > refer to [1](on x86_64) and [2](on aarch64). I've hit it 8 times until now.
> 
> You've hit it 8 times, all on the same kernel?  Any idea if this is a
> regression specific to 6.18-rc1 and later?
>  
> > I tested on latest linux v6.18-rc2. My underlying test device isn't loopback device,
> > it's general disk, and make xfs on it:
> >   meta-data=/dev/sda5              isize=512    agcount=4, agsize=983040 blks
> >            =                       sectsz=4096  attr=2, projid32bit=1
> >            =                       crc=1        finobt=1, sparse=1, rmapbt=1
> >            =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> >            =                       exchange=0   metadir=0
> >   data     =                       bsize=4096   blocks=3932160, imaxpct=25
> >            =                       sunit=0      swidth=0 blks
> >   naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> >   log      =internal log           bsize=4096   blocks=54324, version=2
> >            =                       sectsz=4096  sunit=1 blks, lazy-count=1
> >   realtime =none                   extsz=4096   blocks=0, rtextents=0
> >            =                       rgcount=0    rgsize=0 extents
> >            =                       zoned=0      start=0 reserved=0
> > 
> > Two xfs are mounted on /mnt/xfstests/test and /mnt/xfstests/scratch seperately,
> > then export as:
> >   # cat /etc/exports
> >   /mnt/xfstests/test/nfs-server *(rw,insecure,no_root_squash)
> >   /mnt/xfstests/scratch/nfs-server *(rw,insecure,no_root_squash)
> > 
> > The nfs mount option is only "-o vers=4.2". BTW, nfs server and client are
> > in same machine/system.
> 
> Being that NFS client and server are on the same localhost, I see
> NFS's LOCALIO is being used (nfslocaliod mentioned in "locks held"
> section, so IO is bypassing going over the wire to NFSD and is instead
> being issued directly from NFS client through LOCALIO workqueue and
> then down to XFS).
> 
> I'll try to make sense of the locking (reported below) but this will
> take some time.
> 
> Cc'ing XFS list because this is NFS LOCALIO ontop of XFS, could be
> there is something fundamental in LOCALIO's nfslocaliod workqueue that
> is setting us up for deadlock if XFS takes a specific path...
> 
> NOTE: fs/nfs/inode.c:nfsiod_start(), XFS's nature has influenced
> LOCALIO workqueue config hacks already.. could be dragons lurking:
> 
> /*
>  * Start the nfsiod workqueues
>  */
> static int nfsiod_start(void)
> {
>         dprintk("RPC:       creating workqueue nfsiod\n");
>         nfsiod_workqueue = alloc_workqueue("nfsiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>         if (nfsiod_workqueue == NULL)
>                 return -ENOMEM;
> #if IS_ENABLED(CONFIG_NFS_LOCALIO)
>         /*
>          * localio writes need to use a normal (non-memreclaim) workqueue.
>          * When we start getting low on space, XFS goes and calls flush_work() on
>          * a non-memreclaim work queue, which causes a priority inversion problem.
>          */
>         dprintk("RPC:       creating workqueue nfslocaliod\n");
>         nfslocaliod_workqueue = alloc_workqueue("nfslocaliod", WQ_UNBOUND, 0);
>         if (unlikely(nfslocaliod_workqueue == NULL)) {
>                 nfsiod_stop();
>                 return -ENOMEM;
>         }
> #endif /* CONFIG_NFS_LOCALIO */
>         return 0;
> }
> 
> Mike
> 
>  
> > # cat local.config
> > export FSTYP=nfs
> > export TEST_DEV=xxxx-xxx-xxxx.xxxx.xxx.xxx:/mnt/xfstests/test/nfs-server
> > export TEST_DIR=/mnt/xfstests/test/nfs-client
> > export SCRATCH_DEV=xxxx-xxx-xxxx.xxxx.xxx.xxx:/mnt/xfstests/scratch/nfs-server
> > export SCRATCH_MNT=/mnt/xfstests/scratch/nfs-client
> > export MOUNT_OPTIONS="-o vers=4.2"
> > export TEST_FS_MOUNT_OPTS="-o vers=4.2"
> > 
> > 
> > [1]
> > [23369.572660] run fstests generic/751 at 2025-10-20 23:15:24 
> > [-- MARK -- Tue Oct 21 03:20:00 2025] 
> > [-- MARK -- Tue Oct 21 03:25:00 2025] 
> > [-- MARK -- Tue Oct 21 03:30:00 2025] 
> > [-- MARK -- Tue Oct 21 03:35:00 2025] 
> > [-- MARK -- Tue Oct 21 03:40:00 2025] 
> > [25069.500900] INFO: task kworker/u9:73:825900 blocked for more than 122 seconds. 
> > [25069.501484]       Not tainted 6.18.0-rc2 #1 
> > [25069.501822] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> > [25069.502385] task:kworker/u9:73   state:D stack:24712 pid:825900 tgid:825900 ppid:2      task_flags:0x4208060 flags:0x00080000 
> > [25069.503258] Workqueue: writeback wb_workfn (flush-0:48) 
> > [25069.782725] Call Trace: 
> > [25069.787191]  <TASK> 
> > [25069.787368]  __schedule+0x838/0x1890 
> > [25069.799533]  ? __pfx___schedule+0x10/0x10 
> > [25069.799852]  ? __blk_flush_plug+0x27b/0x4d0 
> > [25069.800617]  ? find_held_lock+0x32/0x90 
> > [25069.801292]  ? __lock_release.isra.0+0x1a4/0x2c0 
> > [25069.801639]  schedule+0xd4/0x260 
> > [25069.801902]  io_schedule+0x8f/0xf0 
> > [25069.802162]  folio_wait_bit_common+0x2d9/0x780 
> > [25069.802919]  ? folio_wait_bit_common+0x1dd/0x780 
> > [25069.803324]  ? __pfx_folio_wait_bit_common+0x10/0x10 
> > [25069.803684]  ? nfs_page_clear_headlock+0x31/0x80 [nfs] 
> > [25069.804254]  ? __pfx_wake_page_function+0x10/0x10 
> > [25069.804602]  ? __pfx___might_resched+0x10/0x10 
> > [25069.805160]  writeback_get_folio+0x3f9/0x500 
> > [25069.806221]  writeback_iter+0x136/0x720 
> > [25069.806628]  nfs_writepages+0x4f8/0x9b0 [nfs] 
> > [25069.807100]  ? mark_held_locks+0x40/0x70 
> > [25069.807403]  ? __pfx_nfs_writepages+0x10/0x10 [nfs] 
> > [25069.807846]  ? virtqueue_notify+0x68/0xc0 
> > [25069.808346]  ? virtio_queue_rq+0x2b1/0x650 [virtio_blk] 
> > [25069.808925]  ? __lock_acquire+0x57c/0xbd0 
> > [25069.809295]  do_writepages+0x21f/0x560 
> > [25069.809576]  ? __pfx_do_writepages+0x10/0x10 
> > [25069.809953]  ? rcu_is_watching+0x15/0xb0 
> > [25069.810587]  __writeback_single_inode+0xe2/0x5f0 
> > [25069.810982]  ? __lock_release.isra.0+0x1a4/0x2c0 
> > [25069.811373]  ? __pfx___writeback_single_inode+0x10/0x10 
> > [25069.811815]  ? writeback_sb_inodes+0x416/0xd00 
> > [25069.812194]  writeback_sb_inodes+0x535/0xd00 
> > [25069.812533]  ? __pfx_stack_trace_save+0x10/0x10 
> > [25069.814193]  ? local_clock_noinstr+0xd/0xe0 
> > [25069.814506]  ? __pfx_writeback_sb_inodes+0x10/0x10 
> > [25069.814910]  ? __lock_release.isra.0+0x1a4/0x2c0 
> > [25069.815407]  ? lock_acquire+0x10b/0x150 
> > [25069.815703]  ? down_read_trylock+0x4b/0x60 
> > [25069.816077]  __writeback_inodes_wb+0xf4/0x270 
> > [25069.816441]  ? __pfx___writeback_inodes_wb+0x10/0x10 
> > [25069.816864]  ? queue_io+0x329/0x510 
> > [25069.817148]  wb_writeback+0x70a/0x9c0 
> > [25069.817448]  ? __pfx_wb_writeback+0x10/0x10 
> > [25069.817831]  ? get_nr_dirty_inodes+0xcb/0x180 
> > [25069.818264]  wb_do_writeback+0x5d4/0x8e0 
> > [25069.818575]  ? __pfx_wb_do_writeback+0x10/0x10 
> > [25069.818975]  ? set_worker_desc+0x16e/0x190 
> > [25069.819313]  ? __pfx_set_worker_desc+0x10/0x10 
> > [25069.819669]  wb_workfn+0x7c/0x200 
> > [25069.819997]  process_one_work+0xd8b/0x1320 
> > [25069.820342]  ? __pfx_process_one_work+0x10/0x10 
> > [25069.820694]  ? assign_work+0x16c/0x240 
> > [25069.821053]  worker_thread+0x5f3/0xfe0 
> > [25069.821364]  ? __pfx_worker_thread+0x10/0x10 
> > [25069.821697]  kthread+0x3b4/0x770 
> > [25069.822122]  ? kvm_sched_clock_read+0x11/0x20 
> > [25069.822457]  ? local_clock_noinstr+0xd/0xe0 
> > [25069.822790]  ? __pfx_kthread+0x10/0x10 
> > [25069.823079]  ? __lock_release.isra.0+0x1a4/0x2c0 
> > [25069.823430]  ? rcu_is_watching+0x15/0xb0 
> > [25069.823729]  ? __pfx_kthread+0x10/0x10 
> > [25069.824035]  ret_from_fork+0x393/0x480 
> > [25069.864794]  ? __pfx_kthread+0x10/0x10 
> > [25069.865076]  ? __pfx_kthread+0x10/0x10 
> > [25069.865371]  ret_from_fork_asm+0x1a/0x30 
> > [25069.879905]  </TASK> 
> > [25069.880145]  
> > [25069.880145] Showing all locks held in the system: 
> > [25069.889550] 1 lock held by khungtaskd/37: 
> > [25069.889886]  #0: ffffffffa29309e0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire.constprop.0+0x7/0x30 
> > [25069.890603] 2 locks held by 751/825733: 
> > [25069.890942]  #0: ffff888101e36440 (sb_writers#16){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 
> > [25069.891589]  #1: ffffffffa2b97fd0 (split_debug_mutex){+.+.}-{4:4}, at: split_huge_pages_write+0x124/0x430 
> > [25069.892449] 3 locks held by kworker/u9:73/825900: 
> > [25069.892818]  #0: ffff8881029df958 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x7f5/0x1320 
> > [25069.893530]  #1: ffffc9000ad9fd10 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0xd3f/0x1320 
> > [25069.894333]  #2: ffff88810b1480e8 (&type->s_umount_key#68){++++}-{4:4}, at: super_trylock_shared+0x1c/0xa0 
> > [25069.895056] 2 locks held by kworker/u10:2/826054: 
> > [25069.895418]  #0: ffff888103ec2158 ((wq_completion)nfsiod){+.+.}-{0:0}, at: process_one_work+0x7f5/0x1320 
> > [25069.896123]  #1: ffffc90001e1fd10 ((work_completion)(&ctx->work)){+.+.}-{0:0}, at: process_one_work+0xd3f/0x1320 
> > [25069.896952]  
> > [25069.897085] ============================================= 

I was able to test xfstests generic/751 on a server in the Red Hat
lab, but I used a different baseline kernel so far (its 6.12.53-based
but has all of NFSD, NFS and SUNRPC changes from 6.18-rc1 backported).

The test runs without deadlock but there is one lockdep splat due to
missing this commit:

commit 45f69d091bab64a332fe751da9829dcd136348fd
Author: Long Li <leo.lilong@huawei.com>
Date:   Sat Jun 22 16:26:31 2024 +0800

    xfs: eliminate lockdep false positives in xfs_attr_shortform_list

    xfs_attr_shortform_list() only called from a non-transactional context, it
    hold ilock before alloc memory and maybe trapped in memory reclaim. Since
    commit 204fae32d5f7("xfs: clean up remaining GFP_NOFS users") removed
    GFP_NOFS flag, lockdep warning will be report as [1]. Eliminate lockdep
    false positives by use __GFP_NOLOCKDEP to alloc memory
    in xfs_attr_shortform_list().

    [1] https://lore.kernel.org/linux-xfs/000000000000e33add0616358204@google.com/

    Reported-by: syzbot+4248e91deb3db78358a2@syzkaller.appspotmail.com
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Carlos Maiolino <cem@kernel.org>

After applying that commit, a generic/751 retest comes up clean. I'll
now switch to testing latest v6.18-rcX kernel.

Mike


      reply	other threads:[~2025-10-29 20:50 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-21  5:14 Zorro Lang
2025-10-22 16:24 ` Mike Snitzer
2025-10-29 20:50   ` Mike Snitzer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQJ-BfBblQ5hlgm1@kernel.org \
    --to=snitzer@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox