Re: [RFC] mm/migrate: make sure folio_unlock() before folio_wait_writeback()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yeoreum Yun <yeoreum.yun@arm.com>
To: Byungchul Park <byungchul@sk.com>
Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	clameter@sgi.com, kravetz@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, max.byungchul.park@gmail.com,
	kernel_team@skhynix.com, harry.yoo@oracle.com,
	gwan-gyeong.mun@intel.com, syzkaller@googlegroups.com,
	ysk@kzalloc.com
Subject: Re: [RFC] mm/migrate: make sure folio_unlock() before folio_wait_writeback()
Date: Thu, 2 Oct 2025 12:42:48 +0100	[thread overview]
Message-ID: <aN5lOFVFfewXUijF@e129823.arm.com> (raw)
In-Reply-To: <20251002081612.53281-1-byungchul@sk.com>

Hi Byoungchul,

> DEPT(Dependency Tracker) reported a deadlock:
>
>    ===================================================
>    DEPT: Circular dependency has been detected.
>    6.15.11-00046-g2c223fa7bd9a-dirty #13 Not tainted
>    ---------------------------------------------------
>    summary
>    ---------------------------------------------------
>    *** DEADLOCK ***
>
>    context A
>       [S] (unknown)(pg_locked_map:0)
>       [W] dept_page_wait_on_bit(pg_writeback_map:0)
>       [E] dept_page_clear_bit(pg_locked_map:0)
>
>    context B
>       [S] (unknown)(pg_writeback_map:0)
>       [W] dept_page_wait_on_bit(pg_locked_map:0)
>       [E] dept_page_clear_bit(pg_writeback_map:0)
>
>    [S]: start of the event context
>    [W]: the wait blocked
>    [E]: the event not reachable
>    ---------------------------------------------------
>    context A's detail
>    ---------------------------------------------------
>    context A
>       [S] (unknown)(pg_locked_map:0)
>       [W] dept_page_wait_on_bit(pg_writeback_map:0)
>       [E] dept_page_clear_bit(pg_locked_map:0)
>
>    [S] (unknown)(pg_locked_map:0):
>    (N/A)
>
>    [W] dept_page_wait_on_bit(pg_writeback_map:0):
>    [<ffff800080589c94>] folio_wait_bit+0x2c/0x38
>    stacktrace:
>          folio_wait_bit_common+0x824/0x8b8
>          folio_wait_bit+0x2c/0x38
>          folio_wait_writeback+0x5c/0xa4
>          migrate_pages_batch+0x5e4/0x1788
>          migrate_pages+0x15c4/0x1840
>          compact_zone+0x9c8/0x1d20
>          compact_node+0xd4/0x27c
>          sysctl_compaction_handler+0x104/0x194
>          proc_sys_call_handler+0x25c/0x3f8
>          proc_sys_write+0x20/0x2c
>          do_iter_readv_writev+0x350/0x448
>          vfs_writev+0x1ac/0x44c
>          do_pwritev+0x100/0x15c
>          __arm64_sys_pwritev2+0x6c/0xcc
>          invoke_syscall.constprop.0+0x64/0x18c
>          el0_svc_common.constprop.0+0x80/0x198
>
>    [E] dept_page_clear_bit(pg_locked_map:0):
>    [<ffff800080700914>] migrate_folio_undo_src+0x1b4/0x200
>    stacktrace:
>          migrate_folio_undo_src+0x1b4/0x200
>          migrate_pages_batch+0x1578/0x1788
>          migrate_pages+0x15c4/0x1840
>          compact_zone+0x9c8/0x1d20
>          compact_node+0xd4/0x27c
>          sysctl_compaction_handler+0x104/0x194
>          proc_sys_call_handler+0x25c/0x3f8
>          proc_sys_write+0x20/0x2c
>          do_iter_readv_writev+0x350/0x448
>          vfs_writev+0x1ac/0x44c
>          do_pwritev+0x100/0x15c
>          __arm64_sys_pwritev2+0x6c/0xcc
>          invoke_syscall.constprop.0+0x64/0x18c
>          el0_svc_common.constprop.0+0x80/0x198
>          do_el0_svc+0x28/0x3c
>          el0_svc+0x50/0x220
>    ---------------------------------------------------
>    context B's detail
>    ---------------------------------------------------
>    context B
>       [S] (unknown)(pg_writeback_map:0)
>       [W] dept_page_wait_on_bit(pg_locked_map:0)
>       [E] dept_page_clear_bit(pg_writeback_map:0)
>
>    [S] (unknown)(pg_writeback_map:0):
>    (N/A)
>
>    [W] dept_page_wait_on_bit(pg_locked_map:0):
>    [<ffff80008081e478>] bdev_getblk+0x58/0x120
>    stacktrace:
>          find_get_block_common+0x224/0xbc4
>          bdev_getblk+0x58/0x120
>          __ext4_get_inode_loc+0x194/0x98c
>          ext4_get_inode_loc+0x4c/0xcc
>          ext4_reserve_inode_write+0x74/0x158
>          __ext4_mark_inode_dirty+0xd4/0x4e0
>          __ext4_ext_dirty+0x118/0x164
>          ext4_ext_map_blocks+0x1578/0x2ca8
>          ext4_map_blocks+0x2a4/0xa60
>          ext4_convert_unwritten_extents+0x1b0/0x3c0
>          ext4_convert_unwritten_io_end_vec+0x90/0x1a0
>          ext4_end_io_end+0x58/0x194
>          ext4_end_io_rsv_work+0xc4/0x150
>          process_one_work+0x3b4/0xac0
>          worker_thread+0x2b0/0x53c
>          kthread+0x1a0/0x33c
>
>    [E] dept_page_clear_bit(pg_writeback_map:0):
>    [<ffff8000809dfc5c>] ext4_finish_bio+0x638/0x820
>    stacktrace:
>          folio_end_writeback+0x140/0x488
>          ext4_finish_bio+0x638/0x820
>          ext4_release_io_end+0x74/0x188
>          ext4_end_io_end+0xa0/0x194
>          ext4_end_io_rsv_work+0xc4/0x150
>          process_one_work+0x3b4/0xac0
>          worker_thread+0x2b0/0x53c
>          kthread+0x1a0/0x33c
>          ret_from_fork+0x10/0x20
>
> To simplify the scenario:
>
>    context X (wq worker)	context Y (process context)
>
> 				migrate_pages_batch()
>    ext4_end_io_end()		  ...
>      ...			  migrate_folio_unmap()
>      ext4_get_inode_loc()	    ...
>        ...			    folio_lock() // hold the folio lock
>        bdev_getblk()		    ...
>          ...			    folio_wait_writeback() // wait forever
>          __find_get_block_slow()
>            ...			    ...
>            folio_lock() // wait forever
>            folio_unlock()	  migrate_folio_undo_src()
> 				    ...
>      ...			    folio_unlock() // never reachable
>      ext4_finish_bio()
> 	...
> 	folio_end_writeback() // never reachable
>
> context X is waiting for the folio lock to be released by context Y,
> while context Y is waiting for the writeback to end in context X.
> Ultimately, two contexts are waiting for the event that will never
> happen, say, deadlock.
>
> *Only one* of the following two conditions should be allowed, or we
> cannot avoid this kind of deadlock:
>
>    1. while holding a folio lock (and heading for folio_unlock()),
>       waiting for a writeback to end,
>    2. while heading for the writeback end, waiting for the folio lock to
>       be released,
>
> Since allowing 2 and avoiding 1 sound more sensible than the other,
> remove the first condition by making sure folio_unlock() before
> folio_wait_writeback() in migrate_folio_unmap().
>
> Fixes: 49d2e9cc45443 ("[PATCH] Swap Migration V5: migrate_pages() function")
> Reported-by: Yunseong Kim <ysk@kzalloc.com>
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Tested-by: Yunseong Kim <ysk@kzalloc.com>
> ---
>
> Hi,
>
> Thanks to Yunseong for reporting the issue, testing, and confirming if
> this patch can resolve the issue.  We used the latest version of DEPT
> to detect the issue:
>
>    https://lore.kernel.org/all/20251002081247.51255-1-byungchul@sk.com/
>
> I mentioned in the commit message above like:
>
>    *Only one* of the following two conditions should be allowed, or we
>    cannot avoid this kind of deadlock:
>
>       1. while holding a folio lock (and heading for folio_unlock()),
>          waiting for a writeback to end,
>       2. while heading for the writeback end, waiting for the folio lock
>          to be released,
>
> Honestly, I'm not convinced which one we should choose between two, I
> chose 'allowing 2 and avoiding 1' to resolve this issue though.
>
> However, please let me know if I was wrong and we should go for
> 'allowing 1 and avoiding 2'.  If so, I should try a different approach,
> for example, to fix by preventing folio_lock() or using folio_try_lock()
> while heading for writeback end in ext4_end_io_end() or something.
>
> To Yunseong,
>
> The link you shared for a system hang is:
>
>    https://gist.github.com/kzall0c/a6091bb2fd536865ca9aabfd017a1fc5
>
> I think an important stacktrace for this issue, this is, waiting for
> PG_writeback, was missed in the log.
>
> 	Byungchul
>
> ---
>  mm/migrate.c | 57 +++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 41 insertions(+), 16 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9e5ef39ce73a..60b0b054f27a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1215,6 +1215,17 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>
>  	dst->private = NULL;
>
> +retry_wait_writeback:
> +	/*
> +	 * Only in the case of a full synchronous migration is it
> +	 * necessary to wait for PageWriteback.  In the async case, the
> +	 * retry loop is too short and in the sync-light case, the
> +	 * overhead of stalling is too much.  Plus, do not write-back if
> +	 * it's in the middle of direct compaction
> +	 */
> +	if (folio_test_writeback(src) && mode == MIGRATE_SYNC)
> +		folio_wait_writeback(src);
> +
>  	if (!folio_trylock(src)) {
>  		if (mode == MIGRATE_ASYNC)
>  			goto out;
> @@ -1245,27 +1256,41 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>
>  		folio_lock(src);
>  	}
> -	locked = true;
> -	if (folio_test_mlocked(src))
> -		old_page_state |= PAGE_WAS_MLOCKED;
>
>  	if (folio_test_writeback(src)) {
> -		/*
> -		 * Only in the case of a full synchronous migration is it
> -		 * necessary to wait for PageWriteback. In the async case,
> -		 * the retry loop is too short and in the sync-light case,
> -		 * the overhead of stalling is too much
> -		 */
> -		switch (mode) {
> -		case MIGRATE_SYNC:
> -			break;
> -		default:
> -			rc = -EBUSY;
> -			goto out;
> +		if (mode == MIGRATE_SYNC) {
> +			/*
> +			 * folio_unlock() is required before trying
> +			 * folio_wait_writeback().  Or it leads a
> +			 * deadlock like:
> +			 *
> +			 *   context x		context y
> +			 *   in XXX_io_end()	in migrate_folio_unmap()
> +			 *
> +			 *   ...		...
> +			 *   bdev_getblk();	folio_lock();
> +			 *
> +			 *     // wait forever	// wait forever
> +			 *     folio_lock();	folio_wait_writeback();
> +			 *
> +			 *     ...		...
> +			 *     folio_unlock();
> +			 *   ...		// never reachable
> +			 *			folio_unlock();
> +			 *   // never reachable
> +			 *   folio_end_writeback();
> +			 */
> +			folio_unlock(src);
> +			goto retry_wait_writeback;
>  		}
> -		folio_wait_writeback(src);
> +		rc = -EBUSY;
> +		goto out;
>  	}
>
> +	locked = true;
> +	if (folio_test_mlocked(src))
> +		old_page_state |= PAGE_WAS_MLOCKED;
> +
>  	/*
>  	 * By try_to_migrate(), src->mapcount goes down to 0 here. In this case,
>  	 * we cannot notice that anon_vma is freed while we migrate a page.

Hmm, I still have concerns about this change.
(1) seems to imply that the use of WB_SYNC_ALL by
mpage_writebacks() is also incorrect. In addition,
this change could introduce another theoretical livelock
when the folio enters writeback frequently.

AFAIK, while a folio is under writeback,
its related buffers won’t be freed by migration, and
since try_free_buffer() checks the writeback state first,
taking folio_lock() shouldn’t be necessary while bdev_getblk().

Therefore, it seems sufficient to check whether
the folio is under writeback in __find_get_block_slow(), e.g.:

diff --git a/fs/buffer.c b/fs/buffer.c
index 6a8752f7bbed..804d33df6b0f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -194,6 +194,9 @@ __find_get_block_slow(struct block_device *bdev, sector_t block, bool atomic)
        if (IS_ERR(folio))
                goto out;

+       if (folio_test_writeback(folio))
+               return true;
+
        /*
         * Folio lock protects the buffers. Callers that cannot block
         * will fallback to serializing vs try_to_free_buffers() via

Am I missing something?

--
Sincerely,
Yeoreum Yun

next prev parent reply	other threads:[~2025-10-02 11:43 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-02  8:16 Byungchul Park
2025-10-02 11:38 ` David Hildenbrand
2025-10-02 22:02   ` Hillf Danton
2025-10-03  0:48     ` Byungchul Park
2025-10-03  0:52       ` Byungchul Park
2025-10-07  6:32         ` Yunseong Kim
2025-10-07  7:04           ` David Hildenbrand
2025-10-07  7:53             ` Yeoreum Yun
2025-10-13  4:36             ` Byungchul Park
2025-10-13  8:08               ` David Hildenbrand
2025-10-03  1:02   ` Byungchul Park
2025-10-03  2:31   ` Byungchul Park
2025-10-03 14:04   ` Pedro Falcato
2025-10-02 11:42 ` Yeoreum Yun [this message]
2025-10-02 11:49   ` Yeoreum Yun
2025-10-03  2:08     ` Byungchul Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aN5lOFVFfewXUijF@e129823.arm.com \
    --to=yeoreum.yun@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=clameter@sgi.com \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=gwan-gyeong.mun@intel.com \
    --cc=harry.yoo@oracle.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=kravetz@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.brost@intel.com \
    --cc=max.byungchul.park@gmail.com \
    --cc=rakie.kim@sk.com \
    --cc=syzkaller@googlegroups.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ysk@kzalloc.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox