[PATCH] mm: memcg: fix split queue list crash when large folio migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: memcg: fix split queue list crash when large folio migration
@ 2023-12-20  6:51 Baolin Wang
  2023-12-20 18:00 ` Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Baolin Wang @ 2023-12-20  6:51 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, roman.gushchin, shakeelb, muchun.song, nphamcs,
	david, ying.huang, shy828301, ziy, baolin.wang, linux-mm,
	linux-kernel, cgroups

When running autonuma with enabling multi-size THP, I encountered the following
kernel crash issue:

[  134.290216] list_del corruption. prev->next should be fffff9ad42e1c490,
but was dead000000000100. (prev=fffff9ad42399890)
[  134.290877] kernel BUG at lib/list_debug.c:62!
[  134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted:
G            E      6.7.0-rc4+ #20
[  134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0
......
[  134.294252] Call Trace:
[  134.294362]  <TASK>
[  134.294440]  ? die+0x33/0x90
[  134.294561]  ? do_trap+0xe0/0x110
......
[  134.295681]  ? __list_del_entry_valid_or_report+0x97/0xb0
[  134.295842]  folio_undo_large_rmappable+0x99/0x100
[  134.296003]  destroy_large_folio+0x68/0x70
[  134.296172]  migrate_folio_move+0x12e/0x260
[  134.296264]  ? __pfx_remove_migration_pte+0x10/0x10
[  134.296389]  migrate_pages_batch+0x495/0x6b0
[  134.296523]  migrate_pages+0x1d0/0x500
[  134.296646]  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
[  134.296799]  migrate_misplaced_folio+0x12d/0x2b0
[  134.296953]  do_numa_page+0x1f4/0x570
[  134.297121]  __handle_mm_fault+0x2b0/0x6c0
[  134.297254]  handle_mm_fault+0x107/0x270
[  134.300897]  do_user_addr_fault+0x167/0x680
[  134.304561]  exc_page_fault+0x65/0x140
[  134.307919]  asm_exc_page_fault+0x22/0x30

The reason for the crash is that, the commit 85ce2c517ade ("memcontrol: only
transfer the memcg data for migration") removed the charging and uncharging
operations of the migration folios and cleared the memcg data of the old folio.

During the subsequent release process of the old large folio in destroy_large_folio(),
if the large folio needs to be removed from the split queue, an incorrect split
queue can be obtained (which is pgdat->deferred_split_queue) because the old
folio's memcg is NULL now. This can lead to list operations being performed
under the wrong split queue lock protection, resulting in a list crash as above.

After the migration, the old folio is going to be freed, so we can remove it
from the split queue in mem_cgroup_migrate() a bit earlier before clearing the
memcg data to avoid getting incorrect split queue.

Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/huge_memory.c |  2 +-
 mm/memcontrol.c  | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6be1a380a298..c50dc2e1483f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3124,7 +3124,7 @@ void folio_undo_large_rmappable(struct folio *folio)
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
-		list_del(&folio->_deferred_list);
+		list_del_init(&folio->_deferred_list);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae8c62c7aa53..e66e0811cccc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7575,6 +7575,17 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
+	/*
+	 * If the old folio a large folio and is in the split queue, it needs
+	 * to be removed from the split queue now, in case getting an incorrect
+	 * split queue in destroy_large_folio() after the memcg of the old folio
+	 * is cleared.
+	 *
+	 * In addition, the old folio is about to be freed after migration, so
+	 * removing from the split queue a bit earlier seems reasonable.
+	 */
+	if (folio_test_large(old) && folio_test_large_rmappable(old))
+		folio_undo_large_rmappable(old);
 	old->memcg_data = 0;
 }
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: memcg: fix split queue list crash when large folio migration
  2023-12-20  6:51 [PATCH] mm: memcg: fix split queue list crash when large folio migration Baolin Wang
@ 2023-12-20 18:00 ` Nhat Pham
  2023-12-20 23:31 ` Yang Shi
  2023-12-27 21:41 ` Zi Yan
  2 siblings, 0 replies; 4+ messages in thread
From: Nhat Pham @ 2023-12-20 18:00 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeelb, muchun.song,
	david, ying.huang, shy828301, ziy, linux-mm, linux-kernel,
	cgroups

On Tue, Dec 19, 2023 at 10:52 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> When running autonuma with enabling multi-size THP, I encountered the following
> kernel crash issue:
>
> [  134.290216] list_del corruption. prev->next should be fffff9ad42e1c490,
> but was dead000000000100. (prev=fffff9ad42399890)
> [  134.290877] kernel BUG at lib/list_debug.c:62!
> [  134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted:
> G            E      6.7.0-rc4+ #20
> [  134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0
> ......
> [  134.294252] Call Trace:
> [  134.294362]  <TASK>
> [  134.294440]  ? die+0x33/0x90
> [  134.294561]  ? do_trap+0xe0/0x110
> ......
> [  134.295681]  ? __list_del_entry_valid_or_report+0x97/0xb0
> [  134.295842]  folio_undo_large_rmappable+0x99/0x100
> [  134.296003]  destroy_large_folio+0x68/0x70
> [  134.296172]  migrate_folio_move+0x12e/0x260
> [  134.296264]  ? __pfx_remove_migration_pte+0x10/0x10
> [  134.296389]  migrate_pages_batch+0x495/0x6b0
> [  134.296523]  migrate_pages+0x1d0/0x500
> [  134.296646]  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
> [  134.296799]  migrate_misplaced_folio+0x12d/0x2b0
> [  134.296953]  do_numa_page+0x1f4/0x570
> [  134.297121]  __handle_mm_fault+0x2b0/0x6c0
> [  134.297254]  handle_mm_fault+0x107/0x270
> [  134.300897]  do_user_addr_fault+0x167/0x680
> [  134.304561]  exc_page_fault+0x65/0x140
> [  134.307919]  asm_exc_page_fault+0x22/0x30
>
> The reason for the crash is that, the commit 85ce2c517ade ("memcontrol: only
> transfer the memcg data for migration") removed the charging and uncharging
> operations of the migration folios and cleared the memcg data of the old folio.
>
> During the subsequent release process of the old large folio in destroy_large_folio(),
> if the large folio needs to be removed from the split queue, an incorrect split
> queue can be obtained (which is pgdat->deferred_split_queue) because the old
> folio's memcg is NULL now. This can lead to list operations being performed
> under the wrong split queue lock protection, resulting in a list crash as above.

Ah this is tricky. I think you're right - the old folio's memcg is
used to get the deferred split queue, and we cleared it here :)

>
> After the migration, the old folio is going to be freed, so we can remove it
> from the split queue in mem_cgroup_migrate() a bit earlier before clearing the
> memcg data to avoid getting incorrect split queue.
>
> Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration")
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/huge_memory.c |  2 +-
>  mm/memcontrol.c  | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6be1a380a298..c50dc2e1483f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3124,7 +3124,7 @@ void folio_undo_large_rmappable(struct folio *folio)
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (!list_empty(&folio->_deferred_list)) {
>                 ds_queue->split_queue_len--;
> -               list_del(&folio->_deferred_list);
> +               list_del_init(&folio->_deferred_list);
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae8c62c7aa53..e66e0811cccc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7575,6 +7575,17 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>         /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> +       /*
> +        * If the old folio a large folio and is in the split queue, it needs
> +        * to be removed from the split queue now, in case getting an incorrect
> +        * split queue in destroy_large_folio() after the memcg of the old folio
> +        * is cleared.
> +        *
> +        * In addition, the old folio is about to be freed after migration, so
> +        * removing from the split queue a bit earlier seems reasonable.
> +        */
> +       if (folio_test_large(old) && folio_test_large_rmappable(old))
> +               folio_undo_large_rmappable(old);

This looks reasonable to me :)
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

>         old->memcg_data = 0;
>  }
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: memcg: fix split queue list crash when large folio migration
  2023-12-20  6:51 [PATCH] mm: memcg: fix split queue list crash when large folio migration Baolin Wang
  2023-12-20 18:00 ` Nhat Pham
@ 2023-12-20 23:31 ` Yang Shi
  2023-12-27 21:41 ` Zi Yan
  2 siblings, 0 replies; 4+ messages in thread
From: Yang Shi @ 2023-12-20 23:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeelb, muchun.song,
	nphamcs, david, ying.huang, ziy, linux-mm, linux-kernel, cgroups

On Tue, Dec 19, 2023 at 10:52 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> When running autonuma with enabling multi-size THP, I encountered the following
> kernel crash issue:
>
> [  134.290216] list_del corruption. prev->next should be fffff9ad42e1c490,
> but was dead000000000100. (prev=fffff9ad42399890)
> [  134.290877] kernel BUG at lib/list_debug.c:62!
> [  134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted:
> G            E      6.7.0-rc4+ #20
> [  134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0
> ......
> [  134.294252] Call Trace:
> [  134.294362]  <TASK>
> [  134.294440]  ? die+0x33/0x90
> [  134.294561]  ? do_trap+0xe0/0x110
> ......
> [  134.295681]  ? __list_del_entry_valid_or_report+0x97/0xb0
> [  134.295842]  folio_undo_large_rmappable+0x99/0x100
> [  134.296003]  destroy_large_folio+0x68/0x70
> [  134.296172]  migrate_folio_move+0x12e/0x260
> [  134.296264]  ? __pfx_remove_migration_pte+0x10/0x10
> [  134.296389]  migrate_pages_batch+0x495/0x6b0
> [  134.296523]  migrate_pages+0x1d0/0x500
> [  134.296646]  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
> [  134.296799]  migrate_misplaced_folio+0x12d/0x2b0
> [  134.296953]  do_numa_page+0x1f4/0x570
> [  134.297121]  __handle_mm_fault+0x2b0/0x6c0
> [  134.297254]  handle_mm_fault+0x107/0x270
> [  134.300897]  do_user_addr_fault+0x167/0x680
> [  134.304561]  exc_page_fault+0x65/0x140
> [  134.307919]  asm_exc_page_fault+0x22/0x30
>
> The reason for the crash is that, the commit 85ce2c517ade ("memcontrol: only
> transfer the memcg data for migration") removed the charging and uncharging
> operations of the migration folios and cleared the memcg data of the old folio.
>
> During the subsequent release process of the old large folio in destroy_large_folio(),
> if the large folio needs to be removed from the split queue, an incorrect split
> queue can be obtained (which is pgdat->deferred_split_queue) because the old
> folio's memcg is NULL now. This can lead to list operations being performed
> under the wrong split queue lock protection, resulting in a list crash as above.
>
> After the migration, the old folio is going to be freed, so we can remove it
> from the split queue in mem_cgroup_migrate() a bit earlier before clearing the
> memcg data to avoid getting incorrect split queue.

Nice catch! The fix looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
> Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration")
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/huge_memory.c |  2 +-
>  mm/memcontrol.c  | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6be1a380a298..c50dc2e1483f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3124,7 +3124,7 @@ void folio_undo_large_rmappable(struct folio *folio)
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (!list_empty(&folio->_deferred_list)) {
>                 ds_queue->split_queue_len--;
> -               list_del(&folio->_deferred_list);
> +               list_del_init(&folio->_deferred_list);
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae8c62c7aa53..e66e0811cccc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7575,6 +7575,17 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>         /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> +       /*
> +        * If the old folio a large folio and is in the split queue, it needs
> +        * to be removed from the split queue now, in case getting an incorrect
> +        * split queue in destroy_large_folio() after the memcg of the old folio
> +        * is cleared.
> +        *
> +        * In addition, the old folio is about to be freed after migration, so
> +        * removing from the split queue a bit earlier seems reasonable.
> +        */
> +       if (folio_test_large(old) && folio_test_large_rmappable(old))
> +               folio_undo_large_rmappable(old);
>         old->memcg_data = 0;
>  }
>
> --
> 2.39.3
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: memcg: fix split queue list crash when large folio migration
  2023-12-20  6:51 [PATCH] mm: memcg: fix split queue list crash when large folio migration Baolin Wang
  2023-12-20 18:00 ` Nhat Pham
  2023-12-20 23:31 ` Yang Shi
@ 2023-12-27 21:41 ` Zi Yan
  2 siblings, 0 replies; 4+ messages in thread
From: Zi Yan @ 2023-12-27 21:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeelb, muchun.song,
	nphamcs, david, ying.huang, shy828301, linux-mm, linux-kernel,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 4300 bytes --]

On 20 Dec 2023, at 1:51, Baolin Wang wrote:

> When running autonuma with enabling multi-size THP, I encountered the following
> kernel crash issue:
>
> [  134.290216] list_del corruption. prev->next should be fffff9ad42e1c490,
> but was dead000000000100. (prev=fffff9ad42399890)
> [  134.290877] kernel BUG at lib/list_debug.c:62!
> [  134.291052] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  134.291210] CPU: 56 PID: 8037 Comm: numa01 Kdump: loaded Tainted:
> G            E      6.7.0-rc4+ #20
> [  134.291649] RIP: 0010:__list_del_entry_valid_or_report+0x97/0xb0
> ......
> [  134.294252] Call Trace:
> [  134.294362]  <TASK>
> [  134.294440]  ? die+0x33/0x90
> [  134.294561]  ? do_trap+0xe0/0x110
> ......
> [  134.295681]  ? __list_del_entry_valid_or_report+0x97/0xb0
> [  134.295842]  folio_undo_large_rmappable+0x99/0x100
> [  134.296003]  destroy_large_folio+0x68/0x70
> [  134.296172]  migrate_folio_move+0x12e/0x260
> [  134.296264]  ? __pfx_remove_migration_pte+0x10/0x10
> [  134.296389]  migrate_pages_batch+0x495/0x6b0
> [  134.296523]  migrate_pages+0x1d0/0x500
> [  134.296646]  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
> [  134.296799]  migrate_misplaced_folio+0x12d/0x2b0
> [  134.296953]  do_numa_page+0x1f4/0x570
> [  134.297121]  __handle_mm_fault+0x2b0/0x6c0
> [  134.297254]  handle_mm_fault+0x107/0x270
> [  134.300897]  do_user_addr_fault+0x167/0x680
> [  134.304561]  exc_page_fault+0x65/0x140
> [  134.307919]  asm_exc_page_fault+0x22/0x30
>
> The reason for the crash is that, the commit 85ce2c517ade ("memcontrol: only
> transfer the memcg data for migration") removed the charging and uncharging
> operations of the migration folios and cleared the memcg data of the old folio.
>
> During the subsequent release process of the old large folio in destroy_large_folio(),
> if the large folio needs to be removed from the split queue, an incorrect split
> queue can be obtained (which is pgdat->deferred_split_queue) because the old
> folio's memcg is NULL now. This can lead to list operations being performed
> under the wrong split queue lock protection, resulting in a list crash as above.
>
> After the migration, the old folio is going to be freed, so we can remove it
> from the split queue in mem_cgroup_migrate() a bit earlier before clearing the
> memcg data to avoid getting incorrect split queue.
>
> Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration")
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/huge_memory.c |  2 +-
>  mm/memcontrol.c  | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6be1a380a298..c50dc2e1483f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3124,7 +3124,7 @@ void folio_undo_large_rmappable(struct folio *folio)
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (!list_empty(&folio->_deferred_list)) {
>  		ds_queue->split_queue_len--;
> -		list_del(&folio->_deferred_list);
> +		list_del_init(&folio->_deferred_list);
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae8c62c7aa53..e66e0811cccc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7575,6 +7575,17 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>  	/* Transfer the charge and the css ref */
>  	commit_charge(new, memcg);
> +	/*
> +	 * If the old folio a large folio and is in the split queue, it needs

should be:
If the old folio is a large folio and in the split queue


> +	 * to be removed from the split queue now, in case getting an incorrect
> +	 * split queue in destroy_large_folio() after the memcg of the old folio
> +	 * is cleared.
> +	 *
> +	 * In addition, the old folio is about to be freed after migration, so
> +	 * removing from the split queue a bit earlier seems reasonable.
               ^
               it
> +	 */
> +	if (folio_test_large(old) && folio_test_large_rmappable(old))
> +		folio_undo_large_rmappable(old);
>  	old->memcg_data = 0;
>  }
>
> -- 
> 2.39.3

Otherwise, LGTM. Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-12-27 21:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-20  6:51 [PATCH] mm: memcg: fix split queue list crash when large folio migration Baolin Wang
2023-12-20 18:00 ` Nhat Pham
2023-12-20 23:31 ` Yang Shi
2023-12-27 21:41 ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox