linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
@ 2025-12-16 20:07 Bijan Tabatabai
  2025-12-17  0:07 ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 10+ messages in thread
From: Bijan Tabatabai @ 2025-12-16 20:07 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, shivankg, Bijan Tabatabai

Currently, folio_expected_ref_count() only adds references for the swap
cache if the folio is anonymous. However, according to the comment above
the definition of PG_swapcache in enum pageflags, shmem folios can also
have PG_swapcache set. This patch makes sure references for the swap
cache are added if folio_test_swapcache(folio) is true.

This issue was found when trying to hot-unplug memory in a QEMU/KVM
virtual machine. When initiating hot-unplug when most of the guest
memory is allocated, hot-unplug hangs partway through removal due to
migration failures. The following message would be printed several
times, and would be printed again about every five seconds:

[   49.641309] migrating pfn b12f25 failed ret:7
[   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
[   49.641311] aops:swap_aops
[   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
[   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
[   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
[   49.641315] page dumped because: migration failure

When debugging this, I found that these migration failures were due to
__migrate_folio() returning -EAGAIN for a small set of folios because
the expected reference count it calculates via folio_expected_ref_count()
is one less than the actual reference count of the folios. Furthermore,
all of the affected folios were not anonymous, but had the PG_swapcache
flag set, inspiring this patch. After applying this patch, the memory
hot-unplug behaves as expected.

I tested this on a machine running Ubuntu 24.04 with kernel version
6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
same) and 48GB of memory. The libvirt XML definition for the VM can be
found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
the guest kernel so the hot-pluggable memory is automatically onlined.

Below are the steps to reproduce this behavior:

1) Define and start and virtual machine
  host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
  host$ virsh -c qemu:///system start test_vm

2) Setup swap in the guest
  guest$ sudo fallocate -l 32G /swapfile
  guest$ sudo chmod 0600 /swapfile
  guest$ sudo mkswap /swapfile
  guest$ sudo swapon /swapfile

3) Use alloc_data [2] to allocate most of the remaining guest memory
  guest$ ./alloc_data 45

4) In a separate guest terminal, monitor the amount of used memory
  guest$ watch -n1 free -h

5) When alloc_data has finished allocating, initiate the memory
hot-unplug using the provided xml file [3]
  host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live

After initiating the memory hot-unplug, you should see the amount of
available memory in the guest decrease, and the amount of used swap data
increase. If everything works as expected, when all of the memory is
unplugged, there should be around 8.5-9GB of data in swap. If the
unplugging is unsuccessful, the amount of used swap data will settle
below that. If that happens, you should be able to see log messages in
dmesg similar to the one posted above.

[1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
[2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
[3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml

Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
---

I am not very familiar with the memory hot-(un)plug or swapping code, so
I am not 100% certain if this patch actually solves the root of the
problem. I believe the issue is from shmem folios, in which case I believe
this patch is correct. However, I couldn't think of an easy way to confirm
that the affected folios were from shmem. I guess it could be possible that
the root cause could be from some bug where some anonymous pages do not
return true to folio_test_anon(). I don't think that's the case, but
figured the MM maintainers would have a better idea of what's going on.

---
 include/linux/mm.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15076261d0c2..6f959d8ca4b4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
 	if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
 		return 0;
 
-	if (folio_test_anon(folio)) {
-		/* One reference per page from the swapcache. */
-		ref_count += folio_test_swapcache(folio) << order;
-	} else {
+	/* One reference per page from the swapcache. */
+	ref_count += folio_test_swapcache(folio) << order;
+
+	if (!folio_test_anon(folio)) {
 		/* One reference per page from the pagecache. */
 		ref_count += !!folio->mapping << order;
 		/* One reference from PG_private. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-16 20:07 [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count() Bijan Tabatabai
@ 2025-12-17  0:07 ` David Hildenbrand (Red Hat)
  2025-12-17  0:34   ` Zi Yan
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-17  0:07 UTC (permalink / raw)
  To: Bijan Tabatabai, linux-mm, linux-kernel
  Cc: akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, shivankg

On 12/16/25 21:07, Bijan Tabatabai wrote:
> Currently, folio_expected_ref_count() only adds references for the swap
> cache if the folio is anonymous. However, according to the comment above
> the definition of PG_swapcache in enum pageflags, shmem folios can also
> have PG_swapcache set. This patch makes sure references for the swap
> cache are added if folio_test_swapcache(folio) is true.
> 
> This issue was found when trying to hot-unplug memory in a QEMU/KVM
> virtual machine. When initiating hot-unplug when most of the guest
> memory is allocated, hot-unplug hangs partway through removal due to
> migration failures. The following message would be printed several
> times, and would be printed again about every five seconds:
> 
> [   49.641309] migrating pfn b12f25 failed ret:7
> [   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
> [   49.641311] aops:swap_aops
> [   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
> [   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
> [   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
> [   49.641315] page dumped because: migration failure
> 
> When debugging this, I found that these migration failures were due to
> __migrate_folio() returning -EAGAIN for a small set of folios because
> the expected reference count it calculates via folio_expected_ref_count()
> is one less than the actual reference count of the folios. Furthermore,
> all of the affected folios were not anonymous, but had the PG_swapcache
> flag set, inspiring this patch. After applying this patch, the memory
> hot-unplug behaves as expected.
> 
> I tested this on a machine running Ubuntu 24.04 with kernel version
> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
> same) and 48GB of memory. The libvirt XML definition for the VM can be
> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
> the guest kernel so the hot-pluggable memory is automatically onlined.
> 
> Below are the steps to reproduce this behavior:
> 
> 1) Define and start and virtual machine
>    host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
>    host$ virsh -c qemu:///system start test_vm
> 
> 2) Setup swap in the guest
>    guest$ sudo fallocate -l 32G /swapfile
>    guest$ sudo chmod 0600 /swapfile
>    guest$ sudo mkswap /swapfile
>    guest$ sudo swapon /swapfile
> 
> 3) Use alloc_data [2] to allocate most of the remaining guest memory
>    guest$ ./alloc_data 45
> 
> 4) In a separate guest terminal, monitor the amount of used memory
>    guest$ watch -n1 free -h
> 
> 5) When alloc_data has finished allocating, initiate the memory
> hot-unplug using the provided xml file [3]
>    host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
> 
> After initiating the memory hot-unplug, you should see the amount of
> available memory in the guest decrease, and the amount of used swap data
> increase. If everything works as expected, when all of the memory is
> unplugged, there should be around 8.5-9GB of data in swap. If the
> unplugging is unsuccessful, the amount of used swap data will settle
> below that. If that happens, you should be able to see log messages in
> dmesg similar to the one posted above.
> 
> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
> 
> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
> ---
> 
> I am not very familiar with the memory hot-(un)plug or swapping code, so
> I am not 100% certain if this patch actually solves the root of the
> problem. I believe the issue is from shmem folios, in which case I believe
> this patch is correct. However, I couldn't think of an easy way to confirm
> that the affected folios were from shmem. I guess it could be possible that
> the root cause could be from some bug where some anonymous pages do not
> return true to folio_test_anon(). I don't think that's the case, but
> figured the MM maintainers would have a better idea of what's going on.
> 
> ---
>   include/linux/mm.h | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 15076261d0c2..6f959d8ca4b4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
>   	if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
>   		return 0;
>   
> -	if (folio_test_anon(folio)) {
> -		/* One reference per page from the swapcache. */
> -		ref_count += folio_test_swapcache(folio) << order;
> -	} else {
> +	/* One reference per page from the swapcache. */
> +	ref_count += folio_test_swapcache(folio) << order;
> +
> +	if (!folio_test_anon(folio)) {
>   		/* One reference per page from the pagecache. */
>   		ref_count += !!folio->mapping << order;
>   		/* One reference from PG_private. */

We discussed that recently [1] and I think Zi wanted to send a patch. We 
were a bit confused about the semantics of folio_test_swapcache(), but 
concluded that it should be fine when called against pagecache folios.

So far I thought 86ebd50224c0 did not result in the issue because it 
replaced

-static int folio_expected_refs(struct address_space *mapping,
-               struct folio *folio)
-{
-       int refs = 1;
-       if (!mapping)
-               return refs;
-
-       refs += folio_nr_pages(folio);
-       if (folio_test_private(folio))
-               refs++;
-
-       return refs;
-}

in migration code where !mapping would have only have returned 1 
(reference held by the caller) that folio_expected_ref_count() now 
expects to be added in the caller.


But looking again, in the caller, we obtain

	mapping = folio_mapping(src)

Which returns the swap_address_space() for folios in the swapcache.


So it indeed looks like 86ebd50224c0 introduced the issue.

Thanks!

We should cc: stable


Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>


[1] 
https://lore.kernel.org/all/33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com/
[2] 
https://lore.kernel.org/all/66C159D8-D267-4B3B-9384-1CE94533990E@nvidia.com/

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-17  0:07 ` David Hildenbrand (Red Hat)
@ 2025-12-17  0:34   ` Zi Yan
  2025-12-17  1:04     ` David Hildenbrand (Red Hat)
  2025-12-17  6:04     ` Kairui Song
  0 siblings, 2 replies; 10+ messages in thread
From: Zi Yan @ 2025-12-17  0:34 UTC (permalink / raw)
  To: Bijan Tabatabai, David Hildenbrand (Red Hat)
  Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
	Hugh Dickins, Chris Li, Kairui Song

On 16 Dec 2025, at 19:07, David Hildenbrand (Red Hat) wrote:

> On 12/16/25 21:07, Bijan Tabatabai wrote:
>> Currently, folio_expected_ref_count() only adds references for the swap
>> cache if the folio is anonymous. However, according to the comment above
>> the definition of PG_swapcache in enum pageflags, shmem folios can also
>> have PG_swapcache set. This patch makes sure references for the swap
>> cache are added if folio_test_swapcache(folio) is true.
>>
>> This issue was found when trying to hot-unplug memory in a QEMU/KVM
>> virtual machine. When initiating hot-unplug when most of the guest
>> memory is allocated, hot-unplug hangs partway through removal due to
>> migration failures. The following message would be printed several
>> times, and would be printed again about every five seconds:
>>
>> [   49.641309] migrating pfn b12f25 failed ret:7
>> [   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
>> [   49.641311] aops:swap_aops
>> [   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
>> [   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
>> [   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
>> [   49.641315] page dumped because: migration failure
>>
>> When debugging this, I found that these migration failures were due to
>> __migrate_folio() returning -EAGAIN for a small set of folios because
>> the expected reference count it calculates via folio_expected_ref_count()
>> is one less than the actual reference count of the folios. Furthermore,
>> all of the affected folios were not anonymous, but had the PG_swapcache
>> flag set, inspiring this patch. After applying this patch, the memory
>> hot-unplug behaves as expected.
>>
>> I tested this on a machine running Ubuntu 24.04 with kernel version
>> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
>> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
>> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
>> same) and 48GB of memory. The libvirt XML definition for the VM can be
>> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
>> the guest kernel so the hot-pluggable memory is automatically onlined.
>>
>> Below are the steps to reproduce this behavior:
>>
>> 1) Define and start and virtual machine
>>    host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
>>    host$ virsh -c qemu:///system start test_vm
>>
>> 2) Setup swap in the guest
>>    guest$ sudo fallocate -l 32G /swapfile
>>    guest$ sudo chmod 0600 /swapfile
>>    guest$ sudo mkswap /swapfile
>>    guest$ sudo swapon /swapfile
>>
>> 3) Use alloc_data [2] to allocate most of the remaining guest memory
>>    guest$ ./alloc_data 45
>>
>> 4) In a separate guest terminal, monitor the amount of used memory
>>    guest$ watch -n1 free -h
>>
>> 5) When alloc_data has finished allocating, initiate the memory
>> hot-unplug using the provided xml file [3]
>>    host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
>>
>> After initiating the memory hot-unplug, you should see the amount of
>> available memory in the guest decrease, and the amount of used swap data
>> increase. If everything works as expected, when all of the memory is
>> unplugged, there should be around 8.5-9GB of data in swap. If the
>> unplugging is unsuccessful, the amount of used swap data will settle
>> below that. If that happens, you should be able to see log messages in
>> dmesg similar to the one posted above.
>>
>> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
>> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
>> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
>>
>> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
>> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
>> ---
>>
>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>> I am not 100% certain if this patch actually solves the root of the
>> problem. I believe the issue is from shmem folios, in which case I believe
>> this patch is correct. However, I couldn't think of an easy way to confirm
>> that the affected folios were from shmem. I guess it could be possible that
>> the root cause could be from some bug where some anonymous pages do not
>> return true to folio_test_anon(). I don't think that's the case, but
>> figured the MM maintainers would have a better idea of what's going on.

I am not sure about if shmem in swapcache causes the issue, since
the above setup does not involve shmem. +Baolin and Hugh for some insight.

But David also mentioned that in __read_swap_cache_async() there is a chance
that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
for more analysis.

>>
>> ---
>>   include/linux/mm.h | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 15076261d0c2..6f959d8ca4b4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
>>   	if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
>>   		return 0;
>>  -	if (folio_test_anon(folio)) {
>> -		/* One reference per page from the swapcache. */
>> -		ref_count += folio_test_swapcache(folio) << order;
>> -	} else {
>> +	/* One reference per page from the swapcache. */
>> +	ref_count += folio_test_swapcache(folio) << order;
>> +
>> +	if (!folio_test_anon(folio)) {
>>   		/* One reference per page from the pagecache. */
>>   		ref_count += !!folio->mapping << order;
>>   		/* One reference from PG_private. */

This change is almost the same as what I proposed in [1] during my discussion
with David.

>
> We discussed that recently [1] and I think Zi wanted to send a patch. We were a bit confused about the semantics of folio_test_swapcache(), but concluded that it should be fine when called against pagecache folios.
>
> So far I thought 86ebd50224c0 did not result in the issue because it replaced
>
> -static int folio_expected_refs(struct address_space *mapping,
> -               struct folio *folio)
> -{
> -       int refs = 1;
> -       if (!mapping)
> -               return refs;
> -
> -       refs += folio_nr_pages(folio);
> -       if (folio_test_private(folio))
> -               refs++;
> -
> -       return refs;
> -}
>
> in migration code where !mapping would have only have returned 1 (reference held by the caller) that folio_expected_ref_count() now expects to be added in the caller.
>
>
> But looking again, in the caller, we obtain
>
> 	mapping = folio_mapping(src)
>
> Which returns the swap_address_space() for folios in the swapcache.
>
>
> So it indeed looks like 86ebd50224c0 introduced the issue.
>
> Thanks!
>
> We should cc: stable
>
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
>
>
> [1] https://lore.kernel.org/all/33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com/
> [2] https://lore.kernel.org/all/66C159D8-D267-4B3B-9384-1CE94533990E@nvidia.com/
>

I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-17  0:34   ` Zi Yan
@ 2025-12-17  1:04     ` David Hildenbrand (Red Hat)
  2025-12-17  3:09       ` Baolin Wang
  2025-12-19  0:21       ` Wei Yang
  2025-12-17  6:04     ` Kairui Song
  1 sibling, 2 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-17  1:04 UTC (permalink / raw)
  To: Zi Yan, Bijan Tabatabai
  Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
	Hugh Dickins, Chris Li, Kairui Song

>>>
>>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>>> I am not 100% certain if this patch actually solves the root of the
>>> problem. I believe the issue is from shmem folios, in which case I believe
>>> this patch is correct. However, I couldn't think of an easy way to confirm
>>> that the affected folios were from shmem. I guess it could be possible that
>>> the root cause could be from some bug where some anonymous pages do not
>>> return true to folio_test_anon(). I don't think that's the case, but
>>> figured the MM maintainers would have a better idea of what's going on.
> 
> I am not sure about if shmem in swapcache causes the issue, since
> the above setup does not involve shmem. +Baolin and Hugh for some insight.

We might just push out another unrelated shmem page to swap as we create 
memory pressure in the system I think.

> 
> But David also mentioned that in __read_swap_cache_async() there is a chance
> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
> for more analysis.

Right, when we swapin an anon folio and did not map it into the page 
table yet. Likely we can trigger something similar when we proactively 
read a shmem page from swap into the swapcache.

So it's unclear "where" a swapcache page belongs to until we move it to 
its owner (anon / shmem), which is also why I cannot judge easily from

[   49.641309] migrating pfn b12f25 failed ret:7
[   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 
index:0x7f404d925 pfn:0xb12f25
[   49.641311] aops:swap_aops
[   49.641313] flags: 
0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
[   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 
0000000000000000
[   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 
0000000000000000
[   49.641315] page dumped because: migration failure

What exactly that was.

It was certainly an order-0 folio.

[...]

> 
> I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>

Thanks for the fast review :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-17  1:04     ` David Hildenbrand (Red Hat)
@ 2025-12-17  3:09       ` Baolin Wang
  2025-12-19  0:21       ` Wei Yang
  1 sibling, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-12-17  3:09 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Zi Yan, Bijan Tabatabai
  Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, shivankg, Hugh Dickins, Chris Li,
	Kairui Song



On 2025/12/17 09:04, David Hildenbrand (Red Hat) wrote:
>>>>
>>>> I am not very familiar with the memory hot-(un)plug or swapping 
>>>> code, so
>>>> I am not 100% certain if this patch actually solves the root of the
>>>> problem. I believe the issue is from shmem folios, in which case I 
>>>> believe
>>>> this patch is correct. However, I couldn't think of an easy way to 
>>>> confirm
>>>> that the affected folios were from shmem. I guess it could be 
>>>> possible that
>>>> the root cause could be from some bug where some anonymous pages do not
>>>> return true to folio_test_anon(). I don't think that's the case, but
>>>> figured the MM maintainers would have a better idea of what's going on.
>>
>> I am not sure about if shmem in swapcache causes the issue, since
>> the above setup does not involve shmem. +Baolin and Hugh for some 
>> insight.
> 
> We might just push out another unrelated shmem page to swap as we create 
> memory pressure in the system I think.
> 
>>
>> But David also mentioned that in __read_swap_cache_async() there is a 
>> chance
>> that anon folio in swapcache can have anon flag not set yet. +Chris 
>> and Kairui
>> for more analysis.
> 
> Right, when we swapin an anon folio and did not map it into the page 
> table yet. Likely we can trigger something similar when we proactively 
> read a shmem page from swap into the swapcache.
> 
> So it's unclear "where" a swapcache page belongs to until we move it to 
> its owner (anon / shmem), which is also why I cannot judge easily from
> 
> [   49.641309] migrating pfn b12f25 failed ret:7
> [   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 
> index:0x7f404d925 pfn:0xb12f25
> [   49.641311] aops:swap_aops
> [   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1| 
> reclaim|swapbacked|node=0|zone=3)
> [   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 
> 0000000000000000
> [   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 
> 0000000000000000
> [   49.641315] page dumped because: migration failure
> 
> What exactly that was.
> 
> It was certainly an order-0 folio.

Thanks David for the explanation. It completely makes sense to me. So 
feel free to add:

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-17  0:34   ` Zi Yan
  2025-12-17  1:04     ` David Hildenbrand (Red Hat)
@ 2025-12-17  6:04     ` Kairui Song
  1 sibling, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-12-17  6:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: Bijan Tabatabai, David Hildenbrand (Red Hat),
	linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
	Hugh Dickins, Chris Li

On Wed, Dec 17, 2025 at 8:34 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 16 Dec 2025, at 19:07, David Hildenbrand (Red Hat) wrote:
>
> > On 12/16/25 21:07, Bijan Tabatabai wrote:
> >> Currently, folio_expected_ref_count() only adds references for the swap
> >> cache if the folio is anonymous. However, according to the comment above
> >> the definition of PG_swapcache in enum pageflags, shmem folios can also
> >> have PG_swapcache set. This patch makes sure references for the swap
> >> cache are added if folio_test_swapcache(folio) is true.
> >>
> >> This issue was found when trying to hot-unplug memory in a QEMU/KVM
> >> virtual machine. When initiating hot-unplug when most of the guest
> >> memory is allocated, hot-unplug hangs partway through removal due to
> >> migration failures. The following message would be printed several
> >> times, and would be printed again about every five seconds:
> >>
> >> [   49.641309] migrating pfn b12f25 failed ret:7
> >> [   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
> >> [   49.641311] aops:swap_aops
> >> [   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
> >> [   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
> >> [   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
> >> [   49.641315] page dumped because: migration failure
> >>
> >> When debugging this, I found that these migration failures were due to
> >> __migrate_folio() returning -EAGAIN for a small set of folios because
> >> the expected reference count it calculates via folio_expected_ref_count()
> >> is one less than the actual reference count of the folios. Furthermore,
> >> all of the affected folios were not anonymous, but had the PG_swapcache
> >> flag set, inspiring this patch. After applying this patch, the memory
> >> hot-unplug behaves as expected.
> >>
> >> I tested this on a machine running Ubuntu 24.04 with kernel version
> >> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
> >> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
> >> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
> >> same) and 48GB of memory. The libvirt XML definition for the VM can be
> >> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
> >> the guest kernel so the hot-pluggable memory is automatically onlined.
> >>
> >> Below are the steps to reproduce this behavior:
> >>
> >> 1) Define and start and virtual machine
> >>    host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
> >>    host$ virsh -c qemu:///system start test_vm
> >>
> >> 2) Setup swap in the guest
> >>    guest$ sudo fallocate -l 32G /swapfile
> >>    guest$ sudo chmod 0600 /swapfile
> >>    guest$ sudo mkswap /swapfile
> >>    guest$ sudo swapon /swapfile
> >>
> >> 3) Use alloc_data [2] to allocate most of the remaining guest memory
> >>    guest$ ./alloc_data 45
> >>
> >> 4) In a separate guest terminal, monitor the amount of used memory
> >>    guest$ watch -n1 free -h
> >>
> >> 5) When alloc_data has finished allocating, initiate the memory
> >> hot-unplug using the provided xml file [3]
> >>    host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
> >>
> >> After initiating the memory hot-unplug, you should see the amount of
> >> available memory in the guest decrease, and the amount of used swap data
> >> increase. If everything works as expected, when all of the memory is
> >> unplugged, there should be around 8.5-9GB of data in swap. If the
> >> unplugging is unsuccessful, the amount of used swap data will settle
> >> below that. If that happens, you should be able to see log messages in
> >> dmesg similar to the one posted above.
> >>
> >> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
> >> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
> >> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
> >>
> >> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
> >> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
> >> ---
> >>
> >> I am not very familiar with the memory hot-(un)plug or swapping code, so
> >> I am not 100% certain if this patch actually solves the root of the
> >> problem. I believe the issue is from shmem folios, in which case I believe
> >> this patch is correct. However, I couldn't think of an easy way to confirm
> >> that the affected folios were from shmem. I guess it could be possible that
> >> the root cause could be from some bug where some anonymous pages do not
> >> return true to folio_test_anon(). I don't think that's the case, but
> >> figured the MM maintainers would have a better idea of what's going on.
>
> I am not sure about if shmem in swapcache causes the issue, since
> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>
> But David also mentioned that in __read_swap_cache_async() there is a chance
> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
> for more analysis.

Yeah, that's possible, a typical case is swap readahead will alloc and
add folios into swap cache, but won't add it to anon/shmem mapping.
Anon/shmem will use the folio in swapcache upon page fault, and make
it anon/shmem folio by then.

This change looks good to me too, thanks for Ccing me.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-17  1:04     ` David Hildenbrand (Red Hat)
  2025-12-17  3:09       ` Baolin Wang
@ 2025-12-19  0:21       ` Wei Yang
  2025-12-19  1:42         ` Baolin Wang
  2025-12-19  2:35         ` Kairui Song
  1 sibling, 2 replies; 10+ messages in thread
From: Wei Yang @ 2025-12-19  0:21 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	shivankg, Baolin Wang, Hugh Dickins, Chris Li, Kairui Song

On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>> > > 
>> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
>> > > I am not 100% certain if this patch actually solves the root of the
>> > > problem. I believe the issue is from shmem folios, in which case I believe
>> > > this patch is correct. However, I couldn't think of an easy way to confirm
>> > > that the affected folios were from shmem. I guess it could be possible that
>> > > the root cause could be from some bug where some anonymous pages do not
>> > > return true to folio_test_anon(). I don't think that's the case, but
>> > > figured the MM maintainers would have a better idea of what's going on.
>> 
>> I am not sure about if shmem in swapcache causes the issue, since
>> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>
>We might just push out another unrelated shmem page to swap as we create
>memory pressure in the system I think.
>

One trivial question: currently we only put anon/shmem folio in swapcache,
right?

>> 
>> But David also mentioned that in __read_swap_cache_async() there is a chance
>> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
>> for more analysis.
>
>Right, when we swapin an anon folio and did not map it into the page table
>yet. Likely we can trigger something similar when we proactively read a shmem
>page from swap into the swapcache.
>
>So it's unclear "where" a swapcache page belongs to until we move it to its
>owner (anon / shmem), which is also why I cannot judge easily from
>
>[   49.641309] migrating pfn b12f25 failed ret:7
>[   49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2
>index:0x7f404d925 pfn:0xb12f25
>[   49.641311] aops:swap_aops
>[   49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
>[   49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8
>0000000000000000
>[   49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff
>0000000000000000
>[   49.641315] page dumped because: migration failure
>
>What exactly that was.
>
>It was certainly an order-0 folio.
>
>[...]
>
>> 
>> I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>
>
>Thanks for the fast review :)
>
>-- 
>Cheers
>
>David

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-19  0:21       ` Wei Yang
@ 2025-12-19  1:42         ` Baolin Wang
  2025-12-19  2:35         ` Kairui Song
  1 sibling, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-12-19  1:42 UTC (permalink / raw)
  To: Wei Yang, David Hildenbrand (Red Hat)
  Cc: Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	shivankg, Hugh Dickins, Chris Li, Kairui Song



On 2025/12/19 08:21, Wei Yang wrote:
> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>
>>>>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>>>>> I am not 100% certain if this patch actually solves the root of the
>>>>> problem. I believe the issue is from shmem folios, in which case I believe
>>>>> this patch is correct. However, I couldn't think of an easy way to confirm
>>>>> that the affected folios were from shmem. I guess it could be possible that
>>>>> the root cause could be from some bug where some anonymous pages do not
>>>>> return true to folio_test_anon(). I don't think that's the case, but
>>>>> figured the MM maintainers would have a better idea of what's going on.
>>>
>>> I am not sure about if shmem in swapcache causes the issue, since
>>> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>>
>> We might just push out another unrelated shmem page to swap as we create
>> memory pressure in the system I think.
>>
> 
> One trivial question: currently we only put anon/shmem folio in swapcache,
> right?

AFAICT, Yes (note a special case for anonymous folios: lazyfree 
anonymous folios will be directly freed instead of being swapped out).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-19  0:21       ` Wei Yang
  2025-12-19  1:42         ` Baolin Wang
@ 2025-12-19  2:35         ` Kairui Song
  2025-12-20  0:47           ` Wei Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-12-19  2:35 UTC (permalink / raw)
  To: Wei Yang
  Cc: David Hildenbrand (Red Hat),
	Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	shivankg, Baolin Wang, Hugh Dickins, Chris Li

On Fri, Dec 19, 2025 at 8:21 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
> >> > >
> >> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
> >> > > I am not 100% certain if this patch actually solves the root of the
> >> > > problem. I believe the issue is from shmem folios, in which case I believe
> >> > > this patch is correct. However, I couldn't think of an easy way to confirm
> >> > > that the affected folios were from shmem. I guess it could be possible that
> >> > > the root cause could be from some bug where some anonymous pages do not
> >> > > return true to folio_test_anon(). I don't think that's the case, but
> >> > > figured the MM maintainers would have a better idea of what's going on.
> >>
> >> I am not sure about if shmem in swapcache causes the issue, since
> >> the above setup does not involve shmem. +Baolin and Hugh for some insight.
> >
> >We might just push out another unrelated shmem page to swap as we create
> >memory pressure in the system I think.
> >
>
> One trivial question: currently we only put anon/shmem folio in swapcache,
> right?

For swapout, yes, the entry point to move a folio to swap space is
folio_alloc_swap, only anon and shmem can do that (vmscan.c ->
folio_test_anon && folio_test_swapbacked, and shmem.c).

Swapin is a bit different because of readahead, readahead folios are
not marked as anon / shmem (folio->mapping) until used, they do belong
to anon / shmem though, but we don't add them to the mapping until
that mapping does a swap cache lookup and use the cached folio.

Also maybe worth mentioning, swap cache lookup convention requires the
caller to lock the folio and double check folio still matches the swap
entry before use (folio_matches_swap_entry), folios there are unstable
and could no longer be a valid swap cache folio unless locked.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
  2025-12-19  2:35         ` Kairui Song
@ 2025-12-20  0:47           ` Wei Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Wei Yang @ 2025-12-20  0:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: Wei Yang, David Hildenbrand (Red Hat),
	Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	shivankg, Baolin Wang, Hugh Dickins, Chris Li

On Fri, Dec 19, 2025 at 10:35:05AM +0800, Kairui Song wrote:
>On Fri, Dec 19, 2025 at 8:21 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>> >> > >
>> >> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
>> >> > > I am not 100% certain if this patch actually solves the root of the
>> >> > > problem. I believe the issue is from shmem folios, in which case I believe
>> >> > > this patch is correct. However, I couldn't think of an easy way to confirm
>> >> > > that the affected folios were from shmem. I guess it could be possible that
>> >> > > the root cause could be from some bug where some anonymous pages do not
>> >> > > return true to folio_test_anon(). I don't think that's the case, but
>> >> > > figured the MM maintainers would have a better idea of what's going on.
>> >>
>> >> I am not sure about if shmem in swapcache causes the issue, since
>> >> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>> >
>> >We might just push out another unrelated shmem page to swap as we create
>> >memory pressure in the system I think.
>> >
>>
>> One trivial question: currently we only put anon/shmem folio in swapcache,
>> right?
>
>For swapout, yes, the entry point to move a folio to swap space is
>folio_alloc_swap, only anon and shmem can do that (vmscan.c ->
>folio_test_anon && folio_test_swapbacked, and shmem.c).
>

Thanks for this information.

>Swapin is a bit different because of readahead, readahead folios are
>not marked as anon / shmem (folio->mapping) until used, they do belong
>to anon / shmem though, but we don't add them to the mapping until
>that mapping does a swap cache lookup and use the cached folio.
>

I saw this. So there is some folio which is in swapcache but no sure is
anon/shmem yet.

>Also maybe worth mentioning, swap cache lookup convention requires the
>caller to lock the folio and double check folio still matches the swap
>entry before use (folio_matches_swap_entry), folios there are unstable
>and could no longer be a valid swap cache folio unless locked.

Thanks for this notice, will pay attention to this.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-20  0:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 20:07 [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count() Bijan Tabatabai
2025-12-17  0:07 ` David Hildenbrand (Red Hat)
2025-12-17  0:34   ` Zi Yan
2025-12-17  1:04     ` David Hildenbrand (Red Hat)
2025-12-17  3:09       ` Baolin Wang
2025-12-19  0:21       ` Wei Yang
2025-12-19  1:42         ` Baolin Wang
2025-12-19  2:35         ` Kairui Song
2025-12-20  0:47           ` Wei Yang
2025-12-17  6:04     ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox