* [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
@ 2025-12-16 20:07 Bijan Tabatabai
2025-12-17 0:07 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 10+ messages in thread
From: Bijan Tabatabai @ 2025-12-16 20:07 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, shivankg, Bijan Tabatabai
Currently, folio_expected_ref_count() only adds references for the swap
cache if the folio is anonymous. However, according to the comment above
the definition of PG_swapcache in enum pageflags, shmem folios can also
have PG_swapcache set. This patch makes sure references for the swap
cache are added if folio_test_swapcache(folio) is true.
This issue was found when trying to hot-unplug memory in a QEMU/KVM
virtual machine. When initiating hot-unplug when most of the guest
memory is allocated, hot-unplug hangs partway through removal due to
migration failures. The following message would be printed several
times, and would be printed again about every five seconds:
[ 49.641309] migrating pfn b12f25 failed ret:7
[ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
[ 49.641311] aops:swap_aops
[ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
[ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
[ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
[ 49.641315] page dumped because: migration failure
When debugging this, I found that these migration failures were due to
__migrate_folio() returning -EAGAIN for a small set of folios because
the expected reference count it calculates via folio_expected_ref_count()
is one less than the actual reference count of the folios. Furthermore,
all of the affected folios were not anonymous, but had the PG_swapcache
flag set, inspiring this patch. After applying this patch, the memory
hot-unplug behaves as expected.
I tested this on a machine running Ubuntu 24.04 with kernel version
6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
same) and 48GB of memory. The libvirt XML definition for the VM can be
found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
the guest kernel so the hot-pluggable memory is automatically onlined.
Below are the steps to reproduce this behavior:
1) Define and start and virtual machine
host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
host$ virsh -c qemu:///system start test_vm
2) Setup swap in the guest
guest$ sudo fallocate -l 32G /swapfile
guest$ sudo chmod 0600 /swapfile
guest$ sudo mkswap /swapfile
guest$ sudo swapon /swapfile
3) Use alloc_data [2] to allocate most of the remaining guest memory
guest$ ./alloc_data 45
4) In a separate guest terminal, monitor the amount of used memory
guest$ watch -n1 free -h
5) When alloc_data has finished allocating, initiate the memory
hot-unplug using the provided xml file [3]
host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
After initiating the memory hot-unplug, you should see the amount of
available memory in the guest decrease, and the amount of used swap data
increase. If everything works as expected, when all of the memory is
unplugged, there should be around 8.5-9GB of data in swap. If the
unplugging is unsuccessful, the amount of used swap data will settle
below that. If that happens, you should be able to see log messages in
dmesg similar to the one posted above.
[1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
[2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
[3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
---
I am not very familiar with the memory hot-(un)plug or swapping code, so
I am not 100% certain if this patch actually solves the root of the
problem. I believe the issue is from shmem folios, in which case I believe
this patch is correct. However, I couldn't think of an easy way to confirm
that the affected folios were from shmem. I guess it could be possible that
the root cause could be from some bug where some anonymous pages do not
return true to folio_test_anon(). I don't think that's the case, but
figured the MM maintainers would have a better idea of what's going on.
---
include/linux/mm.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15076261d0c2..6f959d8ca4b4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
return 0;
- if (folio_test_anon(folio)) {
- /* One reference per page from the swapcache. */
- ref_count += folio_test_swapcache(folio) << order;
- } else {
+ /* One reference per page from the swapcache. */
+ ref_count += folio_test_swapcache(folio) << order;
+
+ if (!folio_test_anon(folio)) {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
/* One reference from PG_private. */
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-16 20:07 [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count() Bijan Tabatabai
@ 2025-12-17 0:07 ` David Hildenbrand (Red Hat)
2025-12-17 0:34 ` Zi Yan
0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-17 0:07 UTC (permalink / raw)
To: Bijan Tabatabai, linux-mm, linux-kernel
Cc: akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, shivankg
On 12/16/25 21:07, Bijan Tabatabai wrote:
> Currently, folio_expected_ref_count() only adds references for the swap
> cache if the folio is anonymous. However, according to the comment above
> the definition of PG_swapcache in enum pageflags, shmem folios can also
> have PG_swapcache set. This patch makes sure references for the swap
> cache are added if folio_test_swapcache(folio) is true.
>
> This issue was found when trying to hot-unplug memory in a QEMU/KVM
> virtual machine. When initiating hot-unplug when most of the guest
> memory is allocated, hot-unplug hangs partway through removal due to
> migration failures. The following message would be printed several
> times, and would be printed again about every five seconds:
>
> [ 49.641309] migrating pfn b12f25 failed ret:7
> [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
> [ 49.641311] aops:swap_aops
> [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
> [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
> [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
> [ 49.641315] page dumped because: migration failure
>
> When debugging this, I found that these migration failures were due to
> __migrate_folio() returning -EAGAIN for a small set of folios because
> the expected reference count it calculates via folio_expected_ref_count()
> is one less than the actual reference count of the folios. Furthermore,
> all of the affected folios were not anonymous, but had the PG_swapcache
> flag set, inspiring this patch. After applying this patch, the memory
> hot-unplug behaves as expected.
>
> I tested this on a machine running Ubuntu 24.04 with kernel version
> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
> same) and 48GB of memory. The libvirt XML definition for the VM can be
> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
> the guest kernel so the hot-pluggable memory is automatically onlined.
>
> Below are the steps to reproduce this behavior:
>
> 1) Define and start and virtual machine
> host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
> host$ virsh -c qemu:///system start test_vm
>
> 2) Setup swap in the guest
> guest$ sudo fallocate -l 32G /swapfile
> guest$ sudo chmod 0600 /swapfile
> guest$ sudo mkswap /swapfile
> guest$ sudo swapon /swapfile
>
> 3) Use alloc_data [2] to allocate most of the remaining guest memory
> guest$ ./alloc_data 45
>
> 4) In a separate guest terminal, monitor the amount of used memory
> guest$ watch -n1 free -h
>
> 5) When alloc_data has finished allocating, initiate the memory
> hot-unplug using the provided xml file [3]
> host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
>
> After initiating the memory hot-unplug, you should see the amount of
> available memory in the guest decrease, and the amount of used swap data
> increase. If everything works as expected, when all of the memory is
> unplugged, there should be around 8.5-9GB of data in swap. If the
> unplugging is unsuccessful, the amount of used swap data will settle
> below that. If that happens, you should be able to see log messages in
> dmesg similar to the one posted above.
>
> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
>
> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
> ---
>
> I am not very familiar with the memory hot-(un)plug or swapping code, so
> I am not 100% certain if this patch actually solves the root of the
> problem. I believe the issue is from shmem folios, in which case I believe
> this patch is correct. However, I couldn't think of an easy way to confirm
> that the affected folios were from shmem. I guess it could be possible that
> the root cause could be from some bug where some anonymous pages do not
> return true to folio_test_anon(). I don't think that's the case, but
> figured the MM maintainers would have a better idea of what's going on.
>
> ---
> include/linux/mm.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 15076261d0c2..6f959d8ca4b4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
> if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
> return 0;
>
> - if (folio_test_anon(folio)) {
> - /* One reference per page from the swapcache. */
> - ref_count += folio_test_swapcache(folio) << order;
> - } else {
> + /* One reference per page from the swapcache. */
> + ref_count += folio_test_swapcache(folio) << order;
> +
> + if (!folio_test_anon(folio)) {
> /* One reference per page from the pagecache. */
> ref_count += !!folio->mapping << order;
> /* One reference from PG_private. */
We discussed that recently [1] and I think Zi wanted to send a patch. We
were a bit confused about the semantics of folio_test_swapcache(), but
concluded that it should be fine when called against pagecache folios.
So far I thought 86ebd50224c0 did not result in the issue because it
replaced
-static int folio_expected_refs(struct address_space *mapping,
- struct folio *folio)
-{
- int refs = 1;
- if (!mapping)
- return refs;
-
- refs += folio_nr_pages(folio);
- if (folio_test_private(folio))
- refs++;
-
- return refs;
-}
in migration code where !mapping would have only have returned 1
(reference held by the caller) that folio_expected_ref_count() now
expects to be added in the caller.
But looking again, in the caller, we obtain
mapping = folio_mapping(src)
Which returns the swap_address_space() for folios in the swapcache.
So it indeed looks like 86ebd50224c0 introduced the issue.
Thanks!
We should cc: stable
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
[1]
https://lore.kernel.org/all/33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com/
[2]
https://lore.kernel.org/all/66C159D8-D267-4B3B-9384-1CE94533990E@nvidia.com/
--
Cheers
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-17 0:07 ` David Hildenbrand (Red Hat)
@ 2025-12-17 0:34 ` Zi Yan
2025-12-17 1:04 ` David Hildenbrand (Red Hat)
2025-12-17 6:04 ` Kairui Song
0 siblings, 2 replies; 10+ messages in thread
From: Zi Yan @ 2025-12-17 0:34 UTC (permalink / raw)
To: Bijan Tabatabai, David Hildenbrand (Red Hat)
Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
Hugh Dickins, Chris Li, Kairui Song
On 16 Dec 2025, at 19:07, David Hildenbrand (Red Hat) wrote:
> On 12/16/25 21:07, Bijan Tabatabai wrote:
>> Currently, folio_expected_ref_count() only adds references for the swap
>> cache if the folio is anonymous. However, according to the comment above
>> the definition of PG_swapcache in enum pageflags, shmem folios can also
>> have PG_swapcache set. This patch makes sure references for the swap
>> cache are added if folio_test_swapcache(folio) is true.
>>
>> This issue was found when trying to hot-unplug memory in a QEMU/KVM
>> virtual machine. When initiating hot-unplug when most of the guest
>> memory is allocated, hot-unplug hangs partway through removal due to
>> migration failures. The following message would be printed several
>> times, and would be printed again about every five seconds:
>>
>> [ 49.641309] migrating pfn b12f25 failed ret:7
>> [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
>> [ 49.641311] aops:swap_aops
>> [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
>> [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
>> [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
>> [ 49.641315] page dumped because: migration failure
>>
>> When debugging this, I found that these migration failures were due to
>> __migrate_folio() returning -EAGAIN for a small set of folios because
>> the expected reference count it calculates via folio_expected_ref_count()
>> is one less than the actual reference count of the folios. Furthermore,
>> all of the affected folios were not anonymous, but had the PG_swapcache
>> flag set, inspiring this patch. After applying this patch, the memory
>> hot-unplug behaves as expected.
>>
>> I tested this on a machine running Ubuntu 24.04 with kernel version
>> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
>> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
>> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
>> same) and 48GB of memory. The libvirt XML definition for the VM can be
>> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
>> the guest kernel so the hot-pluggable memory is automatically onlined.
>>
>> Below are the steps to reproduce this behavior:
>>
>> 1) Define and start and virtual machine
>> host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
>> host$ virsh -c qemu:///system start test_vm
>>
>> 2) Setup swap in the guest
>> guest$ sudo fallocate -l 32G /swapfile
>> guest$ sudo chmod 0600 /swapfile
>> guest$ sudo mkswap /swapfile
>> guest$ sudo swapon /swapfile
>>
>> 3) Use alloc_data [2] to allocate most of the remaining guest memory
>> guest$ ./alloc_data 45
>>
>> 4) In a separate guest terminal, monitor the amount of used memory
>> guest$ watch -n1 free -h
>>
>> 5) When alloc_data has finished allocating, initiate the memory
>> hot-unplug using the provided xml file [3]
>> host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
>>
>> After initiating the memory hot-unplug, you should see the amount of
>> available memory in the guest decrease, and the amount of used swap data
>> increase. If everything works as expected, when all of the memory is
>> unplugged, there should be around 8.5-9GB of data in swap. If the
>> unplugging is unsuccessful, the amount of used swap data will settle
>> below that. If that happens, you should be able to see log messages in
>> dmesg similar to the one posted above.
>>
>> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
>> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
>> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
>>
>> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
>> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
>> ---
>>
>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>> I am not 100% certain if this patch actually solves the root of the
>> problem. I believe the issue is from shmem folios, in which case I believe
>> this patch is correct. However, I couldn't think of an easy way to confirm
>> that the affected folios were from shmem. I guess it could be possible that
>> the root cause could be from some bug where some anonymous pages do not
>> return true to folio_test_anon(). I don't think that's the case, but
>> figured the MM maintainers would have a better idea of what's going on.
I am not sure about if shmem in swapcache causes the issue, since
the above setup does not involve shmem. +Baolin and Hugh for some insight.
But David also mentioned that in __read_swap_cache_async() there is a chance
that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
for more analysis.
>>
>> ---
>> include/linux/mm.h | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 15076261d0c2..6f959d8ca4b4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2459,10 +2459,10 @@ static inline int folio_expected_ref_count(const struct folio *folio)
>> if (WARN_ON_ONCE(page_has_type(&folio->page) && !folio_test_hugetlb(folio)))
>> return 0;
>> - if (folio_test_anon(folio)) {
>> - /* One reference per page from the swapcache. */
>> - ref_count += folio_test_swapcache(folio) << order;
>> - } else {
>> + /* One reference per page from the swapcache. */
>> + ref_count += folio_test_swapcache(folio) << order;
>> +
>> + if (!folio_test_anon(folio)) {
>> /* One reference per page from the pagecache. */
>> ref_count += !!folio->mapping << order;
>> /* One reference from PG_private. */
This change is almost the same as what I proposed in [1] during my discussion
with David.
>
> We discussed that recently [1] and I think Zi wanted to send a patch. We were a bit confused about the semantics of folio_test_swapcache(), but concluded that it should be fine when called against pagecache folios.
>
> So far I thought 86ebd50224c0 did not result in the issue because it replaced
>
> -static int folio_expected_refs(struct address_space *mapping,
> - struct folio *folio)
> -{
> - int refs = 1;
> - if (!mapping)
> - return refs;
> -
> - refs += folio_nr_pages(folio);
> - if (folio_test_private(folio))
> - refs++;
> -
> - return refs;
> -}
>
> in migration code where !mapping would have only have returned 1 (reference held by the caller) that folio_expected_ref_count() now expects to be added in the caller.
>
>
> But looking again, in the caller, we obtain
>
> mapping = folio_mapping(src)
>
> Which returns the swap_address_space() for folios in the swapcache.
>
>
> So it indeed looks like 86ebd50224c0 introduced the issue.
>
> Thanks!
>
> We should cc: stable
>
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
>
>
> [1] https://lore.kernel.org/all/33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com/
> [2] https://lore.kernel.org/all/66C159D8-D267-4B3B-9384-1CE94533990E@nvidia.com/
>
I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-17 0:34 ` Zi Yan
@ 2025-12-17 1:04 ` David Hildenbrand (Red Hat)
2025-12-17 3:09 ` Baolin Wang
2025-12-19 0:21 ` Wei Yang
2025-12-17 6:04 ` Kairui Song
1 sibling, 2 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-17 1:04 UTC (permalink / raw)
To: Zi Yan, Bijan Tabatabai
Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
Hugh Dickins, Chris Li, Kairui Song
>>>
>>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>>> I am not 100% certain if this patch actually solves the root of the
>>> problem. I believe the issue is from shmem folios, in which case I believe
>>> this patch is correct. However, I couldn't think of an easy way to confirm
>>> that the affected folios were from shmem. I guess it could be possible that
>>> the root cause could be from some bug where some anonymous pages do not
>>> return true to folio_test_anon(). I don't think that's the case, but
>>> figured the MM maintainers would have a better idea of what's going on.
>
> I am not sure about if shmem in swapcache causes the issue, since
> the above setup does not involve shmem. +Baolin and Hugh for some insight.
We might just push out another unrelated shmem page to swap as we create
memory pressure in the system I think.
>
> But David also mentioned that in __read_swap_cache_async() there is a chance
> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
> for more analysis.
Right, when we swapin an anon folio and did not map it into the page
table yet. Likely we can trigger something similar when we proactively
read a shmem page from swap into the swapcache.
So it's unclear "where" a swapcache page belongs to until we move it to
its owner (anon / shmem), which is also why I cannot judge easily from
[ 49.641309] migrating pfn b12f25 failed ret:7
[ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2
index:0x7f404d925 pfn:0xb12f25
[ 49.641311] aops:swap_aops
[ 49.641313] flags:
0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
[ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8
0000000000000000
[ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff
0000000000000000
[ 49.641315] page dumped because: migration failure
What exactly that was.
It was certainly an order-0 folio.
[...]
>
> I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>
Thanks for the fast review :)
--
Cheers
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-17 1:04 ` David Hildenbrand (Red Hat)
@ 2025-12-17 3:09 ` Baolin Wang
2025-12-19 0:21 ` Wei Yang
1 sibling, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-12-17 3:09 UTC (permalink / raw)
To: David Hildenbrand (Red Hat), Zi Yan, Bijan Tabatabai
Cc: linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, shivankg, Hugh Dickins, Chris Li,
Kairui Song
On 2025/12/17 09:04, David Hildenbrand (Red Hat) wrote:
>>>>
>>>> I am not very familiar with the memory hot-(un)plug or swapping
>>>> code, so
>>>> I am not 100% certain if this patch actually solves the root of the
>>>> problem. I believe the issue is from shmem folios, in which case I
>>>> believe
>>>> this patch is correct. However, I couldn't think of an easy way to
>>>> confirm
>>>> that the affected folios were from shmem. I guess it could be
>>>> possible that
>>>> the root cause could be from some bug where some anonymous pages do not
>>>> return true to folio_test_anon(). I don't think that's the case, but
>>>> figured the MM maintainers would have a better idea of what's going on.
>>
>> I am not sure about if shmem in swapcache causes the issue, since
>> the above setup does not involve shmem. +Baolin and Hugh for some
>> insight.
>
> We might just push out another unrelated shmem page to swap as we create
> memory pressure in the system I think.
>
>>
>> But David also mentioned that in __read_swap_cache_async() there is a
>> chance
>> that anon folio in swapcache can have anon flag not set yet. +Chris
>> and Kairui
>> for more analysis.
>
> Right, when we swapin an anon folio and did not map it into the page
> table yet. Likely we can trigger something similar when we proactively
> read a shmem page from swap into the swapcache.
>
> So it's unclear "where" a swapcache page belongs to until we move it to
> its owner (anon / shmem), which is also why I cannot judge easily from
>
> [ 49.641309] migrating pfn b12f25 failed ret:7
> [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2
> index:0x7f404d925 pfn:0xb12f25
> [ 49.641311] aops:swap_aops
> [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|
> reclaim|swapbacked|node=0|zone=3)
> [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8
> 0000000000000000
> [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff
> 0000000000000000
> [ 49.641315] page dumped because: migration failure
>
> What exactly that was.
>
> It was certainly an order-0 folio.
Thanks David for the explanation. It completely makes sense to me. So
feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-17 0:34 ` Zi Yan
2025-12-17 1:04 ` David Hildenbrand (Red Hat)
@ 2025-12-17 6:04 ` Kairui Song
1 sibling, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-12-17 6:04 UTC (permalink / raw)
To: Zi Yan
Cc: Bijan Tabatabai, David Hildenbrand (Red Hat),
linux-mm, linux-kernel, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, shivankg, Baolin Wang,
Hugh Dickins, Chris Li
On Wed, Dec 17, 2025 at 8:34 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 16 Dec 2025, at 19:07, David Hildenbrand (Red Hat) wrote:
>
> > On 12/16/25 21:07, Bijan Tabatabai wrote:
> >> Currently, folio_expected_ref_count() only adds references for the swap
> >> cache if the folio is anonymous. However, according to the comment above
> >> the definition of PG_swapcache in enum pageflags, shmem folios can also
> >> have PG_swapcache set. This patch makes sure references for the swap
> >> cache are added if folio_test_swapcache(folio) is true.
> >>
> >> This issue was found when trying to hot-unplug memory in a QEMU/KVM
> >> virtual machine. When initiating hot-unplug when most of the guest
> >> memory is allocated, hot-unplug hangs partway through removal due to
> >> migration failures. The following message would be printed several
> >> times, and would be printed again about every five seconds:
> >>
> >> [ 49.641309] migrating pfn b12f25 failed ret:7
> >> [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25
> >> [ 49.641311] aops:swap_aops
> >> [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
> >> [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000
> >> [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000
> >> [ 49.641315] page dumped because: migration failure
> >>
> >> When debugging this, I found that these migration failures were due to
> >> __migrate_folio() returning -EAGAIN for a small set of folios because
> >> the expected reference count it calculates via folio_expected_ref_count()
> >> is one less than the actual reference count of the folios. Furthermore,
> >> all of the affected folios were not anonymous, but had the PG_swapcache
> >> flag set, inspiring this patch. After applying this patch, the memory
> >> hot-unplug behaves as expected.
> >>
> >> I tested this on a machine running Ubuntu 24.04 with kernel version
> >> 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt
> >> and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the
> >> mm-unstable branch as a Dec 16, 2025 was also tested and behaves the
> >> same) and 48GB of memory. The libvirt XML definition for the VM can be
> >> found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in
> >> the guest kernel so the hot-pluggable memory is automatically onlined.
> >>
> >> Below are the steps to reproduce this behavior:
> >>
> >> 1) Define and start and virtual machine
> >> host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1]
> >> host$ virsh -c qemu:///system start test_vm
> >>
> >> 2) Setup swap in the guest
> >> guest$ sudo fallocate -l 32G /swapfile
> >> guest$ sudo chmod 0600 /swapfile
> >> guest$ sudo mkswap /swapfile
> >> guest$ sudo swapon /swapfile
> >>
> >> 3) Use alloc_data [2] to allocate most of the remaining guest memory
> >> guest$ ./alloc_data 45
> >>
> >> 4) In a separate guest terminal, monitor the amount of used memory
> >> guest$ watch -n1 free -h
> >>
> >> 5) When alloc_data has finished allocating, initiate the memory
> >> hot-unplug using the provided xml file [3]
> >> host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live
> >>
> >> After initiating the memory hot-unplug, you should see the amount of
> >> available memory in the guest decrease, and the amount of used swap data
> >> increase. If everything works as expected, when all of the memory is
> >> unplugged, there should be around 8.5-9GB of data in swap. If the
> >> unplugging is unsuccessful, the amount of used swap data will settle
> >> below that. If that happens, you should be able to see log messages in
> >> dmesg similar to the one posted above.
> >>
> >> [1] https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml
> >> [2] https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c
> >> [3] https://github.com/BijanT/linux_patch_files/blob/main/remove.xml
> >>
> >> Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
> >> Signed-off-by: Bijan Tabatabai <bijan311@gmail.com>
> >> ---
> >>
> >> I am not very familiar with the memory hot-(un)plug or swapping code, so
> >> I am not 100% certain if this patch actually solves the root of the
> >> problem. I believe the issue is from shmem folios, in which case I believe
> >> this patch is correct. However, I couldn't think of an easy way to confirm
> >> that the affected folios were from shmem. I guess it could be possible that
> >> the root cause could be from some bug where some anonymous pages do not
> >> return true to folio_test_anon(). I don't think that's the case, but
> >> figured the MM maintainers would have a better idea of what's going on.
>
> I am not sure about if shmem in swapcache causes the issue, since
> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>
> But David also mentioned that in __read_swap_cache_async() there is a chance
> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
> for more analysis.
Yeah, that's possible, a typical case is swap readahead will alloc and
add folios into swap cache, but won't add it to anon/shmem mapping.
Anon/shmem will use the folio in swapcache upon page fault, and make
it anon/shmem folio by then.
This change looks good to me too, thanks for Ccing me.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-17 1:04 ` David Hildenbrand (Red Hat)
2025-12-17 3:09 ` Baolin Wang
@ 2025-12-19 0:21 ` Wei Yang
2025-12-19 1:42 ` Baolin Wang
2025-12-19 2:35 ` Kairui Song
1 sibling, 2 replies; 10+ messages in thread
From: Wei Yang @ 2025-12-19 0:21 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
shivankg, Baolin Wang, Hugh Dickins, Chris Li, Kairui Song
On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>> > >
>> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
>> > > I am not 100% certain if this patch actually solves the root of the
>> > > problem. I believe the issue is from shmem folios, in which case I believe
>> > > this patch is correct. However, I couldn't think of an easy way to confirm
>> > > that the affected folios were from shmem. I guess it could be possible that
>> > > the root cause could be from some bug where some anonymous pages do not
>> > > return true to folio_test_anon(). I don't think that's the case, but
>> > > figured the MM maintainers would have a better idea of what's going on.
>>
>> I am not sure about if shmem in swapcache causes the issue, since
>> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>
>We might just push out another unrelated shmem page to swap as we create
>memory pressure in the system I think.
>
One trivial question: currently we only put anon/shmem folio in swapcache,
right?
>>
>> But David also mentioned that in __read_swap_cache_async() there is a chance
>> that anon folio in swapcache can have anon flag not set yet. +Chris and Kairui
>> for more analysis.
>
>Right, when we swapin an anon folio and did not map it into the page table
>yet. Likely we can trigger something similar when we proactively read a shmem
>page from swap into the swapcache.
>
>So it's unclear "where" a swapcache page belongs to until we move it to its
>owner (anon / shmem), which is also why I cannot judge easily from
>
>[ 49.641309] migrating pfn b12f25 failed ret:7
>[ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2
>index:0x7f404d925 pfn:0xb12f25
>[ 49.641311] aops:swap_aops
>[ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3)
>[ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8
>0000000000000000
>[ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff
>0000000000000000
>[ 49.641315] page dumped because: migration failure
>
>What exactly that was.
>
>It was certainly an order-0 folio.
>
>[...]
>
>>
>> I agree with David. Acked-by: Zi Yan <ziy@nvidia.com>
>
>Thanks for the fast review :)
>
>--
>Cheers
>
>David
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-19 0:21 ` Wei Yang
@ 2025-12-19 1:42 ` Baolin Wang
2025-12-19 2:35 ` Kairui Song
1 sibling, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-12-19 1:42 UTC (permalink / raw)
To: Wei Yang, David Hildenbrand (Red Hat)
Cc: Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
shivankg, Hugh Dickins, Chris Li, Kairui Song
On 2025/12/19 08:21, Wei Yang wrote:
> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>
>>>>> I am not very familiar with the memory hot-(un)plug or swapping code, so
>>>>> I am not 100% certain if this patch actually solves the root of the
>>>>> problem. I believe the issue is from shmem folios, in which case I believe
>>>>> this patch is correct. However, I couldn't think of an easy way to confirm
>>>>> that the affected folios were from shmem. I guess it could be possible that
>>>>> the root cause could be from some bug where some anonymous pages do not
>>>>> return true to folio_test_anon(). I don't think that's the case, but
>>>>> figured the MM maintainers would have a better idea of what's going on.
>>>
>>> I am not sure about if shmem in swapcache causes the issue, since
>>> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>>
>> We might just push out another unrelated shmem page to swap as we create
>> memory pressure in the system I think.
>>
>
> One trivial question: currently we only put anon/shmem folio in swapcache,
> right?
AFAICT, Yes (note a special case for anonymous folios: lazyfree
anonymous folios will be directly freed instead of being swapped out).
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-19 0:21 ` Wei Yang
2025-12-19 1:42 ` Baolin Wang
@ 2025-12-19 2:35 ` Kairui Song
2025-12-20 0:47 ` Wei Yang
1 sibling, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-12-19 2:35 UTC (permalink / raw)
To: Wei Yang
Cc: David Hildenbrand (Red Hat),
Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
shivankg, Baolin Wang, Hugh Dickins, Chris Li
On Fri, Dec 19, 2025 at 8:21 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
> >> > >
> >> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
> >> > > I am not 100% certain if this patch actually solves the root of the
> >> > > problem. I believe the issue is from shmem folios, in which case I believe
> >> > > this patch is correct. However, I couldn't think of an easy way to confirm
> >> > > that the affected folios were from shmem. I guess it could be possible that
> >> > > the root cause could be from some bug where some anonymous pages do not
> >> > > return true to folio_test_anon(). I don't think that's the case, but
> >> > > figured the MM maintainers would have a better idea of what's going on.
> >>
> >> I am not sure about if shmem in swapcache causes the issue, since
> >> the above setup does not involve shmem. +Baolin and Hugh for some insight.
> >
> >We might just push out another unrelated shmem page to swap as we create
> >memory pressure in the system I think.
> >
>
> One trivial question: currently we only put anon/shmem folio in swapcache,
> right?
For swapout, yes, the entry point to move a folio to swap space is
folio_alloc_swap, only anon and shmem can do that (vmscan.c ->
folio_test_anon && folio_test_swapbacked, and shmem.c).
Swapin is a bit different because of readahead, readahead folios are
not marked as anon / shmem (folio->mapping) until used, they do belong
to anon / shmem though, but we don't add them to the mapping until
that mapping does a swap cache lookup and use the cached folio.
Also maybe worth mentioning, swap cache lookup convention requires the
caller to lock the folio and double check folio still matches the swap
entry before use (folio_matches_swap_entry), folios there are unstable
and could no longer be a valid swap cache folio unless locked.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count()
2025-12-19 2:35 ` Kairui Song
@ 2025-12-20 0:47 ` Wei Yang
0 siblings, 0 replies; 10+ messages in thread
From: Wei Yang @ 2025-12-20 0:47 UTC (permalink / raw)
To: Kairui Song
Cc: Wei Yang, David Hildenbrand (Red Hat),
Zi Yan, Bijan Tabatabai, linux-mm, linux-kernel, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
shivankg, Baolin Wang, Hugh Dickins, Chris Li
On Fri, Dec 19, 2025 at 10:35:05AM +0800, Kairui Song wrote:
>On Fri, Dec 19, 2025 at 8:21 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Wed, Dec 17, 2025 at 02:04:16AM +0100, David Hildenbrand (Red Hat) wrote:
>> >> > >
>> >> > > I am not very familiar with the memory hot-(un)plug or swapping code, so
>> >> > > I am not 100% certain if this patch actually solves the root of the
>> >> > > problem. I believe the issue is from shmem folios, in which case I believe
>> >> > > this patch is correct. However, I couldn't think of an easy way to confirm
>> >> > > that the affected folios were from shmem. I guess it could be possible that
>> >> > > the root cause could be from some bug where some anonymous pages do not
>> >> > > return true to folio_test_anon(). I don't think that's the case, but
>> >> > > figured the MM maintainers would have a better idea of what's going on.
>> >>
>> >> I am not sure about if shmem in swapcache causes the issue, since
>> >> the above setup does not involve shmem. +Baolin and Hugh for some insight.
>> >
>> >We might just push out another unrelated shmem page to swap as we create
>> >memory pressure in the system I think.
>> >
>>
>> One trivial question: currently we only put anon/shmem folio in swapcache,
>> right?
>
>For swapout, yes, the entry point to move a folio to swap space is
>folio_alloc_swap, only anon and shmem can do that (vmscan.c ->
>folio_test_anon && folio_test_swapbacked, and shmem.c).
>
Thanks for this information.
>Swapin is a bit different because of readahead, readahead folios are
>not marked as anon / shmem (folio->mapping) until used, they do belong
>to anon / shmem though, but we don't add them to the mapping until
>that mapping does a swap cache lookup and use the cached folio.
>
I saw this. So there is some folio which is in swapcache but no sure is
anon/shmem yet.
>Also maybe worth mentioning, swap cache lookup convention requires the
>caller to lock the folio and double check folio still matches the swap
>entry before use (folio_matches_swap_entry), folios there are unstable
>and could no longer be a valid swap cache folio unless locked.
Thanks for this notice, will pay attention to this.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-20 0:47 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 20:07 [PATCH] mm: Consider non-anon swap cache folios in folio_expected_ref_count() Bijan Tabatabai
2025-12-17 0:07 ` David Hildenbrand (Red Hat)
2025-12-17 0:34 ` Zi Yan
2025-12-17 1:04 ` David Hildenbrand (Red Hat)
2025-12-17 3:09 ` Baolin Wang
2025-12-19 0:21 ` Wei Yang
2025-12-19 1:42 ` Baolin Wang
2025-12-19 2:35 ` Kairui Song
2025-12-20 0:47 ` Wei Yang
2025-12-17 6:04 ` Kairui Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox