[PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
@ 2025-03-03  6:09 Jinjiang Tu
  2025-03-03 10:25 ` David Hildenbrand
  2025-03-03 13:45 ` David Hildenbrand
  0 siblings, 2 replies; 9+ messages in thread
From: Jinjiang Tu @ 2025-03-03  6:09 UTC (permalink / raw)
  To: akpm, muchun.song, david, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong, tujinjiang

In dissolve_free_huge_page(), free huge pages are dissolved without
adjusting surplus count. However, free huge pages may be accounted as
surplus pages, and will lead to wrong surplus count.

Steps to reproduce:
1) create 64 huge pages for node1
2) run a program to reserve (don't consume) all the huge pages
3) echo 0 > nr_huge_pages for node1. After this step, free huge pages in
Node1 are surplus.
4) create 80 huge pages for node0
5) offline memory of node1 and kill the program in step 2)

The result:
           Node0     Node1
total       80        0
free        80        0
surplus     0         61

To fix it, adjust surplus when destroying huge pages if the node has
surplus pages in dissolve_free_hugetlb_folio().

The result with this patch:
           Node0     Node1
total       80        0
free        80        0
surplus     0         0

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/hugetlb.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 163190e89ea1..2a24ade9d157 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2135,6 +2135,8 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
 
 	if (!folio_ref_count(folio)) {
 		struct hstate *h = folio_hstate(folio);
+		bool adjust_surplus = false;
+
 		if (!available_huge_pages(h))
 			goto out;
 
@@ -2157,7 +2159,9 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
 			goto retry;
 		}
 
-		remove_hugetlb_folio(h, folio, false);
+		if (h->surplus_huge_pages_node[folio_nid(folio)])
+			adjust_surplus = true;
+		remove_hugetlb_folio(h, folio, adjust_surplus);
 		h->max_huge_pages--;
 		spin_unlock_irq(&hugetlb_lock);
 
@@ -2177,7 +2181,7 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
 			rc = hugetlb_vmemmap_restore_folio(h, folio);
 			if (rc) {
 				spin_lock_irq(&hugetlb_lock);
-				add_hugetlb_folio(h, folio, false);
+				add_hugetlb_folio(h, folio, adjust_surplus);
 				h->max_huge_pages++;
 				goto out;
 			}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03  6:09 [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page() Jinjiang Tu
@ 2025-03-03 10:25 ` David Hildenbrand
  2025-03-03 11:16   ` Jinjiang Tu
  2025-03-03 13:45 ` David Hildenbrand
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-03-03 10:25 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong

On 03.03.25 07:09, Jinjiang Tu wrote:
> In dissolve_free_huge_page(), free huge pages are dissolved without
> adjusting surplus count. However, free huge pages may be accounted as
> surplus pages, and will lead to wrong surplus count.
> 
> Steps to reproduce:
> 1) create 64 huge pages for node1
> 2) run a program to reserve (don't consume) all the huge pages
> 3) echo 0 > nr_huge_pages for node1. After this step, free huge pages in
> Node1 are surplus.
> 4) create 80 huge pages for node0
> 5) offline memory of node1 and kill the program in step 2)

Can you elaborate the "offline memory" part? How much are you offlining 
(the complete node vs a single memory block?)

If I skip the offlining part, it all works as expected on a 6.12 kernel.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03 10:25 ` David Hildenbrand
@ 2025-03-03 11:16   ` Jinjiang Tu
  2025-03-03 11:23     ` Jinjiang Tu
  0 siblings, 1 reply; 9+ messages in thread
From: Jinjiang Tu @ 2025-03-03 11:16 UTC (permalink / raw)
  To: David Hildenbrand, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong


在 2025/3/3 18:25, David Hildenbrand 写道:
> On 03.03.25 07:09, Jinjiang Tu wrote:
>> In dissolve_free_huge_page(), free huge pages are dissolved without
>> adjusting surplus count. However, free huge pages may be accounted as
>> surplus pages, and will lead to wrong surplus count.
>>
>> Steps to reproduce:
>> 1) create 64 huge pages for node1
>> 2) run a program to reserve (don't consume) all the huge pages
>> 3) echo 0 > nr_huge_pages for node1. After this step, free huge pages in
>> Node1 are surplus.
>> 4) create 80 huge pages for node0
>> 5) offline memory of node1 and kill the program in step 2)
>
> Can you elaborate the "offline memory" part? How much are you 
> offlining (the complete node vs a single memory block?)
I reproduce this issue on qemu. The memory onlining and offlining 
operations are:

1）Node1 is memory-less at first. Hot-add memory to node1 by executing 
the two commands in qemu monitor
object_add memory-backend-ram,id=mem1,size=1G
device_add pc-dimm,id=dimm1,memdev=mem1,node=1

2）online one memory block of Node1 with：
echo online_movable > /sys/devices/system/node/node1/memoryX/state

3) offline the memory block of Node1 with：
echo offline > /sys/devices/system/node/node1/memoryX/state
>
> If I skip the offlining part, it all works as expected on a 6.12 kernel.
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03 11:16   ` Jinjiang Tu
@ 2025-03-03 11:23     ` Jinjiang Tu
  2025-03-03 11:47       ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Jinjiang Tu @ 2025-03-03 11:23 UTC (permalink / raw)
  To: David Hildenbrand, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong


在 2025/3/3 19:16, Jinjiang Tu 写道:
>
> 在 2025/3/3 18:25, David Hildenbrand 写道:
>> On 03.03.25 07:09, Jinjiang Tu wrote:
>>> In dissolve_free_huge_page(), free huge pages are dissolved without
>>> adjusting surplus count. However, free huge pages may be accounted as
>>> surplus pages, and will lead to wrong surplus count.
>>>
>>> Steps to reproduce:
>>> 1) create 64 huge pages for node1
>>> 2) run a program to reserve (don't consume) all the huge pages
>>> 3) echo 0 > nr_huge_pages for node1. After this step, free huge 
>>> pages in
>>> Node1 are surplus.
>>> 4) create 80 huge pages for node0
>>> 5) offline memory of node1 and kill the program in step 2)
>>
>> Can you elaborate the "offline memory" part? How much are you 
>> offlining (the complete node vs a single memory block?)
> I reproduce this issue on qemu. The memory onlining and offlining 
> operations are:
>
> 1）Node1 is memory-less at first. Hot-add memory to node1 by executing 
> the two commands in qemu monitor
> object_add memory-backend-ram,id=mem1,size=1G
> device_add pc-dimm,id=dimm1,memdev=mem1,node=1
>
> 2）online one memory block of Node1 with：
> echo online_movable > /sys/devices/system/node/node1/memoryX/state
>
> 3) offline the memory block of Node1 with：
> echo offline > /sys/devices/system/node/node1/memoryX/state
The memory range to offline contains the free surplus huge pages created 
in step1) ~ step3)
>>
>> If I skip the offlining part, it all works as expected on a 6.12 kernel.
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03 11:23     ` Jinjiang Tu
@ 2025-03-03 11:47       ` David Hildenbrand
  2025-03-03 12:10         ` Jinjiang Tu
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-03-03 11:47 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong

On 03.03.25 12:23, Jinjiang Tu wrote:
> 
> 在 2025/3/3 19:16, Jinjiang Tu 写道:
>>
>> 在 2025/3/3 18:25, David Hildenbrand 写道:
>>> On 03.03.25 07:09, Jinjiang Tu wrote:
>>>> In dissolve_free_huge_page(), free huge pages are dissolved without
>>>> adjusting surplus count. However, free huge pages may be accounted as
>>>> surplus pages, and will lead to wrong surplus count.
>>>>
>>>> Steps to reproduce:
>>>> 1) create 64 huge pages for node1
>>>> 2) run a program to reserve (don't consume) all the huge pages
>>>> 3) echo 0 > nr_huge_pages for node1. After this step, free huge
>>>> pages in
>>>> Node1 are surplus.
>>>> 4) create 80 huge pages for node0
>>>> 5) offline memory of node1 and kill the program in step 2)
>>>
>>> Can you elaborate the "offline memory" part? How much are you
>>> offlining (the complete node vs a single memory block?)
>> I reproduce this issue on qemu. The memory onlining and offlining
>> operations are:
>>
>> 1）Node1 is memory-less at first. Hot-add memory to node1 by executing
>> the two commands in qemu monitor
>> object_add memory-backend-ram,id=mem1,size=1G
>> device_add pc-dimm,id=dimm1,memdev=mem1,node=1
>>
>> 2）online one memory block of Node1 with：
>> echo online_movable > /sys/devices/system/node/node1/memoryX/state
>>
>> 3) offline the memory block of Node1 with：
>> echo offline > /sys/devices/system/node/node1/memoryX/state
> The memory range to offline contains the free surplus huge pages created
> in step1) ~ step3)

Okay, that makes it a lot clearer. Can you make that clearer in the 
patch description?

The problem appears when memory offlining stumbles over a free hugetlb 
folio. We'll call 
dissolve_free_hugetlb_folios()->dissolve_free_hugetlb_folio() to 
dissolve it, but don't adjust accounting with surplus pages properly.

I'll note that there is another dissolve_free_hugetlb_folio() caller in 
the memory-failure path that might similarly be affected.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03 11:47       ` David Hildenbrand
@ 2025-03-03 12:10         ` Jinjiang Tu
  0 siblings, 0 replies; 9+ messages in thread
From: Jinjiang Tu @ 2025-03-03 12:10 UTC (permalink / raw)
  To: David Hildenbrand, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong


在 2025/3/3 19:47, David Hildenbrand 写道:
> On 03.03.25 12:23, Jinjiang Tu wrote:
>>
>> 在 2025/3/3 19:16, Jinjiang Tu 写道:
>>>
>>> 在 2025/3/3 18:25, David Hildenbrand 写道:
>>>> On 03.03.25 07:09, Jinjiang Tu wrote:
>>>>> In dissolve_free_huge_page(), free huge pages are dissolved without
>>>>> adjusting surplus count. However, free huge pages may be accounted as
>>>>> surplus pages, and will lead to wrong surplus count.
>>>>>
>>>>> Steps to reproduce:
>>>>> 1) create 64 huge pages for node1
>>>>> 2) run a program to reserve (don't consume) all the huge pages
>>>>> 3) echo 0 > nr_huge_pages for node1. After this step, free huge
>>>>> pages in
>>>>> Node1 are surplus.
>>>>> 4) create 80 huge pages for node0
>>>>> 5) offline memory of node1 and kill the program in step 2)
>>>>
>>>> Can you elaborate the "offline memory" part? How much are you
>>>> offlining (the complete node vs a single memory block?)
>>> I reproduce this issue on qemu. The memory onlining and offlining
>>> operations are:
>>>
>>> 1）Node1 is memory-less at first. Hot-add memory to node1 by executing
>>> the two commands in qemu monitor
>>> object_add memory-backend-ram,id=mem1,size=1G
>>> device_add pc-dimm,id=dimm1,memdev=mem1,node=1
>>>
>>> 2）online one memory block of Node1 with：
>>> echo online_movable > /sys/devices/system/node/node1/memoryX/state
>>>
>>> 3) offline the memory block of Node1 with：
>>> echo offline > /sys/devices/system/node/node1/memoryX/state
>> The memory range to offline contains the free surplus huge pages created
>> in step1) ~ step3)
>
> Okay, that makes it a lot clearer. Can you make that clearer in the 
> patch description?
Sure, I will send a v2 patch later.
>
> The problem appears when memory offlining stumbles over a free hugetlb 
> folio. We'll call 
> dissolve_free_hugetlb_folios()->dissolve_free_hugetlb_folio() to 
> dissolve it, but don't adjust accounting with surplus pages properly.
>
> I'll note that there is another dissolve_free_hugetlb_folio() caller 
> in the memory-failure path that might similarly be affected.
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03  6:09 [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page() Jinjiang Tu
  2025-03-03 10:25 ` David Hildenbrand
@ 2025-03-03 13:45 ` David Hildenbrand
  2025-03-04  3:50   ` Jinjiang Tu
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-03-03 13:45 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong

> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2135,6 +2135,8 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
>   
>   	if (!folio_ref_count(folio)) {
>   		struct hstate *h = folio_hstate(folio);
> +		bool adjust_surplus = false;
> +
>   		if (!available_huge_pages(h))
>   			goto out;
>   
> @@ -2157,7 +2159,9 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
>   			goto retry;
>   		}
>   
> -		remove_hugetlb_folio(h, folio, false);
> +		if (h->surplus_huge_pages_node[folio_nid(folio)])
> +			adjust_surplus = true;

This change looks good to me

> +		remove_hugetlb_folio(h, folio, adjust_surplus);
>   		h->max_huge_pages--;
>   		spin_unlock_irq(&hugetlb_lock);
>   
> @@ -2177,7 +2181,7 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
>   			rc = hugetlb_vmemmap_restore_folio(h, folio);
>   			if (rc) {
>   				spin_lock_irq(&hugetlb_lock);
> -				add_hugetlb_folio(h, folio, false);
> +				add_hugetlb_folio(h, folio, adjust_surplus);

I'm not quite sure here, though. We dropped the hugetlb_lock, can't some 
weird concurrent action result in us not having to adjust surplus page 
anymore?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-03 13:45 ` David Hildenbrand
@ 2025-03-04  3:50   ` Jinjiang Tu
  2025-03-04 10:15     ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Jinjiang Tu @ 2025-03-04  3:50 UTC (permalink / raw)
  To: David Hildenbrand, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong


在 2025/3/3 21:45, David Hildenbrand 写道:
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2135,6 +2135,8 @@ int dissolve_free_hugetlb_folio(struct folio 
>> *folio)
>>         if (!folio_ref_count(folio)) {
>>           struct hstate *h = folio_hstate(folio);
>> +        bool adjust_surplus = false;
>> +
>>           if (!available_huge_pages(h))
>>               goto out;
>>   @@ -2157,7 +2159,9 @@ int dissolve_free_hugetlb_folio(struct folio 
>> *folio)
>>               goto retry;
>>           }
>>   -        remove_hugetlb_folio(h, folio, false);
>> +        if (h->surplus_huge_pages_node[folio_nid(folio)])
>> +            adjust_surplus = true;
>
> This change looks good to me
>
>> +        remove_hugetlb_folio(h, folio, adjust_surplus);
>>           h->max_huge_pages--;
>>           spin_unlock_irq(&hugetlb_lock);
>>   @@ -2177,7 +2181,7 @@ int dissolve_free_hugetlb_folio(struct folio 
>> *folio)
>>               rc = hugetlb_vmemmap_restore_folio(h, folio);
>>               if (rc) {
>>                   spin_lock_irq(&hugetlb_lock);
>> -                add_hugetlb_folio(h, folio, false);
>> +                add_hugetlb_folio(h, folio, adjust_surplus);
>
> I'm not quite sure here, though. We dropped the hugetlb_lock, can't 
> some weird concurrent action result in us not having to adjust surplus 
> page anymore?
In this case, we will get a free surplus folio without reserved.

The existing code has similar logic. In free_huge_folio()，when 
h->surplus_huge_pages_node[nid] != 0, 
update_and_free_hugetlb_folio->__update_and_free_hugetlb_folio
may fail due tohugetlb_vmemmap_restore_folio() too, and will treat it as 
surplus folio.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  2025-03-04  3:50   ` Jinjiang Tu
@ 2025-03-04 10:15     ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-03-04 10:15 UTC (permalink / raw)
  To: Jinjiang Tu, akpm, muchun.song, osalvador
  Cc: linux-mm, wangkefeng.wang, sunnanyong

On 04.03.25 04:50, Jinjiang Tu wrote:
> 
> 在 2025/3/3 21:45, David Hildenbrand 写道:
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2135,6 +2135,8 @@ int dissolve_free_hugetlb_folio(struct folio
>>> *folio)
>>>          if (!folio_ref_count(folio)) {
>>>            struct hstate *h = folio_hstate(folio);
>>> +        bool adjust_surplus = false;
>>> +
>>>            if (!available_huge_pages(h))
>>>                goto out;
>>>    @@ -2157,7 +2159,9 @@ int dissolve_free_hugetlb_folio(struct folio
>>> *folio)
>>>                goto retry;
>>>            }
>>>    -        remove_hugetlb_folio(h, folio, false);
>>> +        if (h->surplus_huge_pages_node[folio_nid(folio)])
>>> +            adjust_surplus = true;
>>
>> This change looks good to me
>>
>>> +        remove_hugetlb_folio(h, folio, adjust_surplus);
>>>            h->max_huge_pages--;
>>>            spin_unlock_irq(&hugetlb_lock);
>>>    @@ -2177,7 +2181,7 @@ int dissolve_free_hugetlb_folio(struct folio
>>> *folio)
>>>                rc = hugetlb_vmemmap_restore_folio(h, folio);
>>>                if (rc) {
>>>                    spin_lock_irq(&hugetlb_lock);
>>> -                add_hugetlb_folio(h, folio, false);
>>> +                add_hugetlb_folio(h, folio, adjust_surplus);
>>
>> I'm not quite sure here, though. We dropped the hugetlb_lock, can't
>> some weird concurrent action result in us not having to adjust surplus
>> page anymore?
> In this case, we will get a free surplus folio without reserved.
> 
> The existing code has similar logic. In free_huge_folio()，when
> h->surplus_huge_pages_node[nid] != 0,
> update_and_free_hugetlb_folio->__update_and_free_hugetlb_folio
> may fail due tohugetlb_vmemmap_restore_folio() too, and will treat it as
> surplus folio.

Right, in that scenario we'll even unconditionally treat it as a surplus 
page.

So I guess it's fine (and it's a corner case either way ...)

So with an improved patch description:

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-04 10:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-03  6:09 [PATCH] mm/hugetlb: fix surplus pages in dissolve_free_huge_page() Jinjiang Tu
2025-03-03 10:25 ` David Hildenbrand
2025-03-03 11:16   ` Jinjiang Tu
2025-03-03 11:23     ` Jinjiang Tu
2025-03-03 11:47       ` David Hildenbrand
2025-03-03 12:10         ` Jinjiang Tu
2025-03-03 13:45 ` David Hildenbrand
2025-03-04  3:50   ` Jinjiang Tu
2025-03-04 10:15     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox