[Question] performance regression after VM migration due to anon THP split in CoW

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Question] performance regression after VM migration due to anon THP split in CoW
@ 2024-06-29  9:18 Jinjiang Tu
  2024-06-29  9:45 ` David Hildenbrand
  0 siblings, 1 reply; 4+ messages in thread
From: Jinjiang Tu @ 2024-06-29  9:18 UTC (permalink / raw)
  To: akpm, kirill.shutemov, ziy, william.kucharski, yang.shi
  Cc: aarcange, jhubbard, mike.kravetz, rcampbell, Kefeng Wang,
	Nanyong Sun, baohua, David Hildenbrand, baolin.wang, linux-mm

Hi,

We noticed a performance regression in benchmark memtester[1] after
upgrading the kernel. THP is enabled by default 
(/sys/kernel/mm/transparent_hugepage/enabled
is set to "always"). The issue arises when we migrate a virtual machine
that has 125G total memory and 124G free memory to another host. And then,
we run the command `memtester 120G` in the VM. The benchmark takes about
20 seconds to consume 120G memory in v4.18, but takes about 160 seconds in
v5.10. This issue exists in mainline kernel too.

We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP")
leads to the performance regression. Since this commit, When we trigger a
write fault on a anon THP, we split the PMD and allocate a 4K page, instead
of allocating the full anon THP. When a VM is migrating (based on qemu[2]),
if the page is marked zero page in the source VM, the destination VM will
call mmap and read the region to allocate memory, making the region mapped
by the zero THP. When we run memtester in the destination VM after VM
migration finishes, memtester(in VM) will allocate large amounts of free
memory and write to them, cause CoW of anon THP and THP split, further
cause performance regression. After reverting this commit, performance
regression disappears.

This commit optimises some scenarios such as Redis, but may lead to
performance regression in some other scenarios, such as VM migration.
How could we solve this issue? Maybe we could add a new sysctl to let users
decide whether to CoW the full anon THP or not?

Thanks.

[1] https://github.com/jnavila/memtester/tree/master
[2] https://github.com/qemu/qemu/blob/master/migration/ram.c

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] performance regression after VM migration due to anon THP split in CoW
  2024-06-29  9:18 [Question] performance regression after VM migration due to anon THP split in CoW Jinjiang Tu
@ 2024-06-29  9:45 ` David Hildenbrand
  2024-07-04 13:31   ` Jinjiang Tu
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2024-06-29  9:45 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: Kefeng Wang, Nanyong Sun, aarcange, akpm, baohua, baolin.wang,
	jhubbard, kirill.shutemov, linux-mm, mike.kravetz, rcampbell,
	william.kucharski, yang.shi, ziy

[-- Attachment #1: Type: text/plain, Size: 2640 bytes --]

Hi,

Likely the mailing lists won‘t like my mail from this Google Mail client ;)

Jinjiang Tu <tujinjiang@huawei.com> schrieb am Sa. 29. Juni 2024 um 11:18:

> Hi,
>
> We noticed a performance regression in benchmark memtester[1] after
> upgrading the kernel. THP is enabled by default
> (/sys/kernel/mm/transparent_hugepage/enabled
> is set to "always"). The issue arises when we migrate a virtual machine
> that has 125G total memory and 124G free memory to another host. And then,
> we run the command `memtester 120G` in the VM. The benchmark takes about
> 20 seconds to consume 120G memory in v4.18, but takes about 160 seconds in
> v5.10. This issue exists in mainline kernel too.
>

Simple: use preallocation in QEMU. „prealloc=on“ for host memory backends,
for example.


> We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP")
> leads to the performance regression. Since this commit, When we trigger a
> write fault on a anon THP, we split the PMD and allocate a 4K page, instead
> of allocating the full anon THP. When a VM is migrating (based on qemu[2]),
> if the page is marked zero page in the source VM, the destination VM will
> call mmap and read the region to allocate memory, making the region mapped
> by the zero THP. When we run memtester in the destination VM after VM
> migration finishes, memtester(in VM) will allocate large amounts of free
> memory and write to them, cause CoW of anon THP and THP split, further
> cause performance regression. After reverting this commit, performance
> regression disappears.


You talk about COW of anon THP, whereby your scenario really only relied on
COW of the huge zeropage.

Wouldn’t you would get a similar result when disabling the huge zeropage?


>
> This commit optimises some scenarios such as Redis, but may lead to
> performance regression in some other scenarios, such as VM migration.
> How could we solve this issue? Maybe we could add a new sysctl to let users
> decide whether to CoW the full anon THP or not?
>

I‘m not convinced the use case you present really warrants a toggle for
that. In your case you only want to change semantics on COW fault to the
huge zeropage. But …

Using preallocation in QEMU will give you all anon THP right from the
start, avoiding any cow. Sure, you consume all memory right away, but after
all that‘s what your use case triggers either way. And it might all be even
faster. :)

Cheers!


> Thanks.
>
> [1] https://github.com/jnavila/memtester/tree/master
> [2] https://github.com/qemu/qemu/blob/master/migration/ram.c
>
>
>

[-- Attachment #2: Type: text/html, Size: 3935 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] performance regression after VM migration due to anon THP split in CoW
  2024-06-29  9:45 ` David Hildenbrand
@ 2024-07-04 13:31   ` Jinjiang Tu
  2024-07-04 13:55     ` David Hildenbrand
  0 siblings, 1 reply; 4+ messages in thread
From: Jinjiang Tu @ 2024-07-04 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kefeng Wang, Nanyong Sun, aarcange, akpm, baohua, baolin.wang,
	jhubbard, kirill.shutemov, linux-mm, mike.kravetz, rcampbell,
	william.kucharski, yang.shi, ziy


在 2024/6/29 17:45, David Hildenbrand 写道:
> Hi,
>
> Likely the mailing lists won‘t like my mail from this Google Mail 
> client ;)
>
> Jinjiang Tu <tujinjiang@huawei.com> schrieb am Sa. 29. Juni 2024 um 11:18:
>
>     Hi,
>
>     We noticed a performance regression in benchmark memtester[1] after
>     upgrading the kernel. THP is enabled by default
>     (/sys/kernel/mm/transparent_hugepage/enabled
>     is set to "always"). The issue arises when we migrate a virtual
>     machine
>     that has 125G total memory and 124G free memory to another host.
>     And then,
>     we run the command `memtester 120G` in the VM. The benchmark takes
>     about
>     20 seconds to consume 120G memory in v4.18, but takes about 160
>     seconds in
>     v5.10. This issue exists in mainline kernel too.
>
>
> Simple: use preallocation in QEMU. „prealloc=on“ for host memory 
> backends, for example.
>
>
>     We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>     leads to the performance regression. Since this commit, When we
>     trigger a
>     write fault on a anon THP, we split the PMD and allocate a 4K
>     page, instead
>     of allocating the full anon THP. When a VM is migrating (based on
>     qemu[2]),
>     if the page is marked zero page in the source VM, the destination
>     VM will
>     call mmap and read the region to allocate memory, making the
>     region mapped
>     by the zero THP. When we run memtester in the destination VM after VM
>     migration finishes, memtester(in VM) will allocate large amounts
>     of free
>     memory and write to them, cause CoW of anon THP and THP split, further
>     cause performance regression. After reverting this commit, performance
>     regression disappears.
>
>
> You talk about COW of anon THP, whereby your scenario really only 
> relied on COW of the huge zeropage.
>
> Wouldn’t you would get a similar result when disabling the huge zeropage?
>
>
>
>     This commit optimises some scenarios such as Redis, but may lead to
>     performance regression in some other scenarios, such as VM migration.
>     How could we solve this issue? Maybe we could add a new sysctl to
>     let users
>     decide whether to CoW the full anon THP or not?
>
>
> I‘m not convinced the use case you present really warrants a toggle 
> for that. In your case you only want to change semantics on COW fault 
> to the huge zeropage. But …
>
> Using preallocation in QEMU will give you all anon THP right from the 
> start, avoiding any cow. Sure, you consume all memory right away, but 
> after all that‘s what your use case triggers either way. And it might 
> all be even faster. :)
>
> Cheers!
>
Thanks for reply. The two methods both work. But they both lead to large 
memory consumption even though the VM doesn't need so much memory right now.
>
>
>     Thanks.
>
>     [1] https://github.com/jnavila/memtester/tree/master
>     [2] https://github.com/qemu/qemu/blob/master/migration/ram.c
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] performance regression after VM migration due to anon THP split in CoW
  2024-07-04 13:31   ` Jinjiang Tu
@ 2024-07-04 13:55     ` David Hildenbrand
  0 siblings, 0 replies; 4+ messages in thread
From: David Hildenbrand @ 2024-07-04 13:55 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: Kefeng Wang, Nanyong Sun, aarcange, akpm, baohua, baolin.wang,
	jhubbard, kirill.shutemov, linux-mm, mike.kravetz, rcampbell,
	william.kucharski, yang.shi, ziy

On 04.07.24 15:31, Jinjiang Tu wrote:
> 
> 在 2024/6/29 17:45, David Hildenbrand 写道:
>> Hi,
>>
>> Likely the mailing lists won‘t like my mail from this Google Mail
>> client ;)
>>
>> Jinjiang Tu <tujinjiang@huawei.com> schrieb am Sa. 29. Juni 2024 um 11:18:
>>
>>      Hi,
>>
>>      We noticed a performance regression in benchmark memtester[1] after
>>      upgrading the kernel. THP is enabled by default
>>      (/sys/kernel/mm/transparent_hugepage/enabled
>>      is set to "always"). The issue arises when we migrate a virtual
>>      machine
>>      that has 125G total memory and 124G free memory to another host.
>>      And then,
>>      we run the command `memtester 120G` in the VM. The benchmark takes
>>      about
>>      20 seconds to consume 120G memory in v4.18, but takes about 160
>>      seconds in
>>      v5.10. This issue exists in mainline kernel too.
>>
>>
>> Simple: use preallocation in QEMU. „prealloc=on“ for host memory
>> backends, for example.
>>
>>
>>      We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>      leads to the performance regression. Since this commit, When we
>>      trigger a
>>      write fault on a anon THP, we split the PMD and allocate a 4K
>>      page, instead
>>      of allocating the full anon THP. When a VM is migrating (based on
>>      qemu[2]),
>>      if the page is marked zero page in the source VM, the destination
>>      VM will
>>      call mmap and read the region to allocate memory, making the
>>      region mapped
>>      by the zero THP. When we run memtester in the destination VM after VM
>>      migration finishes, memtester(in VM) will allocate large amounts
>>      of free
>>      memory and write to them, cause CoW of anon THP and THP split, further
>>      cause performance regression. After reverting this commit, performance
>>      regression disappears.
>>
>>
>> You talk about COW of anon THP, whereby your scenario really only
>> relied on COW of the huge zeropage.
>>
>> Wouldn’t you would get a similar result when disabling the huge zeropage?
>>
>>
>>
>>      This commit optimises some scenarios such as Redis, but may lead to
>>      performance regression in some other scenarios, such as VM migration.
>>      How could we solve this issue? Maybe we could add a new sysctl to
>>      let users
>>      decide whether to CoW the full anon THP or not?
>>
>>
>> I‘m not convinced the use case you present really warrants a toggle
>> for that. In your case you only want to change semantics on COW fault
>> to the huge zeropage. But …
>>
>> Using preallocation in QEMU will give you all anon THP right from the
>> start, avoiding any cow. Sure, you consume all memory right away, but
>> after all that‘s what your use case triggers either way. And it might
>> all be even faster. :)
>>
>> Cheers!
>>
> Thanks for reply. The two methods both work. But they both lead to large
> memory consumption even though the VM doesn't need so much memory right now.

Please see

https://lkml.kernel.org/r/1cfae0c0-96a2-4308-9c62-f7a640520242@arm.com

on a related discussion.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-07-04 13:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-29  9:18 [Question] performance regression after VM migration due to anon THP split in CoW Jinjiang Tu
2024-06-29  9:45 ` David Hildenbrand
2024-07-04 13:31   ` Jinjiang Tu
2024-07-04 13:55     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox