From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 609F7C27C4F
	for <linux-mm@archiver.kernel.org>; Sat, 29 Jun 2024 09:45:45 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A9F526B0088; Sat, 29 Jun 2024 05:45:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A4E826B0089; Sat, 29 Jun 2024 05:45:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 917016B008A; Sat, 29 Jun 2024 05:45:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 727A86B0088
	for <linux-mm@kvack.org>; Sat, 29 Jun 2024 05:45:44 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id C035980D77
	for <linux-mm@kvack.org>; Sat, 29 Jun 2024 09:45:43 +0000 (UTC)
X-FDA: 82283444166.30.C68FED7
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf20.hostedemail.com (Postfix) with ESMTP id 1D8851C0006
	for <linux-mm@kvack.org>; Sat, 29 Jun 2024 09:45:39 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Wdu4gDso;
	spf=pass (imf20.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719654327;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=;
	b=OAedxezmoJaEAjmEW37yvrSgX3Od7S3BEh8uMZy4Lyk0PCRWBcPcWKYrI85hDv22OaQ9BS
	eohRyMJngV+kjaOpipoCpYf2ULC6LeTz4wIGqQlxDegFmccXcbARZhtrvTwh49yiq+jDE4
	MI1lUrq+8A3q8HbiD97tSuxah1GRq2k=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Wdu4gDso;
	spf=pass (imf20.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719654327; a=rsa-sha256;
	cv=none;
	b=DPT/0fCN9nReIKvK/UNVmQ610cH96htnXiAV1+am7+wVJHRwNIT41pqVDO+WmlxFERFpHo
	N+MRIlkOiFVrcINa4qr07ygjOE0tyLbdYeGocs1gNe5BEYCQM9HJhyGPPAQna/HkA2pqSJ
	jPK6jQwB1gNkil8HK6DOXZQz5pM3C8I=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1719654339;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=;
	b=Wdu4gDsolWaGxwYpaHbziSywxIa6cy/mANFEdicNsddQAa8fCDUoicgQ1hpBuUmSy+Xkxs
	Jl1H0GNaCj8VX+YNd0obQqIpwXAMYZQ7TdWpQY2fq8CmW3obyfGT8RnEGxdwRXihafDjCa
	2+dz/1AE0eNnWZFPqucZUJbIlEqzcxc=
Received: from mail-io1-f71.google.com (mail-io1-f71.google.com
 [209.85.166.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-10-vaamgkCGOMi8vqiBrrstrA-1; Sat, 29 Jun 2024 05:45:35 -0400
X-MC-Unique: vaamgkCGOMi8vqiBrrstrA-1
Received: by mail-io1-f71.google.com with SMTP id ca18e2360f4ac-7f38c10cab9so142339039f.2
        for <linux-mm@kvack.org>; Sat, 29 Jun 2024 02:45:35 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719654334; x=1720259134;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=;
        b=ZNjAoXdcQg/k5mTG1PQtFmKMGJnMSCdbqx7Dohx0fR35M96dFsErxkU1q5blcAfDVu
         1brgIMAPz66uw6kSeamXca4xdIuP9psqMABaVSZq6XdzTZ9/K+ROrKDdgCuFo1fZDLyo
         9GAOmvQPiygEw92OigApOM2uvzF25VvAmRnVlLYmJ0W+aVgk5v6Hfw+BEXeHuHgDdj+n
         jDgQER4waWqNzEkfW9VcgRKtz5+W93cF9RTgGktq1cni11ah1Qz0VMNHUYWkgqgmYkoK
         qkLyL8hZpLDhi1B36lwDwHYhuwGtMOUD69FkPthw4yC4H/z0mtoxbZmJWDBaq7FGBJ5v
         dWyA==
X-Forwarded-Encrypted: i=1; AJvYcCUZEAzFyExf2dv/IK4MyiJ+8Z4lGEN1GctKqR8r1+Kh/7rauk4SVYhA3fyleQLacILcCgdUmA4Ke5XccBouU0XKvWU=
X-Gm-Message-State: AOJu0YyPnmMFZGrornP/xLnci3SD3Sbyl2zd3dlRcWMrhEhHUM7UpBPH
	WPaS5j5ZWRk/hD+Ww6Btgnri+ZxfATY3Z5w+unZmuMcY3zzl7tRxzAoWjlC2/mA5rg4lFQYtiHV
	POcyWHBtf6PT2uNcw7u9KjAU+Jewi77PDf7dy0rJE953sW8eoH8fo1ALif89KH3LcYIUtqAsLFZ
	aZ2N9UEsIWIDJ/HWF3Pdo6SnY=
X-Received: by 2002:a05:6e02:214c:b0:375:b45c:d8f1 with SMTP id e9e14a558f8ab-37cd2bedcedmr7329005ab.25.1719654334416;
        Sat, 29 Jun 2024 02:45:34 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IFSsPUb2Fq8oKamgnfMg2jswBZxuVnLL3txDd25Hm5ArEJvaQ+gxLV7EoLK7acLcDc7br0p63sHEG3h45rdY28=
X-Received: by 2002:a05:6e02:214c:b0:375:b45c:d8f1 with SMTP id
 e9e14a558f8ab-37cd2bedcedmr7328945ab.25.1719654334079; Sat, 29 Jun 2024
 02:45:34 -0700 (PDT)
MIME-Version: 1.0
References: <740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com>
In-Reply-To: <740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com>
From: David Hildenbrand <david@redhat.com>
Date: Sat, 29 Jun 2024 11:45:22 +0200
Message-ID: <CADFyXm6mWpG8qoURxRPFuOG8WCAYeDbKwfjgET9FnO-xj=y6Hw@mail.gmail.com>
Subject: Re: [Question] performance regression after VM migration due to anon
 THP split in CoW
To: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>, Nanyong Sun <sunnanyong@huawei.com>, 
	aarcange@redhat.com, akpm@linux-foundation.org, baohua@kernel.org, 
	baolin.wang@linux.alibaba.com, jhubbard@nvidia.com, 
	kirill.shutemov@linux.intel.com, linux-mm@kvack.org, mike.kravetz@oracle.com, 
	rcampbell@nvidia.com, william.kucharski@oracle.com, 
	yang.shi@linux.alibaba.com, ziy@nvidia.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: multipart/alternative; boundary="000000000000b8470c061c043995"
X-Stat-Signature: 1t8bzzjsca9i5oot4z84pgd4wfuia9fb
X-Rspam-User: 
X-Rspamd-Queue-Id: 1D8851C0006
X-Rspamd-Server: rspam02
X-HE-Tag: 1719654339-654394
X-HE-Meta: U2FsdGVkX1+TDR8j6e0+5fYQraH6FG8YXM3eDntVkC77iqau/ImyTbtLz1hkxd1ptJqc6QE8LnMi7CZ8L+Vwdq+UEFCp7k+kJXbbQ21bK7d3bEpne0VMZdrmt4/LQDYi8s8fj6xrcP7LUDryyIt0j3s9zsbr9lz5CalF9nbuM0PlT0sHTtYLgjy96Tk4qW24XIMpfXU0MZ73I4eII2Rrj47sNU2WidxRgd+/hGSNa2CxG8FvxQgc6xTNAEolOuE/XEaindhuzQMtIEVTjFztmX66UJywRzaPV0MN8FotqqD5eEvj0/bcoMW9S4Yx3Tctc9lm+Ni+igMtlCkA8UI4XGnDG+Wjw8+TH63rMhZrfiiafKRVN06glciIJdBzuq5lYlbN3udfi7Jxyq0yWtyOIw2mZkCQzWhxOhBlbcBKt/pHy7JF9fZRXGMxArTYN5gPqVALshBqH6abDTTv7W/Yjo3VWDS03qhV/B1wL/bUFKRL1A3+nNUKrKdqqmHy3+21K70DgH4oPGk3UJSS6Ek47Im3c1cMSFfjD3qa7req5WiCzlL92UJjFeVFAbIKDB+eXC7goGgy534enIRMpVpq4+/9sy355d59daaaMHvqjAldV30SNu3LC4loeGArIEjPxM+K0fXZ8GXWypBT/ZCVbPqU/uS3szXGbXn4Y1YJAfc0Lm6YJdFpDFvf86heBWepFQghPqK1Pitdu+ra522Z5rQ2l5BUmjosQ4fWkF3ssXMrk/TIXoqPvN9TLoLvfbMchtYY4bfP+NXrxivC4+Z9+B2deCjn77la5YQYQb3Y8u6bZjrEAJLNfBhUtm54Xs9/oTS2HL6EC3woUOWQI/neTFaEBNxKW9K+6m4mMHDfBIHY8VfBRhx+Ef4XIrE2F5qgbtU6VYm4MsbhPjY4PADnAegszBCnbSAgFCoLwb5VNXxcSBwANr0tRPmo1y+hQNIFOi/YU0xxs18Ayld5Mez
 tXKxdZWa
 0j5tvjk7kqZoph1AmR3J6k4kXzMfmZuPY7GZCEh7wbIMcwrgpOf5ChtBIqIV4rnhR7JMLuyuQePUjGoRj4ABFh5Pm/FvJp9yrg+dPlq7cN5LaP5VZdtFB1FEgCayilOPEM8ZP4QN3enl5t2hEZ7N9+6n9Ah5PRk5j5qZb2cbUru99Qqm5LmxV188LmCIq1erluoUrzYs5/P86LIqeuqZl2rcJXMT1c9jKc23ynyEyNAWntiDoKS8pf8Y/mntBnBNWUOM/giZKPrcGeh1n/Uiv1dken1pr0altHwS8oOTilWFBtQRio2i+CVtoMJyp/Mjl4fMHgE0VxpxrbD8d55bj/47GsdomEwsofUta6/WAFfjbshgTKs90/uLAU0Aw0ji4xG5w5t+iQ5BS/87Kva7Zv7By/4vQLohptFzPKFzlquV9f3HtIheNVoZh4IYLVgdYrnTVNj+gCbKAf5OrMvf7ljuWPNN8UDNLe8kfcZqB3UJMv9bHcblvd2NaBhtjXOtTCZsK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

--000000000000b8470c061c043995
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi,

Likely the mailing lists won=E2=80=98t like my mail from this Google Mail c=
lient ;)

Jinjiang Tu <tujinjiang@huawei.com> schrieb am Sa. 29. Juni 2024 um 11:18:

> Hi,
>
> We noticed a performance regression in benchmark memtester[1] after
> upgrading the kernel. THP is enabled by default
> (/sys/kernel/mm/transparent_hugepage/enabled
> is set to "always"). The issue arises when we migrate a virtual machine
> that has 125G total memory and 124G free memory to another host. And then=
,
> we run the command `memtester 120G` in the VM. The benchmark takes about
> 20 seconds to consume 120G memory in v4.18, but takes about 160 seconds i=
n
> v5.10. This issue exists in mainline kernel too.
>

Simple: use preallocation in QEMU. =E2=80=9Eprealloc=3Don=E2=80=9C for host=
 memory backends,
for example.


> We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP")
> leads to the performance regression. Since this commit, When we trigger a
> write fault on a anon THP, we split the PMD and allocate a 4K page, inste=
ad
> of allocating the full anon THP. When a VM is migrating (based on qemu[2]=
),
> if the page is marked zero page in the source VM, the destination VM will
> call mmap and read the region to allocate memory, making the region mappe=
d
> by the zero THP. When we run memtester in the destination VM after VM
> migration finishes, memtester(in VM) will allocate large amounts of free
> memory and write to them, cause CoW of anon THP and THP split, further
> cause performance regression. After reverting this commit, performance
> regression disappears.


You talk about COW of anon THP, whereby your scenario really only relied on
COW of the huge zeropage.

Wouldn=E2=80=99t you would get a similar result when disabling the huge zer=
opage?


>
> This commit optimises some scenarios such as Redis, but may lead to
> performance regression in some other scenarios, such as VM migration.
> How could we solve this issue? Maybe we could add a new sysctl to let use=
rs
> decide whether to CoW the full anon THP or not?
>

I=E2=80=98m not convinced the use case you present really warrants a toggle=
 for
that. In your case you only want to change semantics on COW fault to the
huge zeropage. But =E2=80=A6

Using preallocation in QEMU will give you all anon THP right from the
start, avoiding any cow. Sure, you consume all memory right away, but after
all that=E2=80=98s what your use case triggers either way. And it might all=
 be even
faster. :)

Cheers!


> Thanks.
>
> [1] https://github.com/jnavila/memtester/tree/master
> [2] https://github.com/qemu/qemu/blob/master/migration/ram.c
>
>
>

--000000000000b8470c061c043995
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Hi,</div><div dir=3D"auto"><br></div><div dir=3D"auto">Li=
kely the mailing lists won=E2=80=98t like my mail from this Google Mail cli=
ent ;)</div><div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"g=
mail_attr">Jinjiang Tu &lt;<a href=3D"mailto:tujinjiang@huawei.com">tujinji=
ang@huawei.com</a>&gt; schrieb am Sa. 29. Juni 2024 um 11:18:<br></div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">Hi,<br>
<br>
We noticed a performance regression in benchmark memtester[1] after<br>
upgrading the kernel. THP is enabled by default <br>
(/sys/kernel/mm/transparent_hugepage/enabled<br>
is set to &quot;always&quot;). The issue arises when we migrate a virtual m=
achine<br>
that has 125G total memory and 124G free memory to another host. And then,<=
br>
we run the command `memtester 120G` in the VM. The benchmark takes about<br=
>
20 seconds to consume 120G memory in v4.18, but takes about 160 seconds in<=
br>
v5.10. This issue exists in mainline kernel too.<br>
</blockquote><div dir=3D"auto"><br></div><div dir=3D"auto">Simple: use prea=
llocation in QEMU. =E2=80=9Eprealloc=3Don=E2=80=9C for host memory backends=
, for example.</div><div dir=3D"auto"><br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><br>
We find commit 3917c80280c9 (&quot;thp: change CoW semantics for anon-THP&q=
uot;)<br>
leads to the performance regression. Since this commit, When we trigger a<b=
r>
write fault on a anon THP, we split the PMD and allocate a 4K page, instead=
<br>
of allocating the full anon THP. When a VM is migrating (based on qemu[2]),=
<br>
if the page is marked zero page in the source VM, the destination VM will<b=
r>
call mmap and read the region to allocate memory, making the region mapped<=
br>
by the zero THP. When we run memtester in the destination VM after VM<br>
migration finishes, memtester(in VM) will allocate large amounts of free<br=
>
memory and write to them, cause CoW of anon THP and THP split, further<br>
cause performance regression. After reverting this commit, performance<br>
regression disappears.</blockquote><div dir=3D"auto"><br></div><div dir=3D"=
auto">You talk about COW of anon THP, whereby your scenario really only rel=
ied on COW of the huge zeropage.</div><div dir=3D"auto"><br></div><div dir=
=3D"auto">Wouldn=E2=80=99t you would get a similar result when disabling th=
e huge zeropage?</div><div dir=3D"auto"><br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex" dir=3D"auto"><br>
<br>
This commit optimises some scenarios such as Redis, but may lead to<br>
performance regression in some other scenarios, such as VM migration.<br>
How could we solve this issue? Maybe we could add a new sysctl to let users=
<br>
decide whether to CoW the full anon THP or not?<br>
</blockquote><div dir=3D"auto"><br></div><div dir=3D"auto">I=E2=80=98m not =
convinced the use case you present really warrants a toggle for that. In yo=
ur case you only want to change semantics on COW fault to the huge zeropage=
. But =E2=80=A6</div><div dir=3D"auto"><br></div><div dir=3D"auto">Using pr=
eallocation in QEMU will give you all anon THP right from the start, avoidi=
ng any cow. Sure, you consume all memory right away, but after all that=E2=
=80=98s what your use case triggers either way. And it might all be even fa=
ster. :)</div><div dir=3D"auto"><br></div><div dir=3D"auto">Cheers!</div><d=
iv dir=3D"auto"><br></div><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex" dir=3D"auto"><br>
Thanks.<br>
<br>
[1] <a href=3D"https://github.com/jnavila/memtester/tree/master" rel=3D"nor=
eferrer" target=3D"_blank">https://github.com/jnavila/memtester/tree/master=
</a><br>
[2] <a href=3D"https://github.com/qemu/qemu/blob/master/migration/ram.c" re=
l=3D"noreferrer" target=3D"_blank">https://github.com/qemu/qemu/blob/master=
/migration/ram.c</a><br>
<br>
<br>
</blockquote></div></div>

--000000000000b8470c061c043995--