From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 609F7C27C4F for ; Sat, 29 Jun 2024 09:45:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A9F526B0088; Sat, 29 Jun 2024 05:45:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A4E826B0089; Sat, 29 Jun 2024 05:45:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 917016B008A; Sat, 29 Jun 2024 05:45:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 727A86B0088 for ; Sat, 29 Jun 2024 05:45:44 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C035980D77 for ; Sat, 29 Jun 2024 09:45:43 +0000 (UTC) X-FDA: 82283444166.30.C68FED7 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf20.hostedemail.com (Postfix) with ESMTP id 1D8851C0006 for ; Sat, 29 Jun 2024 09:45:39 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Wdu4gDso; spf=pass (imf20.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719654327; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=; b=OAedxezmoJaEAjmEW37yvrSgX3Od7S3BEh8uMZy4Lyk0PCRWBcPcWKYrI85hDv22OaQ9BS eohRyMJngV+kjaOpipoCpYf2ULC6LeTz4wIGqQlxDegFmccXcbARZhtrvTwh49yiq+jDE4 MI1lUrq+8A3q8HbiD97tSuxah1GRq2k= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Wdu4gDso; spf=pass (imf20.hostedemail.com: domain of dhildenb@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhildenb@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719654327; a=rsa-sha256; cv=none; b=DPT/0fCN9nReIKvK/UNVmQ610cH96htnXiAV1+am7+wVJHRwNIT41pqVDO+WmlxFERFpHo N+MRIlkOiFVrcINa4qr07ygjOE0tyLbdYeGocs1gNe5BEYCQM9HJhyGPPAQna/HkA2pqSJ jPK6jQwB1gNkil8HK6DOXZQz5pM3C8I= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1719654339; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=; b=Wdu4gDsolWaGxwYpaHbziSywxIa6cy/mANFEdicNsddQAa8fCDUoicgQ1hpBuUmSy+Xkxs Jl1H0GNaCj8VX+YNd0obQqIpwXAMYZQ7TdWpQY2fq8CmW3obyfGT8RnEGxdwRXihafDjCa 2+dz/1AE0eNnWZFPqucZUJbIlEqzcxc= Received: from mail-io1-f71.google.com (mail-io1-f71.google.com [209.85.166.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-10-vaamgkCGOMi8vqiBrrstrA-1; Sat, 29 Jun 2024 05:45:35 -0400 X-MC-Unique: vaamgkCGOMi8vqiBrrstrA-1 Received: by mail-io1-f71.google.com with SMTP id ca18e2360f4ac-7f38c10cab9so142339039f.2 for ; Sat, 29 Jun 2024 02:45:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719654334; x=1720259134; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=A6ARv+P8q+pLJMA/cHQ8AuE7kauLijEyekwMpjqkKtQ=; b=ZNjAoXdcQg/k5mTG1PQtFmKMGJnMSCdbqx7Dohx0fR35M96dFsErxkU1q5blcAfDVu 1brgIMAPz66uw6kSeamXca4xdIuP9psqMABaVSZq6XdzTZ9/K+ROrKDdgCuFo1fZDLyo 9GAOmvQPiygEw92OigApOM2uvzF25VvAmRnVlLYmJ0W+aVgk5v6Hfw+BEXeHuHgDdj+n jDgQER4waWqNzEkfW9VcgRKtz5+W93cF9RTgGktq1cni11ah1Qz0VMNHUYWkgqgmYkoK qkLyL8hZpLDhi1B36lwDwHYhuwGtMOUD69FkPthw4yC4H/z0mtoxbZmJWDBaq7FGBJ5v dWyA== X-Forwarded-Encrypted: i=1; AJvYcCUZEAzFyExf2dv/IK4MyiJ+8Z4lGEN1GctKqR8r1+Kh/7rauk4SVYhA3fyleQLacILcCgdUmA4Ke5XccBouU0XKvWU= X-Gm-Message-State: AOJu0YyPnmMFZGrornP/xLnci3SD3Sbyl2zd3dlRcWMrhEhHUM7UpBPH WPaS5j5ZWRk/hD+Ww6Btgnri+ZxfATY3Z5w+unZmuMcY3zzl7tRxzAoWjlC2/mA5rg4lFQYtiHV POcyWHBtf6PT2uNcw7u9KjAU+Jewi77PDf7dy0rJE953sW8eoH8fo1ALif89KH3LcYIUtqAsLFZ aZ2N9UEsIWIDJ/HWF3Pdo6SnY= X-Received: by 2002:a05:6e02:214c:b0:375:b45c:d8f1 with SMTP id e9e14a558f8ab-37cd2bedcedmr7329005ab.25.1719654334416; Sat, 29 Jun 2024 02:45:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFSsPUb2Fq8oKamgnfMg2jswBZxuVnLL3txDd25Hm5ArEJvaQ+gxLV7EoLK7acLcDc7br0p63sHEG3h45rdY28= X-Received: by 2002:a05:6e02:214c:b0:375:b45c:d8f1 with SMTP id e9e14a558f8ab-37cd2bedcedmr7328945ab.25.1719654334079; Sat, 29 Jun 2024 02:45:34 -0700 (PDT) MIME-Version: 1.0 References: <740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com> In-Reply-To: <740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com> From: David Hildenbrand Date: Sat, 29 Jun 2024 11:45:22 +0200 Message-ID: Subject: Re: [Question] performance regression after VM migration due to anon THP split in CoW To: Jinjiang Tu Cc: Kefeng Wang , Nanyong Sun , aarcange@redhat.com, akpm@linux-foundation.org, baohua@kernel.org, baolin.wang@linux.alibaba.com, jhubbard@nvidia.com, kirill.shutemov@linux.intel.com, linux-mm@kvack.org, mike.kravetz@oracle.com, rcampbell@nvidia.com, william.kucharski@oracle.com, yang.shi@linux.alibaba.com, ziy@nvidia.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/alternative; boundary="000000000000b8470c061c043995" X-Stat-Signature: 1t8bzzjsca9i5oot4z84pgd4wfuia9fb X-Rspam-User: X-Rspamd-Queue-Id: 1D8851C0006 X-Rspamd-Server: rspam02 X-HE-Tag: 1719654339-654394 X-HE-Meta: U2FsdGVkX1+TDR8j6e0+5fYQraH6FG8YXM3eDntVkC77iqau/ImyTbtLz1hkxd1ptJqc6QE8LnMi7CZ8L+Vwdq+UEFCp7k+kJXbbQ21bK7d3bEpne0VMZdrmt4/LQDYi8s8fj6xrcP7LUDryyIt0j3s9zsbr9lz5CalF9nbuM0PlT0sHTtYLgjy96Tk4qW24XIMpfXU0MZ73I4eII2Rrj47sNU2WidxRgd+/hGSNa2CxG8FvxQgc6xTNAEolOuE/XEaindhuzQMtIEVTjFztmX66UJywRzaPV0MN8FotqqD5eEvj0/bcoMW9S4Yx3Tctc9lm+Ni+igMtlCkA8UI4XGnDG+Wjw8+TH63rMhZrfiiafKRVN06glciIJdBzuq5lYlbN3udfi7Jxyq0yWtyOIw2mZkCQzWhxOhBlbcBKt/pHy7JF9fZRXGMxArTYN5gPqVALshBqH6abDTTv7W/Yjo3VWDS03qhV/B1wL/bUFKRL1A3+nNUKrKdqqmHy3+21K70DgH4oPGk3UJSS6Ek47Im3c1cMSFfjD3qa7req5WiCzlL92UJjFeVFAbIKDB+eXC7goGgy534enIRMpVpq4+/9sy355d59daaaMHvqjAldV30SNu3LC4loeGArIEjPxM+K0fXZ8GXWypBT/ZCVbPqU/uS3szXGbXn4Y1YJAfc0Lm6YJdFpDFvf86heBWepFQghPqK1Pitdu+ra522Z5rQ2l5BUmjosQ4fWkF3ssXMrk/TIXoqPvN9TLoLvfbMchtYY4bfP+NXrxivC4+Z9+B2deCjn77la5YQYQb3Y8u6bZjrEAJLNfBhUtm54Xs9/oTS2HL6EC3woUOWQI/neTFaEBNxKW9K+6m4mMHDfBIHY8VfBRhx+Ef4XIrE2F5qgbtU6VYm4MsbhPjY4PADnAegszBCnbSAgFCoLwb5VNXxcSBwANr0tRPmo1y+hQNIFOi/YU0xxs18Ayld5Mez tXKxdZWa 0j5tvjk7kqZoph1AmR3J6k4kXzMfmZuPY7GZCEh7wbIMcwrgpOf5ChtBIqIV4rnhR7JMLuyuQePUjGoRj4ABFh5Pm/FvJp9yrg+dPlq7cN5LaP5VZdtFB1FEgCayilOPEM8ZP4QN3enl5t2hEZ7N9+6n9Ah5PRk5j5qZb2cbUru99Qqm5LmxV188LmCIq1erluoUrzYs5/P86LIqeuqZl2rcJXMT1c9jKc23ynyEyNAWntiDoKS8pf8Y/mntBnBNWUOM/giZKPrcGeh1n/Uiv1dken1pr0altHwS8oOTilWFBtQRio2i+CVtoMJyp/Mjl4fMHgE0VxpxrbD8d55bj/47GsdomEwsofUta6/WAFfjbshgTKs90/uLAU0Aw0ji4xG5w5t+iQ5BS/87Kva7Zv7By/4vQLohptFzPKFzlquV9f3HtIheNVoZh4IYLVgdYrnTVNj+gCbKAf5OrMvf7ljuWPNN8UDNLe8kfcZqB3UJMv9bHcblvd2NaBhtjXOtTCZsK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --000000000000b8470c061c043995 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, Likely the mailing lists won=E2=80=98t like my mail from this Google Mail c= lient ;) Jinjiang Tu schrieb am Sa. 29. Juni 2024 um 11:18: > Hi, > > We noticed a performance regression in benchmark memtester[1] after > upgrading the kernel. THP is enabled by default > (/sys/kernel/mm/transparent_hugepage/enabled > is set to "always"). The issue arises when we migrate a virtual machine > that has 125G total memory and 124G free memory to another host. And then= , > we run the command `memtester 120G` in the VM. The benchmark takes about > 20 seconds to consume 120G memory in v4.18, but takes about 160 seconds i= n > v5.10. This issue exists in mainline kernel too. > Simple: use preallocation in QEMU. =E2=80=9Eprealloc=3Don=E2=80=9C for host= memory backends, for example. > We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP") > leads to the performance regression. Since this commit, When we trigger a > write fault on a anon THP, we split the PMD and allocate a 4K page, inste= ad > of allocating the full anon THP. When a VM is migrating (based on qemu[2]= ), > if the page is marked zero page in the source VM, the destination VM will > call mmap and read the region to allocate memory, making the region mappe= d > by the zero THP. When we run memtester in the destination VM after VM > migration finishes, memtester(in VM) will allocate large amounts of free > memory and write to them, cause CoW of anon THP and THP split, further > cause performance regression. After reverting this commit, performance > regression disappears. You talk about COW of anon THP, whereby your scenario really only relied on COW of the huge zeropage. Wouldn=E2=80=99t you would get a similar result when disabling the huge zer= opage? > > This commit optimises some scenarios such as Redis, but may lead to > performance regression in some other scenarios, such as VM migration. > How could we solve this issue? Maybe we could add a new sysctl to let use= rs > decide whether to CoW the full anon THP or not? > I=E2=80=98m not convinced the use case you present really warrants a toggle= for that. In your case you only want to change semantics on COW fault to the huge zeropage. But =E2=80=A6 Using preallocation in QEMU will give you all anon THP right from the start, avoiding any cow. Sure, you consume all memory right away, but after all that=E2=80=98s what your use case triggers either way. And it might all= be even faster. :) Cheers! > Thanks. > > [1] https://github.com/jnavila/memtester/tree/master > [2] https://github.com/qemu/qemu/blob/master/migration/ram.c > > > --000000000000b8470c061c043995 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

Li= kely the mailing lists won=E2=80=98t like my mail from this Google Mail cli= ent ;)

Jinjiang Tu <tujinji= ang@huawei.com> schrieb am Sa. 29. Juni 2024 um 11:18:
Hi,

We noticed a performance regression in benchmark memtester[1] after
upgrading the kernel. THP is enabled by default
(/sys/kernel/mm/transparent_hugepage/enabled
is set to "always"). The issue arises when we migrate a virtual m= achine
that has 125G total memory and 124G free memory to another host. And then,<= br> we run the command `memtester 120G` in the VM. The benchmark takes about 20 seconds to consume 120G memory in v4.18, but takes about 160 seconds in<= br> v5.10. This issue exists in mainline kernel too.

Simple: use prea= llocation in QEMU. =E2=80=9Eprealloc=3Don=E2=80=9C for host memory backends= , for example.


We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP&q= uot;)
leads to the performance regression. Since this commit, When we trigger a write fault on a anon THP, we split the PMD and allocate a 4K page, instead=
of allocating the full anon THP. When a VM is migrating (based on qemu[2]),=
if the page is marked zero page in the source VM, the destination VM will call mmap and read the region to allocate memory, making the region mapped<= br> by the zero THP. When we run memtester in the destination VM after VM
migration finishes, memtester(in VM) will allocate large amounts of free memory and write to them, cause CoW of anon THP and THP split, further
cause performance regression. After reverting this commit, performance
regression disappears.

You talk about COW of anon THP, whereby your scenario really only rel= ied on COW of the huge zeropage.

Wouldn=E2=80=99t you would get a similar result when disabling th= e huge zeropage?



This commit optimises some scenarios such as Redis, but may lead to
performance regression in some other scenarios, such as VM migration.
How could we solve this issue? Maybe we could add a new sysctl to let users=
decide whether to CoW the full anon THP or not?

I=E2=80=98m not = convinced the use case you present really warrants a toggle for that. In yo= ur case you only want to change semantics on COW fault to the huge zeropage= . But =E2=80=A6

Using pr= eallocation in QEMU will give you all anon THP right from the start, avoidi= ng any cow. Sure, you consume all memory right away, but after all that=E2= =80=98s what your use case triggers either way. And it might all be even fa= ster. :)

Cheers!


Thanks.

[1] https://github.com/jnavila/memtester/tree/master=
[2] https://github.com/qemu/qemu/blob/master= /migration/ram.c


--000000000000b8470c061c043995--