From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48DEEC27C4F for ; Sat, 29 Jun 2024 09:18:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 99AD66B0082; Sat, 29 Jun 2024 05:18:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 94B2C6B0083; Sat, 29 Jun 2024 05:18:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8129E6B0088; Sat, 29 Jun 2024 05:18:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 61F6B6B0082 for ; Sat, 29 Jun 2024 05:18:44 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D30D71A0CA6 for ; Sat, 29 Jun 2024 09:18:43 +0000 (UTC) X-FDA: 82283376126.21.CA94B91 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf05.hostedemail.com (Postfix) with ESMTP id BB2D8100018 for ; Sat, 29 Jun 2024 09:18:40 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of tujinjiang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=tujinjiang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719652702; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=k/xsBMT3X4eF9ectDxHtuRIgIR3FfHc2d5sE7MPeTGA=; b=5jdOTWmszqvT4zejNlnUCbp2AofbZ3S96G1f642MQXcuJiEkfsSFVlQwjT3k7SEj5BDyKi +5x6XVDKtL4JRL6yOX6yfdLDrWRdbuEZCORfUgK+OwbpsWCQ5FTw6Lzzi/97MxeMhv70s7 5rFrbEpb9dF7F0f16lnCyyqx9oZA9YA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719652702; a=rsa-sha256; cv=none; b=ubB7pJVoRZ1VxkJTMH8loRWEX8F6iy8biEsXWUl+nxanWEFAUDtwabkuPyLFngKu0zpPLx 9n9S0n8cZentXiy81HMXLSEnSv/yUIV04CuY8jElEDwr3gdHVV/hkeuubZB2i2ax9FaoTl rmeWhjsCgOKA9YYa1xz+k53AyrNlH6c= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of tujinjiang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=tujinjiang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4WB6DH5mkHznYKv; Sat, 29 Jun 2024 17:18:23 +0800 (CST) Received: from dggpeml500011.china.huawei.com (unknown [7.185.36.84]) by mail.maildlp.com (Postfix) with ESMTPS id 7E18C18006E; Sat, 29 Jun 2024 17:18:34 +0800 (CST) Received: from [10.174.179.13] (10.174.179.13) by dggpeml500011.china.huawei.com (7.185.36.84) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Sat, 29 Jun 2024 17:18:33 +0800 Message-ID: <740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com> Date: Sat, 29 Jun 2024 17:18:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.2 To: , , , , CC: , , , , Kefeng Wang , Nanyong Sun , , David Hildenbrand , , From: Jinjiang Tu Subject: [Question] performance regression after VM migration due to anon THP split in CoW Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.179.13] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml500011.china.huawei.com (7.185.36.84) X-Stat-Signature: 9hfkpt3afeu9uharbcoto8juwbwsknei X-Rspamd-Queue-Id: BB2D8100018 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1719652720-164350 X-HE-Meta: U2FsdGVkX189Cf18ZvzvuZZ/+9rt4zMMFVLcz4uV1oTy+c0XxJok/r49k3ypXYrqe/6/iQOy91nhb3iUkQhMdHeVxhaBt3kIfYljsOjb44O1qGA8QOV2gYyvNgYR0OX6yE9WerDx6WYNaoWW3ILZPThH79d7mvFBwCm1/FkpnnJ44nyINy89GVCCo3ocFHtjsCNe66sUUZ17lINaSyQCuyzX3u4/MxYiXyiMpB8kt8PbD5R3nJ5Kc86Ri8+ZdeMtVD7FaPQCFRUgg02Gr0OVL1RN7aZwYNhmnUjalzC8ncranfeoNDWCw8cRXiLiX0t3fucugYqJEzLaHyG24NElZ4hxXVpYYQITcOWNl/ojRzWs2UtBkUgVruOp+sYlaC9wB24oSPiavtd1sLxJzTIc4p1ErhAl7XEOata3zZw6pbBQX8jgt81LjxXtsB696pX4BKbFff/D0FhbnXJAsJc6KTHOaECzd9tvXoGriUFOoZ0u2k3EX8OEv9uXOAZRVF/HhxDnzv6xogII0gpLoCPeSxC4VqVCFYC1ZVKDCix10oZQVDMs5sr6g8tPe3xS5GR9MKs+oLvYGCQzzApOCynGyZC9D9ozz8mX2leWGIfLhZrKwpJBriFnO2Z7i4luY/SueZUQMg9EDX+hd9cCZtSrrU+FZB7mnz3FtGn0W31F7/Fv/1QE9YVoiZGp0QSL6IrrGelbJkF8ETjQO1kcaMuwz3YSVmHfALUtpGIdi+/dJ0QGJzRkLWM/L36QIr68EmDUvfu6POnozgasYNFCv8nnMCLo7pEXWLNVhd3BphbTGYCVINDZbuP0N6NAtvFmR7SxKqpcpqQvfLLH1D6B+KhpJ/1otuyaVAT/yTBsD5cGfwm9Nf4hrsflMODSIhiXDakq6U0XGZRa6AJhK7KrJWmrQHdrzf1YTCGfGPZZ9a+owk2LxFyAqAv2nraP/1u/eV+oCdbJLpJkMwfSsy7gmiB cGpqjuV9 pEXEVIgjpt6DAK2auOJFC1AvW+ONqCeZCvRwTKc0siE2g9/J2oLVsOEgPOyVsE//uuhf16UNsA8eIltXPw4xPlxwnTQ7PziV0cL+WNsyl3gSA4bxkFGiWRolESoM9MhtuvXPn4DsxpYeVm22o2ObVkznU6xgmhLZLIcZq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, We noticed a performance regression in benchmark memtester[1] after upgrading the kernel. THP is enabled by default (/sys/kernel/mm/transparent_hugepage/enabled is set to "always"). The issue arises when we migrate a virtual machine that has 125G total memory and 124G free memory to another host. And then, we run the command `memtester 120G` in the VM. The benchmark takes about 20 seconds to consume 120G memory in v4.18, but takes about 160 seconds in v5.10. This issue exists in mainline kernel too. We find commit 3917c80280c9 ("thp: change CoW semantics for anon-THP") leads to the performance regression. Since this commit, When we trigger a write fault on a anon THP, we split the PMD and allocate a 4K page, instead of allocating the full anon THP. When a VM is migrating (based on qemu[2]), if the page is marked zero page in the source VM, the destination VM will call mmap and read the region to allocate memory, making the region mapped by the zero THP. When we run memtester in the destination VM after VM migration finishes, memtester(in VM) will allocate large amounts of free memory and write to them, cause CoW of anon THP and THP split, further cause performance regression. After reverting this commit, performance regression disappears. This commit optimises some scenarios such as Redis, but may lead to performance regression in some other scenarios, such as VM migration. How could we solve this issue? Maybe we could add a new sysctl to let users decide whether to CoW the full anon THP or not? Thanks. [1] https://github.com/jnavila/memtester/tree/master [2] https://github.com/qemu/qemu/blob/master/migration/ram.c