From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1855D74957 for ; Wed, 30 Oct 2024 03:04:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 264368D0003; Tue, 29 Oct 2024 23:04:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1EE0B8D0001; Tue, 29 Oct 2024 23:04:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 065978D0003; Tue, 29 Oct 2024 23:04:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D92198D0001 for ; Tue, 29 Oct 2024 23:04:56 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8AEB0A07C6 for ; Wed, 30 Oct 2024 03:04:56 +0000 (UTC) X-FDA: 82728775458.08.6DD9F72 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf11.hostedemail.com (Postfix) with ESMTP id 11CF04000D for ; Wed, 30 Oct 2024 03:04:21 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730257334; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M+deA6nLWfH9Sip3hBL6um+FEpqqCjFSqFQ42WK1dmQ=; b=8APYIrYcGKE65qsH2MSqFFFbD8MghU6Viw0O1AjqbqxqEvSZprTaT0Czh1AZgOLwn4VI5l /NHde1FjtIrVmcF1cMFtqUwTlg0KTrxbkPGAoyeDhrcnaNuqgYN/jvV4x4ED9QXpZMagUC Y8uGPOCTeNvWniRkFiJF57E+imrqnt4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730257334; a=rsa-sha256; cv=none; b=eeoMLF2xbdOGrmtgLbCN3RW4lg3ogk0O86lNQEIg6+ezNRjNv/n4Vv+3aZh6iK399kkhNr tAc0AzOO5PYSJboEjhPm6KEES9Qlg15jKvXNrgi4E+CmTLtuhou3Y8A8+1gxY0WD4izc+Y PWzvtilitNl8MiLNbXUXjaFb1Fmhce4= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4XdX4Y1NkCzyTxt; Wed, 30 Oct 2024 11:03:09 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id E986F1400DC; Wed, 30 Oct 2024 11:04:47 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Wed, 30 Oct 2024 11:04:47 +0800 Message-ID: Date: Wed, 30 Oct 2024 11:04:46 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() To: "Huang, Ying" , David Hildenbrand CC: Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 11CF04000D X-Stat-Signature: swxjsa5mii3ryb3u49b818jb48naqq3r X-HE-Tag: 1730257461-633679 X-HE-Meta: U2FsdGVkX1+Ru0aQ/7S1rV5I21ZYkXbrUS+YkezhloXgG7w3byCMgSqYoqWgtwj04vTPc/19nrPdO8GhcTq7ouIYyQF87x52MiwkfYQFl2L/WJJFpF3NPqs8108vrajtIH4ENGV/Mcq9YtLf6TSWtdOB6QFX4fT89xmAUgef9Y+1wrU6ltW8zcEKUTS28DZd4qtwcGzj32lx0+S6AfyXl3RHbo2zLZYf6prptsHMe/jXhTi6nDhWg+m93yYGfwVcJZ3SCpB97bVFTGBAKhdaWbJDbiWVSdWMFbIzVnvF+4OcCjIxMMZFPjSSlqoNPTe7bPOm9MQWyxDwIMi2wleZ+UCJT3Cb+k2mkLnWF6sizOXFdEF/8YUR0QeUmdP7S9y8fPp4UfwPM2/iryzCNTOXwYcBxHopAqZpWg7UqxTzeK6yVL1VPgw0y5UEs13/LGiHJhr8joAwvQ1LAKcGodADWLDgOl8IZCyv5ghv22ljsIoRBMuvZ8xs6YgJKI/3EcY5Xt6f4CnOgs2l/BlbsJJ/gGjBXMI38jGjO/6cScF5Is94K6iOhnw2xtna8x5jbC+IOa/N5NPUZkwWB/lOIC/thDq1NFvqDIcjFfI3EqrlNdk5lbvEjmcA1+xlOF3A3hZQdz0pn8XsSJdbNbZJcozJ4Arnd31X0EDEFp8eNGtZxf2ohZiVbUM1ZZ05dgyhTEdUM/P6jAafemmdRjGThoZ9vL2j/NDIPCpF01HR6VUi0mqRmiombKANnyttcNI9/eDS26w3KZZGtN1GWM+wRh7uezswC2M6YUXnfxnvwlDWIGUQ5vchShlbk8jWInD6cUkvlqT8vhUGq3dhSNp6x0Wi8eG6X5VcfeoFmY2yrkBMNoH+rW6Jev/OkawYV4yPEtyrZtUO3xOwGcHBcW/bwcIRwzFFdnDLl4StT5j5SjZJy4DF3HFPWI4JJGbUsLQaWtpWrWVis2WxhBuvKzRoCw9 UVHx2QHU G2JBx2xQbYJMihYtXY+YrMcJk1DuW6axqyZK8pbvQG+ZHolWwb3lg3Bd+r0R5HZ8t9YB8BeOe4PseydoWRiqzesfDe8GWhWUqs6L0vVtgLEVdgIbYAx7/13BsYcU8cp+MSuRbEoHfkxXFVgDEFBeKyLdFWN5QFSd9HthD8OqK7z+H0yOmsxfxscEIAR/rcCPY0dKrXGl550HtTWRoYsK2VdtiKdCpX5q1/rdhX9vt0mdWzziQfwdyIPfbDOF52xRL0AU+bLARLBhH3Ht5PhvWazke7e6UCdvgz9kUkQ4b2UdT2hptEVczulfF+TuXdkVCe9NMaE0BXiYXfnS4bjKPtXmWrA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/30 9:04, Huang, Ying wrote: > David Hildenbrand writes: > >> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>> >>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>> shouldn't >>>>>>>>> >>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>> >>>>>>>>>> be even consuming "nr_pages". >>>>>>>>> >>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>> the >>>>>>>>> 'base'. >>>>>>>> >>>>>>>> It should be using folio_nr_pages(). >>>>>>> >>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>> following cleanup patches, or do it in the fix patches? >>>>>> >>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>> Or am I missing something important? >>>>> >>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>> >>>> And unfortunately neither the documentation nor the function name >>>> expresses that :( >>>> >>>> I'm happy to review any patches that improve the situation here :) >>>> >>> Actually, could we drop the process_huge_page() totally, from my >>> testcase[1], process_huge_page() is not better than clear/copy page >>> from start to last, and sequential clearing/copying maybe more >>> beneficial to the hardware prefetching, and is there a way to let lkp >>> to test to check the performance, since the process_huge_page() >>> was submitted by Ying, what's your opinion? > > I don't think that it's a good idea to revert the commit without > studying and root causing the issues. I can work together with you on > that. If we have solid and well explained data to prove > process_huge_page() isn't benefitial, we can revert the commit. Take 'fallocate 20G' as an example, before Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 3,118.94 msec task-clock # 0.999 CPUs utilized 30 context-switches # 0.010 K/sec 1 cpu-migrations # 0.000 K/sec 136 page-faults # 0.044 K/sec 8,092,075,873 cycles # 2.594 GHz (92.82%) 1,624,587,663 instructions # 0.20 insn per cycle (92.83%) 395,341,850 branches # 126.755 M/sec (92.82%) 3,872,302 branch-misses # 0.98% of all branches (92.83%) 1,398,066,701 L1-dcache-loads # 448.251 M/sec (92.82%) 58,124,626 L1-dcache-load-misses # 4.16% of all L1-dcache accesses (92.82%) 1,032,527 LLC-loads # 0.331 M/sec (92.82%) 498,684 LLC-load-misses # 48.30% of all LL-cache accesses (92.84%) 473,689,004 L1-icache-loads # 151.875 M/sec (92.82%) 356,721 L1-icache-load-misses # 0.08% of all L1-icache accesses (92.85%) 1,947,644,987 dTLB-loads # 624.458 M/sec (92.95%) 10,185 dTLB-load-misses # 0.00% of all dTLB cache accesses (92.96%) 474,622,896 iTLB-loads # 152.174 M/sec (92.95%) 94 iTLB-load-misses # 0.00% of all iTLB cache accesses (85.69%) 3.122844830 seconds time elapsed 0.000000000 seconds user 3.107259000 seconds sys and after(clear from start to end) Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 1,135.53 msec task-clock # 0.999 CPUs utilized 10 context-switches # 0.009 K/sec 1 cpu-migrations # 0.001 K/sec 137 page-faults # 0.121 K/sec 2,946,673,587 cycles # 2.595 GHz (92.67%) 1,620,704,205 instructions # 0.55 insn per cycle (92.61%) 394,595,772 branches # 347.499 M/sec (92.60%) 130,756 branch-misses # 0.03% of all branches (92.84%) 1,396,726,689 L1-dcache-loads # 1230.022 M/sec (92.96%) 338,344 L1-dcache-load-misses # 0.02% of all L1-dcache accesses (92.95%) 111,737 LLC-loads # 0.098 M/sec (92.96%) 67,486 LLC-load-misses # 60.40% of all LL-cache accesses (92.96%) 418,198,663 L1-icache-loads # 368.285 M/sec (92.96%) 173,764 L1-icache-load-misses # 0.04% of all L1-icache accesses (92.96%) 2,203,364,632 dTLB-loads # 1940.385 M/sec (92.96%) 17,195 dTLB-load-misses # 0.00% of all dTLB cache accesses (92.95%) 418,198,365 iTLB-loads # 368.285 M/sec (92.96%) 79 iTLB-load-misses # 0.00% of all iTLB cache accesses (85.34%) 1.137015760 seconds time elapsed 0.000000000 seconds user 1.131266000 seconds sys The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but this depends on the implementation of the microarchitecture. 1) Will test some rand test to check the different of performance as David suggested. 2) Hope the LKP to run more tests since it is very useful(more test set and different machines) > >> I questioned that just recently [1], and Ying assumed that it still >> applies [2]. >> >> c79b57e462b5 ("mm: hugetlb: clear target >> sub-page last when clearing huge page”) documents the scenario where >> this matters -- anon-w-seq which you also run below. >> >> If there is no performance benefit anymore, we should rip that >> out. But likely we should check on multiple micro-architectures with >> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 >> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment >> cannot replicate that?>> >> >> [1] >> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/ >> [2] >> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> >>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ >>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >>> case-anon-w-seq-hugetlb (2M PMD HugeTLB) >> >> But these are sequential, not random. I'd have thought access + >> zeroing would be sequentially either way. Did you run with random >> access as well> Will do. > > -- > Best Regards, > Huang, Ying >