From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 475AAD74954 for ; Wed, 30 Oct 2024 03:24:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 92CA06B00C4; Tue, 29 Oct 2024 23:24:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DA606B00C5; Tue, 29 Oct 2024 23:24:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 757A96B00C6; Tue, 29 Oct 2024 23:24:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 54CF36B00C4 for ; Tue, 29 Oct 2024 23:24:57 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id EC21D160871 for ; Wed, 30 Oct 2024 03:24:56 +0000 (UTC) X-FDA: 82728826740.02.7826988 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by imf12.hostedemail.com (Postfix) with ESMTP id A234F40012 for ; Wed, 30 Oct 2024 03:24:42 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=EViRLAzp; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730258565; a=rsa-sha256; cv=none; b=CRjH+tVe2XQet62S2EGJQPV/mJzqFyAhIZXxMDyOmIxS1SLC9AuZkFFSWFMAZ607+R05rS 40n7cwH66TD9e63hdLzsGT7Yu8u8wZFhlQFrM2i1KPh7cjQGi1t4liCup8qX60ZXDmqQ0v ODkXC5ir4wOKdBbljfuDAME3sbsMNqQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=EViRLAzp; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730258565; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cA0QtkwvopCcY+gxBsbQdxLe6H9Im7ALErFDLgE3e9A=; b=U0Bjyh8kvM4TBDiGg7W8OFAkYMb5kgfDUW44zE3r3rsHPwnbUBm2Zjh65dlsV0G+a9syGj 4QmvFRlFrP6Zfj89q1VjkNRiGUjUnZVVGG5NbUMujm0E6Nwq6Facb3XA93+4ItT5H1j8rY mjpF4JHozui3t716byBbNHri54rbTxY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730258695; x=1761794695; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=Qv+rki9uwVX56UShxqnG/aZ8taxKM/qYvgL37jBvqFI=; b=EViRLAzpc6R1CWmkDn3lkwXtl6tx0kGWGypyAoHoPUhTScUZbP2THY6J e6PjvDJFnSgiuOe7OGhFXp9DEJcOjQH4YnbBxaKjrdEZxg8fYJrsZ++/b JAOesBj0wCBSlwR52BzQiXHIM6dK3nK6axSBDPKUWpOMATdNHPjB3yFwm AlL+OLscWmVR3OfN14/aRkSHQY5qOanaWxp9aqYXHW0bjCzMHT56WEIcj OkMMhvAMdsk4Ltxrf2sE8IY4cxUqEYd/scYUj4ORiYsJvbHkhFqU7U0wo WgVh6iZx+7s7aseHQ4N/5qbg7aEW3DsAPsI3XPcAh/0NH3eMyUTzvnWPR g==; X-CSE-ConnectionGUID: G1aiEyjBQviAqCuB8TH88w== X-CSE-MsgGUID: zDdv3QeKRYicagVpo/v9cg== X-IronPort-AV: E=McAfee;i="6700,10204,11240"; a="41341831" X-IronPort-AV: E=Sophos;i="6.11,243,1725346800"; d="scan'208";a="41341831" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Oct 2024 20:24:53 -0700 X-CSE-ConnectionGUID: cdFOvjISQ3S1vOfsUwr5dw== X-CSE-MsgGUID: RZ/Fx637TOe/hciaH22A7A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,243,1725346800"; d="scan'208";a="112976518" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Oct 2024 20:24:51 -0700 From: "Huang, Ying" To: Kefeng Wang Cc: David Hildenbrand , Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() In-Reply-To: (Kefeng Wang's message of "Wed, 30 Oct 2024 11:04:46 +0800") References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 30 Oct 2024 11:21:18 +0800 Message-ID: <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: A234F40012 X-Stat-Signature: dmmstk5ttbsdpkrw989mp3rhiypwipnd X-Rspam-User: X-HE-Tag: 1730258682-593473 X-HE-Meta: U2FsdGVkX1/+Vv0QaF2ZqxM57164uWmzi+To13047qCCsIV7MKY03Huai2au7n8FNknhKhZHwyfQkh47FB36pUZ23WIUhwV2O3mIyMdUPN5oOJYPDsmpwlmONi5+TOAYs2rGPfQrs8xL9wEM+izQqmpE8DETz1KL3o52zPUUGdJifrKCzElNjlWwmyhzkzDueC1DVN58mNTuhnvG46Tsn3noSQ2zXwCldpjPpyr2oRRZZEBkm/+Tg294hcmgUOk0S5av6Clu591InKzKAagwPrpRBRcgKcIYZwES5mLMNFBsWw2IObzaVzbbpR9n+OqS7l3IsbE28N2W8Vw7UlAyn00yF8SxIswknz1G6m49XxCBqWgQywJFCwaNfUXTce0EYD/DRQhOXntatV93PWWel1jUWy1iTL54AYAgRCxdb5UmyHLLftJtEGcII5ljIjGtazd67erUXJ0mU5hjWIt75yi1sKfpFHHABRiMDIV/MD9xsrAy/KJdmB06clPZZuqLCusssOQZq9DWptQRRjf7csXRtTAj3iYDRxP5xOH5fyaOl/wAo4y82EczHQAi5OX7OxPAgNkxRcXLIZOIsU6NfgV1Fj460jOoe6IinJ/GgcIqgj6eSdp7dglg6/6YFGwy7g+i/mjGxWxFXWYr3mhuE9zlbfqslitHrKXacPDl2Kwl+w8dQZQUnGIP6RLGdGGAqgTEJDw4MDo0y0bQ3myaLMRgV8J7LhIEePK0a80/y5RbHkYlkAFJpiqP0oexS0Ij6CZgWix4gfAy6E3f5PCv6az345OhIR1K48G1aBLR5I6Yn5SJk8uiXUHG6JPudxsOvvR+7BJqf8oOAzKf8FPnsoRAYKVEjaKm0qg+VsTG4+rbCL5hh97oBD5+5eM0HhSyi2cVmfycYp0wqtELDdVyiZ6vhkF2wR1SojjHzKkpzzfc2BZ3IAcHHrBNLdFmUrCIg0wgPUVmwpY5NwS9qy4 10t2Kw8c P0hdLLxzXXA0ZAps9zI2yholuI0jWNp8m/+apC3mJbEZ7ZDeKCLTfOQnVVyulgzh9RnzzvJXVGjtlur8P26ks9dYsMiIU9gPjD5uiDm+cZbS5/BdIHk6EOoAXtbfoCRslDTEJ3H2GQK/yHIn7/AJoeyLAeuW7+L2yp4hM4apoExOILlkKHakum8mUN3rrBPElFihw/rNFdH+/3ermTpxjuGxQ3r50sqAuIN4SiHimfXqJrc6A0ma5hKLYwy+3luQ+PSX9qkzlSQGH6gq2ScpvVA13LOPigh2submJJj1rjXih+zO+h4q4ozgDhQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kefeng Wang writes: > On 2024/10/30 9:04, Huang, Ying wrote: >> David Hildenbrand writes: >>=20 >>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>> >>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>> shouldn't >>>>>>>>>> >>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>> >>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>> >>>>>>>>>> No sure about this part, it uses nr_pages as the end and calcula= te >>>>>>>>>> the >>>>>>>>>> 'base'. >>>>>>>>> >>>>>>>>> It should be using folio_nr_pages(). >>>>>>>> >>>>>>>> But process_huge_page() without an explicit folio argument, I'd li= ke to >>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>> >>>>>>> First, why does folio_zero_user() call process_huge_page() for *a s= mall >>>>>>> folio*? Because we like or code to be extra complicated to understa= nd? >>>>>>> Or am I missing something important? >>>>>> >>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>> and HugeTLB, so it won't process a small folio if I understand corre= ctly. >>>>> >>>>> And unfortunately neither the documentation nor the function name >>>>> expresses that :( >>>>> >>>>> I'm happy to review any patches that improve the situation here :) >>>>> >>>> Actually, could we drop the process_huge_page() totally, from my >>>> testcase[1], process_huge_page() is not better than clear/copy page >>>> from start to last, and sequential clearing/copying maybe more >>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>> to test to check the performance, since the process_huge_page() >>>> was submitted by Ying, what's your opinion? >> I don't think that it's a good idea to revert the commit without >> studying and root causing the issues. I can work together with you on >> that. If we have solid and well explained data to prove >> process_huge_page() isn't benefitial, we can revert the commit. > > > Take 'fallocate 20G' as an example=EF=BC=8C before > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': IIUC, fallocate will zero pages, but will not touch them at all, right? If so, no cache benefit from clearing referenced page last. > 3,118.94 msec task-clock # 0.999 CPUs > utilized > 30 context-switches # 0.010 K/sec > 1 cpu-migrations # 0.000 K/sec > 136 page-faults # 0.044 K/sec > 8,092,075,873 cycles # > 2.594 GHz (92.82%) > 1,624,587,663 instructions # 0.20 insn per > cycle (92.83%) > 395,341,850 branches # 126.755 M/sec > (92.82%) > 3,872,302 branch-misses # 0.98% of all > branches (92.83%) > 1,398,066,701 L1-dcache-loads # 448.251 M/sec > (92.82%) > 58,124,626 L1-dcache-load-misses # 4.16% of all > L1-dcache accesses (92.82%) > 1,032,527 LLC-loads # 0.331 M/sec > (92.82%) > 498,684 LLC-load-misses # 48.30% of all > LL-cache accesses (92.84%) > 473,689,004 L1-icache-loads # 151.875 M/sec > (92.82%) > 356,721 L1-icache-load-misses # 0.08% of all > L1-icache accesses (92.85%) > 1,947,644,987 dTLB-loads # 624.458 M/sec > (92.95%) > 10,185 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.96%) > 474,622,896 iTLB-loads # 152.174 M/sec > (92.95%) > 94 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.69%) > > 3.122844830 seconds time elapsed > > 0.000000000 seconds user > 3.107259000 seconds sys > > and after=EF=BC=88clear from start to end=EF=BC=89 > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': > > 1,135.53 msec task-clock # 0.999 CPUs > utilized > 10 context-switches # 0.009 K/sec > 1 cpu-migrations # 0.001 K/sec > 137 page-faults # 0.121 K/sec > 2,946,673,587 cycles # > 2.595 GHz (92.67%) > 1,620,704,205 instructions # 0.55 insn per > cycle (92.61%) > 394,595,772 branches # 347.499 M/sec > (92.60%) > 130,756 branch-misses # 0.03% of all > branches (92.84%) > 1,396,726,689 L1-dcache-loads # 1230.022 M/sec > (92.96%) > 338,344 L1-dcache-load-misses # 0.02% of all > L1-dcache accesses (92.95%) > 111,737 LLC-loads # 0.098 M/sec > (92.96%) > 67,486 LLC-load-misses # 60.40% of all > LL-cache accesses (92.96%) > 418,198,663 L1-icache-loads # 368.285 M/sec > (92.96%) > 173,764 L1-icache-load-misses # 0.04% of all > L1-icache accesses (92.96%) > 2,203,364,632 dTLB-loads # 1940.385 M/sec > (92.96%) > 17,195 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.95%) > 418,198,365 iTLB-loads # 368.285 M/sec > (92.96%) > 79 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.34%) > > 1.137015760 seconds time elapsed > > 0.000000000 seconds user > 1.131266000 seconds sys > > The IPC improved a lot=EF=BC=8Cless LLC-loads and more L1-dcache-loads=EF= =BC=8C but > this depends on the implementation of the microarchitecture. Anyway, we need to avoid (or reduce at least) the pure memory clearing performance. Have you double checked whether process_huge_page() is inlined? Perf-profile result can be used to check this too. When you say from start to end, you mean to use clear_gigantic_page() directly, or change process_huge_page() to clear page from start to end? > 1) Will test some rand test to check the different of performance as > David suggested. > > 2) Hope the LKP to run more tests since it is very useful(more test > set and different machines) I'm starting to use LKP to test. -- Best Regards, Huang, Ying > >>=20 >>> I questioned that just recently [1], and Ying assumed that it still >>> applies [2]. >>> >>> c79b57e462b5 ("mm: hugetlb: clear target >>> sub-page last when clearing huge page=E2=80=9D) documents the scenario = where >>> this matters -- anon-w-seq which you also run below. >>> >>> If there is no performance benefit anymore, we should rip that >>> out. But likely we should check on multiple micro-architectures with >>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 >>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment >>> cannot replicate that?>> >>> >>> [1] >>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@r= edhat.com/ >>> [2] >>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.= intel.com/ >>> >>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee= 3a@huawei.com/ >>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB) >>> >>> But these are sequential, not random. I'd have thought access + >>> zeroing would be sequentially either way. Did you run with random >>> access as well> > > Will do. >> > -- >> Best Regards, >> Huang, Ying >>=20