From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FFEFD74962 for ; Wed, 30 Oct 2024 05:05:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA6CB8D0002; Wed, 30 Oct 2024 01:05:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C57388D0001; Wed, 30 Oct 2024 01:05:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AF86A8D0002; Wed, 30 Oct 2024 01:05:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8A2118D0001 for ; Wed, 30 Oct 2024 01:05:43 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 28A6D1401B0 for ; Wed, 30 Oct 2024 05:05:43 +0000 (UTC) X-FDA: 82729080714.14.79F37BB Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf22.hostedemail.com (Postfix) with ESMTP id D7614C000D for ; Wed, 30 Oct 2024 05:05:07 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730264661; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ugo6+yjpLkxiJbdI/q9XAVGr3arrPHg8HYFvU6JpAbM=; b=IokqvaO2LCWO3zISOXbqIt5HlEBvWy9vGARXoAzKSdOmYcX3xqlo0amchqxIqeNZV2VkHg 4gufU9IovQI7LgYMbuTS8WtMG8CTmzfP379H35Sd5TwgId1LVvOHuTKoXKsBJo7ixa9osQ LWQ+AbfmDiOeMfwBzK036/OilCpBscA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730264661; a=rsa-sha256; cv=none; b=R4M0KfCxBKaz766Hr15rcfQm0SyUgboDSpT2uURpyeQMR+vp6AokRXE/WO2C0VGXg2mIek k030nqNXNYjcDzH/HerGDqA2GbED02yiN8/m5oYqEsH53xv2nWXyTdGBht7qx3ohNuIwfZ QSm3hsendWButgRRsR0vWAtxuN7fv/8= Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4XdZks1MFszdkLS; Wed, 30 Oct 2024 13:03:01 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id ADEC51400DC; Wed, 30 Oct 2024 13:05:34 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Wed, 30 Oct 2024 13:05:34 +0800 Message-ID: <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> Date: Wed, 30 Oct 2024 13:05:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() To: "Huang, Ying" CC: David Hildenbrand , Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: D7614C000D X-Stat-Signature: 6zjhshdcu186ze4qtuqmygw1ufqg93cf X-HE-Tag: 1730264707-365350 X-HE-Meta: U2FsdGVkX1/o30As4Ec8SEKbSpLabpy90ix3An4zaPRKy4QL1kdWylYoku1RR+rFcAZo8nILwDoaBm1k+sXlMiVGGAjlP6yf50P3CpwmR8H6VyzXWrqnIN+02cDHIBdO72G2NNfThEAzxxSd9lkBu81Tzr75KwJy4Qu7DH55YhpvRkRUIRSAuycw8f+JLpluLeJR5APFwYdtdyonQOdOR65NQeRPWC+DK1APFRlkrnlMMEzMUjNuqXQrIXGJffgy/7L7qJcPf2mYXn0nuN21OYc6pdVIy6z0VzPIYoA1xLKH0hqb8KnwjZWA+LwQ7DugtiZYoPmG/OrKMihfaBJiDmMpduCeJKPwKP+wz4FwxEoPD81j7zKd2zsYHVHualy+gnHetHl7TMVHwrvXeqQXheJwHcNO7+v+cCURHi9kHTwsyZqVl4gOC8rjH6ID+FiUjFyzLMKjbx8eFW1w0pDn9s5kEUGegwhAJKgC7mDfGOiFSTwM5NhEIECXV0fIxgoIvKZ2XkCumdNbwNL5IF6s2sI7gjd0xp3Im94m8SMvhwfl2gYG7/IIQvMUJOUY89Q9PIkp0HXM5Yxbz2ew/rI3x8cWPVdJNTbeLr9MocJzemuk0E0OrAPgWDIHSWORyVVJKT2oX/y3OrRMofwuLJTEqbByRl1ZfvNyuh+zWypN38V48wUdgj4Yz0dEgTEzex7xQ4DbFKsQUnACUGn2DOOcpr80cLrnkq98YD7d5VQShQPqlRSEyOp/wFGOvPF/SfjHKaToPTiWvz0pfCFFtfVQQ/wmnisS38Iw/X9ze/D1Lijmb+fXVXJsYF00TRXI7UG4b8AEVlfimCSJwVZjymGJOGdPf/RqOZxfQgG+fUyvTFYB1RVjbAxDKEJrEa8SkL3vfzuD6VgZx/AqmUPNCvizr2Kwq74j9nLHeOXwS315G7lJ6diuM+HrVqoKjNU1rF/nVmp0bDnKeoiRLnCLD9S 7whQMkCo Hu15zPjEcstkWW5CK1tsYg8q0a1bDZNUQ9x8vZerwK7kUXTPexFNALPXrINn9bmWt+3arJWTep29gLAPQbvUOWsFmCxDFkXguv/0c/iSwSagqIlT5qC4MZBSLhJLY0xwdhy4We3dmtuQOu8i/WiFc0VUK9V7rd3K95eLG9D6AvbjWBBpIv33Jd9yC/JIT1BHIQQeN4pYOex7AVXwci7khjOUD4Z9tUqOQK7AN9XEA1/K+86FMV2ZxKb3KMVS0SErO8Gh5HYr8feYYTIHj+Fw+cmQpl1siwW/QBc9U X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/30 11:21, Huang, Ying wrote: > Kefeng Wang writes: > >> On 2024/10/30 9:04, Huang, Ying wrote: >>> David Hildenbrand writes: >>> >>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>> >>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>> shouldn't >>>>>>>>>>> >>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>> >>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>> >>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>> the >>>>>>>>>>> 'base'. >>>>>>>>>> >>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>> >>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>> >>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>> Or am I missing something important? >>>>>>> >>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>> >>>>>> And unfortunately neither the documentation nor the function name >>>>>> expresses that :( >>>>>> >>>>>> I'm happy to review any patches that improve the situation here :) >>>>>> >>>>> Actually, could we drop the process_huge_page() totally, from my >>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>> from start to last, and sequential clearing/copying maybe more >>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>> to test to check the performance, since the process_huge_page() >>>>> was submitted by Ying, what's your opinion? >>> I don't think that it's a good idea to revert the commit without >>> studying and root causing the issues. I can work together with you on >>> that. If we have solid and well explained data to prove >>> process_huge_page() isn't benefitial, we can revert the commit. >> >> >> Take 'fallocate 20G' as an example, before >> >> Performance counter stats for 'taskset -c 10 fallocate -l 20G >> /mnt/hugetlbfs/test': > > IIUC, fallocate will zero pages, but will not touch them at all, right? > If so, no cache benefit from clearing referenced page last. Yes, for this case, only clear page. > >> 3,118.94 msec task-clock # 0.999 CPUs >> utilized >> 30 context-switches # 0.010 K/sec >> 1 cpu-migrations # 0.000 K/sec >> 136 page-faults # 0.044 K/sec >> 8,092,075,873 cycles # >> 2.594 GHz (92.82%) >> 1,624,587,663 instructions # 0.20 insn per >> cycle (92.83%) >> 395,341,850 branches # 126.755 M/sec >> (92.82%) >> 3,872,302 branch-misses # 0.98% of all >> branches (92.83%) >> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >> (92.82%) >> 58,124,626 L1-dcache-load-misses # 4.16% of all >> L1-dcache accesses (92.82%) >> 1,032,527 LLC-loads # 0.331 M/sec >> (92.82%) >> 498,684 LLC-load-misses # 48.30% of all >> LL-cache accesses (92.84%) >> 473,689,004 L1-icache-loads # 151.875 M/sec >> (92.82%) >> 356,721 L1-icache-load-misses # 0.08% of all >> L1-icache accesses (92.85%) >> 1,947,644,987 dTLB-loads # 624.458 M/sec >> (92.95%) >> 10,185 dTLB-load-misses # 0.00% of all >> dTLB cache accesses (92.96%) >> 474,622,896 iTLB-loads # 152.174 M/sec >> (92.95%) >> 94 iTLB-load-misses # 0.00% of all >> iTLB cache accesses (85.69%) >> >> 3.122844830 seconds time elapsed >> >> 0.000000000 seconds user >> 3.107259000 seconds sys >> >> and after(clear from start to end) >> >> Performance counter stats for 'taskset -c 10 fallocate -l 20G >> /mnt/hugetlbfs/test': >> >> 1,135.53 msec task-clock # 0.999 CPUs >> utilized >> 10 context-switches # 0.009 K/sec >> 1 cpu-migrations # 0.001 K/sec >> 137 page-faults # 0.121 K/sec >> 2,946,673,587 cycles # >> 2.595 GHz (92.67%) >> 1,620,704,205 instructions # 0.55 insn per >> cycle (92.61%) >> 394,595,772 branches # 347.499 M/sec >> (92.60%) >> 130,756 branch-misses # 0.03% of all >> branches (92.84%) >> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >> (92.96%) >> 338,344 L1-dcache-load-misses # 0.02% of all >> L1-dcache accesses (92.95%) >> 111,737 LLC-loads # 0.098 M/sec >> (92.96%) >> 67,486 LLC-load-misses # 60.40% of all >> LL-cache accesses (92.96%) >> 418,198,663 L1-icache-loads # 368.285 M/sec >> (92.96%) >> 173,764 L1-icache-load-misses # 0.04% of all >> L1-icache accesses (92.96%) >> 2,203,364,632 dTLB-loads # 1940.385 M/sec >> (92.96%) >> 17,195 dTLB-load-misses # 0.00% of all >> dTLB cache accesses (92.95%) >> 418,198,365 iTLB-loads # 368.285 M/sec >> (92.96%) >> 79 iTLB-load-misses # 0.00% of all >> iTLB cache accesses (85.34%) >> >> 1.137015760 seconds time elapsed >> >> 0.000000000 seconds user >> 1.131266000 seconds sys >> >> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >> this depends on the implementation of the microarchitecture. > > Anyway, we need to avoid (or reduce at least) the pure memory clearing > performance. Have you double checked whether process_huge_page() is > inlined? Perf-profile result can be used to check this too. > Yes, I'm sure the process_huge_page() is inlined. > When you say from start to end, you mean to use clear_gigantic_page() > directly, or change process_huge_page() to clear page from start to end? > Using clear_gigantic_page() and changing process_huge_page() to clear page from start to end are both good for performance when sequential clearing, but no random test so far. >> 1) Will test some rand test to check the different of performance as >> David suggested. >> >> 2) Hope the LKP to run more tests since it is very useful(more test >> set and different machines) > > I'm starting to use LKP to test. Greet.