From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B311DC77B7C for ; Tue, 24 Jun 2025 15:27:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 465446B0093; Tue, 24 Jun 2025 11:27:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 415D86B0096; Tue, 24 Jun 2025 11:27:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 304A86B009A; Tue, 24 Jun 2025 11:27:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 17D836B0093 for ; Tue, 24 Jun 2025 11:27:12 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AF8831D7CCC for ; Tue, 24 Jun 2025 15:27:11 +0000 (UTC) X-FDA: 83590672662.29.C6609EC Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf16.hostedemail.com (Postfix) with ESMTP id D32CA18000E for ; Tue, 24 Jun 2025 15:27:09 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ptu4uVZI; spf=pass (imf16.hostedemail.com: domain of xavier.qyxia@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=xavier.qyxia@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750778829; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=ihMu+w7nCoTSGq+4RZCJ2TmkBfciX1sr6w0E83yVPuA=; b=YACP51hHSYW1ed9V69fBGmbZVFnyA9MOR93YrYRqDK+hR1/tmuIMb2ERtR9tEKpmojms7s d3/qzTwvUbifWdk7CjHOYiEypQMOSz8EM0QMPDROBCo5HZzVnoIgTsypCZcYssRcSBtNCD v5vsce+jSAXBZ425mB7QlxYOjGwkk8s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750778829; a=rsa-sha256; cv=none; b=sU0yQSmaevj7yi2k5xCF1sdwoL1occYoUAfW5/2P69vDPaPq3bhpRgPcekdOcDuwNBhBQ2 lwZ0CDH3CZJsWO3bISXIfaGZdx8CklSSdSY9EuqymApEpg/ptZAC0SgiQPKogrye6LiUYM UIFk9e7HgIaRneDRjiJHU3z5Ih1lESU= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ptu4uVZI; spf=pass (imf16.hostedemail.com: domain of xavier.qyxia@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=xavier.qyxia@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-236470b2dceso57980495ad.0 for ; Tue, 24 Jun 2025 08:27:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750778829; x=1751383629; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ihMu+w7nCoTSGq+4RZCJ2TmkBfciX1sr6w0E83yVPuA=; b=Ptu4uVZIbyrqZ700c9fCttxF0VNEdoFJXKcTZdNMO+g0cT4Nly0vQGiIKJTidkc/Mw d87mDwyK0luEijcdzvqoCf3eRbD2uw7XCvC3Ti+OfaRs6kLal1PXs7gPo1RgpAzjAffw P2xBVTTcLVIa7QXL1wRDDTn825p08YR9YV+AJpbq+Q38DtAYQMp8OKR1ZTi+YLTXIca0 /Iqyi9g2eDTVBcJaM/WoWAwjBFVXtaLEmULKPpS24hdfFbCCL4cLVJ7kg57bkGpDTWNU 60168tBnH63euqCFJFK6ymx6+WiHmXUu7M+Tck+FRDLhkmuZKKARqKZ5oZT/8QHOuCyv 811w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750778829; x=1751383629; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ihMu+w7nCoTSGq+4RZCJ2TmkBfciX1sr6w0E83yVPuA=; b=I2qiTQQ43xCHyQHrESEsV2ln7PtImSs4W4UHf7HCD2SNx/tx0nb8b9QWZ5zpqpxyB0 sXmwa+DwmFQqc37+gw0WtVPw0Cu2HLQn77wCpp8EZExAYJLAHVySm80kQJERr5troFFq dkj+OAeB7ggddaQamJOPkao1oZ8JSCs+WmEyb+ql9FQBQTGp0tn+Zfy3M8oVKUQNh7b/ TMGiKwZUFO5iLpCikQD9FtZ/TCmZOqtxMEKQ/gzRyoNtjX/vKKrm0C2XRc62Bu98USd2 /Lcy7ET5KsjeBOhM4gSxSveJOsGI5KYhdjrBwW4GKP7pldRjqVG6rLKXDcVEqx/rJgvG yJRA== X-Forwarded-Encrypted: i=1; AJvYcCV7hSuE1DQMM9B70dPtuT2IAFq939afIa2tqUtmbxnewvrAC8wbgC2XJ69FS3Gnkzjt7NhEJLc2nQ==@kvack.org X-Gm-Message-State: AOJu0Yx0clCbN/NrF9yhOH0ahqMITpZVoldgGppksiSvpNFQY7Tggedq AGDWioDcCdW4Zy7HNdiyzl8E8r95SmOAUKqEs3vR8/4b39kFs5GCgo+X X-Gm-Gg: ASbGnctPOOoVi/Zv5QPefI+z4ttwPrLcaqSuIjJbU4tc8H/KMHKAMeMFljclqXrvgv6 CjsNKvRdA4djmuhD3YtvZ8uknkhWnAdebDOk+77EcyA6lj/EoVSX5XwoWUzr2uBfiS0JJewaydd T3enzXtgcMcUjiymPd+U1PiYE+/l2H1I5pgan65ozkdwHYI2fgt911QS3cFb9JbIQ/XXtQKUBSW 7mOG4ppTtC51s34SEIpl+6fe7vYF+dE2JbOs87q0/eBak01q+G7j9A4N0V4Yjl7ZK8Qv716F9fo ER2bnqTNrOnDR0Puf/x7hlHt25M59smQUs4TX62Uk5haiVmme6XAGzklIN4hsdTy1cV0fMW6Gjo = X-Google-Smtp-Source: AGHT+IGyXs9ccyO6PIDBnUQkvw3swafltIfKcPLYEq/f4dAu6qbdAkS+Hno5e1qdpaxNZhpT3FQfjQ== X-Received: by 2002:a17:903:32c5:b0:234:c5c1:9b63 with SMTP id d9443c01a7336-237d9907243mr218353515ad.18.1750778828233; Tue, 24 Jun 2025 08:27:08 -0700 (PDT) Received: from localhost ([101.132.132.191]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2380655376esm15369475ad.26.2025.06.24.08.27.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Jun 2025 08:27:07 -0700 (PDT) From: Xavier Xia To: ryan.roberts@arm.com, will@kernel.org, 21cnbao@gmail.com, ioworker0@gmail.com, dev.jain@arm.com Cc: akpm@linux-foundation.org, catalin.marinas@arm.com, david@redhat.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, xavier_qy@163.com, ziy@nvidia.com, Xavier Xia , Barry Song Subject: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get Date: Tue, 24 Jun 2025 23:25:49 +0800 Message-Id: <20250624152549.2647828-1-xavier.qyxia@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: D32CA18000E X-Stat-Signature: r75nceixuhxotg5dt5uwwjp9ph4t4sbx X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1750778829-885126 X-HE-Meta: U2FsdGVkX1/z+CADGwf1cfWLUpFu1F1/9ojIqRAP+G7kk/Jt5BQo/pWEdfeBOoqeytd2vGR5LUNT9s3MAZiGe23jGEcVOyDu0Sv/A6jegutX9/yc3qLQdGFhqT8+syu/gA466pngWGeCdLyJhL/ph3xaUxtUeSvWqG2nvbU3wwsFZKbY5eMc7ED2dXu5+ydJVDFo9hGsLLMt5SCoWU3ZMBV1bxPQAj+nLvzJhyaCFRTJjDM6Pro4xUGyH7LlYe+KMt/xKNMO8RDc0Kz/22cJ3DEBQBcwxrcQWdAy1nKE5uKwbAcVTJAf+L7zyM1mU7k0KcjXS4R0Dekw86oOwtPJzrEqM4451RQDWpYSFiHIRHZVG8s8DealBZeT8QcLUx8kUaX71zOt7zFWqQSBtmqCt1XBDZd+udultWHxrgNPRK5CK9LUOaPABWTJfvhdqAzd7jTxoBnNiAXxSPcbltGMnAwnlf9pOZDzzhYLJjDNev3UvRVgWzRc171gCz7qXNsPgjPpm8gxGFhMPuYNms853QgSscAPTf9xsCBVoTBhIVsi0ccRJevRQDkFUCoWuaJKMC7bxrye8EZ/6RYXJAtZwcHTBDgw4W/32da2LL1xnQtkMAqvCi/OoBus8GQ45CqHfuBhNXBzI4LBeLYq5zx9hJZf81mkzpLNa6laO2XqZydBF/16mtts1mzwijdty1q0upRgR95v7nyiAoUBZR47Tj0yhg+HedsH2KTjK/K5pLF8hCZTJI0eKjtDmhQ5dRf3n7sFl3W8XoEEyrB1BcqJl6WwghTm9eBZl1XltAK3+i9MAv/QaYIpzv8Fnv+z5ocDBUpz2r0Qy2wveovxPkP9/CabrcS1QtUvg6TI0GE+B4C7rnT6BS/GzTqY/gZWEk3bBfXHO4licc1sY4pSOUS7csmSDdkuhrsLSfqN3SHYqGbBrFQEsIAGJa2HumYewYS45Qd0f4/mPNUJMDa088C MHNv5cRZ 2EgyXQK3GkKcRMDNjVWyNlAhBLoHHneEVE/JNYBMpdtRh+9Uruhzzs+Q7uOldDLUDWIQX/rQsDmePmupGaKL2rljj+6ok3+GOaxDHO4xWgecGqs4whoU3aucuuazuPpX4HMF9gX19/ENuWel+XaVvUCqTIVaQ1+eQgYr+sYBpenIA8WQqEaZKx7WEaCWtbsQuH1mG3CkrI3KCF4vTjUoGxO0xvwH2+CzjfrEFAACztLV/Tv/BHdfUnGB73nbA0csb2S6GRnk7c0PJR5g452cl6N54J+JS58ykA6KEzbkDmGwSgqk6AQx++VYESt/ZlBfu9NYL4Xo1b3traw4g/IY2Dh1PVF6f5F17gpDHpJPBLuVgVngO9tl8hwKPzqR+MEhE0I5rh2RLn+O5MExkqD3qZix0CgjBv9U+KV7JwaqGyPG8lgh1SGcYJUDCvDxN6amw4e7bOmgGgI6fdXJeSTMxDRNKsVXVOP2MGvV2V4s2yMvwSHol5Pv2l1zS3q3pJcAdkVEcFIR/trvUxpwUizsr5YaDSa4FqEEnISjqJcR2276UMlXy8KziyJKUcHl+xLM9dZlAYF1Prww3SqnF/MU5k/Cg45bGiYCEl0le5vHJK8afOLYrdr/0koQyornxFShl8hokG6QkzQI0XtDUazEjFdEAo/CQ5sh54Jgf7ecbIiJLy74= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless function by adding early termination logic. It checks if the dirty and young bits of orig_pte are already set and skips redundant bit-setting operations during the loop. This reduces unnecessary iterations and improves performance. In order to verify the optimization performance, a test function has been designed. The function's execution time and instruction statistics have been traced using perf, and the following are the operation results on a certain Qualcomm mobile phone chip: Test Code: #include #include #include #define PAGE_SIZE 4096 #define CONT_PTES 16 #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE) #define YOUNG_BIT 8 void rwdata(char *buf) { for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) { buf[i] = 'a'; volatile char c = buf[i]; } } void clear_young_dirty(char *buf) { if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) { perror("madvise free failed"); free(buf); exit(EXIT_FAILURE); } if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) { perror("madvise free failed"); free(buf); exit(EXIT_FAILURE); } } void set_one_young(char *buf) { for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) { volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE]; } } void test_contpte_perf() { char *buf; int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE); if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) { perror("posix_memalign failed"); exit(EXIT_FAILURE); } rwdata(buf); #if TEST_CASE2 || TEST_CASE3 clear_young_dirty(buf); #endif #if TEST_CASE2 set_one_young(buf); #endif for (int j = 0; j < 500; j++) { mlock(buf, TEST_SIZE); munlock(buf, TEST_SIZE); } free(buf); } int main(void) { test_contpte_perf(); return 0; } Descriptions of three test scenarios Scenario 1 The data of all 16 PTEs are both dirty and young. #define TEST_CASE2 0 #define TEST_CASE3 0 Scenario 2 Among the 16 PTEs, only the 8th one is young, and there are no dirty ones. #define TEST_CASE2 1 #define TEST_CASE3 0 Scenario 3 Among the 16 PTEs, there are neither young nor dirty ones. #define TEST_CASE2 0 #define TEST_CASE3 1 Test results |Scenario 1 | Original| Optimized| |-------------------|---------------|----------------| |instructions | 37912436160| 18731580031| |test time | 4.2797| 2.2949| |overhead of | | | |contpte_ptep_get() | 21.31%| 4.80%| |Scenario 2 | Original| Optimized| |-------------------|---------------|----------------| |instructions | 36701270862| 36115790086| |test time | 3.2335| 3.0874| |Overhead of | | | |contpte_ptep_get() | 32.26%| 33.57%| |Scenario 3 | Original| Optimized| |-------------------|---------------|----------------| |instructions | 36706279735| 36750881878| |test time | 3.2008| 3.1249| |Overhead of | | | |contpte_ptep_get() | 31.94%| 34.59%| For Scenario 1, optimized code can achieve an instruction benefit of 50.59% and a time benefit of 46.38%. For Scenario 2, optimized code can achieve an instruction count benefit of 1.6% and a time benefit of 4.5%. For Scenario 3, since all the PTEs have neither the young nor the dirty flag, the branches taken by optimized code should be the same as those of the original code. In fact, the test results of optimized code seem to be closer to those of the original code. Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP. Scenario 1: reduced to 56% of baseline execution time Scenario 2: reduced to 89% of baseline execution time Scenario 3: reduced to 91% of baseline execution time It can be proven through test function that the optimization for contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless is similar to that of contpte_ptep_get, the same optimization scheme is also adopted for it. Reviewed-by: Ryan Roberts Tested-by: Ryan Roberts Reviewed-by: Barry Song Signed-off-by: Xavier Xia --- Changes in v7: - Update the header files and main function of the test program, as well as Ryan's validation data. - Link to v6: https://lore.kernel.org/all/20250510125948.2383778-1-xavier_qy@163.com/ Changes in v6: - Move prot = pte_pgprot(pte_mkold(pte_mkclean(pte))) into the contpte_is_consistent(), as suggested by Barry. - Link to v5: https://lore.kernel.org/all/20250509122728.2379466-1-xavier_qy@163.com/ Changes in v5: - Replace macro CHECK_CONTPTE_CONSISTENCY with inline function contpte_is_consistent for improved readability and clarity, as suggested by Barry. - Link to v4: https://lore.kernel.org/all/20250508070353.2370826-1-xavier_qy@163.com/ Changes in v4: - Convert macro CHECK_CONTPTE_FLAG to an internal loop for better readability. - Refactor contpte_ptep_get_lockless using the same optimization logic, as suggested by Ryan. - Link to v3: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com/ --- arch/arm64/mm/contpte.c | 74 +++++++++++++++++++++++++++++++++++------ 1 file changed, 64 insertions(+), 10 deletions(-) diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c index bcac4f55f9c1..71efe7dff0ad 100644 --- a/arch/arm64/mm/contpte.c +++ b/arch/arm64/mm/contpte.c @@ -169,17 +169,46 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) for (i = 0; i < CONT_PTES; i++, ptep++) { pte = __ptep_get(ptep); - if (pte_dirty(pte)) + if (pte_dirty(pte)) { orig_pte = pte_mkdirty(orig_pte); - - if (pte_young(pte)) + for (; i < CONT_PTES; i++, ptep++) { + pte = __ptep_get(ptep); + if (pte_young(pte)) { + orig_pte = pte_mkyoung(orig_pte); + break; + } + } + break; + } + + if (pte_young(pte)) { orig_pte = pte_mkyoung(orig_pte); + i++; + ptep++; + for (; i < CONT_PTES; i++, ptep++) { + pte = __ptep_get(ptep); + if (pte_dirty(pte)) { + orig_pte = pte_mkdirty(orig_pte); + break; + } + } + break; + } } return orig_pte; } EXPORT_SYMBOL_GPL(contpte_ptep_get); +static inline bool contpte_is_consistent(pte_t pte, unsigned long pfn, + pgprot_t orig_prot) +{ + pgprot_t prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); + + return pte_valid_cont(pte) && pte_pfn(pte) == pfn && + pgprot_val(prot) == pgprot_val(orig_prot); +} + pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) { /* @@ -202,7 +231,6 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) pgprot_t orig_prot; unsigned long pfn; pte_t orig_pte; - pgprot_t prot; pte_t *ptep; pte_t pte; int i; @@ -219,18 +247,44 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { pte = __ptep_get(ptep); - prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); - if (!pte_valid_cont(pte) || - pte_pfn(pte) != pfn || - pgprot_val(prot) != pgprot_val(orig_prot)) + if (!contpte_is_consistent(pte, pfn, orig_prot)) goto retry; - if (pte_dirty(pte)) + if (pte_dirty(pte)) { orig_pte = pte_mkdirty(orig_pte); + for (; i < CONT_PTES; i++, ptep++, pfn++) { + pte = __ptep_get(ptep); + + if (!contpte_is_consistent(pte, pfn, orig_prot)) + goto retry; + + if (pte_young(pte)) { + orig_pte = pte_mkyoung(orig_pte); + break; + } + } + break; + } - if (pte_young(pte)) + if (pte_young(pte)) { orig_pte = pte_mkyoung(orig_pte); + i++; + ptep++; + pfn++; + for (; i < CONT_PTES; i++, ptep++, pfn++) { + pte = __ptep_get(ptep); + + if (!contpte_is_consistent(pte, pfn, orig_prot)) + goto retry; + + if (pte_dirty(pte)) { + orig_pte = pte_mkdirty(orig_pte); + break; + } + } + break; + } } return orig_pte; -- 2.34.1