From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F03FC7EE30 for ; Wed, 2 Jul 2025 09:00:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 589286B00B9; Wed, 2 Jul 2025 05:00:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 561346B00BC; Wed, 2 Jul 2025 05:00:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49D9C6B00BD; Wed, 2 Jul 2025 05:00:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2BFAC6B00B9 for ; Wed, 2 Jul 2025 05:00:31 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 94A50807CE for ; Wed, 2 Jul 2025 09:00:30 +0000 (UTC) X-FDA: 83618728620.27.2B9EBD8 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf08.hostedemail.com (Postfix) with ESMTP id A96CD16000E for ; Wed, 2 Jul 2025 09:00:28 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GrXxm4RT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of xavier.qyxia@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=xavier.qyxia@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751446828; a=rsa-sha256; cv=none; b=2DGC9hkCtxCvGRd2l4szaKjb4L9g0DjeqXmeAk0rnHHNKQX/IDAUbJO3T+1Kigj+NHVPYb nCS0ltbCq8M4JjPzXtbwjCxjlNMm+bl49r7vACvwVvuc66HzepqTsUj1SavfceI2VlGimt 3iMydQe9pokvybladTUvDpDK8JiauPM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GrXxm4RT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of xavier.qyxia@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=xavier.qyxia@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751446828; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rucETxapXKXGBN7PtyHY+EQo5NyBaPuYKngjv1dJ15E=; b=2FUPfe/PMh5WGbporlEUkeJd4iXNAj/4KEOFx6faUg9akBDlvj9QZ31wOaP9Sq/wtnGmHw vN8IHl7Ov0YxVZabTQ9hP0M2lbQyyGBwEkYkWtdJkpghHqZHMpOs1M5IchIgZgc26Pzhqd 5i71LFeNII31hU3BOFhFZ0lQ954Hq+A= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-60c60f7eeaaso10949539a12.0 for ; Wed, 02 Jul 2025 02:00:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1751446827; x=1752051627; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=rucETxapXKXGBN7PtyHY+EQo5NyBaPuYKngjv1dJ15E=; b=GrXxm4RTPDVUk7PahXwqDyMCjPg51rvQSN8Aoh+aclGtuFpH/XDgnzY+FLMbAwsUGC YPATEtLvpAh/AYZNTCuN9YzmM1iCY/TPceqNsL1Lyh23yuxWgU9CFXCSHDvCi/0NJGEs jwQp76QKEkhMW2cDnLOhSFQaXxoMHjxr4O4dF2ymzio3gsy/YUOdu0u1mk4SfWbP2cSh o4g863TWJeQ8iBZt6q3vX5VNMP3uBLcXH86HWJMz9G2v475A7H+dMKlwN4EsT4fCE+km iri3zwImHHilfKYdhEKiUD3eLTKd8MQ9WMTgGlTnW6BWsT7Vb3Qgz4hdZlbruWCs5jTo Xt1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751446827; x=1752051627; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rucETxapXKXGBN7PtyHY+EQo5NyBaPuYKngjv1dJ15E=; b=XOYqEr9PmZklENHXTM/BfxbGFuDqtpVKcw67frhHqGlY8DMHanPyZYEQ5y2BxkR69I xSANTCz7YFLOJTqoqMCJeXFRAJKUxr0mmEi17rcYnsE7O8dFss72jtNh+fDmDskoBUot E9zcX7+C4TgIuTx+fyS4nMX+z1GiXiGkwILvDv6uVFw3rfXPBboWKHwl5EYjJT0OIcEG TZ2VsEYIcKYnZ44QfhY7yzF5uHMfZ7AMM4euxmMb12i2CLtC8xPoBxTsaYOgE57cZhhx j7EsOYN9tqFvXsjvbz3Sg9A9FSZo9VIzQ/pylvWyGEI5w/4XS4IBJjB2bw9xf5r352c8 MuNQ== X-Forwarded-Encrypted: i=1; AJvYcCVJciP/UzMuFSVebM/HBMwhpJ2Fk0kA9g05NO0IlpW6Vz4a9MQr3R6xc/9ogXZdzSN2iVUIkG2oWQ==@kvack.org X-Gm-Message-State: AOJu0YxE3DGO7ahmOBZUQG3iH23RN2FIj8pmVMc2Y26pto1bBhCev7Y0 bV+s/yOaKvdYjHqppmmi9uBs0w3Oj+U1685CGpIF9Iu7lVH0h8Z7v1/To1Gj+2uZzUYoDdipNze Ih3udUPEmcWRZgEndo3pRXujAN2b/0Bw= X-Gm-Gg: ASbGncs5vc/swYK/+1cQ5cv0kQYzDaxRPCjJetHSvQ0NC7LlV52FZNUju5SqsIbp6VU DWqv3wYN+LSamHaQZuJTUB5cBPGIJuOb12hXG2pUzDa0+4qprtERtzi/pjcaErj0165xDfGRV+H Y15KKSeIAK8QC2kFh7234z76OtVhjMSplRJ61eLY31uMs= X-Google-Smtp-Source: AGHT+IEdvImO4hnAYVqt+jmktPUdIKdat3t90Sg7JU0s+cTJ/+BkwrzVe7PRmsWiA+2BUv1OZ2j0hvm0CoGRXeNnOMo= X-Received: by 2002:a17:907:daa:b0:ae0:14e0:1d62 with SMTP id a640c23a62f3a-ae3c2da9576mr208513766b.55.1751446826630; Wed, 02 Jul 2025 02:00:26 -0700 (PDT) MIME-Version: 1.0 References: <20250624152549.2647828-1-xavier.qyxia@gmail.com> In-Reply-To: From: Xavier Xia Date: Wed, 2 Jul 2025 17:00:14 +0800 X-Gm-Features: Ac12FXwk-GXgGl7lpWXoZ7Ko4kCXGQR4cPY2sOiuMeJ3c4KkhJBtDKSH3jCiC6o Message-ID: Subject: Re: [PATCH v7] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get To: Catalin Marinas Cc: ryan.roberts@arm.com, will@kernel.org, 21cnbao@gmail.com, ioworker0@gmail.com, dev.jain@arm.com, akpm@linux-foundation.org, david@redhat.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, xavier_qy@163.com, ziy@nvidia.com, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A96CD16000E X-Stat-Signature: oxjrgdotpmxwxioh5179qxtazbgbgfrd X-HE-Tag: 1751446828-275997 X-HE-Meta: U2FsdGVkX19ZDzdfg1MB3CUJJlnkYBtxnAOtUKE5xp9Fya9D8tdDetLDRCj6M28nTZyo1ZjFjbAPz+MOaEcWGK3lIsHEvF59r89u4RfZDbZMLc9YW6bEg4DHTdVwSB2ksO1tryHNT97KEGSfq53+aF86NHZORZUD5rYWu/wSdl3UTxzs/CmIcXthyqFPveNDL7NicqK+u/P2md5xFTkVtuobzEp5SNXH/kAl6YpraNhUpEb1dKsE2zP1zyfuU0rrL7IJyBMVR1joMA/8Z5aFAsrNpGVe5fLb0WC4DHYtXy8v4EBElmEk922w9OmD+BCnxlLx0zhtMSXgGvP+fSA2PwOHdSgc4n2vqZt28yBtv5ORAPQyZjHYgYaOnxPXDegOlNcrJUph4+pM/hUW4mUyhQt+p12PehexD7EVHU4V67TZMFWLC+RaR+HW3xbaonpiAD6Ce/YUifF6xu0ssJbPp+47rBCO+Pl2NE3OmRNsr7PDBI71DR6dTBuNbsqxMJWxH5o9F8ZoKUVaiib1p7hovLvdZncKUdC91fVamGIfPp+LuloL7cVnKIuUfbBBIGgjqsGAIwyV7iJO2+kxhc3WrdKnLMz/4XzBovgR13rVzTNqIWSiQGyrKMgDStQijOmQ1knL90IdnQU74JvkgrSffXoWFPcXINDkwQ9P73ABhfRSEmL3pikAQgR75AOzODPJmAAq0GIf2cW4ZnvxoMiF/wK83VxMYlOQ3QFw3SOgViC0xN+BkwfNaSFVDKwDGHzsV4E20dw2q5HMX+U+porgDeZfoQwy/lJoxKWk2KWLkXnYnWqeeFUUYydorW5Rty3q/6Yjar77fa9JqSw1DAG1/lpWKXHDjzJ3yq148C9k2HDT45uMkMf0G/DDwmhBGQyW9M1jL8FkLdQqNq2+0khVL5KZ80In7CoiedXmwibmqpgHHTYBl11yn/vHhtsdFebR6Ti86jQ//0Iu/Gijter 1udIGk75 ouNdb5fIwiutVgjmTMvcGSArLRx7xwl6AgHuWIz/InlMFCkwo6ErWMv5dwhq6N6FLbunwLaSAMWWw4hI+Ecflok1RtQiXKQhKT8yIyjCGoWU4IRrZEfSh5TMD/lK7AEqrJRIGa89NsL0a4Rje6dda3tfI6YXiJTuGI+Ui6ijN/11eRI32hfxmmv8XM8DApf8gLY7FpSbM0P2DS/95A5U/A8aRLpq5mpt7Tkru9IPypL7vBU0hEH+aMf66mtkn2N0s8LNEjX/OtP6S4H1VWb6WUJSQqvc9xXY3YwNJa9MmGVmMqKbdodCX6LGH2eHWJoLDdEraEZPy8UkSajIje24PB1UCGj4OJTIP76chKx4gj7d+E/mauLXY++ZrnQXqARlDbyZx1lZuwS69VKsDNgARKMzvzH8IAlWynvyelOKDgT9FaEtoHEYh4YTX3pHlq6gAK9t0k+9P7bVHhNCm33etoc/K1kmctq2F7s27aXdMmB1qNH4J/rbQWp+did4pt/Qq85Ts X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Catalin, On Tue, Jul 1, 2025 at 9:59=E2=80=AFPM Catalin Marinas wrote: > > On Tue, Jun 24, 2025 at 11:25:49PM +0800, Xavier Xia wrote: > > This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockles= s > > function by adding early termination logic. It checks if the dirty and > > young bits of orig_pte are already set and skips redundant bit-setting > > operations during the loop. This reduces unnecessary iterations and > > improves performance. > > > > In order to verify the optimization performance, a test function has be= en > > designed. The function's execution time and instruction statistics have > > been traced using perf, and the following are the operation results on = a > > certain Qualcomm mobile phone chip: > > > > Test Code: > > #include > > #include > > #include > > > > #define PAGE_SIZE 4096 > > #define CONT_PTES 16 > > #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE) > > #define YOUNG_BIT 8 > > void rwdata(char *buf) > > { > > for (size_t i =3D 0; i < TEST_SIZE; i +=3D PAGE_SIZE) { > > buf[i] =3D 'a'; > > volatile char c =3D buf[i]; > > } > > } > > void clear_young_dirty(char *buf) > > { > > if (madvise(buf, TEST_SIZE, MADV_FREE) =3D=3D -1) { > > perror("madvise free failed"); > > free(buf); > > exit(EXIT_FAILURE); > > } > > if (madvise(buf, TEST_SIZE, MADV_COLD) =3D=3D -1) { > > perror("madvise free failed"); > > free(buf); > > exit(EXIT_FAILURE); > > } > > } > > void set_one_young(char *buf) > > { > > for (size_t i =3D 0; i < TEST_SIZE; i +=3D CONT_PTES * PA= GE_SIZE) { > > volatile char c =3D buf[i + YOUNG_BIT * PAGE_SIZE= ]; > > } > > } > > > > void test_contpte_perf() { > > char *buf; > > int ret =3D posix_memalign((void **)&buf, CONT_PTES * PAG= E_SIZE, > > TEST_SIZE); > > if ((ret !=3D 0) || ((unsigned long)buf % CONT_PTES * PAG= E_SIZE)) { > > perror("posix_memalign failed"); > > exit(EXIT_FAILURE); > > } > > > > rwdata(buf); > > #if TEST_CASE2 || TEST_CASE3 > > clear_young_dirty(buf); > > #endif > > #if TEST_CASE2 > > set_one_young(buf); > > #endif > > > > for (int j =3D 0; j < 500; j++) { > > mlock(buf, TEST_SIZE); > > > > munlock(buf, TEST_SIZE); > > } > > free(buf); > > } > > > > int main(void) > > { > > test_contpte_perf(); > > return 0; > > } > > > > Descriptions of three test scenarios > > > > Scenario 1 > > The data of all 16 PTEs are both dirty and young. > > #define TEST_CASE2 0 > > #define TEST_CASE3 0 > > > > Scenario 2 > > Among the 16 PTEs, only the 8th one is young, and there are no di= rty ones. > > #define TEST_CASE2 1 > > #define TEST_CASE3 0 > > > > Scenario 3 > > Among the 16 PTEs, there are neither young nor dirty ones. > > #define TEST_CASE2 0 > > #define TEST_CASE3 1 > > > > Test results > > > > |Scenario 1 | Original| Optimized| > > |-------------------|---------------|----------------| > > |instructions | 37912436160| 18731580031| > > |test time | 4.2797| 2.2949| > > |overhead of | | | > > |contpte_ptep_get() | 21.31%| 4.80%| > > > > |Scenario 2 | Original| Optimized| > > |-------------------|---------------|----------------| > > |instructions | 36701270862| 36115790086| > > |test time | 3.2335| 3.0874| > > |Overhead of | | | > > |contpte_ptep_get() | 32.26%| 33.57%| > > > > |Scenario 3 | Original| Optimized| > > |-------------------|---------------|----------------| > > |instructions | 36706279735| 36750881878| > > |test time | 3.2008| 3.1249| > > |Overhead of | | | > > |contpte_ptep_get() | 31.94%| 34.59%| > > > > For Scenario 1, optimized code can achieve an instruction benefit of 50= .59% > > and a time benefit of 46.38%. > > For Scenario 2, optimized code can achieve an instruction count benefit= of > > 1.6% and a time benefit of 4.5%. > > For Scenario 3, since all the PTEs have neither the young nor the dirty > > flag, the branches taken by optimized code should be the same as those = of > > the original code. In fact, the test results of optimized code seem to = be > > closer to those of the original code. > > > > Ryan re-ran these tests on Apple M2 with 4K base pages + 64K mTHP. > > > > Scenario 1: reduced to 56% of baseline execution time > > Scenario 2: reduced to 89% of baseline execution time > > Scenario 3: reduced to 91% of baseline execution time > > Still not keen on microbenchmarks to justify such change but at least > the code is more readable than the macro approach in some earlier > version. > > Do you have any numbers to see how it compares with your v1: > > https://lore.kernel.org/all/20250407092243.2207837-1-xavier_qy@163.com/ > > That patch was a lot simpler. > You can check the comparison data via: https://lore.kernel.org/all/3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@16= 3.com/ The v1 only optimizes Scenario 1 case (where all PTEs are both young and di= rty), but it degrades performance in other scenarios. Although the current version increases code complexity, its optimization results are notably significant. -- Thanks, Xavier