From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CD73C3ABBC for ; Fri, 9 May 2025 12:26:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 19123280020; Fri, 9 May 2025 08:26:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 13FE6280018; Fri, 9 May 2025 08:26:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02CC4280020; Fri, 9 May 2025 08:26:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D7D14280018 for ; Fri, 9 May 2025 08:26:03 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id DEDBC1C7DB8 for ; Fri, 9 May 2025 12:26:03 +0000 (UTC) X-FDA: 83423291406.12.C3902E9 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by imf30.hostedemail.com (Postfix) with ESMTP id EB51F8000D for ; Fri, 9 May 2025 12:25:59 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=KzYJYMNx; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf30.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746793561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=s1OBMHH9lNcbiwBxHcJKRHLpPG0d29gL0JUaEp0FVkA=; b=rvg5bwjBX8gvumSProOhT7uH+6JlC4Ii8OGp4wnlvT0/5S367bIhOPigI3RJQkHP/a6o4F Uc0Y5PGqOORh2HxMZzFy4sxVBsi9yBMPNZMxpCFc0+wrYzx+41UiWlOYDhqlY+YnZirB9g iFbUQVBBAGsbb4M/fzPxMNjFrc/qJ4s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746793561; a=rsa-sha256; cv=none; b=sFG6eJR4ou5nwxOUq/1KlMWwLJN1Ho0j19X5z8qzCy+YGxbEfxLyuVURo4v/5cUiiizscE M1mQzLWSn71NceELcUXaJFaN0lFigxO/X0BrbWp8FraPbFm1FA6HNgp5yrQhWhde61p205 prXkGAMGH6Im46inFWJK4aIXSdRCW9A= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=KzYJYMNx; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf30.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1746793554; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=s1OBMHH9lNcbiwBxHcJKRHLpPG0d29gL0JUaEp0FVkA=; b=KzYJYMNxsGSFumHTq1KJi/XMVuPPyDbPIh1alGvCI9CALF2ZCoOFd0XtRYFMbtr14eG9Bu0dgMgpqwNXb8Nbe8FEThh1aGEJqlSii0XnAM1+pP0dkkIRpYmpYX3l1deYxrxccmyqqz5wSJ1igJHmyZ4J12FhUOVpmJ0BgGikvHI= Received: from 30.120.133.168(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Wa3cyme_1746793552 cluster:ay36) by smtp.aliyun-inc.com; Fri, 09 May 2025 20:25:53 +0800 Message-ID: <3d52c5bb-60c5-4bfd-958e-737a8cc90853@linux.alibaba.com> Date: Fri, 9 May 2025 20:25:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm: mincore: use pte_batch_bint() to batch process large folios To: David Hildenbrand , akpm@linux-foundation.org Cc: 21cnbao@gmail.com, ryan.roberts@arm.com, dev.jain@arm.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <99cb00ee626ceb6e788102ca36821815cd832237.1746697240.git.baolin.wang@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: EB51F8000D X-Rspam-User: X-Stat-Signature: mtt48s6nfnw5jskym9awf16qrneki7yt X-HE-Tag: 1746793559-898621 X-HE-Meta: U2FsdGVkX1/mGpMGyRE/g/K6DIM5euhUlU8x+fO+9iYrwrWjWZpaI1kkxS8zvX0lUF6gQUH797ViZBLYV2Sj+s7Od7DdN4bNLM93c6ddP8Bfrg+12QIatEp+N9GEYj+kD8ftGV02QiFLgqe7ygJ9wWCfjsiVHg7azse+hOe2C2v+iMH6JY+rWtAGMcSWNZiHBqGNytIkgxBDk1oJ+kVjMB5HX1ISbpHIjXKneiPINDF9bfGIowz3Im46hkbXyPsZmP5VTnnDu02LqhUXb2erVX3pFwipTS3e4QRXBL9HeaFk6I0gTC52CdWe0CeHsefDP49GWY44Vfo4hP5xz5i2eQ/Rj+G+c2nshQGt6kfB0QSMv5d0sS0HuzoXGds7YXz2khCuGd+F4S8oO9BpHkt3hOLg6vS2gbJp6qaEYj996OLof4IN6Paw4igPtM+2Cn5uUna0sIu46FTJvTZq8vvubnL8U5Pys0FcisV79BtIF8p9tEXAnUSvkaR7SjzyzwEC5uV0zWtHX25N0C7aaYjHUKjiM7rW9ogIJpLYAoqNhO3UKCsKCCCt+dIMdUn74hAmlLwmLpQpSsE52TLYbG2jnQKc/EtrZrHi8/MhTo33FAutXA+sXuZqTIOND4h05FoI2ir+4WgtRFjS9IBLGliiJIS99jazFm3tmJVaEtWihhw+o1sVDxLege+g9zMaScegxMYFFuXuilij3inS3ie7Uh6ql29HsN1IXugVqOuSjOMZVwAI7l0hjsqs9vrxWhdZbd5lJOlf8UJ/xDyUbnMVgFX18o2+wjDncxcmtI2kP0xW0INGS/gnDOHZWvWC8UCgjWQlvF+S+Qa7e8FUKVY07M+H+SvETVNh8Khz6F1KViUgPpvTyCx0mfkNgKSFSbMIaP7QCnpHw3T2mCdeMz4nTn9JHfmW/CGBDxDvekrcvN8oEuEMpSL3leeuDopjdH146YoQ/IcP/TP/d/8Tfkq nZuHRgHj QeYaUaEcuLkr6pbMFhNvrJHNC9TPqDMYNmaJi+MEiOTEnB4LeQg7SDhHrQLaxr/2j+eSN4nPJCGtw3jpCiF7TNXvbFwENUe0K6HmSOFqVCeoyIq4p276IhW/CqLLDaB789mnFz68VL/DcZ3MJj53P0mT4pJdsuVHhy4p2wLmoITouudQ+3zhpzHjeBskEPY9kabQO1t+eD+mdLbQZqLYeKLeMWXih1DYrvdhE40naUNgT3tMmrqwCu5U1qiq9KKYH0Iif5m0WG/qrNDCO44SQRB0WF4p6aVEkfjB3Kum8iUHYDmSxY3FRYJU4dEp60HcnFRqk/RuXhPv3oeDdIay9xghiElBYEfU/VvOzUrBrnBwwIEirff0W1v0Neg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/9 15:51, David Hildenbrand wrote: > On 09.05.25 02:45, Baolin Wang wrote: >> When I tested the mincore() syscall, I observed that it takes longer with >> 64K mTHP enabled on my Arm64 server. The reason is the >> mincore_pte_range() >> still checks each PTE individually, even when the PTEs are contiguous, >> which is not efficient. >> >> Thus we can use pte_batch_hint() to get the batch number of the present >> contiguous PTEs, which can improve the performance. I tested the >> mincore() >> syscall with 1G anonymous memory populated with 64K mTHP, and observed an >> obvious performance improvement: >> >> w/o patch        w/ patch        changes >> 6022us            549us            +91% >> >> Moreover, I also tested mincore() with disabling mTHP/THP, and did not >> see any obvious regression for base pages. >> >> Signed-off-by: Baolin Wang >> --- >> Changes from v2: >> - Re-calculate the max_nr, per Barry. >> Changes from v1: >> - Change to use pte_batch_hint() to get the batch number, per Ryan. >> >> Note: I observed the min_t() can introduce a slight performance >> regression >> for base pages, so I change to add a batch size check for base pages, >> which can resolve the performance regression issue. >> --- >>   mm/mincore.c | 22 +++++++++++++++++----- >>   1 file changed, 17 insertions(+), 5 deletions(-) >> >> diff --git a/mm/mincore.c b/mm/mincore.c >> index 832f29f46767..42d6c9c8da86 100644 >> --- a/mm/mincore.c >> +++ b/mm/mincore.c >> @@ -21,6 +21,7 @@ >>   #include >>   #include "swap.h" >> +#include "internal.h" >>   static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned >> long addr, >>               unsigned long end, struct mm_walk *walk) >> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned >> long addr, unsigned long end, >>       pte_t *ptep; >>       unsigned char *vec = walk->private; >>       int nr = (end - addr) >> PAGE_SHIFT; >> +    int step, i; >>       ptl = pmd_trans_huge_lock(pmd, vma); >>       if (ptl) { >> @@ -118,16 +120,26 @@ static int mincore_pte_range(pmd_t *pmd, >> unsigned long addr, unsigned long end, >>           walk->action = ACTION_AGAIN; >>           return 0; >>       } >> -    for (; addr != end; ptep++, addr += PAGE_SIZE) { >> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >>           pte_t pte = ptep_get(ptep); >> +        step = 1; >>           /* We need to do cache lookup too for pte markers */ >>           if (pte_none_mostly(pte)) >>               __mincore_unmapped_range(addr, addr + PAGE_SIZE, >>                            vma, vec); >> -        else if (pte_present(pte)) >> -            *vec = 1; >> -        else { /* pte is a swap entry */ >> +        else if (pte_present(pte)) { >> +            unsigned int batch = pte_batch_hint(ptep, pte); >> + >> +            if (batch > 1) { >> +                unsigned int max_nr = (end - addr) >> PAGE_SHIFT; > > Nit: probably would have called this max_step to match step. OK. If need respin the patch, I'll rename it. >> + >> +                step = min_t(unsigned int, batch, max_nr); >> +            } >> + >> +            for (i = 0; i < step; i++) >> +                vec[i] = 1; > > I'm surprised this micro-optimization matters that much. Probably the Me too. > compiler > defers the calculation of max_nr. I am not convinced we need that level of > micro-optimization in this code ... > > > But if we're already micro-optimizing, you could have optimized out the > loop as > well for order-0: > >     unsigned int batch = pte_batch_hint(ptep, pte); > >     if (batch > 1) { >         unsigned int max_nr = (end - addr) >> PAGE_SHIFT; > >         step = min_t(unsigned int, batch, max_nr); >         for (i = 0; i < step; i++) >             vec[i] = 1; >     } else { >         *vec = 1; >     } I tried this method, and it had no impact on performance. > In any case > > Acked-by: David Hildenbrand Thanks.