From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6BB4C3ABBE for ; Thu, 8 May 2025 07:57:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B26886B000A; Thu, 8 May 2025 03:57:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD5E76B0082; Thu, 8 May 2025 03:57:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C4136B0085; Thu, 8 May 2025 03:57:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7DC886B000A for ; Thu, 8 May 2025 03:57:57 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 67B26C0F0B for ; Thu, 8 May 2025 07:57:57 +0000 (UTC) X-FDA: 83418986994.02.0B054C2 Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by imf17.hostedemail.com (Postfix) with ESMTP id D11034000C for ; Thu, 8 May 2025 07:57:54 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=TzS6674P; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf17.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746691075; a=rsa-sha256; cv=none; b=2MwdX00j2w4Qgnob4+dRypWUud+xcNcnTILhP0xo0Zrk+bz7WT6hNIXeb1ObXVYcbGeo00 p547aomIwFp1be8NR7s0P7sfXGZCiraYHer+wDJv29wb6DQUOOP6TofdyWlvA8LsPjCgc+ pCuFOVe/w9FW37TUfJcQoo3Ar1ahex4= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=TzS6674P; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf17.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746691075; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6K5ejnEkUC0oNVReyoB1igMWQHDIneQzEbwYNnj+lYc=; b=dMs+2UCUeQjJksMRTk5HxrHO8kYj0WItSs05l43g4QtIStaEw/bvcte9mhIz6CPqbbMSfS Jxai5kLUFgWFqk1/mTYVpu+kPEkx2KPZBhZIXzi2EIg0NjKcrBhSX1/DSm6bdQM1/S1mwI w9R3hwZA8d+9ALWmeLgg54QZkVpIqA0= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1746691072; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=6K5ejnEkUC0oNVReyoB1igMWQHDIneQzEbwYNnj+lYc=; b=TzS6674PBIgbREj5rB5xvZnUefCwOTJS2p4NuSqvH+tcdPPs+QDbu1kSL5SEzmPWsSkuiU0cuYO6lC0947ceB2SKRNKwZW7O7Cm1m8N/yTU3H1urpM47b/dHdHl+zbiPNuEPpg27EtZHkbYJEN19dSu1S4w/ddKM3KbEywv6Vds= Received: from 30.74.144.116(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Wa-9atw_1746691070 cluster:ay36) by smtp.aliyun-inc.com; Thu, 08 May 2025 15:57:50 +0800 Message-ID: <0a22ab94-f63e-4959-a2cd-8b8a3ae210c3@linux.alibaba.com> Date: Thu, 8 May 2025 15:57:49 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] mm: mincore: use pte_batch_bint() to batch process large folios To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, david@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D11034000C X-Stat-Signature: stjfpmymagwe8gxsyptpgtxzgdy33d3o X-Rspam-User: X-HE-Tag: 1746691074-243070 X-HE-Meta: U2FsdGVkX18IMXLcJf0GdhZT9MB1lCBjMXawCbppNYjR3VKaUAyzf7LuQ7hN0mxpDb82vyjCYC2WNr3PyPhcda5e8fflT1aNZjdAOSyymVWw+aaVuI3wMazD2sW1QO/gBu0W0durR9YFyQiQOarNtudhJFt6BHQ6usIJM1eU0KvXAnvZ2gO06aHJI6dt3pg9iDVYyR60KtNjX3jGZ4tNTbXuCZgJAd93vEQxRIn3mPn65v7O7zaTxDpUQ61reZi7T8oATjxG/FoB1eEBGghtJMxWPolduYVLs2aooDDjvKzZ+6v3Qn3M/rO7PVKlYge/qIHfvocR9ulvqEXYyEH37YHsuEFXHWF6i6i9VdIO1mT3IzD2KjKyz378JG5YoUte7fq8F79psURwPyINrH9yexSybLWRa3hgnR2hDJcjHzQ4MAZ8V48PjEy4uHIy+idTTlBBPFYq9xbPhyoMnOu3sBBRdt6c56FavTALXiI3giDsfx0dBxMpf/RB91ymFrlkklgq6tph9TjkecjFew+XFH+dy1CmkSQvx9lUq1NRClX2oP51+zyiXy+MGPBAX+U7jwuYbA+NV4Aiv7saCUDFFXKoJnLmNlkizIw3ul+6TXXnx2pfMfRLyOMwStCU4jrc6VvkEMkJHrN6Nz6d1nKDtBcP63hnXlgS2Wg7PfGOHPLO0bd+EWLSJsZ4/bBEYVZrmaWBByhtWZNKcJU4gJ28GToKYUCTj94yRzEC1sWRSkSpdZXAMzB1zyWa2p8+aEWaBNPERhDdjU3WKQnwjiOaAO1oINKJQ9JZPmwrodTQb5B18lArOpmxMgbAnY72H8T+mk5V1yiy7/G310HtAg8TO/GyeiZU9ZAegev47hL6pAHTbAaYKDjNV/6XX64pDKrs4TqHrGT6+czDUd8qhq0xsxw1P1QfZIXBftPrVlpGuX0nzNaZkfKFDhnDr7CnRMrtsBsr+VMTAB1Z/OEKyoI RLeSmL00 QUSlBoK2dCFW5Apv430LT8nPDWPGwIW8q+HRljFLUgcjJc/ozPMOIc1A/GlPInSDiEjO4ADucWQPYq05hh7KhcQPH/STy7PCsBeOy362ulmCt5kQslQyQAdoBe+pndSSDk2TleRkXTnTCEmWaZd0ssfERionIpQjWRrJ9CsrDJKkqy1YJpMuDje0tQxw9l3H6rdJvfiKaVL8T3+Q8DVZgsWQFztEaaM20zZfArMeFdUX7kdHjgYfd15hTG/gsr2gj7kPnoNh4vxPXyxSbFp7X3InByONnAGiH1suNqWe/L+NG928Am9WMg+erPkPKV8+Gm8IeiOyBmIiBOpiKHJE++rY5N30LJk82YJ3MgZmAzmFsSkELHgk+Tj7qiQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/8 15:42, Barry Song wrote: > On Thu, May 8, 2025 at 4:09 PM Baolin Wang > wrote: >> >> When I tested the mincore() syscall, I observed that it takes longer with >> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() >> still checks each PTE individually, even when the PTEs are contiguous, >> which is not efficient. >> >> Thus we can use pte_batch_hint() to get the batch number of the present >> contiguous PTEs, which can improve the performance. I tested the mincore() >> syscall with 1G anonymous memory populated with 64K mTHP, and observed an >> obvious performance improvement: >> >> w/o patch w/ patch changes >> 6022us 549us +91% >> >> Moreover, I also tested mincore() with disabling mTHP/THP, and did not >> see any obvious regression for base pages. >> >> Signed-off-by: Baolin Wang >> --- >> Changes from v1: >> - Change to use pte_batch_hint() to get the batch number, per Ryan. >> >> Note: I observed the min_t() can introduce a slight performance regression >> for base pages, so I change to add a batch size check for base pages, >> which can resolve the performance regression issue. >> --- >> mm/mincore.c | 19 ++++++++++++++----- >> 1 file changed, 14 insertions(+), 5 deletions(-) >> >> diff --git a/mm/mincore.c b/mm/mincore.c >> index 832f29f46767..2e6a9123305e 100644 >> --- a/mm/mincore.c >> +++ b/mm/mincore.c >> @@ -21,6 +21,7 @@ >> >> #include >> #include "swap.h" >> +#include "internal.h" >> >> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, >> unsigned long end, struct mm_walk *walk) >> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> pte_t *ptep; >> unsigned char *vec = walk->private; >> int nr = (end - addr) >> PAGE_SHIFT; >> + int step, i; >> >> ptl = pmd_trans_huge_lock(pmd, vma); >> if (ptl) { >> @@ -118,16 +120,23 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> walk->action = ACTION_AGAIN; >> return 0; >> } >> - for (; addr != end; ptep++, addr += PAGE_SIZE) { >> + for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >> pte_t pte = ptep_get(ptep); >> >> + step = 1; >> /* We need to do cache lookup too for pte markers */ >> if (pte_none_mostly(pte)) >> __mincore_unmapped_range(addr, addr + PAGE_SIZE, >> vma, vec); >> - else if (pte_present(pte)) >> - *vec = 1; >> - else { /* pte is a swap entry */ >> + else if (pte_present(pte)) { >> + unsigned int batch = pte_batch_hint(ptep, pte); >> + >> + if (batch > 1) >> + step = min_t(unsigned int, batch, nr); > > Not quite sure if nr should be (end - addr) / PAGE_SIZE as nr > is always the initial value. For example, nr = 50, and we have > scanned 48 PTEs, then we have 2 ptes left. No? Ah, you are right. I missed this part when I revised the original patch[1]. Thanks for pointing this out. [1] https://lore.kernel.org/all/6a8418ba-dbd1-489f-929b-e31831bea0cf@linux.alibaba.com/