From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9BE4C369A2 for ; Mon, 14 Apr 2025 13:47:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB6D2280026; Mon, 14 Apr 2025 09:47:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A407228001A; Mon, 14 Apr 2025 09:47:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E1A1280026; Mon, 14 Apr 2025 09:47:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6CFE528001A for ; Mon, 14 Apr 2025 09:47:00 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D30DD5787D for ; Mon, 14 Apr 2025 13:47:01 +0000 (UTC) X-FDA: 83332775442.13.6287997 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 12D9720003 for ; Mon, 14 Apr 2025 13:46:59 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744638420; a=rsa-sha256; cv=none; b=L3TB7evHImr8RqwfY9SXXQ/sqo0f/zrugXCjKIWSG9vA8dKwOaBLxfkI4QnOnm08mLr8m3 Tz4+oDzyatFAjEeoYNPy63IaR4MjjoGbIf5p7FkMBBiTaBPDRxkHdqa0Ie2Umt8UrJBw0G 3DZdVGHIOkYpoqHAVdm2GdEvfwxfAlg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744638420; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4aImnPC0j5fS69B28yFvO3Gli//rDaPLs3P9gCrp3o0=; b=KzCi+oCxHXhr9/u4p4Jgrfp55M8Tgj/Y0wPTFOw3O/lJyWz4/uGppn0+QLif4C/U/8uTCK F477hSFJe6JwzsD2zv+AzJIzrxQ5FAzr9tL+eIb8uwToG0Lu1sBHS/3zYMpMLuczQrPdS/ Ay1E6UAerV/IkSJ74amqM9UexFzii9I= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9CA51007; Mon, 14 Apr 2025 06:46:57 -0700 (PDT) Received: from [10.57.86.225] (unknown [10.57.86.225]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CBF3D3F66E; Mon, 14 Apr 2025 06:46:57 -0700 (PDT) Message-ID: Date: Mon, 14 Apr 2025 14:46:56 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Content-Language: en-GB To: Baolin Wang , David Hildenbrand , akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, 21cnbao@gmail.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <7ad05bc9299de5d954fb21a2da57f46dd6ec59d0.1742960003.git.baolin.wang@linux.alibaba.com> <54886038-3707-4ea0-bd84-00a8f4a19a6a@arm.com> <145ec273-7223-45b8-a7f6-4e593a3cc8ee@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 12D9720003 X-Stat-Signature: fzagxpdfso55bs8coepxcu8b678dmw5b X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1744638419-293648 X-HE-Meta: U2FsdGVkX1/08UZa9s2h62BHuQGl2WAi5Ohuys41jrtQe4555FXzcrbZICWt+rnuP+/CiKtVISKZIcyVpA4tT81QNVaaHn5oXl9F24JBm5NLGdRC3xZVAu6d5BVjMgh/2DKXqlPltAWh4eMjN+lyYRRTkeZ4vUgP6YC5QE1xhCEFe1UUtS5J6YNrlGmDsAeGDTGoPvELQcCRmdrOdTvWVzXXzbxGqHNUfQULf01xSG4yRN6HQ4QxgNNepUH2p9RESC+ie1AWtylHVWmN5Ny46tvorGKPWiLoNEefmvAo/3hh1Svk7uL+4NuufzMZxiA9NFC3YFT3l1MY/zR3WBY49tNpLHRH4U0Kn8N87sfcNHCFFJp2GV1CWsRWfa0xZDCOj4F1vxWRD2dRaRXv4EKX/Hi+to3DSBjZpfDOaZ7xTKnTq/h0F48eavJM2k4H87EiozoqslIlSKIk0UTaNgXVL/mycW0KYKfA1vlOf8s5yXVBvPGOeEQrVRiOnMw3+F6uFVKwpF7SIWlqropnHOBA8Vmx8QrSZ8Y1mxCHonxA6W+w911Rvrcs7PWpsTGhNn2I4vl8Q5Efz3Md10w+36rwVQBAU2HmiF+5/xgTuZy2DLwo74x0EQqIqdIzXvanVK/0iFKKLrEJnUjkY+KVtBwWPdY7Ybvm0gyUgvdfVSY+A0pNX+rjWOKVIJL17pMQB4xyHVkomxwC6HJK9MXRLns1oL4vSJdm4bWuRaPH6T4VboZPn7h2j9NmJP4D6DUAIhRECaIo8rsrKhQ0Z4KbQW36HrSn53coC78CS9JeA7tnARs6YYhpl6sXc31DfTboq6ionbjKOw3Y3AA8nZCcZagTV8zfCXk2Fu/59jDXLLzKTu7Qr2/bKUQwvAfCWn9BGTbIJZJeDpIEscr1Tt4skh6UsSWWxy/5LqtmjD1G/cxmY+ZhmL3u3B4kcOPNxvayePZc9vwhsR67NDSPy/gDTis KmsSuhoy hGo4KnEvN6UI7GPdMBEsKkJoW5PVTW73DQzW1cwUpovZowghoFpXBZeVj6UcZu5EukYSiuP5EESEyHUkqdAdakQhvzBLWN76Ecox2Y1sAJ/DesYp12FwHF0KiZwhSZI5YB3nzF7nDAu8eYJCUHkDTdGgplZhpy5GK1qJVPWnhukOYzQF+/2AJUdANc5HPevs/RhA+5mi9myNbQShDWvAKsWpLqVjoOlsZ5aNsosShWGRAniQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/04/2025 07:33, Baolin Wang wrote: > > > On 2025/4/1 21:04, David Hildenbrand wrote: >> On 01.04.25 12:45, Ryan Roberts wrote: >>> On 30/03/2025 15:57, Baolin Wang wrote: >>>> >>>> >>>> On 2025/3/27 22:08, Ryan Roberts wrote: >>>>> On 25/03/2025 23:38, Baolin Wang wrote: >>>>>> When I tested the mincore() syscall, I observed that it takes longer with >>>>>> 64K mTHP enabled on my Arm64 server. The reason is the mincore_pte_range() >>>>>> still checks each PTE individually, even when the PTEs are contiguous, >>>>>> which is not efficient. >>>>>> >>>>>> Thus we can use folio_pte_batch() to get the batch number of the present >>>>>> contiguous PTEs, which can improve the performance. I tested the mincore() >>>>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an >>>>>> obvious performance improvement: >>>>>> >>>>>> w/o patch        w/ patch        changes >>>>>> 6022us            1115us            +81% >>>>>> >>>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not >>>>>> see any obvious regression. >>>>>> >>>>>> Signed-off-by: Baolin Wang >>>>>> --- >>>>>>    mm/mincore.c | 27 ++++++++++++++++++++++----- >>>>>>    1 file changed, 22 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/mm/mincore.c b/mm/mincore.c >>>>>> index 832f29f46767..88be180b5550 100644 >>>>>> --- a/mm/mincore.c >>>>>> +++ b/mm/mincore.c >>>>>> @@ -21,6 +21,7 @@ >>>>>>      #include >>>>>>    #include "swap.h" >>>>>> +#include "internal.h" >>>>>>      static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned >>>>>> long >>>>>> addr, >>>>>>                unsigned long end, struct mm_walk *walk) >>>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long >>>>>> addr, unsigned long end, >>>>>>        pte_t *ptep; >>>>>>        unsigned char *vec = walk->private; >>>>>>        int nr = (end - addr) >> PAGE_SHIFT; >>>>>> +    int step, i; >>>>>>          ptl = pmd_trans_huge_lock(pmd, vma); >>>>>>        if (ptl) { >>>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long >>>>>> addr, unsigned long end, >>>>>>            walk->action = ACTION_AGAIN; >>>>>>            return 0; >>>>>>        } >>>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) { >>>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >>>>>>            pte_t pte = ptep_get(ptep); >>>>>>    +        step = 1; >>>>>>            /* We need to do cache lookup too for pte markers */ >>>>>>            if (pte_none_mostly(pte)) >>>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE, >>>>>>                             vma, vec); >>>>>> -        else if (pte_present(pte)) >>>>>> -            *vec = 1; >>>>>> -        else { /* pte is a swap entry */ >>>>>> +        else if (pte_present(pte)) { >>>>>> +            if (pte_batch_hint(ptep, pte) > 1) { >>>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte); >>>>>> + >>>>>> +                if (folio && folio_test_large(folio)) { >>>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >>>>>> +                                FPB_IGNORE_SOFT_DIRTY; >>>>>> +                    int max_nr = (end - addr) / PAGE_SIZE; >>>>>> + >>>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte, >>>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL); >>>>>> +                } >>>>>> +            } >>>>> >>>>> You could simplify to the following, I think, to avoid needing to grab the >>>>> folio >>>>> and call folio_pte_batch(): >>>>> >>>>>              else if (pte_present(pte)) { >>>>>                  int max_nr = (end - addr) / PAGE_SIZE; >>>>>                  step = min(pte_batch_hint(ptep, pte), max_nr); >>>>>              } ... >>>>> >>>>> I expect the regression you are seeing here is all due to calling >>>>> ptep_get() for >>>>> every pte in the contpte batch, which will cause 16 memory reads per pte (to >>>>> gather the access/dirty bits). For small folios its just 1 read per pte. >>>> >>>> Right. >>>> >>>>> pte_batch_hint() will skip forward in blocks of 16 so you now end up with the >>>>> same number as for the small folio case. You don't need all the fancy extras >>>>> that folio_pte_batch() gives you here. >>>> >>>> Sounds reasonable. Your suggestion looks simple, but my method can batch the >>>> whole large folio (such as large folios containing more than 16 contiguous >>>> PTEs) >>>> at once. >>> >>> Sure but folio_pte_batch() just implements that with another loop that calls >>> pte_batch_hint(), so it all amounts to the same thing. In fact there are some >>> extra checks in folio_pte_batch() that you don't need so it might be a bit >>> slower. > > Right. I tested your suggestion, yes, much better. > >> I don't enjoy open-coding the batching, especially if only cont-pte users will >> benefit from it. But I also don't enjoy the open-coded pte_batch_hint() :) I'm not quite sure what you are saying here? Is: else if (pte_present(pte)) { int max_nr = (end - addr) / PAGE_SIZE; step = min(pte_batch_hint(ptep, pte), max_nr); } really to be considered open-coding? pte_batch_hint() is a generic API and it feels pretty reasonable to use it in this situation? >> >> But we really don't need the folio here, so I assume the short variant you >> (Ryan) suggest is alright to just avoid the ptep_get(). Yes, that would get my vote. >> >> As Oscar says, these details might soon be hidden inside a new page table >> walker API (even though it will likely end up using folio_pte_batch() >> internally, TBD). > > OK. I can drop this patch if it will be addressed in the following patches. I'm assuming a large chunk of the speedup is actually fixing a regression (it would be good to see the numbers for non-mTHP mappings for comparison), so personally I'd prefer we put this patch in now rather than waiting for the new API to land then waiting for someone to get around to converting this user. Thanks, Ryan