From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A110C3ABBF for ; Wed, 7 May 2025 11:14:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F0DE6B0089; Wed, 7 May 2025 07:14:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6790B6B008A; Wed, 7 May 2025 07:14:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51CB86B008C; Wed, 7 May 2025 07:14:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2C6A16B0089 for ; Wed, 7 May 2025 07:14:57 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 13D76120110 for ; Wed, 7 May 2025 11:14:58 +0000 (UTC) X-FDA: 83415854676.23.9A2B59A Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf02.hostedemail.com (Postfix) with ESMTP id E4A048000C for ; Wed, 7 May 2025 11:14:55 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746616496; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YtuPMFONrYvC4cSyMj7Xv24aGBjcsE1v1LUHQoEZe5g=; b=z81gTeW7r51ZwI9DOqA1aGbI5+2Bl3rKQ4IFyEjMTKJSLtON8dDWQ4yFKMgMyV5LFk+KIG XQGj/tehGsjPppOt/QYAdanQrLYX2Xx7lJWRO85xBVbqgB0K5CxEczOoZx0e8xeh9AxG1s nG2fMQ0JaHHpj6/LL6ylNVkfHhrry6c= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746616496; a=rsa-sha256; cv=none; b=a9BK60lO/KhvoSR1g7yeVilU429so9cZWglcJijTAKD/S4itX9iqhgm+lQQG/d1VWPUQ4b d8u5LtHqanQG3LSKBMyb14HSIL6ZytNOTCXOj701/9pGETg7DyiLhfRMahgCqbE6HkA5JA vwYRRyriG6+Xt6HvXh3GGY2axGUvrog= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E860F339; Wed, 7 May 2025 04:14:44 -0700 (PDT) Received: from [10.1.26.133] (XHFQ2J9959.cambridge.arm.com [10.1.26.133]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 58DFE3F673; Wed, 7 May 2025 04:14:53 -0700 (PDT) Message-ID: <6e0be4f6-0ac2-4861-b25d-3d94c6f35a9f@arm.com> Date: Wed, 7 May 2025 12:14:51 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios Content-Language: en-GB To: Baolin Wang , David Hildenbrand , Dev Jain , akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, 21cnbao@gmail.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <7ad05bc9299de5d954fb21a2da57f46dd6ec59d0.1742960003.git.baolin.wang@linux.alibaba.com> <17289428-894a-4397-9d61-c8500d032b28@arm.com> <6a8418ba-dbd1-489f-929b-e31831bea0cf@linux.alibaba.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: E4A048000C X-Rspam-User: X-Stat-Signature: xeqespuy3zkjd98ph7hf15r3f78cf618 X-HE-Tag: 1746616495-257588 X-HE-Meta: U2FsdGVkX18EdNzneqlMtzu0Kv3nKbF3yRESDS7cbXxY7Bs/I3sK6HupP+JG7Y9CFC8KGPR4+73/ofP5Hjlw0wCU8Cc57/hSScMKs4EoFxUbNfSiOSnwrSG5OHCxruw/Yl/OYkBb6QeLGpRfAb8+8U1i0nZmkJaOFh4BsCwGNiwHkCeYrf67DViilWTvZ1p23xL0TTxC8JUsGXBzT8K++MceKUEzDP8Tvsz/8uXLpwaWDn55dNToV0NoVyasn6TS1kkrD3qXlS3JvuWp0ClS56owzGgikjd2R76AsC8m94PCiTGWaOWzqeyCkRSP79O5VrA0UxDCIIuvIcpVqS20IEJ8g2vDHNFnnzt0QQbTNKIr11lAzwSxOdd/3IwzUNKv4EiE9/owQwChuLMOzJOFXQ4QIFvLB/adqeq8asHMdD7fkpCDPFSG+8PScLuD3cV2cvX1J7tcs5w6raSLy490hdY8lOd5tySibPT7yGs/DgRzZrcKlQNSE9gY4fM7KCxG4Z2wP2YtTIz6pVkL34ajzOhLF4OiF1DRccaQJoDvrmudRbbrASS/xVKg8FQJ/OdexC11cTnvNFmmh9lq2wB8vrrXivqi93K4ChGv+upytwzq7+TNS+swdGAgAL2jOi4g7NjMO0QgcNH8Kyl4FDDqyp/65J6qSBOQxcOMtbKOd5AfJ9QJQfckNU2ssuLQ3cBltevdFuDFKiD6KIPGE7MI7bFWcXypxDRwVy1AO4XXo+3A2MAnCLhkgle4YzDh2aykkpIvz7FlArLwvgclUDWsIWQcDPpG+7xMLz2P6i5p07EMNanchXcihTgoW+rgvXVVZ4vP7dFxY+muUdoEKKpYSHsSgXH/Wy7ct98JQ994uw5gHvZ+DaGKCkudLoOiLNdZYBG1NTT0y7qzHHdm+u5Jol0vaOKApIsIpXmL7UlDTZA4M8HqMwCK9QUZ5kwwpx9uFsdUrCTymzKYvjjTB29 3NP5H646 Ri37KK8eY5s77bPJirDvzZm6lyBm7QvSZmoNADl5DugtO/f1r3uAamc+MUfPdlT/29ftGD9Xuyoh/FnAdsJ+6JnnvoLYtrag3nGmTgBIhwhD8h5dzMjBuG9lS8b9H23AHCfDXz0S0NYQc9BbBtizdyIJSF+ZASOhNyTfr3olXMpu08b1XHlfn9sI1jB0WTeK5kzKj75n7mzT2QlXIbIjntO6unUM0u38ys2a3z5xY9LBE2+znXkfVq0noJjWt8DHa7cGELQT0T88Bunz7SfzXwkRaG2j+Bc3gjhLpsJ5QQ0Ws6nb+rsnKn+MBOd98L9MrrsC1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/05/2025 11:03, Baolin Wang wrote: > > > On 2025/5/7 17:54, David Hildenbrand wrote: >> On 07.05.25 11:48, Baolin Wang wrote: >>> >>> >>> On 2025/5/7 13:12, Dev Jain wrote: >>>> >>>> >>>> On 26/03/25 9:08 am, Baolin Wang wrote: >>>>> When I tested the mincore() syscall, I observed that it takes longer with >>>>> 64K mTHP enabled on my Arm64 server. The reason is the >>>>> mincore_pte_range() >>>>> still checks each PTE individually, even when the PTEs are contiguous, >>>>> which is not efficient. >>>>> >>>>> Thus we can use folio_pte_batch() to get the batch number of the present >>>>> contiguous PTEs, which can improve the performance. I tested the >>>>> mincore() >>>>> syscall with 1G anonymous memory populated with 64K mTHP, and observed an >>>>> obvious performance improvement: >>>>> >>>>> w/o patch        w/ patch        changes >>>>> 6022us            1115us            +81% >>>>> >>>>> Moreover, I also tested mincore() with disabling mTHP/THP, and did not >>>>> see any obvious regression. >>>>> >>>>> Signed-off-by: Baolin Wang >>>>> --- >>>>>    mm/mincore.c | 27 ++++++++++++++++++++++----- >>>>>    1 file changed, 22 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/mm/mincore.c b/mm/mincore.c >>>>> index 832f29f46767..88be180b5550 100644 >>>>> --- a/mm/mincore.c >>>>> +++ b/mm/mincore.c >>>>> @@ -21,6 +21,7 @@ >>>>>    #include >>>>>    #include "swap.h" >>>>> +#include "internal.h" >>>>>    static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned >>>>> long addr, >>>>>                unsigned long end, struct mm_walk *walk) >>>>> @@ -105,6 +106,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned >>>>> long addr, unsigned long end, >>>>>        pte_t *ptep; >>>>>        unsigned char *vec = walk->private; >>>>>        int nr = (end - addr) >> PAGE_SHIFT; >>>>> +    int step, i; >>>>>        ptl = pmd_trans_huge_lock(pmd, vma); >>>>>        if (ptl) { >>>>> @@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, >>>>> unsigned long addr, unsigned long end, >>>>>            walk->action = ACTION_AGAIN; >>>>>            return 0; >>>>>        } >>>>> -    for (; addr != end; ptep++, addr += PAGE_SIZE) { >>>>> +    for (; addr != end; ptep += step, addr += step * PAGE_SIZE) { >>>>>            pte_t pte = ptep_get(ptep); >>>>> +        step = 1; >>>>>            /* We need to do cache lookup too for pte markers */ >>>>>            if (pte_none_mostly(pte)) >>>>>                __mincore_unmapped_range(addr, addr + PAGE_SIZE, >>>>>                             vma, vec); >>>>> -        else if (pte_present(pte)) >>>>> -            *vec = 1; >>>>> -        else { /* pte is a swap entry */ >>>>> +        else if (pte_present(pte)) { >>>>> +            if (pte_batch_hint(ptep, pte) > 1) { >>>>> +                struct folio *folio = vm_normal_folio(vma, addr, pte); >>>>> + >>>>> +                if (folio && folio_test_large(folio)) { >>>>> +                    const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >>>>> +                                FPB_IGNORE_SOFT_DIRTY; >>>>> +                    int max_nr = (end - addr) / PAGE_SIZE; >>>>> + >>>>> +                    step = folio_pte_batch(folio, addr, ptep, pte, >>>>> +                            max_nr, fpb_flags, NULL, NULL, NULL); >>>>> +                } >>>>> +            } >>>> >>>> Can we go ahead with this along with [1], that will help us generalize >>>> for all arches. >>>> >>>> [1] https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/ >>>> (Please replace PAGE_SIZE with 1) >>> >>> As discussed with Ryan, we don’t need to call folio_pte_batch() >>> (something like the code below), so your patch seems unnecessarily >>> complicated. However, David is unhappy about the open-coded >>> pte_batch_hint(). >> >> I can live with the below :) >> >> Having something more universal does maybe not make sense here. Any form of >> patching contiguous PTEs (contiguous PFNs) -- whether with folios or not -- is >> not required here as we really only want to >> >> (a) Identify pte_present() PTEs >> (b) Avoid the cost of repeated ptep_get() with cont-pte. > > Good. I will change the patch and resend it. Thanks. Agreed. Thanks Baolin!