From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 531C2C369BA for ; Wed, 16 Apr 2025 08:41:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BD79E6B00AF; Wed, 16 Apr 2025 04:41:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B87916B0200; Wed, 16 Apr 2025 04:41:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A50836B0201; Wed, 16 Apr 2025 04:41:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D22146B00AF for ; Wed, 16 Apr 2025 04:41:38 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8C2CF80366 for ; Wed, 16 Apr 2025 08:41:39 +0000 (UTC) X-FDA: 83339263518.02.08A1885 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) by imf03.hostedemail.com (Postfix) with ESMTP id 2DCA020006 for ; Wed, 16 Apr 2025 08:41:35 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=HI7M51XJ; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf03.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744792897; a=rsa-sha256; cv=none; b=zT1lnlf+mrTvkcUevYbouhlfNTbGdms7UZVNOsYMS9t/fSH9O0r+BJCaf1bFJdXxpQUYEq MpgpG29ci/nOX2IwDWJO1L9pARXs5vrgXgeQjvrN5t3tZMizB92/6QUjTGrNt9WvCA+n5n L+5LeAogKK+AO5slZRI0rWwPMYLi+Z8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744792897; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z6D/L440zCKz6m87Bedbqv+umyrg/+ozjToXPCehfm4=; b=L4R7gUWcPX4jMDGf8D8XxWc+ZCQluL/dtYIzfGP++nqCwxiO8oLDFKprjzl9l/uZbQ9HiQ mxJpdBEyDIgdmJjJknSbfn1iDqhDitQvB4vNhAjEJNYm3pm8Z+YGrlTLjWcRhBOMD7apYj MpdZCm6jfhLtCYR6NfXTNXGWcAOSMzs= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=HI7M51XJ; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf03.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744792893; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=Z6D/L440zCKz6m87Bedbqv+umyrg/+ozjToXPCehfm4=; b=HI7M51XJtVp/WTbJlyT4+rs4L4eK+4zfYAKImIZmcveRxaQ5b2qjyVm4YTDoPpd4mFSHor7FJ3f45pPNL7UzJ+Xln6ZhU5okduiD5yClxe7mDIdW2c6+dCkZ9HoVOTUwPKVjd81RE9xLobGaiyBfYK+Kc1Eqtw/AQ4wrMPk3Wag= Received: from 30.74.144.124(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WX8VQHe_1744792891 cluster:ay36) by smtp.aliyun-inc.com; Wed, 16 Apr 2025 16:41:31 +0800 Message-ID: <7392a21b-10bf-4ce9-a6fd-807ed954c138@linux.alibaba.com> Date: Wed, 16 Apr 2025 16:41:31 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mempolicy: Optimize queue_folios_pte_range by PTE batching To: David Hildenbrand , Dev Jain , akpm@linux-foundation.org Cc: ryan.roberts@arm.com, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, hughd@google.com, vishal.moola@gmail.com, yang@os.amperecomputing.com, ziy@nvidia.com References: <20250416053048.96479-1-dev.jain@arm.com> <7f96283b-11b3-49ee-9d2d-5ad977325cb0@linux.alibaba.com> <019d1c4a-ffd0-4602-b2ba-cf07379dab17@redhat.com> From: Baolin Wang In-Reply-To: <019d1c4a-ffd0-4602-b2ba-cf07379dab17@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2DCA020006 X-Stat-Signature: yj9ssgwmpgqwtyshwscru6dsetwe8wwg X-Rspam-User: X-HE-Tag: 1744792895-58294 X-HE-Meta: U2FsdGVkX1/VS+Nsa6T2HRQUNP/fogjZDGnNIwBx35TwrOJwoLj1G616NhMe3dzWaEwhaDPtHOGP7MAyJiyFqhDHJQs6at4kQVwJOaKJAJ1dMc6QUnxo3V093WU90ZsjxMBg2UDw1ho1fBUVHoKhdEgjFtG+z5oQNWKC4K26Bx5ZT8hDEtxaltEEuSozZ2AIKcUs61HLkkHvc8B1lKAM/GgB5ZjQSI7/ZnAnHrqj7j8j4ikVC3cqWGlR54DA+D3oeKBzCJj8jadQDLPZEAnohHKlhH3DgM7AL8Cf3N9LKNeQP2PwDto05iaL4Z/5zTv4COsQf4dsm1JDI86IqdE76qeqGlf51oHf5vLlQqy031ny9Ios/zHKEF9jtABpJ5VYzir6WBFh7fb2JV3txw+XmdER0x3w1MYYX/YaBv/JV/ESWjJb6IqIduu6IDlT4Ccb1TcyO1zmwyAhBB4AgZUz9K/2ejbkqfsg+IWq123f/csnJ4DGcvReDKl6wsEmkGMXhmIXSd3NcV1+JhnBWBaL7wy4dRkgziiPUz8P8FB5TkUYcNF+AH+5JQ1Rk3800Z4mcm5Wp20Z8ske8kl0rzN1XoRC8rmSkYjIBZO/xmW51jJefCQmsI8o0YshIZ//mWFxgiWh+TSC+CykgfgW7Er5h8MGzlla9jzhZ9aC41v3QxZbvnWBT0ZwSfSzozbeI13UZMKa9a4B/wz82nGBEexSyM9zfNkwRsmB5kuPIGrouwG3J/doxkZ2sTGjFaeWsTPxJPDeX+rSF/e73+cdljB8leb0ymfX6adyAbbOnMfQBwkme6rwZGkXfUZy8rHMidsOlRloEePBvm1RGHdBLFyY/LJ47kZHZ/z4BDaLJuRrCG21VkIB5hWXNMz347pt3hSvmFwBVjB2rZR42ke3qeJtA0BX6cd4BrV8O3TFYDgXdfNv2m7l/64maiNu/bKE8hUgPh3eDxCD1HI+6i8ydyU gZFf3VEg Lfm5jOnwHCkYJeXm7OgfrmljJ+EI7vVHlMvVAy3jIabnyyYjC5EqXgbUanKuT0CueHrdBDWqgJmirUBHEacVpsVa7Jv7Q00o2AVNfV1U6I4AFLfrnRi7YDDNL78xGgjuQIBMRys4VwhNZqlL2yxj5Rb9+qHNfqPf80V867bnzm7HoQDqq12m3RV8TxTfM6LlE4xPbeL/WEDuuc7tHJzA+L4pReVXLdLEq0VWRr7MnG6GAmwH13OM+nxdzwcfeYPWJbG048l5gQdFMebmrllNSCqi05K1N+/S5sIsnd8vzonRFF5ctHjVeSsNz3+lRcMRmq6hGci0PC2T6kmhBtmBVfFGhdVSx55gOrGhdNLwPI6Z23SW5lroPlfGKeDUNWamGMUoiWKK+tUwQe81BdNuLxnNRlz429p6H4fvBIaN101oY67F6zi/R8Bhe8iJlOeAELJHx2zjIdwr3NW2luXcs1lYUKIjPHrd292hb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/16 16:22, David Hildenbrand wrote: > On 16.04.25 08:32, Baolin Wang wrote: >> >> >> On 2025/4/16 13:30, Dev Jain wrote: >>> After the check for queue_folio_required(), the code only cares about >>> the >>> folio in the for loop, i.e the PTEs are redundant. Therefore, optimize >>> this loop by skipping over a PTE batch mapping the same folio. >>> >>> With a test program migrating pages of the calling process, which >>> includes >>> a mapped VMA of size 4GB with pte-mapped large folios of order-9, and >>> migrating once back and forth node-0 and node-1, the average execution >>> time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. >>> >>> v2->v3: >>>    - Don't use assignment in if condition >>> >>> v1->v2: >>>    - Follow reverse xmas tree declarations >>>    - Don't initialize nr >>>    - Move folio_pte_batch() immediately after retrieving a normal folio >>>    - increment nr_failed in one shot >>> >>> Acked-by: David Hildenbrand >>> Signed-off-by: Dev Jain >>> --- >>>    mm/mempolicy.c | 12 ++++++++++-- >>>    1 file changed, 10 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>> index b28a1e6ae096..4d2dc8b63965 100644 >>> --- a/mm/mempolicy.c >>> +++ b/mm/mempolicy.c >>> @@ -566,6 +566,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct >>> mm_walk *walk) >>>    static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, >>>                unsigned long end, struct mm_walk *walk) >>>    { >>> +    const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >>>        struct vm_area_struct *vma = walk->vma; >>>        struct folio *folio; >>>        struct queue_pages *qp = walk->private; >>> @@ -573,6 +574,7 @@ static int queue_folios_pte_range(pmd_t *pmd, >>> unsigned long addr, >>>        pte_t *pte, *mapped_pte; >>>        pte_t ptent; >>>        spinlock_t *ptl; >>> +    int max_nr, nr; >>>        ptl = pmd_trans_huge_lock(pmd, vma); >>>        if (ptl) { >>> @@ -586,7 +588,9 @@ static int queue_folios_pte_range(pmd_t *pmd, >>> unsigned long addr, >>>            walk->action = ACTION_AGAIN; >>>            return 0; >>>        } >>> -    for (; addr != end; pte++, addr += PAGE_SIZE) { >>> +    for (; addr != end; pte += nr, addr += nr * PAGE_SIZE) { >>> +        max_nr = (end - addr) >> PAGE_SHIFT; >>> +        nr = 1; >>>            ptent = ptep_get(pte); >>>            if (pte_none(ptent)) >>>                continue; >>> @@ -598,6 +602,10 @@ static int queue_folios_pte_range(pmd_t *pmd, >>> unsigned long addr, >>>            folio = vm_normal_folio(vma, addr, ptent); >>>            if (!folio || folio_is_zone_device(folio)) >>>                continue; >>> +        if (folio_test_large(folio) && max_nr != 1) >>> +            nr = folio_pte_batch(folio, addr, pte, ptent, >>> +                         max_nr, fpb_flags, >>> +                         NULL, NULL, NULL); >>>            /* >>>             * vm_normal_folio() filters out zero pages, but there might >>>             * still be reserved folios to skip, perhaps in a VDSO. >>> @@ -630,7 +638,7 @@ static int queue_folios_pte_range(pmd_t *pmd, >>> unsigned long addr, >>>            if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || >>>                !vma_migratable(vma) || >>>                !migrate_folio_add(folio, qp->pagelist, flags)) { >>> -            qp->nr_failed++; >>> +            qp->nr_failed += nr; >> >> Sorry for chiming in late, but I am not convinced that 'qp->nr_failed' >> should add 'nr' when isolation fails. > > This patch does not change the existing behavior. But I stumbled over > that as well ... and scratched my head. > >> >>   From the comments of queue_pages_range(): >> " >> * >0 - this number of misplaced folios could not be queued for moving >>    *      (a hugetlbfs page or a transparent huge page being counted >> as 1). >> " >> >> That means if a large folio is failed to isolate, we should only add '1' >> for qp->nr_failed instead of the number of pages in this large folio. >> Right? > > I think what the doc really meant is "PMD-mapped THP". PTE-mapped THPs > always had the same behavior: per PTE of the THP we would increment > nr_failed by 1. No? For pte-mapped THPs, it only adds 1 for the large folio, since we have below check in queue_folios_pte_range(). if (folio == qp->large) continue; Or I missed anything else? > I assume returning "1" for PMD-mapped THPs was wrong from the beginning; > it might only have been right for hugetlb pages. > > With COW and similar things (VMA splits), achieving "count each folio > only once" reliably is a very hard thing to achieve. > > > Let's explore how "nr_failed" will get used. > > 1) do_mbind() > > Only cares if "any failed", not the exact number. > > > 2) migrate_pages() > > Will return the number to user space, where documentation says: > > "On success migrate_pages() returns the number of pages that could not > be moved (i.e., a return of zero means that all pages were successfully > moved)." > > man-page does not document THP specifics AFAIKs. I would assume most > users care about "all migrated vs. any not migrated". > > > I would even feel confident to change the THP PMD-handling to return the > actual *pages*. >