From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23A2FC369AB for ; Mon, 21 Apr 2025 06:30:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 531226B0006; Mon, 21 Apr 2025 02:30:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B8566B0007; Mon, 21 Apr 2025 02:30:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 359776B0008; Mon, 21 Apr 2025 02:30:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 13B296B0006 for ; Mon, 21 Apr 2025 02:30:24 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7D55D1218B2 for ; Mon, 21 Apr 2025 06:30:24 +0000 (UTC) X-FDA: 83357076768.24.95CE6D2 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf08.hostedemail.com (Postfix) with ESMTP id 8CE99160003 for ; Mon, 21 Apr 2025 06:30:21 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NxCEtx1N; spf=pass (imf08.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745217022; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=idPt4awxei//Gm12NGcnuqpMWa/emoKeVAP9nz2q0EM=; b=6xGPSrad0WhbCBr08mwmc8L+I7T7Z0UJOVbivTYzZb9C3sKkacJNiqYMQe4luUzEbyAK/6 3VmT1H+jyvFfAJhMDCCDlMH0W1qzW6dbadHZEcnmnFqqYv35q5Q80n/T+YBmMGxYLhvTZY hIVy9mxkKO2qMkvSoAvsBhXd4XWFbGE= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NxCEtx1N; spf=pass (imf08.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745217022; a=rsa-sha256; cv=none; b=fAdCqZU9+RBGHDzgcO2VZpYH6XQ9WmD4ZUWOcX6nuTYpfb6kgyKNzuu15gUtOYUHgAbBzo J/rYMkTRiZUyNc655ylwrSL4y9mg1uyR/sJQ2a0cIHHt3kjldnKrwBB//Cd9/UGRc3gdVP 9Imm87jl65qf9OI7J/mkF0GRRw9tMfA= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1745217018; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=idPt4awxei//Gm12NGcnuqpMWa/emoKeVAP9nz2q0EM=; b=NxCEtx1NxO7fWp24zncywNgsPrn8GvM1ltqFNo9UCoqkTiHyxFYjXTiRxe6Aqz5LYTmMeMAtNLSb9Qqgole10YqVklX2BVfWX5PP11Xg6ZriQ4b/tV8Mt+R4/f4KnomCv2TOSdY0A5J/8pr+mZeCrth05RHbiOiJOItJrIbproE= Received: from 30.74.144.126(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WXStuVy_1745217014 cluster:ay36) by smtp.aliyun-inc.com; Mon, 21 Apr 2025 14:30:15 +0800 Message-ID: Date: Mon, 21 Apr 2025 14:30:14 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mempolicy: Optimize queue_folios_pte_range by PTE batching To: David Hildenbrand , Dev Jain , akpm@linux-foundation.org Cc: ryan.roberts@arm.com, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, hughd@google.com, vishal.moola@gmail.com, yang@os.amperecomputing.com, ziy@nvidia.com References: <20250416053048.96479-1-dev.jain@arm.com> <7f96283b-11b3-49ee-9d2d-5ad977325cb0@linux.alibaba.com> <019d1c4a-ffd0-4602-b2ba-cf07379dab17@redhat.com> <7392a21b-10bf-4ce9-a6fd-807ed954c138@linux.alibaba.com> <8b387a53-40e0-40d1-8bfa-b7524657a7dd@redhat.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: nw8kxzwnexw7peuunco9c94zstfpuwpd X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 8CE99160003 X-Rspam-User: X-HE-Tag: 1745217021-348002 X-HE-Meta: U2FsdGVkX18K9gW5b5ANHnBDxtHdvJIH94098/es1Zm5h3BiZyEXK7aCzIBHOciHAmawibRevRdh5wLKpfZ+VHag1MQr3Jg35sSNVVS5iE4VZxx38PQIhDiVG5agUH95qkc3tVjq/Sv95luI3lkmOVfFjcCMeUu6xq/6Rdv3ZI5IGXvTei8N8xJGG3Khy76bAVxN2ayao0HzYS/d3jFsAuYx7A+2epONh4tHzUf1cjUvwbZBSfam42gescdYFVgKYseN2gSWaF2WeOZWcsat9kpNB0sfH+MXFXpTB2CtIPU6oAcx9jxXR5zS+9fJoPQRaykP9r2edOyDi7UlJMOoDIkpHqJfNlB4rZQxEMk6InMLu4WtWswKcvVe2lay9uVJjGCFKLuNgozIlGOCtPm5XoHWOpmlIwQHrsjL9DjKXFZvpr3/mlmg969fdagQflJcucUKBzhfhDU13q1OWMzBagM8HUkLgSvkgfjbVZhOw4tLMMvgBfUE7Lh1Rjg2oCopaEAO7PFeam0P0Vp4/axbp1yexWM0tmsTYF3spjqVxHfTpQ75mePZGO45W196QPg3aAbsUiNvqpeqqCa+x2w4vacuRwkU7EPY11OrGXgBfBqiYv5yDQlDqaqj8aXSu/BaYZa/eZcxy99kRPv7hpM9Q+m0sxwZqKdalnL+hw6jh5aCxDkSAuwQBSm5oyouiDsKfKTAVxLQUbGqi6V+H4PileRWD3txdFcgoZZK/VHRH+t4rxZzXUdxl8EjjRA1NiZ6DnGn1jTRQC+RVCqcIhzvJ+yGrJyNG81oiaXS5RYrBTxhscdhID5MCY2jGk/xIAXajRMnyWZgstmW1OOuYXUhmcGDN6nBlnuKFreBv4Z+t0BuJeTQJslzNIJjsB11Gk74Sk0iU/OJo6VxGYhg//7l6rfFJxHJxsrKMEnLSVt0xvEpP76UnDO1goeu4gijMCFlyPjh0UzFal6u2EFeypd G53tXmMN f6NNm4kzf8m4YgmtQbAhW81SrzDoQjXYP9Dv3DaPu3WPDPd7ttXCWSx1qoZrKxLgJYMXjrgibKOp+TcER2SEFO5DpcWVXYCuoVgzIfwL6bXe6HgvBtTV3+nYgcQWDJ9jwr5PjEeCTm8cWyqYHdDmmtqlLxAJS7XSJ4XhuecdCAquRWEBHLN0x5pZUTL2CfS/A7dxK+YlENziZTO45chuA/5jRTRqAJQRiEG+ewpRWeL9798RX9Bn2m4d96jZAgGfxpcMx9C4DFpCqPnezjTXUhaIssEXnzZcrDewOBxs3vx/cFxDW/OkvbO6e5aVhIeET5SnlQI6JH68JXG2lZD+QZxRdxRssvK0HbOM153XtAVK3iRGqLMDrcP+EuJz4lI0tqWO0T/tFyjNuuzuQYzVFWumAPv2xPJdGXHwknegQT1oAO1Ejl3LBjRpvV2NxrSnfS1TIAHkkE83lub2VSqlpqzskKxfDRq7bypJ8hQiI8WJPP+vimNMSyJ69VQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/16 16:56, David Hildenbrand wrote: > On 16.04.25 10:51, David Hildenbrand wrote: >> On 16.04.25 10:41, Baolin Wang wrote: >>> >>> >>> On 2025/4/16 16:22, David Hildenbrand wrote: >>>> On 16.04.25 08:32, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2025/4/16 13:30, Dev Jain wrote: >>>>>> After the check for queue_folio_required(), the code only cares about >>>>>> the >>>>>> folio in the for loop, i.e the PTEs are redundant. Therefore, >>>>>> optimize >>>>>> this loop by skipping over a PTE batch mapping the same folio. >>>>>> >>>>>> With a test program migrating pages of the calling process, which >>>>>> includes >>>>>> a mapped VMA of size 4GB with pte-mapped large folios of order-9, and >>>>>> migrating once back and forth node-0 and node-1, the average >>>>>> execution >>>>>> time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. >>>>>> >>>>>> v2->v3: >>>>>>      - Don't use assignment in if condition >>>>>> >>>>>> v1->v2: >>>>>>      - Follow reverse xmas tree declarations >>>>>>      - Don't initialize nr >>>>>>      - Move folio_pte_batch() immediately after retrieving a >>>>>> normal folio >>>>>>      - increment nr_failed in one shot >>>>>> >>>>>> Acked-by: David Hildenbrand >>>>>> Signed-off-by: Dev Jain >>>>>> --- >>>>>>      mm/mempolicy.c | 12 ++++++++++-- >>>>>>      1 file changed, 10 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>>>> index b28a1e6ae096..4d2dc8b63965 100644 >>>>>> --- a/mm/mempolicy.c >>>>>> +++ b/mm/mempolicy.c >>>>>> @@ -566,6 +566,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct >>>>>> mm_walk *walk) >>>>>>      static int queue_folios_pte_range(pmd_t *pmd, unsigned long >>>>>> addr, >>>>>>                  unsigned long end, struct mm_walk *walk) >>>>>>      { >>>>>> +    const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >>>>>> FPB_IGNORE_SOFT_DIRTY; >>>>>>          struct vm_area_struct *vma = walk->vma; >>>>>>          struct folio *folio; >>>>>>          struct queue_pages *qp = walk->private; >>>>>> @@ -573,6 +574,7 @@ static int queue_folios_pte_range(pmd_t *pmd, >>>>>> unsigned long addr, >>>>>>          pte_t *pte, *mapped_pte; >>>>>>          pte_t ptent; >>>>>>          spinlock_t *ptl; >>>>>> +    int max_nr, nr; >>>>>>          ptl = pmd_trans_huge_lock(pmd, vma); >>>>>>          if (ptl) { >>>>>> @@ -586,7 +588,9 @@ static int queue_folios_pte_range(pmd_t *pmd, >>>>>> unsigned long addr, >>>>>>              walk->action = ACTION_AGAIN; >>>>>>              return 0; >>>>>>          } >>>>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) { >>>>>> +    for (; addr != end; pte += nr, addr += nr * PAGE_SIZE) { >>>>>> +        max_nr = (end - addr) >> PAGE_SHIFT; >>>>>> +        nr = 1; >>>>>>              ptent = ptep_get(pte); >>>>>>              if (pte_none(ptent)) >>>>>>                  continue; >>>>>> @@ -598,6 +602,10 @@ static int queue_folios_pte_range(pmd_t *pmd, >>>>>> unsigned long addr, >>>>>>              folio = vm_normal_folio(vma, addr, ptent); >>>>>>              if (!folio || folio_is_zone_device(folio)) >>>>>>                  continue; >>>>>> +        if (folio_test_large(folio) && max_nr != 1) >>>>>> +            nr = folio_pte_batch(folio, addr, pte, ptent, >>>>>> +                         max_nr, fpb_flags, >>>>>> +                         NULL, NULL, NULL); >>>>>>              /* >>>>>>               * vm_normal_folio() filters out zero pages, but >>>>>> there might >>>>>>               * still be reserved folios to skip, perhaps in a VDSO. >>>>>> @@ -630,7 +638,7 @@ static int queue_folios_pte_range(pmd_t *pmd, >>>>>> unsigned long addr, >>>>>>              if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || >>>>>>                  !vma_migratable(vma) || >>>>>>                  !migrate_folio_add(folio, qp->pagelist, flags)) { >>>>>> -            qp->nr_failed++; >>>>>> +            qp->nr_failed += nr; >>>>> >>>>> Sorry for chiming in late, but I am not convinced that 'qp->nr_failed' >>>>> should add 'nr' when isolation fails. >>>> >>>> This patch does not change the existing behavior. But I stumbled over >>>> that as well ... and scratched my head. >>>> >>>>> >>>>>     From the comments of queue_pages_range(): >>>>> " >>>>> * >0 - this number of misplaced folios could not be queued for moving >>>>>      *      (a hugetlbfs page or a transparent huge page being counted >>>>> as 1). >>>>> " >>>>> >>>>> That means if a large folio is failed to isolate, we should only >>>>> add '1' >>>>> for qp->nr_failed instead of the number of pages in this large folio. >>>>> Right? >>>> >>>> I think what the doc really meant is "PMD-mapped THP". PTE-mapped THPs >>>> always had the same behavior: per PTE of the THP we would increment >>>> nr_failed by 1. >>> >>> No? For pte-mapped THPs, it only adds 1 for the large folio, since we >>> have below check in queue_folios_pte_range(). >>> >>> if (folio == qp->large) >>>     continue; >>> >>> Or I missed anything else? >> >> Ah, I got confused by that and thought it would only be for LRU >> isolation purposes. >> >> Yeah, it will kind-of work for now and I think you are correct that we >> would only increment nr_failed by 1. >> >> I still think that counting nr_failed that way is dubious. We should be >> counting pages, which is something that user space from migrate_pages() >> could understand. Having it count arbitrary THPs/large folio sizes is >> really questionable. >> >> But that is indeed a separate issue to resolve. > > Digging into it: > > commit 1cb5d11a370f661c5d0d888bb0cfc2cdc5791382 > Author: Hugh Dickins > Date:   Tue Oct 3 02:17:43 2023 -0700 > >     mempolicy: fix migrate_pages(2) syscall return nr_failed >     "man 2 migrate_pages" says "On success migrate_pages() returns the > number >     of pages that could not be moved".  Although 5.3 and 5.4 commits fixed >     mbind(MPOL_MF_STRICT|MPOL_MF_MOVE*) to fail with EIO when not all > pages >     could be moved (because some could not be isolated for migration), >     migrate_pages(2) was left still reporting only those pages failing > at the >     migration stage, forgetting those failing at the earlier isolation > stage. >     Fix that by accumulating a long nr_failed count in struct queue_pages, >     returned by queue_pages_range() when it's not returning an error, for >     adding on to the nr_failed count from migrate_pages() in > mm/migrate.c.  A >     count of pages?  It's more a count of folios, but changing it to pages >     would entail more work (also in mm/migrate.c): does not seem > justified. > > Yeah, we should be counting pages, but likely nobody really cares, > because we > only care if everything was migrated or something was not migrated. Agree. Like you said, we need a separate patch to do some cleanup for this.