From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11566C369B1 for ; Wed, 16 Apr 2025 06:32:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2EBC6B0272; Wed, 16 Apr 2025 02:32:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDFCA6B0273; Wed, 16 Apr 2025 02:32:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA4A86B0274; Wed, 16 Apr 2025 02:32:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8C6AD6B0272 for ; Wed, 16 Apr 2025 02:32:35 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 78A37161A40 for ; Wed, 16 Apr 2025 06:32:35 +0000 (UTC) X-FDA: 83338938270.25.45388DF Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by imf18.hostedemail.com (Postfix) with ESMTP id 5FD911C0004 for ; Wed, 16 Apr 2025 06:32:32 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=h2GVUFUS; spf=pass (imf18.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744785153; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iuviVh9BV5zcVo9BVHDs26PBd6MDMyyyghZwUMJKQ90=; b=b1fU4OGMys/SL8PTc00hHIopAAgGNhGDRYqK2QI60DNi4tivW9anNfNeNOO+FYqGXNopIO jVPdDL89haTyuqalqkXCihGrMf4WMatAvEIiWajIoSd4Dk2Ry1wf0c92pwcSzPMOWnYym1 mPP8zNvy1NRHiVR3vdLxUcjmyn8TzdE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=h2GVUFUS; spf=pass (imf18.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744785153; a=rsa-sha256; cv=none; b=rlmmSAVtitl820HkssALr435bBTA/kzMXJxaJUBs/P/YUrUxzZhIds+1KejC4RAotjXErq GVFRTJirFGI5Z9fMMfCQCgVEjU8FPOr/P9T8taRvjwhizCQ61aFQC2uYRUNrUH8Wwyrnj7 Qvfxm1z6AotByIdfItx4hj8+Y0CPz+Q= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744785149; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=iuviVh9BV5zcVo9BVHDs26PBd6MDMyyyghZwUMJKQ90=; b=h2GVUFUSevwU49syUUqCvpRw+ovroQy9XD01EkMhxUAJarcaFIynm4KlWWuBpg6/b5qTQLleB9scD/AKZZJAwLrQWqxtk6XYgBHBBeMGojBmvl74cQ9apCsb+ONhpFqvjOCiw5gOSn5mR2+rrI4X2vpeWpaO7RsSVpvoef1Z7Tg= Received: from 30.74.144.124(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WX7uX4._1744785147 cluster:ay36) by smtp.aliyun-inc.com; Wed, 16 Apr 2025 14:32:28 +0800 Message-ID: <7f96283b-11b3-49ee-9d2d-5ad977325cb0@linux.alibaba.com> Date: Wed, 16 Apr 2025 14:32:27 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mempolicy: Optimize queue_folios_pte_range by PTE batching To: Dev Jain , akpm@linux-foundation.org Cc: ryan.roberts@arm.com, david@redhat.com, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, hughd@google.com, vishal.moola@gmail.com, yang@os.amperecomputing.com, ziy@nvidia.com References: <20250416053048.96479-1-dev.jain@arm.com> From: Baolin Wang In-Reply-To: <20250416053048.96479-1-dev.jain@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: y8z5xrc1pa19kq5awc5ft3r7rjhmfjk8 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 5FD911C0004 X-Rspam-User: X-HE-Tag: 1744785152-532209 X-HE-Meta: U2FsdGVkX18WyVBEWQ2pZImxrLLsruua1WGcc/soktqP9Mugg/591Mni1mSscq+atPQNUQsZoZm9C9ij9tt6/GO3k6KKwhmsZeADxFGnJ/En5dDXPzfmstUkdtBcmTor1nEXH3NOFnXlvPKsD8PJ0748H1jrlbWOK4xyGOfncRcrT+PO93jIFXK1AXDF2dPcA5H4nhmoS+TaPTui9Mhq7cH69nF0j2og2IommiYhd1XQuJYWsVLrxz3gTNkcEAmjp49fqkYJ/U+QHvIcR7vQ1pApT+03sXaZzRQs5VMjXO6YHGQnb+MLN+oJt5d0DXEsDtcwRhS5WlDGls/u+Sh3ptPo/0io8nommMzWNXFrFWq9eWiTiBNW6i7ojcx8uH9UXG+ZHhx3AJCae08Sz4ObDf4NTz2foUE/Geenry4SRA0cUIgrbxexPqRh3NhHQ4oiySGmYRp/z/lAHItn6Ej8/ktFT0kCnSBuoTCl7gpo85SYg42VYJvbQB2YhbzYCX6BJYHimz028IG4f7lBc4itdU6pL6+fYgnPAZHsyH9u2T4fgEsP90255pLGiHMDSc5O1B2+xTMY0zDFPb/bNsWaGoaOXumOFqRX9DpU9Brh0iumVsku5E7IfZe/WXpt/Fp0pYmsjQrCaesZuFKL2xNiPT4wgrmIAMJfACVWRYsgU58AFhQ0s8Lyt8x26FkoWeIGipB0/f/AYTd4n4VYmZcXEOkm9cDG8nfcWQfcJCtOWpyNitPfSZ3/lggHDR+vGQNOLxkIsYJb7OXE3XlPsHlF8YWyFAnTnysSgIUGZK+9nRdLA34O52t9KAAPZUFTpMYI9VEHSCMkmH8xrDdhkv/wTOlpYYjpGicbGkJ/kdGmLTGWVmO/9i3ZCIx1NwdZqgh/4rQL/RsZCs++lOvKB1rRtNA4tVcKYb7F7lZi0wP7g4VhDgIPjdRMZ0qw6QE6pcuUOze2cYSlcFMc7zVT+Lk cjZBCZu9 vl/6sStCJVrnjveat/3en3Ci2cSgt7N1eSAgZE0y7LN1o2XH0ZKm7w2fGfVQZ5c4Sx3bJjY6v3y6z1JDDq7toHqmRy6D8wf9eJyk52ZUqr1sbOT085lwcB+Y1E+IkxdfORqEQ1YyoQXEa3A5RzCv7/j34VQeYvC3oxXym/ryfm1xJjtqfIgjQiE5JvG7gQG7M99f2hyS8aejusbdvEVWZBhqzrayX+6W+rK7SmKey4HKAUlk7fOf/BiwWSqRGtml/Mq1yswW5lO3iunrF3A4vTnPEC3JNjR3CSROhSO3kvyJXw6my27G1TgNN0BwIz6BLqTNrA2BwoPWIyk+0u7RrG+BHY2AHdFHPiD5QPxm6JfLQJoVp9v6Exj9i9hCl7urM7XtFhSb2uoAvq/pEQSafJaJIbr0dySKQc2ut7LPpn1f5qGddidF9hROMCxkp1i0B7R+hObzfMirtzVY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/16 13:30, Dev Jain wrote: > After the check for queue_folio_required(), the code only cares about the > folio in the for loop, i.e the PTEs are redundant. Therefore, optimize > this loop by skipping over a PTE batch mapping the same folio. > > With a test program migrating pages of the calling process, which includes > a mapped VMA of size 4GB with pte-mapped large folios of order-9, and > migrating once back and forth node-0 and node-1, the average execution > time reduces from 7.5 to 4 seconds, giving an approx 47% speedup. > > v2->v3: > - Don't use assignment in if condition > > v1->v2: > - Follow reverse xmas tree declarations > - Don't initialize nr > - Move folio_pte_batch() immediately after retrieving a normal folio > - increment nr_failed in one shot > > Acked-by: David Hildenbrand > Signed-off-by: Dev Jain > --- > mm/mempolicy.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index b28a1e6ae096..4d2dc8b63965 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -566,6 +566,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) > static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; > struct vm_area_struct *vma = walk->vma; > struct folio *folio; > struct queue_pages *qp = walk->private; > @@ -573,6 +574,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > pte_t *pte, *mapped_pte; > pte_t ptent; > spinlock_t *ptl; > + int max_nr, nr; > > ptl = pmd_trans_huge_lock(pmd, vma); > if (ptl) { > @@ -586,7 +588,9 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > walk->action = ACTION_AGAIN; > return 0; > } > - for (; addr != end; pte++, addr += PAGE_SIZE) { > + for (; addr != end; pte += nr, addr += nr * PAGE_SIZE) { > + max_nr = (end - addr) >> PAGE_SHIFT; > + nr = 1; > ptent = ptep_get(pte); > if (pte_none(ptent)) > continue; > @@ -598,6 +602,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > folio = vm_normal_folio(vma, addr, ptent); > if (!folio || folio_is_zone_device(folio)) > continue; > + if (folio_test_large(folio) && max_nr != 1) > + nr = folio_pte_batch(folio, addr, pte, ptent, > + max_nr, fpb_flags, > + NULL, NULL, NULL); > /* > * vm_normal_folio() filters out zero pages, but there might > * still be reserved folios to skip, perhaps in a VDSO. > @@ -630,7 +638,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || > !vma_migratable(vma) || > !migrate_folio_add(folio, qp->pagelist, flags)) { > - qp->nr_failed++; > + qp->nr_failed += nr; Sorry for chiming in late, but I am not convinced that 'qp->nr_failed' should add 'nr' when isolation fails. From the comments of queue_pages_range(): " * >0 - this number of misplaced folios could not be queued for moving * (a hugetlbfs page or a transparent huge page being counted as 1). " That means if a large folio is failed to isolate, we should only add '1' for qp->nr_failed instead of the number of pages in this large folio. Right?