From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14844C87FD1 for ; Tue, 5 Aug 2025 06:30:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9599E8E0002; Tue, 5 Aug 2025 02:30:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 90AD38E0001; Tue, 5 Aug 2025 02:30:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 820888E0002; Tue, 5 Aug 2025 02:30:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7232B8E0001 for ; Tue, 5 Aug 2025 02:30:51 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2433D140F2B for ; Tue, 5 Aug 2025 06:30:51 +0000 (UTC) X-FDA: 83741730702.14.84BA86C Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) by imf08.hostedemail.com (Postfix) with ESMTP id 2D972160003 for ; Tue, 5 Aug 2025 06:30:48 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Vi/drBEI"; spf=pass (imf08.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1754375449; a=rsa-sha256; cv=none; b=uRUVpOWkR8OSe83VPr0TTr05AKqmVbpwjg2g+xzmDLH/W/nN6AhpKNJqin3Ji65nx3U+O8 F+xUJrkQ+7EYUMxLGUZcolcDn91xcvM8TpALAFQyTV1+mUSSarvIOEIpHLOgJKkzJVeqO4 B90brXTAtUoA1aRGeKeebcwPy/nVMDw= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Vi/drBEI"; spf=pass (imf08.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1754375449; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6jKCzNiBFP4betutlAHF95RVjFqwZC+QyEzLLbZpDpU=; b=zj+E/v1WsM88ba78fvG+vM33ZdCnOB/CiBpV1XkLi2CLeWex4D26ha2RPQJltHkZt4TRnC 5Lm93rpSSU+Nt7avhCVIHLytMpz2ah3dANq2hmfTo0PKA8WBA9QcwTDxWremX3p7oZH9M8 YQ5PTNcB8Da8GNrtOjx6XE7Si2iEnto= Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-61543b05b7cso9629a12.0 for ; Mon, 04 Aug 2025 23:30:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1754375447; x=1754980247; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6jKCzNiBFP4betutlAHF95RVjFqwZC+QyEzLLbZpDpU=; b=Vi/drBEIz66Fgy2tHS8mallQHrHE0nBTp/lslEsGZh+wh94t6bxP2lfgCjyNK80IRu XSIQ9swGRCs+3mMJ+xUu2rh47GJXJCIZ3DXDcLIweHZ2uknl6hOdzxghmFYyZ+EFVSGZ NSHmyczALRA2MHTyEqKL2ylfnp+VC4HrTkXIRM4ZI8mLOlVwUKFvZNg+TnySvlVlRJzv UVYPsxRwjOrxZzaa+15Ag6YQkh2EE9RLx6psAatmqttm33e53cQuVNrVK7wF5huNObuK lRvAxdOlUr7/23sbPk8RZk493q12zz0o1ENX47mMWnAVhk6hNIqM6sJa3fL4BZD2rWpf A6fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754375447; x=1754980247; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6jKCzNiBFP4betutlAHF95RVjFqwZC+QyEzLLbZpDpU=; b=bQHwj5EBl7quGnFHmDoM8XcBQW50E0ZikCtGDvkNA/6CEqYl9VEuAY0T7mAt+iW7uz 9vi3pxs+pGmFLADfDchzlgeXx4FjWelpfpea6dne6uhgGs/6aqy+TlJXqdlXeE3a9EN+ gvkhp3qrVuZBo99Fb4XIRXG1FXB9gDEWnmN+6FTaIdSqIB3dMX6cZzlGu2hOsMEnpXGP DZoNOYdKPKCTAcWV46sHjxiotgGfLvkB5BJaJDXExKeSD78aGa+THu9WBKQmADC6vv2F jMIyQwqnvrUoHETFwXxsNR7MWnCGi7ZWHspKHrNpeprn5ob3Y2Ayt2RFZEPLMKs9cl4f tMsg== X-Forwarded-Encrypted: i=1; AJvYcCWNRGxpemUPZ0NwKP9XYR9gflVQE/DwVtRdhPtbZXSBd3jRoGwUu86un3Fj1x4DJrAlMr0nP3ZzHw==@kvack.org X-Gm-Message-State: AOJu0YzXNe1XJ+ic2Kvh2x93NZkXgwf1cLZ4PmRdd08CIBudODiOSiBc G6DHCPhCBfX7ckjXaf7kdb3pc1Xwliw4jqqN3nimmHSnd2Hr9AL9CZ0t74REIz6qn4d1ihoSXUq ib66Xr3wOhQ5I6zqhxaEeHbg24I+oWBjQbdGsUlro X-Gm-Gg: ASbGncsRKf6scftjE/BP/bH9YlGxCpnkGGMrHS52kPRdvejBh8BOJlIk8Hd4LQ5Lt+a V3BFVb4EjiEgqpUx4IXi1yyH01+DM6PEJZ0KS14ooaYtOrT+0MDHD7LXwhrnzu7tLQo4TtT1B3q +vngctzAJ7l9kOnFnYEMdzMKrDB6ksggTsbTptT59r0RUzDE/ZEe3R8ooo7IV0RJR2/3Yz0dxRA VoJdY9ieQ== X-Google-Smtp-Source: AGHT+IF9dyc8YKAvaICZJqlHqULJ/wR5cCC0kOw7URmA5UotgvJ3JQZgE4a/yTHHwWj7ulJFSDFnaIqvUefbB3nwqTw= X-Received: by 2002:a50:8ac7:0:b0:615:6167:4835 with SMTP id 4fb4d7f45d1cf-61784827174mr16808a12.7.1754375447152; Mon, 04 Aug 2025 23:30:47 -0700 (PDT) MIME-Version: 1.0 References: <20250731104726.103071-1-lokeshgidra@google.com> In-Reply-To: From: Lokesh Gidra Date: Mon, 4 Aug 2025 23:30:35 -0700 X-Gm-Features: Ac12FXwPLT837tjNVzFz6ic90cBYQizv07IErS2ipoubEYgqHnUpZl2nCWBF-FE Message-ID: Subject: Re: [PATCH] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, aarcange@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, ngeoffray@google.com, Suren Baghdasaryan , Kalesh Singh , Barry Song , David Hildenbrand , Peter Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 2D972160003 X-Stat-Signature: 86jo48bbpfjypihf7daeyi876z1mec3q X-HE-Tag: 1754375448-731603 X-HE-Meta: U2FsdGVkX1+YkVnH573nJI4BsyUndoIJDEwlws7QzeYdHs1+fYMXPdeQVg4NIJb3kdkyJjHmIUBQAqzQcpGMmuJ3G/qPKIl4LFEmrsBlp93XEuT1a12AcpY1f1sgyxppecJqI2kxDWmt0FcOH+CeiFlOGqfHteCAkMhBxdz+kvm/2drooOG03gxPszMB2cFtRgKdfyXKSZEgsEbBY1FhgSS0aVVmqDu7dUPxov4A77zDEApr0twlQ8x+9zosRZIPcsjrnWNCFSefxlb95iwv2ep7HJA762mk1Jsiq91atPvly4gSg0a+07XQAof0VY3er4IH4K/gZL+9gXsta4+eWL7mY5U6IfeqvtysmynRc3lKX+VVoLkEsmxiUeHPaOuosTgPBl/MDgNhzC0VUswe8zD3JuA/XVXMlqzuUwYJe51KVLi6IzS3t5S2rXYzeWZmCyc4b74tfGWml+04PNLvmOqatWCQVLTqhLtfJgwGeJPNQ6jQGDSAOk+7+R/WsXC7IuWPCIgpf/lWwrXZHIMNECLrXvq58mr9k6OiQ/Ojt0b8M0IImU/q6v2A4I+8D9yAv+OGObhhQQDzQ7IW7uNJM1QGyE2ouJjI9mzJg8UeK61139FawcczCXJDrBXd7jr1XFsSXQCH9jy9031jXsWJJKBpQ2nd+PWv52SSK1Ow3KApZjBT6smezrdjkODKardnKUtg6v9tQmHcUveKld084n2RGQHSmooDlvn3XppSNHyfeofqdIdvYcu7vrVZelTiyQ5kuOfe9wtvA3bt5aP1X0dFdWQUJgTmgF4nMz1ARI12vnJP5aZLFnSUhGUsnVvSgfoq9fiXu/VKGs1l5P97GPyjc23veMaDBbEPUg+d9iWqBMH6omRIBHLT7dzrKR8DZAAMWckIlegJe4+mK1BHIuBRKHkD6MDvkIlJeG9oF+EGYisrZoeneOzIFiOJgZKw2kQcua4qhcznxgrhE9m cOkkqJKF 6t3FGWc1l7tTYuWDPHa/IPAAx05U5gwbZbq2VS+3kA3tf63DuER9ZEG9G/Lw8bX8V2l3z2HvhQTov9k5+heHpvA/8Kid7hPF6vJDsdpmzvjuJUKdR8+WmdCUjapyhpQ8uXIYSuVXy5phL3gf+U7rdKrDtMcJpbQDmdP9TWupDdcovirsFDwEhq3QUz3EWtFANgReiR4q1G1+LClqna6eRfmlqWG2rnd+LUka/SXzJawmp/Le78cg8T9wsplPgEx9+LJyFHMAKZtqadk2kLeD2oL8z9X/c9O+Pg1uNf5e+kFybB91HH97RNmgWQw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Aug 4, 2025 at 9:35=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Thu, Jul 31, 2025 at 6:47=E2=80=AFPM Lokesh Gidra wrote: > > > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required > > for moving present pages. Mitigate this cost by opportunistically > > batching present contiguous pages for TLB flushing. > > > > Without batching, in our testing on an arm64 Android device with UFFD G= C, > > which uses MOVE ioctl for compaction, we observed that out of the total > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and > > ~20% in vm_normal_folio(). > > > > With batching, the proportion of vm_normal_folio() increases to over > > 70% of move_pages_pte() without any changes to vm_normal_folio(). > > Furthermore, time spent within move_pages_pte() is only ~20%, which > > includes TLB-flush overhead. > > > > Cc: Suren Baghdasaryan > > Cc: Kalesh Singh > > Cc: Barry Song > > Cc: David Hildenbrand > > Cc: Peter Xu > > Signed-off-by: Lokesh Gidra > > --- > > mm/userfaultfd.c | 179 +++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 127 insertions(+), 52 deletions(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 8253978ee0fb..2465fb234671 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1026,18 +1026,62 @@ static inline bool is_pte_pages_stable(pte_t *d= st_pte, pte_t *src_pte, > > pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd)); > > } > > > > -static int move_present_pte(struct mm_struct *mm, > > - struct vm_area_struct *dst_vma, > > - struct vm_area_struct *src_vma, > > - unsigned long dst_addr, unsigned long src_a= ddr, > > - pte_t *dst_pte, pte_t *src_pte, > > - pte_t orig_dst_pte, pte_t orig_src_pte, > > - pmd_t *dst_pmd, pmd_t dst_pmdval, > > - spinlock_t *dst_ptl, spinlock_t *src_ptl, > > - struct folio *src_folio) > > +/* > > + * Checks if the two ptes and the corresponding folio are eligible for= batched > > + * move. If so, then returns pointer to the folio, after locking it. O= therwise, > > + * returns NULL. > > + */ > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct= *src_vma, > > + unsigned long src_addr= , > > + pte_t *src_pte, pte_t = *dst_pte) > > +{ > > + pte_t orig_dst_pte, orig_src_pte; > > + struct folio *folio; > > + > > + orig_dst_pte =3D ptep_get(dst_pte); > > + if (!pte_none(orig_dst_pte)) > > + return NULL; > > + > > + orig_src_pte =3D ptep_get(src_pte); > > + if (pte_none(orig_src_pte)) > > + return NULL; > > + if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_= pte))) > > + return NULL; > > + > > + folio =3D vm_normal_folio(src_vma, src_addr, orig_src_pte); > > + if (!folio || !folio_trylock(folio)) > > + return NULL; > > + if (!PageAnonExclusive(&folio->page) || folio_test_large(folio)= ) { > > + folio_unlock(folio); > > + return NULL; > > + } > > + return folio; > > +} > > + > > +static long move_present_ptes(struct mm_struct *mm, > > + struct vm_area_struct *dst_vma, > > + struct vm_area_struct *src_vma, > > + unsigned long dst_addr, unsigned long src= _addr, > > + pte_t *dst_pte, pte_t *src_pte, > > + pte_t orig_dst_pte, pte_t orig_src_pte, > > + pmd_t *dst_pmd, pmd_t dst_pmdval, > > + spinlock_t *dst_ptl, spinlock_t *src_ptl, > > + struct folio *src_folio, unsigned long le= n) > > { > > int err =3D 0; > > + unsigned long src_start =3D src_addr; > > + unsigned long addr_end; > > + > > + if (len > PAGE_SIZE) { > > + addr_end =3D (dst_addr + PMD_SIZE) & PMD_MASK; > > + if (dst_addr + len > addr_end) > > + len =3D addr_end - dst_addr; > > > > + addr_end =3D (src_addr + PMD_SIZE) & PMD_MASK; > > + if (src_addr + len > addr_end) > > + len =3D addr_end - src_addr; > > + } > > + flush_cache_range(src_vma, src_addr, src_addr + len); > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > @@ -1051,31 +1095,60 @@ static int move_present_pte(struct mm_struct *m= m, > > err =3D -EBUSY; > > goto out; > > } > > + /* Avoid batching overhead for single page case */ > > + if (len > PAGE_SIZE) { > > + flush_tlb_batched_pending(mm); > > What=E2=80=99s confusing to me is that they track the unmapping of multip= le > consecutive PTEs and defer TLB invalidation until later. > In contrast, you=E2=80=99re not tracking anything and instead call > flush_tlb_range() directly, which triggers the flush immediately. > > It seems you might be combining two different batching approaches. These changes I made are in line with how mremap() does batching. See move_ptes() in mm/mremap.c >From the comment in flush_tlb_batched_pending() [1] it seems necessary in this case too. Please correct me if I'm wrong. I'll be happy to remove it if it's not required. [1] https://elixir.bootlin.com/linux/v6.16/source/mm/rmap.c#L728 > From what I can tell, you're essentially using flush_range > as a replacement for flushing each entry individually. That's correct. The idea is to reduce the number of IPIs required for flushing the TLB entries. Since it is quite common that the ioctl is invoked with several pages in one go, this greatly benefits. > > > + arch_enter_lazy_mmu_mode(); > > + orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_p= te); > > + } else > > + orig_src_pte =3D ptep_clear_flush(src_vma, src_addr, sr= c_pte); > > + > > + addr_end =3D src_start + len; > > + do { > > + /* Folio got pinned from under us. Put it back and fail= the move. */ > > + if (folio_maybe_dma_pinned(src_folio)) { > > + set_pte_at(mm, src_addr, src_pte, orig_src_pte)= ; > > + err =3D -EBUSY; > > + break; > > + } > > > > - orig_src_pte =3D ptep_clear_flush(src_vma, src_addr, src_pte); > > - /* Folio got pinned from under us. Put it back and fail the mov= e. */ > > - if (folio_maybe_dma_pinned(src_folio)) { > > - set_pte_at(mm, src_addr, src_pte, orig_src_pte); > > - err =3D -EBUSY; > > - goto out; > > - } > > - > > - folio_move_anon_rmap(src_folio, dst_vma); > > - src_folio->index =3D linear_page_index(dst_vma, dst_addr); > > + folio_move_anon_rmap(src_folio, dst_vma); > > + src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > > > - orig_dst_pte =3D folio_mk_pte(src_folio, dst_vma->vm_page_prot)= ; > > - /* Set soft dirty bit so userspace can notice the pte was moved= */ > > + orig_dst_pte =3D folio_mk_pte(src_folio, dst_vma->vm_pa= ge_prot); > > + /* Set soft dirty bit so userspace can notice the pte w= as moved */ > > #ifdef CONFIG_MEM_SOFT_DIRTY > > - orig_dst_pte =3D pte_mksoft_dirty(orig_dst_pte); > > + orig_dst_pte =3D pte_mksoft_dirty(orig_dst_pte); > > #endif > > - if (pte_dirty(orig_src_pte)) > > - orig_dst_pte =3D pte_mkdirty(orig_dst_pte); > > - orig_dst_pte =3D pte_mkwrite(orig_dst_pte, dst_vma); > > + if (pte_dirty(orig_src_pte)) > > + orig_dst_pte =3D pte_mkdirty(orig_dst_pte); > > + orig_dst_pte =3D pte_mkwrite(orig_dst_pte, dst_vma); > > + set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); > > + > > + src_addr +=3D PAGE_SIZE; > > + if (src_addr =3D=3D addr_end) > > + break; > > + src_pte++; > > + dst_pte++; > > > > - set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); > > + folio_unlock(src_folio); > > + src_folio =3D check_ptes_for_batched_move(src_vma, src_= addr, src_pte, dst_pte); > > + if (!src_folio) > > + break; > > + orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_p= te); > > + dst_addr +=3D PAGE_SIZE; > > + } while (true); > > + > > + if (len > PAGE_SIZE) { > > + arch_leave_lazy_mmu_mode(); > > + if (src_addr > src_start) > > + flush_tlb_range(src_vma, src_start, src_addr); > > + } > > Can't we just remove the `if (len > PAGE_SIZE)` check and unify the > handling for both single-page and multi-page cases? We certainly can. Initially it seemed to me that lazy/batched invalidation has its own overhead and I wanted to avoid that in the single-page case because the ioctl does get called for single pages quite a bit. That too in time sensitive code paths. However, on a deeper relook now, I noticed it's not really that different. I'll unify in the next patch. Thanks for the suggestion. > > > Thanks > Barry