From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77FBED5D695 for ; Thu, 7 Nov 2024 23:36:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D0D06B00A5; Thu, 7 Nov 2024 18:36:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0806F6B00A6; Thu, 7 Nov 2024 18:36:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB15E6B00A7; Thu, 7 Nov 2024 18:36:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CEC516B00A5 for ; Thu, 7 Nov 2024 18:36:18 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6DC94ABC0E for ; Thu, 7 Nov 2024 23:36:18 +0000 (UTC) X-FDA: 82760908986.24.E90A9C5 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf09.hostedemail.com (Postfix) with ESMTP id 6B4E8140002 for ; Thu, 7 Nov 2024 23:35:51 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wIq47KYw; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731022491; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wKyh58HYszHZ6WgJ96MeZYjhhbw/QF5yju4cRb3ZTUA=; b=TQJ+O1rwydI/qcEa+1ZGlt08h9zeFsPHFaQiwKhOAomRg41sWkYLssR1BN7/qXL/3+0XU5 bYA2yQQhrZjHo2TroezypifZ+cbbTqDJO2M/SGiYChubfXWDGZMv9oZEx+tXPgieo8hR9E VOPIdODiHyCYmAdbfA1wnIjmxmECGBU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731022491; a=rsa-sha256; cv=none; b=MXljLVNmeuYEX/Ua+ZsHfTc2Dr5PaBsBbhCfpkGdZIgpybjyvDrAttwSMw8x7U14O0v719 kQZYyMQ/LY8wWFEkQHyjqfwl321+D7UJch6LWh7qER1cjSVYa8twOdPelgFqLusEbW9VC/ dzUEDXzhDjCzk9zlsy0aSDe7PuVz+NE= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wIq47KYw; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-5c934b2c991so9168a12.0 for ; Thu, 07 Nov 2024 15:36:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731022575; x=1731627375; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wKyh58HYszHZ6WgJ96MeZYjhhbw/QF5yju4cRb3ZTUA=; b=wIq47KYw3Kotqe/pmhRyBaG41jh/W4k8qiOFiI2Vv3TERKJOJUromA5TS4kI22chcL qiFZk/W1hWmRSckCib2w6iEJi8QDsEr0XAAbdQZLE+fC1uUMzNEvTg5UpEi5gR5GiG8Z jO09ow/n27h972XBGv7381wvwFZUmMRC5d9a40r3fGOrSs8kiAjBDiIaB7wnJ9YrFlKD Mc38kaUpEMmKkUsB/ZXF/mbXdjargXrOFhxw6fUsPPLWmJUrWwMR9mkJw+j9qSJ6uZvP RBGcpPOtF+4Lkr/Rx6po94jDkD/JOtCzOmRkh4AIPXXWvnhNm8gjEYCatyA/R+scNrG8 Ta2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731022575; x=1731627375; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wKyh58HYszHZ6WgJ96MeZYjhhbw/QF5yju4cRb3ZTUA=; b=DZsHMsstqn/JMNvItgrV7tyjF4yY2S7lGynsQRj6GaKwo3QrbqCnkTg7dHfJYOs+AR Q+8BglB96QNa9IMVkJElb+KnPxy6sUy+P2jns8/zvSQTawHHpq2eQqJO4RChZlM3Lirj 3H6pDdUlGYlljXRVxtfyx/v4CG3TpKe0RRQq0RTxYR2qxwfXY48HLXWHcJ0TZwXZh100 3hYZ2uTWlK6/F7AIHgmB/WF3q4FyPLSeHqHaAS8WwjDlcnhPCFjZPSg9errLH5LDSd9B l2NthXQDDUkAGFI995FlBmbNtZPF9xTiZ0hxlh1pLPX8k15oKEZD4U41Ot8Gx+a4srBS ABNw== X-Forwarded-Encrypted: i=1; AJvYcCUUYoz9dGJo1G5Qsni7x8xnTcG1rUgEZPrtf8cDZRfLGtg9MeVrw3H72hT+v/TPXHFkeHDo25Ck0w==@kvack.org X-Gm-Message-State: AOJu0Ywf7QGsLJ43UDlCJ06BcQs55TO/AbdxgVTb4cK5oqSUpnidacTY QNHCBcZOV/NP8UdaM/alrs93AyLXXMHgnsN6Yjl6V7pp7z0ajzK9prDDZaJk9BnbpkUG4HD7Nvr /l3xcFIRfJwLQzZnPVR11HJX0mghxGWOuCdmJ X-Gm-Gg: ASbGncur7VmcTf5N2guiRnFIi/1RsmSWZ71WDgUGPMffVffSjndLbRAbZMaHduwF+sf mI+RaH61A4pxX+ScnWQZcX+yIAnZSDbsrzvtwSSp7Uzx9CVFWZi8nFsif6LI= X-Google-Smtp-Source: AGHT+IGwuYbdMKN4X8y4WxrF53t7omX0y4mxiwq8/BykUSzl8vXEwho6GMrl85skorVkTtawAHm/PmTuI8tP1Cvp9O8= X-Received: by 2002:a50:f69a:0:b0:5cb:7316:2e15 with SMTP id 4fb4d7f45d1cf-5cefbb03500mr871447a12.0.1731022574654; Thu, 07 Nov 2024 15:36:14 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Fri, 8 Nov 2024 00:35:38 +0100 Message-ID: Subject: Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) To: Qi Zheng Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: c5qxcy549fijgy1bunwyirgzdg5wnibh X-Rspamd-Queue-Id: 6B4E8140002 X-Rspam-User: X-HE-Tag: 1731022551-28103 X-HE-Meta: U2FsdGVkX1+2xRjD2lvyNmjbVxtTvrXgejiWqIna+zlbbK1MKjph/16GiynZShbIjtUa4LRUZ/E5sLdXXYtgORHZx3aHFdE9gt1hJ8NION2Wk5WpMxSUe74vrYQa7WXEiDYPtfyer7Re28BK2P49/b5DnVoU+orzXniGpukv2mTr6EmQHPHrvvaRvUlheIGk7K+1p2qCkqJfbX8ANJ/wRJGRadtsRKrkxTXxiz7B4sfUaO6gZbFAbWPXzUyANPRfQ6coGwT7bHzn6NcGyy9RQXBE0uQ4BlhD97elyhGqjh74eK5+LwGOwQYe8QAzEXpUeIKymfu781I1M/523wAg9JTkfndWsueAGVV7HFkga6qXaPbxo6BnmVCq1zt20lc6KUEz6HeA1zXbH+YbbcT3N3xxdSGX49nCPgpoVglfpSVYeHnIb8s/pLEt1c8cSvVBQFNVzV3t93KbzarHgh+fqs0YzLyvKfgJuMSANdr3Mp1RZTj0OHQQhR+GzKcLEgv99zfNZkUf7zK+Pn2a9UReqMiSBfMlNV9XIX/kDsLImq6kMn2ybQQ7uQjDEgMXu+ngJDZWUElhCwsiU2xlEmRjTl+oHYlg/xC7x2HKzVZnaRXnbRbQckvhN2o47XB6vnfOUgabWv7yT6nKI0XPfWW8WQSPPUrxow2FA9WXOhGyIN7gNbZ1A3H1j5pWJSX1MDKqe/JziMN7HIowdFKnqkz0KJx5cMahdmcKtrxk1Bhnjk+xZZx2I2tieGe7P3jH2FRc/Hn56WkTWk2jdLWNYLCpeYJAFBpvtefGELoJAqhgQ6kfJ6u1AaAG1ErrsXhTMhNjCAlQDuZ+kS6GaK+HoUdYipJ/hZ+CTYgMjx2U0Fmo2LcjbwB5CngYZiANtnTqHbb1p8foHJSzMC/cWCmUXlvHLwKj48E7ToqYYLMJ5ATv0gJO6qUvQMS2xsDEUQOBinAn1TBVji3Iqb7QPnl0m9S Pwsc+Usv bgzruu4cEl355HdES/PhMtDpwtauwwfrhOd/KN486d1N2E+5FQ2ttxuJECgLDJmhl/MvW5vxO6lUCtPSCOK5U0BxD4XOD3xLGSjlNpfocqP3kCJ5uxkkq0qQD5uEtNTWH39/aueNGdhG5WJX6cbhxFhitdwUTjAj3FO8izREgo2em6Yuf29cbrawaDs73ujx0o3suX5IThSlMPZ+NvFMCGYbT2LxoTkz1oRM/aGJBUrO8Nxw+8QEr9hU8UrS9kqefK2LRDVUJJAXI6NI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 31, 2024 at 9:14=E2=80=AFAM Qi Zheng wrote: > As a first step, this commit aims to synchronously free the empty PTE > pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE > pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude > cases other than madvise(MADV_DONTNEED). > > Once an empty PTE is detected, we first try to hold the pmd lock within > the pte lock. If successful, we clear the pmd entry directly (fast path). > Otherwise, we wait until the pte lock is released, then re-hold the pmd > and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detec= t > whether the PTE page is empty and free it (slow path). How does this interact with move_pages_pte()? I am looking at your series applied on top of next-20241106, and it looks to me like move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the PMD entry can't change. You can clearly see this assumption at the WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I think for example move_present_pte() can end up moving a present PTE into a page table that has already been scheduled for RCU freeing. > Signed-off-by: Qi Zheng > --- > include/linux/mm.h | 1 + > mm/Kconfig | 15 ++++++++++ > mm/Makefile | 1 + > mm/internal.h | 23 ++++++++++++++++ > mm/madvise.c | 4 ++- > mm/memory.c | 45 +++++++++++++++++++++++++++++- > mm/pt_reclaim.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 155 insertions(+), 2 deletions(-) > create mode 100644 mm/pt_reclaim.c > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 3e4bb43035953..ce3936590fe72 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void); > struct zap_details { > struct folio *single_folio; /* Locked folio to be unmapped */ > bool even_cows; /* Zap COWed private pages too? *= / > + bool reclaim_pt; /* Need reclaim page tables? */ > zap_flags_t zap_flags; /* Extra flags for zapping */ > }; > > diff --git a/mm/Kconfig b/mm/Kconfig > index 84000b0168086..681909e0a9fa3 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK > The architecture has hardware support for userspace shadow call > stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). > > +config ARCH_SUPPORTS_PT_RECLAIM > + def_bool n > + > +config PT_RECLAIM > + bool "reclaim empty user page table pages" > + default y > + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP > + select MMU_GATHER_RCU_TABLE_FREE > + help > + Try to reclaim empty user page table pages in paths other that = munmap nit: s/other that/other than/ > + and exit_mmap path. > + > + Note: now only empty user PTE page table pages will be reclaime= d. [...] > diff --git a/mm/madvise.c b/mm/madvise.c > index 0ceae57da7dad..ee88652761d45 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_str= uct *vma, > static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > unsigned long start, unsigned lon= g end) > { > - zap_page_range_single(vma, start, end - start, NULL); > + struct zap_details details =3D {.reclaim_pt =3D true,}; > + > + zap_page_range_single(vma, start, end - start, &details); > return 0; > } > > diff --git a/mm/memory.c b/mm/memory.c > index 002aa4f454fa0..c4a8c18fbcfd7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, str= uct vm_area_struct *src_vma) > static inline bool should_zap_cows(struct zap_details *details) > { > /* By default, zap all pages */ > - if (!details) > + if (!details || details->reclaim_pt) > return true; > > /* Or, we zap COWed pages only if the caller wants to */ This looks hacky - when we have a "details" object, its ->even_cows member is supposed to indicate whether COW pages should be zapped. So please instead set .even_cows=3Dtrue in madvise_dontneed_single_vma(). > @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gath= er *tlb, > details, rss); > } > > +static inline int count_pte_none(pte_t *pte, int nr) > +{ > + int none_nr =3D 0; > + > + /* > + * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be > + * re-installed, so we need to check pte_none() one by one. > + * Otherwise, checking a single PTE in a batch is sufficient. > + */ Please also add a comment noting that count_pte_none() relies on this invariant on top of do_zap_pte_range() or somewhere like that. > +#ifdef CONFIG_PTE_MARKER_UFFD_WP > + for (;;) { > + if (pte_none(ptep_get(pte))) > + none_nr++; > + if (--nr =3D=3D 0) > + break; > + pte++; > + } > +#else > + if (pte_none(ptep_get(pte))) > + none_nr =3D nr; > +#endif > + return none_nr; > +} > + > static unsigned long zap_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, > @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gath= er *tlb, > spinlock_t *ptl; > pte_t *start_pte; > pte_t *pte; > + pmd_t pmdval; > + bool can_reclaim_pt =3D false; > + bool direct_reclaim =3D false; > + unsigned long start =3D addr; > + int none_nr =3D 0; > int nr; > > + if (details && details->reclaim_pt && (end - start >=3D PMD_SIZE)= ) > + can_reclaim_pt =3D true; > + > retry: > tlb_change_page_size(tlb, PAGE_SIZE); > init_rss_vec(rss); > @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gat= her *tlb, > > nr =3D do_zap_pte_range(tlb, vma, pte, addr, end, details= , rss, > &force_flush, &force_break); > + none_nr +=3D count_pte_none(pte, nr); > if (unlikely(force_break)) { > addr +=3D nr * PAGE_SIZE; > break; > } > } while (pte +=3D nr, addr +=3D PAGE_SIZE * nr, addr !=3D end); > > + if (addr =3D=3D end && can_reclaim_pt && (none_nr =3D=3D PTRS_PER= _PTE)) > + direct_reclaim =3D try_get_and_clear_pmd(mm, pmd, &pmdval= ); > + > add_mm_rss_vec(mm, rss); > arch_leave_lazy_mmu_mode(); > > @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gath= er *tlb, > goto retry; > } > > + if (can_reclaim_pt) { > + if (direct_reclaim) > + free_pte(mm, start, tlb, pmdval); > + else > + try_to_free_pte(mm, pmd, start, tlb); > + } > + > return addr; > } > > diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c > new file mode 100644 > index 0000000000000..fc055da40b615 > --- /dev/null > +++ b/mm/pt_reclaim.c > @@ -0,0 +1,68 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > + > +#include "internal.h" > + > +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdv= al) > +{ > + spinlock_t *pml =3D pmd_lockptr(mm, pmd); > + > + if (!spin_trylock(pml)) > + return false; > + > + *pmdval =3D pmdp_get_lockless(pmd); > + pmd_clear(pmd); > + spin_unlock(pml); > + > + return true; > +} > + > +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gathe= r *tlb, > + pmd_t pmdval) > +{ > + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); > + mm_dec_nr_ptes(mm); > +} > + > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long add= r, > + struct mmu_gather *tlb) > +{ > + pmd_t pmdval; > + spinlock_t *pml, *ptl; > + pte_t *start_pte, *pte; > + int i; > + If you swap the following two operations around (first pmd_lock(), then pte_offset_map_rw_nolock(), then take the PTE lock): > + start_pte =3D pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &p= tl); > + if (!start_pte) > + return; > + > + pml =3D pmd_lock(mm, pmd); > + if (ptl !=3D pml) > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + Then I think this check can go away: > + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) > + goto out_ptl; > + > + /* Check if it is empty PTE page */ > + for (i =3D 0, pte =3D start_pte; i < PTRS_PER_PTE; i++, pte++) { > + if (!pte_none(ptep_get(pte))) > + goto out_ptl; > + } > + pte_unmap(start_pte); > + > + pmd_clear(pmd); > + > + if (ptl !=3D pml) > + spin_unlock(ptl); > + spin_unlock(pml); > + > + free_pte(mm, addr, tlb, pmdval); > + > + return; > +out_ptl: > + pte_unmap_unlock(start_pte, ptl); > + if (pml !=3D ptl) > + spin_unlock(pml); > +} > -- > 2.20.1 >