From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71D8CD5E140 for ; Fri, 8 Nov 2024 07:13:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BC8336B00AA; Fri, 8 Nov 2024 02:13:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B77586B00AD; Fri, 8 Nov 2024 02:13:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F10F6B00AE; Fri, 8 Nov 2024 02:13:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7FAD06B00AA for ; Fri, 8 Nov 2024 02:13:33 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id ACD0B1208DC for ; Fri, 8 Nov 2024 07:13:32 +0000 (UTC) X-FDA: 82762060248.09.7CF6544 Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) by imf06.hostedemail.com (Postfix) with ESMTP id E258618000B for ; Fri, 8 Nov 2024 07:13:03 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=RkpDAT2v; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731049885; a=rsa-sha256; cv=none; b=kXAMbT7XLboNKPzB2KBZs+nQKi0sP8rkiQ682OVZVEiB+El8l9fe+paQ4zoKm4MyhdLfek rLgZZBLXexj71NR2LrESxmSPLtSYPPX7dzgCzEUEvQY56UXVM3RlD2gv1SLjfGPLdm2bTA AKOcgpI47Id01uJJiQGAv4PxFNGuNmg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=RkpDAT2v; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731049885; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bBEG9HZGs2iopYMSd9JlNjYiGLBvS6h7sIuOs83vCwg=; b=t2PVPLWqPqzHdcedtHL6/fBLGUtFwJkiZhx2k7DYHibi9Zme8nByG3AK/FYz4Wk2SavMp6 pUOC+v6DgScNgoxaqI3rF2rWniZlctS9NC4SfffXo2P6Aa9nXVHvBTRFsmiIAdxzz/Bq0n Sb6YrppcpRC42SxEBVl6lWgCuX3B+MI= Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-71e467c3996so1471611b3a.2 for ; Thu, 07 Nov 2024 23:13:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1731050008; x=1731654808; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=bBEG9HZGs2iopYMSd9JlNjYiGLBvS6h7sIuOs83vCwg=; b=RkpDAT2vbuusStiUQjdbK7YraYdiDoYwyb4vEs2B+rHIyWlNjc4D0Z13N9tvr9PQAL F07Q4rnpKfPVaqVp+iceK/BrK9t4kSzOcZqyOHf43ddacw/dXKSolOVzPuKD9qvPnEdW UKTCV3X/icRNbgGZB1WP+jY9RxQCBzqh5yPglEtA8IheDk9jsut4QQczDyg5R6TMBIAr WQQdoM8TsopOZTyJfIP2jPbNN2Tha88STLnhuW31Gd1g0N4E1abdUi/G1u72FunoQnK9 9pdmN5MqSfGjqMS0E/P8N5hgphzRGCmlo2tEcydadj5Vdh/BrNC/60/leH69NH5g31ta aNUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731050008; x=1731654808; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bBEG9HZGs2iopYMSd9JlNjYiGLBvS6h7sIuOs83vCwg=; b=sTIckYdJeXR9b2S9MJCy1/CGvdxorSEJr+AMcCtUTbulx46MbFTOM2X2+7Q2RRWqkR s+oNGxOB8ifdHF3KUYuR6z25pML/lxUYWxMOf/UFlhswFa7a+Qztwaf5YgdVEHwu6bUz 5DMvnJIodMVDZtOC8auCLRTQXF/h6CB7+76Ul4dJyAr3gKUovb5SSeVZe6XbxNzrfU7F kE6LRT7cooMp5vEN/tnzPSdowgrZkSMezpaR3GuubDkmuzVcuTCR9AETQNRuJUhPWpOu 00rdkHnAvEt2DspNB8fcMWFvD8OY+/R4gikn8IuWNPgA7N+NSxUdXWfhE4KrM33HCN1h 64bw== X-Forwarded-Encrypted: i=1; AJvYcCUbaJ/C/kG7NXIoOR2AUZmFFEXY3F+17v79501KAtVRCB4Arga9Hz5qrJZ5a6vDAXEO4P1OG2TkPQ==@kvack.org X-Gm-Message-State: AOJu0YxlglKymPJeHY9L6dam6JXYpprF+gViaCLmhj1z0ZPPMBWi0fWN xZ8ZbMIKQbwd1ogP5bg6OWd8fa1k1iJUxQO2JFzqFAP6TkczMluvM9p1iJg9TCU= X-Google-Smtp-Source: AGHT+IEiWMjX9m7h/e6SEmO+ax7KBvWPkopXi+Sls8eu5PRlrxxTc3uss328GLsO+WL2fi1Q66z+UA== X-Received: by 2002:a05:6a21:32a6:b0:1db:ed8a:a607 with SMTP id adf61e73a8af0-1dc22934658mr2471552637.11.1731050008009; Thu, 07 Nov 2024 23:13:28 -0800 (PST) Received: from [10.84.149.95] ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e9a5f5e513sm2799412a91.12.2024.11.07.23.13.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 07 Nov 2024 23:13:27 -0800 (PST) Message-ID: <5d371247-853d-49ca-a28c-689a709905d4@bytedance.com> Date: Fri, 8 Nov 2024 15:13:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Content-Language: en-US To: Jann Horn Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org References: From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: E258618000B X-Stat-Signature: ukikrtfxe899h4zp7ownrnu8kjtqyja8 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1731049983-994166 X-HE-Meta: U2FsdGVkX1/VIFmLzSmCZpK/6KPAU03UB9q/xAVess0QbMKo9okKsWk6LT8ynrq6C1EhfpcgOdQDesF5tMQCWigXGK88MqJhft2lnQHwrDMa9c7TGjuqW4iFgWAbNMqfDw7XlDPshFDdlz+zxXGOB0xiI73AZDlYYWPAlShZyktevE+lHZBCi9i0PYaEMPGyCVZnFyG9WD/jNlKVg2dejVI2qftnf6XK3OG+8zf9jIny99fLC2avLXDn6IVkYWFEU3XM6sGxbUQ+W2WC7l1dXo3ERY/qwZerjIlsNpQu1itLT6Q1S03Zg1GXT/IP4c6qNeOl6OSClvxndq9r1sHN8CvzSajn19qIfCoVLzGO/TqXMavF5JTO6E1kCGZqambTRc4isYAhaPhExKgsn79rCUYH0/XWuxQ4yXKDiLQPql/iFHmfwAxdAaw94zF0OePj7asAbQw4bCCx3/mO/PG/XmhNNs5c4hh+W2+31NcE8019DKY7TXAnv6TaI5jSFHIuPJUHck8o7odqVTyHiBrhUFsQqec9scoFT5Bcq/J48LrXAs6Cs7bOkUfBR70gaZCjjtlTGkm4ccGYfokWAZtV7znUnBUXB3uL9jmE2T7abAg7kt7zoRbqE0U+8b3B1zO+8qSGrRUAjFHe3wpGryvWbmDfOuTvaGKuVsGitz+2JSK6qN9cymxKFvHSuWCZBmNkJYtv7EdxIIB8Sn/S9Giluom3kUAk4sgtuu5ePRFRePU7JspIftn/bjmAEYqNstFgp/kRjUO8FT/VZ3tz1IwAzEJdalKQCJDXRsrILJMGJ84bh8b3Y4lY6p9O3E0uOAQccxzcHgoVgPfpvV4kH5GPdLffIdjYVh1hf0/nyirYIw7B8lfGXpjaGNcuQO3OI0F6hydy+kBtg/zWuffmdEYDE0DKfNLD8gqTVM8Kulimxpz5Iah824Ycbjhck0tNaGPz69WZcROuRRMmHy9qqcy GnSb6v90 JmynEUT9iXxF0boX8fENcJkt7XdaE7lOG7p/FDzwDy2syhpMjkYhF/WsvK+fDlDT2pukB/ev3Pffd8q8w8x7ddjuH5F/Oqxr/qk5AjQGVpEH+oO+0YNrvQN9kJvNMKKeH5R7Zrhe7Sohv2S0vz/zGiZuAXcBvyJn1TI12Q/5J7lVBaV0t6pi1pA5szkdeYXcoL/cIrgWqcnhG0fRCX9IeiH2DMpSGe+eXsilHTvHV8AD//xGbjBmezdGTudwy0/+qeTOcH82cJl9TLKfxKwVePxPX3daNRQUavvyHYoCGbohfz1wX7Oqaz1vGnc+/phll+M0mouB5tBBeoIvMaKh2/DvmKi7NiQOGiJwZmPVTftCJcSvrD19yRXY2eDYYx/KZFP9m0LgnCyYc/DhL8rq1sWLfAtPxotuwonCAY7eo6XGXfeQ5EXvSAJg5gA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Jann, On 2024/11/8 07:35, Jann Horn wrote: > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng wrote: >> As a first step, this commit aims to synchronously free the empty PTE >> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE >> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude >> cases other than madvise(MADV_DONTNEED). >> >> Once an empty PTE is detected, we first try to hold the pmd lock within >> the pte lock. If successful, we clear the pmd entry directly (fast path). >> Otherwise, we wait until the pte lock is released, then re-hold the pmd >> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect >> whether the PTE page is empty and free it (slow path). > > How does this interact with move_pages_pte()? I am looking at your > series applied on top of next-20241106, and it looks to me like > move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the > PMD entry can't change. You can clearly see this assumption at the > WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I In move_pages_pte(), the following conditions may indeed be triggered: /* Sanity checks before the operation */ if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) || WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) { err = -EINVAL; goto out; } But maybe we can just remove the WARN_ON_ONCE(), because... > think for example move_present_pte() can end up moving a present PTE > into a page table that has already been scheduled for RCU freeing. ...this situation is impossible to happen. Before performing move operation, the pte_same() check will be performed after holding the pte lock, which can ensure that the PTE page is stable: CPU 0 CPU 1 zap_pte_range orig_src_pte = ptep_get(src_pte); pmd_lock pte_lock check if all PTEs are pte_none() --> clear pmd entry unlock pte unlock pmd src_pte_lock pte_same(orig_src_pte, ptep_get(src_pte)) --> return false and will skip the move op > >> Signed-off-by: Qi Zheng >> --- >> include/linux/mm.h | 1 + >> mm/Kconfig | 15 ++++++++++ >> mm/Makefile | 1 + >> mm/internal.h | 23 ++++++++++++++++ >> mm/madvise.c | 4 ++- >> mm/memory.c | 45 +++++++++++++++++++++++++++++- >> mm/pt_reclaim.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++ >> 7 files changed, 155 insertions(+), 2 deletions(-) >> create mode 100644 mm/pt_reclaim.c >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 3e4bb43035953..ce3936590fe72 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void); >> struct zap_details { >> struct folio *single_folio; /* Locked folio to be unmapped */ >> bool even_cows; /* Zap COWed private pages too? */ >> + bool reclaim_pt; /* Need reclaim page tables? */ >> zap_flags_t zap_flags; /* Extra flags for zapping */ >> }; >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 84000b0168086..681909e0a9fa3 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK >> The architecture has hardware support for userspace shadow call >> stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >> >> +config ARCH_SUPPORTS_PT_RECLAIM >> + def_bool n >> + >> +config PT_RECLAIM >> + bool "reclaim empty user page table pages" >> + default y >> + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >> + select MMU_GATHER_RCU_TABLE_FREE >> + help >> + Try to reclaim empty user page table pages in paths other that munmap > > nit: s/other that/other than/ will fix. > >> + and exit_mmap path. >> + >> + Note: now only empty user PTE page table pages will be reclaimed. > [...] >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 0ceae57da7dad..ee88652761d45 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, >> static long madvise_dontneed_single_vma(struct vm_area_struct *vma, >> unsigned long start, unsigned long end) >> { >> - zap_page_range_single(vma, start, end - start, NULL); >> + struct zap_details details = {.reclaim_pt = true,}; >> + >> + zap_page_range_single(vma, start, end - start, &details); >> return 0; >> } >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 002aa4f454fa0..c4a8c18fbcfd7 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) >> static inline bool should_zap_cows(struct zap_details *details) >> { >> /* By default, zap all pages */ >> - if (!details) >> + if (!details || details->reclaim_pt) >> return true; >> >> /* Or, we zap COWed pages only if the caller wants to */ > > This looks hacky - when we have a "details" object, its ->even_cows > member is supposed to indicate whether COW pages should be zapped. So > please instead set .even_cows=true in madvise_dontneed_single_vma(). But the details->reclaim_pt should continue to be set, right? Because we need to use .reclaim_pt to indicate whether the empty PTE page should be reclaimed. > >> @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, >> details, rss); >> } >> >> +static inline int count_pte_none(pte_t *pte, int nr) >> +{ >> + int none_nr = 0; >> + >> + /* >> + * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be >> + * re-installed, so we need to check pte_none() one by one. >> + * Otherwise, checking a single PTE in a batch is sufficient. >> + */ > > Please also add a comment noting that count_pte_none() relies on this > invariant on top of do_zap_pte_range() or somewhere like that. OK, will add a comment above do_zap_pte_range() to explain this special case. > >> +#ifdef CONFIG_PTE_MARKER_UFFD_WP >> + for (;;) { >> + if (pte_none(ptep_get(pte))) >> + none_nr++; >> + if (--nr == 0) >> + break; >> + pte++; >> + } >> +#else >> + if (pte_none(ptep_get(pte))) >> + none_nr = nr; >> +#endif >> + return none_nr; >> +} >> + >> static unsigned long zap_pte_range(struct mmu_gather *tlb, >> struct vm_area_struct *vma, pmd_t *pmd, >> unsigned long addr, unsigned long end, >> @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >> spinlock_t *ptl; >> pte_t *start_pte; >> pte_t *pte; >> + pmd_t pmdval; >> + bool can_reclaim_pt = false; >> + bool direct_reclaim = false; >> + unsigned long start = addr; >> + int none_nr = 0; >> int nr; >> >> + if (details && details->reclaim_pt && (end - start >= PMD_SIZE)) >> + can_reclaim_pt = true; >> + >> retry: >> tlb_change_page_size(tlb, PAGE_SIZE); >> init_rss_vec(rss); >> @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >> >> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, >> &force_flush, &force_break); >> + none_nr += count_pte_none(pte, nr); >> if (unlikely(force_break)) { >> addr += nr * PAGE_SIZE; >> break; >> } >> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >> >> + if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE)) >> + direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); >> + >> add_mm_rss_vec(mm, rss); >> arch_leave_lazy_mmu_mode(); >> >> @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >> goto retry; >> } >> >> + if (can_reclaim_pt) { >> + if (direct_reclaim) >> + free_pte(mm, start, tlb, pmdval); >> + else >> + try_to_free_pte(mm, pmd, start, tlb); >> + } >> + >> return addr; >> } >> >> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c >> new file mode 100644 >> index 0000000000000..fc055da40b615 >> --- /dev/null >> +++ b/mm/pt_reclaim.c >> @@ -0,0 +1,68 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +#include >> +#include >> +#include >> + >> +#include "internal.h" >> + >> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval) >> +{ >> + spinlock_t *pml = pmd_lockptr(mm, pmd); >> + >> + if (!spin_trylock(pml)) >> + return false; >> + >> + *pmdval = pmdp_get_lockless(pmd); >> + pmd_clear(pmd); >> + spin_unlock(pml); >> + >> + return true; >> +} >> + >> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, >> + pmd_t pmdval) >> +{ >> + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); >> + mm_dec_nr_ptes(mm); >> +} >> + >> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, >> + struct mmu_gather *tlb) >> +{ >> + pmd_t pmdval; >> + spinlock_t *pml, *ptl; >> + pte_t *start_pte, *pte; >> + int i; >> + > > If you swap the following two operations around (first pmd_lock(), > then pte_offset_map_rw_nolock(), then take the PTE lock): Indeed, will change to it. Thanks, Qi > >> + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl); >> + if (!start_pte) >> + return; >> + >> + pml = pmd_lock(mm, pmd); >> + if (ptl != pml) >> + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); >> + > > Then I think this check can go away: > >> + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) >> + goto out_ptl; >> + >> + /* Check if it is empty PTE page */ >> + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { >> + if (!pte_none(ptep_get(pte))) >> + goto out_ptl; >> + } >> + pte_unmap(start_pte); >> + >> + pmd_clear(pmd); >> + >> + if (ptl != pml) >> + spin_unlock(ptl); >> + spin_unlock(pml); >> + >> + free_pte(mm, addr, tlb, pmdval); >> + >> + return; >> +out_ptl: >> + pte_unmap_unlock(start_pte, ptl); >> + if (pml != ptl) >> + spin_unlock(pml); >> +} >> -- >> 2.20.1 >>