From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7EC3D64074 for ; Fri, 8 Nov 2024 18:04:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B7226B009F; Fri, 8 Nov 2024 13:04:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 766596B00B3; Fri, 8 Nov 2024 13:04:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E0D16B00B5; Fri, 8 Nov 2024 13:04:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 399B06B009F for ; Fri, 8 Nov 2024 13:04:56 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id EC92F80F01 for ; Fri, 8 Nov 2024 18:04:55 +0000 (UTC) X-FDA: 82763702658.23.D614911 Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf28.hostedemail.com (Postfix) with ESMTP id 5E579C0008 for ; Fri, 8 Nov 2024 18:04:16 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YYPTH3AU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of jannh@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731089033; a=rsa-sha256; cv=none; b=z9nDvX5fwX4Bb28tgAgg4IW4wws/4iUgjBtfH1n96xalbY468oDzrGyomPSAiBCxjxMLa1 +a6UQZP8AEpDE3DgRCZAOkTYh0UDZ7iCHq0KqmBCrcP81sTCDsFpBPQ6iPWH5axztIzwTo yYww3926PxpGJy0V2rpxBzlCBXCLRH0= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YYPTH3AU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of jannh@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731089033; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tY6MX6G2iuh1hwT0r0xv8GFRuWlJgXq9lNuDxr3taP8=; b=rW7haypv8zNXXEI7bv18ugxNrYiXoOxVPCTVMNrXS3DzwZlbwTm/7PrZ1neBpqlNX8veDc /KIhQZlGUCEfPPVXzmTqnNmKersxHM48bkbH4zVzLDwATIqJqj9TdZ8eyKxp0Ckqh4AzYI WmvY5tTI/xbiQWK9MfHizoccSY0QCsM= Received: by mail-lf1-f51.google.com with SMTP id 2adb3069b0e04-539e19d8c41so900e87.0 for ; Fri, 08 Nov 2024 10:04:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731089092; x=1731693892; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tY6MX6G2iuh1hwT0r0xv8GFRuWlJgXq9lNuDxr3taP8=; b=YYPTH3AU5lSFwVi5WbTYmY2AiY796Wp+7wViLUU5gJObTwPAXiYLJiIJG3ROIom19A jKzXUCjjte8aLrLcsr5CZTDYsM1Vr0LsPA++xr5o/gAGoXrJFN1gvvjx8pF46RvHxHJ+ kWabPfufGB1jREUjsBKpctv86iP8+jlnimesbFIZimrDnKa5Ey4LGBfN1JhdRmfITmcp 5oPMfZjeM50PIFbVunQ1GTBWGpA0iqKwD4Dyt/F+zvqiDToulh7jL1i7kkS2uhfVX6Y5 /UQTI3rwXhhg3/iYC6rnOvbXNqULNH2vRIjoI5ed909exAhMu/Oe1h3KJvHsS2vYP0aA 24fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731089092; x=1731693892; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tY6MX6G2iuh1hwT0r0xv8GFRuWlJgXq9lNuDxr3taP8=; b=ItGWcMLUuLbvm4Z5F4oYmgMNhJBuOT0lqrUoXQMh3/YPyFI2oK8J6/hzmAXt415QEt ZAYelkmG/fl9B+zXGyes6Z09LFmT37arlOphvY7yQDwzu9nDftfhMqtKkohJCmgDO/ji Tayf4vzUuan5at7o2CNzvg36Wsjw/P6qyNKDe03dvgqcS0NjFb/K6Z9t4i59ol+k4Q9m WFmuebYFKE03A/Nmcrop0BtwvO+WDtxkO71TvcVBEgESTR/kCz1JUvTgQpR3qPWvN14R 2hdlJR9bnXrNN/VNKBJV9nqktB4HaNF7zU2z2JWpXApJdN50Zm4mCrPWhRwwBbUMvGtR KY+g== X-Forwarded-Encrypted: i=1; AJvYcCU/anICMrVnG2XOGFiEgMQs91IBG69lC7pf2NnVUb/tPl2Q3kEgZ/tT0ICE0UAoqjL8I5oQRdPMbg==@kvack.org X-Gm-Message-State: AOJu0Yw7OYZ7dJL4a+wfCOhwB8ebBnYHAOQoDJT+Yc7s6fNrlpVqryRD bGsSiUQu4vsPM22dkFR+SGlCel+pD7PZWf4B69pASRFV9imB0Ph6Yan/nvf9UUaucle0OK4w4v9 osX4Nx8/gFTnwnDvMDXQ+Xgc7h+M7m71BwyAa X-Gm-Gg: ASbGnctU07fM1fSVHIeEv7L1HlqmXNN6jM6XFWwE3GgGViLJZFpKum0y8IpCTmAyr2K hSCdXdU/7SVLnci75JcKpceJZKYwFwH1ebU65lUu6C8XtbMLbs4wBkoE0T2g= X-Google-Smtp-Source: AGHT+IGj2IqDXz+/Aet3CYfAPxn/D2mSXM68OjaFUwS87DjQkIufp1pkYuA8Lybz4NCssvKIyXdVwN+eL7wErfuawL8= X-Received: by 2002:ac2:4d9a:0:b0:53c:75e8:a5c9 with SMTP id 2adb3069b0e04-53d8122b836mr892006e87.4.1731089091827; Fri, 08 Nov 2024 10:04:51 -0800 (PST) MIME-Version: 1.0 References: <5d371247-853d-49ca-a28c-689a709905d4@bytedance.com> In-Reply-To: <5d371247-853d-49ca-a28c-689a709905d4@bytedance.com> From: Jann Horn Date: Fri, 8 Nov 2024 19:04:15 +0100 Message-ID: Subject: Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) To: Qi Zheng Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 5E579C0008 X-Rspamd-Server: rspam11 X-Stat-Signature: tdoupnarg1s8puxoz3zgj3qpcanfjcfd X-HE-Tag: 1731089056-146884 X-HE-Meta: U2FsdGVkX1/YqHYdOkxpT3/y0Nwauq9fcugpTJNDUXgPwtcu0bMOsaxvhSQW27aRFg+JRlEArfLP6kpCXuhdPVXNZZL32EWGCwRCE5PSJpYqs31dkPsD+VdxJj64UpwTeSGLx3W5jqje4dwsjdUO3kOithkwOkqODFh2FoCBgUObq5rmWkEWNChml6vY+xLg0KMdegEOycVm8NtKL2xJm314yloVBTD3tcikavtoeaBvTidZweNyREpyL3aOhxUrpOqddO01+f2DgLdW5QIUiUw07a5+ciMJCuzho1hUHJwaQyBa0ifVqIErONvjMxYKF+PbvPD3JGaM729GIRk3DfdL3qui4QlzlW93kfJo5jJe+wXmFv2o+435krQE56MAEYnhKHflV1Qf0zPjBdnZST/5+uqr0hBuSuenoBU93bwQWpJfFVpjNJ+0Ymv580oP37aJnVfzLhXChS6diQS7RGc6NYeBDmvXM5+mguI/2Gub9P8wuN08sq1FdEwLGegoxGIySFm/dG5D9e9kBHWJUlz270/Rl2OghQD9XPaM9uQ0hbGQgy/EW1WKMgZ/lAcPzDmM/+c6GtGVjg/mXTW92K7L4TMBsC/IGU5OEFn7/y8LROGatmZsSLGRNeIR2aO4X1/TWAfLqHD1wQzmw1Dw1CgABAzymD4mioETuF2OtzLBOLRpufsJFRRLLRWcBmHkbQIXHpKPsMchqZ6fhOjw4+7c5iAisrCOgNPrKqFxRipWLe5SuWYj60uE8EijLxD/JO7tks9aV8/vIsK09kLdy8x3QqfqJ73OHcLrMLdnAi2b0HVfYfiNavJKwMo0eYtpVwLPsPhdrov4Y38Iw5ONuObmsdW8gbOOIDUf8fpooM8KiYwQrIVYqW2r28RYnT/ceFgXQyC8vf1+35t5jSGzUuPL4zMNkHqQthIsdDebjK0DB8vnACfoeOkR1AT38MT9zRzn/sASAtia3apIG+4 L1NpgYWi HxvsUbLuDcpzUf63TSX484DcL8Vp6lZ61D1EXgToRL6+34w+Fc9AmPvjgGo4r7woygTiZzvz1whO/O3Hv9xIlqOSIRBPDlnoctRhAlb7UhiIRQGhV6BEBA4hdXb0KQdQJaDkO8RbQycQuRGu5Gy5A+O7bQDc6j96MVwY6LNRLMlyJaBe8KZ964gIzDBgi8gi5yj6JPiFq/lX8BlHyhaYhKKu0jvC8S33xCDSjlNRXyDVa29Zuz+x/DSG2KUM4+6ZJVZH92rP7/tWszY0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Nov 8, 2024 at 8:13=E2=80=AFAM Qi Zheng wrote: > On 2024/11/8 07:35, Jann Horn wrote: > > On Thu, Oct 31, 2024 at 9:14=E2=80=AFAM Qi Zheng wrote: > >> As a first step, this commit aims to synchronously free the empty PTE > >> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PT= E > >> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclu= de > >> cases other than madvise(MADV_DONTNEED). > >> > >> Once an empty PTE is detected, we first try to hold the pmd lock withi= n > >> the pte lock. If successful, we clear the pmd entry directly (fast pat= h). > >> Otherwise, we wait until the pte lock is released, then re-hold the pm= d > >> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-de= tect > >> whether the PTE page is empty and free it (slow path). > > > > How does this interact with move_pages_pte()? I am looking at your > > series applied on top of next-20241106, and it looks to me like > > move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the > > PMD entry can't change. You can clearly see this assumption at the > > WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I > > In move_pages_pte(), the following conditions may indeed be triggered: > > /* Sanity checks before the operation */ > if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*sr= c_pmd)) || > WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || > WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) { > err =3D -EINVAL; > goto out; > } > > But maybe we can just remove the WARN_ON_ONCE(), because... > > > think for example move_present_pte() can end up moving a present PTE > > into a page table that has already been scheduled for RCU freeing. > > ...this situation is impossible to happen. Before performing move > operation, the pte_same() check will be performed after holding the > pte lock, which can ensure that the PTE page is stable: > > CPU 0 CPU 1 > > zap_pte_range > > orig_src_pte =3D ptep_get(src_pte); > > pmd_lock > pte_lock > check if all PTEs are pte_none() > --> clear pmd entry > unlock pte > unlock pmd > > src_pte_lock > pte_same(orig_src_pte, ptep_get(src_pte)) > --> return false and will skip the move op Yes, that works for the source PTE. But what about the destination? Operations on the destination PTE in move_pages_pte() are, when moving a normal present source PTE pointing to an order-0 page, and assuming that the optimistic folio_trylock(src_folio) and anon_vma_trylock_write(src_anon_vma) succeed: dst_pte =3D pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval, &dst_ptl) [check that dst_pte is non-NULL] some racy WARN_ON_ONCE() checks spin_lock(dst_ptl); orig_dst_pte =3D ptep_get(dst_pte); spin_unlock(dst_ptl); [bail if orig_dst_pte isn't none] double_pt_lock(dst_ptl, src_ptl) [check pte_same(ptep_get(dst_pte), orig_dst_pte)] and then we're past the point of no return. Note that there is a pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is intentionally pte_none(), so the pte_same() check does not guarantee that the destination page table is still linked in. > >> diff --git a/mm/memory.c b/mm/memory.c > >> index 002aa4f454fa0..c4a8c18fbcfd7 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, = struct vm_area_struct *src_vma) > >> static inline bool should_zap_cows(struct zap_details *details) > >> { > >> /* By default, zap all pages */ > >> - if (!details) > >> + if (!details || details->reclaim_pt) > >> return true; > >> > >> /* Or, we zap COWed pages only if the caller wants to */ > > > > This looks hacky - when we have a "details" object, its ->even_cows > > member is supposed to indicate whether COW pages should be zapped. So > > please instead set .even_cows=3Dtrue in madvise_dontneed_single_vma(). > > But the details->reclaim_pt should continue to be set, right? Because > we need to use .reclaim_pt to indicate whether the empty PTE page > should be reclaimed. Yeah, you should set both.