From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B534D5D674 for ; Thu, 7 Nov 2024 17:57:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 918916B007B; Thu, 7 Nov 2024 12:57:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C83D6B0083; Thu, 7 Nov 2024 12:57:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 790256B0085; Thu, 7 Nov 2024 12:57:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5CB4E6B007B for ; Thu, 7 Nov 2024 12:57:57 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DF24DC0879 for ; Thu, 7 Nov 2024 17:57:56 +0000 (UTC) X-FDA: 82760056302.23.2F9EA1B Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) by imf09.hostedemail.com (Postfix) with ESMTP id E6CC1140009 for ; Thu, 7 Nov 2024 17:57:29 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JyDJA2bx; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731002215; a=rsa-sha256; cv=none; b=L0XhQbd9nxk/6zzGCYnXso1igs865/Vb+ZLAT5vyWWfP0wBFbmnG/08BhGOLP3Pzw4rkV4 S1V53DH/fStrge7pi07U/54YC7MSCJbf4mzq3hYT9tt34TPWm6XYatv7dCW/zISSzMnOR4 kpJJh4G+tmjwZmY/n8T4Ok2jUV3RW1k= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JyDJA2bx; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731002215; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3sYXU+J45JGzgakIJX/wtPY+rtwo4bAtMY/QuZ4Bgqw=; b=ElLJk5vpO6H6cQp00RcxVhF6CBFWjOoaqQUvRrO8H1eWnmFBGTnXnZ1/xuFjNu/h3i3nLQ TRRJORSTMvC/Oeh8WaQsHveC8euYZx+2JIT49sxchwlO7HMaggAGIBFyt9OXOHQrkdDlZc x481HOCJJwk88IJsMAMM/rx7PA62Phk= Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-5c947c00f26so468a12.1 for ; Thu, 07 Nov 2024 09:57:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731002273; x=1731607073; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3sYXU+J45JGzgakIJX/wtPY+rtwo4bAtMY/QuZ4Bgqw=; b=JyDJA2bxUR2zrIZwxWODBRONUe3OZTi1vRhQ2TUmBHHUlnQoC9fgMHb2noN3TtaLEF k4SB9odlVEA3wHkkOgDF/srMifTfYG9oU8zG5UOVQ83/qpkWXN0msitWuJAPo8atvksH 5VpRn46HMY6M7B2umWrAr2gqOBzaWVkseLJl8F/rFQENW4dk2EJV2H4wkNJ21ivW4uTA pJHVdGPzqQ2ddkYErQ5xBDbZILvKaa6lW4dUhztkFaeIZEEDb/iHFPhAt7Pq1qYveSd/ dcCzBRIP2UHU8BvZ0G1M9Z9orQiAqmg56wa1heNRg9DAw3NNoJ07whLNM9GEbYDu7Lpw gXog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731002273; x=1731607073; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3sYXU+J45JGzgakIJX/wtPY+rtwo4bAtMY/QuZ4Bgqw=; b=LQCLBOXb+xTucHKobvOVHesQtT0lwPSj9updan4RzJHuXpcufJeLYchINfU36mg9Sx HmNdVJIzr/cIJKKdBGHyBTpGwukQzy+Y7WEh1cGgEqZDA26+mlUsVDJnJWIg/+DV9oDe c/tyCfONCdnPMXvSMzQjWmG9SvzVcp+hDl8++iL7Ln3i88h9pLrHmk2a2A/+kCRiprXr LERrtpt0ckUW4en6B8IJylU2X2eUdrUvZXw8gflpoBpv9br8yXsLy3SFnLbpSVwVU7+W pFI5Ur3dXXHPUIc91decgnneFcc1uk1IF9tmv6aqKdzAd+MIbrC6X9AOgHKm8NbsGqsN ao4Q== X-Forwarded-Encrypted: i=1; AJvYcCV8UfUGQz+ohPKuaHf4xHVJzaaH3lGIQ8Pi+9bkIAGMfKC8+J+6Qb9L2Wk1b0YiJwMRqMoAj56tiw==@kvack.org X-Gm-Message-State: AOJu0YzVmtsErRdXu1uKgY0dAnZlDQFBuLVQwD4xYZXVrbUncP4X1FTO ijb59SsavRhHXSZNB6de0X05xNQObTewdIyysVixT4scn61VT8Ke4cD2hUWpZf/tSf1ed9HoiUW XMYgAkKzaTRPLaAeFfofKaq85mkLQhVzw7Z+6 X-Gm-Gg: ASbGnctm0ozurJq3kw4cIUOlp6mc0PBCbuSX+H4MY1uLpYAv+ViAyosUrmSgo2bYtva /QptPT0l+53Hs15BaykR6SfDFomlLblJCrqmATxl/eTC9W621us2aodwPjg0= X-Google-Smtp-Source: AGHT+IF4uMRfvJtFgKasRKvlG8BsbliKgjjtDjA52QF3a2NBjqsOaBKbvBrNSqoI/6Qa3egPGlHObZTcWDMjX7+5GGM= X-Received: by 2002:a05:6402:1cb9:b0:5cb:6694:a4bd with SMTP id 4fb4d7f45d1cf-5cefbc29fdamr691458a12.1.1731002272648; Thu, 07 Nov 2024 09:57:52 -0800 (PST) MIME-Version: 1.0 References: <4c3f4aa29f38c013c4529a43bce846a3edd31523.1730360798.git.zhengqi.arch@bytedance.com> <8c27c1f8-9573-4777-8397-929a15e67f60@bytedance.com> In-Reply-To: <8c27c1f8-9573-4777-8397-929a15e67f60@bytedance.com> From: Jann Horn Date: Thu, 7 Nov 2024 18:57:16 +0100 Message-ID: Subject: Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() To: Qi Zheng Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: E6CC1140009 X-Rspamd-Server: rspam11 X-Stat-Signature: ut9e3fzymkkxc9qscp3ikxeua4mf7jze X-HE-Tag: 1731002249-266214 X-HE-Meta: U2FsdGVkX1+Gtb6rnY4F09Dlf4bbMvmQ3MKDtm9rarXqFdm/ZSFflbmfPym0V/EoyidaJUqhTU97M7TbAbaWpwovftlD2GvIWkP7ja2U0aOxvZh7BxqWTAD3DtSH0L+QE0Aafzim/Ss1n7Si3lUQylz6wzXSrWypTwHwi+srI+3EVbVk7k8PdwTUdzvgSb9udkCeYOBy03Wbd+2OJRw4XVmgEDOq71gZBILHfUW+ZPJrZ+ooUNbPbQs9WlHJhHSUXga92IBhIhst+KD+QGr/0NTayhtosBMbOK5RAvGR8X0QoD9lv9cMSnPCREQgvoejMbG5jy0dI1tEJrGW6NnbGm5XNh5vgvl4U+c0FtYZ95wJ24unM+hx/vZk4j217KB2OWU562kyBOJKW6PbjbgZUGVH2d2n18v8MiQer06xPeOko1uo5v+ZV7foX9N0Qst+rVZTGCLczPdjleP+BVuAGygy8SfZsTiwnXmGVhXERfyQza8ztnIL6AAJro0hMw9JjTOPFBxXKle7FKW0C4JgTg+l7LkzclehNrMfX9/uvtMOrHNwVUEK5Xw5JVcn2juQdYq9g0epVRy7VZe/nZQSVA9g21GlT4u6EE6JhaV9W3MFWKnWcRKjj1JjSvrkRtcnyMu/UBmVkYrESa63v72vFEe+HfRWltn+45CH4TnXpi3kyiYCfBQQZ0YOVMZaloNhCrny15y4TaGpN3QYvANgjKR6sGjqkEE7mvIndrSu3GUZ5s6TtjupEW6WyvgMBbMuCl/mmHok0bP2OhTFQQyaemKCeMrmgzPLovOhqjr28gp9dLQdA2ucamhLVzDX4TGtoXV502S/pJK/BwsNgqkgy5FRGCOpOBTPb3+FrrhYNq+WygA/45oTdd6h1TyDwSx58eEgXwrbrGwI+AVWpItmln3m7jhLA7gKCZMXWfiFFiUGcX6cK6wSoKGLUPJXq0byA5OYE53Mvep3MlyvjtX Ui1TlEkb ca4MksWz6iAORz87Ejp5yi4OBwHThNWZPWiRUvZEQRBGFlyiz9AnTVekFot2B9gydEpdDn7VGjUnbQi3y5dqGpJJkkvxK/32fH0OuqcKL++lz2+IWHGtGJjAB3NE/IXM0zdigu5bSWFdmcMxyB2J6J+4zbbM53V1j8oiQjF/X1IU14HaGlTmvolvGHAk0dWm8KRt90Lyd2F6jIyGvjl6FF335mSr8HFG07dOyNuTcaZMwjDIdGQvMSE21OmJl5sTYCUO9HXbpRcbIwzY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 7, 2024 at 8:54=E2=80=AFAM Qi Zheng wrote: > On 2024/11/7 05:48, Jann Horn wrote: > > On Thu, Oct 31, 2024 at 9:14=E2=80=AFAM Qi Zheng wrote: > >> In retract_page_tables(), we may modify the pmd entry after acquiring = the > >> pml and ptl, so we should also check whether the pmd entry is stable. > > > > Why does taking the PMD lock not guarantee that the PMD entry is stable= ? > > Because the pmd entry may have changed before taking the pmd lock, so we > need to recheck it after taking the pmd or pte lock. You mean it could have changed from the value we obtained from find_pmd_or_thp_or_none(mm, addr, &pmd)? I don't think that matters though. > >> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we ca= n > >> also remove the calling of the pte_lockptr(). > >> > >> Signed-off-by: Qi Zheng > >> --- > >> mm/khugepaged.c | 17 ++++++++++++++++- > >> 1 file changed, 16 insertions(+), 1 deletion(-) > >> > >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c > >> index 6f8d46d107b4b..6d76dde64f5fb 100644 > >> --- a/mm/khugepaged.c > >> +++ b/mm/khugepaged.c > >> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_s= pace *mapping, pgoff_t pgoff) > >> spinlock_t *pml; > >> spinlock_t *ptl; > >> bool skipped_uffd =3D false; > >> + pte_t *pte; > >> > >> /* > >> * Check vma->anon_vma to exclude MAP_PRIVATE mapping= s that > >> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address= _space *mapping, pgoff_t pgoff) > >> addr, addr + HPAGE_PMD_SIZE); > >> mmu_notifier_invalidate_range_start(&range); > >> > >> + pte =3D pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_p= md, &ptl); > >> + if (!pte) { > >> + mmu_notifier_invalidate_range_end(&range); > >> + continue; > >> + } > >> + > >> pml =3D pmd_lock(mm, pmd); > > > > I don't understand why you're mapping the page table before locking > > the PMD. Doesn't that just mean we need more error checking > > afterwards? > > The main purpose is to obtain the pmdval. If we don't use > pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry > before pte_lockptr(), like this: > > pmdval =3D pmdp_get_lockless(pmd); > pmd_lock > recheck pmdval > pte_lockptr(mm, pmd) > > Otherwise, it may cause the system to crash. Consider the following > situation: > > CPU 0 CPU 1 > > zap_pte_range > --> clear pmd entry > free pte page (by RCU) > > retract_page_tables > --> pmd_lock > pte_lockptr(mm, pmd) <-- BOOM!! > > So maybe calling pte_offset_map_rw_nolock() is more convenient. How about refactoring find_pmd_or_thp_or_none() like this, by moving the checks of the PMD entry value into a separate helper: -static int find_pmd_or_thp_or_none(struct mm_struct *mm, - unsigned long address, - pmd_t **pmd) +static int check_pmd_state(pmd_t *pmd) { - pmd_t pmde; + pmd_t pmde =3D pmdp_get_lockless(*pmd); - *pmd =3D mm_find_pmd(mm, address); - if (!*pmd) - return SCAN_PMD_NULL; - - pmde =3D pmdp_get_lockless(*pmd); if (pmd_none(pmde)) return SCAN_PMD_NONE; if (!pmd_present(pmde)) return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) return SCAN_PMD_MAPPED; if (pmd_devmap(pmde)) return SCAN_PMD_NULL; if (pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED; } +static int find_pmd_or_thp_or_none(struct mm_struct *mm, + unsigned long address, + pmd_t **pmd) +{ + + *pmd =3D mm_find_pmd(mm, address); + if (!*pmd) + return SCAN_PMD_NULL; + return check_pmd_state(*pmd); +} + And simplifying retract_page_tables() a little bit like this: i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { struct mmu_notifier_range range; struct mm_struct *mm; unsigned long addr; pmd_t *pmd, pgt_pmd; spinlock_t *pml; spinlock_t *ptl; - bool skipped_uffd =3D false; + bool success =3D false; /* * Check vma->anon_vma to exclude MAP_PRIVATE mappings that * got written to. These VMAs are likely not worth removing * page tables from, as PMD-mapping is likely to be split l= ater. */ if (READ_ONCE(vma->anon_vma)) continue; addr =3D vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_S= HIFT); @@ -1763,34 +1767,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) /* * Huge page lock is still held, so normally the page table * must remain empty; and we have already skipped anon_vma * and userfaultfd_wp() vmas. But since the mmap_lock is n= ot * held, it is still possible for a racing userfaultfd_ioct= l() * to have inserted ptes or markers. Now that we hold ptlo= ck, * repeating the anon_vma check protects from one category, * and repeating the userfaultfd_wp() check from another. */ - if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) { - skipped_uffd =3D true; - } else { + if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) { pgt_pmd =3D pmdp_collapse_flush(vma, addr, pmd); pmdp_get_lockless_sync(); + success =3D true; } if (ptl !=3D pml) spin_unlock(ptl); +drop_pml: spin_unlock(pml); mmu_notifier_invalidate_range_end(&range); - if (!skipped_uffd) { + if (success) { mm_dec_nr_ptes(mm); page_table_check_pte_clear_range(mm, addr, pgt_pmd)= ; pte_free_defer(mm, pmd_pgtable(pgt_pmd)); } } i_mmap_unlock_read(mapping); And then instead of your patch, I think you can just do this? @@ -1754,20 +1754,22 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) */ if (userfaultfd_wp(vma)) continue; /* PTEs were notified when unmapped; but now for the PMD? *= / mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, addr + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); pml =3D pmd_lock(mm, pmd); + if (check_pmd_state(mm, addr, pmd) !=3D SCAN_SUCCEED) + goto drop_pml; ptl =3D pte_lockptr(mm, pmd); if (ptl !=3D pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); /* * Huge page lock is still held, so normally the page table * must remain empty; and we have already skipped anon_vma * and userfaultfd_wp() vmas. But since the mmap_lock is n= ot * held, it is still possible for a racing userfaultfd_ioct= l() * to have inserted ptes or markers. Now that we hold ptlo= ck,