From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6053EC3DA49 for ; Thu, 25 Jul 2024 18:30:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E94CC6B0083; Thu, 25 Jul 2024 14:30:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E1DFC6B0085; Thu, 25 Jul 2024 14:30:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE52C6B0088; Thu, 25 Jul 2024 14:30:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A9C1E6B0083 for ; Thu, 25 Jul 2024 14:30:29 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 27F0B81461 for ; Thu, 25 Jul 2024 18:30:29 +0000 (UTC) X-FDA: 82379115378.13.69F3103 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) by imf05.hostedemail.com (Postfix) with ESMTP id 623C7100025 for ; Thu, 25 Jul 2024 18:30:27 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=f+IsrAsA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721932172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TWboNGe0GPmGnnKlHALD1mYzZCC9gIR6IarJsn1e87E=; b=3p9ubUDjjUSiot8hXc0VjCxvkAKN3wTryg44BnibqEBYdcP4T4bkL2xpjJCKZsO7LUmMck J11r/dWDiHka6uaRgZTQBoeLJSBj1EnkiU1ujY+7ctECPgSjFd+92qKZK4UxnJOKhrNS6r /1XLWo6cknFSi8EXLZWtknGOS3qKlvs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721932172; a=rsa-sha256; cv=none; b=yuVVUNaRiJHF2M1fVkSpZlu17a2HgXMyhjKAewQ8O6xOzhCHNEPn4GNDE3EcLO1f7Cf4al 0MuyhOCr5YvMtpSQHjrnWLSf3Bo1n/DuxtG6WlO40dcG0FffQQMZ3pIS5c2TrNCDPunZvX lCMKVepGNrCQ8dh0ElJwubu/JFZX528= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=f+IsrAsA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jthoughton@google.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=jthoughton@google.com Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-44fdc70e695so64231cf.0 for ; Thu, 25 Jul 2024 11:30:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721932226; x=1722537026; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TWboNGe0GPmGnnKlHALD1mYzZCC9gIR6IarJsn1e87E=; b=f+IsrAsAPy9TXm9f2z09ImmVj2F2k8QdbBj4bNAtNZ7CFtARomufdnQ6IZdFXr3UNb 7OFg7+nV1KQqq1eRgQjepdZ+Id+oS/GIMiAY+jJRngKG7aW1mMzqKe1pqODYGYAjxMr9 OTL7btoTGPnov1fd4v26C4nRYtpzCBs/lq3Bk+9ApeS8fUo29EsjFV/39fF1ZsF0rWQL OzPSUANRi4Ycz2w9x4+GJBd8qaZlvk6d8RDGZFqFIaggGdqapzgHyb9gKoiHTjvS9RKl sK92QYthtY7mGkoq3YuR8LrDngia+b7VRjVJUfNlaFCWFNUoO8RjoL2SNAmmdX0L9b4z Gt+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721932226; x=1722537026; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TWboNGe0GPmGnnKlHALD1mYzZCC9gIR6IarJsn1e87E=; b=wmJQZfhfnNY4daNx3byAqYsTKN4WMAflGdpMmW4opx3KHgx/kBGJML70TmQgwre+8N TQs2soY8o8N2y7RtNRHxAA6uvqthqN0l4Kvl6O8+mqBWWPaFDgpx7xZF5QYUMkxrsmKp lDBxdExHqSXN3WPxUvXHG7ejL85hzTsoSRvPD/maQlbzQgahofvtKs82jZ9x46FEcuSE HCkpJDBsg3KFm7BbsMyfifI99GAeEcVt/IbNE/1OmGt4uJInP441ho/q8Cg61XFsNNG3 iMschtlIKG3uBdnErt42djCUVCQN7YE27T8YtTC0G13ZLmdTvzPXYEuVjkZ4TzYdN/vi IFvQ== X-Gm-Message-State: AOJu0Yx9aF/0hJ7fp4kqBpQfxuz/n2WlSTA1u914x0gf+A0kgfGvKQzX WNuhwpF24iEwj74dqNL7o3BALryKg5BI5+iN2tevEwapTVjMXUE+aomXJRlzUDJow643lRHuJ2V c/GUMyQN58ne478JSaYKntklY/SSMsGxQaJHj X-Google-Smtp-Source: AGHT+IHUiu7SrV4KAMaecRgK6Yn782IOcgfOtXEKWfR6swvHfNxKq8fdmGpFKpWLG5VKNwjKbn/vDRD/q9Rcer887Uo= X-Received: by 2002:a05:622a:11ca:b0:447:d81a:9320 with SMTP id d75a77b69052e-44ff3b37eacmr268571cf.20.1721932225993; Thu, 25 Jul 2024 11:30:25 -0700 (PDT) MIME-Version: 1.0 References: <20240715192142.3241557-1-peterx@redhat.com> <20240715192142.3241557-9-peterx@redhat.com> In-Reply-To: <20240715192142.3241557-9-peterx@redhat.com> From: James Houghton Date: Thu, 25 Jul 2024 11:29:49 -0700 Message-ID: Subject: Re: [PATCH v3 8/8] mm/mprotect: fix dax pud handlings To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Jiang , Rik van Riel , Dave Hansen , Michael Ellerman , linuxppc-dev@lists.ozlabs.org, Matthew Wilcox , Rick P Edgecombe , Oscar Salvador , Mel Gorman , Andrew Morton , Borislav Petkov , Christophe Leroy , Huang Ying , "Kirill A . Shutemov" , "Aneesh Kumar K . V" , Dan Williams , Thomas Gleixner , Hugh Dickins , x86@kernel.org, Nicholas Piggin , Vlastimil Babka , Ingo Molnar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 623C7100025 X-Stat-Signature: bxkbzgfe97s43eh3ic9akpn6z5o8rq7a X-Rspam-User: X-HE-Tag: 1721932227-295125 X-HE-Meta: U2FsdGVkX18harDM8xJRJ4CKUIswoFLWlw8AVAItaxxT6eEuM9yKJa6TYewTkwx7kQ9FNJmsMP6qMiS727LFVZvB9gtTG4wMYagVzm403da0XAWV4XngSccXU7q/phVjEB089L671xk4O9TiJjBzW6CKdVl3103YvDNIBkzBku37Uh82qZxlZCGhK8pqTIDcxzXhOVtyBGocvNibgaK+F8atKDiCTcVGTDRJ2zTa9aUroFFucayEoJT7zyF4LOBS996wqfE8kJdDWjhieVfZ37SuCHivStqDvTpUTu3dbbOU7wAn7HM5yAELloUUvvdn4kHZwZzSH8+4BkeqVjX/pmzalJu+BqgcASIBB81Hi0e7xgWcv1tthEOlUfThoZOB/38khKuwbIH+KlAR0PMjL2Aq+FH32wmhWVCRNaxkRKftc1DlZtuT1qcIC88pA5rI75hz9zehNU0rvgDCMkS2rzjj1CpsJk2Hyv4AEV46TD4X6BykThyVk9qgWILPX/pv0fZfabO3Dupkvqu4FKZ/ij5qoC/GyrGMlgnhBTOEOCURRemqnOjntge1/G7VE8gWF5cL+s2Q2W18raX8FW3PgaJNnz0c42Y54i/wZAcwy+MBmA1QjSP//ooFhQqfJfK3CgIL0V9v5pBCtP6GslgIfVdd/meuAcpY/R/a4XRY3BGg07aoAhTtnok5dscRWC9gTtqUgVdKmwcp+Yywbx0Eb4lSmb7QzUWC1lHwOXCXvqdrAbevY0yJ94gOZfVNtf2foYvKKXTLaRInXjdGunej7BWiG7+eh7HUbagnUPVj2R9mnSO6hMuxedfWD9vIigvQTFEQhmTByfi+Alt0jIIDi6DXtLlm6y7xpvj9kXq9x+njMrTxjsnPef4YGVTlZhXSR7gVmHFir8PGY904M2DAQPpDk+wc5nArXw08buF/4ce2G9pjGvcNX6vHMy+TUo1i0jokDFQOkbAFgeUZ5ZJ BF4/deK1 wjvQQX3IbltXLrd1rpmyVOnVpaF8TV4uDd7qC4eH6TOGNl5rke08nSLXckW0/UqqBuYeTJBlVCCqcZ+O/hQfl0Qhf9d8FbT9xWWjX5xjx4qTJ5/+dSq3MOLfOs/pXqZpxFxgOzAk8/j273js9/sY112CQtYv40Th+MRWa1CqHibI8xmHBzZyBKnFYcWc+S/aWrZ6o7Xj/bEolqCM5KBTi3PvoN4awosIVpxHWWY/cmsjRlMbmKWXXD0J8JLGxZdnBzhlkyQvtx7G9ajzt0hUWbaYLobHGrvmXTAGt91kj+y533ALiOwGrUJuwEV/fRGmMf2l6YFCsGegFUOnAtfD2wFO3kba8U8eXK4gZ+pQRSQHoXBA+8AjoqytSMqByqibRhprhoIgRx4bOZrodI6PinPt79SuiTyv/uAmFewqsKcPf0jp6wwYR8ILDTxvJzWSCWrEQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 15, 2024 at 12:22=E2=80=AFPM Peter Xu wrote= : > > This is only relevant to the two archs that support PUD dax, aka, x86_64 > and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not > count in this case. > > DAX have had PUD mappings for years, but change protection path never > worked. When the path is triggered in any form (a simple test program > would be: call mprotect() on a 1G dev_dax mapping), the kernel will repor= t > "bad pud". This patch should fix that. > > The new change_huge_pud() tries to keep everything simple. For example, = it > doesn't optimize write bit as that will need even more PUD helpers. It's > not too bad anyway to have one more write fault in the worst case once fo= r > 1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does > it support userfault-wp bits, as there isn't such PUD mappings that is > supported; file mappings always need a split there. > > The same to TLB shootdown: the pmd path (which was for x86 only) has the > trick of using _ad() version of pmdp_invalidate*() which can avoid one > redundant TLB, but let's also leave that for later. Again, the larger th= e > mapping, the smaller of such effect. > > Another thing worth mention is this path needs to be careful on handling > "retry" event for change_huge_pud() (where it can return 0): it isn't lik= e > change_huge_pmd(), as the pmd version is safe with all conditions handled > in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock(). > In short, change_pte_range() is simply smarter than change_pmd_range() no= w > after the shmem thp collapse rework. For that reason, change_pud_range() > will need proper retry if it races with something else when a huge PUD > changed from under us. > > Cc: Dan Williams > Cc: Matthew Wilcox > Cc: Dave Jiang > Cc: Hugh Dickins > Cc: Kirill A. Shutemov > Cc: Vlastimil Babka > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Cc: Michael Ellerman > Cc: Aneesh Kumar K.V > Cc: Oscar Salvador > Cc: x86@kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent huge= pages") > Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepag= e") > Signed-off-by: Peter Xu > --- > include/linux/huge_mm.h | 24 +++++++++++++++++++ > mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++++ > mm/mprotect.c | 34 ++++++++++++++++++++++----- > 3 files changed, 104 insertions(+), 6 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index cff002be83eb..6e742680590a 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -336,6 +336,17 @@ void split_huge_pmd_address(struct vm_area_struct *v= ma, unsigned long address, > void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > unsigned long address); > > +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, > + pud_t *pudp, unsigned long addr, pgprot_t newprot, > + unsigned long cp_flags); > +#else > +static inline int > +change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, > + pud_t *pudp, unsigned long addr, pgprot_t newprot, > + unsigned long cp_flags) { return 0; } > +#endif > + > #define split_huge_pud(__vma, __pud, __address) = \ > do { \ > pud_t *____pud =3D (__pud); = \ > @@ -579,6 +590,19 @@ static inline int next_order(unsigned long *orders, = int prev) > { > return 0; > } > + > +static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *p= ud, > + unsigned long address) > +{ > +} > + > +static inline int change_huge_pud(struct mmu_gather *tlb, > + struct vm_area_struct *vma, pud_t *pudp= , > + unsigned long addr, pgprot_t newprot, > + unsigned long cp_flags) > +{ > + return 0; > +} > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > static inline int split_folio_to_list_to_order(struct folio *folio, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index c10247bef08a..9a00c5955c0c 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2112,6 +2112,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct= vm_area_struct *vma, > return ret; > } > > +/* > + * Returns: > + * > + * - 0: if pud leaf changed from under us > + * - 1: if pud can be skipped > + * - HPAGE_PUD_NR: if pud was successfully processed > + */ > +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, > + pud_t *pudp, unsigned long addr, pgprot_t newprot, > + unsigned long cp_flags) > +{ > + struct mm_struct *mm =3D vma->vm_mm; > + pud_t oldpud, entry; > + spinlock_t *ptl; > + > + tlb_change_page_size(tlb, HPAGE_PUD_SIZE); > + > + /* NUMA balancing doesn't apply to dax */ > + if (cp_flags & MM_CP_PROT_NUMA) > + return 1; > + > + /* > + * Huge entries on userfault-wp only works with anonymous, while = we > + * don't have anonymous PUDs yet. > + */ > + if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL)) > + return 1; > + > + ptl =3D __pud_trans_huge_lock(pudp, vma); > + if (!ptl) > + return 0; > + > + /* > + * Can't clear PUD or it can race with concurrent zapping. See > + * change_huge_pmd(). > + */ > + oldpud =3D pudp_invalidate(vma, addr, pudp); > + entry =3D pud_modify(oldpud, newprot); > + set_pud_at(mm, addr, pudp, entry); > + tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE); > + > + spin_unlock(ptl); > + return HPAGE_PUD_NR; > +} > +#endif > + > #ifdef CONFIG_USERFAULTFD > /* > * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked = by > @@ -2342,6 +2389,11 @@ void __split_huge_pud(struct vm_area_struct *vma, = pud_t *pud, > spin_unlock(ptl); > mmu_notifier_invalidate_range_end(&range); > } > +#else > +void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > + unsigned long address) > +{ > +} > #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > > static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 2a81060b603d..694f13b83864 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -425,31 +425,53 @@ static inline long change_pud_range(struct mmu_gath= er *tlb, > unsigned long end, pgprot_t newprot, unsigned long cp_fla= gs) > { > struct mmu_notifier_range range; > - pud_t *pud; > + pud_t *pudp, pud; > unsigned long next; > long pages =3D 0, ret; > > range.start =3D 0; > > - pud =3D pud_offset(p4d, addr); > + pudp =3D pud_offset(p4d, addr); > do { > +again: > next =3D pud_addr_end(addr, end); > - ret =3D change_prepare(vma, pud, pmd, addr, cp_flags); > + ret =3D change_prepare(vma, pudp, pmd, addr, cp_flags); > if (ret) { > pages =3D ret; > break; > } > - if (pud_none_or_clear_bad(pud)) > + > + pud =3D READ_ONCE(*pudp); > + if (pud_none(pud)) > continue; > + > if (!range.start) { > mmu_notifier_range_init(&range, > MMU_NOTIFY_PROTECTION_VMA= , 0, > vma->vm_mm, addr, end); > mmu_notifier_invalidate_range_start(&range); > } > - pages +=3D change_pmd_range(tlb, vma, pud, addr, next, ne= wprot, > + > + if (pud_leaf(pud)) { > + if ((next - addr !=3D PUD_SIZE) || > + pgtable_split_needed(vma, cp_flags)) { > + __split_huge_pud(vma, pudp, addr); > + goto again; IIUC, most of the time, we're just going to end up clearing the PUD in this case. __split_huge_pud() will just clear it, then we goto again and `continue` to the next pudp. Is that ok? (I think it's ok as long as: you never map an anonymous page with a PUD, and that uffd-wp is not usable with non-hugetlb PUD mappings of user memory (which I think is only DAX?). So it seems ok today...?) Also, does the comment in pgtable_split_needed() need updating? Somewhat related question: change_huge_pmd() is very careful not to clear the PMD before writing the new value. Yet change_pmd_range(), when it calls into __split_huge_pmd(), will totally clear the PMD and then populate the PTEs underneath (in some cases at least), seemingly reintroducing the MADV_DONTNEED concern. But your PUD version, because it never re-populates the PUD (or PMDs/PTEs underneath) does not have this issue. WDYT? Thanks for this series! > + } else { > + ret =3D change_huge_pud(tlb, vma, pudp, > + addr, newprot, cp_f= lags); > + if (ret =3D=3D 0) > + goto again; > + /* huge pud was handled */ > + if (ret =3D=3D HPAGE_PUD_NR) > + pages +=3D HPAGE_PUD_NR; > + continue; > + } > + } > + > + pages +=3D change_pmd_range(tlb, vma, pudp, addr, next, n= ewprot, > cp_flags); > - } while (pud++, addr =3D next, addr !=3D end); > + } while (pudp++, addr =3D next, addr !=3D end); > > if (range.start) > mmu_notifier_invalidate_range_end(&range); > -- > 2.45.0 > >