From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCE51C3DA49 for ; Fri, 2 Aug 2024 03:56:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2E2046B007B; Thu, 1 Aug 2024 23:56:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2927B6B0083; Thu, 1 Aug 2024 23:56:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 159646B0085; Thu, 1 Aug 2024 23:56:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EBCD56B007B for ; Thu, 1 Aug 2024 23:56:24 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 7DCC71A04C8 for ; Fri, 2 Aug 2024 03:56:24 +0000 (UTC) X-FDA: 82405943088.10.C450011 Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf24.hostedemail.com (Postfix) with ESMTP id 7699F180003 for ; Fri, 2 Aug 2024 03:56:21 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=B4zWeRgY; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722570977; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oxdXySie//BvTkv5OMff0blMsR+i/fu3buzLeZzurX8=; b=wsAPSOfnEuzvb4cb0GFUko1cubKgY5QkZqpyb/r9npans2FuzO/zC3Ysj1R0mPnfhsxYay ONUpLtRUc1dvb5jt4Iy9ke2N6dhboCS7oDbnfipqP2HbMmHOpUMqwGbNTa62KdqmxYRqcd gz6RjtjL81OWvddiebN+eAiLR0dlzkg= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=B4zWeRgY; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722570977; a=rsa-sha256; cv=none; b=34Icw/4TdsDweZCqZGDSs6gTKEVIr44bb6YKbeJ2UmOXcD7Q5fyA0qC/4nNNemZ8EJ6kMC 6AriwafXHoyDfOVSBb4AGucFHCus5dFA7gIH6TmPfahvlN5jLYKfPHDpqtKnrKW9CamXNG AN7sgyOSHpN0zMRoCXOElBkSfjq5ZNQ= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1722570978; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=oxdXySie//BvTkv5OMff0blMsR+i/fu3buzLeZzurX8=; b=B4zWeRgYU+eX1O2vG6iyaERTDlGkAUmikWAng7/+EWljmlAX5cWA6bEz/RuvHnlszWp3ZooOuu9HKT44tgKpetTzQxdyqD2IQuejbaW4xTeRBQJQiGpchisa2JRzU0O7fBVrqIFNY0PFgm2OdSpClIMa/1uFG2o7gjevDMxp0fA= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033068173054;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0WBw1jxR_1722570976; Received: from 30.97.56.76(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WBw1jxR_1722570976) by smtp.aliyun-inc.com; Fri, 02 Aug 2024 11:56:17 +0800 Message-ID: <0eb66533-445f-45d5-8f68-0281e4ed017d@linux.alibaba.com> Date: Fri, 2 Aug 2024 11:56:16 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4] mm/hugetlb: fix hugetlb vs. core-mm PT locking To: David Hildenbrand , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Peter Xu , stable@vger.kernel.org, Oscar Salvador , Muchun Song References: <20240801204748.99107-1-david@redhat.com> From: Baolin Wang In-Reply-To: <20240801204748.99107-1-david@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: m3g9apo7mc8i47x65epggyjet4p4kjpb X-Rspamd-Queue-Id: 7699F180003 X-Rspamd-Server: rspam11 X-HE-Tag: 1722570981-166302 X-HE-Meta: U2FsdGVkX1+OgclP5QmYsYw+69S7Cu+cknzGtqAV6DHBymEWM5HVWrhf0fAVRSV/rZ1wgAc30jiergt9muJ2GU0hppkEBE4pZrKXR7gkwtcDBu2IBXLPxFzA3we6oDrXdCvdCsjDSMOz+TesOeDRu3jCNN4u0kfGpbNmvoTLCc0vkQ0MmH0tSpF0D5OBCTOwZILCCI0msI7WEYJ52YlwTTmBt6sxtNNzkQSY9uovVCRd2rhnf2AoMJ3MOWgq8yjT8cJFqA4m0XTV1ogmGm9QS3z8Gi+NYM/GlGlakMcanefSDPWlyRINww+aqlagVUxRpEmATVabXi4fcEW2iTZAaAc+dhjNqnLgLDa0X4Wvi7Fc0dISPojsk154N4m/SXJJdUEvXgYPvaSIaXZPa9WeVbYqKNKXkb33KK45WRCaJo5rfKMtWR72F+YQILbRulf0vhrI8sSnu2bEofuRneG1AVKPfq7E75hx7s4I2vE+0aVmKs0WJNIRFDtQYbUtc1IhpFdFD5aq9EI1JXOyoQ7o0DY1mAXjykjngRfoxoi53K8awMcYVub/8UJ7NaM+HM/pJZ0ctQ3HaaVQs7rbw3yQ+uOLmbhQOXr+SV4YohesET5FVSxbRNAqopKIpvC0qqBfSH5m2z8ly3a8lP7Vs1fScGLWVx4iWi0lnWpBVt5lUgeByu0d2946sCqlJe0gR0jCts56i2LDIy9NwkKQoEHVYC/MERH5WgE2B1OOCsMpKAdmAn1bXha0NmbePYj166s77WPNXngoOi6NFLypmu4P7449+KFxxnLmn9pit0+7Xp90/bCU/hFe21mhC9uaqSXNEE/R4zhZxy9jIxcSnQKy7nSqUeB20FXLHzpCjD4/8SRf9dsocBk+2RPewPjlPW/8MrwyAEl4M6pQzNzpKxpy0O6L++ECjiPXexYcRteLBEBDN/XlTed/lSujPYil46SB9/muWYRXZfspWcwKF+j 0N0QBQ3r vebROwP3h+JubZxge7Utlx7YYN/Fu/7yTtowdlOyzaOhG+r2uLbXn64CCIJHkvvG8ejF8uX6oO0qmIjCN4u3o2oTnIYREjJbK+vOujOcvJ43sOTwFk+1oG9Ocjd6YAx9DmmaLmMACXyDss/H8PR5X4hacTsZS5B9gAhjf0eoY3msmWSubp/J01EC76XzIuxjTUNcbgIL5gzNvieXPB/OWwBocGK47PFBupUXeUlyT2gHOBdqYZCyO3ZjHeqcCNQ7dbXdvUj2HvansBFltCxiebyLAzakl4/xrGp0NMfXoKJqStT6JLjlgP2cWCM3zF82UF6czeNYRKxBulXDGxLb2xEmLQSXTaTbqfyjl1wVmVxbbTlTrgMx+fWVl/UUIFzH8YD8o/Nn6ljLXgmU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/2 04:47, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk hugetlb > VMAs without most hugetlb special-casing, preparing for the future of > having less hugetlb-specific page table walking code in the codebase. > Turns out that we missed one page table locking detail: page table locking > for hugetlb folios that are not mapped using a single PMD/PUD. > > Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB > hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the > page tables, will perform a pte_offset_map_lock() to grab the PTE table > lock. > > However, hugetlb that concurrently modifies these page tables would > actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the > locks would differ. Something similar can happen right now with hugetlb > folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. > > This issue can be reproduced [1], for example triggering: > > [ 3105.936100] ------------[ cut here ]------------ > [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 > [ 3105.944634] Modules linked in: [...] > [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 > [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 > [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 3105.991108] pc : try_grab_folio+0x11c/0x188 > [ 3105.994013] lr : follow_page_pte+0xd8/0x430 > [ 3105.996986] sp : ffff80008eafb8f0 > [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 > [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 > [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 > [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 > [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 > [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 > [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 > [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 > [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 > [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 > [ 3106.047957] Call trace: > [ 3106.049522] try_grab_folio+0x11c/0x188 > [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 > [ 3106.055527] follow_page_mask+0x1a0/0x2b8 > [ 3106.058118] __get_user_pages+0xf0/0x348 > [ 3106.060647] faultin_page_range+0xb0/0x360 > [ 3106.063651] do_madvise+0x340/0x598 > > Let's make huge_pte_lockptr() effectively use the same PT locks as any > core-mm page table walker would. Add ptep_lockptr() to obtain the PTE > page table lock using a pte pointer -- unfortunately we cannot convert > pte_lockptr() because virt_to_page() doesn't work with kmap'ed page > tables we can have with CONFIG_HIGHPTE. > > Handle CONFIG_PGTABLE_LEVELS correctly by checking in reverse order, > such that when e.g., CONFIG_PGTABLE_LEVELS==2 with > PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE will work as expected. > Document why that works. > > There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb > folio being mapped using two PTE page tables. While hugetlb wants to take > the PMD table lock, core-mm would grab the PTE table lock of one of both > PTE page tables. In such corner cases, we have to make sure that both > locks match, which is (fortunately!) currently guaranteed for 8xx as it > does not support SMP and consequently doesn't use split PT locks. > > [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/ > > Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") > Acked-by: Peter Xu > Cc: > Cc: Peter Xu > Cc: Oscar Salvador > Cc: Muchun Song > Cc: Baolin Wang > Signed-off-by: David Hildenbrand I tried your reproducer on my ARM64 machine, and this patch can fix the problem. Although I know nothing about HIGHPTE, the other parts look good to me. So feel free to add: Reviewed-by: Baolin Wang Tested-by: Baolin Wang > --- > > @James, I dropped your RB. > > Retested on arm64 and x86-64. Cross-compiled on a bunch of others. > > v3 -> v4: > * Replace PTE pointer alignment by BUILD_BUG_ON() > * Simplify lock lookup by looking up in reverse > * Adjust comment and patch description > > v2 -> v3: > * Handle CONFIG_PGTABLE_LEVELS oddities as good as possible. It's a mess. > Remove the size >= P4D_SIZE check and simply default to the > &mm->page_table_lock. > * Align the PTE pointer to the start of the page table to handle PTE page > tables bigger than a single page (unclear if this could currently trigger). > * Extend patch description > > v1 -> 2: > * Extend patch description > * Drop "mm: let pte_lockptr() consume a pte_t pointer" > * Introduce ptep_lockptr() in this patch > > --- > include/linux/hugetlb.h | 33 ++++++++++++++++++++++++++++++--- > include/linux/mm.h | 11 +++++++++++ > 2 files changed, 41 insertions(+), 3 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 8e462205400d..ac3ea8596f93 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -938,10 +938,37 @@ static inline bool htlb_allow_alloc_fallback(int reason) > static inline spinlock_t *huge_pte_lockptr(struct hstate *h, > struct mm_struct *mm, pte_t *pte) > { > - if (huge_page_size(h) == PMD_SIZE) > + const unsigned long size = huge_page_size(h); > + > + VM_WARN_ON(size == PAGE_SIZE); > + > + /* > + * hugetlb must use the exact same PT locks as core-mm page table > + * walkers would. When modifying a PTE table, hugetlb must take the > + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD > + * PT lock etc. > + * > + * The expectation is that any hugetlb folio smaller than a PMD is > + * always mapped into a single PTE table and that any hugetlb folio > + * smaller than a PUD (but at least as big as a PMD) is always mapped > + * into a single PMD table. > + * > + * If that does not hold for an architecture, then that architecture > + * must disable split PT locks such that all *_lockptr() functions > + * will give us the same result: the per-MM PT lock. > + * > + * Note that with e.g., CONFIG_PGTABLE_LEVELS=2 where > + * PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE, we'd use pud_lockptr() > + * and core-mm would use pmd_lockptr(). However, in such configurations > + * split PMD locks are disabled -- they don't make sense on a single > + * PGDIR page table -- and the end result is the same. > + */ > + if (size >= PUD_SIZE) > + return pud_lockptr(mm, (pud_t *) pte); > + else if (size >= PMD_SIZE || IS_ENABLED(CONFIG_HIGHPTE)) > return pmd_lockptr(mm, (pmd_t *) pte); > - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); > - return &mm->page_table_lock; > + /* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */ > + return ptep_lockptr(mm, pte); > } > > #ifndef hugepages_supported > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a890a1731c14..bd219ac9c026 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2869,6 +2869,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) > return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); > } > > +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) > +{ > + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); > + BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE); > + return ptlock_ptr(virt_to_ptdesc(pte)); > +} > + > static inline bool ptlock_init(struct ptdesc *ptdesc) > { > /* > @@ -2893,6 +2900,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) > { > return &mm->page_table_lock; > } > +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) > +{ > + return &mm->page_table_lock; > +} > static inline void ptlock_cache_init(void) {} > static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } > static inline void ptlock_free(struct ptdesc *ptdesc) {}