From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0702FC3DA70 for ; Fri, 26 Jul 2024 02:33:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8ECF46B0098; Thu, 25 Jul 2024 22:33:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 875E36B0099; Thu, 25 Jul 2024 22:33:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7165C6B009A; Thu, 25 Jul 2024 22:33:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4EABE6B0098 for ; Thu, 25 Jul 2024 22:33:23 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 263FDA14B7 for ; Fri, 26 Jul 2024 02:33:22 +0000 (UTC) X-FDA: 82380332244.03.EE53370 Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf15.hostedemail.com (Postfix) with ESMTP id 72D09A0003 for ; Fri, 26 Jul 2024 02:33:19 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=qHMBwAu4; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721961161; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2iBorP0brvdhsdoDt5kSKVhDfTCUkv1CB9oZ/RXddcM=; b=TMGHc0r00HUgya/OSvDArv6YA7qP/7Wtax0i/NM69tmWBoAy9dt0I3uXT2uS5kgDJ0udEy l9oidjASIIFM/7PKMNYOn6dXJna93O4qaUVZrT4RsGpcxPNzCxFrpXWhn9NrFT1WDadhCD HyZXrgDCMTVUTM6X2JpTHmt0TfPGtSA= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=qHMBwAu4; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721961161; a=rsa-sha256; cv=none; b=w3jYOPoeHegxVuEA9ZFRAGzz6MznIRFDAHBy/X0JxFtDMf220/3KrfhFRvmWbWejVrERy0 tR7JikWXc5ITKTsKfmaMhhX9ajvDUhrttPM+4f6QlbtQ5tTSB7QYdUwInWnIqV88SVj1kA u4lOR+j1M9XBf7Rg3p9cIBzBguZd6KY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1721961196; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=2iBorP0brvdhsdoDt5kSKVhDfTCUkv1CB9oZ/RXddcM=; b=qHMBwAu4/nxWKJuZIH1C/rRT1E6t6gl/hf01I39CFlWgaPNWZuoJFGKZbiHWdCQkHNIpPNg30GmA0FFiCS0aFgKvmhIcqeyUL08RZE41K6srWGrBsTgP/0q1dZEbFK+Rm2PZngBDYgIxvG8lodrj+g00fjoYERXplBvjFS2qqrs= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033032019045;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0WBK2g3S_1721961194; Received: from 30.97.56.64(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WBK2g3S_1721961194) by smtp.aliyun-inc.com; Fri, 26 Jul 2024 10:33:15 +0800 Message-ID: <0067dfe6-b9a6-4e98-9eef-7219299bfe58@linux.alibaba.com> Date: Fri, 26 Jul 2024 10:33:14 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking To: David Hildenbrand , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Andrew Morton , Muchun Song , Peter Xu , Oscar Salvador , stable@vger.kernel.org References: <20240725183955.2268884-1-david@redhat.com> <20240725183955.2268884-3-david@redhat.com> From: Baolin Wang In-Reply-To: <20240725183955.2268884-3-david@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: nuucuyg8ubtocnhixh5414dwr37ra4t5 X-Rspam-User: X-Rspamd-Queue-Id: 72D09A0003 X-Rspamd-Server: rspam02 X-HE-Tag: 1721961199-217665 X-HE-Meta: U2FsdGVkX19ZXEVG+4sTGfKAPJrNBUzUXhZpoAwBbpHAK09239ARWhqLlqSchGHvW2VGIpPgaxO0X0SPMK5ddNL03ZZXNSGsxzru7C1Wh8tN3o/fEesAg/BeeyIaKXrp5zQjVZBig2Cw4Ym+0opqYNMFZ2EcZP3iuHZ88rmIuTDK4GOuBX8up4FB/E04rvmAmuH7AE767Lpw2U+T8d436zc1J34gJe5Qjhojj2uPMFWwxH4TSBsfISnjGhu3/LwNZwjvHOV6xxAm9vzP6zx9ITZxsnGjTWzoq8OTq7aq+GOPctIhLQBBjAwaNjWJ4Ejcy86JZhJVAO8Cpwgv6Pz83Tdt46l6AubUns1LHBA/5zcPlMsBtHsC+mbjpICvTyS/TdwA9ILpEdnrCapBKCiQ9H9OUztT39EmmYGVyfsWGUstg23ItEdsyFKEUuuOc73c4LIcBfMdqfQDWS0qF162vqC4I9BYChJ1sXayituCyq/7BpX7l2UjGwniZNPxB5qM/UAH/r8MS3fOxa+R/kXBwP6vtvrCg74fX9Yc80XasrRhorj5ciRj+POQLIZ52xcgQdE2ZFmJ9x/Vs3fXj1XNl6p6Hm/qm6MXRfLKP02OiUTLe7FyMGVdi58hs0KM5PYKXO3AjbSSkL8eQvdbR5TxFMDDnZjFwRREX7GrO70wW3aCPxXm4abuvTFFCZruI5vdVPAOiMALPljKa5nXlW80PEQYBeItXlP+GW1jBvSw0FVaCJlJmDSOy5c+6EQBfzdzOjg8WzFaOzIcWyi3K83XUZAINmMRk4NzW71Sjafl3waERNOpkhGcTLcnsSuIdFZ+hRlt0q/ruVL5EiqHEw94R+bpZKAxXJP2Z4WNPcgCNSAD9mmZNGSsffQWZCLbWpaOms/91I/JpbU3i4+GF11Epdwn2ahOj6ToFvo/VumwPz4A7GdIn6AoMmnJNsE0XbynsacQAKZoe6tAORt3PBM IvIcn4R9 Cs+1BbVapGieNK89GEI+LndC5mNq8wcyc8hbDxS3+lvQW0M7iJDJSN+bJJ6M1/KIXjhQo7m01DZluOkQJk5+/xWHZ7jWRUsLA7uw2jv20HRsEVUteRUDL+VmT/Jaxg4N3OcmWi2V44037grJX+csrvVLOIiArNNl8I+CDKVbKOWdftiiSDIg/FShqkLOd4PJuNU5siCfolIKK1EDlJWkH65wIWrV4ojKTp/ksCkxN15G554D02PlUTGRX74/7L9OOWtHWWKmXjKluHAMptIshuigu20P4Idj5y0stIJCVbeAfovjjp0Ff+R4aww== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/26 02:39, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk > hugetlb VMAs without most hugetlb special-casing, preparing for the > future of having less hugetlb-specific page table walking code in the > codebase. Turns out that we missed one page table locking detail: page > table locking for hugetlb folios that are not mapped using a single > PMD/PUD. > > Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB > hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the > page tables, will perform a pte_offset_map_lock() to grab the PTE table > lock. > > However, hugetlb that concurrently modifies these page tables would > actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the > locks would differ. Something similar can happen right now with hugetlb > folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. > > Let's make huge_pte_lockptr() effectively uses the same PT locks as any > core-mm page table walker would. Thanks for raising the issue again. I remember fixing this issue 2 years ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page"), but it seems to be broken again. > There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb > folio being mapped using two PTE page tables. While hugetlb wants to take > the PMD table lock, core-mm would grab the PTE table lock of one of both > PTE page tables. In such corner cases, we have to make sure that both > locks match, which is (fortunately!) currently guaranteed for 8xx as it > does not support SMP. > > Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") > Cc: > Signed-off-by: David Hildenbrand > --- > include/linux/hugetlb.h | 25 ++++++++++++++++++++++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index c9bf68c239a01..da800e56fe590 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -944,10 +944,29 @@ static inline bool htlb_allow_alloc_fallback(int reason) > static inline spinlock_t *huge_pte_lockptr(struct hstate *h, > struct mm_struct *mm, pte_t *pte) > { > - if (huge_page_size(h) == PMD_SIZE) > + VM_WARN_ON(huge_page_size(h) == PAGE_SIZE); > + VM_WARN_ON(huge_page_size(h) >= P4D_SIZE); > + > + /* > + * hugetlb must use the exact same PT locks as core-mm page table > + * walkers would. When modifying a PTE table, hugetlb must take the > + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD > + * PT lock etc. > + * > + * The expectation is that any hugetlb folio smaller than a PMD is > + * always mapped into a single PTE table and that any hugetlb folio > + * smaller than a PUD (but at least as big as a PMD) is always mapped > + * into a single PMD table. ARM64 also supports cont-PMD size hugetlb, which is 32MiB size with a 4 KiB base page size. This means the PT locks for 32MiB hugetlb may race again, as we currently only hold one PMD lock for several PMD entries of a cont-PMD size hugetlb. > + * > + * If that does not hold for an architecture, then that architecture > + * must disable split PT locks such that all *_lockptr() functions > + * will give us the same result: the per-MM PT lock. > + */ > + if (huge_page_size(h) < PMD_SIZE) > + return pte_lockptr(mm, pte); > + else if (huge_page_size(h) < PUD_SIZE) > return pmd_lockptr(mm, (pmd_t *) pte); IIUC, as I said above, this change doesn't fix the inconsistent lock for cont-PMD size hugetlb for GUP, and it will also break the lock rule for unmapping/migrating a cont-PMD size hugetlb (use mm->page_table_lock before for cont-PMD size hugetlb before). > - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); > - return &mm->page_table_lock; > + return pud_lockptr(mm, (pud_t *) pte); > } > > #ifndef hugepages_supported