From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BAF8FC531DC for ; Wed, 14 Aug 2024 17:25:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4250A6B0088; Wed, 14 Aug 2024 13:25:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D4686B008C; Wed, 14 Aug 2024 13:25:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C36B6B0092; Wed, 14 Aug 2024 13:25:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0A9226B0088 for ; Wed, 14 Aug 2024 13:25:09 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A6BFD1A1036 for ; Wed, 14 Aug 2024 17:25:08 +0000 (UTC) X-FDA: 82451526696.24.E0AB68A Received: from pegase2.c-s.fr (pegase2.c-s.fr [93.17.235.10]) by imf21.hostedemail.com (Postfix) with ESMTP id 769F71C0018 for ; Wed, 14 Aug 2024 17:25:05 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=csgroup.eu; spf=pass (imf21.hostedemail.com: domain of christophe.leroy@csgroup.eu designates 93.17.235.10 as permitted sender) smtp.mailfrom=christophe.leroy@csgroup.eu ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723656252; a=rsa-sha256; cv=none; b=BzxFNit3NyvzdslYkR1X28zqDKhjEbohUJkgKbDVwHwjbOt97w81DByklRRlWkpYlIq2nu JVJnVmolhUvRqFpbrIFGm+oiJbT7Yb18LjewfwOq2BnWXgDa4XmkX7nUfNaoQYSQsiVMf8 GPSDIyaTeTSpVGp35QJVZEiGQyyh4Bg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=csgroup.eu; spf=pass (imf21.hostedemail.com: domain of christophe.leroy@csgroup.eu designates 93.17.235.10 as permitted sender) smtp.mailfrom=christophe.leroy@csgroup.eu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723656252; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gAmp0/k5peMyonr+6/Tme1Ua687ayK6pg44uiRpkocA=; b=LBb54C36HHUZExHokMSa7KBvlN4dohtR7uwU1G8bAib76hk2b8erG0lbbE85u1a2IxVWtF s46EiB53mZPNop6/asKkvjafiq2EC9j6x3ROU59WXsLt/e6/YWmnBqPxlma7RDFnhMFhnd tzv3Aigt1OP4JCqGtMeLTCKfOBG13N4= Received: from localhost (mailhub3.si.c-s.fr [172.26.127.67]) by localhost (Postfix) with ESMTP id 4WkZrb1fnlz9sRy; Wed, 14 Aug 2024 19:25:03 +0200 (CEST) X-Virus-Scanned: amavisd-new at c-s.fr Received: from pegase2.c-s.fr ([172.26.127.65]) by localhost (pegase2.c-s.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jxUP_fGYFpiT; Wed, 14 Aug 2024 19:25:03 +0200 (CEST) Received: from messagerie.si.c-s.fr (messagerie.si.c-s.fr [192.168.25.192]) by pegase2.c-s.fr (Postfix) with ESMTP id 4WkZrb0QPfz9sRs; Wed, 14 Aug 2024 19:25:03 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by messagerie.si.c-s.fr (Postfix) with ESMTP id F20568B775; Wed, 14 Aug 2024 19:25:02 +0200 (CEST) X-Virus-Scanned: amavisd-new at c-s.fr Received: from messagerie.si.c-s.fr ([127.0.0.1]) by localhost (messagerie.si.c-s.fr [127.0.0.1]) (amavisd-new, port 10023) with ESMTP id TDpg8Vpdl9oq; Wed, 14 Aug 2024 19:25:02 +0200 (CEST) Received: from [192.168.232.91] (unknown [192.168.232.91]) by messagerie.si.c-s.fr (Postfix) with ESMTP id 4CC9B8B764; Wed, 14 Aug 2024 19:25:02 +0200 (CEST) Message-ID: Date: Wed, 14 Aug 2024 19:25:02 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking To: David Hildenbrand , Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, James Houghton , stable@vger.kernel.org, Oscar Salvador , Muchun Song , Baolin Wang , Michael Ellerman , Nicholas Piggin References: <20240731122103.382509-1-david@redhat.com> <2b0131cf-d066-44ba-96d9-a611448cbaf9@redhat.com> Content-Language: fr-FR From: Christophe Leroy In-Reply-To: <2b0131cf-d066-44ba-96d9-a611448cbaf9@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 769F71C0018 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 99ja3pqgfuc5439wxtmg1t7ooerq6ekp X-HE-Tag: 1723656305-27379 X-HE-Meta: U2FsdGVkX19+jsNkLPGULSXjScrS61ZHP7IA1RBrJkM+jOY3swMR6/sJWnRAbpXLKktAFsSwiyb+p4sOgQfNkesDynh0/cOj8ogrxXCt4QYkBSW5Lc6cnnNhfShHvGbJ5IP0rGa9/QAZPeLTQyMBdQk2At138WUmK7uOs1TSmlL07SQD9Eza5q24CI9Tm1WwvvNaWfmEOriKjyATRWpW0coqYRgVn7+Co3bsrMF0ltvDZtYp0h1ElTeENFtpps5VzHNEgAAHwFWBCv/kNgzn3AMaFcMgDtqiNX3GcB46IkSdxykGHxYl2r6mjuLcGttIb9+xtQMceP0gt5FbxDIDX53sg3btCswxfgKq9xyacll7KKi+Yl4Uy5Q2kDouyDlSl3CGJI/jTmnz6KKEYHxZYYE1i7k/PCDwKG4mIEhzDeJeNhfamLNTe4h3wA4MmuY5qLRpleLyH7NuY2yLvgOVjhOmNo1Z/B+z/+TZyz5Fd9jwwkR+9nNKffV0Q3u1HxLa9uMLkkoSdz6C4vhgd0l0StYCbH/elah7soxfqGavF9zA7uNkB8ri0++i9wysW4T/Tok9+l38hYzUXLW9pSMz1vdvoXpiBu+MiME28SFuIoPv1VIY+7l1t8C0+jM9O+IJBPzC2pz2uBr1uoLKmAoeY3nZ+fy9Fw7gZQ7TnT2Zss1/isVAN5PIKyJ6hb48XuLk8FpLLDPG+vy/DNZILnJV2O27VQiS982M1nBZ6BJSoN39DF6BVZIBToayD66yaUlPPu58UT5Nqx/anXsc0hfHIb+EU6OV/lIzg86oqFvwRIGkn0PA/NhFS38hgLsVInDcxhKz/PnBYcTfUEDjwT4zs4VXekN6/R3mLHfoTmXDU8uj3g6j8WzHIJvaAPp/ePJN8GzfHwMJ/1yUAM9xztJCkPpER5ZxnhPc0msA5ewRM+Y9Jx/+olSBnE4Tdx6JHKPYuGf/X7isS7P0INkk0MD DZ1agQdP 2oIxV5VBmPnLf2HK0B4jJgYIdsGiFOrvyXfJ5CuRTXdCcZJ5A2Zgz2spo1w5fiZpy8iKOfxDk9jNAxVkNxyMZ9ILKGFcEg0TV74XnJ20A+0rLS2PhpotXNLeLMtxVDkBVJXn6fMnn60f4lj4TZ2D8bVWQDFDi6QaYnQYSi4eiXguQrP5U2brMIp/PnjN+9PM0REtrEOIA5Bz646itERXeGlr29zL7bhNYeA+/aHsSKcu4LqyaTwD6zqYsVD1+0U9EpSym+9IOzKE1iWH6CpqzLnmOQ5nz9WYvmEZiaw9+nX/BSHg4uXBboIFIhyvEa16w5a6N3EdvQrw5jau9ytiU5o29D/xEdTP9dvF+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Le 31/07/2024 à 18:33, David Hildenbrand a écrit : > On 31.07.24 16:54, Peter Xu wrote: >> On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: >>> We recently made GUP's common page table walking code to also walk >>> hugetlb >>> VMAs without most hugetlb special-casing, preparing for the future of >>> having less hugetlb-specific page table walking code in the codebase. >>> Turns out that we missed one page table locking detail: page table >>> locking >>> for hugetlb folios that are not mapped using a single PMD/PUD. >>> >>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB >>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the >>> page tables, will perform a pte_offset_map_lock() to grab the PTE table >>> lock. >>> >>> However, hugetlb that concurrently modifies these page tables would >>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the >>> locks would differ. Something similar can happen right now with hugetlb >>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. >>> >>> This issue can be reproduced [1], for example triggering: >>> >>> [ 3105.936100] ------------[ cut here ]------------ >>> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 >>> try_grab_folio+0x11c/0x188 >>> [ 3105.944634] Modules linked in: [...] >>> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted >>> 6.10.0-64.eln141.aarch64 #1 >>> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS >>> edk2-20240524-4.fc40 05/24/2024 >>> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS >>> BTYPE=--) >>> [ 3105.991108] pc : try_grab_folio+0x11c/0x188 >>> [ 3105.994013] lr : follow_page_pte+0xd8/0x430 >>> [ 3105.996986] sp : ffff80008eafb8f0 >>> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: >>> 00f80001207cff43 >>> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: >>> ffff80008eafba48 >>> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: >>> ffff7a546c1aa978 >>> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: >>> 0000000000000001 >>> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: >>> 0000000000000000 >>> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: >>> 0000000000000000 >>> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : >>> ffffb854771b12f0 >>> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : >>> 0008000000000080 >>> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : >>> ffffffe8d481f000 >>> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : >>> 0000000000000000 >>> [ 3106.047957] Call trace: >>> [ 3106.049522]  try_grab_folio+0x11c/0x188 >>> [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 >>> [ 3106.055527]  follow_page_mask+0x1a0/0x2b8 >>> [ 3106.058118]  __get_user_pages+0xf0/0x348 >>> [ 3106.060647]  faultin_page_range+0xb0/0x360 >>> [ 3106.063651]  do_madvise+0x340/0x598 >>> >>> Let's make huge_pte_lockptr() effectively use the same PT locks as any >>> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE >>> page table lock using a pte pointer -- unfortunately we cannot convert >>> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page >>> tables we can have with CONFIG_HIGHPTE. >>> >>> Take care of PTE tables possibly spanning multiple pages, and take >>> care of >>> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For >>> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect >>> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which >>> would end up just mapping to the per-MM PT lock. >>> >>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb >>> folio being mapped using two PTE page tables.  While hugetlb wants to >>> take >>> the PMD table lock, core-mm would grab the PTE table lock of one of both >>> PTE page tables.  In such corner cases, we have to make sure that both >>> locks match, which is (fortunately!) currently guaranteed for 8xx as it >>> does not support SMP and consequently doesn't use split PT locks. >>> >>> [1] >>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F1bbfcc7f-f222-45a5-ac44-c5a1381c596d%40redhat.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cf91a0b3cdcab46c7bd6108dcb17e9454%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638580404425532305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FQ4QFqbyThojISHACwzxCdtYbgwc4JsMIP%2Bmx4PneOk%3D&reserved=0 >>> >>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic >>> follow_page_mask code") >>> Reviewed-by: James Houghton >>> Cc: >>> Cc: Peter Xu >>> Cc: Oscar Salvador >>> Cc: Muchun Song >>> Cc: Baolin Wang >>> Signed-off-by: David Hildenbrand >> >> Nitpick: I wonder whether some of the lines can be simplified if we write >> it downwards from PUD, like, >> >> huge_pte_lockptr() >> { >>          if (size >= PUD_SIZE) >>             return pud_lockptr(...); >>          if (size >= PMD_SIZE) >>             return pmd_lockptr(...); >>          /* Sub-PMD only applies to !CONFIG_HIGHPTE, see >> pte_alloc_huge() */ >>          WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); >>          return ptep_lockptr(...); >> } > > Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm > calling pmd_lockptr we'd call pud_lockptr(). I guess it is only when including asm-generic/pgtable-nopmd.h Otherwise you should have more than one entry in the PMD table so PMD_SIZE would always be smaller than PUD_SIZE, wouldn't it ? So maybe some simplification could be done, like having pud_lock() a nop in that case ? Christophe