From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42624C3DA64 for ; Wed, 31 Jul 2024 14:54:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B2B606B0082; Wed, 31 Jul 2024 10:54:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AB41D6B0083; Wed, 31 Jul 2024 10:54:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92D706B0085; Wed, 31 Jul 2024 10:54:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7110A6B0082 for ; Wed, 31 Jul 2024 10:54:34 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 199C61C1660 for ; Wed, 31 Jul 2024 14:54:34 +0000 (UTC) X-FDA: 82400344068.07.E64E1FD Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 8E6EBA002A for ; Wed, 31 Jul 2024 14:54:30 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EAN0A7Gl; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722437666; a=rsa-sha256; cv=none; b=c0eRPoLgmoYLiz4ZC5MrF44j/bkm/UD81NwJPIq/75dBAoIaqAZy57LqLaseM4B2gB4s5h BtLbYFvxDW/utEKIEi7kv5j+scgLxRAaRr0+dxYmIEuHqVxcfkLjPghlrTzog7HUiL3Q6i iv7wxZoxMPUWZp5LYmo7JVK/dDDBQNg= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EAN0A7Gl; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722437666; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GBTw0W2rbsk4F+ekcQcaksXYFFFBotjI+RQovJOgbjU=; b=UP9gl4rKBpWhN8fKEcI95+7Eys2uSDhVci27A2pzGRNm/tKg2FCSGMDnp4498miBv64h08 0OXM4IlIFRzGG4CCnNZ/lqCexdtuCn8UQKtqKZj/11pGoSQKehHHgpX27V0fH6Aj+fPYyA cNsQ6Bqa9SiQmv3H02IT2jEN70lZK3c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1722437669; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GBTw0W2rbsk4F+ekcQcaksXYFFFBotjI+RQovJOgbjU=; b=EAN0A7GlZFN6bAcljEKH4nMtzEsjcmts94p95+uVM+zMObMJcD2PDJd2NswJeGCvBQjUoB 0ORk2a9lyUJ7RbQYFE+pbmQrl+faGQBNY2Vih8YUUqA43bFb1On4tSqkydvTMci16H8qyK skt/XuyMEo8eAoYueatMB7xa6hOQBTs= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-448-0JE-eIufN0iZzykjXrTs_g-1; Wed, 31 Jul 2024 10:54:28 -0400 X-MC-Unique: 0JE-eIufN0iZzykjXrTs_g-1 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-7a1da9c3d40so40012985a.2 for ; Wed, 31 Jul 2024 07:54:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722437667; x=1723042467; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=GBTw0W2rbsk4F+ekcQcaksXYFFFBotjI+RQovJOgbjU=; b=serlWp/wulgsOgNuYbh9kIGC0eEP4Sipd2dhTVCFWEQFfi0LmDLlw8Yemiq99/0rUx asCDvrJKRTXxInHTCn2Fi03FJBW4kjh7qjiBhsgGrCcKhOh0yfrtFe9Ee3OGl9Lhxo9H auVcMtAz7LBjntF+L579/18fbQj6mB69gSKostev+Zrmmvju7DyQzZ260go6389B/aOc buRxXCNXkB6FIt88WpCCdqHrS/VVjzAqKy66NcqTAEp/Ri/Gk0pLrePdOyG3vdF5qoGR dEbtTIkiBSmWo3z9xZkhcx5fbTjysGga5B1WoN2E7ucA9n9Yptcu/mYfQ8rsMuKe9Xlc L0ew== X-Forwarded-Encrypted: i=1; AJvYcCWPlh6F5A62XjCWzDxaIEXOgLm/Pjdksl653a3AzE+4ceOhwxQyaqw/X+4vmGBee2q9AknU8Va6FQ==@kvack.org X-Gm-Message-State: AOJu0YxpzTwoJjdkJxBkxbyMXnqfXifiJa7MyxeGzDQkHe9VJMnnGjU6 KIhAyl3N+tIBZzyRfmy4H9FAS+j2hjOguOPVT05QmhLavxMiMjIPTa/T0TmG7SAwJSqMIefA8TH 5Nu9G2y6Bd5qdSlJPuzb211umUeWU2/EdYXjGZlSob+DAhyuD X-Received: by 2002:a05:620a:3198:b0:79d:6273:9993 with SMTP id af79cd13be357-7a1d697f7bdmr1364523085a.6.1722437667451; Wed, 31 Jul 2024 07:54:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFEB61L/K7pQEACsFHzusBL/JC7NLuKoHSkhKCOykCWw/2D3jzpOFiVeWNKiYGZ886kry2gcA== X-Received: by 2002:a05:620a:3198:b0:79d:6273:9993 with SMTP id af79cd13be357-7a1d697f7bdmr1364521485a.6.1722437667024; Wed, 31 Jul 2024 07:54:27 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a1f3840e29sm338681985a.79.2024.07.31.07.54.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 31 Jul 2024 07:54:26 -0700 (PDT) Date: Wed, 31 Jul 2024 10:54:24 -0400 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, James Houghton , stable@vger.kernel.org, Oscar Salvador , Muchun Song , Baolin Wang Subject: Re: [PATCH v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking Message-ID: References: <20240731122103.382509-1-david@redhat.com> MIME-Version: 1.0 In-Reply-To: <20240731122103.382509-1-david@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Queue-Id: 8E6EBA002A X-Rspamd-Server: rspam01 X-Stat-Signature: du9yj4oafxzhguyxzktdc4w6td3xtdju X-HE-Tag: 1722437670-615794 X-HE-Meta: U2FsdGVkX1+znfdZbhnO9zoHvELXWT24IFcYkb8HzGPHTIXyAUN2eJK3ttUZzcTsedH1Og0GBQS+usBeoN97hIpZidQsqR2q0RUaQRoA+U/OX9kxYhp7k7qd3HLS2xPjEikKVye0J8na9Y1klJNdfXr2CM9rAOBshCPgDWWWzIl+NqfsCApO9RRv1ivVcCqOjdFI20JzvFEejtPnhdne+SqCXVK3VFDBE6uws8cbhJBR2WeytDyVY7vD81/kNErOlvag/ab8pdAX2eNK0+hhGvOJMASnUWnhBVLlOMpPfv9O+48knXVFIgmJ6uT0uTQNTw+ec4UugxVAaM7c32zhLyTk1C2BZxIYU2ibxBp8/mJdc6cIcpRaqDSTCLXClPxfVZPmmVxjSUw/8hSTa3uWZ8qW0Yfmo/Ykblg2h0ADQTYOqEODsJ4LlIl0wM1nivaentireXq00w4L1jh5Z4oGX2UgU3m0t+Dk2uME5RvscyShSpaGPGJ6JNuXWOxHLSt7yQTAmnPa53rSt2S/RuVSRY634LTcI2+wWZrEdy0FlVYaOdjfLfjuHBXXOyHKVgHwdsM0ylhatbPkzUMObBM6MmRN5Lhy3HofAjOm3Cin2PQhaBHHHCRAiHsEUApcOIwqKqbZ/pFORw04cPi77iaFcbtdimFhSq9bYsgbRARLOjeifRQn3WB+i7OEGaMa40Jf261HnrEDFVkZ+qbSEP1Xj6u80u7lMePbKmebRWvKtGICZQZkSxcDM/tuB8KRPCEhEdsA4vQj9pzTyDeP8SNkSifQYfdOfvIlCuxbzuJ9JY5jS+MKL3n6XQAadgI3gg5Oh//Ik5j/h8iCuh5xu14KwihCLhaXZQtKPULph4GcR/Y1XzcHhMZYcYjK9JEmE2shgfPlu3j9JsWhEzRHCT7cTWCnPoQjBGd+kfQlPq0xhKVr8x716j/zKOcdvnRd9jxnRL3thZ8glzLUvlchqrv 3gcb9WTC mTIpZE1ZJJCR1XasqNEWG3sRzQtVilprXtRvHrcE/wU20K1u31olKfUwxSbwJ9nOgr6IPdoGIDJtkvDLAK+ay0TMCKp8rHhhtJ3rbFu5gdhFNP1O0UomskOJnn4vrsgaKw/Q/OQX4x74tZ96J8eAOB0jUnYDgKMKYQDfoby520O2tJ6HT6ZZxxCFi4rU/klWXbCPgvvuYbr0CWFcrfoElVNCTXQcfWcCQ4CphVkPuXtItHG9u8uWaHWQ7tS53m3cEgy6c+2YidKDZNzgd5ksAoGXK2Xtq6osVjHfbGt+kPFI/Bx1k9FbOozmqoDPBYUoI4Q+VKOmHO6aS2SSNEOWVszy9vS1TYLZbnCPlmHOhhls7KQkUT8pqJILFWN5Zc40HducVDffW9MiKnTQcSaoUGLAWPZHNnofrcVdrl6uX/E+zLici2yTDeZzX1PD0+1nC5JzOg9L2sIm6FZa+yPUtcDElx2T8E6dt7mn8yMRrlzNn+0VnKiSvRVSO1aawKZx1m+Qi8bPrjBB3Y6dXd5XdILiUuVN1ehZh1RST/x21O4cLW+Sfxz1ngynUPWVFfA/ikDid X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk hugetlb > VMAs without most hugetlb special-casing, preparing for the future of > having less hugetlb-specific page table walking code in the codebase. > Turns out that we missed one page table locking detail: page table locking > for hugetlb folios that are not mapped using a single PMD/PUD. > > Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB > hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the > page tables, will perform a pte_offset_map_lock() to grab the PTE table > lock. > > However, hugetlb that concurrently modifies these page tables would > actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the > locks would differ. Something similar can happen right now with hugetlb > folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. > > This issue can be reproduced [1], for example triggering: > > [ 3105.936100] ------------[ cut here ]------------ > [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 > [ 3105.944634] Modules linked in: [...] > [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 > [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 > [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 3105.991108] pc : try_grab_folio+0x11c/0x188 > [ 3105.994013] lr : follow_page_pte+0xd8/0x430 > [ 3105.996986] sp : ffff80008eafb8f0 > [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 > [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 > [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 > [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 > [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 > [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 > [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 > [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 > [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 > [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 > [ 3106.047957] Call trace: > [ 3106.049522] try_grab_folio+0x11c/0x188 > [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 > [ 3106.055527] follow_page_mask+0x1a0/0x2b8 > [ 3106.058118] __get_user_pages+0xf0/0x348 > [ 3106.060647] faultin_page_range+0xb0/0x360 > [ 3106.063651] do_madvise+0x340/0x598 > > Let's make huge_pte_lockptr() effectively use the same PT locks as any > core-mm page table walker would. Add ptep_lockptr() to obtain the PTE > page table lock using a pte pointer -- unfortunately we cannot convert > pte_lockptr() because virt_to_page() doesn't work with kmap'ed page > tables we can have with CONFIG_HIGHPTE. > > Take care of PTE tables possibly spanning multiple pages, and take care of > CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For > example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect > with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which > would end up just mapping to the per-MM PT lock. > > There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb > folio being mapped using two PTE page tables. While hugetlb wants to take > the PMD table lock, core-mm would grab the PTE table lock of one of both > PTE page tables. In such corner cases, we have to make sure that both > locks match, which is (fortunately!) currently guaranteed for 8xx as it > does not support SMP and consequently doesn't use split PT locks. > > [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/ > > Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") > Reviewed-by: James Houghton > Cc: > Cc: Peter Xu > Cc: Oscar Salvador > Cc: Muchun Song > Cc: Baolin Wang > Signed-off-by: David Hildenbrand Nitpick: I wonder whether some of the lines can be simplified if we write it downwards from PUD, like, huge_pte_lockptr() { if (size >= PUD_SIZE) return pud_lockptr(...); if (size >= PMD_SIZE) return pmd_lockptr(...); /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */ WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); return ptep_lockptr(...); } The ">=" checks should avoid checking over pgtable level, iiuc. The other nitpick is, I didn't yet find any arch that use non-zero order page for pte pgtables. I would give it a shot with dropping the mask thing then see what explodes (which I don't expect any, per my read..), but yeah I understand we saw some already due to other things, so I think it's fine in this hugetlb path (that we're removing) we do a few more math if you think that's easier for you. Acked-by: Peter Xu Thanks, -- Peter Xu