From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C448DCEFCFA for ; Tue, 6 Jan 2026 19:26:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 264046B008A; Tue, 6 Jan 2026 14:26:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 211B76B0092; Tue, 6 Jan 2026 14:26:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 067526B0093; Tue, 6 Jan 2026 14:26:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DECBB6B008A for ; Tue, 6 Jan 2026 14:26:46 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7C324C2E52 for ; Tue, 6 Jan 2026 19:26:46 +0000 (UTC) X-FDA: 84302521212.28.26CC83D Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by imf23.hostedemail.com (Postfix) with ESMTP id 1890914000B for ; Tue, 6 Jan 2026 19:26:42 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=auccEX6f; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=f3meQXfI; spf=pass (imf23.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1767727603; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5PTwYrRz8VGLWJXJfny38L38CBlz3iiV5IkiY1LU4Ss=; b=7HWckMWfvYtjnnoFUdg5JQYOQk3dG1swYmmVCzHmz/7Ty+YucKe8fh3HjMGS7/Aa5VdodJ Pw1DSx8uG9J9TW4Y4lcBSQcbpcg8flN69hLAaPTQrZQ+UAQqcUQY7U4j7201v1G3Adl1M/ Bbk7Ng2GS3xLU8thb0ZDupt6zf230jg= ARC-Authentication-Results: i=2; imf23.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=auccEX6f; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=f3meQXfI; spf=pass (imf23.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1767727603; a=rsa-sha256; cv=pass; b=V1x91TtpotFt+PiuNHYIEVMUdw5DEzeYM1DUtmimmxo0fIVJlL1whgZ1w5KsBuuSDnjWbB 0SgtrXTxFgrzNWEnUTAC1vd+/68ngCStphHxR74WqOzJfXrAzjzRgHKqv3XpMBE7cV6owR meaI1KHf96FOvy/ll0Iycs8mR3qrA3c= Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 606JKO9g343678; Tue, 6 Jan 2026 19:26:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=corp-2025-04-25; bh=5PTwYrRz8VGLWJXJfn y38L38CBlz3iiV5IkiY1LU4Ss=; b=auccEX6fSk3e+Caznx99dyOVgUxtSEYQgR ni2f8aLORTChpe9jjXXB0I7qSEjC4nknTWMzCQLyo2P33aBetsLDX7mwG/px7htH eEH30674KDB1ztKzMMMjO5fybabmQkc1iPCsaOy3fSWHSd4WR5MjIiDzR5ElRqxG MfDQBayFN4fIAnXr1aLNSHh7MGo4sMtYwp0UNeqXiCIU+HiPnBGnAwjFwzCie7AF /EeqP59RoysMAC65OcrTjgn0o0Mqx/Ly4mB4yjpCtlIv4tU+ZXpazwske5/tB1CH sf4FqEwxX4SIyUPx3yNllPgM2XlgxZLOXHlopKos+Rr6ljg5Uziw== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 4bh8f7g08m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 06 Jan 2026 19:26:27 +0000 (GMT) Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 606HQCd8015540; Tue, 6 Jan 2026 19:26:26 GMT Received: from bn1pr04cu002.outbound.protection.outlook.com (mail-eastus2azon11010022.outbound.protection.outlook.com [52.101.56.22]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 4besj8r20w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 06 Jan 2026 19:26:26 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ej9Sw61ZPA4/gsoTyOM6pGlZdflV6GXiSVrl4RT4ZJrnOVyBXBM0HkdC2j60BtHIQKUYVXkE1OkVZ+hgjp2qGHj9kdkl2C9FtZLHSh7nfDfvOTfXFBnyEyTRZek+gaISnYGRHfcpCsrgumHaWWriGWFRJx3+D0WeBVKjAHsWPZK8BwzF2BeAO4/izDzK8DeIuyQKG5Oi3Gc49IKIjXu5WYQ6pSH6InUMAZQQEp0NFFai5MSfCSj3scfJCdYoyMJhYdK6lberpN91zQ4TNTw51xAJWefyzoRqeT/223lAWk77Eg8lLMxfsleTSOb33kKMm80FBC2U4TE6QPdPxjo7tQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5PTwYrRz8VGLWJXJfny38L38CBlz3iiV5IkiY1LU4Ss=; b=cG6+UWgzpR6qWYDEcOW57sLRYe+xRocC3sSMX+UabzZmC/EFz2Gh5nEOkAA1YvxQICr4QjUuHVqIPAxrKRpUf0GQOHhMIrB5WZPUmeXt68pKNcX2xvmKQaFNui3vyuLXX0WiDyc+ta4l4z2TlfEeTXE+NA3Updfc95KEom/LMgv2C1FwIbcAI0dw4ImUV+SOi+K7jLIOLPUNGS/th62FyE5hKZ9crbyRXG59dqiHHncBLT0WcpkL7YE2aYGOP/0JCeYVl8SwBvtuRCupYbh3mU0FnTsOpPOtPkvvCoOy1Kx05SWUW+YsT9YaQLofVc5Sdnkk+vWwRRGOjc0zonGxsQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5PTwYrRz8VGLWJXJfny38L38CBlz3iiV5IkiY1LU4Ss=; b=f3meQXfILSWN/zMkPm93wwEbaof8HGaI3AXdq2JJ3OuKQNMKv8HV7k9fUPPTHwhNVGPAipqsd5saydNXaTDJIsUnYKlTFyYMbYKGwc1j3x5kf+zRq6pC1op+52JimbXzjvHGyS3hR9S6IR+/rXjJnMUuSGs0yoLAd9JUcQ3z41I= Received: from DM4PR10MB8218.namprd10.prod.outlook.com (2603:10b6:8:1cc::16) by SN4PR10MB5543.namprd10.prod.outlook.com (2603:10b6:806:1ea::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9478.4; Tue, 6 Jan 2026 19:26:22 +0000 Received: from DM4PR10MB8218.namprd10.prod.outlook.com ([fe80::f3ea:674e:7f2e:b711]) by DM4PR10MB8218.namprd10.prod.outlook.com ([fe80::f3ea:674e:7f2e:b711%6]) with mapi id 15.20.9478.004; Tue, 6 Jan 2026 19:26:22 +0000 Date: Tue, 6 Jan 2026 19:26:24 +0000 From: Lorenzo Stoakes To: "David Hildenbrand (Red Hat)" Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, Will Deacon , "Aneesh Kumar K.V" , Andrew Morton , Nick Piggin , Peter Zijlstra , Arnd Bergmann , Muchun Song , Oscar Salvador , "Liam R. Howlett" , Vlastimil Babka , Jann Horn , Pedro Falcato , Rik van Riel , Harry Yoo , Laurence Oberman , Prakash Sangappa , Nadav Amit , stable@vger.kernel.org Subject: Re: [PATCH RESEND v3 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather Message-ID: <4e3e2b83-c024-4e16-9913-89f4bc302444@lucifer.local> References: <20251223214037.580860-1-david@kernel.org> <20251223214037.580860-5-david@kernel.org> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251223214037.580860-5-david@kernel.org> X-ClientProxiedBy: LO4P123CA0058.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:153::9) To DM4PR10MB8218.namprd10.prod.outlook.com (2603:10b6:8:1cc::16) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM4PR10MB8218:EE_|SN4PR10MB5543:EE_ X-MS-Office365-Filtering-Correlation-Id: 6669e2df-e4de-4a41-46af-08de4d597981 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|1800799024|366016; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?Okc809rKvw6RMT4FL56BHugtOv5QfGhHs5Kz5n4nXY1P0CO58KDc9xCvLZC8?= =?us-ascii?Q?MPQ0JlegEJsl4XZlwfMYbrN1Jfm/Xfp7rp2PKCRjKhOUtNy/YQ5dM48oe8Qt?= =?us-ascii?Q?O0QuWpdbOcMtET+rqCIhEmTlVVqnMcGHNljQ8wjFjkqfDRTygmlBVzlmPd0v?= =?us-ascii?Q?z2ds66HIxeVZkkTodJyg4614Q6J7tLcVlMcuAl00w+3DGkTGgS9L/QT0Pv0O?= =?us-ascii?Q?o7fHQvfE0tW2tzfP7bW5S1aTzngrvFZOOWYLs7vV2tKxYZKvE1HOyr0jz6jB?= =?us-ascii?Q?7lMn98XSOgNcCFh/2UopuiQeXkdHwiHAEcWK3xz0hGEoFfNSpp9kC1a+nqJH?= =?us-ascii?Q?RtPP4z8+heBsIcaYk2m4KgurRwfYYTtnAAio4/cAcYzcii/cuxr2cUlrjqHT?= =?us-ascii?Q?lPGShX2evJG9/HMFmdb6wbfJbPn1tbfQMHx47kuon8uAu1ZP3fqRlmWAyxIJ?= =?us-ascii?Q?u2U//uZIM4HgTUplmC+EGAEp9ZvuiX1DsxnmvFe+Jiuj4YBSPMtacLyfK7g6?= =?us-ascii?Q?oYGv6N7M+yR13goywPIT+OxoYpRxXpBNfiCdJl+GBsTZ3aBwdFD6HNfRFIeG?= =?us-ascii?Q?cJfV7eauXCDqQoOTnj9VXWkrM2bOHQFYHZksqbgMaMzPjYcVOWt9wchhHE+k?= =?us-ascii?Q?0nllojeeyVQ5uNaIVyG1FKK3j2cL/lrCeHLgyBB9S+t3xXqoC/VjW8bmN4Bs?= =?us-ascii?Q?VZ5XaMzhaKpUyQAuFjV2CjLOmnlyoG73ITChFNwTyoytugsQJ9IQMJ4deOyS?= =?us-ascii?Q?eMUlLCmUmlzgSUPHjbtnkMpZ+uKMT68dIUfMLNktF3FaX+rUbVgpK+jUZkaZ?= =?us-ascii?Q?Hs571e9fVqyMaoPAN3SQsIjFP46v6Pxl2bMnvSKmF8RslhrS6PsNMKL97Nec?= =?us-ascii?Q?XU4nHvyY8JUriSmBWcqhN7N7kzniMPb4VxpndBZOSlWhq2wGEDu7dSuzG5UJ?= =?us-ascii?Q?klrK8KhmLvKqbMTLhnkSzjMnAJARQDpCsk7WeSFG7DhK58sTRWgpXNP8Ya/C?= =?us-ascii?Q?1NDHA2KL5Bu616joDo+ci1XLUgINV+CI3yDD+KjSxt+s2UwmxeMDI8OdLYEt?= =?us-ascii?Q?kO7C1q5k4b1jYMdY9EBgA2+0RXqlqEa6+1KPLH3je0sU0J+dnh4u2mhrYGK7?= =?us-ascii?Q?VkybXxXpvN7NQvgQbBYsZvmxFhEU0bxk7jUAxYJUTdlH8AFNISKkfMCW9FPB?= =?us-ascii?Q?g49d+JA5g6my8bT+jxptQbZdfdNxCSbrcE80iu4OcBRrvabDAYgQWX1nZ7nZ?= =?us-ascii?Q?Ajgb1Ksf0mxzbPk21UgjF2gYdUeGXi661hVKxKd30HF7Y887ALKe215MqxVk?= =?us-ascii?Q?wAxwFVijHQHw5bLhLMX+X7sxb/MUV5IVyZtk+Tq/tSxOkWYiQ2YPSC4PtbEX?= =?us-ascii?Q?lYWBtDUGlvRqPlgMJMbCGHgq/oyzuZ3zz0TDPIZt7VzICxX+863Qznu7wAzD?= =?us-ascii?Q?9/hWwSyAhjvV9lAQ0iLBtFdSJ37hhcRKSa/OTcBYQ6WHqXPOuga4pg=3D=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM4PR10MB8218.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(7416014)(376014)(1800799024)(366016);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?WMVbkX2mP4yL08pMMmPu/kVndAwNpr3RTPkwBqPmrhD/jAa4V57iikJQA1Es?= =?us-ascii?Q?Gcbpvjn9BQbM91VtNf7/AlACQT1lptcB1E/iz/Y0FV4vHJgjun821d6M+fNJ?= =?us-ascii?Q?X+X8hXboSSkgNnoBvcnyUNqdHgvt2NL3Wr3/mU6zWDMkd6PWCmTdrcOa3vtM?= =?us-ascii?Q?T1wSNMjS/82sncXn6NX7dCfpBfCW2bjiKX+KJURHCtmMhDGXF6zH1X6MeApn?= =?us-ascii?Q?V6qcPkwr5/NDIQ4VcUOebln0uIVejp+tauifpQ9IThIoEe0ZG39bXfl9u8WO?= =?us-ascii?Q?3GfndUDX8NTGEM13LWBd3WW9exgXsQG0cUxDDeZVkCe73slTVb3KcdfzPOGe?= =?us-ascii?Q?A5TaL9pyMfNoGo25MUL7thcf6JDmQJIBmIMZYp/bDn+1oCCKcRRllo+edUky?= =?us-ascii?Q?Q/XU3cFAwt0fRXWWDBNoynggEI17Vn447Wk59BNZzAjqBh5kUsjYEuTiOLKm?= =?us-ascii?Q?2xbOYVCXln1ow3wnSEbMjNc1tf1ZnOmNxorIBf0vgKUKdexCTRxZFGetaOAi?= =?us-ascii?Q?g+lWDlFeyYCTuGBEQHKLoJedJ4bNZK8kK1vla1PB2QQbM3jT6/nsKebkAZCz?= =?us-ascii?Q?PRGZIXuoEIj85JR4+hssBwYMFK9ZtFkQVi79vU9IStt7CCxJZV8mWp3GmYz+?= =?us-ascii?Q?sHPwasBTOmJuBmpEE7q0vrxiNg+BOfc7VZ6uEttrrJxTBPAi2LKOQl+ryz4j?= =?us-ascii?Q?ZCglemM0N854ftw20812v+EW/XxBB8nrwN6MT3SInRsaQJnNWgrQEyErzfcM?= =?us-ascii?Q?+6yk4+gcVlSHkIwXWzGDPv/DzaMXzlHGe8ySNKZ3nTiTRVA8swZxsakBIh11?= =?us-ascii?Q?8uH8YOjN7BrAahYtOohaIUxGPAZBCBQENxNw/f4SHzej30CcikGLLc37I8ot?= =?us-ascii?Q?TIUaZ+smU+QyAWwQEWdpwW+B/HExYaQLDTXtt/LsDgt1+6BrttCOwsstNeAG?= =?us-ascii?Q?ME3sLIjzWJTrOUYJ0eDF59N7PY4UsDgznwSs2MSAgOe5+oFhYsAw3qBrtE35?= =?us-ascii?Q?gs1gqfGH//+eXo6KEUdruwzHdgWK5qwLUSczOrsELY74YrKNHTw/WCDBNiod?= =?us-ascii?Q?L5CvdBk2MK7MR0ZfOtrNiv2kroaL8hcnYWgsC/YcCj/jz69N3ojjQt6N3OeO?= =?us-ascii?Q?NsaQj1zUR3supxzEOt60bK9o2SxIwWGVU5F9sM2vHt6IDxhMpgJxdYYZhOPI?= =?us-ascii?Q?Ssn0btEZBw9ju2vyBlJE02AGli134L17SeIWW61Myt2f9ZC3QL0imSangiLq?= =?us-ascii?Q?8ieyDW4hLKsz/VRK+2AHuaQChZqFPRp3oN89sWsUe58gCpMEAheglU8uYggy?= =?us-ascii?Q?ifFWXlAvs03JiWxliaA9thgE+dOorj3LfMsRTIbJ06KrQuaYTWk7l97LOp1O?= =?us-ascii?Q?DIw4O4QPG/6eJR+UdiAxZMpEJm5sVg3mPY/oaSannSYPezlDhftU2Q+lCCFQ?= =?us-ascii?Q?Xoii2kfS3jTB0RdTbTKl/Qz/YNJ2b3xgv2f18Q/RzfY2+JFKLIgZchvmCZRW?= =?us-ascii?Q?eAskHYzkzdFf3GrtDWfUS+pbV0Brr64LKD8AOYY2FlJWC5WQzwlodX6tBhtM?= =?us-ascii?Q?hswM4AF2IaK1FcadQ8cDa0uJVT4qWwsBp9LmD37yGsE0xwXT0tYXAVvms1nA?= =?us-ascii?Q?yL79sVVf9kMX8bf4lT41ZbIrPrVxgp55SDM5AXPxZG/vhtx4WBOTeHXyI8WN?= =?us-ascii?Q?KWkjX8cG6ZGG3hTzAgAgB0jZb97btTE5DZU+GJUJnvNJwrY4sGJrbcbouRUl?= =?us-ascii?Q?IFzuJ2zobsdMQWsisMVcbmDCOE7YKr4=3D?= X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: LyeJYuzlJQneREK5hbq0M5y8v1tzQHJJOpMwxz7L1InY+lLMJOh5bf7YBQD7Mn5jbkX7WcxgDabY23bIIW6AddV5aJog9ZRDWaUvf8dPNP5tQwv7EDOSbQM+EhgxXDCpigJKN9glWl3ICRgGry56/zrxw1fbOwRxU43Rk8i/FKKqNv3My77qHDGKlrk5ti7DvW7T1uSuiloOWKge++VfK1htyWnOIodrG/bwjKjIr72E5RYewJb7gAdJtb5h5L7clhnLmbo5ovW/DE1jdeyx9Q/ICoZOu3IKUUkQQ/KMM14r6dBw1u+9h9zrpsb6yCJABKWo7mmsj5qFynyd17rjCuVhxw0d58stRcGLTpzsnTYDb2Qa4nlh7fOBG0TA8+h35Dcz3guT+RDGM99ADDg0dpOscpB9240YdUhgbdOZD3BdIE0w8ekwNLol86oB3LO7sT3vx6qxXoLCqarAysm55enzLZ6vCpXdMrip1z3NKT93YBY3rWrcxQXn1Ku17geyJVlB6hkE4cyHcp54JzeOaO39w+ozsKSgCIRAI1kXmgomciZ2uwlSBCa5RkLUxrl+Ld5ehYRT9+bn50XuoYQY42h0xb1TwQOW9reSp4hW7HQ= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 6669e2df-e4de-4a41-46af-08de4d597981 X-MS-Exchange-CrossTenant-AuthSource: DM4PR10MB8218.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 Jan 2026 19:26:22.1730 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: VY7mg0FjlyoDSrRhuTVZ94ray80+rqwcWE70urwD+tjs88ny92GKpZanFYVOWM6DmfvnLbwknKKppQoAvSaY+IgwVCHDTRK9G5kVK6s858k= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN4PR10MB5543 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.100.49 definitions=2026-01-06_01,2026-01-06_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 phishscore=0 malwarescore=0 adultscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2512120000 definitions=main-2601060168 X-Proofpoint-ORIG-GUID: RvWLKDgP5-DiD2iAfHGciheWCaKCuTuK X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMTA2MDE2NyBTYWx0ZWRfX1Pt7J+lmlFQS aG97q/KOZC7Saiyg+w5KjCmVT/tz+/z3Q9gF7i3P4RW5B7l5pgzaFxNJN7HO8mZVi7YbBPuB+j5 XNILtnWFIzlr1HYgGpRkqhUln3LRLw3avQ4eO7F7poY6ksKPQJi6BRqZeD978/P4Iz3QQOtY983 fgWjTsen/VjYkpSok/Rf8XpZljki7JSKs+LTPofj5RkAvlelQPTBaULrGjQjbftG5ZWaUuNO/EG Zwvxttu5BZx16IZYAGhC6qIR84NDLUsfaqPWs1OqjwBmKlUDTYwCoLP5/RwFPYEmLre+qJ0dHl/ uAOssNaxM9X0qGgln8jGxi3+yB5wE9m+wy7xCfuF1G6FfTImb02Q/+cZQcn8tUk42+VP/11w1j4 JGWgomHqesLRoKWYLXqKIpFBGtx9Y1wWUP5lawegvVw2ensmS7fZLMgi7WfZm3c4ExhP8zeSzRl bqHqXr8um111bTd6NNQ== X-Proofpoint-GUID: RvWLKDgP5-DiD2iAfHGciheWCaKCuTuK X-Authority-Analysis: v=2.4 cv=Y4z1cxeN c=1 sm=1 tr=0 ts=695d61e3 cx=c_pps a=XiAAW1AwiKB2Y8Wsi+sD2Q==:117 a=XiAAW1AwiKB2Y8Wsi+sD2Q==:17 a=6eWqkTHjU83fiwn7nKZWdM+Sl24=:19 a=z/mQ4Ysz8XfWz/Q5cLBRGdckG28=:19 a=lCpzRmAYbLLaTzLvsPZ7Mbvzbb8=:19 a=xqWC_Br6kY4A:10 a=kj9zAlcOel0A:10 a=vUbySO9Y5rIA:10 a=GoEa3M9JfhUA:10 a=VkNPw1HP01LnGYTKEx00:22 a=VwQbUJbxAAAA:8 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=K1192TvNGPq6cbv30xkA:9 a=JwldTkq2j4ZXW0OM:21 a=CjuIK1q_8ugA:10 X-Stat-Signature: 6aiwzs4ehndn5gz6hojfj6zttapkchkt X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1890914000B X-HE-Tag: 1767727602-674803 X-HE-Meta: U2FsdGVkX1+6GLfUgC0SZDkDJsCfT5zi/oYjBExPeEPRc8fwlukn4rno/P805eYN9XOZwtcQYGK5rhdutAXUiyrpa38HLlCnJagf24PUGnpeNMWUkyrg1Bsr1EwFn2QgCz8LJvRBUeZpfn6R5zwecZ/L8m50vquUYopnyHl/dZeVUZRYDPM3LwPUqZhNtFdYpobWH6eApmj4Uie2GU2MJKlbduR635RCB4qoMqNYdbEfppneVskuC6Gellty38gFh5H8r6Lqa/FaLKP1tf+YGERa64J8j0mrEV6TgbPq354BGuBPgjiydZcRBCZBIHhFHcPXsmLaryfvKGReSPlwV/sN8uHh301RPxNoTvpt3Dl8EeQT4KGz3B9hsRUXO3/B5S2GrqQXscg3viN3B/MbOpNzM3h9kuiRdM2qxU+HL5iDc0qdPUpb0QNEujXPW2qklxC3X2OYvP8YIHdcTMo3gygu/pmmZ5Fxp2sFeQQkuG0wZgNMhgh1nWWxruHe6NTtzzCIw86t2QZkDhu+Ap97O7y3ydADz9+6+QnnzSZgExwx8s4DqCJMW23qUWx9/xOSJgSCiuN4rOKG2eYcKYLl4i9CoAW7Jqw1B+wRs272j0bxB6f0Zy3uuI1a4oBACgpJnyE5Houg1iZqGltduejgGkJh1VQSGxt1uKowyAs7PSDcBaEUc6Vyu9ZLljHIb3bJx9iQXHP7GUOlBm+rwZlNh4YJe7khKBOqK6sqiu4r8ccXqwWmw2i/6xRzOjYIj/We5jJ3Kq6WVeIVEOrnIdp29+4m/QPmtuU0RPw9hHZ6UPm/rwBm4Sdh4SAqJ8z2XfbucLr91zph8pl5tuhAMAic2mvRJuav2tT/XjzQflcIWqGIlvNHl1COl73of5OCU3tIH64A8gomqd77Apq24exm3v0mCGb81oqMaNaz2L46fc6q29mu1QEPRNKkMF8zjjfmO8+35TYqAJKTmSp7eoK p8mnKBwN UuwNuP/PfeqWsDkyTYZm3u+AQMLIzQta2MZqVd036d52Jwr/Q9Updn6FwmfP+v6Y+uxsnptOcHz6ELmeSgUy+yPSvqEq2bmsuJ4XgWnpxIneuuX8T/0aGRaCgs9FBal+h+tcfgeFrI7ln9KMCF0prvGV7Vpc1bWHNn/0D3gEveTeSKoIXEsZWcwcNePiclePtp4aZ7Y7RgS/S42bF3k0oln2P/B6iPK/v/KPOUxGYYSidfIQsGbNRlHGLYyScVS3pXhH9yImXGfkU1wAEh+MgjccgOOJMrixYqoAuPAmLm3HpXAiafIPE3qNjL3H2MijQ1OdxV8jMuEKmSeX5dRzKrxw2OdNKiMwgEqjQm/9n08pK4tj3TdpIfft8j4781sY8u0FWFYqK+T4Nt7CTfIHd7yBm1ADzp0doNFuZFGTggPSyKgpw3KRW5U7mIRCffzzJ8//dW3rHtDDf5skNnTH8OEY1e6fnf3BiMJ8ZOeN20t4bMuTJ8m4NUY/Tf2Ncfpl4FdeIwzs4T0l94aR1ZnNa+6L3UCqeaYbATo++APs2R8lkmunXaEa0M7xgVA4ue6s69MPABo2GkUVpqG72wl7rnLZhbyxpWi+XwMLRPku156zqnJYMVhh18E9nQ82v9y53G5YMt9RzBJqt200jCiMHuuAM/vCsFcSjslkKKT3BGC2wU/17em8+J/HPa2cWv0lcY9QoVFhhp5qXmDWzlaX7xlQbIMqzIFx+rU8dXsf+B7UePKCzMVGXL53JLL2Bl1PeeaS0TejOy1peSPB27GmCCtYjoPN99Et4ZYm32bqHWX06tHk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 23, 2025 at 10:40:37PM +0100, David Hildenbrand (Red Hat) wrote: > As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix > huge_pmd_unshare() vs GUP-fast race") we can end up in some situations > where we perform so many IPI broadcasts when unsharing hugetlb PMD page > tables that it severely regresses some workloads. > > In particular, when we fork()+exit(), or when we munmap() a large > area backed by many shared PMD tables, we perform one IPI broadcast per > unshared PMD table. > > There are two optimizations to be had: > > (1) When we process (unshare) multiple such PMD tables, such as during > exit(), it is sufficient to send a single IPI broadcast (as long as > we respect locking rules) instead of one per PMD table. > > Locking prevents that any of these PMD tables could get reused before > we drop the lock. > > (2) When we are not the last sharer (> 2 users including us), there is > no need to send the IPI broadcast. The shared PMD tables cannot > become exclusive (fully unshared) before an IPI will be broadcasted > by the last sharer. > > Concurrent GUP-fast could walk into a PMD table just before we > unshared it. It could then succeed in grabbing a page from the > shared page table even after munmap() etc succeeded (and supressed > an IPI). But there is not difference compared to GUP-fast just > sleeping for a while after grabbing the page and re-enabling IRQs. > > Most importantly, GUP-fast will never walk into page tables that are > no-longer shared, because the last sharer will issue an IPI > broadcast. > > (if ever required, checking whether the PUD changed in GUP-fast > after grabbing the page like we do in the PTE case could handle > this) > > So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather > infrastructure so we can implement these optimizations and demystify the > code at least a bit. Extend the mmu_gather infrastructure to be able to > deal with our special hugetlb PMD table sharing implementation. > > To make initialization of the mmu_gather easier when working on a single > VMA (in particular, when dealing with hugetlb), provide > tlb_gather_mmu_vma(). > > We'll consolidate the handling for (full) unsharing of PMD tables in > tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track > in "struct mmu_gather" whether we had (full) unsharing of PMD tables. > > Because locking is very special (concurrent unsharing+reuse must be > prevented), we disallow deferring flushing to tlb_finish_mmu() and instead > require an explicit earlier call to tlb_flush_unshared_tables(). > > From hugetlb code, we call huge_pmd_unshare_flush() where we make sure > that the expected lock protecting us from concurrent unsharing+reuse is > still held. > > Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that > tlb_flush_unshared_tables() was properly called earlier. > > Document it all properly. > > Notes about tlb_remove_table_sync_one() interaction with unsharing: > > There are two fairly tricky things: > > (1) tlb_remove_table_sync_one() is a NOP on architectures without > CONFIG_MMU_GATHER_RCU_TABLE_FREE. > > Here, the assumption is that the previous TLB flush would send an > IPI to all relevant CPUs. Careful: some architectures like x86 only > send IPIs to all relevant CPUs when tlb->freed_tables is set. > > The relevant architectures should be selecting > MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable > kernels and it might have been problematic before this patch. > > Also, the arch flushing behavior (independent of IPIs) is different > when tlb->freed_tables is set. Do we have to enlighten them to also > take care of tlb->unshared_tables? So far we didn't care, so > hopefully we are fine. Of course, we could be setting > tlb->freed_tables as well, but that might then unnecessarily flush > too much, because the semantics of tlb->freed_tables are a bit > fuzzy. > > This patch changes nothing in this regard. > > (2) tlb_remove_table_sync_one() is not a NOP on architectures with > CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. > > Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) > we still issue IPIs during TLB flushes and don't actually need the > second tlb_remove_table_sync_one(). > > This optimized can be implemented on top of this, by checking e.g., in > tlb_remove_table_sync_one() whether we really need IPIs. But as > described in (1), it really must honor tlb->freed_tables then to > send IPIs to all relevant CPUs. > > Notes on TLB flushing changes: > > (1) Flushing for non-shared PMD tables > > We're converting from flush_hugetlb_tlb_range() to > tlb_remove_huge_tlb_entry(). Given that we properly initialize the > MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to > __unmap_hugepage_range(), that should be fine. > > (2) Flushing for shared PMD tables > > We're converting from various things (flush_hugetlb_tlb_range(), > tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). > > tlb_flush_pmd_range() achieves the same that > tlb_remove_huge_tlb_entry() would achieve in these scenarios. > Note that tlb_remove_huge_tlb_entry() also calls > __tlb_remove_tlb_entry(), however that is only implemented on > powerpc, which does not support PMD table sharing. > > Similar to (1), tlb_gather_mmu_vma() should make sure that TLB > flushing keeps on working as expected. > > Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a > concern, as we are holding the i_mmap_lock the whole time, preventing > concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed > separately as a cleanup later. > > There are plenty more cleanups to be had, but they have to wait until > this is fixed. > > Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") > Reported-by: Uschakow, Stanislav" > Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ > Tested-by: Laurence Oberman > Cc: > Signed-off-by: David Hildenbrand (Red Hat) OK with some local testing, ample use of git range-diff, this LGTM, hopefully no horrifying weird arch strangeness in some other corner lurking to bite us :P So: Reviewed-by: Lorenzo Stoakes > --- > include/asm-generic/tlb.h | 77 +++++++++++++++++++++++- > include/linux/hugetlb.h | 15 +++-- > include/linux/mm_types.h | 1 + > mm/hugetlb.c | 123 ++++++++++++++++++++++---------------- > mm/mmu_gather.c | 33 ++++++++++ > mm/rmap.c | 25 +++++--- > 6 files changed, 208 insertions(+), 66 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 1fff717cae510..4d679d2a206b4 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -46,7 +46,8 @@ > * > * The mmu_gather API consists of: > * > - * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu() > + * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_gather_mmu_vma() / > + * tlb_finish_mmu() > * > * start and finish a mmu_gather > * > @@ -364,6 +365,20 @@ struct mmu_gather { > unsigned int vma_huge : 1; > unsigned int vma_pfn : 1; > > + /* > + * Did we unshare (unmap) any shared page tables? For now only > + * used for hugetlb PMD table sharing. > + */ > + unsigned int unshared_tables : 1; > + > + /* > + * Did we unshare any page tables such that they are now exclusive > + * and could get reused+modified by the new owner? When setting this > + * flag, "unshared_tables" will be set as well. For now only used > + * for hugetlb PMD table sharing. > + */ > + unsigned int fully_unshared_tables : 1; > + > unsigned int batch_count; > > #ifndef CONFIG_MMU_GATHER_NO_GATHER > @@ -400,6 +415,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) > tlb->cleared_pmds = 0; > tlb->cleared_puds = 0; > tlb->cleared_p4ds = 0; > + tlb->unshared_tables = 0; > /* > * Do not reset mmu_gather::vma_* fields here, we do not > * call into tlb_start_vma() again to set them if there is an > @@ -484,7 +500,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > * these bits. > */ > if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds || > - tlb->cleared_puds || tlb->cleared_p4ds)) > + tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables)) > return; > > tlb_flush(tlb); > @@ -773,6 +789,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) > } > #endif > > +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING > +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt, > + unsigned long addr) > +{ > + /* > + * The caller must make sure that concurrent unsharing + exclusive > + * reuse is impossible until tlb_flush_unshared_tables() was called. > + */ > + VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt)); > + ptdesc_pmd_pts_dec(pt); > + > + /* Clearing a PUD pointing at a PMD table with PMD leaves. */ > + tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE); > + > + /* > + * If the page table is now exclusively owned, we fully unshared > + * a page table. > + */ > + if (!ptdesc_pmd_is_shared(pt)) > + tlb->fully_unshared_tables = true; > + tlb->unshared_tables = true; > +} > + > +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) > +{ > + /* > + * As soon as the caller drops locks to allow for reuse of > + * previously-shared tables, these tables could get modified and > + * even reused outside of hugetlb context, so we have to make sure that > + * any page table walkers (incl. TLB, GUP-fast) are aware of that > + * change. > + * > + * Even if we are not fully unsharing a PMD table, we must > + * flush the TLB for the unsharer now. > + */ > + if (tlb->unshared_tables) > + tlb_flush_mmu_tlbonly(tlb); > + > + /* > + * Similarly, we must make sure that concurrent GUP-fast will not > + * walk previously-shared page tables that are getting modified+reused > + * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast. > + * > + * We only perform this when we are the last sharer of a page table, > + * as the IPI will reach all CPUs: any GUP-fast. > + * > + * Note that on configs where tlb_remove_table_sync_one() is a NOP, > + * the expectation is that the tlb_flush_mmu_tlbonly() would have issued > + * required IPIs already for us. > + */ > + if (tlb->fully_unshared_tables) { > + tlb_remove_table_sync_one(); > + tlb->fully_unshared_tables = false; > + } > +} > +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ > + > #endif /* CONFIG_MMU */ > > #endif /* _ASM_GENERIC__TLB_H */ > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 03c8725efa289..e51b8ef0cebd9 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz); > unsigned long hugetlb_mask_last_page(struct hstate *h); > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep); > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep); > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma); > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > unsigned long *start, unsigned long *end); > > @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write( > return NULL; > } > > -static inline int huge_pmd_unshare(struct mm_struct *mm, > - struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +static inline int huge_pmd_unshare(struct mmu_gather *tlb, > + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) > { > return 0; > } > > +static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb, > + struct vm_area_struct *vma) > +{ > +} > + > static inline void adjust_range_if_pmd_sharing_possible( > struct vm_area_struct *vma, > unsigned long *start, unsigned long *end) > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 42af2292951d4..d1053b2c1f800 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1522,6 +1522,7 @@ static inline unsigned int mm_cid_size(void) > struct mmu_gather; > extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); > extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); > +void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma); > extern void tlb_finish_mmu(struct mmu_gather *tlb); > > struct vm_fault; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3c77cdef12a32..2609b6d58f99e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5096,7 +5096,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > unsigned long last_addr_mask; > pte_t *src_pte, *dst_pte; > struct mmu_notifier_range range; > - bool shared_pmd = false; > + struct mmu_gather tlb; > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, > old_end); > @@ -5106,6 +5106,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > * range. > */ > flush_cache_range(vma, range.start, range.end); > + tlb_gather_mmu_vma(&tlb, vma); > > mmu_notifier_invalidate_range_start(&range); > last_addr_mask = hugetlb_mask_last_page(h); > @@ -5122,8 +5123,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) > continue; > > - if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { > - shared_pmd = true; > + if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) { > old_addr |= last_addr_mask; > new_addr |= last_addr_mask; > continue; > @@ -5134,15 +5134,16 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > break; > > move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); > + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr); > } > > - if (shared_pmd) > - flush_hugetlb_tlb_range(vma, range.start, range.end); > - else > - flush_hugetlb_tlb_range(vma, old_end - len, old_end); > + tlb_flush_mmu_tlbonly(&tlb); > + huge_pmd_unshare_flush(&tlb, vma); > + > mmu_notifier_invalidate_range_end(&range); > i_mmap_unlock_write(mapping); > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > > return len + old_addr - old_end; > } > @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > unsigned long sz = huge_page_size(h); > bool adjust_reservation; > unsigned long last_addr_mask; > - bool force_flush = false; > > WARN_ON(!is_vm_hugetlb_page(vma)); > BUG_ON(start & ~huge_page_mask(h)); > @@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > } > > ptl = huge_pte_lock(h, mm, ptep); > - if (huge_pmd_unshare(mm, vma, address, ptep)) { > + if (huge_pmd_unshare(tlb, vma, address, ptep)) { > spin_unlock(ptl); > - tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); > - force_flush = true; > address |= last_addr_mask; > continue; > } > @@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > } > tlb_end_vma(tlb, vma); > > - /* > - * There is nothing protecting a previously-shared page table that we > - * unshared through huge_pmd_unshare() from getting freed after we > - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() > - * succeeded, flush the range corresponding to the pud. > - */ > - if (force_flush) > - tlb_flush_mmu_tlbonly(tlb); > + huge_pmd_unshare_flush(tlb, vma); > } > > void __hugetlb_zap_begin(struct vm_area_struct *vma, > @@ -6409,11 +6400,11 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > pte_t pte; > struct hstate *h = hstate_vma(vma); > long pages = 0, psize = huge_page_size(h); > - bool shared_pmd = false; > struct mmu_notifier_range range; > unsigned long last_addr_mask; > bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > + struct mmu_gather tlb; > > /* > * In the case of shared PMDs, the area to flush could be beyond > @@ -6426,6 +6417,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > > BUG_ON(address >= end); > flush_cache_range(vma, range.start, range.end); > + tlb_gather_mmu_vma(&tlb, vma); > > mmu_notifier_invalidate_range_start(&range); > hugetlb_vma_lock_write(vma); > @@ -6452,7 +6444,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > } > } > ptl = huge_pte_lock(h, mm, ptep); > - if (huge_pmd_unshare(mm, vma, address, ptep)) { > + if (huge_pmd_unshare(&tlb, vma, address, ptep)) { > /* > * When uffd-wp is enabled on the vma, unshare > * shouldn't happen at all. Warn about it if it > @@ -6461,7 +6453,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); > pages++; > spin_unlock(ptl); > - shared_pmd = true; > address |= last_addr_mask; > continue; > } > @@ -6522,22 +6513,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > pte = huge_pte_clear_uffd_wp(pte); > huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); > pages++; > + tlb_remove_huge_tlb_entry(h, &tlb, ptep, address); > } > > next: > spin_unlock(ptl); > cond_resched(); > } > - /* > - * There is nothing protecting a previously-shared page table that we > - * unshared through huge_pmd_unshare() from getting freed after we > - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() > - * succeeded, flush the range corresponding to the pud. > - */ > - if (shared_pmd) > - flush_hugetlb_tlb_range(vma, range.start, range.end); > - else > - flush_hugetlb_tlb_range(vma, start, end); > + > + tlb_flush_mmu_tlbonly(&tlb); > + huge_pmd_unshare_flush(&tlb, vma); > /* > * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are > * downgrading page table protection not changing it to point to a new > @@ -6548,6 +6533,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > i_mmap_unlock_write(vma->vm_file->f_mapping); > hugetlb_vma_unlock_write(vma); > mmu_notifier_invalidate_range_end(&range); > + tlb_finish_mmu(&tlb); > > return pages > 0 ? (pages << h->order) : pages; > } > @@ -6904,18 +6890,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > return pte; > } > > -/* > - * unmap huge page backed by shared pte. > +/** > + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users > + * @tlb: the current mmu_gather. > + * @vma: the vma covering the pmd table. > + * @addr: the address we are trying to unshare. > + * @ptep: pointer into the (pmd) page table. > + * > + * Called with the page table lock held, the i_mmap_rwsem held in write mode > + * and the hugetlb vma lock held in write mode. > * > - * Called with page table lock held. > + * Note: The caller must call huge_pmd_unshare_flush() before dropping the > + * i_mmap_rwsem. > * > - * returns: 1 successfully unmapped a shared pte page > - * 0 the underlying pte page is not shared, or it is the last user > + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it > + * was not a shared PMD table. > */ > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > { > unsigned long sz = huge_page_size(hstate_vma(vma)); > + struct mm_struct *mm = vma->vm_mm; > pgd_t *pgd = pgd_offset(mm, addr); > p4d_t *p4d = p4d_offset(pgd, addr); > pud_t *pud = pud_offset(p4d, addr); > @@ -6927,18 +6922,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > i_mmap_assert_write_locked(vma->vm_file->f_mapping); > hugetlb_vma_assert_locked(vma); > pud_clear(pud); > - /* > - * Once our caller drops the rmap lock, some other process might be > - * using this page table as a normal, non-hugetlb page table. > - * Wait for pending gup_fast() in other threads to finish before letting > - * that happen. > - */ > - tlb_remove_table_sync_one(); > - ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep)); > + > + tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr); > + > mm_dec_nr_pmds(mm); > return 1; > } > > +/* > + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls > + * @tlb: the current mmu_gather. > + * @vma: the vma covering the pmd table. > + * > + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table > + * unsharing with concurrent page table walkers. > + * > + * This function must be called after a sequence of huge_pmd_unshare() > + * calls while still holding the i_mmap_rwsem. > + */ > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > + /* > + * We must synchronize page table unsharing such that nobody will > + * try reusing a previously-shared page table while it might still > + * be in use by previous sharers (TLB, GUP_fast). > + */ > + i_mmap_assert_write_locked(vma->vm_file->f_mapping); > + > + tlb_flush_unshared_tables(tlb); > +} > + > #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ > > pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > @@ -6947,12 +6960,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > return NULL; > } > > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > { > return 0; > } > > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > +} > + > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > unsigned long *start, unsigned long *end) > { > @@ -7219,6 +7236,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > unsigned long sz = huge_page_size(h); > struct mm_struct *mm = vma->vm_mm; > struct mmu_notifier_range range; > + struct mmu_gather tlb; > unsigned long address; > spinlock_t *ptl; > pte_t *ptep; > @@ -7230,6 +7248,8 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > return; > > flush_cache_range(vma, start, end); > + tlb_gather_mmu_vma(&tlb, vma); > + > /* > * No need to call adjust_range_if_pmd_sharing_possible(), because > * we have already done the PUD_SIZE alignment. > @@ -7248,10 +7268,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > if (!ptep) > continue; > ptl = huge_pte_lock(h, mm, ptep); > - huge_pmd_unshare(mm, vma, address, ptep); > + huge_pmd_unshare(&tlb, vma, address, ptep); > spin_unlock(ptl); > } > - flush_hugetlb_tlb_range(vma, start, end); > + huge_pmd_unshare_flush(&tlb, vma); > if (take_locks) { > i_mmap_unlock_write(vma->vm_file->f_mapping); > hugetlb_vma_unlock_write(vma); > @@ -7261,6 +7281,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > * Documentation/mm/mmu_notifier.rst. > */ > mmu_notifier_invalidate_range_end(&range); > + tlb_finish_mmu(&tlb); > } > > /* > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 247e3f9db6c7a..cd32c2dbf501b 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > > #include > > @@ -426,6 +427,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > #endif > tlb->vma_pfn = 0; > > + tlb->fully_unshared_tables = 0; > __tlb_reset_range(tlb); > inc_tlb_flush_pending(tlb->mm); > } > @@ -459,6 +461,31 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm) > __tlb_gather_mmu(tlb, mm, true); > } > > +/** > + * tlb_gather_mmu - initialize an mmu_gather structure for operating on a single > + * VMA > + * @tlb: the mmu_gather structure to initialize > + * @vma: the vm_area_struct > + * > + * Called to initialize an (on-stack) mmu_gather structure for operating on > + * a single VMA. In contrast to tlb_gather_mmu(), calling this function will > + * not require another call to tlb_start_vma(). In contrast to tlb_start_vma(), > + * this function will *not* call flush_cache_range(). > + * > + * For hugetlb VMAs, this function will also initialize the mmu_gather > + * page_size accordingly, not requiring a separate call to > + * tlb_change_page_size(). > + * > + */ > +void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > + tlb_gather_mmu(tlb, vma->vm_mm); > + tlb_update_vma_flags(tlb, vma); > + if (is_vm_hugetlb_page(vma)) > + /* All entries have the same size. */ > + tlb_change_page_size(tlb, huge_page_size(hstate_vma(vma))); > +} > + > /** > * tlb_finish_mmu - finish an mmu_gather structure > * @tlb: the mmu_gather structure to finish > @@ -468,6 +495,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm) > */ > void tlb_finish_mmu(struct mmu_gather *tlb) > { > + /* > + * We expect an earlier huge_pmd_unshare_flush() call to sort this out, > + * due to complicated locking requirements with page table unsharing. > + */ > + VM_WARN_ON_ONCE(tlb->fully_unshared_tables); > + > /* > * If there are parallel threads are doing PTE changes on same range > * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB > diff --git a/mm/rmap.c b/mm/rmap.c > index 748f48727a162..7b9879ef442d9 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -76,7 +76,7 @@ > #include > #include > > -#include > +#include > > #define CREATE_TRACE_POINTS > #include > @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > * if unsuccessful. > */ > if (!anon) { > + struct mmu_gather tlb; > + > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > if (!hugetlb_vma_trylock_write(vma)) > goto walk_abort; > - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > + > + tlb_gather_mmu_vma(&tlb, vma); > + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { > hugetlb_vma_unlock_write(vma); > - flush_tlb_range(vma, > - range.start, range.end); > + huge_pmd_unshare_flush(&tlb, vma); > + tlb_finish_mmu(&tlb); > /* > * The PMD table was unmapped, > * consequently unmapping the folio. > @@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > goto walk_done; > } > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > } > pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > if (pte_dirty(pteval)) > @@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > * fail if unsuccessful. > */ > if (!anon) { > + struct mmu_gather tlb; > + > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > if (!hugetlb_vma_trylock_write(vma)) { > page_vma_mapped_walk_done(&pvmw); > ret = false; > break; > } > - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > - hugetlb_vma_unlock_write(vma); > - flush_tlb_range(vma, > - range.start, range.end); > > + tlb_gather_mmu_vma(&tlb, vma); > + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { > + hugetlb_vma_unlock_write(vma); > + huge_pmd_unshare_flush(&tlb, vma); > + tlb_finish_mmu(&tlb); > /* > * The PMD table was unmapped, > * consequently unmapping the folio. > @@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > break; > } > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > } > /* Nuke the hugetlb page table entry */ > pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > -- > 2.52.0 >