From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 83493D5CCBE for ; Tue, 16 Dec 2025 10:47:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB0976B0005; Tue, 16 Dec 2025 05:47:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E87EB6B0089; Tue, 16 Dec 2025 05:47:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D08736B008A; Tue, 16 Dec 2025 05:47:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BA6026B0005 for ; Tue, 16 Dec 2025 05:47:42 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6148D135F5A for ; Tue, 16 Dec 2025 10:47:42 +0000 (UTC) X-FDA: 84225008364.04.08007B3 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by imf05.hostedemail.com (Postfix) with ESMTP id 17B0D100002 for ; Tue, 16 Dec 2025 10:47:38 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=cNUocf+H; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=r0GplmiK; spf=pass (imf05.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1765882059; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zHnGIj3GJ3gFpFAaG6jyab8ltQ1ExdkYtAsDUCgs9c8=; b=RW5vdT5OQ89qDwKyBaVZ9VEnucJ0NVfz260bey3+J4qq16sNcQCN/s74D3QktAZp/qI4et uJIb1JFTp4y7iOfcWlzRR6Sj+yDlHiRoJGtLzof5uVOVYMjBBSWO8EGCvuvx4mDyqC+eSQ ykjNVMfx8SQOxGzI5RPdhzFS9aK6aCk= ARC-Authentication-Results: i=2; imf05.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=cNUocf+H; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=r0GplmiK; spf=pass (imf05.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1765882059; a=rsa-sha256; cv=pass; b=OdrJp47Aab0jloxMngHl0zjusTDRh7Q/V/l6tY1IDYVhlVgQImHBY8LhpNbomf1iGjmIkt P5AonYFJfQicuFCItO22OGCLbYyF9TQY/OAV4x/a/suW3AaJcxAURlPircLPhcDip+37iU gFUR4lIT+7/L2ZQnTfN2+CzXytMre2k= Received: from pps.filterd (m0333520.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 5BGACB4k338698; Tue, 16 Dec 2025 10:47:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=corp-2025-04-25; bh=zHnGIj3GJ3gFpFAaG6 jyab8ltQ1ExdkYtAsDUCgs9c8=; b=cNUocf+HCadEouG5DDE4npVkCrA2aHFGo4 VfrVX8h3qSlePXudlz2Ve/WDCJhjalZUCWOm4YvChIJQvWaoWjmiM0AqvrMaUdNP HDFEwP2mQLRiL4d8VWdGvEs/QnKm+d/zR+OmB6elxHmYqB/ZUcLtFqTuIfePNByW WgUsxuhGyp/qykj7TRbVmvLTtadNP+jxUSZnlhyfA91qqBwIa5dHKQdotdu2AWxV kcT9B9ReEmvwcr7Jvrktn+3sU/axGNSly3wbZNfJdKYqP9L5aVFeWiWc4IVPLf2o ZW6RRWKisLe2NdERlcPBv/X13cxfZIScW90YwxvArWb2JDh0yLuw== Received: from iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta03.appoci.oracle.com [130.35.103.27]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 4b106cbq0a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 16 Dec 2025 10:47:27 +0000 (GMT) Received: from pps.filterd (iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 5BG8Z9rk016318; Tue, 16 Dec 2025 10:47:27 GMT Received: from cy7pr03cu001.outbound.protection.outlook.com (mail-westcentralusazon11010054.outbound.protection.outlook.com [40.93.198.54]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 4b0xkkkuu5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 16 Dec 2025 10:47:27 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=nhzBDme0aycr8o0VIWtU9x+Sr5wWq4FfCQ2/lmrtHkTE60seD1Xx32vaaTHj2AeRySq1PQSa0/RGH9pUeyid2u5gPzFJ3ZH07zggee/kJl4fEow36rQi0VvzNzGjNQ9J7UExqrkuuhIPZ7G7UBaatxNhL/kmJ4QGhCoHizltV3Ny5hhmY4/pn9AJeZXNL+y3wy12Uo6bAS0sp6mUwoA0vq+Yl8ilsneMvVegdB7448ELnaU3MLoudm0HznGgL6CWu/eSHp4b8d3SOLncZzONpA6Uz3dlKmOb+/J+lkgBTn2B3gd7EskodshfwIu4FtXhlG55BhO0M4/27ZzFZFaeuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=zHnGIj3GJ3gFpFAaG6jyab8ltQ1ExdkYtAsDUCgs9c8=; b=m7NkGhutR74/zu1R1k17GYHP7U3J5OL3PZ/ODad8GJcQNE9jcs9MeySj5Y+w7Qp4WW9Mp4ia4WOhOnW9xijgGbV9BFJ1KlKCWihMW3EWzIgJ/6rSZMdtJ9cSsyvNUT2Gei5cN7/31SFMWHFb5YvyrXF/yx51i4InB5cBH74aYC2JiyDgGsOmtTLkH51su9jaKmRzrqvLBDv0479tWZraQ8hou3NRZV7TIv+Cb8jGh9Fvl9slfI0pfbizQE2o6SwS6hGLzjfY87KJrv61hGsHjgg85AW2Ip3gkPu/6UGwpff5U1IE1dc9Qo6tX5k5Juske8U9f5hmM1us5IG0uv+QgA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zHnGIj3GJ3gFpFAaG6jyab8ltQ1ExdkYtAsDUCgs9c8=; b=r0GplmiKBjKwfc2p/aAzuFeA12lgkhoipBjSXQKH7HGCTfWjIvCeF7cl6bfVd48O4mYxWSrsXQHnyN4ywAJrxNIALKgaGnn/3KHPtpPYp5FBDwdWAB+MNWJWQM6iu7x+mMy1O183xm/9uKCjd/5bD3tVqFjM25DMgHc2au+wfMA= Received: from BL4PR10MB8229.namprd10.prod.outlook.com (2603:10b6:208:4e6::14) by DM4PR10MB6838.namprd10.prod.outlook.com (2603:10b6:8:106::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9412.13; Tue, 16 Dec 2025 10:47:22 +0000 Received: from BL4PR10MB8229.namprd10.prod.outlook.com ([fe80::552b:16d2:af:c582]) by BL4PR10MB8229.namprd10.prod.outlook.com ([fe80::552b:16d2:af:c582%6]) with mapi id 15.20.9412.011; Tue, 16 Dec 2025 10:47:22 +0000 Date: Tue, 16 Dec 2025 10:47:21 +0000 From: Lorenzo Stoakes To: "David Hildenbrand (Red Hat)" Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, Will Deacon , "Aneesh Kumar K.V" , Andrew Morton , Nick Piggin , Peter Zijlstra , Arnd Bergmann , Muchun Song , Oscar Salvador , "Liam R. Howlett" , Vlastimil Babka , Jann Horn , Pedro Falcato , Rik van Riel , Harry Yoo , Laurence Oberman , Prakash Sangappa , Nadav Amit , stable@vger.kernel.org Subject: Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather Message-ID: References: <20251212071019.471146-1-david@kernel.org> <20251212071019.471146-5-david@kernel.org> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251212071019.471146-5-david@kernel.org> X-ClientProxiedBy: LO0P123CA0006.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:354::8) To DM4PR10MB8218.namprd10.prod.outlook.com (2603:10b6:8:1cc::16) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BL4PR10MB8229:EE_|DM4PR10MB6838:EE_ X-MS-Office365-Filtering-Correlation-Id: 30eb1e42-185f-4c15-ae35-08de3c907da6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|366016|376014|1800799024|7416014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?6P3WIPjIqZERZBkOTSgPY5a7iKkZ/ypQIOAUy8yL3xDKuJAKMIWMm8j0dTMS?= =?us-ascii?Q?fFMnSZg/XdhmUQU9bDRQdCz/aGHDuG3t5pxXJVKg6A49W91JN//oAdXhL2OE?= =?us-ascii?Q?VbUs6dWxZzvSzU3ViC/9zNG+5hNJ3d2I+Pzc2u5BNGwfMxKghRd/qjaKnFa3?= =?us-ascii?Q?9NBmi9qsK8iF6ph4ETM5byDmNtPw0O2arDwfEuFKZvkZvMtMzcnzX5Lnek07?= =?us-ascii?Q?JFrcNCQPs1AS7BOxsrLt/1NRAPTh6+cXREDJ1asO9hQ7+SyDuvawOn+Fh+z9?= =?us-ascii?Q?VpvMpe0wlhSJpbfAYVUv9zEa+Cp+lzVEtSHBjoqRkP1QLR9Hh83yFwdjsLHS?= =?us-ascii?Q?l2iMcItp+/f8jK1rP/LJLsfwbAaDeUIvi7po9BgqtQEfi2Yz+CMqKctl3+T0?= =?us-ascii?Q?BiuJ2yS5GfFINCKi3ppnHjjrLtaGT/3IG6hVugudG6S5ERIw3R7G9iz6WDDb?= =?us-ascii?Q?lUhO8Ig35SoLXvxanXywF4+bi5qP6d6eIkrotuT5+Rp4E4EnDuFkDigkGrMb?= =?us-ascii?Q?n7YwiXR7xkRGE2hAk7/IUit//T/PHviV1U/XNKhCasQD44qvTujESNZe1vON?= =?us-ascii?Q?2d8oOM90FShxRPrAj2jneKC9/l+ELPZPvBeAZHqu3P6Kn5+s41ajJutu3aev?= =?us-ascii?Q?U2/ZR7kIVhPR6ElD1TgQE1xITcGD12RA1x0bykrCDkGCDagu6opgkM/OvFeo?= =?us-ascii?Q?O21hrGNE44djsiwSI3OVIYehevlVKuvx1msnpwUtU8mnjf9K2pc0G3C4bSWN?= =?us-ascii?Q?p1VAj2ngyyp10Ksu7ef+pMKkd29/V7a9fFygpbRioBd8sIII3EujyeG/ldy+?= =?us-ascii?Q?uV3CnA/3CTSWwKaB+V31jlhfG5wGUv/iTT5L5+vRBtNTLa3j05U0ftcdIHsI?= =?us-ascii?Q?3op37dR7uCp/GGqQev57EDSTQiPcPkZ33LkdQmdDJ/vOiYOe/EPsZ5DBpoed?= =?us-ascii?Q?L9+tvRTUbCZkxL3LmBwsG6G/W+KxtE+4zXnqw4+ZbMG7f0Gm9ZlXzjtQGf4k?= =?us-ascii?Q?zYH51As63Q8QoM5XVEjnNIQF0rI+mJOvFceiwjAMH1LPDuEzs4pZyKyFYFrw?= =?us-ascii?Q?4w84jyvtGxcZXsOLcN3eow/uWnQUCxdHbKgqdJSBL+ox1brvOebXzvayC4P8?= =?us-ascii?Q?gH6xKJKbRf4uHtMporu4to828kzMeds2baiUxG7XTZc6SnbFczyIArJ8ItTV?= =?us-ascii?Q?BnzIzxcVYvhILlsbXJuNH1LUJJFv6i0T52t3/ckbDeC+qtseI4RrR/GwZUSs?= =?us-ascii?Q?PWcedBXYBeDENVMcL0NavLZ9ZWbkGz7SCVizXDpVu4PVVg9R9Kz8FU/+15da?= =?us-ascii?Q?ARMsQBgkcEj5UaL0vDEDTmfTOE6tO9QKonBQk3qRVFz1atJeOZVnUY1/b57O?= =?us-ascii?Q?Sni6t3unS0fy0rRVYI/V8Rks8yg0RR4g/uDpi53vqCoyVwBt0vjeNQ6ME2xu?= =?us-ascii?Q?hh9BJqXnYjwBFPc19zMU+G7/mFsx16FuQ4T9WXaSUxa2Hsz5dHn1EA=3D=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BL4PR10MB8229.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(366016)(376014)(1800799024)(7416014);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?TNwcgXyGhQe1MlZ/5pBv6xANWtmJPsRTWH2B4W5fqku8qiVtQMfOVGDWLqNV?= =?us-ascii?Q?WJNktt4wbf9b2DTuV9NVUJbDIVYWZQj+5UIMlJz934hJJ/yrUCYMGodPcPG6?= =?us-ascii?Q?/+hmNqCUiw7Fiomi/j4ZRCfQmqGMUxEE6WTKjwuc2tp4p/+UxB6Izs/ilsBs?= =?us-ascii?Q?7SMd62WKG92sIIyq4YvHOMLf9IMTOesCnx5O4jD9IryAuCSfmWeCmbFMY8UD?= =?us-ascii?Q?W9NcRglvK05QpwUjrEZJSvdYZBOQ9SzJSbnN/LKvC9r5BvXQZ/RrfYGYlc1w?= =?us-ascii?Q?Sc2Ar5bCtECcG6A1MwYiQzVuI/coqYEtCtyheE/G67s7QU3j+O8L+IVPg3+o?= =?us-ascii?Q?hX8uYXWvGRWaBzczrtqLd4c+Jocqp1W5duRZIBpglKnHJ/zNtBr8Lky+9Tjc?= =?us-ascii?Q?OOxnye3vh6S3o1jf3iBaF/LvMaVXb5VoVEn5+8Z+V9Cn3rZhfLap2orEAsu8?= =?us-ascii?Q?c1bXvWqKtGd+diel4HOLANK1iAnvrBgLTRX5gmieX8hVggHGY0tisl35iwfF?= =?us-ascii?Q?yrCb8tzsdl3OFmgAVL5ste3jy1yfX/Gf8HiE3gC7baBUc2M29r/RG1E2+S8v?= =?us-ascii?Q?T4a2BaYAiDxEnbsC5fy7kX9a82x6TJBpdN9N4Uyuypc0ZqW25VbbZycUmH3G?= =?us-ascii?Q?UKQlbK6c43LQ3zblTGqo4im4wak/JoGfow6wrUi/+Byz9Ty7JhG53xF0qO0w?= =?us-ascii?Q?IAVu30INoFIUA70DggY8pTBk94dFdlw/CwLyzfAiHqpadn1zok832XdEgrJL?= =?us-ascii?Q?hknCawld1XC4Xc+PQmTQ/lUNXVCv4WX4D319pJslGVJGxfDtXtBOE15Nfvl4?= =?us-ascii?Q?PfHna7gDP5wwDqfumvh5o1V+fx6yz8fzQyEq+aFuG7PrsffECxDh/oCJgNS3?= =?us-ascii?Q?yvf66adtg/t5Pr3QDuCNQw7oEpDPkXmdqrBP7mIEYYjBQ1XVeDDwfKyNUc5v?= =?us-ascii?Q?DBum1i4zh698pFa8MOlE3RpN2jHlYvYhUjsJ0h70qv9fkM2smZns8uSX79SZ?= =?us-ascii?Q?EcUO3L7CYMqviRYq7dg3faKJnhRtlJicm5Buyr++UxQmtFrKKFlRb+a+V9tD?= =?us-ascii?Q?5c9lbVYLJlPiNoda0XMzNq9QSjEzRq1F4R3hgXJUeKXB4ezGv0IM8QuwhCgK?= =?us-ascii?Q?ij87rF5NJWil0/11lIVrQrUyEV/pNmlkQRhuTuvz/6wO6toVkVTH8+2PSUwK?= =?us-ascii?Q?chRuqnLGL026FFGI/6DH0fUgoZhO1+Q0njjDg8UfcfrNRoGo0k2wIJh7AHkq?= =?us-ascii?Q?1CR0yqgOKf/kNNV/WdydAxasFpKyCuOvkfrWN1bfYxacehBkOBPD5XpHjJ+m?= =?us-ascii?Q?EbkQtWraCdsh8Lgj0hwB0PKKLXmxHcp+PKRcpSMpoVh/unuOWK6Xm63ooOOI?= =?us-ascii?Q?+aewOIGGf9op5wMnERD11xbNnRuRkzC61WsSPpD63SoeR5XoVfaJl4hq13V9?= =?us-ascii?Q?4HUDSSXfvCdtM0Egh1XSS3QCg/nAK8wTfVXbh3g7Er+pXjVwQxwfO2K4ECX1?= =?us-ascii?Q?vRKaxEsjZcbjhL0W/6y1ujYJqsEl6R4OIHncthDKto0+CYAszScNPLpji8CN?= =?us-ascii?Q?rdYVJDTdT1zEyTRkkTpZu4wTd3R+j5qEttRUmblz9q+fbWglDLYi9ANsAlDQ?= =?us-ascii?Q?AQ=3D=3D?= X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: e98iCfRvuToknMufHx/i1F9o8gbYpa33a/qU8/KsCYW0fhK0zPOtVezryLKaR51F3Idqqh9omJcP5kYpYC6M2uyYzAJTPBDnkcKjrO/a8FbmrR/5W6T5HbDjVKxP22gMKQDXkUPpm6Y91/y4olnjDD4Wyl1HMQvrcdkKCi0kY7ntM90ZkudxQ6HAlq8tC8DBme0/KqocpHHP5mCkc9BXyNVUa0Yhmjpv2KLAYnqEdFSMUhF6pYvRX7PjAOEZJ/y/hQmpXz1pfb1OnPwm6fuYvI8fXXFMOhZmLuQ3INDR7HKElAcLXR4s8GZrvMwfEgGJNj+u2D2OvDBpM2r3eShu6aIx2OfCZ5s5MhkqVMYkeV88tLcn92XFQTzX07uT5oSVibJvYiLtUzLyxGLmRg6rAaDryUJVDgXp68WRJXMIpJQEBK3jtrBSNpEjHc+brrQFiEk1eCz3E8yJztOzzx/AQ0R1uANLQivpJwnCBu32GV0vQHA4+UCE1vcaqCb9NQnNWFjNq+5yJ/L+QI/KfX54jZum1oMc/QsHMsYtrCA8BcYAHLh1l45EopfuVF6C/CIuOYx54MaB5q9gGSr43FNtnIqDieq0VmqF7FqjbgadkZ0= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 30eb1e42-185f-4c15-ae35-08de3c907da6 X-MS-Exchange-CrossTenant-AuthSource: DM4PR10MB8218.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 16 Dec 2025 10:47:22.1447 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: +3JiJPGYAKOmiZimeKR2BMLCfy8pjATaxQeXLNYUmFBZMt2a/JYMMR9cXk7O1U5WEUpCwtHUH8yKnimMO/jC/W2u4Kg0vqdbNjSsKZ+RcrY= X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR10MB6838 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.100.49 definitions=2025-12-16_02,2025-12-15_03,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 adultscore=0 mlxscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2510240000 definitions=main-2512160090 X-Proofpoint-ORIG-GUID: 0gSaJYgh3u3InpNITuDGZqwcxrhhlI_P X-Proofpoint-GUID: 0gSaJYgh3u3InpNITuDGZqwcxrhhlI_P X-Authority-Analysis: v=2.4 cv=et/SD4pX c=1 sm=1 tr=0 ts=694138bf b=1 cx=c_pps a=qoll8+KPOyaMroiJ2sR5sw==:117 a=qoll8+KPOyaMroiJ2sR5sw==:17 a=6eWqkTHjU83fiwn7nKZWdM+Sl24=:19 a=z/mQ4Ysz8XfWz/Q5cLBRGdckG28=:19 a=lCpzRmAYbLLaTzLvsPZ7Mbvzbb8=:19 a=xqWC_Br6kY4A:10 a=kj9zAlcOel0A:10 a=wP3pNCr1ah4A:10 a=GoEa3M9JfhUA:10 a=VkNPw1HP01LnGYTKEx00:22 a=VwQbUJbxAAAA:8 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=pKZnElRZrDUNvwj5VfYA:9 a=MK45fKWU_Qe-xWZO:21 a=CjuIK1q_8ugA:10 cc=ntf awl=host:12110 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUxMjE2MDA5MSBTYWx0ZWRfX7ckRXdeiZkt0 EPnRbu4VG8x87NpCnp4hb+zciZo+o6sqtTtUFP/WX+AY22DphoNm8ByJEx3xELjnLVU/ZNd78lc fQdt38mncsokdtFh1JCTFpIE/PNTnOM+6pUQEnEknO3Yuc01/SJfNa6ugWFRBD0+M5HrgYJT5PY AOWWA5k9Gz06cRgsASob9IP/KDIn3Z4dbnSfXNXxVtFwyTJ5ThwSJ2ZXwzQgRSr0epEmo7CihGG uRKPL2n7Anyc4b3YRGCZl5Y5foSj3Bojqi3Q5Iwfc8UpeDJ3jGeIWPNnPklC6JjBjP/vzY4t6fb 84nk4wdBXoRJ11yteKYEHbrjQLN4weyuiw4kEFrAZExwq1G+Ba3EzBAbBdl6yKnbpmYE8agvq8P 1hqOrBvhtZ5qJeZkBYojgtWIAA9+9ZFfr4rcA98HWt3s6vXO92U= X-Rspamd-Server: rspam02 X-Stat-Signature: 1w5iku8dumdyr46i1tf7t8sggp3raxnx X-Rspam-User: X-Rspamd-Queue-Id: 17B0D100002 X-HE-Tag: 1765882058-182252 X-HE-Meta: U2FsdGVkX19jclVIHVqH+4YrrimPK7cqL/kQ1LSKFg7sp4Kfezd5wMHPgf8176hnpTmLGRQiwJsLnQRClvtRohEm2AAs0mWKkESSIzgh3b8vk8CgdTcxmlKn49UG/Q+xgKAyqBxUVNctAlw8VoMU36nrfphAKYC5L4BtRUY1MHVYds7CFiyLhEENdJ4evoNZQY9151xYVXJCw8sP0xbyr5FtjIy3MpLIO3OMuZC9vyzOIftOMRhy/WUkTmkTkVSKCWSbC27OYEp2ohQzOKbkGjEDAHsaNDezQvHXgJvw1STxJ2XKIM10HO0wG5kxThxvtT+FpxCZTPXNsU8wIpyfwFODlhmiyY5SuEa3vI0UJgZ6DwnVyp2cuBNgf8u3r+56l1yAXCadf+rt3XZbOLhrEUjolLt2gn8d8Rl+hbdQiPRz3jMdChEMn3Ub32vjIcIdACdyDDgp64UGxZbHCbk1qKqgHVGFEr8najsiJlZXXtmwCEl5cmZqIQUSq22FoW7j0VQQt/5peSaCqX06NoYWE05Go5jvW+BikF2UcTd4ek2XHeK3/IAI47nXaTD7Bov+V//mjsBW+VwUkVa4YldAXC6tgO7ow+uSJByrsaeBxhkOLHIaaunpAojn99kZeBDVG8ny1IAeZGwOvF/YyguveyhumUJGLXvB3y0QWx+eDsEhEo2sVUzZA28LJ//kzi2u6bzLdP8MlGmirIU9y8eyIy0wJN6IjUqd/lpHzvVmKY2DcjluHPkCX5BV93jlZbQREgqGl0obpuIAL7RBRG/pGwZbHhumsHsBGbJNIJ+X1I9HjxCRRUKkevRlELJ5XRefCSZpM35U7SftnZpUJCHOi1Fn+6fwLgpSfUt+NPC737Ag+cqqLwXvHd3cyiLDiYSHVTAaKPTYOmsysHBfbvZw67OdKCknhm3arU3+lZp4qu78EmoG/262LyLbDFNfJwlWuH6IDDlHi45twYQAXEk 9Bl3FgyY SgxVwtnLzE4M1+6PBATs+2as8sMTWUt81CGduCkBzYGEqxFouoCtnetDHL4MzIPHcUvzuz9tTebyRMPhkAjd0SIaneaUeAAqbFNVIrIBleYZCIS/b6UDGwkJIilsuFxhszCUsbhyiy5jfZnxjmsysWeazMFkHwMGHSt5qoMrgCO3yy1r9ZCdCPb08QMkqHv7fol6zPUIBGClRCXt3VDtIGfmd9znCft131IVpXrMuXooc9rXPTZecmsb0Jkw2IlOGPU6uhhyudMFooWBngoPy8eqOBCdfnddc7O1XF5ZOmTF25FExSQVKehGuy5ewK4g/nE9Yf/tWiek0yQZDZMVVgYy2eO5V0dgAuBQ2uJQg6lJYLKeg7zeMmqbhLlXpa7AycerVLltImC18/2TPragkwycpDcRm1bt/0lU9yMSG2KN3qYtV9z0EG9N1aocYW2dO2uwWKhtmfkhYPU02CicjkaHgPYoaxSdka880FK8yPvAQIb7QlS15WN1kAMBsooMowkdR8hUpHJP2B8rLcIbIGU6isK6LW/E20EgTYfIHlua/68/WfJsxjrJfCTVgpoNGNTmcC0QqVThtCFhI3lpvz7YDWrfwXxh8RCoH7zBjXKT0Gbal0Ph0/mA4j2Xj0TiVV7q8EM2b1AMNOf2F4yQPxz47GZZVnmzmj0PEVMPz0wNF5VO7Ax3DaW3xAJPO3V55KEix+qu6snjI0XaFRY6PXC2LyuirzxPMJVVNjLGDVqa20BnbgQ3J/JIWYMZXZvIXiWKkRJEzzZLGqYhFo67S6z6Kg5xvBOsDp/cMZOrlEmrIxzG4kl0gXP0w4TT6jXZHCP79DGUys7XFe6MzPpcgCU7TKA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote: > As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix > huge_pmd_unshare() vs GUP-fast race") we can end up in some situations > where we perform so many IPI broadcasts when unsharing hugetlb PMD page > tables that it severely regresses some workloads. > > In particular, when we fork()+exit(), or when we munmap() a large > area backed by many shared PMD tables, we perform one IPI broadcast per > unshared PMD table. > > There are two optimizations to be had: > > (1) When we process (unshare) multiple such PMD tables, such as during > exit(), it is sufficient to send a single IPI broadcast (as long as > we respect locking rules) instead of one per PMD table. > > Locking prevents that any of these PMD tables could get reuse before > we drop the lock. > > (2) When we are not the last sharer (> 2 users including us), there is > no need to send the IPI broadcast. The shared PMD tables cannot > become exclusive (fully unshared) before an IPI will be broadcasted > by the last sharer. > > Concurrent GUP-fast could walk into a PMD table just before we > unshared it. It could then succeed in grabbing a page from the > shared page table even after munmap() etc succeeded (and supressed > an IPI). But there is not difference compared to GUP-fast just > sleeping for a while after grabbing the page and re-enabling IRQs. > > Most importantly, GUP-fast will never walk into page tables that are > no-longer shared, because the last sharer will issue an IPI > broadcast. > > (if ever required, checking whether the PUD changed in GUP-fast > after grabbing the page like we do in the PTE case could handle > this) > > So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather > infrastructure so we can implement these optimizations and demystify the > code at least a bit. Extend the mmu_gather infrastructure to be able to > deal with our special hugetlb PMD table sharing implementation. > > We'll consolidate the handling for (full) unsharing of PMD tables in > tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track > in "struct mmu_gather" whether we had (full) unsharing of PMD tables. > > Because locking is very special (concurrent unsharing+reuse must be > prevented), we disallow deferring flushing to tlb_finish_mmu() and instead > require an explicit earlier call to tlb_flush_unshared_tables(). > > From hugetlb code, we call huge_pmd_unshare_flush() where we make sure > that the expected lock protecting us from concurrent unsharing+reuse is > still held. > > Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that > tlb_flush_unshared_tables() was properly called earlier. > > Document it all properly. > > Notes about tlb_remove_table_sync_one() interaction with unsharing: > > There are two fairly tricky things: > > (1) tlb_remove_table_sync_one() is a NOP on architectures without > CONFIG_MMU_GATHER_RCU_TABLE_FREE. > > Here, the assumption is that the previous TLB flush would send an > IPI to all relevant CPUs. Careful: some architectures like x86 only > send IPIs to all relevant CPUs when tlb->freed_tables is set. > > The relevant architectures should be selecting > MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable > kernels and it might have been problematic before this patch. > > Also, the arch flushing behavior (independent of IPIs) is different > when tlb->freed_tables is set. Do we have to enlighten them to also > take care of tlb->unshared_tables? So far we didn't care, so > hopefully we are fine. Of course, we could be setting > tlb->freed_tables as well, but that might then unnecessarily flush > too much, because the semantics of tlb->freed_tables are a bit > fuzzy. > > This patch changes nothing in this regard. > > (2) tlb_remove_table_sync_one() is not a NOP on architectures with > CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. > > Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) > we still issue IPIs during TLB flushes and don't actually need the > second tlb_remove_table_sync_one(). > > This optimized can be implemented on top of this, by checking e.g., in > tlb_remove_table_sync_one() whether we really need IPIs. But as > described in (1), it really must honor tlb->freed_tables then to > send IPIs to all relevant CPUs. > > Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a > concern, as we are holding the i_mmap_lock the whole time, preventing > concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed > separately as a cleanup later. > > There are plenty more cleanups to be had, but they have to wait until > this is fixed. > > Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") > Reported-by: Uschakow, Stanislav" > Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ > Tested-by: Laurence Oberman > Cc: > Signed-off-by: David Hildenbrand (Red Hat) After discussion on v1 4/4, and running a git range-diff between the two, this LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/asm-generic/tlb.h | 74 ++++++++++++++++++++++- > include/linux/hugetlb.h | 19 +++--- > mm/hugetlb.c | 121 ++++++++++++++++++++++---------------- > mm/mmu_gather.c | 7 +++ > mm/mprotect.c | 2 +- > mm/rmap.c | 25 +++++--- > 6 files changed, 179 insertions(+), 69 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 1fff717cae510..706416babb3d6 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -364,6 +364,20 @@ struct mmu_gather { > unsigned int vma_huge : 1; > unsigned int vma_pfn : 1; > > + /* > + * Did we unshare (unmap) any shared page tables? For now only > + * used for hugetlb PMD table sharing. > + */ > + unsigned int unshared_tables : 1; > + > + /* > + * Did we unshare any page tables such that they are now exclusive > + * and could get reused+modified by the new owner? When setting this > + * flag, "unshared_tables" will be set as well. For now only used > + * for hugetlb PMD table sharing. > + */ > + unsigned int fully_unshared_tables : 1; > + > unsigned int batch_count; > > #ifndef CONFIG_MMU_GATHER_NO_GATHER > @@ -400,6 +414,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) > tlb->cleared_pmds = 0; > tlb->cleared_puds = 0; > tlb->cleared_p4ds = 0; > + tlb->unshared_tables = 0; > /* > * Do not reset mmu_gather::vma_* fields here, we do not > * call into tlb_start_vma() again to set them if there is an > @@ -484,7 +499,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > * these bits. > */ > if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds || > - tlb->cleared_puds || tlb->cleared_p4ds)) > + tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables)) > return; > > tlb_flush(tlb); > @@ -773,6 +788,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) > } > #endif > > +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING > +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt, > + unsigned long addr) > +{ > + /* > + * The caller must make sure that concurrent unsharing + exclusive > + * reuse is impossible until tlb_flush_unshared_tables() was called. > + */ > + VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt)); > + ptdesc_pmd_pts_dec(pt); > + > + /* Clearing a PUD pointing at a PMD table with PMD leaves. */ > + tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE); > + > + /* > + * If the page table is now exclusively owned, we fully unshared > + * a page table. > + */ > + if (!ptdesc_pmd_is_shared(pt)) > + tlb->fully_unshared_tables = true; > + tlb->unshared_tables = true; > +} > + > +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) > +{ > + /* > + * As soon as the caller drops locks to allow for reuse of > + * previously-shared tables, these tables could get modified and > + * even reused outside of hugetlb context, so we have to make sure that > + * any page table walkers (incl. TLB, GUP-fast) are aware of that > + * change. > + * > + * Even if we are not fully unsharing a PMD table, we must > + * flush the TLB for the unsharer now. > + */ > + if (tlb->unshared_tables) > + tlb_flush_mmu_tlbonly(tlb); > + > + /* > + * Similarly, we must make sure that concurrent GUP-fast will not > + * walk previously-shared page tables that are getting modified+reused > + * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast. > + * > + * We only perform this when we are the last sharer of a page table, > + * as the IPI will reach all CPUs: any GUP-fast. > + * > + * Note that on configs where tlb_remove_table_sync_one() is a NOP, > + * the expectation is that the tlb_flush_mmu_tlbonly() would have issued > + * required IPIs already for us. > + */ > + if (tlb->fully_unshared_tables) { > + tlb_remove_table_sync_one(); > + tlb->fully_unshared_tables = false; > + } > +} > +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ > + > #endif /* CONFIG_MMU */ > > #endif /* _ASM_GENERIC__TLB_H */ > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 03c8725efa289..63b248c6bfd47 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz); > unsigned long hugetlb_mask_last_page(struct hstate *h); > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep); > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep); > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma); > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > unsigned long *start, unsigned long *end); > > @@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma); > int hugetlb_vma_trylock_write(struct vm_area_struct *vma); > void hugetlb_vma_assert_locked(struct vm_area_struct *vma); > void hugetlb_vma_lock_release(struct kref *kref); > -long hugetlb_change_protection(struct vm_area_struct *vma, > +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, > unsigned long address, unsigned long end, pgprot_t newprot, > unsigned long cp_flags); > void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); > @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write( > return NULL; > } > > -static inline int huge_pmd_unshare(struct mm_struct *mm, > - struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +static inline int huge_pmd_unshare(struct mmu_gather *tlb, > + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) > { > return 0; > } > > +static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb, > + struct vm_area_struct *vma) > +{ > +} > + > static inline void adjust_range_if_pmd_sharing_possible( > struct vm_area_struct *vma, > unsigned long *start, unsigned long *end) > @@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio, > { > } > > -static inline long hugetlb_change_protection( > +static inline long hugetlb_change_protection(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long address, > unsigned long end, pgprot_t newprot, > unsigned long cp_flags) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3c77cdef12a32..7fef0b94b5d1e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > unsigned long last_addr_mask; > pte_t *src_pte, *dst_pte; > struct mmu_notifier_range range; > - bool shared_pmd = false; > + struct mmu_gather tlb; > > + tlb_gather_mmu(&tlb, vma->vm_mm); > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, > old_end); > adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); > @@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) > continue; > > - if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { > - shared_pmd = true; > + if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) { > old_addr |= last_addr_mask; > new_addr |= last_addr_mask; > continue; > } > + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr); > > dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz); > if (!dst_pte) > @@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, > move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); > } > > - if (shared_pmd) > - flush_hugetlb_tlb_range(vma, range.start, range.end); > - else > - flush_hugetlb_tlb_range(vma, old_end - len, old_end); > + tlb_flush_mmu_tlbonly(&tlb); > + huge_pmd_unshare_flush(&tlb, vma); > + > mmu_notifier_invalidate_range_end(&range); > i_mmap_unlock_write(mapping); > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > > return len + old_addr - old_end; > } > @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > unsigned long sz = huge_page_size(h); > bool adjust_reservation; > unsigned long last_addr_mask; > - bool force_flush = false; > > WARN_ON(!is_vm_hugetlb_page(vma)); > BUG_ON(start & ~huge_page_mask(h)); > @@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > } > > ptl = huge_pte_lock(h, mm, ptep); > - if (huge_pmd_unshare(mm, vma, address, ptep)) { > + if (huge_pmd_unshare(tlb, vma, address, ptep)) { > spin_unlock(ptl); > - tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); > - force_flush = true; > address |= last_addr_mask; > continue; > } > @@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > } > tlb_end_vma(tlb, vma); > > - /* > - * There is nothing protecting a previously-shared page table that we > - * unshared through huge_pmd_unshare() from getting freed after we > - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() > - * succeeded, flush the range corresponding to the pud. > - */ > - if (force_flush) > - tlb_flush_mmu_tlbonly(tlb); > + huge_pmd_unshare_flush(tlb, vma); > } > > void __hugetlb_zap_begin(struct vm_area_struct *vma, > @@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > } > #endif /* CONFIG_USERFAULTFD */ > > -long hugetlb_change_protection(struct vm_area_struct *vma, > +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma, > unsigned long address, unsigned long end, > pgprot_t newprot, unsigned long cp_flags) > { > @@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > pte_t pte; > struct hstate *h = hstate_vma(vma); > long pages = 0, psize = huge_page_size(h); > - bool shared_pmd = false; > struct mmu_notifier_range range; > unsigned long last_addr_mask; > bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > @@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > } > } > ptl = huge_pte_lock(h, mm, ptep); > - if (huge_pmd_unshare(mm, vma, address, ptep)) { > + if (huge_pmd_unshare(tlb, vma, address, ptep)) { > /* > * When uffd-wp is enabled on the vma, unshare > * shouldn't happen at all. Warn about it if it > @@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); > pages++; > spin_unlock(ptl); > - shared_pmd = true; > address |= last_addr_mask; > continue; > } > @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma, > pte = huge_pte_clear_uffd_wp(pte); > huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); > pages++; > + tlb_remove_huge_tlb_entry(h, tlb, ptep, address); > } > > next: > spin_unlock(ptl); > cond_resched(); > } > - /* > - * There is nothing protecting a previously-shared page table that we > - * unshared through huge_pmd_unshare() from getting freed after we > - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() > - * succeeded, flush the range corresponding to the pud. > - */ > - if (shared_pmd) > - flush_hugetlb_tlb_range(vma, range.start, range.end); > - else > - flush_hugetlb_tlb_range(vma, start, end); > + > + tlb_flush_mmu_tlbonly(tlb); > + huge_pmd_unshare_flush(tlb, vma); > /* > * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are > * downgrading page table protection not changing it to point to a new > @@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > return pte; > } > > -/* > - * unmap huge page backed by shared pte. > +/** > + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users > + * @tlb: the current mmu_gather. > + * @vma: the vma covering the pmd table. > + * @addr: the address we are trying to unshare. > + * @ptep: pointer into the (pmd) page table. > + * > + * Called with the page table lock held, the i_mmap_rwsem held in write mode > + * and the hugetlb vma lock held in write mode. > * > - * Called with page table lock held. > + * Note: The caller must call huge_pmd_unshare_flush() before dropping the > + * i_mmap_rwsem. > * > - * returns: 1 successfully unmapped a shared pte page > - * 0 the underlying pte page is not shared, or it is the last user > + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it > + * was not a shared PMD table. > */ > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > { > unsigned long sz = huge_page_size(hstate_vma(vma)); > + struct mm_struct *mm = vma->vm_mm; > pgd_t *pgd = pgd_offset(mm, addr); > p4d_t *p4d = p4d_offset(pgd, addr); > pud_t *pud = pud_offset(p4d, addr); > @@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > i_mmap_assert_write_locked(vma->vm_file->f_mapping); > hugetlb_vma_assert_locked(vma); > pud_clear(pud); > - /* > - * Once our caller drops the rmap lock, some other process might be > - * using this page table as a normal, non-hugetlb page table. > - * Wait for pending gup_fast() in other threads to finish before letting > - * that happen. > - */ > - tlb_remove_table_sync_one(); > - ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep)); > + > + tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr); > + > mm_dec_nr_pmds(mm); > return 1; > } > > +/* > + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls > + * @tlb: the current mmu_gather. > + * @vma: the vma covering the pmd table. > + * > + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table > + * unsharing with concurrent page table walkers. > + * > + * This function must be called after a sequence of huge_pmd_unshare() > + * calls while still holding the i_mmap_rwsem. > + */ > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > + /* > + * We must synchronize page table unsharing such that nobody will > + * try reusing a previously-shared page table while it might still > + * be in use by previous sharers (TLB, GUP_fast). > + */ > + i_mmap_assert_write_locked(vma->vm_file->f_mapping); > + > + tlb_flush_unshared_tables(tlb); > +} > + > #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ > > pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > return NULL; > } > > -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, pte_t *ptep) > +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > { > return 0; > } > > +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > +} > + > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > unsigned long *start, unsigned long *end) > { > @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > unsigned long sz = huge_page_size(h); > struct mm_struct *mm = vma->vm_mm; > struct mmu_notifier_range range; > + struct mmu_gather tlb; > unsigned long address; > spinlock_t *ptl; > pte_t *ptep; > @@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > if (start >= end) > return; > > + tlb_gather_mmu(&tlb, mm); > flush_cache_range(vma, start, end); > /* > * No need to call adjust_range_if_pmd_sharing_possible(), because > @@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > if (!ptep) > continue; > ptl = huge_pte_lock(h, mm, ptep); > - huge_pmd_unshare(mm, vma, address, ptep); > + huge_pmd_unshare(&tlb, vma, address, ptep); > spin_unlock(ptl); > } > - flush_hugetlb_tlb_range(vma, start, end); > + huge_pmd_unshare_flush(&tlb, vma); > if (take_locks) { > i_mmap_unlock_write(vma->vm_file->f_mapping); > hugetlb_vma_unlock_write(vma); > @@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, > * Documentation/mm/mmu_notifier.rst. > */ > mmu_notifier_invalidate_range_end(&range); > + tlb_finish_mmu(&tlb); > } > > /* > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 247e3f9db6c7a..030a162a263ba 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -426,6 +426,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > #endif > tlb->vma_pfn = 0; > > + tlb->fully_unshared_tables = 0; > __tlb_reset_range(tlb); > inc_tlb_flush_pending(tlb->mm); > } > @@ -468,6 +469,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm) > */ > void tlb_finish_mmu(struct mmu_gather *tlb) > { > + /* > + * We expect an earlier huge_pmd_unshare_flush() call to sort this out, > + * due to complicated locking requirements with page table unsharing. > + */ > + VM_WARN_ON_ONCE(tlb->fully_unshared_tables); > + > /* > * If there are parallel threads are doing PTE changes on same range > * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 283889e4f1cec..5c330e817129e 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb, > #endif > > if (is_vm_hugetlb_page(vma)) > - pages = hugetlb_change_protection(vma, start, end, newprot, > + pages = hugetlb_change_protection(tlb, vma, start, end, newprot, > cp_flags); > else > pages = change_protection_range(tlb, vma, start, end, newprot, > diff --git a/mm/rmap.c b/mm/rmap.c > index 748f48727a162..d6799afe11147 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -76,7 +76,7 @@ > #include > #include > > -#include > +#include > > #define CREATE_TRACE_POINTS > #include > @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > * if unsuccessful. > */ > if (!anon) { > + struct mmu_gather tlb; > + > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > if (!hugetlb_vma_trylock_write(vma)) > goto walk_abort; > - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > + > + tlb_gather_mmu(&tlb, mm); > + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { > hugetlb_vma_unlock_write(vma); > - flush_tlb_range(vma, > - range.start, range.end); > + huge_pmd_unshare_flush(&tlb, vma); > + tlb_finish_mmu(&tlb); > /* > * The PMD table was unmapped, > * consequently unmapping the folio. > @@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > goto walk_done; > } > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > } > pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > if (pte_dirty(pteval)) > @@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > * fail if unsuccessful. > */ > if (!anon) { > + struct mmu_gather tlb; > + > VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > if (!hugetlb_vma_trylock_write(vma)) { > page_vma_mapped_walk_done(&pvmw); > ret = false; > break; > } > - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > - hugetlb_vma_unlock_write(vma); > - flush_tlb_range(vma, > - range.start, range.end); > > + tlb_gather_mmu(&tlb, mm); > + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { > + hugetlb_vma_unlock_write(vma); > + huge_pmd_unshare_flush(&tlb, vma); > + tlb_finish_mmu(&tlb); > /* > * The PMD table was unmapped, > * consequently unmapping the folio. > @@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > break; > } > hugetlb_vma_unlock_write(vma); > + tlb_finish_mmu(&tlb); > } > /* Nuke the hugetlb page table entry */ > pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > -- > 2.52.0 >