From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD6A5C43334 for ; Wed, 13 Jul 2022 15:49:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3376394014A; Wed, 13 Jul 2022 11:49:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E6FB940134; Wed, 13 Jul 2022 11:49:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 160E594014A; Wed, 13 Jul 2022 11:49:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 015BC940134 for ; Wed, 13 Jul 2022 11:49:18 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CE55061128 for ; Wed, 13 Jul 2022 15:49:18 +0000 (UTC) X-FDA: 79682510796.18.A67966E Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by imf01.hostedemail.com (Postfix) with ESMTP id 4CE8A4003E for ; Wed, 13 Jul 2022 15:49:18 +0000 (UTC) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26DEBUcr018107; Wed, 13 Jul 2022 08:48:40 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=facebook; bh=SMCor2hRleFT69+9s5PBulj3Pt+RQByJMVadaQ2RkUw=; b=cZjgOXGeuhtkJbSBoVrgL15/mMZ9BHgwSHhZIikILuPdkncl2AcWhOwjWzSe3Igi0sWG gpO2xZc/CRthHDl+Cuz94Rv/iC0NnOdIlsjVPO9FjAmn5fS6k4FvGHEx+uUgQZpwBLQ7 PtqCXsflgUVnNY8C0nl+/HN7ZI4w/yDZXiU= Received: from nam10-bn7-obe.outbound.protection.outlook.com (mail-bn7nam10lp2101.outbound.protection.outlook.com [104.47.70.101]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3h9h5f4qe0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Jul 2022 08:48:39 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mt1oL5dKe+eV6+SsFdJbE0+f+sHnSczWijb7LdrKLnpWrcTsGOHqpNHlTfi+38u5zEjNjIzqiy51wwxmX9Nstch5N1Iw/tYe7IkBYyKEtCtuWxOt0pj4J1eu5N7LNKH9SmMSzLgZqUxUZOQtgp5pYa6nhTe8fbw0d7+RisMLvqlkwdRpNfxcZhZgWSYyD0nxbbutWioAHwxUmhOg0vTu//c4NLTFdpJDZEPw3DnDWlUYU9p+8HNcJ6BiAxHjXsOTPrFj+PdVSg671kJtdSdOtvar4LrwE1mfH0ZwV0TEHu5KYA1X25iccEEgGcFQLylkbpCugQgAdOK6e3AGm8fyRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=SMCor2hRleFT69+9s5PBulj3Pt+RQByJMVadaQ2RkUw=; b=MFkZN6cXXRHyI6nPqLpx70hiAPUqRMf5UQTjjoC/nwAlDeNYQT00oF9PF3/bCNn6jW2nkZfdJxmGP/2aNtLx0L/0DQEFIHzCbBHmOVdk0EV4THqRDHmu+mYTHt3850gKI4YAjsci+YoviFX1ptxZC8kQJfgDAlMup1FLDGHTS8q6fX5i0frPrIvyfFLm8weBpYi5e0y28XFO4R13MOlJoR1JdATfZ/7e+M/YpRg+i1/Nc8ATpFhhopvDQzjUaOkmfxTy1Us1YRXRvNz9tpUber3iOceWCp7/lFFF/J6u6m8LvbYVl8eub/fTYghenn8N/C5xjkrGyXCZb2F2XYvyVw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fb.com; dmarc=pass action=none header.from=fb.com; dkim=pass header.d=fb.com; arc=none Received: from SA1PR15MB5109.namprd15.prod.outlook.com (2603:10b6:806:1dc::10) by DM6PR15MB4297.namprd15.prod.outlook.com (2603:10b6:5:1f8::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5417.26; Wed, 13 Jul 2022 15:48:36 +0000 Received: from SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::5de3:3999:66df:42d1]) by SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::5de3:3999:66df:42d1%4]) with mapi id 15.20.5438.012; Wed, 13 Jul 2022 15:48:35 +0000 From: Song Liu To: Peter Zijlstra CC: Song Liu , bpf , lkml , Linux-MM , "linux-modules@vger.kernel.org" , Luis Chamberlain , Steven Rostedt , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Masami Hiramatsu , "naveen.n.rao@linux.ibm.com" , "davem@davemloft.net" , "anil.s.keshavamurthy@intel.com" , "keescook@chromium.org" , "hch@infradead.org" , "dave@stgolabs.net" , "daniel@iogearbox.net" , Kernel Team , "x86@kernel.org" , "dave.hansen@linux.intel.com" , "rick.p.edgecombe@intel.com" , "akpm@linux-foundation.org" Subject: Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory Thread-Topic: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory Thread-Index: AQHYlpK25UY5kW1ND0S/krM8qG//OK18FxyAgABbw4A= Date: Wed, 13 Jul 2022 15:48:35 +0000 Message-ID: <7C927986-3665-4BD6-A339-D3FE4A71E3D4@fb.com> References: <20220713071846.3286727-1-song@kernel.org> <20220713071846.3286727-2-song@kernel.org> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-mailer: Apple Mail (2.3696.100.31) x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 4be26d81-0d70-4119-dd52-08da64e725b2 x-ms-traffictypediagnostic: DM6PR15MB4297:EE_ x-fb-source: Internal x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: JS59S/QtiBCf/0ZVy0LvILoooMqsPAIq0t7VGPiA9ER+4meWF22VuQD8VEyWt5063xbHTgQ0PnbKx/UcYVBXdDC0YJaEoyL6KbZ0m/Lb9u76QXqK/wsPu8PSUgT+dLTuCvQtc3afpe02zup6zjvrjYoR8GfSYqYvfp3HFDzBA3pPbbSEo/2ncmBLjPlbE56o2r+JSniTStE8VSZHU7i1XKBXJN3pZRzr/9f9XNpdFzo8JYtd68B7WifRwVKzWn0CJ1Awg5voxug9ueCbvTVoaWKMWFZd20mB4Sn24aBRW4T49Dv+KBn6s7zyDgp4Z5LorMpETd39ouiV5KIqmx34rgfNRZux8jbkvMqXQsMUd8UXlr5vZZkQdAL4SlsXWlqa1maILmYexMKbvnHbnR8g2OO6oochoC00azDCON8oG/l65Tk4oCPfKBlWhmCAqPmvu/uRirNiKpud/Kwnotbh7e6L/jgjLBxFRooEtTymT+mK48OXSTnHV+1HD26wRrwllLsjvXloe2BGmM08MDfGYltCZZytegS2Pru7Ycw5GoMdZ8jmKgr8FRmjffkJsJkmuhcn7k7gHfSS+EDOVe5vDQACJExE56ZTV7olF97ncBJ6mF/ucu2DrpnekdtKOa/Sc+zpbOhKcl/IMNwlmIsALyRqa3P8YWfFYSQUf2vNEgSnfma8Q4zUxYlTwLW1jtDmrXsyCFBXGe5wv6I4VqtKiJ/YjI930y3jsraA1L6poPo+J1/UICASPULn+B0tOyT9NDT5lQI+jF6Grv+NPcKSrRVKnXm4/FfA9SNqmT2xfwEodulIvC2xwItuYLtr4d16uPwuTzhWhnSIcX5OOdDjxagKZoxrzGjwpdfRB/SuNLI= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SA1PR15MB5109.namprd15.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230016)(4636009)(366004)(136003)(376002)(39860400002)(346002)(396003)(478600001)(86362001)(122000001)(38070700005)(54906003)(6486002)(2616005)(36756003)(71200400001)(53546011)(316002)(41300700001)(6506007)(64756008)(76116006)(6512007)(66556008)(91956017)(33656002)(66476007)(6916009)(66446008)(8676002)(8936002)(2906002)(7416002)(4326008)(186003)(66946007)(5660300002)(38100700002)(45980500001);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?2CMULXxOgKEYNi698WQ6S2TeeDXfBlWvNbUp0Snx57YBXL/eDZJF31zJMfOT?= =?us-ascii?Q?D8qy77cpZIjmK2yVDNCbFLTJGoxHqtiJKYtl93cI3lva3EznCpFtwGRvM4Ry?= =?us-ascii?Q?CDgTJSlr0byrXOtZCyJyHSo6tTNZZm+aiKSOQ5l877lvz3lSeftnXEjWG7SU?= =?us-ascii?Q?pUyq0s5wPBQrUkKMy9W5zdxQuNlNAFON7q6cXbVSiK9U9C5kaMpIkP0YPUyw?= =?us-ascii?Q?2zUyWH9rClUaPL/6O6oRwU6bqo9mT5tnYQN9eUeDlrg98X9E9dWDkt6PvQ4g?= =?us-ascii?Q?bWnyxHZ9ola2fYhUSMLX7KWkGsGhxJlFthbZW51cy8pcDC7oMnPyk6E3yH2K?= =?us-ascii?Q?4gYBYakTmApI1JL/n2zfYZDgKFXBx5pc0rgTABYqlJhMkvc+xn6CPFBzosB9?= =?us-ascii?Q?Tm6KcesOU02TFhipqFk9fuUApzR/hYkB9ZlPzwj/abAvcoKKoUUjF5nkhev7?= =?us-ascii?Q?8KCjjMJtdqieSHrDcf4srrmZRf0IRupQjsrzeLRMqn31DlEOmjkuIE99xlQX?= =?us-ascii?Q?7tT5BlqZF/lSntuFovHp7d5NmlftqHRPYiW73P1l2QJ4Q1ve3rgaaPfvWYe7?= =?us-ascii?Q?4z+cGf6ziu97fxms3VSX0yyGTuWqJecJtJzXLZgDQo2uSXL9OO8ZrKtZ1h8r?= =?us-ascii?Q?zzDhIo70DAbc4VLT6hXXI7o2UYfnK49sKzSv3EjmC1smZyiBAPO3fNfzxLLF?= =?us-ascii?Q?VicOd/GZNrTJxs/2oBOu4JBWbaGslbyHKD+zbbVt5lZZXCMgZ+mXqLLgLanp?= =?us-ascii?Q?hexRgTIkd4Wefc5InwBMIY/L4RO/GLTy0f9HvxQNdCBgWZv5IoaVRYaJg5KZ?= =?us-ascii?Q?/NOTKPfzEZwNMzaSRBjqxR7hFPMsl7E5nKLy/JEArWnKv7mw4pAr9ClFBtAd?= =?us-ascii?Q?EcyFK7bAIGj6GxPllalbPwmfnAdyhRVxkryAbL7D18QG+LoOuwGnM2NCAM96?= =?us-ascii?Q?yXg+VDq5hEd2WOe5adC5FHMo1mTiMQjPei1I9EypPSli8HG1zntKwd1+BGSL?= =?us-ascii?Q?W2dncr27SPrfaNNKbr9ms3OUGqgSPeAIAKWlYlWf01cqC71UNUgensRTXndf?= =?us-ascii?Q?Y58nk0j5twIDfF4q+rjavYWyGKQrsiZk7Bnn9M6ZC8cOfMeBsB3C9AejQWOf?= =?us-ascii?Q?sLZ9DTjr+m5CiO7oiwT+YEqkpKbHbkZINRJgEvp0sfOMHlPAExNG8TwAgWyJ?= =?us-ascii?Q?IyEDnl2a8zfO4ehSKebJVQGE2qO60O9XspnEDvZ++8EsoSj97xEI826ArNYL?= =?us-ascii?Q?MZOQVKM0c9vGtF34o5Kz7er4v/YCzeJqxFZR+MTFabnqoch1az2ETZhep9q2?= =?us-ascii?Q?6M330K3zfMHpoabZYs/UkNQbPrWFBJkmLHG/e1/wDvggFIBE2JydcqfXJPfr?= =?us-ascii?Q?aew+pJRUf9TkW/sQywCBTdsHku2u0PSBdAuyWkgzxz5oSSwOIPWniZERGlb2?= =?us-ascii?Q?H3HCIR4yUL5oGu4zo9Gdv+zpnvB3QZIAgxhQKadO0g28pmM5gIu4JOKnjcGU?= =?us-ascii?Q?AOjREoYKZVr6CTewcsO+PthkCtBmxCad2ALfEx/F0F3C9ZW4H8SNwXMPa/He?= =?us-ascii?Q?ISF7A9Szw0OZ6EzEihST/BvJBdqqwUQlWN24PvckvgwnoxeCL3+TTgs/Op68?= =?us-ascii?Q?96SWZcZiwSN5K+OlFp0zrFU=3D?= Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fb.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SA1PR15MB5109.namprd15.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 4be26d81-0d70-4119-dd52-08da64e725b2 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2022 15:48:35.8892 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: sWMNyqKuWHgHtpkb6vPPq4fVEhpf3gLo7OEIggAJLkWUO8nGXlEzMH0GY06fVBQ00z65jAkFHqKs4BGn52wWKw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR15MB4297 X-Proofpoint-ORIG-GUID: pqR5jnw8i4l2HF_7jFXlTZkhM0x5qjyU X-Proofpoint-GUID: pqR5jnw8i4l2HF_7jFXlTZkhM0x5qjyU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-13_05,2022-07-13_03,2022-06-22_01 ARC-Authentication-Results: i=2; imf01.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=cZjgOXGe; dmarc=pass (policy=reject) header.from=fb.com; spf=none (imf01.hostedemail.com: domain of "prvs=819375f3bc=songliubraving@fb.com" has no SPF policy when checking 67.231.145.42) smtp.mailfrom="prvs=819375f3bc=songliubraving@fb.com"; arc=pass ("microsoft.com:s=arcselector9901:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657727358; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SMCor2hRleFT69+9s5PBulj3Pt+RQByJMVadaQ2RkUw=; b=axzqaUI+7hTZFDL6EyehEYPVdafrT37Ap4mNQEKUsGwuPVyxIZ9BnoAECiKunKjIpe6FDU UxscQx2MMBUixZXdwDxc5OapQHnukWxqd7WMPZFhEacEEvKOSHwi0b9DEK8rbrOSgX63z+ J09lG+xOsz15FiHZ4DWyU2BcAwaSjeE= ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1657727358; a=rsa-sha256; cv=pass; b=wj6E12Uo2EkZ9dPntw1bNcNFJ/kumyVdEcv+dmmnr7gUQaHQpNGVx5+EojLnjHZwzmh6Tc VRp0oKBpOb5GoemBIStfZX/QqY4908lzuucw8FdwDAS/egctOQAScXVlpBUYsXY2nuBo6G +Xc4mmnYr1zyOkW4EV8IBP2+QsbHLcE= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4CE8A4003E X-Stat-Signature: 7xjn4r4fy6goi537e134qxkqoj776kwc Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=cZjgOXGe; dmarc=pass (policy=reject) header.from=fb.com; spf=none (imf01.hostedemail.com: domain of "prvs=819375f3bc=songliubraving@fb.com" has no SPF policy when checking 67.231.145.42) smtp.mailfrom="prvs=819375f3bc=songliubraving@fb.com"; arc=pass ("microsoft.com:s=arcselector9901:i=1") X-Rspam-User: X-HE-Tag: 1657727358-854855 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Jul 13, 2022, at 3:20 AM, Peter Zijlstra wrote: >=20 > On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote: >> Dynamically allocated kernel texts, such as module texts, bpf programs, >> and ftrace trampolines, are used in more and more scenarios. Currently, >> these users allocate meory with module_alloc, fill the memory with text, >> and then use set_memory_[ro|x] to protect the memory. >>=20 >> This approach has two issues: >> 1) each of these user occupies one or more RO+X page, and thus one or >> more entry in the page table and the iTLB; >> 2) frequent allocate/free of RO+X pages causes fragmentation of kernel >> direct map [1]. >>=20 >> BPF prog pack [2] addresses this from the BPF side. Now, make the same >> logic available to other users of dynamic kernel text. >>=20 >> The new API is like: >>=20 >> void *vmalloc_exec(size_t size); >> void vfree_exec(void *addr, size_t size); >>=20 >> vmalloc_exec has different handling for small and big allocations >> (> PMD_SIZE * num_possible_nodes). bigger allocations have dedicated >> vmalloc allocation; while small allocations share a vmalloc_exec_pack >> with other allocations. >>=20 >> Once allocated, the vmalloc_exec_pack is filled with invalid instruction= s >=20 > *sigh*, again, INT3 is a *VALID* instruction. I am fully aware "invalid" or "illegal" is not accurate, but I am not=20 sure what to use. Shall we call them "safe" instructions? >=20 >> and protected with RO+X. Some text_poke feature is required to make >> changes to the vmalloc_exec_pack. Therefore, vmalloc_exec requires chang= es >> from the arch (to provide text_poke family APIs), and the user (to use >> text poke APIs to make any changes to the memory). >=20 > I hate the naming; this isn't just vmalloc, this is a whole different > allocator build on top of things. >=20 > I'm also not convinced this is the right way to go about doing this; > much of the design here is because of how the module range is mixing > text and data and working around that. Hmm.. I am not sure mixed data/text is the only problem here.=20 >=20 > So how about instead we separate them? Then much of the problem goes > away, you don't need to track these 2M chunks at all. If we manage the memory in < 2MiB granularity, either 4kB or smaller,=20 we still need some way to track which parts are being used, no? I mean the bitmap. =20 >=20 > Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest > (leftmost) vmap_area that fits, picks the higests (rightmost). >=20 > Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make > ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a > weak function doing the vmalloc). >=20 > This gets you bottom of module range is RO+X only, top is shattered > between different !X types. >=20 > Then track the boundary between X and !X and ensure module_alloc_data() > and module_alloc() never cross over and stay strictly separated. >=20 > Then change all module_alloc() users to expect RO+X memory, instead of > RW. >=20 > Then make sure any extention of the X range is 2M aligned. >=20 > And presto, *everybody* always uses 2M TLB for text, modules, bpf, > ftrace, the lot and nobody is tracking chunks. >=20 > Maybe migration can be eased by instead providing module_alloc_text() > and ARCH_WANTS_MODULE_ALLOC_TEXT. If we have the text/data separation, can we just put text after _etext?=20 Right now, we allocate huge pages for _stext to round_down(_etext, 2MB), and 4kB pages for round_down(_etext, 2MB) to round_up(_etext, 4kB). To=20 make this more efficient, we can allocate huge pages for _stext to=20 round_up(_etext, 2MB), and use _etext to round_up(_etext, 2MB) as the first pool of memory for module_alloc_text(). Once we used all the=20 memory there, we allocate more huge pages after round_up(_etext, 2MB). I am not sure how to make this work, but I guess this is similar to=20 the idea you are describing here? However, we will need some bitmap=20 to track the usage of these memory pools, right? Thanks, Song