From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F8F7C433F5 for ; Tue, 19 Apr 2022 05:37:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 269588D0049; Tue, 19 Apr 2022 01:37:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 21A288D0047; Tue, 19 Apr 2022 01:37:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 047748D0049; Tue, 19 Apr 2022 01:37:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id EB9208D0047 for ; Tue, 19 Apr 2022 01:37:26 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BC4382478F for ; Tue, 19 Apr 2022 05:37:26 +0000 (UTC) X-FDA: 79372520892.04.3B009E7 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by imf18.hostedemail.com (Postfix) with ESMTP id 1574A1C0005 for ; Tue, 19 Apr 2022 05:37:24 +0000 (UTC) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 23J1bE7A029060; Mon, 18 Apr 2022 22:36:47 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=facebook; bh=a3lawqdCFHN1bjX/Lv5FK+XrTPOrOBxYv5m3i7LB3os=; b=Vxy3vpAl8Yc3pTiGlpB0Tfa8VkeYPvGZddSSrnEXRUg+yhT3HPSOOckWePCk9vcwjEal eQYMmmtP/VrqINW1VrdbteL5fCN+880yxm3soNDEvK2PRtsmkJgMWaDRa58MUc+LPych Jrqe3QlFO5uthZyE6A+WNMKgzqHJieEAj38= Received: from nam10-mw2-obe.outbound.protection.outlook.com (mail-mw2nam10lp2102.outbound.protection.outlook.com [104.47.55.102]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3fhkk20rju-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Apr 2022 22:36:47 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=T9iGr4mDNwBc2uQ4Nh+eZiO4TShL0EleGrSuCF6SS8Se9cBRyQ+742M9rAKU+zkNBh8oQbCI2vwBUyDsR6WFVqxGiS9HvNdJqfERrjnivEJV9SDklIq3EJf5N4mCXvRNvJmcCcIkFgml2QsRbyim4GOrDavf93VDC3xsqeY22igQn3CNoIKXYwFr6vqT2v/4pt/tjmEGziaZQYUwArPsamCaf5kDNY1s+M9QR4/jOm5+l4A6K+FRebUNM8fBYy3rcks5GA+KaUg2gVG052/JMYviRbVsYoyODZfk+qTY9QdVRRnSJWyzmXTHFRGzWri2QYz6Wjg5FZ1aTOYmsz6n8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=a3lawqdCFHN1bjX/Lv5FK+XrTPOrOBxYv5m3i7LB3os=; b=PU0G93VPZtIZRLKoa8SLUD6/ZH8/yoKgNvTxNxketl1oY8Dt1xwHBgXAvcPm/p9D24CunEYRhhP8f2cSkbowVEx5FgXOydXsp9nBF6L9BI6VaKdZAPb0sIiqmrdYlbI6YrKN83P7vZJdqijCetzlAxM9uc9BUTt1tdS5sqlkoS4V6BKueIxNMbtyiZhYxH6DbEVOuOyzrGhuxORcRqGdVgR22qbcqIyfuuQaj17fSGdturNrxv3EFAekncp2vmxpWVzcHaup4091DRH7YzlBUuR2DZQbIaiNoyBUi59atPd6fEily66QCO5fIMWbFj0EV5pBIsclLvHsuNXsJVymrQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fb.com; dmarc=pass action=none header.from=fb.com; dkim=pass header.d=fb.com; arc=none Received: from SA1PR15MB5109.namprd15.prod.outlook.com (2603:10b6:806:1dc::10) by BYAPR15MB2263.namprd15.prod.outlook.com (2603:10b6:a02:87::27) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5164.20; Tue, 19 Apr 2022 05:36:45 +0000 Received: from SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::20d2:26a2:6cb1:3c4b]) by SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::20d2:26a2:6cb1:3c4b%6]) with mapi id 15.20.5164.025; Tue, 19 Apr 2022 05:36:45 +0000 From: Song Liu To: "Edgecombe, Rick P" CC: "rppt@kernel.org" , "mcgrof@kernel.org" , "linux-kernel@vger.kernel.org" , "bpf@vger.kernel.org" , "hch@infradead.org" , "ast@kernel.org" , "daniel@iogearbox.net" , "Torvalds, Linus" , "linux-mm@kvack.org" , "song@kernel.org" , Kernel Team , "pmladek@suse.com" , "akpm@linux-foundation.org" , "hpa@zytor.com" , "dborkman@redhat.com" , "edumazet@google.com" , "bp@alien8.de" , "mbenes@suse.cz" , "imbrenda@linux.ibm.com" Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Thread-Topic: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Thread-Index: AQHYUOjRr/eYt5HVI0ucQZg4JV/u/6zxVckAgACoe4CAAPedAIAACgKAgAAgN4CAAlYOAIAA9TuAgAAUC4CAAD2oAA== Date: Tue, 19 Apr 2022 05:36:45 +0000 Message-ID: References: <20220415164413.2727220-1-song@kernel.org> <4AD023F9-FBCE-4C7C-A049-9292491408AA@fb.com> <88eafc9220d134d72db9eb381114432e71903022.camel@intel.com> In-Reply-To: <88eafc9220d134d72db9eb381114432e71903022.camel@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-mailer: Apple Mail (2.3696.80.82.1.1) x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 0b2d44f1-c617-41ce-a734-08da21c69762 x-ms-traffictypediagnostic: BYAPR15MB2263:EE_ x-microsoft-antispam-prvs: x-fb-source: Internal x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: WaUGMGinUiX3x26oJY+CPwszCflW7ODTa2kTUwz3etyNYMtGukYViGtRuYlanz71J7PQquI4ycfM4knML2elcDUERyjnk9pFdPOffh5jvgaLx+j52wk4vU2JqvG597TX9nV2ERoXgNVH6KuWkfX0GB3aAIaPEm9f5om/O+3i4MdWUMvzzNpRRDsRuqRyadCSamnRYbQqv9yuIY1n1XdjJkXqfIcCtS8laTiyyhNeoFMCIkvQWr54KotOsX/A0QKQvo3eTQnlMXr3NGVXKPTwUAz3vNkTlFbL8KRXmQYuhLZsq268hCScdFuU5M8Z7NYcGRuUnmZm6w3yra0e77FC5/hKJvBgPOyXsolUruTyHFNxJnOwBcjBGdEjFf3FwTnq+xiXYH47iV+RK7D6ZZ8uy1Xg6Nn9tIwLxGAk2SWKoesIZtQxeCKxgn4lQMFEAXEQcGNLXNxGwkdfjK1j2EwBUM1DVWH/qEhOqxReuZqk7xNpVcK+PCdc1T9kbQj2PxU9B/JzFnwdF+lbhYC1ESirCmT+qODHlbyia9ZE1Fd2jV9aRYzBrrJG4n8MUVk6cOJ+ImGcJ7r08LzJxgZ9DujQojSAlTxdzk8q1O0wRMj2JzkHc6LyrMFMCFLeRHJbhzvDeLU78Wdz6rI/jTIzjihqWwkU9S3g/Kb040aZYV6tH53K5OQWYR72BcKWf6fqzq1KV4TlKeaMI5u/Biso4uHxNfPxMLpMJaKKTar+b5aUjpIzp+VawCgbRhUnP2UCTBOjAato2t40f19VuNY5TRDcsDiwo1O3F24rDoRyVo5aSVFqDNblXY1ergdLrjF9I15+T3mOpxZIcrTm/QDnt44Qmb9JvHrYP2fE+qk//Pcfz6w= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SA1PR15MB5109.namprd15.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230001)(366004)(6486002)(36756003)(2616005)(33656002)(6916009)(91956017)(966005)(66446008)(66476007)(64756008)(8676002)(6512007)(4326008)(76116006)(54906003)(66946007)(66556008)(71200400001)(6506007)(53546011)(2906002)(122000001)(508600001)(316002)(7416002)(86362001)(5660300002)(38100700002)(38070700005)(8936002)(83380400001)(186003)(14583001)(45980500001);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?CW6BkUZvkcB2IdC/dJ8LWffXG0aupWGJYf95KcFvPGt0unnmwFxaDoDhRd6x?= =?us-ascii?Q?72PoJYg+1zFM7P2yMRJvgWAI8IQiY8tMdEghu6P8t80ayjh3Ddngq74ROvO7?= =?us-ascii?Q?9O5QnFbeW7H4AIbZ8rS6TSBO/l89D1pmZoVfJyZWcYdHQntVSa4fCS87+eRH?= =?us-ascii?Q?DPRg/yUQLsdwBa2mqbU9IAJWdKa7rZjG0mdoM1rJqTyzgjqm+kMYvrm6Vxs/?= =?us-ascii?Q?JON9wsM+ZdbujI3Iqvcc5w35+f5wdjTFUMN3uUW0YLYe0q076fzncWzfF7En?= =?us-ascii?Q?FL9YXX7XyHkjD3KhOtYe9e3LYwy0/REpTN2r9Xo+wYMBMJLAwTX0z9FidnSw?= =?us-ascii?Q?Lv3/ks1Djiyd9GP7HhKYghPu+AvokvKfvXTkgSu/tMkPZfKwlelUUVAu2OWE?= =?us-ascii?Q?YFGQ+9DgsTF8ZAYM+0zDi4KSHwpNRQHRrKTE9b8o1YklmRt8Jrn7q1feVsVu?= =?us-ascii?Q?YHP3OG0ppSLUNccJd1XGO0KSTnbBQhUXJhmTKi2ckp2YG17t1WLmUauA74di?= =?us-ascii?Q?BDdS0VFTmjrxFAW2XQ8P9p7S9BLG4266XkA6PoEJFJnHNZNBrJnmQpWGI6QH?= =?us-ascii?Q?RTjWrODCOqALWIhokDmMZHqs+PFVspHkW30U4PWExYz6GK5nR/fXGHOPs5fC?= =?us-ascii?Q?Oagr5yP3FjZkWogz8kyZNXpJGRwZFdiv3qAeudnsbSzKQDl2FxB9vnKV9Q28?= =?us-ascii?Q?JWRwWQLeTEs6cl4rDCiv+b+0yVMIPPjqEHBGJLoS4/PH3gwUQx66aRpvLZ7I?= =?us-ascii?Q?9nWrjGfm0i/jU5T1rXD7ytU1IN/+KCAVAsq5zJOdXVarsWsxYY9wJl6SaZvw?= =?us-ascii?Q?IVawRb+ysP5ASuWL+U0w6+vggbpo3lQM7xpl0RsjfmBmLI1wWAjzz6NEyuTp?= =?us-ascii?Q?6hBtYmKHZiDMneCdCpGdEhuiCpbXc+kpAhM/eBokbC5ihhrTshHhQXmqa58e?= =?us-ascii?Q?puii1xL36Z60/1Lpwo24kWWFmUAlC5IHbqoabgE4ZfQ30TJ4EQd2rxAnOgEV?= =?us-ascii?Q?SOEchxtTiasB3tFKfoUm03y59omMmkhEn7QZZR+mCIN+38qKszDzvc//tHgr?= =?us-ascii?Q?SecnSvNO0ifd25Hl05pxSjdKJ+8XR+pKpm7a74YPbcqhKQrkie2xYYCDSRwT?= =?us-ascii?Q?DXSPBkCrG4c6l0bjzcM+DFam/5oX6JEhBNAZot1med7c36vYLqL1OWDXP5Ng?= =?us-ascii?Q?w5eLKvyTnEYHAMS0iOWcuYnWwYtNcH87nWfOd7FVai22kOIWRqaeaos0hYo/?= =?us-ascii?Q?PhJLNIdO7S4Bdo2Thk3/Rk4vfwEq/bcdWZMC3Wu+AaE1Y4aeSTDYewUn6oSE?= =?us-ascii?Q?QdLBxevgYbDy1DiYURPbeUNOUOxsIboUF0z3PpKSYl/GCaK9m8yvdpo94gbe?= =?us-ascii?Q?etjTJjp5NKVah27N1CF4aqWsoKMuXNN7y2ybUyMe7ztKOcwKQ88DheXGIb/u?= =?us-ascii?Q?jaxylghlbi2fRBWsBSwvj5MQD6M4CYAlFD7DwlZ5J1tV3PO2fS9OqmOgzpyZ?= =?us-ascii?Q?LZqScj4sBQEAs+KDmteIBNt7JqYjpghsNgwyx77B58jrgnMf4Whj8ZaJMLy/?= =?us-ascii?Q?es5zoZc9Cvw9N/afHlWiDpepveV5wFxyBvRnXnEsQIDMI0ZbmnFuL6tuFgSI?= =?us-ascii?Q?Cm5hGne7kLFR+YuieQAKHmpgBBoegTIiXc5hTMOfPZypdRjw3eYJlgov8osb?= =?us-ascii?Q?5TyNcJVjaHlGJ+C9ZX6/JFHpXXSB7e3E1TSMVF52gOamyvjgeYLAqxY2f5xz?= =?us-ascii?Q?0N1vUkGb5Ks7+JFuutOSdIWcTAf3BjCcCWfOLkenpyyCgG6AYsGm?= Content-Type: text/plain; charset="us-ascii" Content-ID: <7098622265602F41BF74141C3A0B9E8A@namprd15.prod.outlook.com> X-OriginatorOrg: fb.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SA1PR15MB5109.namprd15.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 0b2d44f1-c617-41ce-a734-08da21c69762 X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Apr 2022 05:36:45.2788 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: ap/+VBDqVT5xe6QV/Jyj++nXxrrGhVd+v/JEyamDMq7QDhm8b77UfghYMTRCvdqJy7mDDBz2V59fnEwo7/Gs2w== X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR15MB2263 X-Proofpoint-GUID: HvSkzXEDeHXz328vG1PrDUT2NF1V8mgf X-Proofpoint-ORIG-GUID: HvSkzXEDeHXz328vG1PrDUT2NF1V8mgf Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.858,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-04-19_01,2022-04-15_01,2022-02-23_01 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1574A1C0005 X-Stat-Signature: gxzze9xeir5kx9xhqz5xprtepz1iybi3 Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=Vxy3vpAl; dmarc=pass (policy=reject) header.from=fb.com; spf=none (imf18.hostedemail.com: domain of "prvs=5108553430=songliubraving@fb.com" has no SPF policy when checking 67.231.145.42) smtp.mailfrom="prvs=5108553430=songliubraving@fb.com" X-HE-Tag: 1650346644-40600 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Mike, Luis, and Rick, Thanks for sharing your work and findings in the space. I didn't=20 realize we were looking at the same set of problems.=20 > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P wrote: >=20 > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: >>> There are use-cases that require 4K pages with non-default >>> permissions in >>> the direct map and the pages not necessarily should be executable. >>> There >>> were several suggestions to implement caches of 4K pages backed by >>> 2M >>> pages. >>=20 >> Even if we just focus on the executable side of the story... there >> may >> be users who can share this too. >>=20 >> I've gone down memory lane now at least down to year 2005 in kprobes >> to see why the heck module_alloc() was used. At first glance there >> are >> some old comments about being within the 2 GiB text kernel range... >> But >> some old tribal knowledge is still lost. The real hints come from >> kprobe work >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of >> line >> - take2"), so that the "For the %rip-relative displacement fixups to >> be >> doable"... but this got me wondering, would other users who *do* want >> similar funcionality benefit from a cache. If the space is limited >> then >> using a cache makes sense. Specially if architectures tend to require >> hacks for some of this to all work. >=20 > Yea, that was my understanding. X86 modules have to be linked within > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > to be within 2GB of the kernel text. >=20 >=20 > I think of two types of caches we could have: caches of unmapped pages > on the direct map and caches of virtual memory mappings. Caches of > pages on the direct map reduce breakage of the large pages (and is > somewhat x86 specific problem). Caches of virtual memory mappings > reduce shootdowns, and are also required to share huge pages. I'll plug > my old RFC, where I tried to work towards enabling both: >=20 > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@inte= l.com/ >=20 > Since then Mike has taken a lot further the direct map cache piece. These are really interesting work. With this landed, we won't need=20 the bpf_prog_pack work at all (I think). OTOH, this looks like a=20 long term project, as some of the work in bpf_prog_pack took quite=20 some time to discuss/debate, and that was just a subset of the=20 whole thing.=20 I really like the two types of cache concept. But there are some=20 details I cannot figure out about them: 1. Is "caches of unmapped pages on direct map" (cache #1)=20 sufficient to fix all direct map fragmentation? IIUC, pages in the cache may still be used by other allocation (with some=20 memory pressure). If the system runs for long enough, there=20 may be a lot of direct map fragmentation. Is this right? 2. If we have "cache of virtual memory mappings" (cache #2), do we still need cache #1? I know cache #2 alone may waste some=20 memory, but I still think 2MB within noise for modern systems.=20 3. If we do need both caches, what would be the right APIs?=20 Thanks, Song > Yea, probably a lot of JIT's are way smaller than a page, but there is > also hopefully some performance benefit of reduced ITLB pressure and > TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps > its own cache of a page for putting very small trampolines. >=20 >>=20 >> Then, since it seems since the vmalloc area was not initialized, >> wouldn't that break the old JIT spray fixes, refer to commit >> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying >> attacks")? >=20 > Hmm, yea it might be a way to get around the ebpf jit rlimit. The > allocator could just text_poke() invalid instructions on "free" of the > jit. >=20 >>=20 >> Is that sort of work not needed anymore? If in doubt I at least made >> the >> old proof of concept JIT spray stuff compile on recent kernels [0], >> but >> I haven't tried out your patches yet. If this is not needed anymore, >> why not? >=20 > IIRC this got addressed in two ways, randomizing of the jit offset > inside the vmalloc allocation, and "constant blinding", such that the > specific attack of inserting unaligned instructions as immediate > instruction data did not work. Neither of those mitigations seem > unworkable with a large page caching allocator. >=20 >>=20 >> The collection of tribal knowedge around these sorts of things would >> be >> good to not loose and if we can share, even better. >=20 > Totally agree here. I think the abstraction I was exploring in that RFC > could remove some of the special permission memory tribal knowledge > that is lurking in in the cross-arch module.c. I wonder if you have any > thoughts on something like that? The normal modules proved the hardest. >=20