From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76F1CEB64D7 for ; Mon, 26 Jun 2023 12:31:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB52E8D0002; Mon, 26 Jun 2023 08:31:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3D658D0001; Mon, 26 Jun 2023 08:31:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB6D08D0002; Mon, 26 Jun 2023 08:31:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 949638D0001 for ; Mon, 26 Jun 2023 08:31:30 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 915F8120628 for ; Mon, 26 Jun 2023 12:31:29 +0000 (UTC) X-FDA: 80944834698.15.D14555A Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf30.hostedemail.com (Postfix) with ESMTP id B85CB80007 for ; Mon, 26 Jun 2023 12:31:27 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of mark.rutland@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=mark.rutland@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687782687; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bh6Rm9tKr06X73KK3E/70n1TMkdTiELzKYnIp90eK+E=; b=kwiEBLyGcuZefgXlHqLmoQneqlhjt63gMJfF9oWxy85AbXdzZV1GXyDOTOkwe9NldLu5LV BiDUEV+xss8dpO6DvZaMoifRdS2q+Bvun4g8Zn//JUyxd+2Y2BUIm8BBESHsFcZKF476kA zWAuH3CpuISm0QwK4JPK2GhduShEk64= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687782687; a=rsa-sha256; cv=none; b=dQ569hqkjczFu2QvPpE/hPvfw0X8ZFxxm+qRxmNZNGyw06omtYLE3zynVGjGZmxnlQGD5N 4JzE8FHFPlZlDJJ8BJ72exmHFGjGfwdM7kbh1bSN5ujbpFazajeRMcTl7C/ke/c47RlOwP qfyItzaIa60eG1lCWyeCjlVLG1OlNyo= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of mark.rutland@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=mark.rutland@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CBCD22F4; Mon, 26 Jun 2023 05:32:10 -0700 (PDT) Received: from FVFF77S0Q05N (unknown [10.57.23.38]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 12EC63F64C; Mon, 26 Jun 2023 05:31:20 -0700 (PDT) Date: Mon, 26 Jun 2023 13:31:18 +0100 From: Mark Rutland To: Mike Rapoport Cc: Andy Lutomirski , Kees Cook , Linux Kernel Mailing List , Andrew Morton , Catalin Marinas , Christophe Leroy , "David S. Miller" , Dinh Nguyen , Heiko Carstens , Helge Deller , Huacai Chen , Kent Overstreet , Luis Chamberlain , Michael Ellerman , Nadav Amit , "Naveen N. Rao" , Palmer Dabbelt , Puranjay Mohan , Rick P Edgecombe , "Russell King (Oracle)" , Song Liu , Steven Rostedt , Thomas Bogendoerfer , Thomas Gleixner , Will Deacon , bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev, netdev@vger.kernel.org, sparclinux@vger.kernel.org, the arch/x86 maintainers Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Message-ID: References: <20230616085038.4121892-1-rppt@kernel.org> <20230616085038.4121892-3-rppt@kernel.org> <20230618080027.GA52412@kernel.org> <20230625161417.GK52412@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230625161417.GK52412@kernel.org> X-Stat-Signature: nghhytmcsawsaq7f3i763qncdzk3sisc X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B85CB80007 X-Rspam-User: X-HE-Tag: 1687782687-259021 X-HE-Meta: U2FsdGVkX19/4Hw3/GpJSJIkfC1BGQ/L+OmK7AUHL/g8f65mwLt5977CESohynAQfjunOzEwIQmYK/DKX1mD4cZZIFnKyBdO/Q5zzltoWUT6vUBWRQs1KJK+n+rpxTe0YOn7LpDKfkRkz3vRs6syn+eNYgmx4mFO+T6JwUwW/Xn5chVI9bccY/TfOo4PcxRtA3SqXidQ3+pI9JTnGSfXRupuTyMTbgt32UkR1Had4rA03ASuWE+hk6cH3PdeWaggwyBnofDE71Fi7Pw2nUS9zwEbgTEEYFqEkPQuxCsGnBGV2ppWhwuq2uBx/W7acs3DjXPGRDinV9S/4Bz9PSQhuD1FxJwSyde35okhQUuGfbFx3WuBip+amkUES7V6uhGFpy5xNU4O0tNOdJenAKqWGMnoVHHHly7bVF2dUn09kyce0L3jRVVTEsZxKILemJSyWC1NWzp4In2ZkBFKAiUZjrbXuPSVpripAERqtndcB7Z7+2wCm5Sfefw8vtpKfubLWiqFQtaK/LOmCdf0y5Ld3Al3OkfNvbKHX3eX5Zx5qbZDfM7uD32cZgmsELTP8V2LuG2tHW0/zGP40TReJAziiYzLhizrNlYiUxmZBEMKIzLma/xB+SBzIFPu97ij6ihnhwrLlUepd9Bj6bWxa+QN4t5rEuUq4baNAaOg1IjYwwXrlDIyTUtvQf7BiQVTYT6KYUiUsN/C1Fm8Bv6XzjBCVLksPV89omRN+hzZFBsgsqMtvu1jvBZkIDavygLtGWwymHxLj2bRl9K39P05yl5F3/R9PHZ9STTVtjTSIt+28cbN0aafJPPfM0VBi2TAYnN3AEvOoGykcshHr08HjwCaikTXyLLofeBwHOXG7l+LMZsZlnjavFYqjTjDneeZsEUEsMFU727MtkLvNTSkJYUD319P7dl3WZLGAUrcS116hfo1hmzoMiO1iJwFCGAWaItTR4SVHoslH1jG1I1d2v1 1CjcLBdM 0s+XWZ8lRtrKCD/b41pyZvtU0TeeKFN7orDjYPcWvbKFB8yFxcR1pS7vXPO4bihzpVK0uAovZoWNDh7qMLM7kk6Rhnsws+hUXO8aPjm0rszpyqEGvaRw/Z9qLMUPzqkps6EClDl6iKu6+jnbt/a/bBtsDH6sFETB8iXdd4eAHxqCzPTmXY5Bua8G3le0npNbpmd5RtiDYHXK6gI/e/iubJvETnC67CQDwLvGk/6lQoJ7fpPu8gc4GQwgc8udeohkbXZyfqatwxLYgJgORevLikNNwJxhAEplMeOg4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote: > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote: > > > > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote: > > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote: > > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote: > > >> > From: "Mike Rapoport (IBM)" > > >> > > > >> > module_alloc() is used everywhere as a mean to allocate memory for code. > > >> > > > >> > Beside being semantically wrong, this unnecessarily ties all subsystems > > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules > > >> > and puts the burden of code allocation to the modules code. > > >> > > > >> > Several architectures override module_alloc() because of various > > >> > constraints where the executable memory can be located and this causes > > >> > additional obstacles for improvements of code allocation. > > >> > > > >> > Start splitting code allocation from modules by introducing > > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > > >> > > > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > > >> > module_alloc() and execmem_free() and jit_free() are replacements of > > >> > module_memfree() to allow updating all call sites to use the new APIs. > > >> > > > >> > The intention semantics for new allocation APIs: > > >> > > > >> > * execmem_text_alloc() should be used to allocate memory that must reside > > >> > close to the kernel image, like loadable kernel modules and generated > > >> > code that is restricted by relative addressing. > > >> > > > >> > * jit_text_alloc() should be used to allocate memory for generated code > > >> > when there are no restrictions for the code placement. For > > >> > architectures that require that any code is within certain distance > > >> > from the kernel image, jit_text_alloc() will be essentially aliased to > > >> > execmem_text_alloc(). > > >> > > > >> > > >> Is there anything in this series to help users do the appropriate > > >> synchronization when the actually populate the allocated memory with > > >> code? See here, for example: > > > > > > This series only factors out the executable allocations from modules and > > > puts them in a central place. > > > Anything else would go on top after this lands. > > > > Hmm. > > > > On the one hand, there's nothing wrong with factoring out common code. On > > the other hand, this is probably the right time to at least start > > thinking about synchronization, at least to the extent that it might make > > us want to change this API. (I'm not at all saying that this series > > should require changes -- I'm just saying that this is a good time to > > think about how this should work.) > > > > The current APIs, *and* the proposed jit_text_alloc() API, don't actually > > look like the one think in the Linux ecosystem that actually > > intelligently and efficiently maps new text into an address space: > > mmap(). > > > > On x86, you can mmap() an existing file full of executable code PROT_EXEC > > and jump to it with minimal synchronization (just the standard implicit > > ordering in the kernel that populates the pages before setting up the > > PTEs and whatever user synchronization is needed to avoid jumping into > > the mapping before mmap() finishes). It works across CPUs, and the only > > possible way userspace can screw it up (for a read-only mapping of > > read-only text, anyway) is to jump to the mapping too early, in which > > case userspace gets a page fault. Incoherence is impossible, and no one > > needs to "serialize" (in the SDM sense). > > > > I think the same sequence (from userspace's perspective) works on other > > architectures, too, although I think more cache management is needed on > > the kernel's end. As far as I know, no Linux SMP architecture needs an > > IPI to map executable text into usermode, but I could easily be wrong. > > (IIRC RISC-V has very developer-unfriendly icache management, but I don't > > remember the details.) > > > > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is > > rather fraught, and I bet many things do it wrong when userspace is > > multithreaded. But not in production because it's mostly not used in > > production.) > > > > But jit_text_alloc() can't do this, because the order of operations > > doesn't match. With jit_text_alloc(), the executable mapping shows up > > before the text is populated, so there is no atomic change from not-there > > to populated-and-executable. Which means that there is an opportunity > > for CPUs, speculatively or otherwise, to start filling various caches > > with intermediate states of the text, which means that various > > architectures (even x86!) may need serialization. > > > > For eBPF- and module- like use cases, where JITting/code gen is quite > > coarse-grained, perhaps something vaguely like: > > > > jit_text_alloc() -> returns a handle and an executable virtual address, > > but does *not* map it there > > jit_text_write() -> write to that handle > > jit_text_map() -> map it and synchronize if needed (no sync needed on > > x86, I think) > > > > could be more efficient and/or safer. > > > > (Modules could use this too. Getting alternatives right might take some > > fiddling, because off the top of my head, this doesn't match how it works > > now.) > > > > To make alternatives easier, this could work, maybe (haven't fully > > thought it through): > > > > jit_text_alloc() > > jit_text_map_rw_inplace() -> map at the target address, but RW, !X > > > > write the text and apply alternatives > > > > jit_text_finalize() -> change from RW to RX *and synchronize* > > > > jit_text_finalize() would either need to wait for RCU (possibly extra > > heavy weight RCU to get "serialization") or send an IPI. > > This essentially how modules work now. The memory is allocated RW, written > and updated with alternatives and then made ROX in the end with set_memory > APIs. > > The issue with not having the memory mapped X when it's written is that we > cannot use large pages to map it. One of the goals is to have executable > memory mapped with large pages and make code allocator able to divide that > page among several callers. > > So the idea was that jit_text_alloc() will have a cache of large pages > mapped ROX, will allocate memory from those caches and there will be > jit_update() that uses text poking for writing to that memory. > > Upon allocation of a large page to increase the cache, that large page will > be "invalidated" by filling it with breakpoint instructions (e.g int3 on > x86) Does that work on x86? That is in no way gauranteed for other architectures; on arm64 you need explicit cache maintenance (with I-cache maintenance at the VA to be executed from) followed by context-synchronization-events (e.g. via ISB instructions, or IPIs). Mark.