From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F56EC001B3 for ; Sun, 25 Jun 2023 17:44:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D25426B0071; Sun, 25 Jun 2023 13:44:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD5BA6B0072; Sun, 25 Jun 2023 13:44:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B9D6A6B0074; Sun, 25 Jun 2023 13:44:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A92C06B0071 for ; Sun, 25 Jun 2023 13:44:27 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7FE9E12013B for ; Sun, 25 Jun 2023 17:44:27 +0000 (UTC) X-FDA: 80941994574.27.372DADC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf09.hostedemail.com (Postfix) with ESMTP id B07E614000C for ; Sun, 25 Jun 2023 17:44:25 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=YP5riMEL; spf=pass (imf09.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687715065; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DECo5oYLXgYHxtaq24u1hOWyMMWeoi+Wvc/SSIRw0ME=; b=0SiUh1q3xe+eRqdARazNHEUWKLIts6+nLLDC5CLrDayG2uUuzHB6+C58WZwMHRUt45ATl0 JiYrReV8ifJzz/2BpRXAWuLsEtp6FlrvuUJstT9Ok5gU85yYJzHEX63wHgxer4pBBQl33k EVYf74WQ+gMVdJOu5uN5bDkELObxzhw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=YP5riMEL; spf=pass (imf09.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687715065; a=rsa-sha256; cv=none; b=lqKEPLSVqOzhxljQHDvpU9aN1FZnWQeDuYpGYT+EMggEubbW9JXkvwyhGZL4dSUVjKzY9O IPN46LZYK9gpvHuL1k89hPN5tTZ4X0mVRKBCPwC47JFAPIcHkm3AYtJ6aP7Yspn/cEO0r3 yl8dwJP20/EWoUQiInBR4soDtQ752Os= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 5A3C1602F9; Sun, 25 Jun 2023 17:44:24 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8AC02C433C8; Sun, 25 Jun 2023 17:44:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1687715063; bh=WoUrmf0JEvFfQkljC61rPZoFHAXMId+ovOZ+fz1xXRM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=YP5riMELjwGif5MetoqunCE/NEQNM05SODCHCM3NP274NT5kLDkCw+lYlFK9UhEOT qDLdn9rWmrAiPScC30o8bBxNq06ndu4LyRdafclIFEHnyh1dl1m6lm8HKVRj0PW8Gd bS1NGRYWfoyju/EEsTN0Hfg++5BGw6JBu3tNKZMoh+4lxQ/InBdA1cnWhewfz/DpH3 NQvBcro2ROnD4F+sChNda583y6v06vsiiNksZdiHaJOziCCpsjvIYI5dguVCw3W8jl /7gz0Aq4l2D8Ub/ROd+l5Cg2ih7cqqiikyhGIYUqney6omj2WYc5s+RTv4euyPzP/E NLShZkWcrywjw== Date: Sun, 25 Jun 2023 20:42:57 +0300 From: Mike Rapoport To: Andy Lutomirski Cc: Mark Rutland , Kees Cook , Linux Kernel Mailing List , Andrew Morton , Catalin Marinas , Christophe Leroy , "David S. Miller" , Dinh Nguyen , Heiko Carstens , Helge Deller , Huacai Chen , Kent Overstreet , Luis Chamberlain , Michael Ellerman , Nadav Amit , "Naveen N. Rao" , Palmer Dabbelt , Puranjay Mohan , Rick P Edgecombe , "Russell King (Oracle)" , Song Liu , Steven Rostedt , Thomas Bogendoerfer , Thomas Gleixner , Will Deacon , bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev, netdev@vger.kernel.org, sparclinux@vger.kernel.org, the arch/x86 maintainers Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Message-ID: <20230625174257.GL52412@kernel.org> References: <20230616085038.4121892-1-rppt@kernel.org> <20230616085038.4121892-3-rppt@kernel.org> <20230618080027.GA52412@kernel.org> <20230625161417.GK52412@kernel.org> <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com> X-Rspamd-Queue-Id: B07E614000C X-Rspam-User: X-Stat-Signature: tzrbxbypdo897to6m1bzi7nixjypr3z4 X-Rspamd-Server: rspam01 X-HE-Tag: 1687715065-798580 X-HE-Meta: U2FsdGVkX19f3x4PIinI6R9RkcLP6u43Tz9aDteHWQLagZZdgEGZU629MImEME5kOMl7oFt71EU5fWLG6WaPLss9zWTy91a/NnoAmeXCl+zErrTnnSmi7Y53rrQwDagJjvZE1AC//qZIQUSgeu9pI578YkCNLFuYD7eOq+XdtNI/a/pxFeKYWHWbIZQDBPWrOsVWUwHgNWK2tUDJ8+FwNwjEguFSpwr7fVzkypDwEqE8PlTXjn5AJxDfM70wUd5oxAzc0PPG1czkYbrzvcGdfHX+/ooLn+/J/nhJ4YS4PIa291ZM1jJt4OuPfDSSfWkZudQL6r1T1/jDofNiUKs71Uihi3Vw0ost7tD9tE0oQOqfkDAMviNJzg6QAorpYh/Qa6ncfb7rUO3Qmk3iuaOYFnleeKSCQygAx6Cu/H8NOW4biodAJ3JTn739qHJ82Fk2GiqYJ3/DdIQi2Eh5bjoMvB07bKwaibRkPWwsFnxCDAVG4E2hl0siFEh/Qg8I9fa7TxTxnr5S3jYZehXQwHhGeu795k7Zp8UTJQJ8ByPF0ziyUY19ihxXgW4BAZbnqF/dLH517Q2FnCPFVje0VEw81Dh+UPc+BuRRwJvX9qFvsbBz9emqnAf/bZFaAOsgCiYGf4Cvo41qHcxIZfObtTUav6pdSrugpyxU3WS8ByKhaKj5Zx50/8mjJez7vlqCniWbUBL1PhZ/PzKlg3fvJTQvs0nOHX9655l0TmBR/TN1q/kiFjoQ6mNbRBQwIAdmebkhii1jud8gNfZUNDYZ0cNxTLxmA62Mqru/8AViqJ/SVx3aXWuK6Bn1m9xu4sOexYjPYH4g4qwvGBcFLRvLLYwzneM+qah+t2ZaPEsfzeqPsfbCVFxCozYmc8lMa0F+OpBW8ooTc4lVQ/FNIznKGCojtej7LYqhsOEOAy0J5lDwoa+sHmtXXSftAvMaL8lp/NZ+Y1d8El/Uojm71h7zANa vNxEEOBb g9ceXptq8gakGam3gSUEP/quF8WIJ5k9tRsBKbgRPI8uo8nXABlrMrO18Svruh4owyPAhQbctFx6YjyLMLDHyWuWQt8FnJCneHotzdSKvPuc5STUvaZHh9x0+oOXQOB22DSvknBih/lBaoc2uFZxQFHqF62AgZ7RcIWWPoDoDrgAieTo1hO0LsbTfUjFGJnaYPrzC5+eDHUxXIxCXWo1dzaVYmoXAhYbyxIQDohLTCfsa182jsUoiNsHaqQrKVK43u16fxvGUfYYCg21sTyv9PQ+Dig3M1tmIfXn8CzQAXNfhwDuKUO7dxzbObSQOEY5RN4sio+TTNqkskyOyC+Unp0wkbAefTetbyKsghRU4CbkSky5qmyRrLoO8rHas5QmzXwJ3rw2bUneNgXSZ4CtVKJo3H+rignxrTb8jKCzrJxKvEjP1QzUngqDTB4ythzdypFvsGMgaU2uhNzYgUnDgj5HbiA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote: > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote: > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote: > >> > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote: > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote: > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote: > >> >> > From: "Mike Rapoport (IBM)" > >> >> > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code. > >> >> > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules > >> >> > and puts the burden of code allocation to the modules code. > >> >> > > >> >> > Several architectures override module_alloc() because of various > >> >> > constraints where the executable memory can be located and this causes > >> >> > additional obstacles for improvements of code allocation. > >> >> > > >> >> > Start splitting code allocation from modules by introducing > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > >> >> > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of > >> >> > module_memfree() to allow updating all call sites to use the new APIs. > >> >> > > >> >> > The intention semantics for new allocation APIs: > >> >> > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside > >> >> > close to the kernel image, like loadable kernel modules and generated > >> >> > code that is restricted by relative addressing. > >> >> > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code > >> >> > when there are no restrictions for the code placement. For > >> >> > architectures that require that any code is within certain distance > >> >> > from the kernel image, jit_text_alloc() will be essentially aliased to > >> >> > execmem_text_alloc(). > >> >> > > >> >> > >> >> Is there anything in this series to help users do the appropriate > >> >> synchronization when the actually populate the allocated memory with > >> >> code? See here, for example: > >> > > >> > This series only factors out the executable allocations from modules and > >> > puts them in a central place. > >> > Anything else would go on top after this lands. > >> > >> Hmm. > >> > >> On the one hand, there's nothing wrong with factoring out common code. On > >> the other hand, this is probably the right time to at least start > >> thinking about synchronization, at least to the extent that it might make > >> us want to change this API. (I'm not at all saying that this series > >> should require changes -- I'm just saying that this is a good time to > >> think about how this should work.) > >> > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually > >> look like the one think in the Linux ecosystem that actually > >> intelligently and efficiently maps new text into an address space: > >> mmap(). > >> > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC > >> and jump to it with minimal synchronization (just the standard implicit > >> ordering in the kernel that populates the pages before setting up the > >> PTEs and whatever user synchronization is needed to avoid jumping into > >> the mapping before mmap() finishes). It works across CPUs, and the only > >> possible way userspace can screw it up (for a read-only mapping of > >> read-only text, anyway) is to jump to the mapping too early, in which > >> case userspace gets a page fault. Incoherence is impossible, and no one > >> needs to "serialize" (in the SDM sense). > >> > >> I think the same sequence (from userspace's perspective) works on other > >> architectures, too, although I think more cache management is needed on > >> the kernel's end. As far as I know, no Linux SMP architecture needs an > >> IPI to map executable text into usermode, but I could easily be wrong. > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't > >> remember the details.) > >> > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is > >> rather fraught, and I bet many things do it wrong when userspace is > >> multithreaded. But not in production because it's mostly not used in > >> production.) > >> > >> But jit_text_alloc() can't do this, because the order of operations > >> doesn't match. With jit_text_alloc(), the executable mapping shows up > >> before the text is populated, so there is no atomic change from not-there > >> to populated-and-executable. Which means that there is an opportunity > >> for CPUs, speculatively or otherwise, to start filling various caches > >> with intermediate states of the text, which means that various > >> architectures (even x86!) may need serialization. > >> > >> For eBPF- and module- like use cases, where JITting/code gen is quite > >> coarse-grained, perhaps something vaguely like: > >> > >> jit_text_alloc() -> returns a handle and an executable virtual address, > >> but does *not* map it there > >> jit_text_write() -> write to that handle > >> jit_text_map() -> map it and synchronize if needed (no sync needed on > >> x86, I think) > >> > >> could be more efficient and/or safer. > >> > >> (Modules could use this too. Getting alternatives right might take some > >> fiddling, because off the top of my head, this doesn't match how it works > >> now.) > >> > >> To make alternatives easier, this could work, maybe (haven't fully > >> thought it through): > >> > >> jit_text_alloc() > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X > >> > >> write the text and apply alternatives > >> > >> jit_text_finalize() -> change from RW to RX *and synchronize* > >> > >> jit_text_finalize() would either need to wait for RCU (possibly extra > >> heavy weight RCU to get "serialization") or send an IPI. > > > > This essentially how modules work now. The memory is allocated RW, written > > and updated with alternatives and then made ROX in the end with set_memory > > APIs. > > > > The issue with not having the memory mapped X when it's written is that we > > cannot use large pages to map it. One of the goals is to have executable > > memory mapped with large pages and make code allocator able to divide that > > page among several callers. > > > > So the idea was that jit_text_alloc() will have a cache of large pages > > mapped ROX, will allocate memory from those caches and there will be > > jit_update() that uses text poking for writing to that memory. > > > > Upon allocation of a large page to increase the cache, that large page will > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on > > x86) > > Is this actually valid? In between int3 and real code, there’s a > potential torn read of real code mixed up with 0xcc. You mean while doing text poking? > > To improve the performance of this process, we can write to !X copy and > > then text_poke it to the actual address in one go. This will require some > > changes to get the alternatives right. > > > > -- > > Sincerely yours, > > Mike. -- Sincerely yours, Mike.