From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4090CC001DF for ; Sun, 25 Jun 2023 17:00:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87AA46B0071; Sun, 25 Jun 2023 13:00:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 803176B0072; Sun, 25 Jun 2023 13:00:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 655796B0074; Sun, 25 Jun 2023 13:00:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4CC466B0071 for ; Sun, 25 Jun 2023 13:00:04 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1AA821605A2 for ; Sun, 25 Jun 2023 17:00:04 +0000 (UTC) X-FDA: 80941882728.07.AACDB72 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf17.hostedemail.com (Postfix) with ESMTP id A451E4001B for ; Sun, 25 Jun 2023 16:59:59 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=cOWR4r04; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf17.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687712399; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H8BsOEzLRy865y0/MUZ/Zk6svIODMZfamdDQ+yjqUv4=; b=xTqXENcGuvDL779icVR8Vc6sYl4Iif823JSQXYMkG2vU86kP/J6zdRA814iUSAXd9jKvSN 5fNwFIykTzLDWvzdaioUf9+hSi6Bjr2jb7F0/qQdd75CY3yEGSgi8kVi6w/hE3y0Ke+bhp LpxRjQrVgoNNulXZzHOdlgJYufDH0k0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=cOWR4r04; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf17.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687712399; a=rsa-sha256; cv=none; b=QMxropGugo17CZ2fz9Z4AqecWrvvgCeJXeLxlc2KAYjot+ky8I/TLRkkFvGqomA8zrHGSL Na2w13MBHPhB8fzYJULWl+pgM50cUxvO6GFMFg1fr1mp5jqHdtlvphfYTbx/1tz2wIcce0 OKc03iDek1YqZ7wsUU/AV3POaYdeFLY= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9382360B42; Sun, 25 Jun 2023 16:59:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 33248C433C8; Sun, 25 Jun 2023 16:59:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1687712398; bh=aLCuUdfO6pxcjuIFQrLZGXDWAtq12vZAS1hIVljCMnc=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=cOWR4r045oxtLVetz/9eapfdZNWb4ATF+yfHt+motXh4ZE3I83jOk7XXblg2E4fgL Al0LxzKRQy85dESKdLUZCbByKlhPm+OGYy7gReJ06pqVh+h438RKiak/yZpdhfEZ9I SJiqyStCpEz0OupUQjW5d2NawViNDf6VpRKXr6zzKCtJ5iBVB+h1Ss5kSpVW1QKtI+ BR3IBCwnVoSAOa29cfKkS1d0wcRn4bPps29mPq9lMXFfOx8BbDjW0NTFzExr+aYqih 7qnk0oYyAxLjmq51SExt+8wUNpHz99SHjuW2i5C14Mo3X1Qh7b2PSIfXAdgRa5deU4 sjPVr8jRp0ygQ== Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailauth.nyi.internal (Postfix) with ESMTP id 0BE8127C0054; Sun, 25 Jun 2023 12:59:56 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute3.internal (MEProxy); Sun, 25 Jun 2023 12:59:56 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrgeehtddguddthecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefofgggkfgjfhffhffvvefutgfgsehtqhertderreejnecuhfhrohhmpedf tehnugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhepudevffdvgedvfefhgeejjeelgfdtffeukedugfekuddvtedv udeileeugfejgefgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudei udekheeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlih hnuhigrdhluhhtohdruhhs X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 8085731A0063; Sun, 25 Jun 2023 12:59:54 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.9.0-alpha0-499-gf27bbf33e2-fm-20230619.001-gf27bbf33 Mime-Version: 1.0 Message-Id: <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com> In-Reply-To: <20230625161417.GK52412@kernel.org> References: <20230616085038.4121892-1-rppt@kernel.org> <20230616085038.4121892-3-rppt@kernel.org> <20230618080027.GA52412@kernel.org> <20230625161417.GK52412@kernel.org> Date: Sun, 25 Jun 2023 09:59:34 -0700 From: "Andy Lutomirski" To: "Mike Rapoport" Cc: "Mark Rutland" , "Kees Cook" , "Linux Kernel Mailing List" , "Andrew Morton" , "Catalin Marinas" , "Christophe Leroy" , "David S. Miller" , "Dinh Nguyen" , "Heiko Carstens" , "Helge Deller" , "Huacai Chen" , "Kent Overstreet" , "Luis Chamberlain" , "Michael Ellerman" , "Nadav Amit" , "Naveen N. Rao" , "Palmer Dabbelt" , "Puranjay Mohan" , "Rick P Edgecombe" , "Russell King (Oracle)" , "Song Liu" , "Steven Rostedt" , "Thomas Bogendoerfer" , "Thomas Gleixner" , "Will Deacon" , bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev, netdev@vger.kernel.org, sparclinux@vger.kernel.org, "the arch/x86 maintainers" Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: A451E4001B X-Stat-Signature: qdbg3koggd3t9wd74jkyerng13dmsryy X-HE-Tag: 1687712399-564957 X-HE-Meta: U2FsdGVkX1/7rK5IH1KGrwdnMhS2Wa/YD+DznxD9cHgKbSvTM50jXmBmP0M/MliKmvgjo5Imd1b6uM7FWmBeEAFcbW7N3rNWC6HrR/c5SB04OJmbZrDMjOW+8DdTVdVJPvRSvzlv0TmjuY25Igdhlpm0sYWg5K4Cm26WYBtWW1fAyN+6FWu4wozGV5JbEevnvZLwa8jLkOR/8zsNJdPOa7emuHt7acIvxPJTxTxiezt7B1Mci/eQ/aN1SIuohpQ2TlTf7nte2x5jY5GBVI/THtrORhsxg0iypTs8wZ+62L45vwKs2M5ZZZxCsR5JKdN354hlzd+rW4QLKltazIbWu4noUpbD5mfdBoJN87luEYh/EmLPv2sjZUYxmlQeO3l5nvz0LSbFoslGim3pttPVY+p/KtNOnN0v0PyAdGuruxe+zPE/VcWT0owi5tySpkOuyAI5FIvhtqQZ0TTw7hnsQWn0ZepOZIipdTIv2AFp5rA/J+zeS9QnSFy5gcEt/0Onuj7a5m3dVOC76xXvCowwTCg2ivYjA3uZ+UHcSMFmpL0LCgqs9PumYrJW589RvogLzvdcy4oFX4ZwTntGAoQ/vxISntReAUdiqm2O1pKowkj6J+WI3LnJmkgl0rR/RvS3ZHxVq1ElT6C1JOStd7fF67GK5edT4ULzz9IPrNCIt717zyzhNC8/Nuzlj8j20C/e0bpx5whvZZk9utI/Hb/CtZolX/CSqMEf35HzQeU7qFoDj0lEx+pawKWbdPIslyDhLng/py503Cp3oLyr53MCyppktFGQRnFt1pa29FkTJrtpnXw81hVt0H8/RDPUJMGW/M1MiMMNcvlgnJWkBGhWG4nqSCgddulHW3ypiddjuqGgMaBThv7V+Xy0exIYLox99sVA+greshAKduKUzXFuwxJp2M6ldxGPQS0YjVHZUMu09EWKATkw1iHbcaN7A0GlatpapHrb3Qu4PgqOgF5 53FRpPbo TZAy+iu912gtlC3i4AUTJo0+yjoEsdRzJiJgnxHK5AtfD2WTtMkORto5YaP0rmwJ3CLYn8cXs4oZy3uyyF2ouYhSmTfq6t4AsBraOEaKueAAaVIDGnK0T3BAOfShQdHbQf8hsAp/Aw9hMx393c0CrNhCy9NO2Cq79yj6DCcbkmJv4iJI49KwEskLvhBpLXbR0215E4cX1TxvbyqzKAvlDAYg9lwPXxe1zu2b8ULmsGuptm+iF57kjxncpDTJpuGg6GNC2uB+GzsRWg4H8MPDvVH1kMnpJeYbQLpMRzOS1diP5JBpfeQoHONp0Rdc5FpLw5sEpUOfZksMH32w4LQT8SBBiV6gD7exqBH50hykBr+MTGGrmetJ4hAZvNd3TPNYO/Or428xche4UPZ6KO+/7WTQVG+KH4MtLu45d6K7EePkQzy5JhY7Cr+aHh79rFuMHQhe2dMgZICgN0E4QcxCLRjNfF70NgxgVFPQGVHHSyhwowOR78eYojDWUL4vj8uREub5Q7Jp5JF3YDeA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote: > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote: >>=20 >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote: >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote: >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote: >> >> > From: "Mike Rapoport (IBM)" >> >> > >> >> > module_alloc() is used everywhere as a mean to allocate memory f= or code. >> >> > >> >> > Beside being semantically wrong, this unnecessarily ties all sub= systems >> >> > that need to allocate code, such as ftrace, kprobes and BPF to m= odules >> >> > and puts the burden of code allocation to the modules code. >> >> > >> >> > Several architectures override module_alloc() because of various >> >> > constraints where the executable memory can be located and this = causes >> >> > additional obstacles for improvements of code allocation. >> >> > >> >> > Start splitting code allocation from modules by introducing >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free= () APIs. >> >> > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrapper= s for >> >> > module_alloc() and execmem_free() and jit_free() are replacement= s of >> >> > module_memfree() to allow updating all call sites to use the new= APIs. >> >> > >> >> > The intention semantics for new allocation APIs: >> >> > >> >> > * execmem_text_alloc() should be used to allocate memory that mu= st reside >> >> > close to the kernel image, like loadable kernel modules and ge= nerated >> >> > code that is restricted by relative addressing. >> >> > >> >> > * jit_text_alloc() should be used to allocate memory for generat= ed code >> >> > when there are no restrictions for the code placement. For >> >> > architectures that require that any code is within certain dis= tance >> >> > from the kernel image, jit_text_alloc() will be essentially al= iased to >> >> > execmem_text_alloc(). >> >> > >> >>=20 >> >> Is there anything in this series to help users do the appropriate >> >> synchronization when the actually populate the allocated memory wi= th >> >> code? See here, for example: >> > >> > This series only factors out the executable allocations from module= s and >> > puts them in a central place. >> > Anything else would go on top after this lands. >>=20 >> Hmm. >>=20 >> On the one hand, there's nothing wrong with factoring out common code= . On >> the other hand, this is probably the right time to at least start >> thinking about synchronization, at least to the extent that it might = make >> us want to change this API. (I'm not at all saying that this series >> should require changes -- I'm just saying that this is a good time to >> think about how this should work.) >>=20 >> The current APIs, *and* the proposed jit_text_alloc() API, don't actu= ally >> look like the one think in the Linux ecosystem that actually >> intelligently and efficiently maps new text into an address space: >> mmap(). >>=20 >> On x86, you can mmap() an existing file full of executable code PROT_= EXEC >> and jump to it with minimal synchronization (just the standard implic= it >> ordering in the kernel that populates the pages before setting up the >> PTEs and whatever user synchronization is needed to avoid jumping into >> the mapping before mmap() finishes). It works across CPUs, and the o= nly >> possible way userspace can screw it up (for a read-only mapping of >> read-only text, anyway) is to jump to the mapping too early, in which >> case userspace gets a page fault. Incoherence is impossible, and no = one >> needs to "serialize" (in the SDM sense). >>=20 >> I think the same sequence (from userspace's perspective) works on oth= er >> architectures, too, although I think more cache management is needed = on >> the kernel's end. As far as I know, no Linux SMP architecture needs = an >> IPI to map executable text into usermode, but I could easily be wrong. >> (IIRC RISC-V has very developer-unfriendly icache management, but I d= on't >> remember the details.) >>=20 >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86= is >> rather fraught, and I bet many things do it wrong when userspace is >> multithreaded. But not in production because it's mostly not used in >> production.) >>=20 >> But jit_text_alloc() can't do this, because the order of operations >> doesn't match. With jit_text_alloc(), the executable mapping shows up >> before the text is populated, so there is no atomic change from not-t= here >> to populated-and-executable. Which means that there is an opportunity >> for CPUs, speculatively or otherwise, to start filling various caches >> with intermediate states of the text, which means that various >> architectures (even x86!) may need serialization. >>=20 >> For eBPF- and module- like use cases, where JITting/code gen is quite >> coarse-grained, perhaps something vaguely like: >>=20 >> jit_text_alloc() -> returns a handle and an executable virtual addres= s, >> but does *not* map it there >> jit_text_write() -> write to that handle >> jit_text_map() -> map it and synchronize if needed (no sync needed on >> x86, I think) >>=20 >> could be more efficient and/or safer. >>=20 >> (Modules could use this too. Getting alternatives right might take s= ome >> fiddling, because off the top of my head, this doesn't match how it w= orks >> now.) >>=20 >> To make alternatives easier, this could work, maybe (haven't fully >> thought it through): >>=20 >> jit_text_alloc() >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X >>=20 >> write the text and apply alternatives >>=20 >> jit_text_finalize() -> change from RW to RX *and synchronize* >>=20 >> jit_text_finalize() would either need to wait for RCU (possibly extra >> heavy weight RCU to get "serialization") or send an IPI. > > This essentially how modules work now. The memory is allocated RW, wri= tten > and updated with alternatives and then made ROX in the end with set_me= mory > APIs. > > The issue with not having the memory mapped X when it's written is tha= t we > cannot use large pages to map it. One of the goals is to have executab= le > memory mapped with large pages and make code allocator able to divide = that > page among several callers. > > So the idea was that jit_text_alloc() will have a cache of large pages > mapped ROX, will allocate memory from those caches and there will be > jit_update() that uses text poking for writing to that memory. > > Upon allocation of a large page to increase the cache, that large page= will > be "invalidated" by filling it with breakpoint instructions (e.g int3 = on > x86) Is this actually valid? In between int3 and real code, there=E2=80=99s = a potential torn read of real code mixed up with 0xcc. > > To improve the performance of this process, we can write to !X copy and > then text_poke it to the actual address in one go. This will require s= ome > changes to get the alternatives right. > > --=20 > Sincerely yours, > Mike.