From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E627C43217 for ; Thu, 1 Dec 2022 20:23:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC2C36B0071; Thu, 1 Dec 2022 15:23:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D4BC56B0073; Thu, 1 Dec 2022 15:23:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BECE36B0074; Thu, 1 Dec 2022 15:23:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A94DC6B0071 for ; Thu, 1 Dec 2022 15:23:34 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7BEC440135 for ; Thu, 1 Dec 2022 20:23:34 +0000 (UTC) X-FDA: 80194862748.11.4BE2CFC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf15.hostedemail.com (Postfix) with ESMTP id 162AFA000B for ; Thu, 1 Dec 2022 20:23:33 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=cdpIXlEd; spf=pass (imf15.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669926214; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XA0xmi9DEHTD1VqUAgS21gSIOETpbsnT4IhESLVdLWI=; b=6TmC8JiikRq1b/yK8vkHR5isuM5ROVhr5KrIaEnsAFnQKKyI6PCt49ld2/l0cs4C02hkyB iDKdBd8AzSQ7dYlRaPbGZ516OXVgJV6l1wja9kAcv+zy5nKKHh2ER0OYX0/ByhQwmAMNpg 3p7+C4FmQ1WJFpWWRiI0xlWX4Pu5HYc= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=cdpIXlEd; spf=pass (imf15.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669926214; a=rsa-sha256; cv=none; b=M61PlLkY19jEW9DonYOgKITsttvEUkw40U2qHyerIg+AsboBzh4wln/GNZ9V5Q3GbFpcGI ah/w8/ZHjAoI8pXKfu6qdNA7YZJ+nEF7sUoR+PWcXh5rvqzyqBjjVU8RrC4zsfyTeiHbnN xQE9V69tjo8DHW43ZDlCMYZVqK5OFec= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 27FF06210F; Thu, 1 Dec 2022 20:23:33 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3BBA9C433D6; Thu, 1 Dec 2022 20:23:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1669926212; bh=JLbZ6up/9E6WAJSAbM7WyLhAqYygiyaeytFXLHrqY2s=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=cdpIXlEdOePC3sLVtcjPcergkM188bT8YbiOGIEyMUYBAkQTCD3cSoFlErRQHeJrL qUHGQ/3KZ+nUh/+hW6avNnAWwHC6MPYDzOwU7PDJpto6VpEoCuGIT2O354YmaXMQYM aadGg4eiCQalJaVnHISOBRiUhP97huYaBYvME79oZCLjKv0gZbj+dCBfOkEPbOc/4M r8fnXnjligrj/XQBTfoDzfzFn/O5+rcYTmD+Uv/afdtH6IQMBJEjAwYOZ6lJGJxZgE dxdQTFF2YmD+gap8bXUg94+edV7jD6XiCzqH8N8EtYwjPy4CkAS66t4lw3ygaEnMkP 9VxsWeoELvBKg== Date: Thu, 1 Dec 2022 22:23:14 +0200 From: Mike Rapoport To: Thomas Gleixner Cc: Song Liu , bpf@vger.kernel.org, linux-mm@kvack.org, peterz@infradead.org, akpm@linux-foundation.org, x86@kernel.org, hch@lst.de, rick.p.edgecombe@intel.com, aaron.lu@intel.com, mcgrof@kernel.org Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Message-ID: References: <87v8mvsd8d.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87v8mvsd8d.ffs@tglx> X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-3.00)[100.00%]; SUBJECT_HAS_UNDERSCORES(1.00)[]; DMARC_POLICY_ALLOW(-0.50)[kernel.org,none]; R_DKIM_ALLOW(-0.20)[kernel.org:s=k20201202]; R_SPF_ALLOW(-0.20)[+a:dfw.source.kernel.org]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_SEVEN(0.00)[11]; DKIM_TRACE(0.00)[kernel.org:+]; RCVD_COUNT_THREE(0.00)[3]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_HAS_DN(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Queue-Id: 162AFA000B X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: n43s3638sz8frk8htuct87kux9w8h1uf X-HE-Tag: 1669926213-662202 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: > Song! > > On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner wrote: > >> You are not making anything easier. You are violating the basic > >> engineering principle of "Fix the root cause, not the symptom". > >> > > > > I am not sure what is the root cause and the symptom here. > > The symptom is iTLB pressure. The root cause is the way how module > memory is allocated, which in turn causes the fragmentation into > 4k PTEs. That's the same problem for anything which uses module_alloc() > to get space for text allocated, e.g. kprobes, tracing.... There's also dTLB pressure caused by the fragmentation of the direct map. The memory allocated with module_alloc() is a priori mapped with 4k PTEs, but setting RO in the malloc address space also updates the direct map alias and this causes splits of large pages. It's not clear what causes more performance improvement: avoiding splits of large pages in the direct map or reducing iTLB pressure by backing text memory with 2M pages. If the major improvement comes from keeping direct map intact, it's might be possible to mix data and text in the same 2M page. > A module consists of: > > - text sections > - data sections > > Except for PPC32, which has the module data in vmalloc space, all others > allocate text and data sections in one lump. > > This en-bloc allocation is one reason for the 4k splits: > > - text is RX > - data is RW or RO > > Truly vmalloc'ed module data is not an option for 64bit architectures > which use PC relative addressing as vmalloc does not guarantee that the > data ends up within the limited displacement range (s32 on x8664) > > This made me look at your allocator again: > > > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > > +#define EXEC_MEM_START MODULES_VADDR > > +#define EXEC_MEM_END MODULES_END > > +#else > > +#define EXEC_MEM_START VMALLOC_START > > +#define EXEC_MEM_END VMALLOC_END > > +#endif > > The #else part is completely broken on x86/64 and any other > architecture, which has PC relative restricted displacement. > > Even if modules are disabled in Kconfig the only safe place to allocate > executable kernel text from (on these architectures) is the modules > address space. The ISA restrictions do not go magically away when > modules are disabled. > > In the early version of the SKX retbleed mitigation work I had > > https://lore.kernel.org/all/20220716230953.442937066@linutronix.de > > exactly to handle this correctly for the !MODULE case. It went nowhere > as we did not need the trampolines in the final version. > > This is why Peter suggested to 'split' the module address range into a > top down and bottom up part: > > https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > > That obviously separates text and data, but keeps everything within the > defined working range. > > It immediately solves the text problem for _all_ module_alloc() users > and still leaves the data split into 4k pages due to RO/RW sections. > > But after staring at it for a while I think this top down and bottom up > dance is too much effort for not much gain. The module address space is > sized generously, so the straight forward solution is to split that > space into two blocks and use them to allocate text and data separately. > > The rest of Peter's suggestions how to migrate there still apply. > > The init sections of a module are obviously separate as they are freed > after the module is initialized, but they are not really special either. > Today they leave holes in the address range. With the new scheme these > holes will be in the memory backed large mapping, but I don't see a real > issue with that, especially as those holes at least in text can be > reused for small allocations (kprobes, trace, bpf). > > As a logical next step we make that three blocks and allocate text, > data and rodata separately, which will preserve the large mappings for > text and data. rodata still needs to be split because we need a space to > accomodate ro_after_init data. > > Alternatively, instead of splitting the module address space, the > allocation mechanism can keep track of the types (text, data, rodata) > and manage large mapping blocks per type. There are pros and cons for > both approaches, so that needs some thought. > > But at the end we want an allocation mechanism which: > > - preserves large mappings > - handles a distinct address range > - is mapping type aware > > That solves _all_ the issues of modules, kprobes, tracing, bpf in one > go. See? There is also - handles kaslr and at least for arm and powerpc we'd also need - handles architecture specific range restrictions and fallbacks > Thanks, > > tglx -- Sincerely yours, Mike.