From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D85BEC433EF for ; Tue, 19 Apr 2022 18:42:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BCD06B0071; Tue, 19 Apr 2022 14:42:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66C476B0073; Tue, 19 Apr 2022 14:42:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5340B6B0074; Tue, 19 Apr 2022 14:42:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 41CBC6B0071 for ; Tue, 19 Apr 2022 14:42:33 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 08D542530F for ; Tue, 19 Apr 2022 18:42:33 +0000 (UTC) X-FDA: 79374499386.20.AB6CA00 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf22.hostedemail.com (Postfix) with ESMTP id 12E0EC0010 for ; Tue, 19 Apr 2022 18:42:31 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 6DE83B819A3; Tue, 19 Apr 2022 18:42:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 40B93C385A5; Tue, 19 Apr 2022 18:42:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1650393749; bh=aNoitcKVY+3runARZOKXlbXNRTPUSLUckf8FikGfbmg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i7mqUboZvJQO3DEpHzHUrQS1XU/Xq0/4kVpXYnuX/EY5TEq79O7i3JpbTNV1ank/D 3rovYl/JOrQRyL69n00/5WqcUh3N139NbOjfw9Vws7EtwFZbDgwrTGvU7g12xQdW9p LxvxYyiDo1i7sTAkI1G+Hi3Og+ZHF3qXCBODPFufpxUraX0F2M3ibEylB05sF3c3pT IXLgMLDp22fz/AeZ6W3BVY3YB4X5riLJN3ir8YrgRXD6yRtDtFvwS+QvfTKkG+K6K/ PF0XoUq4nMBSZz9alDletCJq+qESl0CTcWNQEq9IWjq1c57cap7A7bR/lvO+m4qv/s ASgkvryZLfG/w== Date: Tue, 19 Apr 2022 21:42:17 +0300 From: Mike Rapoport To: Song Liu Cc: "Edgecombe, Rick P" , "mcgrof@kernel.org" , "linux-kernel@vger.kernel.org" , "bpf@vger.kernel.org" , "hch@infradead.org" , "ast@kernel.org" , "daniel@iogearbox.net" , "Torvalds, Linus" , "linux-mm@kvack.org" , "song@kernel.org" , Kernel Team , "pmladek@suse.com" , "akpm@linux-foundation.org" , "hpa@zytor.com" , "dborkman@redhat.com" , "edumazet@google.com" , "bp@alien8.de" , "mbenes@suse.cz" , "imbrenda@linux.ibm.com" Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Message-ID: References: <20220415164413.2727220-1-song@kernel.org> <4AD023F9-FBCE-4C7C-A049-9292491408AA@fb.com> <88eafc9220d134d72db9eb381114432e71903022.camel@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 12E0EC0010 X-Rspam-User: Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=i7mqUboZ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of rppt@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=rppt@kernel.org X-Stat-Signature: qajjuqd4wth5ksnp7q9hge5npnm6be7a X-HE-Tag: 1650393751-676598 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote: > Hi Mike, Luis, and Rick, > > Thanks for sharing your work and findings in the space. I didn't > realize we were looking at the same set of problems. > > > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P wrote: > > > > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: > >>> There are use-cases that require 4K pages with non-default > >>> permissions in > >>> the direct map and the pages not necessarily should be executable. > >>> There > >>> were several suggestions to implement caches of 4K pages backed by > >>> 2M > >>> pages. > >> > >> Even if we just focus on the executable side of the story... there > >> may > >> be users who can share this too. > >> > >> I've gone down memory lane now at least down to year 2005 in kprobes > >> to see why the heck module_alloc() was used. At first glance there > >> are > >> some old comments about being within the 2 GiB text kernel range... > >> But > >> some old tribal knowledge is still lost. The real hints come from > >> kprobe work > >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of > >> line > >> - take2"), so that the "For the %rip-relative displacement fixups to > >> be > >> doable"... but this got me wondering, would other users who *do* want > >> similar funcionality benefit from a cache. If the space is limited > >> then > >> using a cache makes sense. Specially if architectures tend to require > >> hacks for some of this to all work. > > > > Yea, that was my understanding. X86 modules have to be linked within > > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > > to be within 2GB of the kernel text. > > > > > > I think of two types of caches we could have: caches of unmapped pages > > on the direct map and caches of virtual memory mappings. Caches of > > pages on the direct map reduce breakage of the large pages (and is > > somewhat x86 specific problem). Caches of virtual memory mappings > > reduce shootdowns, and are also required to share huge pages. I'll plug > > my old RFC, where I tried to work towards enabling both: > > > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > > > Since then Mike has taken a lot further the direct map cache piece. > > These are really interesting work. With this landed, we won't need > the bpf_prog_pack work at all (I think). OTOH, this looks like a > long term project, as some of the work in bpf_prog_pack took quite > some time to discuss/debate, and that was just a subset of the > whole thing. I'd say that bpf_prog_pack was a cure for symptoms and this project tries to address more general problem. But you are right, it'll take some time and won't land in 5.19. > I really like the two types of cache concept. But there are some > details I cannot figure out about them: After some discussions we decided to try moving the caching of large pages to the page allocator and see if the second cache will be needed at all. But I've got distracted after posting the RFC and that work didn't have real progress since then. > 1. Is "caches of unmapped pages on direct map" (cache #1) > sufficient to fix all direct map fragmentation? IIUC, pages in > the cache may still be used by other allocation (with some > memory pressure). If the system runs for long enough, there > may be a lot of direct map fragmentation. Is this right? If the system runs long enough, it may run out of high-order free pages regardless of the way the caches are implemented. Then we either fail the allocation because it is impossible to refill the cache with large pages or fall back to 4k pages and fragment direct map. I don't see how can we avoid direct map fragmentation entirely and still be able to allocate memory for users of set_memory APIs. > 2. If we have "cache of virtual memory mappings" (cache #2), do we > still need cache #1? I know cache #2 alone may waste some > memory, but I still think 2MB within noise for modern systems. I presume that by cache #1 you mean the cache in the page allocator. In that case cache #2 is probably not needed at all, because the cache at page allocator level will be used by vmalloc() and friends to provide what Rick called "permissioned allocations". > Thanks, > Song -- Sincerely yours, Mike.