From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6723DC433F5 for ; Thu, 6 Oct 2022 23:44:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8DB868E0001; Thu, 6 Oct 2022 19:44:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 88C0C6B0073; Thu, 6 Oct 2022 19:44:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 753BC8E0001; Thu, 6 Oct 2022 19:44:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 61FCC6B0072 for ; Thu, 6 Oct 2022 19:44:20 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2C3EE120872 for ; Thu, 6 Oct 2022 23:44:20 +0000 (UTC) X-FDA: 79992155880.21.C64AD35 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf01.hostedemail.com (Postfix) with ESMTP id D7F1240023 for ; Thu, 6 Oct 2022 23:44:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=ZgYg+TXdShKnQjqWZe7CHnhBk6HcFIPAWWEHqOLdrLg=; b=cewjqgldr3aDQBKlHs46OsWViU Jif0KUS407VTQybYKvF9uqj7AIUaldZ9odDeFpKB4bWWhUOWEbcLO1TKJ6QQ4c41vpEKG7mAJ6Cor pBKxOwD/WGJZRgBqsV2tIopPs1AjaPPf7RLFS/D3MQglaBClWc6i5FOAQW+bBxcnibooXl3U5gzqM oCBzfQSBq+VX9ewBP7O5KKc9mC+HonUrZPoE9hgtsExNjDJkug8sjhT1AIyk8jjXMjy3Reu+P16qx UJ0NC3I4Itb3erFn/L+cZ/p4G2wtkt0XQomOI2OHqw3fJtQxMRzo7m9BJxlejf0VTVmjao03nANdw xOXrirkA==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1ogaXI-0062Td-FU; Thu, 06 Oct 2022 23:44:16 +0000 Date: Thu, 6 Oct 2022 16:44:16 -0700 From: Luis Chamberlain To: Song Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, x86@kernel.org, peterz@infradead.org, hch@lst.de, kernel-team@fb.com, rick.p.edgecombe@intel.com, dave.hansen@intel.com Subject: Re: [RFC 4/5] vmalloc_exec: share a huge page with kernel text Message-ID: References: <20220818224218.2399791-1-song@kernel.org> <20220818224218.2399791-5-song@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220818224218.2399791-5-song@kernel.org> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665099859; a=rsa-sha256; cv=none; b=GcbzMY1Fyaj20Rwf4wZcpVlN220WEqrEA3Cyp1Ln1AyN9u7y5eAKN5PdJRed9DJA25LX/c Rwbanv0wsY4/PLP4BsZ7Jvg2XTRy2bH28auZHznblkfKWv/6x4FSdDlU6cX6kV9rnkblY5 kDfpGYW+taVfsHAozUw4a/XOivNA5Gk= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=cewjqgld; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf01.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665099859; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZgYg+TXdShKnQjqWZe7CHnhBk6HcFIPAWWEHqOLdrLg=; b=tL4VfSxMSObhmOVmTtHCSnX1Y6rK4SBPbVtpqLu4lCnz8siigt6+v2etNVirWbAwcG2lWX XbE2BpiUUciBtAY/nnDWi72c1pf+ihXyRvkhOIGfOCwDp/3JZ862hoDvqkE31DrTrfBNGP zs2ONGU47ydq1hOeI3QOjnlxO0lUbDs= Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=cewjqgld; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf01.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org X-Rspam-User: X-Stat-Signature: fmy9ynb63hyz4giui7sx6au1kyr463tm X-Rspamd-Queue-Id: D7F1240023 X-Rspamd-Server: rspam03 X-HE-Tag: 1665099859-107808 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 18, 2022 at 03:42:17PM -0700, Song Liu wrote: > On x86 kernel, we allocate 2MB pages for kernel text up to > round_down(_etext, 2MB). Therefore, some of the kernel text is still > on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to > round_up(_etext, 2MB), and use the rest of the page for modules and > BPF programs. > > Here is an example: > > [root@eth50-1 ~]# grep _etext /proc/kallsyms > ffffffff82202a08 T _etext > > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms | tail -n 3 > ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup [bpf] > ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new [bpf] > ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch [bpf] > > [root@eth50-1 ~]# grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel > 0xffffffff82200000-0xffffffff82400000 2M ro PSE x pmd > > [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms > ffffffff822ba910 t xfs_flush_inodes_worker [xfs] > ffffffff822bc580 t xfs_flush_inodes [xfs] > > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, xfs > module, and bpf programs. This is pretty rad. I'm not sure how you were able to squeeze xfs and *more* into one 2 MiB huge page though at least on debian 5.17.0-1-amd64 xfs is 3.6847 MiB. How big is your XFS module? I don't grok mm stuff, but I'd like to understand why we gain the ability of re-use the same 2 MiB page with this patch, from the code I really can't tail. Any pointers? But, I'm still concerned about the free'ing case in terms of fragmentation for contigous memory, when free huage pages are available. Luis > --- > arch/x86/mm/init_64.c | 3 ++- > mm/vmalloc.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 29 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 39c5246964a9..d27d0af5beb5 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -1367,12 +1367,13 @@ int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask) > > int kernel_set_to_readonly; > > +#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK) > void mark_rodata_ro(void) > { > unsigned long start = PFN_ALIGN(_text); > unsigned long rodata_start = PFN_ALIGN(__start_rodata); > unsigned long end = (unsigned long)__end_rodata_hpage_align; > - unsigned long text_end = PFN_ALIGN(_etext); > + unsigned long text_end = PMD_ALIGN(_etext); > unsigned long rodata_end = PFN_ALIGN(__end_rodata); > unsigned long all_end; > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 472287e71bf1..5f3b5df9313f 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc); > static const bool vmap_allow_huge = false; > #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ > > +#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK) > + > +static struct vm_struct text_tail_vm; > +static struct vmap_area text_tail_va; > + > bool is_vmalloc_addr(const void *x) > { > unsigned long addr = (unsigned long)kasan_reset_tag(x); > @@ -634,6 +639,8 @@ int is_vmalloc_or_module_addr(const void *x) > unsigned long addr = (unsigned long)kasan_reset_tag(x); > if (addr >= MODULES_VADDR && addr < MODULES_END) > return 1; > + if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end) > + return 1; > #endif > return is_vmalloc_addr(x); > } > @@ -2371,6 +2378,25 @@ static void vmap_init_free_space(void) > } > } > > +static void register_text_tail_vm(void) > +{ > + unsigned long start = PFN_ALIGN(_etext); > + unsigned long end = PMD_ALIGN(_etext); > + struct vmap_area *va; > + > + va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); > + if (WARN_ON_ONCE(!va)) > + return; > + text_tail_vm.addr = (void *)start; > + text_tail_vm.size = end - start; > + text_tail_vm.flags = VM_KERNEL_EXEC; > + text_tail_va.va_start = start; > + text_tail_va.va_end = end; > + text_tail_va.vm = &text_tail_vm; > + memcpy(va, &text_tail_va, sizeof(*va)); > + insert_vmap_area(va, &free_text_area_root, &free_text_area_list); > +} > + > void __init vmalloc_init(void) > { > struct vmap_area *va; > @@ -2381,6 +2407,7 @@ void __init vmalloc_init(void) > * Create the cache for vmap_area objects. > */ > vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC); > + register_text_tail_vm(); > > for_each_possible_cpu(i) { > struct vmap_block_queue *vbq; > -- > 2.30.2 >