Re: [RFC 4/5] vmalloc_exec: share a huge page with kernel text

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Song Liu <songliubraving@fb.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Song Liu <song@kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"hch@lst.de" <hch@lst.de>, Kernel Team <Kernel-team@fb.com>,
	"rick.p.edgecombe@intel.com" <rick.p.edgecombe@intel.com>,
	"dave.hansen@intel.com" <dave.hansen@intel.com>
Subject: Re: [RFC 4/5] vmalloc_exec: share a huge page with kernel text
Date: Fri, 7 Oct 2022 06:53:49 +0000	[thread overview]
Message-ID: <EC88843A-CA78-428C-A125-A3CDA01D657F@fb.com> (raw)
In-Reply-To: <Yz9oUDY6nj4V9z/O@bombadil.infradead.org>



> On Oct 6, 2022, at 4:44 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Aug 18, 2022 at 03:42:17PM -0700, Song Liu wrote:
>> On x86 kernel, we allocate 2MB pages for kernel text up to
>> round_down(_etext, 2MB). Therefore, some of the kernel text is still
>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>> round_up(_etext, 2MB), and use the rest of the page for modules and
>> BPF programs.
>> 
>> Here is an example:
>> 
>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>> ffffffff82202a08 T _etext
>> 
>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>> ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>> ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>> ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>> 
>> [root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
>> 0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd
>> 
>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>> 
>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, xfs
>> module, and bpf programs.
> 
> This is pretty rad. I'm not sure how you were able to squeeze xfs and
> *more* into one 2 MiB huge page though at least on debian 5.17.0-1-amd64
> xfs is 3.6847 MiB. How big is your XFS module?

In my build, xfs.ko is 50MB before strip, and 3.1MB after strip. But the
text section is about 1.3MB, so it fits in one 2MB page. 

> 
> I don't grok mm stuff, but I'd like to understand why we gain the ability
> of re-use the same 2 MiB page with this patch, from the code I really
> can't tail. Any pointers?

I don't quite follow the question here. In this case, we allocate one more
2MB page, so that some static kernel text can use it, and shall it with 
dynamic kernel text. Does this answer your questions?

I am working on a newer version of this. I am planning to resend when it 
is stable for BPF programs. For modules, I think we will need more 
discussion about the interface with arch code. 

Thanks,
Song

> 
> But, I'm still concerned about the free'ing case in terms of
> fragmentation for contigous memory, when free huage pages are available.
> 
>  Luis
> 
>> ---
>> arch/x86/mm/init_64.c |  3 ++-
>> mm/vmalloc.c          | 27 +++++++++++++++++++++++++++
>> 2 files changed, 29 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 39c5246964a9..d27d0af5beb5 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1367,12 +1367,13 @@ int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
>> 
>> int kernel_set_to_readonly;
>> 
>> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
>> void mark_rodata_ro(void)
>> {
>> 	unsigned long start = PFN_ALIGN(_text);
>> 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>> 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
>> -	unsigned long text_end = PFN_ALIGN(_etext);
>> +	unsigned long text_end = PMD_ALIGN(_etext);
>> 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>> 	unsigned long all_end;
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 472287e71bf1..5f3b5df9313f 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
>> static const bool vmap_allow_huge = false;
>> #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
>> 
>> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
>> +
>> +static struct vm_struct text_tail_vm;
>> +static struct vmap_area text_tail_va;
>> +
>> bool is_vmalloc_addr(const void *x)
>> {
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> @@ -634,6 +639,8 @@ int is_vmalloc_or_module_addr(const void *x)
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> 	if (addr >= MODULES_VADDR && addr < MODULES_END)
>> 		return 1;
>> +	if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
>> +		return 1;
>> #endif
>> 	return is_vmalloc_addr(x);
>> }
>> @@ -2371,6 +2378,25 @@ static void vmap_init_free_space(void)
>> 	}
>> }
>> 
>> +static void register_text_tail_vm(void)
>> +{
>> +	unsigned long start = PFN_ALIGN(_etext);
>> +	unsigned long end = PMD_ALIGN(_etext);
>> +	struct vmap_area *va;
>> +
>> +	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
>> +	if (WARN_ON_ONCE(!va))
>> +		return;
>> +	text_tail_vm.addr = (void *)start;
>> +	text_tail_vm.size = end - start;
>> +	text_tail_vm.flags = VM_KERNEL_EXEC;
>> +	text_tail_va.va_start = start;
>> +	text_tail_va.va_end = end;
>> +	text_tail_va.vm = &text_tail_vm;
>> +	memcpy(va, &text_tail_va, sizeof(*va));
>> +	insert_vmap_area(va, &free_text_area_root, &free_text_area_list);
>> +}
>> +
>> void __init vmalloc_init(void)
>> {
>> 	struct vmap_area *va;
>> @@ -2381,6 +2407,7 @@ void __init vmalloc_init(void)
>> 	 * Create the cache for vmap_area objects.
>> 	 */
>> 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
>> +	register_text_tail_vm();
>> 
>> 	for_each_possible_cpu(i) {
>> 		struct vmap_block_queue *vbq;
>> -- 
>> 2.30.2
>>

next prev parent reply	other threads:[~2022-10-07  6:54 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-18 22:42 [RFC 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-08-18 22:42 ` [RFC 1/5] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
2022-10-06 23:15   ` Luis Chamberlain
2022-10-07  6:39     ` Song Liu
2022-08-18 22:42 ` [RFC 2/5] bpf: use vmalloc_exec Song Liu
2022-08-18 22:42 ` [RFC 3/5] modules, x86: use vmalloc_exec for module core Song Liu
2022-10-06 23:38   ` Luis Chamberlain
2022-10-07  6:46     ` Song Liu
2022-08-18 22:42 ` [RFC 4/5] vmalloc_exec: share a huge page with kernel text Song Liu
2022-10-06 23:44   ` Luis Chamberlain
2022-10-07  6:53     ` Song Liu [this message]
2022-08-18 22:42 ` [RFC 5/5] vmalloc: vfree_exec: free unused vm_struct Song Liu
2022-08-22 15:46 ` [RFC 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-08-22 16:34   ` Peter Zijlstra
2022-08-22 16:56     ` Song Liu
2022-08-23  5:42       ` Peter Zijlstra
2022-08-23  6:39         ` Christophe Leroy
2022-08-23  6:57           ` Song Liu
2022-08-23  6:55         ` Song Liu
2022-08-24 17:06       ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EC88843A-CA78-428C-A125-A3CDA01D657F@fb.com \
    --to=songliubraving@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=song@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox