linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Song Liu <songliubraving@fb.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Song Liu <song@kernel.org>, Vlastimil Babka <vbabka@suse.cz>,
	Mel Gorman <mgorman@suse.de>, Linux-MM <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"hch@lst.de" <hch@lst.de>, Kernel Team <Kernel-team@fb.com>,
	"rick.p.edgecombe@intel.com" <rick.p.edgecombe@intel.com>,
	"dave.hansen@intel.com" <dave.hansen@intel.com>
Subject: Re: [RFC 1/5] vmalloc: introduce vmalloc_exec and vfree_exec
Date: Fri, 7 Oct 2022 06:39:29 +0000	[thread overview]
Message-ID: <E652B995-9F44-4740-93A9-2B288635CE90@fb.com> (raw)
In-Reply-To: <Yz9hhTRzMr0ZEnA/@bombadil.infradead.org>



> On Oct 6, 2022, at 4:15 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Aug 18, 2022 at 03:42:14PM -0700, Song Liu wrote:
>> --- a/mm/nommu.c
>> +++ b/mm/nommu.c
>> @@ -372,6 +372,13 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
>> }
>> EXPORT_SYMBOL(vm_map_pages_zero);
>> 
>> +void *vmalloc_exec(unsigned long size, unsigned long align)
>> +{
>> +	return NULL;
>> +}
> 
> Well that's not so nice for no-mmu systems. Shouldn't we have a
> fallback?

This is still early version, so I am not quite sure whether we 
need the fallback for no-mmu system. 

> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index effd1ff6a4b4..472287e71bf1 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -1583,9 +1592,17 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>> 	va->va_end = addr + size;
>> 	va->vm = NULL;
>> 
>> -	spin_lock(&vmap_area_lock);
>> -	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
>> -	spin_unlock(&vmap_area_lock);
>> +	if (vm_flags & VM_KERNEL_EXEC) {
>> +		spin_lock(&free_text_area_lock);
>> +		insert_vmap_area(va, &free_text_area_root, &free_text_area_list);
>> +		/* update subtree_max_size now as we need this soon */
>> +		augment_tree_propagate_from(va);
> 
> Sorry, it is not clear to me why its needed only for exec, can you elaborate a
> bit more?

This version was wrong. We should use insert_vmap_area_augment() here. 
Actually, I changed this in latest version. 

> 
>> +		spin_unlock(&free_text_area_lock);
>> +	} else {
>> +		spin_lock(&vmap_area_lock);
>> +		insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
>> +		spin_unlock(&vmap_area_lock);
>> +	}
>> 
>> 	BUG_ON(!IS_ALIGNED(va->va_start, align));
>> 	BUG_ON(va->va_start < vstart);
> 
> <-- snip -->
> 
>> @@ -3265,6 +3282,97 @@ void *vmalloc(unsigned long size)
>> }
>> EXPORT_SYMBOL(vmalloc);
>> 
>> +void *vmalloc_exec(unsigned long size, unsigned long align)
>> +{
>> +	struct vmap_area *va, *tmp;
>> +	unsigned long addr;
>> +	enum fit_type type;
>> +	int ret;
>> +
>> +	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
>> +	if (unlikely(!va))
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +again:
>> +	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
>> +	tmp = find_vmap_lowest_match(free_text_area_root.rb_node,
>> +				     size, align, 1, false);
>> +
>> +	if (!tmp) {
>> +		unsigned long alloc_size;
>> +		void *ptr;
>> +
>> +		spin_unlock(&free_text_area_lock);
>> +
>> +		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
>> +		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, MODULES_VADDR,
>> +					   MODULES_END, GFP_KERNEL, PAGE_KERNEL,
>> +					   VM_KERNEL_EXEC | VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
>> +					   NUMA_NO_NODE, __builtin_return_address(0));
> 
> We can review the guard stuff on the other thread with Peter.
> 
>> +		if (unlikely(!ptr)) {
>> +			ret = -ENOMEM;
>> +			goto err_out;
>> +		}
>> +		memset(ptr, 0, alloc_size);
>> +		set_memory_ro((unsigned long)ptr, alloc_size >> PAGE_SHIFT);
>> +		set_memory_x((unsigned long)ptr, alloc_size >> PAGE_SHIFT);
> 
> I *really* like that this is now not something users have to muck with thanks!

Well, this pushed some other complexity to the user side, for example, all
those hacks with text_poke in 3/5. 

> 
>> +
>> +		goto again;
>> +	}
>> +
>> +	addr = roundup(tmp->va_start, align);
>> +	type = classify_va_fit_type(tmp, addr, size);
>> +	if (WARN_ON_ONCE(type == NOTHING_FIT)) {
>> +		addr = -ENOMEM;
>> +		goto err_out;
>> +	}
>> +
>> +	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
>> +				    tmp, addr, size, type);
>> +	if (ret) {
>> +		addr = ret;
>> +		goto err_out;
>> +	}
>> +	spin_unlock(&free_text_area_lock);
>> +
>> +	va->va_start = addr;
>> +	va->va_end = addr + size;
>> +	va->vm = tmp->vm;
>> +
>> +	spin_lock(&vmap_area_lock);
>> +	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
>> +	spin_unlock(&vmap_area_lock);
>> +
>> +	return (void *)addr;
>> +
>> +err_out:
>> +	spin_unlock(&free_text_area_lock);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +void vfree_exec(const void *addr)
>> +{
>> +	struct vmap_area *va;
>> +
>> +	might_sleep();
>> +
>> +	spin_lock(&vmap_area_lock);
>> +	va = __find_vmap_area((unsigned long)addr, vmap_area_root.rb_node);
>> +	if (WARN_ON_ONCE(!va)) {
>> +		spin_unlock(&vmap_area_lock);
>> +		return;
>> +	}
>> +
>> +	unlink_va(va, &vmap_area_root);
> 
> Curious why we don't memset to 0 before merge_or_add_vmap_area_augment()?
> I realize other code doesn't seem to do it, though.

We should do the memset here. We will need the text_poke version of it. 

> 
>> +	spin_unlock(&vmap_area_lock);
>> +
>> +	spin_lock(&free_text_area_lock);
>> +	merge_or_add_vmap_area_augment(va,
>> +		&free_text_area_root, &free_text_area_list);
> 
> I have concern that we can be using precious physically contigous memory
> from huge pages to then end up in a situation where we create our own
> pool and allow things to be non-contigous afterwards.
> 
> I'm starting to suspect that if the allocation is > PAGE_SIZE we just
> give it back generally. Otherwise wouldn't the fragmentation cause us
> to eventually just eat up most huge pages available? Probably not for
> eBPF but if we use this on a system with tons of module insertions /
> deletions this seems like it could happen?

Currently, bpf_prog_pack doesn't let allocation > PMD_SIZE to share with
smaller allocations. I guess it is similar to the idea here? I am not 
sure what's the proper threshold for modules. We can discuss this later. 

Thanks,
Song

> 
>  Luis
> 
>> +	spin_unlock(&free_text_area_lock);
>> +	/* TODO: when the whole vm_struct is not in use, free it */
>> +}
>> +
>> /**
>>  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
>>  * @size:      allocation size
>> @@ -3851,7 +3959,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>> 			/* It is a BUG(), but trigger recovery instead. */
>> 			goto recovery;
>> 
>> -		ret = adjust_va_to_fit_type(va, start, size, type);
>> +		ret = adjust_va_to_fit_type(&free_vmap_area_root, &free_vmap_area_list,
>> +					    va, start, size, type);
>> 		if (unlikely(ret))
>> 			goto recovery;
>> 
>> -- 
>> 2.30.2
>> 



  reply	other threads:[~2022-10-07  6:39 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-18 22:42 [RFC 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-08-18 22:42 ` [RFC 1/5] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
2022-10-06 23:15   ` Luis Chamberlain
2022-10-07  6:39     ` Song Liu [this message]
2022-08-18 22:42 ` [RFC 2/5] bpf: use vmalloc_exec Song Liu
2022-08-18 22:42 ` [RFC 3/5] modules, x86: use vmalloc_exec for module core Song Liu
2022-10-06 23:38   ` Luis Chamberlain
2022-10-07  6:46     ` Song Liu
2022-08-18 22:42 ` [RFC 4/5] vmalloc_exec: share a huge page with kernel text Song Liu
2022-10-06 23:44   ` Luis Chamberlain
2022-10-07  6:53     ` Song Liu
2022-08-18 22:42 ` [RFC 5/5] vmalloc: vfree_exec: free unused vm_struct Song Liu
2022-08-22 15:46 ` [RFC 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-08-22 16:34   ` Peter Zijlstra
2022-08-22 16:56     ` Song Liu
2022-08-23  5:42       ` Peter Zijlstra
2022-08-23  6:39         ` Christophe Leroy
2022-08-23  6:57           ` Song Liu
2022-08-23  6:55         ` Song Liu
2022-08-24 17:06       ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E652B995-9F44-4740-93A9-2B288635CE90@fb.com \
    --to=songliubraving@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=song@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox