From: Uladzislau Rezki <urezki@gmail.com>
To: Song Liu <song@kernel.org>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org,
akpm@linux-foundation.org, x86@kernel.org, peterz@infradead.org,
hch@lst.de, rick.p.edgecombe@intel.com, dave.hansen@intel.com,
urezki@gmail.com, mcgrof@kernel.org, kernel-team@fb.com
Subject: Re: [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
Date: Tue, 1 Nov 2022 12:54:47 +0100 [thread overview]
Message-ID: <Y2EJB34M3NPKBY3v@pc636> (raw)
In-Reply-To: <20221031215834.1615596-2-song@kernel.org>
On Mon, Oct 31, 2022 at 02:58:30PM -0700, Song Liu wrote:
> vmalloc_exec is used to allocate memory to host dynamic kernel text
> (modules, BPF programs, etc.) with huge pages. This is similar to the
> proposal by Peter in [1].
>
> A new tree of vmap_area, free_text_area_* tree, is introduced in addition
> to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
> free_text_area_*. When there isn't enough space left in free_text_area_*,
> new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
> free_text_area_*. To be more accurate, the vmap_area is first added to
> vmap_area_* tree and then moved to free_text_area_*. This extra move
> simplifies the logic of vmalloc_exec.
>
> vmap_area in free_text_area_* tree are backed with memory, but we need
> subtree_max_size for tree operations. Therefore, vm_struct for these
> vmap_area are stored in a separate list, all_text_vm.
>
> The new tree allows separate handling of < PAGE_SIZE allocations, as
> current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
> version of vmalloc_exec can handle bpf programs, which uses 64 byte
> aligned allocations), and modules, which uses PAGE_SIZE aligned
> allocations.
>
> Memory allocated by vmalloc_exec() is set to RO+X before returning to the
> caller. Therefore, the caller cannot write directly write to the memory.
> Instead, the caller is required to use vcopy_exec() to update the memory.
> For the safety and security of X memory, vcopy_exec() checks the data
> being updated always in the memory allocated by one vmalloc_exec() call.
> vcopy_exec() uses text_poke like mechanism and requires arch support.
> Specifically, the arch need to implement arch_vcopy_exec().
>
> In vfree_exec(), the memory is first erased with arch_invalidate_exec().
> Then, the memory is added to free_text_area_*. If this free creates big
> enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
> the backing vm_struct.
>
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
>
> Signed-off-by: Song Liu <song@kernel.org>
> ---
> include/linux/vmalloc.h | 5 +
> mm/nommu.c | 12 ++
> mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 335 insertions(+)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 096d48aa3437..9b2042313c12 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
> void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
> int node, const void *caller) __alloc_size(1);
> void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> +void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
> +void *vcopy_exec(void *dst, void *src, size_t len);
> +void vfree_exec(void *addr);
> +void *arch_vcopy_exec(void *dst, void *src, size_t len);
> +int arch_invalidate_exec(void *ptr, size_t len);
>
> extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
> extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 214c70e1d059..8a1317247ef0 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> }
> EXPORT_SYMBOL(vm_map_pages_zero);
>
> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> + return NULL;
> +}
> +
> +void *vcopy_exec(void *dst, void *src, size_t len)
> +{
> + return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +void vfree_exec(const void *addr) { }
> +
> /*
> * sys_brk() for the most part doesn't need the global kernel
> * lock, except when an application is doing something nasty
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ccaa461998f3..6f4c73e67191 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
> static const bool vmap_allow_huge = false;
> #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
>
> +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
> +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
> +
> bool is_vmalloc_addr(const void *x)
> {
> unsigned long addr = (unsigned long)kasan_reset_tag(x);
> @@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list);
> */
> static struct rb_root free_vmap_area_root = RB_ROOT;
>
> +/*
> + * free_text_area for vmalloc_exec()
> + */
> +static DEFINE_SPINLOCK(free_text_area_lock);
> +/*
> + * This linked list is used in pair with free_text_area_root.
> + * It gives O(1) access to prev/next to perform fast coalescing.
> + */
> +static LIST_HEAD(free_text_area_list);
> +
> +/*
> + * This augment red-black tree represents the free text space.
> + * All vmap_area objects in this tree are sorted by va->va_start
> + * address. It is used for allocation and merging when a vmap
> + * object is released.
> + *
> + * Each vmap_area node contains a maximum available free block
> + * of its sub-tree, right or left. Therefore it is possible to
> + * find a lowest match of free area.
> + *
> + * vmap_area in this tree are backed by RO+X memory, but they do
> + * not have valid vm pointer (because we need subtree_max_size).
> + * The vm for these vmap_area are stored in all_text_vm.
> + */
> +static struct rb_root free_text_area_root = RB_ROOT;
> +
> +/*
> + * List of vm_struct for free_text_area_root. This list is rarely
> + * accessed, so the O(N) complexity is not likely a real issue.
> + */
> +struct vm_struct *all_text_vm;
> +
> /*
> * Preload a CPU with one object for "no edge" split case. The
> * aim is to get rid of allocations from the atomic context, thus
> @@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size)
> }
> EXPORT_SYMBOL(vmalloc);
>
> +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> +#define VMALLOC_EXEC_START MODULES_VADDR
> +#define VMALLOC_EXEC_END MODULES_END
> +#else
> +#define VMALLOC_EXEC_START VMALLOC_START
> +#define VMALLOC_EXEC_END VMALLOC_END
> +#endif
> +
> +static void move_vmap_to_free_text_tree(void *addr)
> +{
> + struct vmap_area *va;
> +
> + /* remove from vmap_area_root */
> + spin_lock(&vmap_area_lock);
> + va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> + if (WARN_ON_ONCE(!va)) {
> + spin_unlock(&vmap_area_lock);
> + return;
> + }
> + unlink_va(va, &vmap_area_root);
> + spin_unlock(&vmap_area_lock);
> +
> + /* make the memory RO+X */
> + memset(addr, 0, va->va_end - va->va_start);
> + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> +
> + /* add to all_text_vm */
> + va->vm->next = all_text_vm;
> + all_text_vm = va->vm;
> +
> + /* add to free_text_area_root */
> + spin_lock(&free_text_area_lock);
> + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
> + spin_unlock(&free_text_area_lock);
> +}
> +
> +/**
> + * vmalloc_exec - allocate virtually contiguous RO+X memory
> + * @size: allocation size
> + *
> + * This is used to allocate dynamic kernel text, such as module text, BPF
> + * programs, etc. User need to use text_poke to update the memory allocated
> + * by vmalloc_exec.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> + struct vmap_area *va, *tmp;
> + unsigned long addr;
> + enum fit_type type;
> + int ret;
> +
> + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> + if (unlikely(!va))
> + return NULL;
> +
> +again:
> + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> +
> + if (!tmp) {
> + unsigned long alloc_size;
> + void *ptr;
> +
> + spin_unlock(&free_text_area_lock);
> +
> + /*
> + * Not enough continuous space in free_text_area_root, try
> + * allocate more memory. The memory is first added to
> + * vmap_area_root, and then moved to free_text_area_root.
> + */
> + alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
> + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
> + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> + NUMA_NO_NODE, __builtin_return_address(0));
> + if (unlikely(!ptr))
> + goto err_out;
> +
> + move_vmap_to_free_text_tree(ptr);
> + goto again;
>
It is yet another allocator built on top of vmalloc. So there are 4 then.
Could you please avoid of doing it? I do not find it as something that is
reasonable.
--
Uladzislau Rezki
next prev parent reply other threads:[~2022-11-01 11:54 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
2022-11-01 11:54 ` Uladzislau Rezki [this message]
2022-11-01 15:06 ` Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 5/5] x86: use register_text_tail_vm Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2EJB34M3NPKBY3v@pc636 \
--to=urezki@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=dave.hansen@intel.com \
--cc=hch@lst.de \
--cc=kernel-team@fb.com \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=peterz@infradead.org \
--cc=rick.p.edgecombe@intel.com \
--cc=song@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox