From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1145FA373D for ; Tue, 1 Nov 2022 11:54:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7537E6B0072; Tue, 1 Nov 2022 07:54:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DB8D6B0073; Tue, 1 Nov 2022 07:54:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5564E6B0074; Tue, 1 Nov 2022 07:54:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 422FE6B0072 for ; Tue, 1 Nov 2022 07:54:53 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0210640FB4 for ; Tue, 1 Nov 2022 11:54:52 +0000 (UTC) X-FDA: 80084716866.14.1432935 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf14.hostedemail.com (Postfix) with ESMTP id 821D9100003 for ; Tue, 1 Nov 2022 11:54:52 +0000 (UTC) Received: by mail-lj1-f172.google.com with SMTP id l8so9601154ljh.13 for ; Tue, 01 Nov 2022 04:54:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=T+NHB+45QH9Lm1JwMIt8qyGsRjfvqvv2Oj9l6Fz9484=; b=XqNPJfNlXBKV6CdupK1WlEroJKGEk12PXwL/nLfmDbRphmIB57U9XKaMyXb/TpWmTS 2a8S3HYbIzBTirdWLAFnceZBKHKl5V43odVbc3hz0sTRUiEqL8WTzRlS8GYq4qAxjnCw A+kSrDIVZpkqsdwa1LP6PD8LtkZoVF6DSTVIXDuFJvMRRanKl/PBilSjS1X7vwofUXjw WU+iyNHg555cd8Ddiwm3htSfDS/KJta7n2GiAr1yGRYTeqyJP5ZgSZqSvTdj+i9Q/dm0 +59mG3Owr8E4D33ZCn9YHc6fKkpC2A/pF2nvHIJqTMxwf3Apgto0KXJgh4L/U8A4qHrH asTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=T+NHB+45QH9Lm1JwMIt8qyGsRjfvqvv2Oj9l6Fz9484=; b=AUKyeWJHLqWJXRuZxob49C4+PMeOcnJEp/wj+Zq1+sm/8TN6rkQ2vqZDohgoQcJ8Wo pWV6Knf5AR2qvofhefgbzlNW78BaC3LoG4EYUaMpoQs8E1XXmGNiAxfpQgxc4uTRE7F6 V2VSKrZjX1S/17Bu1YZ3jxdP0Jfs+5RhwDa80Qpnv/CBEE8qHz3wnRomTbjRw0iqDInM 45no2sGVqRvvlEEmDrVz2oZWG3SxgPFk3LUgXeItEI6ntLbz6Y5UmDOe5GijSzM02Hp9 Ubyu5KrbonLYoKyctI8KN2Mut3LMQ7jxVX00CDXZ+UKs6UdmtGEXV6Bn+AZ4OV6lxLPB wgJA== X-Gm-Message-State: ACrzQf1vMtK550sNbYNbwyPzn76hCKnIAYroGnsU7FyndXyodnYf6hck f3uUUXmd0DcqGDksmCu8Hss= X-Google-Smtp-Source: AMsMyM54VKGz/3N1shyObhBDBs6i4meDvY8fQ6hjvP+FwGXjUu3zMzT3ayiXLB37wcDZYW1BnOdINQ== X-Received: by 2002:a2e:9848:0:b0:277:72a:41a5 with SMTP id e8-20020a2e9848000000b00277072a41a5mr801251ljj.352.1667303690742; Tue, 01 Nov 2022 04:54:50 -0700 (PDT) Received: from pc636 (host-90-235-23-76.mobileonline.telia.com. [90.235.23.76]) by smtp.gmail.com with ESMTPSA id k14-20020a2ea26e000000b0026c2baa72d4sm1764033ljm.27.2022.11.01.04.54.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Nov 2022 04:54:50 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Tue, 1 Nov 2022 12:54:47 +0100 To: Song Liu Cc: bpf@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, x86@kernel.org, peterz@infradead.org, hch@lst.de, rick.p.edgecombe@intel.com, dave.hansen@intel.com, urezki@gmail.com, mcgrof@kernel.org, kernel-team@fb.com Subject: Re: [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Message-ID: References: <20221031215834.1615596-1-song@kernel.org> <20221031215834.1615596-2-song@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221031215834.1615596-2-song@kernel.org> ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=XqNPJfNl; spf=pass (imf14.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667303692; a=rsa-sha256; cv=none; b=EPzDtUPdLFSBVrEo2UXIUAVW9l3VzYzjwpYHRcON9xCdaGuC4UABH8heyS7EFcs3Xzro3R WrZMu6W0V8FNQuvTlHWCPJ/Zq4tofKES3p5xpS+aGXMgIyY8sJMXsgUI04VVfwJ3vntsz4 xHkl58KSfYFL7w6O0IIyr652GmTz++w= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667303692; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T+NHB+45QH9Lm1JwMIt8qyGsRjfvqvv2Oj9l6Fz9484=; b=Xebfmgi9zA6oboG3YC98loyU9UU81lOE58CW/DCXz796sragqel1zCNpHVv+ch5sNm7RSV f1QBVMv0/S4//Bxvrm3hKiogLPeOnvTBPCI5xzaJ5QoG8I+v5Ay+e9/7rOBOC2MiKPbHHk 7zBp6EDvcHomd//IeTAPPriKB1ndAOU= X-Rspam-User: X-Rspamd-Queue-Id: 821D9100003 Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=XqNPJfNl; spf=pass (imf14.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: nbcbxjy7zchgtp89q6ndr7fo43bzknkg X-Rspamd-Server: rspam10 X-HE-Tag: 1667303692-635123 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 31, 2022 at 02:58:30PM -0700, Song Liu wrote: > vmalloc_exec is used to allocate memory to host dynamic kernel text > (modules, BPF programs, etc.) with huge pages. This is similar to the > proposal by Peter in [1]. > > A new tree of vmap_area, free_text_area_* tree, is introduced in addition > to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from > free_text_area_*. When there isn't enough space left in free_text_area_*, > new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to > free_text_area_*. To be more accurate, the vmap_area is first added to > vmap_area_* tree and then moved to free_text_area_*. This extra move > simplifies the logic of vmalloc_exec. > > vmap_area in free_text_area_* tree are backed with memory, but we need > subtree_max_size for tree operations. Therefore, vm_struct for these > vmap_area are stored in a separate list, all_text_vm. > > The new tree allows separate handling of < PAGE_SIZE allocations, as > current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This > version of vmalloc_exec can handle bpf programs, which uses 64 byte > aligned allocations), and modules, which uses PAGE_SIZE aligned > allocations. > > Memory allocated by vmalloc_exec() is set to RO+X before returning to the > caller. Therefore, the caller cannot write directly write to the memory. > Instead, the caller is required to use vcopy_exec() to update the memory. > For the safety and security of X memory, vcopy_exec() checks the data > being updated always in the memory allocated by one vmalloc_exec() call. > vcopy_exec() uses text_poke like mechanism and requires arch support. > Specifically, the arch need to implement arch_vcopy_exec(). > > In vfree_exec(), the memory is first erased with arch_invalidate_exec(). > Then, the memory is added to free_text_area_*. If this free creates big > enough continuous free space (> PMD_SIZE), vfree_exec() will try to free > the backing vm_struct. > > [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > > Signed-off-by: Song Liu > --- > include/linux/vmalloc.h | 5 + > mm/nommu.c | 12 ++ > mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 335 insertions(+) > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 096d48aa3437..9b2042313c12 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align, > void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, > int node, const void *caller) __alloc_size(1); > void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1); > +void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1); > +void *vcopy_exec(void *dst, void *src, size_t len); > +void vfree_exec(void *addr); > +void *arch_vcopy_exec(void *dst, void *src, size_t len); > +int arch_invalidate_exec(void *ptr, size_t len); > > extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2); > extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2); > diff --git a/mm/nommu.c b/mm/nommu.c > index 214c70e1d059..8a1317247ef0 100644 > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, > } > EXPORT_SYMBOL(vm_map_pages_zero); > > +void *vmalloc_exec(unsigned long size, unsigned long align) > +{ > + return NULL; > +} > + > +void *vcopy_exec(void *dst, void *src, size_t len) > +{ > + return ERR_PTR(-EOPNOTSUPP); > +} > + > +void vfree_exec(const void *addr) { } > + > /* > * sys_brk() for the most part doesn't need the global kernel > * lock, except when an application is doing something nasty > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ccaa461998f3..6f4c73e67191 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc); > static const bool vmap_allow_huge = false; > #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ > > +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE) > +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE) > + > bool is_vmalloc_addr(const void *x) > { > unsigned long addr = (unsigned long)kasan_reset_tag(x); > @@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list); > */ > static struct rb_root free_vmap_area_root = RB_ROOT; > > +/* > + * free_text_area for vmalloc_exec() > + */ > +static DEFINE_SPINLOCK(free_text_area_lock); > +/* > + * This linked list is used in pair with free_text_area_root. > + * It gives O(1) access to prev/next to perform fast coalescing. > + */ > +static LIST_HEAD(free_text_area_list); > + > +/* > + * This augment red-black tree represents the free text space. > + * All vmap_area objects in this tree are sorted by va->va_start > + * address. It is used for allocation and merging when a vmap > + * object is released. > + * > + * Each vmap_area node contains a maximum available free block > + * of its sub-tree, right or left. Therefore it is possible to > + * find a lowest match of free area. > + * > + * vmap_area in this tree are backed by RO+X memory, but they do > + * not have valid vm pointer (because we need subtree_max_size). > + * The vm for these vmap_area are stored in all_text_vm. > + */ > +static struct rb_root free_text_area_root = RB_ROOT; > + > +/* > + * List of vm_struct for free_text_area_root. This list is rarely > + * accessed, so the O(N) complexity is not likely a real issue. > + */ > +struct vm_struct *all_text_vm; > + > /* > * Preload a CPU with one object for "no edge" split case. The > * aim is to get rid of allocations from the atomic context, thus > @@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size) > } > EXPORT_SYMBOL(vmalloc); > > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > +#define VMALLOC_EXEC_START MODULES_VADDR > +#define VMALLOC_EXEC_END MODULES_END > +#else > +#define VMALLOC_EXEC_START VMALLOC_START > +#define VMALLOC_EXEC_END VMALLOC_END > +#endif > + > +static void move_vmap_to_free_text_tree(void *addr) > +{ > + struct vmap_area *va; > + > + /* remove from vmap_area_root */ > + spin_lock(&vmap_area_lock); > + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); > + if (WARN_ON_ONCE(!va)) { > + spin_unlock(&vmap_area_lock); > + return; > + } > + unlink_va(va, &vmap_area_root); > + spin_unlock(&vmap_area_lock); > + > + /* make the memory RO+X */ > + memset(addr, 0, va->va_end - va->va_start); > + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); > + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); > + > + /* add to all_text_vm */ > + va->vm->next = all_text_vm; > + all_text_vm = va->vm; > + > + /* add to free_text_area_root */ > + spin_lock(&free_text_area_lock); > + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list); > + spin_unlock(&free_text_area_lock); > +} > + > +/** > + * vmalloc_exec - allocate virtually contiguous RO+X memory > + * @size: allocation size > + * > + * This is used to allocate dynamic kernel text, such as module text, BPF > + * programs, etc. User need to use text_poke to update the memory allocated > + * by vmalloc_exec. > + * > + * Return: pointer to the allocated memory or %NULL on error > + */ > +void *vmalloc_exec(unsigned long size, unsigned long align) > +{ > + struct vmap_area *va, *tmp; > + unsigned long addr; > + enum fit_type type; > + int ret; > + > + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE); > + if (unlikely(!va)) > + return NULL; > + > +again: > + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE); > + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false); > + > + if (!tmp) { > + unsigned long alloc_size; > + void *ptr; > + > + spin_unlock(&free_text_area_lock); > + > + /* > + * Not enough continuous space in free_text_area_root, try > + * allocate more memory. The memory is first added to > + * vmap_area_root, and then moved to free_text_area_root. > + */ > + alloc_size = roundup(size, PMD_SIZE * num_online_nodes()); > + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START, > + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL, > + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD, > + NUMA_NO_NODE, __builtin_return_address(0)); > + if (unlikely(!ptr)) > + goto err_out; > + > + move_vmap_to_free_text_tree(ptr); > + goto again; > It is yet another allocator built on top of vmalloc. So there are 4 then. Could you please avoid of doing it? I do not find it as something that is reasonable. -- Uladzislau Rezki