From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55D70C48260 for ; Tue, 13 Feb 2024 23:14:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF7C68D001B; Tue, 13 Feb 2024 18:14:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA6308D000E; Tue, 13 Feb 2024 18:14:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9FA688D001B; Tue, 13 Feb 2024 18:14:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 899E48D000E for ; Tue, 13 Feb 2024 18:14:55 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 587011A0163 for ; Tue, 13 Feb 2024 23:14:55 +0000 (UTC) X-FDA: 81788337750.21.F3FC122 Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) by imf27.hostedemail.com (Postfix) with ESMTP id 777BE40009 for ; Tue, 13 Feb 2024 23:14:53 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T8JJS9FW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707866093; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MzhKZjkjJwlLses3QNHWy7dnYflIuFvfnHY2hMvM0sw=; b=PBjH4K2/jcmLiWsLM/7nTEyS85ec68QED6DDbxBPtf6MkRxZbAEv+470GltvD5bT35vACX Pcd4fCFNLW6ZWntwkF9WlUiegV/KFNhDTl3d005Mtctl63ez/pLUzrBJOLQ/guoLqB5o+p a00ZZEw4pH+uGfop0/+W5Oc0qeNJ0hs= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T8JJS9FW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707866093; a=rsa-sha256; cv=none; b=7iYT5l7MhwEm3wjMRv/ipsZ2w20aVH9nkhyDPob27KM89oFms2RrX/7s1yFSV5Wy1Q/jVe 6bs4nVGx25uVcrRZvnLsUdABO985+0iHtdSTxPZAW0P2Ms2yMbbqNjfkodZQjaZw5x77sq gMfjiS2IdmIdV62FHQ2TvVM3qJ/T5Pk= Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-5cdf76cde78so3815805a12.1 for ; Tue, 13 Feb 2024 15:14:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707866092; x=1708470892; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MzhKZjkjJwlLses3QNHWy7dnYflIuFvfnHY2hMvM0sw=; b=T8JJS9FWcJE4ZXCM+izHRGUYjc+hR11OkGGkOSG+vEqYRjQuDvT123mEShYJ6Xdezf 44htSXpJ1LKMvD6GZBPE/kgDXDN01ncUOKSMWg7Z3C2++XF6ts/E6xtkx47UkZw27tHW m/aVEETUo4A4NiP8AznwdJAQOqFVs86tyQyj6NhtUaWCxb1Mq3ooy+nM5qQMYm+oERf0 bgnChf2VcyoP75UiixlagWdMWj2PfJ4wDMDPBmV9VoD5WOJItpfMpXfvAb5KLj9/0Tqh sQY7ZOjH4lOie1vBJPpKbH3LPgr/7joY3FgqUPrcoP4PWpMaGq2eWc2Ck0wVXdIOHuVM w6Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707866092; x=1708470892; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MzhKZjkjJwlLses3QNHWy7dnYflIuFvfnHY2hMvM0sw=; b=QAG/SUG/jjDlixX+tGhDlQaRQ4i9VYiRdBTm+t2Zlm4puyb8E3CIua48U8RoyYcmSd l143IQYxxCHMuNgTEtwLEjrUEcWPWeBqoK/5EwmbC8+NMPf2GcXxHufZ/cyjQwgS1wZn sXtMv/D4PV8sRAWfH57/UzaXhf3Lj8nKGok2kMr1HcG5y6hR9uMP/rpRTNgUCXpCjBSS e7TPdT4ljsOI4JmqAvudVCbhsGh/2hlwUPpuZIKktqSTSp9+HjZt4eQPjsKmAenIzqry iPQZx5Goke7OluVx7GG1MsK6IBm0UnU1ATlQ3ehYYF3811TsXoBHQaZwn26ghky7OmQD vcsg== X-Forwarded-Encrypted: i=1; AJvYcCXnsj4maj+H9+M/EHvayDgEnpEhT3YRPe7Ty9k8SONa/bU+VjgTLc9msrqN3KS19Pn9lOFHvTAMQIJnVKyWgRHjJts= X-Gm-Message-State: AOJu0YzP5gwRWDE0h1EvndYofAkbTHtWskFA8nid/E2tdoXkuaZaSVUn 6FK/kgcqSkOaR9ZVnkwq9XVkDjNxT9bwLG3obqmtHlWCGmlQLjm/85WxhFFNzTm0tNCmhDAPQ1w taUrklF8cztrpOL7UFsKpNa6vqog= X-Google-Smtp-Source: AGHT+IEruXVjdP6C9Nb928ocqjjzOPTZRpH518XKJs2T+Edzm67SB6ag4zu50oDnQlT4g3g8XRWBBboCsh4HzTPpEaE= X-Received: by 2002:a05:6a20:9c8f:b0:19e:caf3:87e4 with SMTP id mj15-20020a056a209c8f00b0019ecaf387e4mr851798pzb.6.1707866091909; Tue, 13 Feb 2024 15:14:51 -0800 (PST) MIME-Version: 1.0 References: <20240209040608.98927-1-alexei.starovoitov@gmail.com> <20240209040608.98927-6-alexei.starovoitov@gmail.com> In-Reply-To: <20240209040608.98927-6-alexei.starovoitov@gmail.com> From: Andrii Nakryiko Date: Tue, 13 Feb 2024 15:14:39 -0800 Message-ID: Subject: Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena. To: Alexei Starovoitov Cc: bpf@vger.kernel.org, daniel@iogearbox.net, andrii@kernel.org, memxor@gmail.com, eddyz87@gmail.com, tj@kernel.org, brho@google.com, hannes@cmpxchg.org, lstoakes@gmail.com, akpm@linux-foundation.org, urezki@gmail.com, hch@infradead.org, linux-mm@kvack.org, kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 777BE40009 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: kci7bezrt57b7o4wyi9ebdzmr3neb9b6 X-HE-Tag: 1707866093-279994 X-HE-Meta: U2FsdGVkX18QENlgwd54mPDCmD1uG1rtzWs+zmS8arI8hndkW9YOeTtP7NkhOS6uNkZwSHRG/xUCDD8R7tHI60bdGM2mJvzcE323S73JJl1sm9gwjWoU+ZOcwwmk3G/72NECBCbZGLGk13nc+zZur/AXU6c1hMnvZSzvGOPsPd1Mi502o0vDr+xfVKuWqUKuJfRs7pStXFApmNSKiMSVT9GYpbVF11D1z67HHJqFMBvHzRUPgVhr39BUVFnWVdzbO2zc1HYrwr1hRzZJYcpLvtGLBOEazmXjW0Ezaz/Vd4HaL1dj/Rk7Kp7oSgiYdos9Y8GdFc3xVSicYVclJyJ3saR19sAR8FObCYSIgSRwwLb2bGws4MbOsBD0A2TEjlvl72+64vSl2FfSSykfg7ax/qiQh91fQ+lNNJaWQ6n98PsFTQQso90xow/UHD/FRKIo3HcanbTNYetNQQv4D8K2yBvtk4vM8zh22tNJxOzpZN6SrmHNf1e2s7wR/yo7UEXnExN+sZ8lkPaukBUxj/YI8yg2lPnoBjsCDYW5pMvJYtXyq5xKrnzt9eSzLrHdi7ByNzZc+UhDeXA5jFcmQwQdNiwzF60kstXS+AiPorXgCPKZAsveJzBG7xuY2u2fTpwobD5o5B+t7gePyVb1wG2ci0Qj+xJQM6dupCj1+hyV8mCy8uKfdkATY/Onh4LMJIiH7QX3dbWEg67ISP4+t1gCy8NtDZElrY394FF5XMY3tOehsy0yTu6H/g4HSnHE4ubaOWlrohJa7sEKQc9F3jturxIJVBkAiimixGNfQU+JNRM6vAKO9DI5XaUiM4Tm6r8BkcDmrItLv0Mv5k3KJT1RAeLdjnsckhB4QZPcJsbyOcK3XImCy7lOjsF7QoS/TMCdffqew33GssR5IcgSghM0Fg6AB3JtVYoxsVZ9PjqqzyhxhWozZ8WnrWUXIVgdRO7Z9cfk0sAbtsYFa/pHKbH KT0uc+H3 qjjOSYqpXn6ayNr+fjXKsmLC1Ag/aX3wH0HR1SR8/3VHIHRVgD1NUjLN94r77SL50isvaBu2Jg4cdqx2NqSatm6RZC9d9qOxUNEMayhBoHtshLdz1KzENNj52gPA/3ocPxP5rdS4qGwz+DSd676DWo65AQoujgdZ+k7LDb7DJFIfzcp+qMrnrB4elSIrklu3EyX/D4yWp5x2vbwdK/5HKvKjlEF28zwsgvshBnNMFLJBR0KgCKf5RyB3WoD7eT060A7sJml8dITkrPm32sXbjp2UqaWcVZQ8CbS1V9B4zYGZJy4oc/+mwVanS3cT/4VLzTBtMWRAzSoY7MwNSL+hHCDg+Ef+RoP9sSIbGZYL8rrpT/1MzWk6xdbpK3Rj10YqZZoQ6J8SOlq6+1voo57GsIcOelwJpHaI4bPMQuiVj1eDtqKn04uBO7jp81r9YBhwBWkR6m9Be4kazFiNrVd7cA0BXWm6xNQmhOj9AqDocpylgCUt1O/JIq9F9D7/nXH+991/AL2B+SAhOVEZ5fXPUUDD53KXOV7BWBvH4yx/CZ64kZeF2hUtGXCanUanMuNsGrwNqfx1JhS7LnNn7CEDzCbSpRt9rIdWV8TFxXJH8GpfwdbU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 8, 2024 at 8:06=E2=80=AFPM Alexei Starovoitov wrote: > > From: Alexei Starovoitov > > Introduce bpf_arena, which is a sparse shared memory region between the b= pf > program and user space. > > Use cases: > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anony= mous > region, like memcached or any key/value storage. The bpf program imple= ments an > in-kernel accelerator. XDP prog can search for a key in bpf_arena and = return a > value without going to user space. > 2. The bpf program builds arbitrary data structures in bpf_arena (hash ta= bles, > rb-trees, sparse arrays), while user space consumes it. > 3. bpf_arena is a "heap" of memory from the bpf program's point of view. > The user space may mmap it, but bpf program will not convert pointers > to user base at run-time to improve bpf program speed. > > Initially, the kernel vm_area and user vma are not populated. User space = can > fault in pages within the range. While servicing a page fault, bpf_arena = logic > will insert a new page into the kernel and user vmas. The bpf program can > allocate pages from that region via bpf_arena_alloc_pages(). This kernel > function will insert pages into the kernel vm_area. The subsequent fault-= in > from user space will populate that page into the user vma. The > BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fa= ult-in > from user space. In such a case, if a page is not allocated by the bpf pr= ogram > and not present in the kernel vm_area, the user process will segfault. Th= is is > useful for use cases 2 and 3 above. > > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pag= es > either at a specific address within the arena or allocates a range with t= he > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees = pages > and removes the range from the kernel vm_area and from user process vmas. > > bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of = bpf > program is more important than ease of sharing with user space. This is u= se > case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It wi= ll > tell the verifier to treat the rX =3D bpf_arena_cast_user(rY) instruction= as a > 32-bit move wX =3D wY, which will improve bpf prog performance. Otherwise= , > bpf_arena_cast_user is translated by JIT to conditionally add the upper 3= 2 bits > of user vm_start (if the pointer is not NULL) to arena pointers before th= ey are > stored into memory. This way, user space sees them as valid 64-bit pointe= rs. > > Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF back= end to > generate the bpf_cast_kern() instruction before dereference of the arena > pointer and the bpf_cast_user() instruction when the arena pointer is for= med. > In a typical bpf program there will be very few bpf_cast_user(). > > From LLVM's point of view, arena pointers are tagged as > __attribute__((address_space(1))). Hence, clang provides helpful diagnost= ics > when pointers cross address space. Libbpf and the kernel support only > address_space =3D=3D 1. All other address space identifiers are reserved. > > rX =3D bpf_cast_kern(rY, addr_space) tells the verifier that > rX->type =3D PTR_TO_ARENA. Any further operations on PTR_TO_ARENA registe= r have > to be in the 32-bit domain. The verifier will mark load/store through > PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as > kern_vm_start + 32bit_addr memory accesses. The behavior is similar to > copy_from_kernel_nofault() except that no address checks are necessary. T= he > address is guaranteed to be in the 4GB range. If the page is not present,= the > destination register is zeroed on read, and the operation is ignored on w= rite. > > rX =3D bpf_cast_user(rY, addr_space) tells the verifier that > rX->type =3D unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV s= et, then > the verifier converts cast_user to mov32. Otherwise, JIT will emit native= code > equivalent to: > rX =3D (u32)rY; > if (rY) > rX |=3D clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in = rX */ > > After such conversion, the pointer becomes a valid user pointer within > bpf_arena range. The user process can access data structures created in > bpf_arena without any additional computations. For example, a linked list= built > by a bpf program can be walked natively by user space. > > Signed-off-by: Alexei Starovoitov > --- > include/linux/bpf.h | 5 +- > include/linux/bpf_types.h | 1 + > include/uapi/linux/bpf.h | 7 + > kernel/bpf/Makefile | 3 + > kernel/bpf/arena.c | 557 +++++++++++++++++++++++++++++++++ > kernel/bpf/core.c | 11 + > kernel/bpf/syscall.c | 3 + > kernel/bpf/verifier.c | 1 + > tools/include/uapi/linux/bpf.h | 7 + > 9 files changed, 593 insertions(+), 2 deletions(-) > create mode 100644 kernel/bpf/arena.c > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 8b0dcb66eb33..de557c6c42e0 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -37,6 +37,7 @@ struct perf_event; > struct bpf_prog; > struct bpf_prog_aux; > struct bpf_map; > +struct bpf_arena; > struct sock; > struct seq_file; > struct btf; > @@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field= , void *list_head, > struct bpf_spin_lock *spin_lock); > void bpf_rb_root_free(const struct btf_field *field, void *rb_root, > struct bpf_spin_lock *spin_lock); > - > - > +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena); > +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena); > int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size); > > struct bpf_offload_dev; > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 94baced5a1ad..9f2a6b83b49e 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_= map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops) > +BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops) > > BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) > BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index d96708380e52..f6648851eae6 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -983,6 +983,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > BPF_MAP_TYPE_CGRP_STORAGE, > + BPF_MAP_TYPE_ARENA, > __MAX_BPF_MAP_TYPE > }; > > @@ -1370,6 +1371,12 @@ enum { > > /* BPF token FD is passed in a corresponding command's token_fd field */ > BPF_F_TOKEN_FD =3D (1U << 16), > + > +/* When user space page faults in bpf_arena send SIGSEGV instead of inse= rting new page */ > + BPF_F_SEGV_ON_FAULT =3D (1U << 17), > + > +/* Do not translate kernel bpf_arena pointers to user pointers */ > + BPF_F_NO_USER_CONV =3D (1U << 18), > }; > > /* Flags for BPF_PROG_QUERY. */ > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 4ce95acfcaa7..368c5d86b5b7 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM} +=3D bpf_inode_storage.o > obj-$(CONFIG_BPF_SYSCALL) +=3D disasm.o mprog.o > obj-$(CONFIG_BPF_JIT) +=3D trampoline.o > obj-$(CONFIG_BPF_SYSCALL) +=3D btf.o memalloc.o > +ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy) > +obj-$(CONFIG_BPF_SYSCALL) +=3D arena.o > +endif > obj-$(CONFIG_BPF_JIT) +=3D dispatcher.o > ifeq ($(CONFIG_NET),y) > obj-$(CONFIG_BPF_SYSCALL) +=3D devmap.o > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > new file mode 100644 > index 000000000000..5c1014471740 > --- /dev/null > +++ b/kernel/bpf/arena.c > @@ -0,0 +1,557 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* > + * bpf_arena is a sparsely populated shared memory region between bpf pr= ogram and > + * user space process. > + * > + * For example on x86-64 the values could be: > + * user_vm_start 7f7d26200000 // picked by mmap() > + * kern_vm_start ffffc90001e69000 // picked by get_vm_area() > + * For user space all pointers within the arena are normal 8-byte addres= ses. > + * In this example 7f7d26200000 is the address of the first page (pgoff= =3D0). > + * The bpf program will access it as: kern_vm_start + lower_32bit_of_use= r_ptr > + * (u32)7f7d26200000 -> 26200000 > + * hence > + * ffffc90001e69000 + 26200000 =3D=3D ffffc90028069000 is "pgoff=3D0" wi= thin 4Gb > + * kernel memory region. > + * > + * BPF JITs generate the following code to access arena: > + * mov eax, eax // eax has lower 32-bit of user pointer > + * mov word ptr [rax + r12 + off], bx > + * where r12 =3D=3D kern_vm_start and off is s16. > + * Hence allocate 4Gb + GUARD_SZ/2 on each side. > + * > + * Initially kernel vm_area and user vma are not populated. > + * User space can fault-in any address which will insert the page > + * into kernel and user vma. > + * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc > + * which will insert it into kernel vm_area. > + * The later fault-in from user space will populate that page into user = vma. > + */ > + > +/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field *= / > +#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8) > +#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ) I feel like we need another named constant for those 4GB limits here, something like: #define MAX_ARENA_SZ (1ull << 32) #define KERN_VM_SZ (MAX_ARENA_SZ + GUARD_SZ) see below why > + > +struct bpf_arena { > + struct bpf_map map; > + u64 user_vm_start; > + u64 user_vm_end; > + struct vm_struct *kern_vm; > + struct maple_tree mt; > + struct list_head vma_list; > + struct mutex lock; > +}; > + [...] > +static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > +{ > + struct vm_struct *kern_vm; > + int numa_node =3D bpf_map_attr_numa_node(attr); > + struct bpf_arena *arena; > + u64 vm_range; > + int err =3D -ENOMEM; > + > + if (attr->key_size || attr->value_size || attr->max_entries =3D= =3D 0 || > + /* BPF_F_MMAPABLE must be set */ > + !(attr->map_flags & BPF_F_MMAPABLE) || > + /* No unsupported flags present */ > + (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | B= PF_F_NO_USER_CONV))) > + return ERR_PTR(-EINVAL); > + > + if (attr->map_extra & ~PAGE_MASK) > + /* If non-zero the map_extra is an expected user VMA star= t address */ > + return ERR_PTR(-EINVAL); > + > + vm_range =3D (u64)attr->max_entries * PAGE_SIZE; > + if (vm_range > (1ull << 32)) here we can then use MAX_ARENA_SZ > + return ERR_PTR(-E2BIG); > + > + if ((attr->map_extra >> 32) !=3D ((attr->map_extra + vm_range - 1= ) >> 32)) > + /* user vma must not cross 32-bit boundary */ > + return ERR_PTR(-ERANGE); > + > + kern_vm =3D get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP); > + if (!kern_vm) > + return ERR_PTR(-ENOMEM); > + > + arena =3D bpf_map_area_alloc(sizeof(*arena), numa_node); > + if (!arena) > + goto err; > + > + arena->kern_vm =3D kern_vm; > + arena->user_vm_start =3D attr->map_extra; > + if (arena->user_vm_start) > + arena->user_vm_end =3D arena->user_vm_start + vm_range; > + > + INIT_LIST_HEAD(&arena->vma_list); > + bpf_map_init_from_attr(&arena->map, attr); > + mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE); > + mutex_init(&arena->lock); > + > + return &arena->map; > +err: > + free_vm_area(kern_vm); > + return ERR_PTR(err); > +} > + > +static int for_each_pte(pte_t *ptep, unsigned long addr, void *data) > +{ > + struct page *page; > + pte_t pte; > + > + pte =3D ptep_get(ptep); > + if (!pte_present(pte)) > + return 0; > + page =3D pte_page(pte); > + /* > + * We do not update pte here: > + * 1. Nobody should be accessing bpf_arena's range outside of a k= ernel bug > + * 2. TLB flushing is batched or deferred. Even if we clear pte, > + * the TLB entries can stick around and continue to permit access= to > + * the freed page. So it all relies on 1. > + */ > + __free_page(page); > + return 0; > +} > + > +static void arena_map_free(struct bpf_map *map) > +{ > + struct bpf_arena *arena =3D container_of(map, struct bpf_arena, m= ap); > + > + /* > + * Check that user vma-s are not around when bpf map is freed. > + * mmap() holds vm_file which holds bpf_map refcnt. > + * munmap() must have happened on vma followed by arena_vm_close(= ) > + * which would clear arena->vma_list. > + */ > + if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) > + return; > + > + /* > + * free_vm_area() calls remove_vm_area() that calls free_unmap_vm= ap_area(). > + * It unmaps everything from vmalloc area and clears pgtables. > + * Call apply_to_existing_page_range() first to find populated pt= es and > + * free those pages. > + */ > + apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_star= t(arena), > + KERN_VM_SZ - GUARD_SZ / 2, for_each_= pte, NULL); I'm still reading the rest (so it might become obvious), but this KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to unmap 4GB (MAX_ARENA_SZ)? > + free_vm_area(arena->kern_vm); > + mtree_destroy(&arena->mt); > + bpf_map_area_free(arena); > +} > + [...] > +static unsigned long arena_get_unmapped_area(struct file *filp, unsigned= long addr, > + unsigned long len, unsigned = long pgoff, > + unsigned long flags) > +{ > + struct bpf_map *map =3D filp->private_data; > + struct bpf_arena *arena =3D container_of(map, struct bpf_arena, m= ap); > + long ret; > + > + if (pgoff) > + return -EINVAL; > + if (len > (1ull << 32)) MAX_ARENA_SZ ? > + return -E2BIG; > + > + /* if user_vm_start was specified at arena creation time */ > + if (arena->user_vm_start) { > + if (len > arena->user_vm_end - arena->user_vm_start) > + return -E2BIG; > + if (len !=3D arena->user_vm_end - arena->user_vm_start) > + return -EINVAL; > + if (addr !=3D arena->user_vm_start) > + return -EINVAL; > + } > + > + ret =3D current->mm->get_unmapped_area(filp, addr, len * 2, 0, fl= ags); > + if (IS_ERR_VALUE(ret)) > + return 0; Can you leave a comment why we are swallowing errors, if this is intentiona= l? > + if ((ret >> 32) =3D=3D ((ret + len - 1) >> 32)) > + return ret; > + if (WARN_ON_ONCE(arena->user_vm_start)) > + /* checks at map creation time should prevent this */ > + return -EFAULT; > + return round_up(ret, 1ull << 32); this is still probably MAX_ARENA_SZ, no? > +} > + > +static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vm= a) > +{ > + struct bpf_arena *arena =3D container_of(map, struct bpf_arena, m= ap); > + > + guard(mutex)(&arena->lock); > + if (arena->user_vm_start && arena->user_vm_start !=3D vma->vm_sta= rt) > + /* > + * If map_extra was not specified at arena creation time = then > + * 1st user process can do mmap(NULL, ...) to pick user_v= m_start > + * 2nd user process must pass the same addr to mmap(addr,= MAP_FIXED..); > + * or > + * specify addr in map_extra and > + * use the same addr later with mmap(addr, MAP_FIXED..); > + */ > + return -EBUSY; > + > + if (arena->user_vm_end && arena->user_vm_end !=3D vma->vm_end) > + /* all user processes must have the same size of mmap-ed = region */ > + return -EBUSY; > + > + /* Earlier checks should prevent this */ > + if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > (1ull << 32) || vm= a->vm_pgoff)) MAX_ARENA_SZ ? > + return -EFAULT; > + > + if (remember_vma(arena, vma)) > + return -ENOMEM; > + > + arena->user_vm_start =3D vma->vm_start; > + arena->user_vm_end =3D vma->vm_end; > + /* > + * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and > + * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid > + * potential change of user_vm_start. > + */ > + vm_flags_set(vma, VM_DONTEXPAND); > + vma->vm_ops =3D &arena_vm_ops; > + return 0; > +} > + [...]