From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72D70C4828D for ; Wed, 7 Feb 2024 12:34:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 076BD6B0078; Wed, 7 Feb 2024 07:34:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 026516B007D; Wed, 7 Feb 2024 07:34:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E31846B007E; Wed, 7 Feb 2024 07:34:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CF7FF6B0078 for ; Wed, 7 Feb 2024 07:34:37 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6781380D3F for ; Wed, 7 Feb 2024 12:34:37 +0000 (UTC) X-FDA: 81764951394.05.FA8D7BE Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) by imf15.hostedemail.com (Postfix) with ESMTP id 63370A0013 for ; Wed, 7 Feb 2024 12:34:35 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f8qd7HAh; spf=pass (imf15.hostedemail.com: domain of donald.hunter@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=donald.hunter@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707309275; a=rsa-sha256; cv=none; b=JXIQqZtZxRULct9m7y+bNXHMScDYKpm2fGTR+sjq2prO0wOPLGvj9t7/cG2vuBKifS2KoC uBF4b1RclP/PoSZ7OSyE9JlY3GUfIETCoWaLEfPvNeka9ENvzAzy8/8djnwthTFfEtT3/L ueUw/lNq/kvSdTFggBryD8PJhtXRvko= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f8qd7HAh; spf=pass (imf15.hostedemail.com: domain of donald.hunter@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=donald.hunter@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707309275; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I1uTmNcfzAF+gjMkYf3jUIsxGBz4Fag6C6Y+rI0n2lw=; b=2YfcL3naeedAZQzNdePz2U81a7eClcEHY9vE2KxYUkG7Ww5BgJsEBQJG/W6/WT3y/DcJSu dmBAuIkC1YsJTbK/j4YNXSa7Djpk/llwoER6D3hEiH0q2wM+mKGddRE6wA/KLrEBhV6XAL noIrWpXl9ofxQUC0FoJ2nq20y4hVQfg= Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-40fff96d5d7so4754865e9.2 for ; Wed, 07 Feb 2024 04:34:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707309274; x=1707914074; darn=kvack.org; h=mime-version:user-agent:references:message-id:date:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=I1uTmNcfzAF+gjMkYf3jUIsxGBz4Fag6C6Y+rI0n2lw=; b=f8qd7HAhE9E5M56Bfeo5p5RipxhxkD9CkbpOdjYNIAXoj1viqHdi8CAfZAz73k5kG6 GednB3vyEZ2eodkrFlmfCg2A+RAlP7Y1qO9zqDVw3s+MWxiH4Tugsjz8iOBmmrdVyHFj FHBNmzcno62eZQHdHsqZDNI3YvTf5Sr4kI2jZrpUZLOmHd6G8evJihPuqZyG7n5ovw4c toxGDPRHKYsqRdaSOksZjf9vtgXBweDy9Poy9E2wXz4f4mBRVcUD6zfz4FzwPnqhcoMW CPkiIEDRf8wfIO/np2nLBusYlbqJO+D2FjuTxe16IylRRnLNZgru1o7mx2NjnvFqdEht rM7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707309274; x=1707914074; h=mime-version:user-agent:references:message-id:date:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=I1uTmNcfzAF+gjMkYf3jUIsxGBz4Fag6C6Y+rI0n2lw=; b=uTBl1jBdhTkau5V8KwYLfzOEflDJnZVeDunyiFvbfqyG67I44yupw2HU5szqpxwhN9 8hRr3OWvZ2FSRdjr3ilSq5bET9JbCrBFT/xqxVtesdC9/x6qSi85y+lUGQgI+Gks2jfK 9zS9ieDSV62vHmU4IQkOrWlweyeuk0GR7ENlNrFMCOmsd0jRzaTOc9wpJ1cWS3WwoHY6 9CKqZ00b95VG2aFtZBzQ13P1HfZdwA4C8yQBV7xys8V93KXrjoi36OlFqnz3eT5nNg8t iKrvKsA5txo3/S5pSJc5AhVFIvTPc2TbP2DoLR7Y+bzEkEtmN6xPVt0XWwk/AYaYXwCd hhHg== X-Forwarded-Encrypted: i=1; AJvYcCX2/tLrvO7be/q5EpgkuMtVlVm5KXTVy6WoqjP2ePKMxkN8fh7bu8udu9/EzDX2tydtZCkhyEpcPoBqwbSSJcgmulg= X-Gm-Message-State: AOJu0Yx7u27X/DKrguTVmTAm+B8A97o4Gg/Z2ILo+jMHieTUOXvq2BOY mw04DZJ6qucQzC+O0S8u8RseRB3BRRTVLHjzNJoH8enm2JqoCHc+ X-Google-Smtp-Source: AGHT+IGCsAw8odCoP29oj1hnRNZMPgqODjRWhpkst93KCCIeq1kfcR8I6TPaPv+wW7bZxZjVG0vIHQ== X-Received: by 2002:a05:600c:3b9a:b0:40f:dde7:b882 with SMTP id n26-20020a05600c3b9a00b0040fdde7b882mr4784240wms.31.1707309273409; Wed, 07 Feb 2024 04:34:33 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCVSf7uVbfvhfwegQVjIjTIdbtYJDcSyBgnMbIEaqgw5QoWIDEK5qpnI/keNw0xaLlv8MK/k2FkN1tcTwC8UzPNjlMx/6iN4a9aPbrEMq5mzi5XfRcr07DPBF4NLBHZdKTIR6f5YR2VCiIVTvHJQDz5UvVcEji2u2dv6OCs0hyxMkK3X9xtejlve21wdDLSkNZh+Fx3910MGlUUVAcn6pjWVXqfbyB8NmC3n/ZzUKvVzb31WLsDptWG0LQBEsTk2bnTtU9zKKlcAC1/dWWlKUD/AmrfXjTs8j4ogPmW94kOune8VV0Uq8w6XQt8eLsh4mQ== Received: from imac ([2a02:8010:60a0:0:2495:3ca9:a061:eea4]) by smtp.gmail.com with ESMTPSA id u8-20020a05600c19c800b0040ffd94cd27sm2766097wmq.45.2024.02.07.04.34.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Feb 2024 04:34:32 -0800 (PST) From: Donald Hunter To: Alexei Starovoitov Cc: bpf@vger.kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org, memxor@gmail.com, eddyz87@gmail.com, tj@kernel.org, brho@google.com, hannes@cmpxchg.org, linux-mm@kvack.org, kernel-team@fb.com Subject: Re: [PATCH bpf-next 00/16] bpf: Introduce BPF arena. In-Reply-To: <20240206220441.38311-1-alexei.starovoitov@gmail.com> (Alexei Starovoitov's message of "Tue, 6 Feb 2024 14:04:25 -0800") Date: Wed, 07 Feb 2024 12:34:20 +0000 Message-ID: References: <20240206220441.38311-1-alexei.starovoitov@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 63370A0013 X-Stat-Signature: cz57epi8rtkw9u6ipqhriaep5s6qds45 X-Rspam-User: X-HE-Tag: 1707309275-996006 X-HE-Meta: U2FsdGVkX18jXMQWf1/dndbvygjBU8HwK8Hvn3UMhjI3/RRvFLOVzDQJwb0TOAsvne934lcEFn+f31lFL4S/N49eADPnIQ160V7588QN/AWWHP6JiaQ+YWazoks/sctPIkkWTahykqOtC2zmdhBH1MAjORo/BUleen0Me1/vs/tgQwfehpuUnMB/r/Fi+xbpel7U9ggsuZUFplWB6pKVITRIFLIVsNJMvCDkH5DUlh40nMiQGAU0G3IXXqpQDu5315eDKGsjabPMmQjDk/9YBTRNS7kCzUIJlCe6Lq7XwyVjfRG4wlR8RD8M3gPmn7iFKpt+Hp4MubXB1reXYFNQ8TX7CB6fH993w5XtbEgo8iFg5j2rsj25dlDY4qBNQhqmA0gBXV+02FI+1OjzngPFKxFF4xXa1VpnbyJumgVI3/bSlBNtll3yMDC0Typ1PtCEdAdDy+lGHAN7AKi2zJW2D3rYk3LL0VnXWRMM2/nfk9KqmJNqr2/a4rAQD9FIu6UmvcTEzGa5C6h7zQCE4UthXcNqcKPx2vZcl2DdD3ciafwMnVD3iJpeqwKpO+Ro6dtX/2EY7sj7SNH4UkVRRYnM4Xq/j4CWr7A5D0pXTizmB5cwHIjVgvlJYUKMrFag1MhQ+ZoaZMZnFsR49vyt7a5wTkT35VOAmIeQvWjs98jJ9A5LRNqcxhO04Q74kJYjtupawBcHLy1bXbuclm8A673J32257DZzO/TrixE0dElpgjBrI4UCJwf4KQBQ/U8iRtW64VlmyLJBBbVpj76sTxZRbQ7sA6F5maHJxf3XyOZj+9s7EbK73ELxe1mS0hJAm4wttcOUs7Qj3N1cccn2TWrD0D/ButaSNTc+0KPxY/o9oohIt0unztQ+sqC01m2MUAV4mVhSkzgfvqvBLyHHSU/uvwKwq2zSi2I3Gg1iYCEIvRa9SGtMluiqa/C0VK0XT87yCekw8vo9AxpybuK2RMc rl+gKCPD GAaPMkN/8gfipGXqqbwbkSLAh7MjkrxBywqkIAXRPb5eNg+EZprVD5N8N5DRHubcwlEw6qj6g1+MwfLYHpsD0XCtL/D83ebRcOQsMqf59ec12BfIDYkXw98hYNJ+ljQDV8/gXVyXCPLJ6WScVXQ6WlN34xbWzd2Z5d9PTHfQbvwtzD7XXtAh4aK+Gh2cYm61AtA7CMkeDbQd7rVliUfq/ySDbe6cclbsVKQDyRymg05INkYoLo+ACKKCnpeFwgRxHqncBQpnQq4nw/autzDPUb74YdhZH7lD0w0N7ZkbEck6C7DL1QfiAQ0ks0vTpV5ThiGM442vw1gOHsAsF/uscS02ZNYdis1uCadeVxiTlBCceS6spfOSb6jzpx+yauZDJKrCAbqpnlXy9AA2kN1ImmUFt+wZ1SWlLcxRE1HebTtTvBt+VPZAAMIQjrVV4Enl54XbEniImCvS5BT5PkiodXF/ZSUzp6+xRIEnB+nKte6JBL3QDRL6iHl2BjN+RDnrwdrS5XsaJIOREidBqnd3RvBoCOO0s8cT62mGZOQ1cQKsDMVd46XaM7CSUb6n/st9WGhXW2hWwk0UlmM/kejkqcrz6+8aE14MYTXhTZaBjh02JzNPrrbzHw89cHg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Alexei Starovoitov writes: > From: Alexei Starovoitov > > bpf programs have multiple options to communicate with user space: > - Various ring buffers (perf, ftrace, bpf): The data is streamed > unidirectionally from bpf to user space. > - Hash map: The bpf program populates elements, and user space consumes them > via bpf syscall. > - mmap()-ed array map: Libbpf creates an array map that is directly accessed by > the bpf program and mmap-ed to user space. It's the fastest way. Its > disadvantage is that memory for the whole array is reserved at the start. > > These patches introduce bpf_arena, which is a sparse shared memory region > between the bpf program and user space. This will need to be documented, probably in a new file at Documentation/bpf/map_arena.rst since it's cosplaying as a BPF map. Why is it a map, when it doesn't have map semantics as evidenced by the -EOPNOTSUPP map accessors? Is it the only way you can reuse the kernel / userspace plumbing? > Use cases: > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous > region, like memcached or any key/value storage. The bpf program implements an > in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a > value without going to user space. > 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables, > rb-trees, sparse arrays), while user space occasionally consumes it. > 3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is > not shared with user space. > > Initially, the kernel vm_area and user vma are not populated. User space can > fault in pages within the range. While servicing a page fault, bpf_arena logic > will insert a new page into the kernel and user vmas. The bpf program can > allocate pages from that region via bpf_arena_alloc_pages(). This kernel > function will insert pages into the kernel vm_area. The subsequent fault-in > from user space will populate that page into the user vma. The > BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in > from user space. In such a case, if a page is not allocated by the bpf program > and not present in the kernel vm_area, the user process will segfault. This is > useful for use cases 2 and 3 above. > > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages > either at a specific address within the arena or allocates a range with the > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages > and removes the range from the kernel vm_area and from user process vmas. > > bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not > shared with user space. This is use case 3. In such a case, the > BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the I can see _what_ this flag does but it's not clear what the consequences of this flag are. Perhaps it would be better named BPF_F_NO_USER_ACCESS? > rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will > improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by > JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is > not NULL) to arena pointers before they are stored into memory. This way, user > space sees them as valid 64-bit pointers. > > Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to > generate the bpf_cast_kern() instruction before dereference of the arena > pointer and the bpf_cast_user() instruction when the arena pointer is formed. > In a typical bpf program there will be very few bpf_cast_user(). > > From LLVM's point of view, arena pointers are tagged as > __attribute__((address_space(1))). Hence, clang provides helpful diagnostics > when pointers cross address space. Libbpf and the kernel support only > address_space == 1. All other address space identifiers are reserved. > > rX = bpf_cast_kern(rY, addr_space) tells the verifier that > rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have > to be in the 32-bit domain. The verifier will mark load/store through > PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as > kern_vm_start + 32bit_addr memory accesses. The behavior is similar to > copy_from_kernel_nofault() except that no address checks are necessary. The > address is guaranteed to be in the 4GB range. If the page is not present, the > destination register is zeroed on read, and the operation is ignored on write. > > rX = bpf_cast_user(rY, addr_space) tells the verifier that > rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then > the verifier converts cast_user to mov32. Otherwise, JIT will emit native code > equivalent to: > rX = (u32)rY; > if (rX) > rX |= arena->user_vm_start & ~(u64)~0U; > > After such conversion, the pointer becomes a valid user pointer within > bpf_arena range. The user process can access data structures created in > bpf_arena without any additional computations. For example, a linked list built > by a bpf program can be walked natively by user space. The last two patches > demonstrate how algorithms in the C language can be compiled as a bpf program > and as native code. > > Followup patches are planned: > . selftests in asm > . support arena variables in global data. Example: > void __arena * ptr; // works > int __arena var; // supported by llvm, but not by kernel and libbpf yet > . support bpf_spin_lock in arena > bpf programs running on different CPUs can synchronize access to the arena via > existing bpf_spin_lock mechanisms (spin_locks in bpf_array or in bpf hash map). > It will be more convenient to allow spin_locks inside the arena too. > > Patch set overview: > - patch 1,2: minor verifier enhancements to enable bpf_arena kfuncs > - patch 3: export vmap_pages_range() to be used out side of mm directory > - patch 4: main patch that introduces bpf_arena map type. See commit log > - patch 6: probe_mem32 support in x86 JIT > - patch 7: bpf_cast_user support in x86 JIT > - patch 8: main verifier patch to support bpf_arena > - patch 9: __arg_arena to tag arena pointers in bpf globla functions > - patch 11: libbpf support for arena > - patch 12: __ulong() macro to pass 64-bit constants in BTF > - patch 13: export PAGE_SIZE constant into vmlinux BTF to be used from bpf programs > - patch 14: bpf_arena_cast instruction as inline asm for setups with old LLVM > - patch 15,16: testcases in C > > Alexei Starovoitov (16): > bpf: Allow kfuncs return 'void *' > bpf: Recognize '__map' suffix in kfunc arguments > mm: Expose vmap_pages_range() to the rest of the kernel. > bpf: Introduce bpf_arena. > bpf: Disasm support for cast_kern/user instructions. > bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions. > bpf: Add x86-64 JIT support for bpf_cast_user instruction. > bpf: Recognize cast_kern/user instructions in the verifier. > bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA. > libbpf: Add __arg_arena to bpf_helpers.h > libbpf: Add support for bpf_arena. > libbpf: Allow specifying 64-bit integers in map BTF. > bpf: Tell bpf programs kernel's PAGE_SIZE > bpf: Add helper macro bpf_arena_cast() > selftests/bpf: Add bpf_arena_list test. > selftests/bpf: Add bpf_arena_htab test. > > arch/x86/net/bpf_jit_comp.c | 222 +++++++- > include/linux/bpf.h | 8 +- > include/linux/bpf_types.h | 1 + > include/linux/bpf_verifier.h | 1 + > include/linux/filter.h | 4 + > include/linux/vmalloc.h | 2 + > include/uapi/linux/bpf.h | 12 + > kernel/bpf/Makefile | 3 + > kernel/bpf/arena.c | 518 ++++++++++++++++++ > kernel/bpf/btf.c | 19 +- > kernel/bpf/core.c | 23 +- > kernel/bpf/disasm.c | 11 + > kernel/bpf/log.c | 3 + > kernel/bpf/syscall.c | 3 + > kernel/bpf/verifier.c | 127 ++++- > mm/vmalloc.c | 4 +- > tools/include/uapi/linux/bpf.h | 12 + > tools/lib/bpf/bpf_helpers.h | 2 + > tools/lib/bpf/libbpf.c | 62 ++- > tools/lib/bpf/libbpf_probes.c | 6 + > tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 + > tools/testing/selftests/bpf/DENYLIST.s390x | 1 + > tools/testing/selftests/bpf/bpf_arena_alloc.h | 58 ++ > .../testing/selftests/bpf/bpf_arena_common.h | 70 +++ > tools/testing/selftests/bpf/bpf_arena_htab.h | 100 ++++ > tools/testing/selftests/bpf/bpf_arena_list.h | 95 ++++ > .../testing/selftests/bpf/bpf_experimental.h | 41 ++ > .../selftests/bpf/prog_tests/arena_htab.c | 88 +++ > .../selftests/bpf/prog_tests/arena_list.c | 65 +++ > .../testing/selftests/bpf/progs/arena_htab.c | 48 ++ > .../selftests/bpf/progs/arena_htab_asm.c | 5 + > .../testing/selftests/bpf/progs/arena_list.c | 75 +++ > 32 files changed, 1669 insertions(+), 21 deletions(-) > create mode 100644 kernel/bpf/arena.c > create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h > create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c