[PATCH bpf-next 00/16] bpf: Introduce BPF arena.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 00/16] bpf: Introduce BPF arena.
@ 2024-02-06 22:04 Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
                   ` (16 more replies)
  0 siblings, 17 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf programs have multiple options to communicate with user space:
- Various ring buffers (perf, ftrace, bpf): The data is streamed
  unidirectionally from bpf to user space.
- Hash map: The bpf program populates elements, and user space consumes them
  via bpf syscall.
- mmap()-ed array map: Libbpf creates an array map that is directly accessed by
  the bpf program and mmap-ed to user space. It's the fastest way. Its
  disadvantage is that memory for the whole array is reserved at the start.

These patches introduce bpf_arena, which is a sparse shared memory region
between the bpf program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
   region, like memcached or any key/value storage. The bpf program implements an
   in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
   value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
   rb-trees, sparse arrays), while user space occasionally consumes it. 
3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is
   not shared with user space.

Initially, the kernel vm_area and user vma are not populated. User space can
fault in pages within the range. While servicing a page fault, bpf_arena logic
will insert a new page into the kernel and user vmas. The bpf program can
allocate pages from that region via bpf_arena_alloc_pages(). This kernel
function will insert pages into the kernel vm_area. The subsequent fault-in
from user space will populate that page into the user vma. The
BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
from user space. In such a case, if a page is not allocated by the bpf program
and not present in the kernel vm_area, the user process will segfault. This is
useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
and removes the range from the kernel vm_area and from user process vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not
shared with user space. This is use case 3. In such a case, the
BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the
rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will
improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by
JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is
not NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
generate the bpf_cast_kern() instruction before dereference of the arena
pointer and the bpf_cast_user() instruction when the arena pointer is formed.
In a typical bpf program there will be very few bpf_cast_user().

From LLVM's point of view, arena pointers are tagged as
__attribute__((address_space(1))). Hence, clang provides helpful diagnostics
when pointers cross address space. Libbpf and the kernel support only
address_space == 1. All other address space identifiers are reserved.

rX = bpf_cast_kern(rY, addr_space) tells the verifier that
rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
to be in the 32-bit domain. The verifier will mark load/store through
PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
copy_from_kernel_nofault() except that no address checks are necessary. The
address is guaranteed to be in the 4GB range. If the page is not present, the
destination register is zeroed on read, and the operation is ignored on write.

rX = bpf_cast_user(rY, addr_space) tells the verifier that
rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
equivalent to:
rX = (u32)rY;
if (rX)
  rX |= arena->user_vm_start & ~(u64)~0U;

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list built
by a bpf program can be walked natively by user space. The last two patches
demonstrate how algorithms in the C language can be compiled as a bpf program
and as native code.

Followup patches are planned:
. selftests in asm
. support arena variables in global data. Example:
  void __arena * ptr; // works
  int __arena var; // supported by llvm, but not by kernel and libbpf yet
. support bpf_spin_lock in arena
  bpf programs running on different CPUs can synchronize access to the arena via
  existing bpf_spin_lock mechanisms (spin_locks in bpf_array or in bpf hash map).
  It will be more convenient to allow spin_locks inside the arena too.

Patch set overview:
- patch 1,2: minor verifier enhancements to enable bpf_arena kfuncs
- patch 3: export vmap_pages_range() to be used out side of mm directory
- patch 4: main patch that introduces bpf_arena map type. See commit log
- patch 6: probe_mem32 support in x86 JIT
- patch 7: bpf_cast_user support in x86 JIT
- patch 8: main verifier patch to support bpf_arena
- patch 9: __arg_arena to tag arena pointers in bpf globla functions
- patch 11: libbpf support for arena
- patch 12: __ulong() macro to pass 64-bit constants in BTF
- patch 13: export PAGE_SIZE constant into vmlinux BTF to be used from bpf programs
- patch 14: bpf_arena_cast instruction as inline asm for setups with old LLVM
- patch 15,16: testcases in C

Alexei Starovoitov (16):
  bpf: Allow kfuncs return 'void *'
  bpf: Recognize '__map' suffix in kfunc arguments
  mm: Expose vmap_pages_range() to the rest of the kernel.
  bpf: Introduce bpf_arena.
  bpf: Disasm support for cast_kern/user instructions.
  bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  bpf: Recognize cast_kern/user instructions in the verifier.
  bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  libbpf: Add __arg_arena to bpf_helpers.h
  libbpf: Add support for bpf_arena.
  libbpf: Allow specifying 64-bit integers in map BTF.
  bpf: Tell bpf programs kernel's PAGE_SIZE
  bpf: Add helper macro bpf_arena_cast()
  selftests/bpf: Add bpf_arena_list test.
  selftests/bpf: Add bpf_arena_htab test.

 arch/x86/net/bpf_jit_comp.c                   | 222 +++++++-
 include/linux/bpf.h                           |   8 +-
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/filter.h                        |   4 +
 include/linux/vmalloc.h                       |   2 +
 include/uapi/linux/bpf.h                      |  12 +
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/arena.c                            | 518 ++++++++++++++++++
 kernel/bpf/btf.c                              |  19 +-
 kernel/bpf/core.c                             |  23 +-
 kernel/bpf/disasm.c                           |  11 +
 kernel/bpf/log.c                              |   3 +
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         | 127 ++++-
 mm/vmalloc.c                                  |   4 +-
 tools/include/uapi/linux/bpf.h                |  12 +
 tools/lib/bpf/bpf_helpers.h                   |   2 +
 tools/lib/bpf/libbpf.c                        |  62 ++-
 tools/lib/bpf/libbpf_probes.c                 |   6 +
 tools/testing/selftests/bpf/DENYLIST.aarch64  |   1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 tools/testing/selftests/bpf/bpf_arena_alloc.h |  58 ++
 .../testing/selftests/bpf/bpf_arena_common.h  |  70 +++
 tools/testing/selftests/bpf/bpf_arena_htab.h  | 100 ++++
 tools/testing/selftests/bpf/bpf_arena_list.h  |  95 ++++
 .../testing/selftests/bpf/bpf_experimental.h  |  41 ++
 .../selftests/bpf/prog_tests/arena_htab.c     |  88 +++
 .../selftests/bpf/prog_tests/arena_list.c     |  65 +++
 .../testing/selftests/bpf/progs/arena_htab.c  |  48 ++
 .../selftests/bpf/progs/arena_htab_asm.c      |   5 +
 .../testing/selftests/bpf/progs/arena_list.c  |  75 +++
 32 files changed, 1669 insertions(+), 21 deletions(-)
 create mode 100644 kernel/bpf/arena.c
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-08 19:40   ` Andrii Nakryiko
  2024-02-09 16:06   ` David Vernet
  2024-02-06 22:04 ` [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Recognize return of 'void *' from kfunc as returning unknown scalar.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ddaf09db1175..d9c2dbb3939f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 					meta.func_name);
 				return -EFAULT;
 			}
+		} else if (btf_type_is_void(ptr_type)) {
+			/* kfunc returning 'void *' is equivalent to returning scalar */
+			mark_reg_unknown(env, regs, BPF_REG_0);
 		} else if (!__btf_type_is_struct(ptr_type)) {
 			if (!meta.r0_size) {
 				__u32 sz;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-09 16:57   ` David Vernet
  2024-02-06 22:04 ` [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
It allows kfunc to have 'void *' argument for maps, since bpf progs
will call them as:
struct {
        __uint(type, BPF_MAP_TYPE_ARENA);
	...
} arena SEC(".maps");

bpf_kfunc_with_map(... &arena ...);

Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
If kfunc was defined with 'struct bpf_map *' it would pass
the verifier, but bpf prog would need to use '(void *)&arena'.
Which is not clean.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d9c2dbb3939f..db569ce89fb1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
 	return __kfunc_param_match_suffix(btf, arg, "__ign");
 }
 
+static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__map");
+}
+
 static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
 {
 	return __kfunc_param_match_suffix(btf, arg, "__alloc");
@@ -11064,7 +11069,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 		return KF_ARG_PTR_TO_CONST_STR;
 
 	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
-		if (!btf_type_is_struct(ref_t)) {
+		if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
 			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
 				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
 			return -EINVAL;
@@ -11660,6 +11665,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		if (kf_arg_type < 0)
 			return kf_arg_type;
 
+		if (is_kfunc_arg_map(btf, &args[i])) {
+			/* If argument has '__map' suffix expect 'struct bpf_map *' */
+			ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
+			ref_t = btf_type_by_id(btf_vmlinux, ref_id);
+			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+		}
+
 		switch (kf_arg_type) {
 		case KF_ARG_PTR_TO_NULL:
 			continue;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-07 21:07   ` Lorenzo Stoakes
  2024-02-06 22:04 ` [PATCH bpf-next 04/16] bpf: Introduce bpf_arena Alexei Starovoitov
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The next commit will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process.
It will function similar to vmalloc()/vm_map_ram():
- get_vm_area()
- alloc_pages()
- vmap_pages_range()

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h | 2 ++
 mm/vmalloc.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..bafb87c69e3d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr)
 
 #ifdef CONFIG_MMU
 void vunmap_range(unsigned long addr, unsigned long end);
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..eae93d575d1b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -625,8 +625,8 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-static int vmap_pages_range(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages, unsigned int page_shift)
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	int err;
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-07 18:40   ` Barret Rhoden
  2024-02-06 22:04 ` [PATCH bpf-next 05/16] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
   region, like memcached or any key/value storage. The bpf program implements an
   in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
   value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
   rb-trees, sparse arrays), while user space occasionally consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is
   not shared with user space.

Initially, the kernel vm_area and user vma are not populated. User space can
fault in pages within the range. While servicing a page fault, bpf_arena logic
will insert a new page into the kernel and user vmas. The bpf program can
allocate pages from that region via bpf_arena_alloc_pages(). This kernel
function will insert pages into the kernel vm_area. The subsequent fault-in
from user space will populate that page into the user vma. The
BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
from user space. In such a case, if a page is not allocated by the bpf program
and not present in the kernel vm_area, the user process will segfault. This is
useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
and removes the range from the kernel vm_area and from user process vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not
shared with user space. This is use case 3. In such a case, the
BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the
rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will
improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by
JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is
not NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
generate the bpf_cast_kern() instruction before dereference of the arena
pointer and the bpf_cast_user() instruction when the arena pointer is formed.
In a typical bpf program there will be very few bpf_cast_user().

From LLVM's point of view, arena pointers are tagged as
__attribute__((address_space(1))). Hence, clang provides helpful diagnostics
when pointers cross address space. Libbpf and the kernel support only
address_space == 1. All other address space identifiers are reserved.

rX = bpf_cast_kern(rY, addr_space) tells the verifier that
rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
to be in the 32-bit domain. The verifier will mark load/store through
PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
copy_from_kernel_nofault() except that no address checks are necessary. The
address is guaranteed to be in the 4GB range. If the page is not present, the
destination register is zeroed on read, and the operation is ignored on write.

rX = bpf_cast_user(rY, addr_space) tells the verifier that
rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
equivalent to:
rX = (u32)rY;
if (rX)
  rX |= arena->user_vm_start & ~(u64)~0U;

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list built
by a bpf program can be walked natively by user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   5 +-
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/Makefile            |   3 +
 kernel/bpf/arena.c             | 518 +++++++++++++++++++++++++++++++++
 kernel/bpf/core.c              |  11 +
 kernel/bpf/syscall.c           |   3 +
 kernel/bpf/verifier.c          |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 9 files changed, 554 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/arena.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1ebbee1d648e..42f22bc881f0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -37,6 +37,7 @@ struct perf_event;
 struct bpf_prog;
 struct bpf_prog_aux;
 struct bpf_map;
+struct bpf_arena;
 struct sock;
 struct seq_file;
 struct btf;
@@ -531,8 +532,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock);
 void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
 		      struct bpf_spin_lock *spin_lock);
-
-
+u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
+u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 94baced5a1ad..9f2a6b83b49e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d96708380e52..f6648851eae6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	BPF_MAP_TYPE_ARENA,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1370,6 +1371,12 @@ enum {
 
 /* BPF token FD is passed in a corresponding command's token_fd field */
 	BPF_F_TOKEN_FD          = (1U << 16),
+
+/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
+	BPF_F_SEGV_ON_FAULT	= (1U << 17),
+
+/* Do not translate kernel bpf_arena pointers to user pointers */
+	BPF_F_NO_USER_CONV	= (1U << 18),
 };
 
 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 4ce95acfcaa7..368c5d86b5b7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
+ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
+obj-$(CONFIG_BPF_SYSCALL) += arena.o
+endif
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
new file mode 100644
index 000000000000..9db720321700
--- /dev/null
+++ b/kernel/bpf/arena.c
@@ -0,0 +1,518 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/err.h>
+#include <linux/btf_ids.h>
+#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
+
+/*
+ * bpf_arena is a sparsely populated shared memory region between bpf program and
+ * user space process.
+ *
+ * For example on x86-64 the values could be:
+ * user_vm_start 7f7d26200000     // picked by mmap()
+ * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
+ * For user space all pointers within the arena are normal 8-byte addresses.
+ * In this example 7f7d26200000 is the address of the first page (pgoff=0).
+ * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
+ * (u32)7f7d26200000 -> 26200000
+ * hence
+ * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
+ * kernel memory region.
+ *
+ * BPF JITs generate the following code to access arena:
+ *   mov eax, eax  // eax has lower 32-bit of user pointer
+ *   mov word ptr [rax + r12 + off], bx
+ * where r12 == kern_vm_start and off is s16.
+ * Hence allocate 4Gb + GUARD_SZ/2 on each side.
+ *
+ * Initially kernel vm_area and user vma are not populated.
+ * User space can fault-in any address which will insert the page
+ * into kernel and user vma.
+ * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
+ * which will insert it into kernel vm_area.
+ * The later fault-in from user space will populate that page into user vma.
+ */
+
+/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
+#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
+#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)
+
+struct bpf_arena {
+	struct bpf_map map;
+	u64 user_vm_start;
+	u64 user_vm_end;
+	struct vm_struct *kern_vm;
+	struct maple_tree mt;
+	struct list_head vma_list;
+	struct mutex lock;
+};
+
+u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
+{
+	return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
+}
+
+u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
+{
+	return arena ? arena->user_vm_start : 0;
+}
+
+static long arena_map_peek_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_pop_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_delete_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+static long compute_pgoff(struct bpf_arena *arena, long uaddr)
+{
+	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
+}
+
+#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */
+
+/*
+ * Reserve a "zero page", so that bpf prog and user space never see
+ * a pointer to arena with lower 32 bits being zero.
+ * bpf_cast_user() promotes it to full 64-bit NULL.
+ */
+static int reserve_zero_page(struct bpf_arena *arena)
+{
+	long pgoff = compute_pgoff(arena, 0);
+
+	return mtree_insert(&arena->mt, pgoff, MT_ENTRY, GFP_KERNEL);
+}
+
+static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
+{
+	struct vm_struct *kern_vm;
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_arena *arena;
+	int err = -ENOMEM;
+
+	if (attr->key_size != 8 || attr->value_size != 8 ||
+	    /* BPF_F_MMAPABLE must be set */
+	    !(attr->map_flags & BPF_F_MMAPABLE) ||
+	    /* No unsupported flags present */
+	    (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
+		return ERR_PTR(-EINVAL);
+
+	if (attr->map_extra & ~PAGE_MASK)
+		/* If non-zero the map_extra is an expected user VMA start address */
+		return ERR_PTR(-EINVAL);
+
+	kern_vm = get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP);
+	if (!kern_vm)
+		return ERR_PTR(-ENOMEM);
+
+	arena = bpf_map_area_alloc(sizeof(*arena), numa_node);
+	if (!arena)
+		goto err;
+
+	INIT_LIST_HEAD(&arena->vma_list);
+	arena->kern_vm = kern_vm;
+	arena->user_vm_start = attr->map_extra;
+	bpf_map_init_from_attr(&arena->map, attr);
+	mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE);
+	mutex_init(&arena->lock);
+	if (arena->user_vm_start) {
+		err = reserve_zero_page(arena);
+		if (err) {
+			bpf_map_area_free(arena);
+			goto err;
+		}
+	}
+
+	return &arena->map;
+err:
+	free_vm_area(kern_vm);
+	return ERR_PTR(err);
+}
+
+static int for_each_pte(pte_t *ptep, unsigned long addr, void *data)
+{
+	struct page *page;
+	pte_t pte;
+
+	pte = ptep_get(ptep);
+	if (!pte_present(pte))
+		return 0;
+	page = pte_page(pte);
+	/*
+	 * We do not update pte here:
+	 * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
+	 * 2. TLB flushing is batched or deferred. Even if we clear pte,
+	 * the TLB entries can stick around and continue to permit access to
+	 * the freed page. So it all relies on 1.
+	 */
+	__free_page(page);
+	return 0;
+}
+
+static void arena_map_free(struct bpf_map *map)
+{
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	/*
+	 * Check that user vma-s are not around when bpf map is freed.
+	 * mmap() holds vm_file which holds bpf_map refcnt.
+	 * munmap() must have happened on vma followed by arena_vm_close()
+	 * which would clear arena->vma_list.
+	 */
+	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
+		return;
+
+	/*
+	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
+	 * It unmaps everything from vmalloc area and clears pgtables.
+	 * Call apply_to_existing_page_range() first to find populated ptes and
+	 * free those pages.
+	 */
+	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+				     KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
+	free_vm_area(arena->kern_vm);
+	mtree_destroy(&arena->mt);
+	bpf_map_area_free(arena);
+}
+
+static void *arena_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static long arena_map_update_elem(struct bpf_map *map, void *key,
+				  void *value, u64 flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static int arena_map_check_btf(const struct bpf_map *map, const struct btf *btf,
+			       const struct btf_type *key_type, const struct btf_type *value_type)
+{
+	return 0;
+}
+
+static u64 arena_map_mem_usage(const struct bpf_map *map)
+{
+	return 0;
+}
+
+struct vma_list {
+	struct vm_area_struct *vma;
+	struct list_head head;
+};
+
+static int remember_vma(struct bpf_arena *arena, struct vm_area_struct *vma)
+{
+	struct vma_list *vml;
+
+	vml = kmalloc(sizeof(*vml), GFP_KERNEL);
+	if (!vml)
+		return -ENOMEM;
+	vma->vm_private_data = vml;
+	vml->vma = vma;
+	list_add(&vml->head, &arena->vma_list);
+	return 0;
+}
+
+static void arena_vm_close(struct vm_area_struct *vma)
+{
+	struct vma_list *vml;
+
+	vml = vma->vm_private_data;
+	list_del(&vml->head);
+	vma->vm_private_data = NULL;
+	kfree(vml);
+}
+
+static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
+{
+	struct bpf_map *map = vmf->vma->vm_file->private_data;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+	struct page *page;
+	long kbase, kaddr;
+	int ret;
+
+	kbase = bpf_arena_get_kern_vm_start(arena);
+	kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
+
+	guard(mutex)(&arena->lock);
+	page = vmalloc_to_page((void *)kaddr);
+	if (page)
+		/* already have a page vmap-ed */
+		goto out;
+
+	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
+		/* User space requested to segfault when page is not allocated by bpf prog */
+		return VM_FAULT_SIGSEGV;
+
+	ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
+	if (ret == -EEXIST)
+		return VM_FAULT_RETRY;
+	if (ret)
+		return VM_FAULT_SIGSEGV;
+
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page) {
+		mtree_erase(&arena->mt, vmf->pgoff);
+		return VM_FAULT_SIGSEGV;
+	}
+
+	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE, PAGE_KERNEL, &page, PAGE_SHIFT);
+	if (ret) {
+		mtree_erase(&arena->mt, vmf->pgoff);
+		__free_page(page);
+		return VM_FAULT_SIGSEGV;
+	}
+out:
+	page_ref_add(page, 1);
+	vmf->page = page;
+	return 0;
+}
+
+static const struct vm_operations_struct arena_vm_ops = {
+	.close		= arena_vm_close,
+	.fault          = arena_vm_fault,
+};
+
+static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+	int err;
+
+	if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
+		/*
+		 * 1st user process can do mmap(NULL, ...) to pick user_vm_start
+		 * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
+		 *   or
+		 * specify addr in map_extra at map creation time and
+		 * use the same addr later with mmap(addr, MAP_FIXED..);
+		 */
+		return -EBUSY;
+
+	if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
+		/* all user processes must have the same size of mmap-ed region */
+		return -EBUSY;
+
+	if (vma->vm_end - vma->vm_start > 1ull << 32)
+		/* Must not be bigger than 4Gb */
+		return -E2BIG;
+
+	if (remember_vma(arena, vma))
+		return -ENOMEM;
+
+	if (!arena->user_vm_start) {
+		arena->user_vm_start = vma->vm_start;
+		err = reserve_zero_page(arena);
+		if (err)
+			return err;
+	}
+	arena->user_vm_end = vma->vm_end;
+	/*
+	 * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
+	 * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
+	 * potential change of user_vm_start.
+	 */
+	vm_flags_set(vma, VM_DONTEXPAND);
+	vma->vm_ops = &arena_vm_ops;
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_arena_map_btf_ids, struct, bpf_arena)
+const struct bpf_map_ops arena_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = arena_map_alloc,
+	.map_free = arena_map_free,
+	.map_mmap = arena_map_mmap,
+	.map_get_next_key = arena_map_get_next_key,
+	.map_push_elem = arena_map_push_elem,
+	.map_peek_elem = arena_map_peek_elem,
+	.map_pop_elem = arena_map_pop_elem,
+	.map_lookup_elem = arena_map_lookup_elem,
+	.map_update_elem = arena_map_update_elem,
+	.map_delete_elem = arena_map_delete_elem,
+	.map_check_btf = arena_map_check_btf,
+	.map_mem_usage = arena_map_mem_usage,
+	.map_btf_id = &bpf_arena_map_btf_ids[0],
+};
+
+static u64 clear_lo32(u64 val)
+{
+	return val & ~(u64)~0U;
+}
+
+/*
+ * Allocate pages and vmap them into kernel vmalloc area.
+ * Later the pages will be mmaped into user space vma.
+ */
+static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
+{
+	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
+	long pgoff = 0, kaddr, nr_pages = 0;
+	struct page **pages;
+	int ret, i;
+
+	if (page_cnt >= page_cnt_max)
+		return 0;
+
+	if (uaddr) {
+		if (uaddr & ~PAGE_MASK)
+			return 0;
+		pgoff = compute_pgoff(arena, uaddr);
+		if (pgoff + page_cnt > page_cnt_max)
+			/* requested address will be outside of user VMA */
+			return 0;
+	}
+
+	/* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
+	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return 0;
+
+	guard(mutex)(&arena->lock);
+
+	if (uaddr)
+		ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt,
+					 MT_ENTRY, GFP_KERNEL);
+	else
+		ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
+					page_cnt, 0, page_cnt_max, GFP_KERNEL);
+	if (ret)
+		goto out_free_pages;
+
+	nr_pages = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_ZERO, node_id, page_cnt, pages);
+	if (nr_pages != page_cnt)
+		goto out;
+
+	kaddr = kern_vm_start + (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
+	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE * page_cnt, PAGE_KERNEL,
+			       pages, PAGE_SHIFT);
+	if (ret)
+		goto out;
+	kvfree(pages);
+	return clear_lo32(arena->user_vm_start) + (u32)(kaddr - kern_vm_start);
+out:
+	mtree_erase(&arena->mt, pgoff);
+out_free_pages:
+	if (pages)
+		for (i = 0; i < nr_pages; i++)
+			__free_page(pages[i]);
+	kvfree(pages);
+	return 0;
+}
+
+/*
+ * If page is present in vmalloc area, unmap it from vmalloc area,
+ * unmap it from all user space vma-s,
+ * and free it.
+ */
+static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+{
+	struct vma_list *vml;
+
+	list_for_each_entry(vml, &arena->vma_list, head)
+		zap_page_range_single(vml->vma, uaddr,
+				      PAGE_SIZE * page_cnt, NULL);
+}
+
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+{
+	u64 full_uaddr, uaddr_end;
+	long kaddr, pgoff, i;
+	struct page *page;
+
+	/* only aligned lower 32-bit are relevant */
+	uaddr = (u32)uaddr;
+	uaddr &= PAGE_MASK;
+	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
+	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
+	if (full_uaddr >= uaddr_end)
+		return;
+
+	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+
+	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
+
+	guard(mutex)(&arena->lock);
+
+	pgoff = compute_pgoff(arena, uaddr);
+	/* clear range */
+	mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt, NULL, GFP_KERNEL);
+
+	if (page_cnt > 1)
+		/* bulk zap if multiple pages being freed */
+		zap_pages(arena, full_uaddr, page_cnt);
+
+	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
+		page = vmalloc_to_page((void *)kaddr);
+		if (!page)
+			continue;
+		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
+			zap_pages(arena, full_uaddr, 1);
+		vunmap_range(kaddr, kaddr + PAGE_SIZE);
+		__free_page(page);
+	}
+}
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
+					int node_id, u64 flags)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !arena->user_vm_start || flags)
+		return NULL;
+
+	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
+}
+
+__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !arena->user_vm_start)
+		return;
+	arena_free_pages(arena, (long)ptr__ign, page_cnt);
+}
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(arena_kfuncs)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_KFUNCS_END(arena_kfuncs)
+
+static const struct btf_kfunc_id_set common_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &arena_kfuncs,
+};
+
+static int __init kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &common_kfunc_set);
+}
+late_initcall(kfunc_init);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 71c459a51d9e..2539d9bfe369 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2970,6 +2970,17 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
 {
 }
 
+/* for configs without MMU or 32-bit */
+__weak const struct bpf_map_ops arena_map_ops;
+__weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
+{
+	return 0;
+}
+__weak u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
+{
+	return 0;
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 static int __init bpf_global_ma_init(void)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b2750b79ac80..ac0e4a8bb852 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -164,6 +164,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	if (bpf_map_is_offloaded(map)) {
 		return bpf_map_offload_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
+		   map->map_type == BPF_MAP_TYPE_ARENA ||
 		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		return map->ops->map_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH ||
@@ -1160,6 +1161,7 @@ static int map_create(union bpf_attr *attr)
 	}
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
+	    attr->map_type != BPF_MAP_TYPE_ARENA &&
 	    attr->map_extra != 0)
 		return -EINVAL;
 
@@ -1249,6 +1251,7 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 	case BPF_MAP_TYPE_STRUCT_OPS:
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_ARENA:
 		if (!bpf_token_capable(token, CAP_BPF))
 			goto put_token;
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index db569ce89fb1..3c77a3ab1192 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18047,6 +18047,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_SK_STORAGE:
 		case BPF_MAP_TYPE_TASK_STORAGE:
 		case BPF_MAP_TYPE_CGRP_STORAGE:
+		case BPF_MAP_TYPE_ARENA:
 			break;
 		default:
 			verbose(env,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d96708380e52..f6648851eae6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	BPF_MAP_TYPE_ARENA,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1370,6 +1371,12 @@ enum {
 
 /* BPF token FD is passed in a corresponding command's token_fd field */
 	BPF_F_TOKEN_FD          = (1U << 16),
+
+/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
+	BPF_F_SEGV_ON_FAULT	= (1U << 17),
+
+/* Do not translate kernel bpf_arena pointers to user pointers */
+	BPF_F_NO_USER_CONV	= (1U << 18),
 };
 
 /* Flags for BPF_PROG_QUERY. */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 05/16] bpf: Disasm support for cast_kern/user instructions.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 04/16] bpf: Introduce bpf_arena Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 06/16] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

LLVM generates rX = bpf_cast_kern/_user(rY, address_space) instructions
when pointers in non-zero address space are used by the bpf program.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/disasm.c            | 11 +++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f6648851eae6..3de1581379d4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1313,6 +1313,11 @@ enum {
  */
 #define BPF_PSEUDO_KFUNC_CALL	2
 
+enum bpf_arena_cast_kinds {
+	BPF_ARENA_CAST_KERN = 1,
+	BPF_ARENA_CAST_USER = 2,
+};
+
 /* flags for BPF_MAP_UPDATE_ELEM command */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index 49940c26a227..37d9b37b34f7 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -166,6 +166,12 @@ static bool is_movsx(const struct bpf_insn *insn)
 	       (insn->off == 8 || insn->off == 16 || insn->off == 32);
 }
 
+static bool is_arena_cast(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) &&
+		(insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER);
+}
+
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 		    const struct bpf_insn *insn,
 		    bool allow_ptr_leaks)
@@ -184,6 +190,11 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 				insn->code, class == BPF_ALU ? 'w' : 'r',
 				insn->dst_reg, class == BPF_ALU ? 'w' : 'r',
 				insn->dst_reg);
+		} else if (is_arena_cast(insn)) {
+			verbose(cbs->private_data, "(%02x) r%d = cast_%s(r%d, %d)\n",
+				insn->code, insn->dst_reg,
+				insn->off == BPF_ARENA_CAST_KERN ? "kern" : "user",
+				insn->src_reg, insn->imm);
 		} else if (BPF_SRC(insn->code) == BPF_X) {
 			verbose(cbs->private_data, "(%02x) %c%d %s %s%c%d\n",
 				insn->code, class == BPF_ALU ? 'w' : 'r',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f6648851eae6..3de1581379d4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1313,6 +1313,11 @@ enum {
  */
 #define BPF_PSEUDO_KFUNC_CALL	2
 
+enum bpf_arena_cast_kinds {
+	BPF_ARENA_CAST_KERN = 1,
+	BPF_ARENA_CAST_USER = 2,
+};
+
 /* flags for BPF_MAP_UPDATE_ELEM command */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 06/16] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 05/16] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 07/16] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
They are similar to PROBE_MEM instructions with the following differences:
- PROBE_MEM has to check that the address is in the kernel range with
  src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
- PROBE_MEM doesn't support store
- PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
  Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
  to be within arena virtual range, so no address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
  When LDX faults the destination register is zeroed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 183 +++++++++++++++++++++++++++++++++++-
 include/linux/bpf.h         |   1 +
 include/linux/filter.h      |   3 +
 3 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e1390d1e331b..883b7f604b9a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -113,6 +113,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
 /* Pick a register outside of BPF range for JIT internal work */
 #define AUX_REG (MAX_BPF_JIT_REG + 1)
 #define X86_REG_R9 (MAX_BPF_JIT_REG + 2)
+#define X86_REG_R12 (MAX_BPF_JIT_REG + 3)
 
 /*
  * The following table maps BPF registers to x86-64 registers.
@@ -139,6 +140,7 @@ static const int reg2hex[] = {
 	[BPF_REG_AX] = 2, /* R10 temp register */
 	[AUX_REG] = 3,    /* R11 temp register */
 	[X86_REG_R9] = 1, /* R9 register, 6th function argument */
+	[X86_REG_R12] = 4, /* R12 callee saved */
 };
 
 static const int reg2pt_regs[] = {
@@ -167,6 +169,7 @@ static bool is_ereg(u32 reg)
 			     BIT(BPF_REG_8) |
 			     BIT(BPF_REG_9) |
 			     BIT(X86_REG_R9) |
+			     BIT(X86_REG_R12) |
 			     BIT(BPF_REG_AX));
 }
 
@@ -205,6 +208,17 @@ static u8 add_2mod(u8 byte, u32 r1, u32 r2)
 	return byte;
 }
 
+static u8 add_3mod(u8 byte, u32 r1, u32 r2, u32 index)
+{
+	if (is_ereg(r1))
+		byte |= 1;
+	if (is_ereg(index))
+		byte |= 2;
+	if (is_ereg(r2))
+		byte |= 4;
+	return byte;
+}
+
 /* Encode 'dst_reg' register into x86-64 opcode 'byte' */
 static u8 add_1reg(u8 byte, u32 dst_reg)
 {
@@ -887,6 +901,18 @@ static void emit_insn_suffix(u8 **pprog, u32 ptr_reg, u32 val_reg, int off)
 	*pprog = prog;
 }
 
+static void emit_insn_suffix_SIB(u8 **pprog, u32 ptr_reg, u32 val_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	if (is_imm8(off)) {
+		EMIT3(add_2reg(0x44, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
+	} else {
+		EMIT2_off32(add_2reg(0x84, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
+	}
+	*pprog = prog;
+}
+
 /*
  * Emit a REX byte if it will be necessary to address these registers
  */
@@ -968,6 +994,37 @@ static void emit_ldsx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 	*pprog = prog;
 }
 
+static void emit_ldx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* movzx rax, byte ptr [rax + r12 + off] */
+		EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB6);
+		break;
+	case BPF_H:
+		/* movzx rax, word ptr [rax + r12 + off] */
+		EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB7);
+		break;
+	case BPF_W:
+		/* mov eax, dword ptr [rax + r12 + off] */
+		EMIT2(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x8B);
+		break;
+	case BPF_DW:
+		/* mov rax, qword ptr [rax + r12 + off] */
+		EMIT2(add_3mod(0x48, src_reg, dst_reg, index_reg), 0x8B);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, src_reg, dst_reg, index_reg, off);
+	*pprog = prog;
+}
+
+static void emit_ldx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
+{
+	emit_ldx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
+}
+
 /* STX: *(u8*)(dst_reg + off) = src_reg */
 static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 {
@@ -1002,6 +1059,71 @@ static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 	*pprog = prog;
 }
 
+/* STX: *(u8*)(dst_reg + index_reg + off) = src_reg */
+static void emit_stx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* mov byte ptr [rax + r12 + off], al */
+		EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x88);
+		break;
+	case BPF_H:
+		/* mov word ptr [rax + r12 + off], ax */
+		EMIT3(0x66, add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	case BPF_W:
+		/* mov dword ptr [rax + r12 + 1], eax */
+		EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	case BPF_DW:
+		/* mov qword ptr [rax + r12 + 1], rax */
+		EMIT2(add_3mod(0x48, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, dst_reg, src_reg, index_reg, off);
+	*pprog = prog;
+}
+
+static void emit_stx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
+{
+	emit_stx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
+}
+
+/* ST: *(u8*)(dst_reg + index_reg + off) = imm32 */
+static void emit_st_index(u8 **pprog, u32 size, u32 dst_reg, u32 index_reg, int off, int imm)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* mov byte ptr [rax + r12 + off], imm8 */
+		EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC6);
+		break;
+	case BPF_H:
+		/* mov word ptr [rax + r12 + off], imm16 */
+		EMIT3(0x66, add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
+		break;
+	case BPF_W:
+		/* mov dword ptr [rax + r12 + 1], imm32 */
+		EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
+		break;
+	case BPF_DW:
+		/* mov qword ptr [rax + r12 + 1], imm32 */
+		EMIT2(add_3mod(0x48, dst_reg, 0, index_reg), 0xC7);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, dst_reg, 0, index_reg, off);
+	EMIT(imm, bpf_size_to_x86_bytes(size));
+	*pprog = prog;
+}
+
+static void emit_st_r12(u8 **pprog, u32 size, u32 dst_reg, int off, int imm)
+{
+	emit_st_index(pprog, size, dst_reg, X86_REG_R12, off, imm);
+}
+
 static int emit_atomic(u8 **pprog, u8 atomic_op,
 		       u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size)
 {
@@ -1043,12 +1165,15 @@ static int emit_atomic(u8 **pprog, u8 atomic_op,
 	return 0;
 }
 
+#define DONT_CLEAR 1
+
 bool ex_handler_bpf(const struct exception_table_entry *x, struct pt_regs *regs)
 {
 	u32 reg = x->fixup >> 8;
 
 	/* jump over faulting load and clear dest register */
-	*(unsigned long *)((void *)regs + reg) = 0;
+	if (reg != DONT_CLEAR)
+		*(unsigned long *)((void *)regs + reg) = 0;
 	regs->ip += x->fixup & 0xff;
 	return true;
 }
@@ -1147,11 +1272,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+	u64 arena_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
 	u8 *prog = temp;
 	int err;
 
+	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
+
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
 
@@ -1172,8 +1300,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		push_r12(&prog);
 		push_callee_regs(&prog, all_callee_regs_used);
 	} else {
+		if (arena_vm_start)
+			push_r12(&prog);
 		push_callee_regs(&prog, callee_regs_used);
 	}
+	if (arena_vm_start)
+		emit_mov_imm64(&prog, X86_REG_R12,
+			       arena_vm_start >> 32, (u32) arena_vm_start);
 
 	ilen = prog - temp;
 	if (rw_image)
@@ -1564,6 +1697,52 @@ st:			if (is_imm8(insn->off))
 			emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
 			break;
 
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
+			start_of_ldx = prog;
+			emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
+			goto populate_extable;
+
+			/* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
+			start_of_ldx = prog;
+			if (BPF_CLASS(insn->code) == BPF_LDX)
+				emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
+			else
+				emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
+populate_extable:
+			{
+				struct exception_table_entry *ex;
+				u8 *_insn = image + proglen + (start_of_ldx - temp);
+				s64 delta;
+
+				if (!bpf_prog->aux->extable)
+					break;
+
+				ex = &bpf_prog->aux->extable[excnt++];
+
+				delta = _insn - (u8 *)&ex->insn;
+				/* switch ex to rw buffer for writes */
+				ex = (void *)rw_image + ((void *)ex - (void *)image);
+
+				ex->insn = delta;
+
+				ex->data = EX_TYPE_BPF;
+
+				ex->fixup = (prog - start_of_ldx) |
+					((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
+			}
+			break;
+
 			/* LDX: dst_reg = *(u8*)(src_reg + off) */
 		case BPF_LDX | BPF_MEM | BPF_B:
 		case BPF_LDX | BPF_PROBE_MEM | BPF_B:
@@ -2036,6 +2215,8 @@ st:			if (is_imm8(insn->off))
 				pop_r12(&prog);
 			} else {
 				pop_callee_regs(&prog, callee_regs_used);
+				if (arena_vm_start)
+					pop_r12(&prog);
 			}
 			EMIT1(0xC9);         /* leave */
 			emit_return(&prog, image + addrs[i - 1] + (prog - temp));
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 42f22bc881f0..a0d737bb86d1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1460,6 +1460,7 @@ struct bpf_prog_aux {
 	bool xdp_has_frags;
 	bool exception_cb;
 	bool exception_boundary;
+	struct bpf_arena *arena;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fee070b9826e..cd76d43412d0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -72,6 +72,9 @@ struct ctl_table_header;
 /* unused opcode to mark special ldsx instruction. Same as BPF_IND */
 #define BPF_PROBE_MEMSX	0x40
 
+/* unused opcode to mark special load instruction. Same as BPF_MSH */
+#define BPF_PROBE_MEM32	0xa0
+
 /* unused opcode to mark call to interpreter with arguments */
 #define BPF_CALL_ARGS	0xe0
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 07/16] bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 06/16] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 08/16] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

LLVM generates bpf_cast_kern and bpf_cast_user instructions while translating
pointers with __attribute__((address_space(1))).

rX = cast_kern(rY) is processed by the verifier and converted to
normal 32-bit move: wX = wY

bpf_cast_user has to be converted by JIT.

rX = cast_user(rY) is

aux_reg = upper_32_bits of arena->user_vm_start
aux_reg <<= 32
wX = wY // clear upper 32 bits of dst register
if (wX) // if not zero add upper bits of user_vm_start
  wX |= aux_reg

JIT can do it more efficiently:

mov dst_reg32, src_reg32  // 32-bit move
shl dst_reg, 32
or dst_reg, user_vm_start
rol dst_reg, 32
xor r11, r11
test dst_reg32, dst_reg32 // check if lower 32-bit are zero
cmove r11, dst_reg	  // if so, set dst_reg to zero
			  // Intel swapped src/dst register encoding in CMOVcc

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 41 ++++++++++++++++++++++++++++++++++++-
 include/linux/filter.h      |  1 +
 kernel/bpf/core.c           |  5 +++++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 883b7f604b9a..a042ed57af7b 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1272,13 +1272,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
-	u64 arena_vm_start;
+	u64 arena_vm_start, user_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
 	u8 *prog = temp;
 	int err;
 
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
+	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
@@ -1346,6 +1347,39 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 			break;
 
 		case BPF_ALU64 | BPF_MOV | BPF_X:
+			if (insn->off == BPF_ARENA_CAST_USER) {
+				if (dst_reg != src_reg)
+					/* 32-bit mov */
+					emit_mov_reg(&prog, false, dst_reg, src_reg);
+				/* shl dst_reg, 32 */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				EMIT3(0xC1, add_1reg(0xE0, dst_reg), 32);
+
+				/* or dst_reg, user_vm_start */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				if (is_axreg(dst_reg))
+					EMIT1_off32(0x0D,  user_vm_start >> 32);
+				else
+					EMIT2_off32(0x81, add_1reg(0xC8, dst_reg),  user_vm_start >> 32);
+
+				/* rol dst_reg, 32 */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				EMIT3(0xC1, add_1reg(0xC0, dst_reg), 32);
+
+				/* xor r11, r11 */
+				EMIT3(0x4D, 0x31, 0xDB);
+
+				/* test dst_reg32, dst_reg32; check if lower 32-bit are zero */
+				maybe_emit_mod(&prog, dst_reg, dst_reg, false);
+				EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
+
+				/* cmove r11, dst_reg; if so, set dst_reg to zero */
+				/* WARNING: Intel swapped src/dst register encoding in CMOVcc !!! */
+				maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
+				EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
+				break;
+			}
+			fallthrough;
 		case BPF_ALU | BPF_MOV | BPF_X:
 			if (insn->off == 0)
 				emit_mov_reg(&prog,
@@ -3424,6 +3458,11 @@ void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
 	}
 }
 
+bool bpf_jit_supports_arena(void)
+{
+	return true;
+}
+
 bool bpf_jit_supports_ptr_xchg(void)
 {
 	return true;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cd76d43412d0..78ea63002531 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -959,6 +959,7 @@ bool bpf_jit_supports_kfunc_call(void);
 bool bpf_jit_supports_far_kfunc_call(void);
 bool bpf_jit_supports_exceptions(void);
 bool bpf_jit_supports_ptr_xchg(void);
+bool bpf_jit_supports_arena(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2539d9bfe369..2829077f0461 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2926,6 +2926,11 @@ bool __weak bpf_jit_supports_far_kfunc_call(void)
 	return false;
 }
 
+bool __weak bpf_jit_supports_arena(void)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend satisfies the following two conditions:
  * 1) JIT backend supports atomic_xchg() on pointer-sized words.
  * 2) Under the specific arch, the implementation of xchg() is the same
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 08/16] bpf: Recognize cast_kern/user instructions in the verifier.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 07/16] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 09/16] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

rX = bpf_cast_kern(rY, addr_space) tells the verifier that rX->type = PTR_TO_ARENA.
Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.

The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
JIT will generate them as kern_vm_start + 32bit_addr memory accesses.

rX = bpf_cast_user(rY, addr_space) tells the verifier that rX->type = unknown scalar.
If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
Otherwise JIT will convert it to:
  rX = (u32)rY;
  if (rX)
     rX |= arena->user_vm_start & ~(u64)~0U;

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/log.c             |  3 ++
 kernel/bpf/verifier.c        | 94 +++++++++++++++++++++++++++++++++---
 4 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a0d737bb86d1..82f7727e434a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -886,6 +886,7 @@ enum bpf_reg_type {
 	 * an explicit null check is required for this struct.
 	 */
 	PTR_TO_MEM,		 /* reg points to valid memory region */
+	PTR_TO_ARENA,
 	PTR_TO_BUF,		 /* reg points to a read/write buffer */
 	PTR_TO_FUNC,		 /* reg points to a bpf program function */
 	CONST_PTR_TO_DYNPTR,	 /* reg points to a const struct bpf_dynptr */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 84365e6dd85d..43c95e3e2a3c 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -547,6 +547,7 @@ struct bpf_insn_aux_data {
 	u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
 	bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */
 	bool zext_dst; /* this insn zero extends dst reg */
+	bool needs_zext; /* alu op needs to clear upper bits */
 	bool storage_get_func_atomic; /* bpf_*_storage_get() with atomic memory alloc */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
index 594a234f122b..677076c760ff 100644
--- a/kernel/bpf/log.c
+++ b/kernel/bpf/log.c
@@ -416,6 +416,7 @@ const char *reg_type_str(struct bpf_verifier_env *env, enum bpf_reg_type type)
 		[PTR_TO_XDP_SOCK]	= "xdp_sock",
 		[PTR_TO_BTF_ID]		= "ptr_",
 		[PTR_TO_MEM]		= "mem",
+		[PTR_TO_ARENA]		= "arena",
 		[PTR_TO_BUF]		= "buf",
 		[PTR_TO_FUNC]		= "func",
 		[PTR_TO_MAP_KEY]	= "map_key",
@@ -651,6 +652,8 @@ static void print_reg_state(struct bpf_verifier_env *env,
 	}
 
 	verbose(env, "%s", reg_type_str(env, t));
+	if (t == PTR_TO_ARENA)
+		return;
 	if (t == PTR_TO_STACK) {
 		if (state->frameno != reg->frameno)
 			verbose(env, "[%d]", reg->frameno);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c77a3ab1192..6bd5a0f30f72 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4370,6 +4370,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_MEM:
 	case PTR_TO_FUNC:
 	case PTR_TO_MAP_KEY:
+	case PTR_TO_ARENA:
 		return true;
 	default:
 		return false;
@@ -5805,6 +5806,8 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_XDP_SOCK:
 		pointer_desc = "xdp_sock ";
 		break;
+	case PTR_TO_ARENA:
+		return 0;
 	default:
 		break;
 	}
@@ -6906,6 +6909,9 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 
 		if (!err && value_regno >= 0 && (rdonly_mem || t == BPF_READ))
 			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_ARENA) {
+		if (t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else {
 		verbose(env, "R%d invalid mem access '%s'\n", regno,
 			reg_type_str(env, reg->type));
@@ -8377,6 +8383,7 @@ static int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	case PTR_TO_MEM | MEM_RINGBUF:
 	case PTR_TO_BUF:
 	case PTR_TO_BUF | MEM_RDONLY:
+	case PTR_TO_ARENA:
 	case SCALAR_VALUE:
 		return 0;
 	/* All the rest must be rejected, except PTR_TO_BTF_ID which allows
@@ -13837,6 +13844,21 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 
 	dst_reg = &regs[insn->dst_reg];
 	src_reg = NULL;
+
+	if (dst_reg->type == PTR_TO_ARENA) {
+		struct bpf_insn_aux_data *aux = cur_aux(env);
+
+		if (BPF_CLASS(insn->code) == BPF_ALU64)
+			/*
+			 * 32-bit operations zero upper bits automatically.
+			 * 64-bit operations need to be converted to 32.
+			 */
+			aux->needs_zext = true;
+
+		/* Any arithmetic operations are allowed on arena pointers */
+		return 0;
+	}
+
 	if (dst_reg->type != SCALAR_VALUE)
 		ptr_reg = dst_reg;
 	else
@@ -13954,16 +13976,17 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	} else if (opcode == BPF_MOV) {
 
 		if (BPF_SRC(insn->code) == BPF_X) {
-			if (insn->imm != 0) {
-				verbose(env, "BPF_MOV uses reserved fields\n");
-				return -EINVAL;
-			}
-
 			if (BPF_CLASS(insn->code) == BPF_ALU) {
-				if (insn->off != 0 && insn->off != 8 && insn->off != 16) {
+				if ((insn->off != 0 && insn->off != 8 && insn->off != 16) ||
+				    insn->imm) {
 					verbose(env, "BPF_MOV uses reserved fields\n");
 					return -EINVAL;
 				}
+			} else if (insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER) {
+				if (!insn->imm) {
+					verbose(env, "cast_kern/user insn must have non zero imm32\n");
+					return -EINVAL;
+				}
 			} else {
 				if (insn->off != 0 && insn->off != 8 && insn->off != 16 &&
 				    insn->off != 32) {
@@ -13993,7 +14016,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 			struct bpf_reg_state *dst_reg = regs + insn->dst_reg;
 
 			if (BPF_CLASS(insn->code) == BPF_ALU64) {
-				if (insn->off == 0) {
+				if (insn->imm) {
+					/* off == BPF_ARENA_CAST_KERN || off == BPF_ARENA_CAST_USER */
+					mark_reg_unknown(env, regs, insn->dst_reg);
+					if (insn->off == BPF_ARENA_CAST_KERN)
+						dst_reg->type = PTR_TO_ARENA;
+				} else if (insn->off == 0) {
 					/* case: R1 = R2
 					 * copy register state to dest reg
 					 */
@@ -14059,6 +14087,9 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 						dst_reg->subreg_def = env->insn_idx + 1;
 						coerce_subreg_to_size_sx(dst_reg, insn->off >> 3);
 					}
+				} else if (src_reg->type == PTR_TO_ARENA) {
+					mark_reg_unknown(env, regs, insn->dst_reg);
+					dst_reg->type = PTR_TO_ARENA;
 				} else {
 					mark_reg_unknown(env, regs,
 							 insn->dst_reg);
@@ -16519,6 +16550,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		 * the same stack frame, since fp-8 in foo != fp-8 in bar
 		 */
 		return regs_exact(rold, rcur, idmap) && rold->frameno == rcur->frameno;
+	case PTR_TO_ARENA:
+		return true;
 	default:
 		return regs_exact(rold, rcur, idmap);
 	}
@@ -18235,6 +18268,27 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
 				fdput(f);
 				return -EBUSY;
 			}
+			if (map->map_type == BPF_MAP_TYPE_ARENA) {
+				if (env->prog->aux->arena) {
+					verbose(env, "Only one arena per program\n");
+					fdput(f);
+					return -EBUSY;
+				}
+				if (!env->allow_ptr_leaks || !env->bpf_capable) {
+					verbose(env, "CAP_BPF and CAP_PERFMON are required to use arena\n");
+					fdput(f);
+					return -EPERM;
+				}
+				if (!env->prog->jit_requested) {
+					verbose(env, "JIT is required to use arena\n");
+					return -EOPNOTSUPP;
+				}
+				if (!bpf_jit_supports_arena()) {
+					verbose(env, "JIT doesn't support arena\n");
+					return -EOPNOTSUPP;
+				}
+				env->prog->aux->arena = (void *)map;
+			}
 
 			fdput(f);
 next_insn:
@@ -18799,6 +18853,18 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			   insn->code == (BPF_ST | BPF_MEM | BPF_W) ||
 			   insn->code == (BPF_ST | BPF_MEM | BPF_DW)) {
 			type = BPF_WRITE;
+		} else if (insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->imm) {
+			if (insn->off == BPF_ARENA_CAST_KERN ||
+			    (((struct bpf_map *)env->prog->aux->arena)->map_flags & BPF_F_NO_USER_CONV)) {
+				/* convert to 32-bit mov that clears upper 32-bit */
+				insn->code = BPF_ALU | BPF_MOV | BPF_X;
+				/* clear off, so it's a normal 'wX = wY' from JIT pov */
+				insn->off = 0;
+			} /* else insn->off == BPF_ARENA_CAST_USER should be handled by JIT */
+			continue;
+		} else if (env->insn_aux_data[i + delta].needs_zext) {
+			/* Convert BPF_CLASS(insn->code) == BPF_ALU64 to 32-bit ALU */
+			insn->code = BPF_ALU | BPF_OP(insn->code) | BPF_SRC(insn->code);
 		} else {
 			continue;
 		}
@@ -18856,6 +18922,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 				env->prog->aux->num_exentries++;
 			}
 			continue;
+		case PTR_TO_ARENA:
+			if (BPF_MODE(insn->code) == BPF_MEMSX) {
+				verbose(env, "sign extending loads from arena are not supported yet\n");
+				return -EOPNOTSUPP;
+			}
+			insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32 | BPF_SIZE(insn->code);
+			env->prog->aux->num_exentries++;
+			continue;
 		default:
 			continue;
 		}
@@ -19041,13 +19115,19 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->nr_linfo = prog->aux->nr_linfo;
 		func[i]->aux->jited_linfo = prog->aux->jited_linfo;
 		func[i]->aux->linfo_idx = env->subprog_info[i].linfo_idx;
+		func[i]->aux->arena = prog->aux->arena;
 		num_exentries = 0;
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
 			if (BPF_CLASS(insn->code) == BPF_LDX &&
 			    (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
+			     BPF_MODE(insn->code) == BPF_PROBE_MEM32 ||
 			     BPF_MODE(insn->code) == BPF_PROBE_MEMSX))
 				num_exentries++;
+			if ((BPF_CLASS(insn->code) == BPF_STX ||
+			     BPF_CLASS(insn->code) == BPF_ST) &&
+			     BPF_MODE(insn->code) == BPF_PROBE_MEM32)
+				num_exentries++;
 		}
 		func[i]->aux->num_exentries = num_exentries;
 		func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 09/16] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 08/16] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 10/16] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.

Note, when the verifier sees:

__weak void foo(struct bar *p)

it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/btf.c      | 19 +++++++++++++++----
 kernel/bpf/verifier.c | 15 +++++++++++++++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 82f7727e434a..401c0031090d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -715,6 +715,7 @@ enum bpf_arg_type {
 	 * on eBPF program stack
 	 */
 	ARG_PTR_TO_MEM,		/* pointer to valid memory (stack, packet, map value) */
+	ARG_PTR_TO_ARENA,
 
 	ARG_CONST_SIZE,		/* number of bytes accessed from memory */
 	ARG_CONST_SIZE_OR_ZERO,	/* number of bytes accessed from memory or 0 */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index f7725cb6e564..6d2effb65943 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -7053,10 +7053,11 @@ static int btf_get_ptr_to_btf_id(struct bpf_verifier_log *log, int arg_idx,
 }
 
 enum btf_arg_tag {
-	ARG_TAG_CTX = 0x1,
-	ARG_TAG_NONNULL = 0x2,
-	ARG_TAG_TRUSTED = 0x4,
-	ARG_TAG_NULLABLE = 0x8,
+	ARG_TAG_CTX	 = BIT_ULL(0),
+	ARG_TAG_NONNULL  = BIT_ULL(1),
+	ARG_TAG_TRUSTED  = BIT_ULL(2),
+	ARG_TAG_NULLABLE = BIT_ULL(3),
+	ARG_TAG_ARENA	 = BIT_ULL(4),
 };
 
 /* Process BTF of a function to produce high-level expectation of function
@@ -7168,6 +7169,8 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog)
 				tags |= ARG_TAG_NONNULL;
 			} else if (strcmp(tag, "nullable") == 0) {
 				tags |= ARG_TAG_NULLABLE;
+			} else if (strcmp(tag, "arena") == 0) {
+				tags |= ARG_TAG_ARENA;
 			} else {
 				bpf_log(log, "arg#%d has unsupported set of tags\n", i);
 				return -EOPNOTSUPP;
@@ -7222,6 +7225,14 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog)
 			sub->args[i].btf_id = kern_type_id;
 			continue;
 		}
+		if (tags & ARG_TAG_ARENA) {
+			if (tags & ~ARG_TAG_ARENA) {
+				bpf_log(log, "arg#%d arena cannot be combined with any other tags\n", i);
+				return -EINVAL;
+			}
+			sub->args[i].arg_type = ARG_PTR_TO_ARENA;
+			continue;
+		}
 		if (is_global) { /* generic user data pointer */
 			u32 mem_size;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6bd5a0f30f72..07b8eec2f006 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9348,6 +9348,18 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
 				bpf_log(log, "arg#%d is expected to be non-NULL\n", i);
 				return -EINVAL;
 			}
+		} else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
+			/*
+			 * Can pass any value and the kernel won't crash, but
+			 * only PTR_TO_ARENA or SCALAR make sense. Everything
+			 * else is a bug in the bpf program. Point it out to
+			 * the user at the verification time instead of
+			 * run-time debug nightmare.
+			 */
+			if (reg->type != PTR_TO_ARENA && reg->type != SCALAR_VALUE) {
+				bpf_log(log, "R%d is not a pointer to arena or scalar.\n", regno);
+				return -EINVAL;
+			}
 		} else if (arg->arg_type == (ARG_PTR_TO_DYNPTR | MEM_RDONLY)) {
 			ret = process_dynptr_func(env, regno, -1, arg->arg_type, 0);
 			if (ret)
@@ -20321,6 +20333,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 				reg->btf = bpf_get_btf_vmlinux(); /* can't fail at this point */
 				reg->btf_id = arg->btf_id;
 				reg->id = ++env->id_gen;
+			} else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
+				/* caller can pass either PTR_TO_ARENA or SCALAR */
+				mark_reg_unknown(env, regs, i);
 			} else {
 				WARN_ONCE(1, "BUG: unhandled arg#%d type %d\n",
 					  i - BPF_REG_1, arg->arg_type);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 10/16] libbpf: Add __arg_arena to bpf_helpers.h
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 09/16] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena Alexei Starovoitov
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add __arg_arena to bpf_helpers.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 79eaa581be98..9c777c21da28 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -192,6 +192,7 @@ enum libbpf_tristate {
 #define __arg_nonnull __attribute((btf_decl_tag("arg:nonnull")))
 #define __arg_nullable __attribute((btf_decl_tag("arg:nullable")))
 #define __arg_trusted __attribute((btf_decl_tag("arg:trusted")))
+#define __arg_arena __attribute((btf_decl_tag("arg:arena")))
 
 #ifndef ___bpf_concat
 #define ___bpf_concat(a, b) a ## b
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 10/16] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-08  1:15   ` Andrii Nakryiko
  2024-02-06 22:04 ` [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

mmap() bpf_arena right after creation, since the kernel needs to
remember the address returned from mmap. This is user_vm_start.
LLVM will generate bpf_arena_cast_user() instructions where
necessary and JIT will add upper 32-bit of user_vm_start
to such pointers.

Use traditional map->value_size * map->max_entries to calculate mmap sz,
though it's not the best fit.

Also don't set BTF at bpf_arena creation time, since it doesn't support it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/libbpf.c        | 18 ++++++++++++++++++
 tools/lib/bpf/libbpf_probes.c |  6 ++++++
 2 files changed, 24 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 01f407591a92..c5ce5946dc6d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_BLOOM_FILTER]		= "bloom_filter",
 	[BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
 	[BPF_MAP_TYPE_CGRP_STORAGE]		= "cgrp_storage",
+	[BPF_MAP_TYPE_ARENA]			= "arena",
 };
 
 static const char * const prog_type_name[] = {
@@ -4852,6 +4853,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_QUEUE:
 	case BPF_MAP_TYPE_STACK:
+	case BPF_MAP_TYPE_ARENA:
 		create_attr.btf_fd = 0;
 		create_attr.btf_key_type_id = 0;
 		create_attr.btf_value_type_id = 0;
@@ -4908,6 +4910,22 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	if (map->fd == map_fd)
 		return 0;
 
+	if (def->type == BPF_MAP_TYPE_ARENA) {
+		size_t mmap_sz;
+
+		mmap_sz = bpf_map_mmap_sz(def->value_size, def->max_entries);
+		map->mmaped = mmap((void *)map->map_extra, mmap_sz, PROT_READ | PROT_WRITE,
+				   map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
+				   map_fd, 0);
+		if (map->mmaped == MAP_FAILED) {
+			err = -errno;
+			map->mmaped = NULL;
+			pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
+				bpf_map__name(map), err);
+			return err;
+		}
+	}
+
 	/* Keep placeholder FD value but now point it to the BPF map object.
 	 * This way everything that relied on this map's FD (e.g., relocated
 	 * ldimm64 instructions) will stay valid and won't need adjustments.
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index ee9b1dbea9eb..cbc7f4c09060 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -338,6 +338,12 @@ static int probe_map_create(enum bpf_map_type map_type)
 		key_size = 0;
 		max_entries = 1;
 		break;
+	case BPF_MAP_TYPE_ARENA:
+		key_size	= sizeof(__u64);
+		value_size	= sizeof(__u64);
+		opts.map_extra	= 0; /* can mmap() at any address */
+		opts.map_flags	= BPF_F_MMAPABLE;
+		break;
 	case BPF_MAP_TYPE_HASH:
 	case BPF_MAP_TYPE_ARRAY:
 	case BPF_MAP_TYPE_PROG_ARRAY:
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-08  1:16   ` Andrii Nakryiko
  2024-02-06 22:04 ` [PATCH bpf-next 13/16] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

__uint() macro that is used to specify map attributes like:
  __uint(type, BPF_MAP_TYPE_ARRAY);
  __uint(map_flags, BPF_F_MMAPABLE);
is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.

Introduce __ulong() macro that allows specifying values bigger than 32-bit.
In map definition "map_extra" is the only u64 field.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf_helpers.h |  1 +
 tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 9c777c21da28..fb909fc6866d 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -13,6 +13,7 @@
 #define __uint(name, val) int (*name)[val]
 #define __type(name, val) typeof(val) *name
 #define __array(name, val) typeof(val) *name[]
+#define __ulong(name, val) enum name##__enum { name##__value = val } name
 
 /*
  * Helper macro to place programs, maps, license in
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index c5ce5946dc6d..a8c89b2315cd 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2229,6 +2229,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
 	return true;
 }
 
+static bool get_map_field_long(const char *map_name, const struct btf *btf,
+			       const struct btf_member *m, __u64 *res)
+{
+	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
+	const char *name = btf__name_by_offset(btf, m->name_off);
+
+	if (btf_is_ptr(t))
+		return false;
+
+	if (!btf_is_enum(t) && !btf_is_enum64(t)) {
+		pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",
+			map_name, name, btf_kind_str(t));
+		return false;
+	}
+
+	if (btf_vlen(t) != 1) {
+		pr_warn("map '%s': attr '%s': invalid __ulong\n",
+			map_name, name);
+		return false;
+	}
+
+	if (btf_is_enum(t)) {
+		const struct btf_enum *e = btf_enum(t);
+
+		*res = e->val;
+	} else {
+		const struct btf_enum64 *e = btf_enum64(t);
+
+		*res = btf_enum64_value(e);
+	}
+	return true;
+}
+
 static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
 {
 	int len;
@@ -2462,10 +2495,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
 			map_def->pinning = val;
 			map_def->parts |= MAP_DEF_PINNING;
 		} else if (strcmp(name, "map_extra") == 0) {
-			__u32 map_extra;
+			__u64 map_extra;
 
-			if (!get_map_field_int(map_name, btf, m, &map_extra))
-				return -EINVAL;
+			if (!get_map_field_long(map_name, btf, m, &map_extra)) {
+				__u32 map_extra_u32;
+
+				if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
+					return -EINVAL;
+				map_extra = map_extra_u32;
+			}
 			map_def->map_extra = map_extra;
 			map_def->parts |= MAP_DEF_MAP_EXTRA;
 		} else {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 13/16] bpf: Tell bpf programs kernel's PAGE_SIZE
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 14/16] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

vmlinux BTF includes all kernel enums.
Add __PAGE_SIZE = PAGE_SIZE enum, so that bpf programs
that include vmlinux.h can easily access it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2829077f0461..3aa3f56a4310 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -88,13 +88,18 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
 	return NULL;
 }
 
+/* tell bpf programs that include vmlinux.h kernel's PAGE_SIZE */
+enum page_size_enum {
+	__PAGE_SIZE = PAGE_SIZE
+};
+
 struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags)
 {
 	gfp_t gfp_flags = bpf_memcg_flags(GFP_KERNEL | __GFP_ZERO | gfp_extra_flags);
 	struct bpf_prog_aux *aux;
 	struct bpf_prog *fp;
 
-	size = round_up(size, PAGE_SIZE);
+	size = round_up(size, __PAGE_SIZE);
 	fp = __vmalloc(size, gfp_flags);
 	if (fp == NULL)
 		return NULL;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 14/16] bpf: Add helper macro bpf_arena_cast()
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 13/16] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-06 22:04 ` [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce helper macro bpf_arena_cast() that emits:
rX = rX
instruction with off = BPF_ARENA_CAST_KERN or off = BPF_ARENA_CAST_USER
and encodes address_space into imm32.

It's useful with older LLVM that doesn't emit this insn automatically.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 0d749006d107..e73b7d48439f 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -331,6 +331,47 @@ l_true:												\
 	asm volatile("%[reg]=%[reg]"::[reg]"r"((short)var))
 #endif
 
+/* emit instruction: rX=rX .off = mode .imm32 = address_space */
+#ifndef bpf_arena_cast
+#define bpf_arena_cast(var, mode, addr_space)	\
+	({					\
+	typeof(var) __var = var;		\
+	asm volatile(".byte 0xBF;		\
+		     .ifc %[reg], r0;		\
+		     .byte 0x00;		\
+		     .endif;			\
+		     .ifc %[reg], r1;		\
+		     .byte 0x11;		\
+		     .endif;			\
+		     .ifc %[reg], r2;		\
+		     .byte 0x22;		\
+		     .endif;			\
+		     .ifc %[reg], r3;		\
+		     .byte 0x33;		\
+		     .endif;			\
+		     .ifc %[reg], r4;		\
+		     .byte 0x44;		\
+		     .endif;			\
+		     .ifc %[reg], r5;		\
+		     .byte 0x55;		\
+		     .endif;			\
+		     .ifc %[reg], r6;		\
+		     .byte 0x66;		\
+		     .endif;			\
+		     .ifc %[reg], r7;		\
+		     .byte 0x77;		\
+		     .endif;			\
+		     .ifc %[reg], r8;		\
+		     .byte 0x88;		\
+		     .endif;			\
+		     .ifc %[reg], r9;		\
+		     .byte 0x99;		\
+		     .endif;			\
+		     .short %[off]; .long %[as]"	\
+		     :: [reg]"r"(__var), [off]"i"(mode), [as]"i"(addr_space)); __var; \
+	})
+#endif
+
 /* Description
  *	Assert that a conditional expression is true.
  * Returns
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (13 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 14/16] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-07 17:04   ` Eduard Zingerman
  2024-02-06 22:04 ` [PATCH bpf-next 16/16] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
  2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
  16 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_arena_common.h - common helpers and macros
bpf_arena_alloc.h - implements page_frag allocator as a bpf program.
bpf_arena_list.h - doubly linked link list as a bpf program.

Compiled as a bpf program and as native C code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 tools/testing/selftests/bpf/bpf_arena_alloc.h | 58 +++++++++++
 .../testing/selftests/bpf/bpf_arena_common.h  | 70 ++++++++++++++
 tools/testing/selftests/bpf/bpf_arena_list.h  | 95 +++++++++++++++++++
 .../selftests/bpf/prog_tests/arena_list.c     | 65 +++++++++++++
 .../testing/selftests/bpf/progs/arena_list.c  | 75 +++++++++++++++
 7 files changed, 365 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index 5c2cc7e8c5d0..7759cff95b6f 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -11,3 +11,4 @@ fill_link_info/kprobe_multi_link_info            # bpf_program__attach_kprobe_mu
 fill_link_info/kretprobe_multi_link_info         # bpf_program__attach_kprobe_multi_opts unexpected error: -95
 fill_link_info/kprobe_multi_invalid_ubuff        # bpf_program__attach_kprobe_multi_opts unexpected error: -95
 missed/kprobe_recursion                          # missed_kprobe_recursion__attach unexpected error: -95 (errno 95)
+arena						 # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 1a63996c0304..11f7b612f967 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -3,3 +3,4 @@
 exceptions				 # JIT does not support calling kfunc bpf_throw				       (exceptions)
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
+arena					 # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/bpf_arena_alloc.h b/tools/testing/selftests/bpf/bpf_arena_alloc.h
new file mode 100644
index 000000000000..0f4cb399b4c7
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_alloc.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include "bpf_arena_common.h"
+
+#ifndef __round_mask
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+#endif
+#ifndef round_up
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+#endif
+
+void __arena *cur_page;
+int cur_offset;
+
+/* Simple page_frag allocator */
+static inline void __arena* bpf_alloc(unsigned int size)
+{
+	__u64 __arena *obj_cnt;
+	void __arena *page = cur_page;
+	int offset;
+
+	size = round_up(size, 8);
+	if (size >= PAGE_SIZE - 8)
+		return NULL;
+	if (!page) {
+refill:
+		page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+		if (!page)
+			return NULL;
+		cast_kern(page);
+		cur_page = page;
+		cur_offset = PAGE_SIZE - 8;
+		obj_cnt = page + PAGE_SIZE - 8;
+		*obj_cnt = 0;
+	} else {
+		cast_kern(page);
+		obj_cnt = page + PAGE_SIZE - 8;
+	}
+
+	offset = cur_offset - size;
+	if (offset < 0)
+		goto refill;
+
+	(*obj_cnt)++;
+	cur_offset = offset;
+	return page + offset;
+}
+
+static inline void bpf_free(void __arena *addr)
+{
+	__u64 __arena *obj_cnt;
+
+	addr = (void __arena *)(((long)addr) & ~(PAGE_SIZE - 1));
+	obj_cnt = addr + PAGE_SIZE - 8;
+	if (--(*obj_cnt) == 0)
+		bpf_arena_free_pages(&arena, addr, 1);
+}
diff --git a/tools/testing/selftests/bpf/bpf_arena_common.h b/tools/testing/selftests/bpf/bpf_arena_common.h
new file mode 100644
index 000000000000..07849d502f40
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_common.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+
+#ifndef WRITE_ONCE
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *) &(x)) = (val))
+#endif
+
+#ifndef NUMA_NO_NODE
+#define	NUMA_NO_NODE	(-1)
+#endif
+
+#ifndef arena_container_of
+#define arena_container_of(ptr, type, member)			\
+	({							\
+		void __arena *__mptr = (void __arena *)(ptr);	\
+		((type *)(__mptr - offsetof(type, member)));	\
+	})
+#endif
+
+#ifdef __BPF__ /* when compiled as bpf program */
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE __PAGE_SIZE
+/*
+ * for older kernels try sizeof(struct genradix_node)
+ * or flexible:
+ * static inline long __bpf_page_size(void) {
+ *   return bpf_core_enum_value(enum page_size_enum___l, __PAGE_SIZE___l) ?: sizeof(struct genradix_node);
+ * }
+ * but generated code is not great.
+ */
+#endif
+
+#if defined(__BPF_FEATURE_ARENA_CAST) && !defined(BPF_ARENA_FORCE_ASM)
+#define __arena __attribute__((address_space(1)))
+#define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */
+#define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */
+#else
+#define __arena
+#define cast_kern(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_KERN, 1)
+#define cast_user(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_USER, 1)
+#endif
+
+void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt,
+				    int node_id, __u64 flags) __ksym __weak;
+void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak;
+
+#else /* when compiled as user space code */
+
+#define __arena
+#define __arg_arena
+#define cast_kern(ptr) /* nop for user space */
+#define cast_user(ptr) /* nop for user space */
+__weak char arena[1];
+
+#ifndef offsetof
+#define offsetof(type, member)  ((unsigned long)&((type *)0)->member)
+#endif
+
+static inline void __arena* bpf_arena_alloc_pages(void *map, void *addr, __u32 page_cnt,
+						  int node_id, __u64 flags)
+{
+	return NULL;
+}
+static inline void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt)
+{
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/bpf_arena_list.h b/tools/testing/selftests/bpf/bpf_arena_list.h
new file mode 100644
index 000000000000..9f34142b0f65
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_list.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include "bpf_arena_common.h"
+
+struct arena_list_node;
+
+typedef struct arena_list_node __arena arena_list_node_t;
+
+struct arena_list_node {
+	arena_list_node_t *next;
+	arena_list_node_t * __arena *pprev;
+};
+
+struct arena_list_head {
+	struct arena_list_node __arena *first;
+};
+typedef struct arena_list_head __arena arena_list_head_t;
+
+#define list_entry(ptr, type, member) arena_container_of(ptr, type, member)
+
+#define list_entry_safe(ptr, type, member) \
+	({ typeof(*ptr) * ___ptr = (ptr); \
+	 ___ptr ? ({ cast_kern(___ptr); list_entry(___ptr, type, member); }) : NULL; \
+	 })
+
+#ifndef __BPF__
+static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) {	return NULL; }
+static inline void bpf_iter_num_destroy(struct bpf_iter_num *) {}
+static inline bool bpf_iter_num_next(struct bpf_iter_num *) { return true; }
+#endif
+
+/* Safely walk link list of up to 1M elements. Deletion of elements is allowed. */
+#define list_for_each_entry(pos, head, member)						\
+	for (struct bpf_iter_num ___it __attribute__((aligned(8),			\
+						      cleanup(bpf_iter_num_destroy))),	\
+			* ___tmp = (			\
+				bpf_iter_num_new(&___it, 0, (1000000)),			\
+				pos = list_entry_safe((head)->first,			\
+						      typeof(*(pos)), member),		\
+				(void)bpf_iter_num_destroy, (void *)0);			\
+	     bpf_iter_num_next(&___it) && pos &&				\
+		({ ___tmp = (void *)pos->member.next; 1; });			\
+	     pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))
+
+static inline void list_add_head(arena_list_node_t *n, arena_list_head_t *h)
+{
+	arena_list_node_t *first = h->first, * __arena *tmp;
+
+	cast_user(first);
+	cast_kern(n);
+	WRITE_ONCE(n->next, first);
+	cast_kern(first);
+	if (first) {
+		tmp = &n->next;
+		cast_user(tmp);
+		WRITE_ONCE(first->pprev, tmp);
+	}
+	cast_user(n);
+	WRITE_ONCE(h->first, n);
+
+	tmp = &h->first;
+	cast_user(tmp);
+	cast_kern(n);
+	WRITE_ONCE(n->pprev, tmp);
+}
+
+static inline void __list_del(arena_list_node_t *n)
+{
+	arena_list_node_t *next = n->next, *tmp;
+	arena_list_node_t * __arena *pprev = n->pprev;
+
+	cast_user(next);
+	cast_kern(pprev);
+	tmp = *pprev;
+	cast_kern(tmp);
+	WRITE_ONCE(tmp, next);
+	if (next) {
+		cast_user(pprev);
+		cast_kern(next);
+		WRITE_ONCE(next->pprev, pprev);
+	}
+}
+
+#define POISON_POINTER_DELTA 0
+
+#define LIST_POISON1  ((void __arena *) 0x100 + POISON_POINTER_DELTA)
+#define LIST_POISON2  ((void __arena *) 0x122 + POISON_POINTER_DELTA)
+
+static inline void list_del(arena_list_node_t *n)
+{
+	__list_del(n);
+	n->next = LIST_POISON1;
+	n->pprev = LIST_POISON2;
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
new file mode 100644
index 000000000000..ca3ce8abefc4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <sys/mman.h>
+#include <network_helpers.h>
+
+#define PAGE_SIZE 4096
+
+#include "bpf_arena_list.h"
+#include "arena_list.skel.h"
+
+struct elem {
+	struct arena_list_node node;
+	__u64 value;
+};
+
+static int list_sum(struct arena_list_head *head)
+{
+	struct elem __arena *n;
+	int sum = 0;
+
+	list_for_each_entry(n, head, node)
+		sum += n->value;
+	return sum;
+}
+
+static void test_arena_list_add_del(int cnt)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_list *skel;
+	int expected_sum = (u64)cnt * (cnt - 1) / 2;
+	int ret, sum;
+
+	skel = arena_list__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+		return;
+
+	skel->bss->cnt = cnt;
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
+	ASSERT_OK(ret, "ret_add");
+	ASSERT_OK(opts.retval, "retval");
+	if (skel->bss->skip) {
+		printf("%s:SKIP:compiler doesn't support arena_cast\n", __func__);
+		test__skip();
+		goto out;
+	}
+	sum = list_sum(skel->bss->list_head);
+	ASSERT_EQ(sum, expected_sum, "sum of list elems");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_del), &opts);
+	ASSERT_OK(ret, "ret_del");
+	sum = list_sum(skel->bss->list_head);
+	ASSERT_EQ(sum, 0, "sum of list elems after del");
+	ASSERT_EQ(skel->bss->list_sum, expected_sum, "sum of list elems computed by prog");
+out:
+	arena_list__destroy(skel);
+}
+
+void test_arena_list(void)
+{
+	if (test__start_subtest("arena_list_1"))
+		test_arena_list_add_del(1);
+	if (test__start_subtest("arena_list_1000"))
+		test_arena_list_add_del(1000);
+}
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
new file mode 100644
index 000000000000..1acdec9dadde
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 1u << 24); /* max_entries * value_size == size of mmap() region */
+	__ulong(map_extra, 2ull << 44); /* start of mmap() region */
+	__type(key, __u64);
+	__type(value, __u64);
+} arena SEC(".maps");
+
+#include "bpf_arena_alloc.h"
+#include "bpf_arena_list.h"
+
+struct elem {
+	struct arena_list_node node;
+	__u64 value;
+};
+
+struct arena_list_head __arena *list_head;
+int list_sum;
+int cnt;
+bool skip = false;
+
+SEC("syscall")
+int arena_list_add(void *ctx)
+{
+#ifdef __BPF_FEATURE_ARENA_CAST
+	__u64 i;
+
+	list_head = bpf_alloc(sizeof(*list_head));
+
+	bpf_for(i, 0, cnt) {
+		struct elem __arena *n = bpf_alloc(sizeof(*n));
+
+		n->value = i;
+		list_add_head(&n->node, list_head);
+	}
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+SEC("syscall")
+int arena_list_del(void *ctx)
+{
+#ifdef __BPF_FEATURE_ARENA_CAST
+	struct elem __arena *n;
+	int sum = 0;
+
+	list_for_each_entry(n, list_head, node) {
+		sum += n->value;
+		list_del(&n->node);
+		bpf_free(n);
+	}
+	list_sum = sum;
+
+	/* triple free will not crash the kernel */
+	bpf_free(list_head);
+	bpf_free(list_head);
+	bpf_free(list_head);
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH bpf-next 16/16] selftests/bpf: Add bpf_arena_htab test.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (14 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
@ 2024-02-06 22:04 ` Alexei Starovoitov
  2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
  16 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-06 22:04 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, martin.lau, memxor, eddyz87, tj, brho, hannes,
	linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_arena_htab.h - hash table implemented as bpf program

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/bpf_arena_htab.h  | 100 ++++++++++++++++++
 .../selftests/bpf/prog_tests/arena_htab.c     |  88 +++++++++++++++
 .../testing/selftests/bpf/progs/arena_htab.c  |  48 +++++++++
 .../selftests/bpf/progs/arena_htab_asm.c      |   5 +
 4 files changed, 241 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c

diff --git a/tools/testing/selftests/bpf/bpf_arena_htab.h b/tools/testing/selftests/bpf/bpf_arena_htab.h
new file mode 100644
index 000000000000..acc01a876668
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_htab.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include <errno.h>
+#include "bpf_arena_alloc.h"
+#include "bpf_arena_list.h"
+
+struct htab_bucket {
+	struct arena_list_head head;
+};
+typedef struct htab_bucket __arena htab_bucket_t;
+
+struct htab {
+	htab_bucket_t *buckets;
+	int n_buckets;
+};
+typedef struct htab __arena htab_t;
+
+static inline htab_bucket_t *__select_bucket(htab_t *htab, __u32 hash)
+{
+	htab_bucket_t *b = htab->buckets;
+
+	cast_kern(b);
+	return &b[hash & (htab->n_buckets - 1)];
+}
+
+static inline arena_list_head_t *select_bucket(htab_t *htab, __u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
+struct hashtab_elem {
+	int hash;
+	int key;
+	int value;
+	struct arena_list_node hash_node;
+};
+typedef struct hashtab_elem __arena hashtab_elem_t;
+
+static hashtab_elem_t *lookup_elem_raw(arena_list_head_t *head, __u32 hash, int key)
+{
+	hashtab_elem_t *l;
+
+	list_for_each_entry(l, head, hash_node)
+		if (l->hash == hash && l->key == key)
+			return l;
+
+	return NULL;
+}
+
+static int htab_hash(int key)
+{
+	return key;
+}
+
+__weak int htab_lookup_elem(htab_t *htab __arg_arena, int key)
+{
+	hashtab_elem_t *l_old;
+	arena_list_head_t *head;
+
+	cast_kern(htab);
+	head = select_bucket(htab, key);
+	l_old = lookup_elem_raw(head, htab_hash(key), key);
+	if (l_old)
+		return l_old->value;
+	return 0;
+}
+
+__weak int htab_update_elem(htab_t *htab __arg_arena, int key, int value)
+{
+	hashtab_elem_t *l_new = NULL, *l_old;
+	arena_list_head_t *head;
+
+	cast_kern(htab);
+	head = select_bucket(htab, key);
+	l_old = lookup_elem_raw(head, htab_hash(key), key);
+
+	l_new = bpf_alloc(sizeof(*l_new));
+	if (!l_new)
+		return -ENOMEM;
+	l_new->key = key;
+	l_new->hash = htab_hash(key);
+	l_new->value = value;
+
+	list_add_head(&l_new->hash_node, head);
+	if (l_old) {
+		list_del(&l_old->hash_node);
+		bpf_free(l_old);
+	}
+	return 0;
+}
+
+void htab_init(htab_t *htab)
+{
+	void __arena *buckets = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+
+	cast_user(buckets);
+	htab->buckets = buckets;
+	htab->n_buckets = 2 * PAGE_SIZE / sizeof(struct htab_bucket);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_htab.c b/tools/testing/selftests/bpf/prog_tests/arena_htab.c
new file mode 100644
index 000000000000..0766702de846
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/arena_htab.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <sys/mman.h>
+#include <network_helpers.h>
+
+#include "arena_htab_asm.skel.h"
+#include "arena_htab.skel.h"
+
+#define PAGE_SIZE 4096
+
+#include "bpf_arena_htab.h"
+
+static void test_arena_htab_common(struct htab *htab)
+{
+	int i;
+
+	printf("htab %p buckets %p n_buckets %d\n", htab, htab->buckets, htab->n_buckets);
+	ASSERT_OK_PTR(htab->buckets, "htab->buckets shouldn't be NULL");
+	for (i = 0; htab->buckets && i < 16; i += 4) {
+		/*
+		 * Walk htab buckets and link lists since all pointers are correct,
+		 * though they were written by bpf program.
+		 */
+		int val = htab_lookup_elem(htab, i);
+
+		ASSERT_EQ(i, val, "key == value");
+	}
+}
+
+static void test_arena_htab_llvm(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_htab *skel;
+	struct htab *htab;
+	size_t arena_sz;
+	void *area;
+	int ret;
+
+	skel = arena_htab__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_htab__open_and_load"))
+		return;
+
+	area = bpf_map__initial_value(skel->maps.arena, &arena_sz);
+	/* fault-in a page with pgoff == 0 as sanity check */
+	*(volatile int *)area = 0x55aa;
+
+	/* bpf prog will allocate more pages */
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_htab_llvm), &opts);
+	ASSERT_OK(ret, "ret");
+	ASSERT_OK(opts.retval, "retval");
+	if (skel->bss->skip) {
+		printf("%s:SKIP:compiler doesn't support arena_cast\n", __func__);
+		test__skip();
+		goto out;
+	}
+	htab = skel->bss->htab_for_user;
+	test_arena_htab_common(htab);
+out:
+	arena_htab__destroy(skel);
+}
+
+static void test_arena_htab_asm(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_htab_asm *skel;
+	struct htab *htab;
+	int ret;
+
+	skel = arena_htab_asm__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_htab_asm__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_htab_asm), &opts);
+	ASSERT_OK(ret, "ret");
+	ASSERT_OK(opts.retval, "retval");
+	htab = skel->bss->htab_for_user;
+	test_arena_htab_common(htab);
+	arena_htab_asm__destroy(skel);
+}
+
+void test_arena_htab(void)
+{
+	if (test__start_subtest("arena_htab_llvm"))
+		test_arena_htab_llvm();
+	if (test__start_subtest("arena_htab_asm"))
+		test_arena_htab_asm();
+}
diff --git a/tools/testing/selftests/bpf/progs/arena_htab.c b/tools/testing/selftests/bpf/progs/arena_htab.c
new file mode 100644
index 000000000000..51a9eeb3df5a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_htab.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 1u << 20); /* max_entries * value_size == size of mmap() region */
+	__type(key, __u64);
+	__type(value, __u64);
+} arena SEC(".maps");
+
+#include "bpf_arena_htab.h"
+
+void __arena *htab_for_user;
+bool skip = false;
+
+SEC("syscall")
+int arena_htab_llvm(void *ctx)
+{
+#if defined(__BPF_FEATURE_ARENA_CAST) || defined(BPF_ARENA_FORCE_ASM)
+	struct htab __arena *htab;
+	__u64 i;
+
+	htab = bpf_alloc(sizeof(*htab));
+	cast_kern(htab);
+	htab_init(htab);
+
+	/* first run. No old elems in the table */
+	bpf_for(i, 0, 1000)
+		htab_update_elem(htab, i, i);
+
+	/* should replace all elems with new ones */
+	bpf_for(i, 0, 1000)
+		htab_update_elem(htab, i, i);
+	cast_user(htab);
+	htab_for_user = htab;
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/arena_htab_asm.c b/tools/testing/selftests/bpf/progs/arena_htab_asm.c
new file mode 100644
index 000000000000..6cd70ea12f0d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_htab_asm.c
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#define BPF_ARENA_FORCE_ASM
+#define arena_htab_llvm arena_htab_asm
+#include "arena_htab.c"
-- 
2.34.1



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 00/16] bpf: Introduce BPF arena.
  2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (15 preceding siblings ...)
  2024-02-06 22:04 ` [PATCH bpf-next 16/16] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
@ 2024-02-07 12:34 ` Donald Hunter
  2024-02-07 13:33   ` Barret Rhoden
  2024-02-07 20:12   ` Alexei Starovoitov
  16 siblings, 2 replies; 56+ messages in thread
From: Donald Hunter @ 2024-02-07 12:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> From: Alexei Starovoitov <ast@kernel.org>
>
> bpf programs have multiple options to communicate with user space:
> - Various ring buffers (perf, ftrace, bpf): The data is streamed
>   unidirectionally from bpf to user space.
> - Hash map: The bpf program populates elements, and user space consumes them
>   via bpf syscall.
> - mmap()-ed array map: Libbpf creates an array map that is directly accessed by
>   the bpf program and mmap-ed to user space. It's the fastest way. Its
>   disadvantage is that memory for the whole array is reserved at the start.
>
> These patches introduce bpf_arena, which is a sparse shared memory region
> between the bpf program and user space.

This will need to be documented, probably in a new file at
Documentation/bpf/map_arena.rst since it's cosplaying as a BPF map.

Why is it a map, when it doesn't have map semantics as evidenced by the
-EOPNOTSUPP map accessors? Is it the only way you can reuse the kernel /
userspace plumbing?

> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>    region, like memcached or any key/value storage. The bpf program implements an
>    in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>    value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>    rb-trees, sparse arrays), while user space occasionally consumes it. 
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is
>    not shared with user space.
>
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not
> shared with user space. This is use case 3. In such a case, the
> BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the

I can see _what_ this flag does but it's not clear what the consequences
of this flag are. Perhaps it would be better named BPF_F_NO_USER_ACCESS?

> rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will
> improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by
> JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is
> not NULL) to arena pointers before they are stored into memory. This way, user
> space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
>
> From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
>
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
>
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rX)
>   rX |= arena->user_vm_start & ~(u64)~0U;
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space. The last two patches
> demonstrate how algorithms in the C language can be compiled as a bpf program
> and as native code.
>
> Followup patches are planned:
> . selftests in asm
> . support arena variables in global data. Example:
>   void __arena * ptr; // works
>   int __arena var; // supported by llvm, but not by kernel and libbpf yet
> . support bpf_spin_lock in arena
>   bpf programs running on different CPUs can synchronize access to the arena via
>   existing bpf_spin_lock mechanisms (spin_locks in bpf_array or in bpf hash map).
>   It will be more convenient to allow spin_locks inside the arena too.
>
> Patch set overview:
> - patch 1,2: minor verifier enhancements to enable bpf_arena kfuncs
> - patch 3: export vmap_pages_range() to be used out side of mm directory
> - patch 4: main patch that introduces bpf_arena map type. See commit log
> - patch 6: probe_mem32 support in x86 JIT
> - patch 7: bpf_cast_user support in x86 JIT
> - patch 8: main verifier patch to support bpf_arena
> - patch 9: __arg_arena to tag arena pointers in bpf globla functions
> - patch 11: libbpf support for arena
> - patch 12: __ulong() macro to pass 64-bit constants in BTF
> - patch 13: export PAGE_SIZE constant into vmlinux BTF to be used from bpf programs
> - patch 14: bpf_arena_cast instruction as inline asm for setups with old LLVM
> - patch 15,16: testcases in C
>
> Alexei Starovoitov (16):
>   bpf: Allow kfuncs return 'void *'
>   bpf: Recognize '__map' suffix in kfunc arguments
>   mm: Expose vmap_pages_range() to the rest of the kernel.
>   bpf: Introduce bpf_arena.
>   bpf: Disasm support for cast_kern/user instructions.
>   bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
>   bpf: Add x86-64 JIT support for bpf_cast_user instruction.
>   bpf: Recognize cast_kern/user instructions in the verifier.
>   bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
>   libbpf: Add __arg_arena to bpf_helpers.h
>   libbpf: Add support for bpf_arena.
>   libbpf: Allow specifying 64-bit integers in map BTF.
>   bpf: Tell bpf programs kernel's PAGE_SIZE
>   bpf: Add helper macro bpf_arena_cast()
>   selftests/bpf: Add bpf_arena_list test.
>   selftests/bpf: Add bpf_arena_htab test.
>
>  arch/x86/net/bpf_jit_comp.c                   | 222 +++++++-
>  include/linux/bpf.h                           |   8 +-
>  include/linux/bpf_types.h                     |   1 +
>  include/linux/bpf_verifier.h                  |   1 +
>  include/linux/filter.h                        |   4 +
>  include/linux/vmalloc.h                       |   2 +
>  include/uapi/linux/bpf.h                      |  12 +
>  kernel/bpf/Makefile                           |   3 +
>  kernel/bpf/arena.c                            | 518 ++++++++++++++++++
>  kernel/bpf/btf.c                              |  19 +-
>  kernel/bpf/core.c                             |  23 +-
>  kernel/bpf/disasm.c                           |  11 +
>  kernel/bpf/log.c                              |   3 +
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 127 ++++-
>  mm/vmalloc.c                                  |   4 +-
>  tools/include/uapi/linux/bpf.h                |  12 +
>  tools/lib/bpf/bpf_helpers.h                   |   2 +
>  tools/lib/bpf/libbpf.c                        |  62 ++-
>  tools/lib/bpf/libbpf_probes.c                 |   6 +
>  tools/testing/selftests/bpf/DENYLIST.aarch64  |   1 +
>  tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
>  tools/testing/selftests/bpf/bpf_arena_alloc.h |  58 ++
>  .../testing/selftests/bpf/bpf_arena_common.h  |  70 +++
>  tools/testing/selftests/bpf/bpf_arena_htab.h  | 100 ++++
>  tools/testing/selftests/bpf/bpf_arena_list.h  |  95 ++++
>  .../testing/selftests/bpf/bpf_experimental.h  |  41 ++
>  .../selftests/bpf/prog_tests/arena_htab.c     |  88 +++
>  .../selftests/bpf/prog_tests/arena_list.c     |  65 +++
>  .../testing/selftests/bpf/progs/arena_htab.c  |  48 ++
>  .../selftests/bpf/progs/arena_htab_asm.c      |   5 +
>  .../testing/selftests/bpf/progs/arena_list.c  |  75 +++
>  32 files changed, 1669 insertions(+), 21 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c
>  create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c
>  create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c
>  create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 00/16] bpf: Introduce BPF arena.
  2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
@ 2024-02-07 13:33   ` Barret Rhoden
  2024-02-07 20:16     ` Alexei Starovoitov
  2024-02-07 20:12   ` Alexei Starovoitov
  1 sibling, 1 reply; 56+ messages in thread
From: Barret Rhoden @ 2024-02-07 13:33 UTC (permalink / raw)
  To: Donald Hunter
  Cc: Alexei Starovoitov, bpf, daniel, andrii, martin.lau, memxor,
	eddyz87, tj, hannes, linux-mm, kernel-team

On 2/7/24 07:34, Donald Hunter wrote:
>> Use cases:
>> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>>     region, like memcached or any key/value storage. The bpf program implements an
>>     in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>>     value without going to user space.
>> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>>     rb-trees, sparse arrays), while user space occasionally consumes it.
>> 3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is
>>     not shared with user space.
>>
>> Initially, the kernel vm_area and user vma are not populated. User space can
>> fault in pages within the range. While servicing a page fault, bpf_arena logic
>> will insert a new page into the kernel and user vmas. The bpf program can
>> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
>> function will insert pages into the kernel vm_area. The subsequent fault-in
>> from user space will populate that page into the user vma. The
>> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
>> from user space. In such a case, if a page is not allocated by the bpf program
>> and not present in the kernel vm_area, the user process will segfault. This is
>> useful for use cases 2 and 3 above.
>>
>> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
>> either at a specific address within the arena or allocates a range with the
>> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
>> and removes the range from the kernel vm_area and from user process vmas.
>>
>> bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not
>> shared with user space. This is use case 3. In such a case, the
>> BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the
 >
> I can see_what_  this flag does but it's not clear what the consequences
> of this flag are. Perhaps it would be better named BPF_F_NO_USER_ACCESS?

i can see a use for NO_USER_CONV, but also still allowing user access. 
userspace could mmap the region, but only look at scalars within it. 
this is similar to what i do today with array maps in my BPF schedulers. 
  that's a little different than Case 3.

if i knew userspace wasn't going to follow pointers, NO_USER_CONV would 
both be a speedup and make it so i don't have to worry about mmapping to 
the same virtual address in every process that shares the arena map. 
though this latter feature isn't in the code.  right now you have to 
have it mmapped at the same user_va in all address spaces.  that's not a 
huge deal for me either way.

barret






^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test.
  2024-02-06 22:04 ` [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
@ 2024-02-07 17:04   ` Eduard Zingerman
  2024-02-08  2:59     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Eduard Zingerman @ 2024-02-07 17:04 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, martin.lau, memxor, tj, brho, hannes, linux-mm,
	kernel-team

On Tue, 2024-02-06 at 14:04 -0800, Alexei Starovoitov wrote:
[...]

> diff --git a/tools/testing/selftests/bpf/bpf_arena_list.h b/tools/testing/selftests/bpf/bpf_arena_list.h
> new file mode 100644
> index 000000000000..9f34142b0f65
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_arena_list.h

[...]

> +#ifndef __BPF__
> +static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) {	return NULL; }
> +static inline void bpf_iter_num_destroy(struct bpf_iter_num *) {}
> +static inline bool bpf_iter_num_next(struct bpf_iter_num *) { return true; }
> +#endif

Note: when compiling using current clang 'main' (make test_progs) this reports the following errors:

In file included from tools/testing/selftests/bpf/prog_tests/arena_list.c:9:
./bpf_arena_list.h:28:59: error: omitting the parameter name in a function
                                 definition is a C23 extension [-Werror,-Wc23-extensions]
   28 | static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) { return NULL; }
   ...

So I had to give parameter names for the above functions.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-06 22:04 ` [PATCH bpf-next 04/16] bpf: Introduce bpf_arena Alexei Starovoitov
@ 2024-02-07 18:40   ` Barret Rhoden
  2024-02-07 20:55     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Barret Rhoden @ 2024-02-07 18:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, hannes,
	linux-mm, kernel-team

On 2/6/24 17:04, Alexei Starovoitov wrote:
> +
> +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> +{
> +	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> +}
> +
> +#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */
> +
> +/*
> + * Reserve a "zero page", so that bpf prog and user space never see
> + * a pointer to arena with lower 32 bits being zero.
> + * bpf_cast_user() promotes it to full 64-bit NULL.
> + */
> +static int reserve_zero_page(struct bpf_arena *arena)
> +{
> +	long pgoff = compute_pgoff(arena, 0);
> +
> +	return mtree_insert(&arena->mt, pgoff, MT_ENTRY, GFP_KERNEL);
> +}
> +

this is pretty tricky, and i think i didn't understand it at first.

you're punching a hole in the arena, such that BPF won't allocate it via 
arena_alloc_pages().  thus BPF won't 'produce' an object with a pointer 
ending in 0x00000000.

depending on where userspace mmaps the arena, that hole may or may not 
be the first page in the array.  if userspace mmaps it to a 4GB aligned 
virtual address, it'll be page 0.  but it could be at some arbitrary 
offset within the 4GB arena.

that arbitrariness makes it harder for a BPF program to do its own 
allocations within the arena.  i'm planning on carving up the 4GB arena 
for my own purposes, managed by BPF, with the expectation that i'll be 
able to allocate any 'virtual address' within the arena.  but there's a 
magic page that won't be usable.

i can certainly live with this.  just mmap userspace to a 4GB aligned 
address + PGSIZE, so that the last page in the arena is page 0.  but 
it's a little weird.

though i think we'll have more serious issues if anyone accidentally 
tries to use that zero page.  BPF would get an EEXIST if they try to 
allocate it directly, but then page fault and die if they touched it, 
since there's no page.  i can live with that, if we force it to be the 
last page in the arena.

however, i think you need to add something to the fault handler (below) 
in case userspace touches that page:

[snip]
> +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> +{
> +	struct bpf_map *map = vmf->vma->vm_file->private_data;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +	struct page *page;
> +	long kbase, kaddr;
> +	int ret;
> +
> +	kbase = bpf_arena_get_kern_vm_start(arena);
> +	kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
> +
> +	guard(mutex)(&arena->lock);
> +	page = vmalloc_to_page((void *)kaddr);
> +	if (page)
> +		/* already have a page vmap-ed */
> +		goto out;
> +
> +	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> +		/* User space requested to segfault when page is not allocated by bpf prog */
> +		return VM_FAULT_SIGSEGV;
> +
> +	ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
> +	if (ret == -EEXIST)
> +		return VM_FAULT_RETRY;

say this was the zero page.  vmalloc_to_page() failed, so we tried to 
insert.  we get EEXIST, since the slot is reserved.  we retry, since we 
were expecting the case where "no page, yet slot reserved" meant that 
BPF was in the middle of filling this page.

though i think you can fix this by just treating this as a SIGSEGV 
instead of RETRY.  when i made the original suggestion of making this a 
retry (in an email off list), that was before you had the arena mutex. 
now that you have the mutex, you shouldn't have the scenario where two 
threads are concurrently trying to fill a page.  i.e. mtree_insert + 
page_alloc + vmap are all atomic w.r.t. the mutex.

> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		mtree_erase(&arena->mt, vmf->pgoff);
> +		return VM_FAULT_SIGSEGV;
> +	}
> +
> +	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE, PAGE_KERNEL, &page, PAGE_SHIFT);
> +	if (ret) {
> +		mtree_erase(&arena->mt, vmf->pgoff);
> +		__free_page(page);
> +		return VM_FAULT_SIGSEGV;
> +	}
> +out:
> +	page_ref_add(page, 1);
> +	vmf->page = page;
> +	return 0;
> +}

[snip]

> +static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +	int err;
> +
> +	if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
> +		/*
> +		 * 1st user process can do mmap(NULL, ...) to pick user_vm_start
> +		 * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
> +		 *   or
> +		 * specify addr in map_extra at map creation time and
> +		 * use the same addr later with mmap(addr, MAP_FIXED..);
> +		 */
> +		return -EBUSY;
> +
> +	if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
> +		/* all user processes must have the same size of mmap-ed region */
> +		return -EBUSY;
> +
> +	if (vma->vm_end - vma->vm_start > 1ull << 32)
> +		/* Must not be bigger than 4Gb */
> +		return -E2BIG;
> +
> +	if (remember_vma(arena, vma))
> +		return -ENOMEM;
> +
> +	if (!arena->user_vm_start) {
> +		arena->user_vm_start = vma->vm_start;
> +		err = reserve_zero_page(arena);
> +		if (err)
> +			return err;
> +	}
> +	arena->user_vm_end = vma->vm_end;
> +	/*
> +	 * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
> +	 * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
> +	 * potential change of user_vm_start.
> +	 */
> +	vm_flags_set(vma, VM_DONTEXPAND);
> +	vma->vm_ops = &arena_vm_ops;
> +	return 0;
> +}

i think this whole function needs to be protected by the mutex, or at 
least all the stuff relate to user_vm_{start,end}.  if you have to 
threads mmapping the region for the first time, you'll race on the 
values of user_vm_*.

[snip]

> +/*
> + * Allocate pages and vmap them into kernel vmalloc area.
> + * Later the pages will be mmaped into user space vma.
> + */
> +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)

instead of uaddr, can you change this to take an address relative to the 
arena ("arena virtual address"?)?  the caller of this is in BPF, and 
they don't easily know the user virtual address.  maybe even just pgoff 
directly.

additionally, you won't need to call compute_pgoff().  as it is now, i'm 
not sure what would happen if BPF did an arena_alloc with a uaddr and 
user_vm_start wasn't set yet.  actually, i guess it'd just be 0, so 
uaddr would act like an arena virtual address, up until the moment where 
someone mmaps, then it'd suddenly change to be a user virtual address.

either way, making uaddr an arena-relative addr would make all that moot.

> +{
> +	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;

any time you compute_pgoff() or look at user_vm_{start,end}, maybe 
either hold the mutex, or only do it from mmap faults (where we know 
user_vm_start is already set).  o/w there might be subtle races where 
some other thread is mmapping the arena for the first time.

> +	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
> +	long pgoff = 0, kaddr, nr_pages = 0;
> +	struct page **pages;
> +	int ret, i;
> +
> +	if (page_cnt >= page_cnt_max)
> +		return 0;
> +
> +	if (uaddr) {
> +		if (uaddr & ~PAGE_MASK)
> +			return 0;
> +		pgoff = compute_pgoff(arena, uaddr);
> +		if (pgoff + page_cnt > page_cnt_max)
> +			/* requested address will be outside of user VMA */
> +			return 0;
> +	}
> +
> +	/* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
> +	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return 0;
> +
> +	guard(mutex)(&arena->lock);
> +
> +	if (uaddr)
> +		ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt,
> +					 MT_ENTRY, GFP_KERNEL);
> +	else
> +		ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
> +					page_cnt, 0, page_cnt_max, GFP_KERNEL);
> +	if (ret)
> +		goto out_free_pages;
> +
> +	nr_pages = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_ZERO, node_id, page_cnt, pages);
> +	if (nr_pages != page_cnt)
> +		goto out;
> +
> +	kaddr = kern_vm_start + (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);

adding user_vm_start here is pretty subtle.

so far i've been thinking that the mtree is the "address space" of the 
arena, in units of pages instead of bytes.  and pgoff is an address 
within the arena.  so mtree slot 0 is the 0th page of the 4GB region. 
and that "arena address space" is mapped at a kernel virtual address and 
a user virtual address (the same for all processes).

to convert user addresses (uaddr et al.) to arena addresses, we subtract 
user_vm_start, which makes sense.  that's what compute_pgoff() does.

i was expecting kaddr = kern_vm_start + pgoff * PGSIZE, essentially 
converting from arena address space to kernel virtual address.

instead, by adding user_vm_start and casting to u32, you're converting 
or shifting arena addresses to *another* arena address (user address, 
truncated to 4GB to keep it in the arena), and that is the one that the 
kernel will use.

is that correct?

my one concern is that there's some subtle wrap-around going on, and due 
to the shifting, kaddr can be very close to the end of the arena and 
page_cnt can be big enough to go outside the 4GB range.  we'd want it to 
wrap around.  e.g.

user_start_va = 0x1,fffff000
user_end_va =   0x2,fffff000
page_cnt_max = 0x100000 or whatever.  full 4GB range.

say we want to alloc at pgoff=0 (uaddr 0x1,fffff000), page_cnt = X.  you 
can get this pgoff either by doing mtree_insert_range or 
mtree_alloc_range on an arena with no allocations.

kaddr = kern_vm_start + 0xfffff000

the size of the vm area is 4GB + guard stuff, and we're right up against 
the end of it.

if page_cnt > the guard size, vmap_pages_range() would be called on 
something outside the vm area we reserved, which seems bad.  and even if 
it wasn't, what we want is for later page maps to start at the beginning 
of kern_vm_start.

the fix might be to just only map a page at a time - maybe a loop.  or 
detect when we're close to the edge and break it into two vmaps.  i feel 
like the loop would be easier to understand, but maybe less efficient.

> +	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE * page_cnt, PAGE_KERNEL,
> +			       pages, PAGE_SHIFT);
> +	if (ret)
> +		goto out;
> +	kvfree(pages);
> +	return clear_lo32(arena->user_vm_start) + (u32)(kaddr - kern_vm_start);
> +out:
> +	mtree_erase(&arena->mt, pgoff);
> +out_free_pages:
> +	if (pages)
> +		for (i = 0; i < nr_pages; i++)
> +			__free_page(pages[i]);
> +	kvfree(pages);
> +	return 0;
> +}

thanks,
barret

> +
> +/*
> + * If page is present in vmalloc area, unmap it from vmalloc area,
> + * unmap it from all user space vma-s,
> + * and free it.
> + */
> +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +	struct vma_list *vml;
> +
> +	list_for_each_entry(vml, &arena->vma_list, head)
> +		zap_page_range_single(vml->vma, uaddr,
> +				      PAGE_SIZE * page_cnt, NULL);
> +}
> +
> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +	u64 full_uaddr, uaddr_end;
> +	long kaddr, pgoff, i;
> +	struct page *page;
> +
> +	/* only aligned lower 32-bit are relevant */
> +	uaddr = (u32)uaddr;
> +	uaddr &= PAGE_MASK;
> +	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
> +	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
> +	if (full_uaddr >= uaddr_end)
> +		return;
> +
> +	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> +
> +	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> +
> +	guard(mutex)(&arena->lock);
> +
> +	pgoff = compute_pgoff(arena, uaddr);
> +	/* clear range */
> +	mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt, NULL, GFP_KERNEL);
> +
> +	if (page_cnt > 1)
> +		/* bulk zap if multiple pages being freed */
> +		zap_pages(arena, full_uaddr, page_cnt);
> +
> +	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> +		page = vmalloc_to_page((void *)kaddr);
> +		if (!page)
> +			continue;
> +		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
> +			zap_pages(arena, full_uaddr, 1);
> +		vunmap_range(kaddr, kaddr + PAGE_SIZE);
> +		__free_page(page);
> +	}
> +}
> +
> +__bpf_kfunc_start_defs();
> +
> +__bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
> +					int node_id, u64 flags)
> +{
> +	struct bpf_map *map = p__map;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	if (map->map_type != BPF_MAP_TYPE_ARENA || !arena->user_vm_start || flags)
> +		return NULL;
> +
> +	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
> +}
> +
> +__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
> +{
> +	struct bpf_map *map = p__map;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	if (map->map_type != BPF_MAP_TYPE_ARENA || !arena->user_vm_start)
> +		return;
> +	arena_free_pages(arena, (long)ptr__ign, page_cnt);
> +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 00/16] bpf: Introduce BPF arena.
  2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
  2024-02-07 13:33   ` Barret Rhoden
@ 2024-02-07 20:12   ` Alexei Starovoitov
  1 sibling, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-07 20:12 UTC (permalink / raw)
  To: Donald Hunter
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 4:34 AM Donald Hunter <donald.hunter@gmail.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > bpf programs have multiple options to communicate with user space:
> > - Various ring buffers (perf, ftrace, bpf): The data is streamed
> >   unidirectionally from bpf to user space.
> > - Hash map: The bpf program populates elements, and user space consumes them
> >   via bpf syscall.
> > - mmap()-ed array map: Libbpf creates an array map that is directly accessed by
> >   the bpf program and mmap-ed to user space. It's the fastest way. Its
> >   disadvantage is that memory for the whole array is reserved at the start.
> >
> > These patches introduce bpf_arena, which is a sparse shared memory region
> > between the bpf program and user space.
>
> This will need to be documented, probably in a new file at
> Documentation/bpf/map_arena.rst

of course. Once interfaces stop changing.

> since it's cosplaying as a BPF map.

cosplaying? It's a first class bpf map.

> Why is it a map, when it doesn't have map semantics as evidenced by the
> -EOPNOTSUPP map accessors?

array map doesn't support delete.
bloom filter map doesn't support lookup/update/delete.
queue/stack map doesn't support lookup/update/delete.
ringbuf map doesn't support lookup/update/delete.

ringbuf map can be mmap-ed.
array map can be mmap-ed.
bloom filter cannot be mmaped, but that can easily be added
if there is a use case.
In some ways the arena is a superset of array and bloom filter.
bpf prog can trivially implement the bloom filter inside the arena.
32-bit bounded pointers is what makes the arena so powerful.
It might be one the last maps that we will add,
since almost any algorithm can be implemented in the arena.

> Is it the only way you can reuse the kernel /
> userspace plumbing?

What do you mean?

> > shared with user space. This is use case 3. In such a case, the
> > BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the
>
> I can see _what_ this flag does but it's not clear what the consequences
> of this flag are. Perhaps it would be better named BPF_F_NO_USER_ACCESS?

no_user_access doesn't make sense.
Even when prog doesn't convert pointers to nice user pointers,
the whole arena is still mmap-able and accessible from user space.
One can operate it with offsets instead of pointers.

Pls trim your replies.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 00/16] bpf: Introduce BPF arena.
  2024-02-07 13:33   ` Barret Rhoden
@ 2024-02-07 20:16     ` Alexei Starovoitov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-07 20:16 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Donald Hunter, bpf, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 5:33 AM Barret Rhoden <brho@google.com> wrote:
>
>
> if i knew userspace wasn't going to follow pointers, NO_USER_CONV would
> both be a speedup and make it so i don't have to worry about mmapping to
> the same virtual address in every process that shares the arena map.
> though this latter feature isn't in the code.  right now you have to
> have it mmapped at the same user_va in all address spaces.  that's not a
> huge deal for me either way.

Not quite. With:

struct {
   __uint(type, BPF_MAP_TYPE_ARENA);
...
   __ulong(map_extra, 2ull << 44); /* start of mmap() region */
...
} arena SEC(".maps");

the future user_vm_start will be specified at map creation time,
so arena doesn't have to be mapped to be usable.
But having an address known early helps potential future mmap-s.
Like to examine the state of the arena with bpftool.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-07 18:40   ` Barret Rhoden
@ 2024-02-07 20:55     ` Alexei Starovoitov
  2024-02-07 21:11       ` Barret Rhoden
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-07 20:55 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 10:40 AM Barret Rhoden <brho@google.com> wrote:
>
> On 2/6/24 17:04, Alexei Starovoitov wrote:
> > +
> > +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> > +{
> > +     return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> > +}
> > +
> > +#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */
> > +
> > +/*
> > + * Reserve a "zero page", so that bpf prog and user space never see
> > + * a pointer to arena with lower 32 bits being zero.
> > + * bpf_cast_user() promotes it to full 64-bit NULL.
> > + */
> > +static int reserve_zero_page(struct bpf_arena *arena)
> > +{
> > +     long pgoff = compute_pgoff(arena, 0);
> > +
> > +     return mtree_insert(&arena->mt, pgoff, MT_ENTRY, GFP_KERNEL);
> > +}
> > +
>
> this is pretty tricky, and i think i didn't understand it at first.
>
> you're punching a hole in the arena, such that BPF won't allocate it via
> arena_alloc_pages().  thus BPF won't 'produce' an object with a pointer
> ending in 0x00000000.
>
> depending on where userspace mmaps the arena, that hole may or may not
> be the first page in the array.  if userspace mmaps it to a 4GB aligned
> virtual address, it'll be page 0.  but it could be at some arbitrary
> offset within the 4GB arena.
>
> that arbitrariness makes it harder for a BPF program to do its own
> allocations within the arena.  i'm planning on carving up the 4GB arena
> for my own purposes, managed by BPF, with the expectation that i'll be
> able to allocate any 'virtual address' within the arena.  but there's a
> magic page that won't be usable.
>
> i can certainly live with this.  just mmap userspace to a 4GB aligned
> address + PGSIZE, so that the last page in the arena is page 0.  but
> it's a little weird.

Agree. I made the same conclusion while adding global variables to the arena.
From the compiler point of view all such global vars start at offset zero
and there is no way to just "move them up by a page".
For example in C code it will look like:
int __arena var1;
int __arena var2;

&var1 == user_vm_start
&var2 == user_vm_start + 4

If __ulong(map_extra,...) was used or mmap(addr, MAP_FIXED)
and was 4Gb aligned the lower 32-bits of &var1 address will be zero
and there is not much we can do about it.
We can tell LLVM to emit extra 8 byte padding to the arena section,
but it will be useless pad if the arena is not aligned to 4Gb.

Anyway, in the v2 I will remove this reserve_zero_page() logic.
It's causing more harm than good.

>
> though i think we'll have more serious issues if anyone accidentally
> tries to use that zero page.  BPF would get an EEXIST if they try to
> allocate it directly, but then page fault and die if they touched it,
> since there's no page.  i can live with that, if we force it to be the
> last page in the arena.
>
> however, i think you need to add something to the fault handler (below)
> in case userspace touches that page:
>
> [snip]
> > +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> > +{
> > +     struct bpf_map *map = vmf->vma->vm_file->private_data;
> > +     struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> > +     struct page *page;
> > +     long kbase, kaddr;
> > +     int ret;
> > +
> > +     kbase = bpf_arena_get_kern_vm_start(arena);
> > +     kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
> > +
> > +     guard(mutex)(&arena->lock);
> > +     page = vmalloc_to_page((void *)kaddr);
> > +     if (page)
> > +             /* already have a page vmap-ed */
> > +             goto out;
> > +
> > +     if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> > +             /* User space requested to segfault when page is not allocated by bpf prog */
> > +             return VM_FAULT_SIGSEGV;
> > +
> > +     ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
> > +     if (ret == -EEXIST)
> > +             return VM_FAULT_RETRY;
>
> say this was the zero page.  vmalloc_to_page() failed, so we tried to
> insert.  we get EEXIST, since the slot is reserved.  we retry, since we
> were expecting the case where "no page, yet slot reserved" meant that
> BPF was in the middle of filling this page.

Yes. Great catch! I hit that too while playing with global vars.

>
> though i think you can fix this by just treating this as a SIGSEGV
> instead of RETRY.

Agree.

> when i made the original suggestion of making this a
> retry (in an email off list), that was before you had the arena mutex.
> now that you have the mutex, you shouldn't have the scenario where two
> threads are concurrently trying to fill a page.  i.e. mtree_insert +
> page_alloc + vmap are all atomic w.r.t. the mutex.

yes. mutex part makes sense.

> > +
> > +     if (!arena->user_vm_start) {
> > +             arena->user_vm_start = vma->vm_start;
> > +             err = reserve_zero_page(arena);
> > +             if (err)
> > +                     return err;
> > +     }
> > +     arena->user_vm_end = vma->vm_end;
> > +     /*
> > +      * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
> > +      * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
> > +      * potential change of user_vm_start.
> > +      */
> > +     vm_flags_set(vma, VM_DONTEXPAND);
> > +     vma->vm_ops = &arena_vm_ops;
> > +     return 0;
> > +}
>
> i think this whole function needs to be protected by the mutex, or at
> least all the stuff relate to user_vm_{start,end}.  if you have to
> threads mmapping the region for the first time, you'll race on the
> values of user_vm_*.

yes. will add a mutex guard.

>
> [snip]
>
> > +/*
> > + * Allocate pages and vmap them into kernel vmalloc area.
> > + * Later the pages will be mmaped into user space vma.
> > + */
> > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
>
> instead of uaddr, can you change this to take an address relative to the
> arena ("arena virtual address"?)?  the caller of this is in BPF, and
> they don't easily know the user virtual address.  maybe even just pgoff
> directly.

I thought about it, but it doesn't quite make sense.
bpf prog only sees user addresses.
All load/store returns them. If it bpf_printk-s an address it will be
user address.
bpf_arena_alloc_pages() also returns a user address.

Kernel addresses are not seen by bpf prog at all.
kern_vm_base is completely hidden.
Only at JIT time, it's added to pointers.
So passing uaddr to arena_alloc_pages() matches mmap style.

uaddr = bpf_arena_alloc_pages(... uaddr ...)
uaddr = mmap(uaddr, ...MAP_FIXED)

Passing pgoff would be weird.
Also note that there is no extra flag for bpf_arena_alloc_pages().
uaddr == full 64-bit of zeros is not a valid addr to use.

> additionally, you won't need to call compute_pgoff().  as it is now, i'm
> not sure what would happen if BPF did an arena_alloc with a uaddr and
> user_vm_start wasn't set yet.

That's impossible. bpf prog won't load, if the arena map is not created
either with fixed map_extra == user_vm_start or map is created
with map_extra == 0 and mmaped.
Only then bpf prog that uses that arena can be loaded.

>
> > +{
> > +     long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
>
> any time you compute_pgoff() or look at user_vm_{start,end}, maybe
> either hold the mutex, or only do it from mmap faults (where we know
> user_vm_start is already set).  o/w there might be subtle races where
> some other thread is mmapping the arena for the first time.

That's unnecessary, since user_vm_start is fixed for the lifetime of
bpf prog.
But you spotted a bug. I need to set user_vm_end in arena_map_alloc()
when map_extra is specified.
Otherwise after arena create and before mmap bpf prog will see different
page_cnt_max. user_vm_start will be the same, of course.

> > +
> > +     kaddr = kern_vm_start + (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
>
> adding user_vm_start here is pretty subtle.
>
> so far i've been thinking that the mtree is the "address space" of the
> arena, in units of pages instead of bytes.  and pgoff is an address
> within the arena.  so mtree slot 0 is the 0th page of the 4GB region.
> and that "arena address space" is mapped at a kernel virtual address and
> a user virtual address (the same for all processes).
>
> to convert user addresses (uaddr et al.) to arena addresses, we subtract
> user_vm_start, which makes sense.  that's what compute_pgoff() does.
>
> i was expecting kaddr = kern_vm_start + pgoff * PGSIZE, essentially
> converting from arena address space to kernel virtual address.
>
> instead, by adding user_vm_start and casting to u32, you're converting
> or shifting arena addresses to *another* arena address (user address,
> truncated to 4GB to keep it in the arena), and that is the one that the
> kernel will use.
>
> is that correct?

Pretty much. kern and user have to see lower 32-bit exactly the same.
From user pov allocation starts at pgoff=0 which is the first page
after user_vm_start. Which is normal mmap behavior and in-kernel
vma convention.
Hence in arena_alloc_pages() does the same pgoff=0 is the first
page from user vma pov.
The math to compute kernel address gets complicated because two
bases need to be added. Both kernel and user bases are different
64-bit bases and may not be aligned to 4G.

> my one concern is that there's some subtle wrap-around going on, and due
> to the shifting, kaddr can be very close to the end of the arena and
> page_cnt can be big enough to go outside the 4GB range.  we'd want it to

page_cnt cannot go outside of 4gb range.
page_cnt is a number of pages in arena->user_vm_end - arena->user_vm_start
and during mmap we check that it's <= 4gb.

> wrap around.  e.g.
>
> user_start_va = 0x1,fffff000
> user_end_va =   0x2,fffff000
> page_cnt_max = 0x100000 or whatever.  full 4GB range.
>
> say we want to alloc at pgoff=0 (uaddr 0x1,fffff000), page_cnt = X.  you
> can get this pgoff either by doing mtree_insert_range or
> mtree_alloc_range on an arena with no allocations.
>
> kaddr = kern_vm_start + 0xfffff000
>
> the size of the vm area is 4GB + guard stuff, and we're right up against
> the end of it.
>
> if page_cnt > the guard size, vmap_pages_range() would be called on
> something outside the vm area we reserved, which seems bad.  and even if
> it wasn't, what we want is for later page maps to start at the beginning
> of kern_vm_start.
>
> the fix might be to just only map a page at a time - maybe a loop.  or
> detect when we're close to the edge and break it into two vmaps.  i feel
> like the loop would be easier to understand, but maybe less efficient.

Oops. You're correct. Great catch.
In earlier versions I had it as a loop, but then I decided that doing
all mapping ops page at a time is not efficient.
Oh well. Will fix.

Thanks a lot for the review!


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-06 22:04 ` [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
@ 2024-02-07 21:07   ` Lorenzo Stoakes
  2024-02-07 22:56     ` Alexei Starovoitov
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2024-02-07 21:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig

I don't know what conventions you bpf guys follow, but it's common courtesy
in the rest of the kernel to do a get_maintainers.pl check and figure out
who the maintainers/reviews of a part of the kernel you change are,
and include them in your mailing list.

I've done this for you.

On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> The next commit will introduce bpf_arena which is a sparsely populated shared
> memory region between bpf program and user space process.
> It will function similar to vmalloc()/vm_map_ram():
> - get_vm_area()
> - alloc_pages()
> - vmap_pages_range()

This tells me absolutely nothing about why it is justified to expose this
internal interface. You need to put more explanation here along the lines
of 'we had no other means of achieving what we needed from vmalloc because
X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
for instance. We good to expose that, not only for you but for any other
core kernel users?

>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/vmalloc.h | 2 ++
>  mm/vmalloc.c            | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..bafb87c69e3d 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr)
>
>  #ifdef CONFIG_MMU
>  void vunmap_range(unsigned long addr, unsigned long end);
> +int vmap_pages_range(unsigned long addr, unsigned long end,
> +		     pgprot_t prot, struct page **pages, unsigned int page_shift);
>  static inline void set_vm_flush_reset_perms(void *addr)
>  {
>  	struct vm_struct *vm = find_vm_area(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d12a17fc0c17..eae93d575d1b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -625,8 +625,8 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>   * RETURNS:
>   * 0 on success, -errno on failure.
>   */
> -static int vmap_pages_range(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages, unsigned int page_shift)
> +int vmap_pages_range(unsigned long addr, unsigned long end,
> +		     pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
>  	int err;
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-07 20:55     ` Alexei Starovoitov
@ 2024-02-07 21:11       ` Barret Rhoden
  2024-02-08  6:26         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Barret Rhoden @ 2024-02-07 21:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On 2/7/24 15:55, Alexei Starovoitov wrote:
>> instead of uaddr, can you change this to take an address relative to the
>> arena ("arena virtual address"?)?  the caller of this is in BPF, and
>> they don't easily know the user virtual address.  maybe even just pgoff
>> directly.
> I thought about it, but it doesn't quite make sense.
> bpf prog only sees user addresses.
> All load/store returns them. If it bpf_printk-s an address it will be
> user address.
> bpf_arena_alloc_pages() also returns a user address.

Yeah, makes sense to keep them all in the same address space.

> 
> Kernel addresses are not seen by bpf prog at all.
> kern_vm_base is completely hidden.
> Only at JIT time, it's added to pointers.
> So passing uaddr to arena_alloc_pages() matches mmap style.
> 
> uaddr = bpf_arena_alloc_pages(... uaddr ...)
> uaddr = mmap(uaddr, ...MAP_FIXED)
> 
> Passing pgoff would be weird.
> Also note that there is no extra flag for bpf_arena_alloc_pages().
> uaddr == full 64-bit of zeros is not a valid addr to use.

The problem I had with uaddr was that when I'm writing a BPF program, I 
don't know which address to use for a given page, e.g. the beginning of 
the arena.  I needed some way to tell me the user address "base" of the 
arena.  Though now that I can specify the user_vm_start through the 
map_extra, I think I'm ok.

Specifically, say I want to break up my arena into two, 2GB chunks, one 
for each numa node, and I want to bump-allocate from each chunk.  When I 
want to allocate the first page from either segment, I'll need to know 
what user address is offset 0 or offset 2GB.

Since I know the user_start_vm at compile time, I can just hardcode that 
to convert from "arena address" (e.g. pgoff) to the user address space.

thanks,

barret

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-07 21:07   ` Lorenzo Stoakes
@ 2024-02-07 22:56     ` Alexei Starovoitov
  2024-02-08  5:44     ` Johannes Weiner
  2024-02-14  8:31     ` Christoph Hellwig
  2 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-07 22:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig

On Wed, Feb 7, 2024 at 1:10 PM Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> I don't know what conventions you bpf guys follow, but it's common courtesy
> in the rest of the kernel to do a get_maintainers.pl check and figure out
> who the maintainers/reviews of a part of the kernel you change are,
> and include them in your mailing list.

linux-mm and Johannes was cc-ed.

> On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The next commit will introduce bpf_arena which is a sparsely populated shared
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
>
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

Full motivation and details are in the cover letter and in the next commit as
commit log of this patch says.
Everyone subscribed to linux-mm has all patches in their mailboxes.

The commit log also mentioned that the next patch does pretty much
what vm_map_ram() does.
Any further details you're looking for?

What 'risk of breaking' are you talking about?

> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

I could have overlooked something. What specific checks do you have in mind?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-06 22:04 ` [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena Alexei Starovoitov
@ 2024-02-08  1:15   ` Andrii Nakryiko
  2024-02-08  1:38     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> mmap() bpf_arena right after creation, since the kernel needs to
> remember the address returned from mmap. This is user_vm_start.
> LLVM will generate bpf_arena_cast_user() instructions where
> necessary and JIT will add upper 32-bit of user_vm_start
> to such pointers.
>
> Use traditional map->value_size * map->max_entries to calculate mmap sz,
> though it's not the best fit.

We should probably make bpf_map_mmap_sz() aware of specific map type
and do different calculations based on that. It makes sense to have
round_up(PAGE_SIZE) for BPF map arena, and use just just value_size or
max_entries to specify the size (fixing the other to be zero).

>
> Also don't set BTF at bpf_arena creation time, since it doesn't support it.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/libbpf.c        | 18 ++++++++++++++++++
>  tools/lib/bpf/libbpf_probes.c |  6 ++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 01f407591a92..c5ce5946dc6d 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
>         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
>         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
>         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> +       [BPF_MAP_TYPE_ARENA]                    = "arena",
>  };
>
>  static const char * const prog_type_name[] = {
> @@ -4852,6 +4853,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         case BPF_MAP_TYPE_SOCKHASH:
>         case BPF_MAP_TYPE_QUEUE:
>         case BPF_MAP_TYPE_STACK:
> +       case BPF_MAP_TYPE_ARENA:
>                 create_attr.btf_fd = 0;
>                 create_attr.btf_key_type_id = 0;
>                 create_attr.btf_value_type_id = 0;
> @@ -4908,6 +4910,22 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         if (map->fd == map_fd)
>                 return 0;
>
> +       if (def->type == BPF_MAP_TYPE_ARENA) {
> +               size_t mmap_sz;
> +
> +               mmap_sz = bpf_map_mmap_sz(def->value_size, def->max_entries);
> +               map->mmaped = mmap((void *)map->map_extra, mmap_sz, PROT_READ | PROT_WRITE,
> +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> +                                  map_fd, 0);
> +               if (map->mmaped == MAP_FAILED) {
> +                       err = -errno;
> +                       map->mmaped = NULL;
> +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> +                               bpf_map__name(map), err);
> +                       return err;

leaking map_fd here, you need to close(map_fd) before erroring out


> +               }
> +       }
> +
>         /* Keep placeholder FD value but now point it to the BPF map object.
>          * This way everything that relied on this map's FD (e.g., relocated
>          * ldimm64 instructions) will stay valid and won't need adjustments.
> diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
> index ee9b1dbea9eb..cbc7f4c09060 100644
> --- a/tools/lib/bpf/libbpf_probes.c
> +++ b/tools/lib/bpf/libbpf_probes.c
> @@ -338,6 +338,12 @@ static int probe_map_create(enum bpf_map_type map_type)
>                 key_size = 0;
>                 max_entries = 1;
>                 break;
> +       case BPF_MAP_TYPE_ARENA:
> +               key_size        = sizeof(__u64);
> +               value_size      = sizeof(__u64);
> +               opts.map_extra  = 0; /* can mmap() at any address */
> +               opts.map_flags  = BPF_F_MMAPABLE;
> +               break;
>         case BPF_MAP_TYPE_HASH:
>         case BPF_MAP_TYPE_ARRAY:
>         case BPF_MAP_TYPE_PROG_ARRAY:
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-06 22:04 ` [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
@ 2024-02-08  1:16   ` Andrii Nakryiko
  2024-02-08  1:58     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08  1:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> __uint() macro that is used to specify map attributes like:
>   __uint(type, BPF_MAP_TYPE_ARRAY);
>   __uint(map_flags, BPF_F_MMAPABLE);
> is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
>
> Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> In map definition "map_extra" is the only u64 field.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/bpf_helpers.h |  1 +
>  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 9c777c21da28..fb909fc6866d 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -13,6 +13,7 @@
>  #define __uint(name, val) int (*name)[val]
>  #define __type(name, val) typeof(val) *name
>  #define __array(name, val) typeof(val) *name[]
> +#define __ulong(name, val) enum name##__enum { name##__value = val } name

Can you try using __ulong() twice in the same file? enum type and
value names have global visibility, so I suspect second use with the
same field name would cause compilation error

>
>  /*
>   * Helper macro to place programs, maps, license in
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index c5ce5946dc6d..a8c89b2315cd 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -2229,6 +2229,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
>         return true;
>  }
>
> +static bool get_map_field_long(const char *map_name, const struct btf *btf,
> +                              const struct btf_member *m, __u64 *res)
> +{
> +       const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> +       const char *name = btf__name_by_offset(btf, m->name_off);
> +
> +       if (btf_is_ptr(t))
> +               return false;
> +
> +       if (!btf_is_enum(t) && !btf_is_enum64(t)) {
> +               pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",
> +                       map_name, name, btf_kind_str(t));
> +               return false;
> +       }
> +
> +       if (btf_vlen(t) != 1) {
> +               pr_warn("map '%s': attr '%s': invalid __ulong\n",
> +                       map_name, name);
> +               return false;
> +       }
> +
> +       if (btf_is_enum(t)) {
> +               const struct btf_enum *e = btf_enum(t);
> +
> +               *res = e->val;
> +       } else {
> +               const struct btf_enum64 *e = btf_enum64(t);
> +
> +               *res = btf_enum64_value(e);
> +       }
> +       return true;
> +}
> +
>  static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
>  {
>         int len;
> @@ -2462,10 +2495,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
>                         map_def->pinning = val;
>                         map_def->parts |= MAP_DEF_PINNING;
>                 } else if (strcmp(name, "map_extra") == 0) {
> -                       __u32 map_extra;
> +                       __u64 map_extra;
>
> -                       if (!get_map_field_int(map_name, btf, m, &map_extra))
> -                               return -EINVAL;
> +                       if (!get_map_field_long(map_name, btf, m, &map_extra)) {
> +                               __u32 map_extra_u32;
> +
> +                               if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
> +                                       return -EINVAL;
> +                               map_extra = map_extra_u32;
> +                       }
>                         map_def->map_extra = map_extra;
>                         map_def->parts |= MAP_DEF_MAP_EXTRA;
>                 } else {
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-08  1:15   ` Andrii Nakryiko
@ 2024-02-08  1:38     ` Alexei Starovoitov
  2024-02-08 18:29       ` Andrii Nakryiko
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08  1:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 5:15 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > mmap() bpf_arena right after creation, since the kernel needs to
> > remember the address returned from mmap. This is user_vm_start.
> > LLVM will generate bpf_arena_cast_user() instructions where
> > necessary and JIT will add upper 32-bit of user_vm_start
> > to such pointers.
> >
> > Use traditional map->value_size * map->max_entries to calculate mmap sz,
> > though it's not the best fit.
>
> We should probably make bpf_map_mmap_sz() aware of specific map type
> and do different calculations based on that. It makes sense to have
> round_up(PAGE_SIZE) for BPF map arena, and use just just value_size or
> max_entries to specify the size (fixing the other to be zero).

I went with value_size == key_size == 8 in order to be able to extend
it in the future and allow map_lookup/update/delete to do something
useful. Ex: lookup/delete can behave just like arena_alloc/free_pages.

Are you proposing to force key/value_size to zero ?
That was my first attempt.
key_size can be zero, but syscall side of lookup/update expects
a non-zero value_size for all maps regardless of type.
We can modify bpf/syscall.c, of course, but it feels arena would be
too different of a map if generic map handling code would need
to be specialized.

Then since value_size is > 0 then what sizes make sense?
When it's 8 it can be an indirection to anything.
key/value would be user pointers to other structs that
would be meaningful for an arena.
Right now it costs nothing to force both to 8 and pick any logic
when we decide what lookup/update should do.

But then when value_size == 8 than making max_entries to
mean the size of arena in bytes or pages.. starting to look odd
and different from all other maps.

We could go with max_entries==0 and value_size to mean the size of
arena in bytes, but it will prevent us from defining lookup/update
in the future, which doesn't feel right.

Considering all this I went with map->value_size * map->max_entries choice.
Though it's not pretty.

> > @@ -4908,6 +4910,22 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> >         if (map->fd == map_fd)
> >                 return 0;
> >
> > +       if (def->type == BPF_MAP_TYPE_ARENA) {
> > +               size_t mmap_sz;
> > +
> > +               mmap_sz = bpf_map_mmap_sz(def->value_size, def->max_entries);
> > +               map->mmaped = mmap((void *)map->map_extra, mmap_sz, PROT_READ | PROT_WRITE,
> > +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> > +                                  map_fd, 0);
> > +               if (map->mmaped == MAP_FAILED) {
> > +                       err = -errno;
> > +                       map->mmaped = NULL;
> > +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> > +                               bpf_map__name(map), err);
> > +                       return err;
>
> leaking map_fd here, you need to close(map_fd) before erroring out

ahh. good catch.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-08  1:16   ` Andrii Nakryiko
@ 2024-02-08  1:58     ` Alexei Starovoitov
  2024-02-08 18:16       ` Andrii Nakryiko
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08  1:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 5:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > __uint() macro that is used to specify map attributes like:
> >   __uint(type, BPF_MAP_TYPE_ARRAY);
> >   __uint(map_flags, BPF_F_MMAPABLE);
> > is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
> >
> > Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> > In map definition "map_extra" is the only u64 field.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  tools/lib/bpf/bpf_helpers.h |  1 +
> >  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
> >  2 files changed, 42 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> > index 9c777c21da28..fb909fc6866d 100644
> > --- a/tools/lib/bpf/bpf_helpers.h
> > +++ b/tools/lib/bpf/bpf_helpers.h
> > @@ -13,6 +13,7 @@
> >  #define __uint(name, val) int (*name)[val]
> >  #define __type(name, val) typeof(val) *name
> >  #define __array(name, val) typeof(val) *name[]
> > +#define __ulong(name, val) enum name##__enum { name##__value = val } name
>
> Can you try using __ulong() twice in the same file? enum type and
> value names have global visibility, so I suspect second use with the
> same field name would cause compilation error

Good point will change it to:

#define __ulong(name, val) enum { __PASTE(__unique_value,__COUNTER__)
= val } name


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test.
  2024-02-07 17:04   ` Eduard Zingerman
@ 2024-02-08  2:59     ` Alexei Starovoitov
  2024-02-08 11:10       ` Jose E. Marchesi
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08  2:59 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 9:04 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-02-06 at 14:04 -0800, Alexei Starovoitov wrote:
> [...]
>
> > diff --git a/tools/testing/selftests/bpf/bpf_arena_list.h b/tools/testing/selftests/bpf/bpf_arena_list.h
> > new file mode 100644
> > index 000000000000..9f34142b0f65
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bpf_arena_list.h
>
> [...]
>
> > +#ifndef __BPF__
> > +static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) {      return NULL; }
> > +static inline void bpf_iter_num_destroy(struct bpf_iter_num *) {}
> > +static inline bool bpf_iter_num_next(struct bpf_iter_num *) { return true; }
> > +#endif
>
> Note: when compiling using current clang 'main' (make test_progs) this reports the following errors:
>
> In file included from tools/testing/selftests/bpf/prog_tests/arena_list.c:9:
> ./bpf_arena_list.h:28:59: error: omitting the parameter name in a function
>                                  definition is a C23 extension [-Werror,-Wc23-extensions]
>    28 | static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) { return NULL; }
>    ...
>
> So I had to give parameter names for the above functions.

Thanks. Fixed. Too bad gcc 12 didn't catch it.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-07 21:07   ` Lorenzo Stoakes
  2024-02-07 22:56     ` Alexei Starovoitov
@ 2024-02-08  5:44     ` Johannes Weiner
  2024-02-08 23:55       ` Alexei Starovoitov
  2024-02-09  6:36       ` Lorenzo Stoakes
  2024-02-14  8:31     ` Christoph Hellwig
  2 siblings, 2 replies; 56+ messages in thread
From: Johannes Weiner @ 2024-02-08  5:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Alexei Starovoitov, bpf, daniel, andrii, martin.lau, memxor,
	eddyz87, tj, brho, linux-mm, kernel-team, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig

On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The next commit will introduce bpf_arena which is a sparsely populated shared
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
> 
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

How about this:

---

BPF would like to use the vmap API to implement a lazily-populated
memory space which can be shared by multiple userspace threads.

The vmap API is generally public and has functions to request and
release areas of kernel address space, as well as functions to map
various types of backing memory into that space.

For example, there is the public ioremap_page_range(), which is used
to map device memory into addressable kernel space.

The new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area. Indeed this function used to be public, but became private
when usecases other than vmalloc happened to disappear.

Make it public again for the new external user.

---

> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

Those are applicable only to the higher-level vmap/vmalloc usecases:
controlling the implied call to get_vm_area; managing the area with
vfree(). They're not relevant for mapping privately-managed pages into
an existing vm area. It's the same pattern and layer of abstraction as
ioremap_pages_range(), which doesn't have any of those checks either.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-07 21:11       ` Barret Rhoden
@ 2024-02-08  6:26         ` Alexei Starovoitov
  2024-02-08 21:58           ` Barret Rhoden
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08  6:26 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 1:12 PM Barret Rhoden <brho@google.com> wrote:
>
> On 2/7/24 15:55, Alexei Starovoitov wrote:
> >> instead of uaddr, can you change this to take an address relative to the
> >> arena ("arena virtual address"?)?  the caller of this is in BPF, and
> >> they don't easily know the user virtual address.  maybe even just pgoff
> >> directly.
> > I thought about it, but it doesn't quite make sense.
> > bpf prog only sees user addresses.
> > All load/store returns them. If it bpf_printk-s an address it will be
> > user address.
> > bpf_arena_alloc_pages() also returns a user address.
>
> Yeah, makes sense to keep them all in the same address space.
>
> >
> > Kernel addresses are not seen by bpf prog at all.
> > kern_vm_base is completely hidden.
> > Only at JIT time, it's added to pointers.
> > So passing uaddr to arena_alloc_pages() matches mmap style.
> >
> > uaddr = bpf_arena_alloc_pages(... uaddr ...)
> > uaddr = mmap(uaddr, ...MAP_FIXED)
> >
> > Passing pgoff would be weird.
> > Also note that there is no extra flag for bpf_arena_alloc_pages().
> > uaddr == full 64-bit of zeros is not a valid addr to use.
>
> The problem I had with uaddr was that when I'm writing a BPF program, I
> don't know which address to use for a given page, e.g. the beginning of
> the arena.  I needed some way to tell me the user address "base" of the
> arena.  Though now that I can specify the user_vm_start through the
> map_extra, I think I'm ok.
>
> Specifically, say I want to break up my arena into two, 2GB chunks, one
> for each numa node, and I want to bump-allocate from each chunk.  When I
> want to allocate the first page from either segment, I'll need to know
> what user address is offset 0 or offset 2GB.

bump allocate... you mean like page_frag alloc does?
I've implemented one on top of arena:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/tools/testing/selftests/bpf/bpf_arena_alloc.h?h=arena&id=36d78b0f1c14c959d907d68cd7d54439b9213d0c

Also I believe I addressed all issues with missing mutex and wrap around,
and pushed to:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=e1cb522fee661e7346e8be567eade9cf607eaf11
Please take a look.

Including the wrap around test in the last commit:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=01653c393a4167ccca23dc5a69aa9cf34a46eabd

Will wait a bit before sending v2.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test.
  2024-02-08  2:59     ` Alexei Starovoitov
@ 2024-02-08 11:10       ` Jose E. Marchesi
  0 siblings, 0 replies; 56+ messages in thread
From: Jose E. Marchesi @ 2024-02-08 11:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo,
	Barret Rhoden, Johannes Weiner, linux-mm, Kernel Team


> On Wed, Feb 7, 2024 at 9:04 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>
>> On Tue, 2024-02-06 at 14:04 -0800, Alexei Starovoitov wrote:
>> [...]
>>
>> > diff --git a/tools/testing/selftests/bpf/bpf_arena_list.h b/tools/testing/selftests/bpf/bpf_arena_list.h
>> > new file mode 100644
>> > index 000000000000..9f34142b0f65
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/bpf/bpf_arena_list.h
>>
>> [...]
>>
>> > +#ifndef __BPF__
>> > +static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) {      return NULL; }
>> > +static inline void bpf_iter_num_destroy(struct bpf_iter_num *) {}
>> > +static inline bool bpf_iter_num_next(struct bpf_iter_num *) { return true; }
>> > +#endif
>>
>> Note: when compiling using current clang 'main' (make test_progs) this reports the following errors:
>>
>> In file included from tools/testing/selftests/bpf/prog_tests/arena_list.c:9:
>> ./bpf_arena_list.h:28:59: error: omitting the parameter name in a function
>>                                  definition is a C23 extension [-Werror,-Wc23-extensions]
>>    28 | static inline void *bpf_iter_num_new(struct bpf_iter_num *, int, int) { return NULL; }
>>    ...
>>
>> So I had to give parameter names for the above functions.
>
> Thanks. Fixed. Too bad gcc 12 didn't catch it.

I'm opening a GCC bugzilla for this.a


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-08  1:58     ` Alexei Starovoitov
@ 2024-02-08 18:16       ` Andrii Nakryiko
  0 siblings, 0 replies; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08 18:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 5:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 5:17 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > __uint() macro that is used to specify map attributes like:
> > >   __uint(type, BPF_MAP_TYPE_ARRAY);
> > >   __uint(map_flags, BPF_F_MMAPABLE);
> > > is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
> > >
> > > Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> > > In map definition "map_extra" is the only u64 field.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  tools/lib/bpf/bpf_helpers.h |  1 +
> > >  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 42 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> > > index 9c777c21da28..fb909fc6866d 100644
> > > --- a/tools/lib/bpf/bpf_helpers.h
> > > +++ b/tools/lib/bpf/bpf_helpers.h
> > > @@ -13,6 +13,7 @@
> > >  #define __uint(name, val) int (*name)[val]
> > >  #define __type(name, val) typeof(val) *name
> > >  #define __array(name, val) typeof(val) *name[]
> > > +#define __ulong(name, val) enum name##__enum { name##__value = val } name
> >
> > Can you try using __ulong() twice in the same file? enum type and
> > value names have global visibility, so I suspect second use with the
> > same field name would cause compilation error
>
> Good point will change it to:
>
> #define __ulong(name, val) enum { __PASTE(__unique_value,__COUNTER__)
> = val } name

yep, that should work. We still can have name collisions across
multiple files, but it doesn't matter when linking two .bpf.o files.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-08  1:38     ` Alexei Starovoitov
@ 2024-02-08 18:29       ` Andrii Nakryiko
  2024-02-08 18:45         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08 18:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Wed, Feb 7, 2024 at 5:38 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 5:15 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > mmap() bpf_arena right after creation, since the kernel needs to
> > > remember the address returned from mmap. This is user_vm_start.
> > > LLVM will generate bpf_arena_cast_user() instructions where
> > > necessary and JIT will add upper 32-bit of user_vm_start
> > > to such pointers.
> > >
> > > Use traditional map->value_size * map->max_entries to calculate mmap sz,
> > > though it's not the best fit.
> >
> > We should probably make bpf_map_mmap_sz() aware of specific map type
> > and do different calculations based on that. It makes sense to have
> > round_up(PAGE_SIZE) for BPF map arena, and use just just value_size or
> > max_entries to specify the size (fixing the other to be zero).
>
> I went with value_size == key_size == 8 in order to be able to extend
> it in the future and allow map_lookup/update/delete to do something
> useful. Ex: lookup/delete can behave just like arena_alloc/free_pages.
>
> Are you proposing to force key/value_size to zero ?

Yeah, I was thinking either (value_size=<size-in-bytes> and
max_entries=0) or (value_size=0 and max_entries=<size-in-bytes>). The
latter is what we do for BPF ringbuf, for example.

What you are saying about lookup/update already seems different from
any "normal" map anyways, so I'm not sure that's a good enough reason
to have hard-coded 8 for value size. And it seems like in practice
instead of doing lookup/update through syscall, the more natural way
of working with arena is going to be mmap() anyways, so not even sure
we need to implement the syscall side of lookup/update.

But just as an extra aside, what you have in mind for lookup/update
for the arena map can be generalized into "partial lookup/update" for
any map where it makes sense. I.e., instead of expecting the user to
read/update the entire value size, we can allow them to provide a
subrange to read/update (i.e., offset+size combo to specify subrange
within full map value range). This will work for the arena, but also
for most other maps (if not all) that currently support LOOKUP/UPDATE.

but specifically for bpf_map_mmap_sz(), regardless of what we decide
we should still change it to be something like:

switch (map_type) {
case BPF_MAP_TYPE_ARRAY:
    return <whatever we are doing right now>;
case BPF_MAP_TYPE_ARENA:
    /* round up to page size */
    return round_up(<whatever based on value_size and/or max_entries>,
page_size);
default:
    return 0; /* not supported */
}

we can also add a still different case for RINGBUF, where it's (2 *
max_entries). The general point is that mmapable size rules differ by
map type, so we best express this explicitly in this helper.


> That was my first attempt.
> key_size can be zero, but syscall side of lookup/update expects
> a non-zero value_size for all maps regardless of type.
> We can modify bpf/syscall.c, of course, but it feels arena would be
> too different of a map if generic map handling code would need
> to be specialized.
>
> Then since value_size is > 0 then what sizes make sense?
> When it's 8 it can be an indirection to anything.
> key/value would be user pointers to other structs that
> would be meaningful for an arena.
> Right now it costs nothing to force both to 8 and pick any logic
> when we decide what lookup/update should do.
>
> But then when value_size == 8 than making max_entries to
> mean the size of arena in bytes or pages.. starting to look odd
> and different from all other maps.
>
> We could go with max_entries==0 and value_size to mean the size of
> arena in bytes, but it will prevent us from defining lookup/update
> in the future, which doesn't feel right.
>
> Considering all this I went with map->value_size * map->max_entries choice.
> Though it's not pretty.
>
> > > @@ -4908,6 +4910,22 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> > >         if (map->fd == map_fd)
> > >                 return 0;
> > >
> > > +       if (def->type == BPF_MAP_TYPE_ARENA) {
> > > +               size_t mmap_sz;
> > > +
> > > +               mmap_sz = bpf_map_mmap_sz(def->value_size, def->max_entries);
> > > +               map->mmaped = mmap((void *)map->map_extra, mmap_sz, PROT_READ | PROT_WRITE,
> > > +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> > > +                                  map_fd, 0);
> > > +               if (map->mmaped == MAP_FAILED) {
> > > +                       err = -errno;
> > > +                       map->mmaped = NULL;
> > > +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> > > +                               bpf_map__name(map), err);
> > > +                       return err;
> >
> > leaking map_fd here, you need to close(map_fd) before erroring out
>
> ahh. good catch.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-08 18:29       ` Andrii Nakryiko
@ 2024-02-08 18:45         ` Alexei Starovoitov
  2024-02-08 18:54           ` Andrii Nakryiko
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08 18:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 10:29 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 5:38 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 5:15 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > mmap() bpf_arena right after creation, since the kernel needs to
> > > > remember the address returned from mmap. This is user_vm_start.
> > > > LLVM will generate bpf_arena_cast_user() instructions where
> > > > necessary and JIT will add upper 32-bit of user_vm_start
> > > > to such pointers.
> > > >
> > > > Use traditional map->value_size * map->max_entries to calculate mmap sz,
> > > > though it's not the best fit.
> > >
> > > We should probably make bpf_map_mmap_sz() aware of specific map type
> > > and do different calculations based on that. It makes sense to have
> > > round_up(PAGE_SIZE) for BPF map arena, and use just just value_size or
> > > max_entries to specify the size (fixing the other to be zero).
> >
> > I went with value_size == key_size == 8 in order to be able to extend
> > it in the future and allow map_lookup/update/delete to do something
> > useful. Ex: lookup/delete can behave just like arena_alloc/free_pages.
> >
> > Are you proposing to force key/value_size to zero ?
>
> Yeah, I was thinking either (value_size=<size-in-bytes> and
> max_entries=0) or (value_size=0 and max_entries=<size-in-bytes>). The
> latter is what we do for BPF ringbuf, for example.

Ouch. since map_update_elem() does:
        value_size = bpf_map_value_size(map);
        value = kvmemdup_bpfptr(uvalue, value_size);
...
static inline void *kvmemdup_bpfptr(bpfptr_t src, size_t len)
{
        void *p = kvmalloc(len, GFP_USER | __GFP_NOWARN);

        if (!p)
                return ERR_PTR(-ENOMEM);
        if (copy_from_bpfptr(p, src, len)) {
...
        if (unlikely(!size))
                return ZERO_SIZE_PTR;

and it's probably crashing the kernel.

Looks like we have fixes to do anyway :(


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-08 18:45         ` Alexei Starovoitov
@ 2024-02-08 18:54           ` Andrii Nakryiko
  2024-02-08 18:59             ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08 18:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 10:45 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 10:29 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 5:38 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Feb 7, 2024 at 5:15 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 6, 2024 at 2:05 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > >
> > > > > mmap() bpf_arena right after creation, since the kernel needs to
> > > > > remember the address returned from mmap. This is user_vm_start.
> > > > > LLVM will generate bpf_arena_cast_user() instructions where
> > > > > necessary and JIT will add upper 32-bit of user_vm_start
> > > > > to such pointers.
> > > > >
> > > > > Use traditional map->value_size * map->max_entries to calculate mmap sz,
> > > > > though it's not the best fit.
> > > >
> > > > We should probably make bpf_map_mmap_sz() aware of specific map type
> > > > and do different calculations based on that. It makes sense to have
> > > > round_up(PAGE_SIZE) for BPF map arena, and use just just value_size or
> > > > max_entries to specify the size (fixing the other to be zero).
> > >
> > > I went with value_size == key_size == 8 in order to be able to extend
> > > it in the future and allow map_lookup/update/delete to do something
> > > useful. Ex: lookup/delete can behave just like arena_alloc/free_pages.
> > >
> > > Are you proposing to force key/value_size to zero ?
> >
> > Yeah, I was thinking either (value_size=<size-in-bytes> and
> > max_entries=0) or (value_size=0 and max_entries=<size-in-bytes>). The
> > latter is what we do for BPF ringbuf, for example.
>
> Ouch. since map_update_elem() does:
>         value_size = bpf_map_value_size(map);
>         value = kvmemdup_bpfptr(uvalue, value_size);
> ...
> static inline void *kvmemdup_bpfptr(bpfptr_t src, size_t len)
> {
>         void *p = kvmalloc(len, GFP_USER | __GFP_NOWARN);
>
>         if (!p)
>                 return ERR_PTR(-ENOMEM);
>         if (copy_from_bpfptr(p, src, len)) {
> ...
>         if (unlikely(!size))
>                 return ZERO_SIZE_PTR;
>
> and it's probably crashing the kernel.

You mean when doing this from SYSCALL program?

>
> Looks like we have fixes to do anyway :(

Yeah, it's kind of weird to first read key/value "memory", and then
getting -ENOTSUP for maps that don't support lookup/update. We should
error out sooner.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena.
  2024-02-08 18:54           ` Andrii Nakryiko
@ 2024-02-08 18:59             ` Alexei Starovoitov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08 18:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 10:55 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> You mean when doing this from SYSCALL program?

both. regular syscall too.


> >
> > Looks like we have fixes to do anyway :(
>
> Yeah, it's kind of weird to first read key/value "memory", and then
> getting -ENOTSUP for maps that don't support lookup/update. We should
> error out sooner.

it's all over the place.
Probably better to hack bpf_map_value_size() to never return 0.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
@ 2024-02-08 19:40   ` Andrii Nakryiko
  2024-02-09  0:09     ` Alexei Starovoitov
  2024-02-09 16:06   ` David Vernet
  1 sibling, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-08 19:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

On Tue, Feb 6, 2024 at 2:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Recognize return of 'void *' from kfunc as returning unknown scalar.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/verifier.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ddaf09db1175..d9c2dbb3939f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>                                         meta.func_name);
>                                 return -EFAULT;
>                         }
> +               } else if (btf_type_is_void(ptr_type)) {
> +                       /* kfunc returning 'void *' is equivalent to returning scalar */
> +                       mark_reg_unknown(env, regs, BPF_REG_0);

Acked-by: Andrii Nakryiko <andrii@kernel.org>

I think we should do a similar extension when passing `void *` into
global funcs. It's best to treat it as SCALAR instead of rejecting it
because we can't calculate the size. Currently users in practice just
have to define it as `uintptr_t` and then cast (or create static
wrappers doing the casting). Anyways, my point is that it makes sense
to treat `void *` as non-pointer.

>                 } else if (!__btf_type_is_struct(ptr_type)) {
>                         if (!meta.r0_size) {
>                                 __u32 sz;
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-08  6:26         ` Alexei Starovoitov
@ 2024-02-08 21:58           ` Barret Rhoden
  2024-02-08 23:36             ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Barret Rhoden @ 2024-02-08 21:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On 2/8/24 01:26, Alexei Starovoitov wrote:
> Also I believe I addressed all issues with missing mutex and wrap around,
> and pushed to:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=e1cb522fee661e7346e8be567eade9cf607eaf11
> Please take a look.

LGTM, thanks.

minor things:

> +static void arena_vm_close(struct vm_area_struct *vma)
> +{
> +	struct vma_list *vml;
> +
> +	vml = vma->vm_private_data;
> +	list_del(&vml->head);
> +	vma->vm_private_data = NULL;
> +	kfree(vml);
> +}

i think this also needs protected by the arena mutex.  otherwise two 
VMAs that close at the same time can corrupt the arena vma_list.  or a 
VMA that closes while you're zapping.

remember_vma() already has the mutex held, since it's called from mmap.

> +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> +{
> +	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;

this function and arena_free_pages() are both using user_vm_start/end 
before grabbing the mutex.  so need to grab the mutex very early.

alternatively, you could make it so that the user must set the 
user_vm_start via map_extra, so you don't have to worry about these 
changing after the arena is created.

thanks,

barret

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-08 21:58           ` Barret Rhoden
@ 2024-02-08 23:36             ` Alexei Starovoitov
  2024-02-08 23:50               ` Barret Rhoden
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08 23:36 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 1:58 PM Barret Rhoden <brho@google.com> wrote:
>
> On 2/8/24 01:26, Alexei Starovoitov wrote:
> > Also I believe I addressed all issues with missing mutex and wrap around,
> > and pushed to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=e1cb522fee661e7346e8be567eade9cf607eaf11
> > Please take a look.
>
> LGTM, thanks.
>
> minor things:
>
> > +static void arena_vm_close(struct vm_area_struct *vma)
> > +{
> > +     struct vma_list *vml;
> > +
> > +     vml = vma->vm_private_data;
> > +     list_del(&vml->head);
> > +     vma->vm_private_data = NULL;
> > +     kfree(vml);
> > +}
>
> i think this also needs protected by the arena mutex.  otherwise two
> VMAs that close at the same time can corrupt the arena vma_list.  or a
> VMA that closes while you're zapping.

Excellent catch.

> remember_vma() already has the mutex held, since it's called from mmap.
>
> > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> > +{
> > +     long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
>
> this function and arena_free_pages() are both using user_vm_start/end
> before grabbing the mutex.  so need to grab the mutex very early.
>
> alternatively, you could make it so that the user must set the
> user_vm_start via map_extra, so you don't have to worry about these
> changing after the arena is created.

Looks like I lost the diff hunk where verifier checks that
arena has user_vm_start set before loading the prog.
And for some reason forgot to remove
if (!arena->user_vm_start) return..
in bpf_arena_alloc/free_page().
I'll remove the latter and add the verifier enforcement back.
The intent was to never call arena_alloc/free_pages when the arena is
not fully formed.
Once it's fixed there will be no race in arena_alloc_pages().
user_vm_end/start are fixed before the program is loaded.

One more thing.
The vmap_pages_range_wrap32() fix that you saw in that commit is not
enough.
Turns out that [%r12 + src_reg + off] in JIT asm doesn't
fully conform to "kernel bounds all access into 32-bit".
That "+ off" part is added _after_ src_reg is bounded to 32-bit.
Remember, that was the reason we added guard pages before and after
kernel 4Gb vm area.
It's working as intended, but for this wrap32 case we need to
map one page into the normal kernel vma _and_ into the guard page.
Consider your example:
user_start_va = 0x1,fffff000
user_end_va =   0x2,fffff000

the pgoff = 0 is uaddr 0x1,fffff000.
It's kaddr = kern_vm_start + 0xfffff000
and kaddr + PAGE_SIZE is kern_vm_start + 0.

When bpf prog access an arena pointer it can do:
dst_reg = *(u64 *)(src_reg + 0)
and
dst_reg = *(u64 *)(src_reg + 4096)

the first LDX is fine, but the 2nd will be faulting
when src_reg is fffff000.
From user space pov it's a virtually contiguous address range.
For bpf prog it's also contiguous when src_reg is 32-bit bounded,
but "+ 4096" breaks that.
The 2nd load becomes:
kern_vm_start + 0xfffff000 + 4096
and it faults.
Theoretically a solution is to do:
kern_vm_start + (u32)(0xfffff000 + 4096)
in JIT, but that is too expensive.

Hence I went with arena fix (ignore lack of error checking):
static int vunmap_guard_pages(u64 kern_vm_start, u64 start, u64 end)
{
        end = (u32)end;
        if (start < S16_MAX) {
                u64 end1 = min(end, S16_MAX + 1);

                vunmap_range(kern_vm_start + (1ull << 32) + start,
                             kern_vm_start + (1ull << 32) + end1);
        }

        if (end >= U32_MAX - S16_MAX + 1) {
                u64 start2 = max(start, U32_MAX - S16_MAX + 1);

                vunmap_range(kern_vm_start - (1ull << 32) + start2,
                             kern_vm_start - (1ull << 32) + end);
        }
        return 0;
}
static int vmap_pages_range_wrap32(u64 kern_vm_start, u64 uaddr, u64 page_cnt,
                                   struct page **pages)
{
        u64 start = kern_vm_start + uaddr;
        u64 end = start + page_cnt * PAGE_SIZE;
        u64 part1_page_cnt, start2, end2;
        int ret;

        if (page_cnt == 1 || !((uaddr + page_cnt * PAGE_SIZE) >> 32)) {
                /* uaddr doesn't overflow in 32-bit */
                ret = vmap_pages_range(start, end, PAGE_KERNEL, pages,
PAGE_SHIFT);
                if (ret)
                        return ret;
                vmap_guard_pages(kern_vm_start, uaddr, uaddr +
page_cnt * PAGE_SIZE, pages);
                return 0;
        }

        part1_page_cnt = ((1ull << 32) - (u32)uaddr) >> PAGE_SHIFT;
        end = start + part1_page_cnt * PAGE_SIZE;
        ret = vmap_pages_range(start, end,
                               PAGE_KERNEL, pages, PAGE_SHIFT);
        if (ret)
            return ret;

        vmap_guard_pages(kern_vm_start, uaddr, uaddr + part1_page_cnt
* PAGE_SIZE, pages);

        start2 = kern_vm_start;
        end2 = start2 + (page_cnt - part1_page_cnt) * PAGE_SIZE;
        ret = vmap_pages_range(start2, end2,
                               PAGE_KERNEL, &pages[part1_page_cnt], PAGE_SHIFT);
        if (ret) {
                vunmap_range(start, end);
                return ret;
        }

        vmap_guard_pages(kern_vm_start, 0, (page_cnt - part1_page_cnt)
* PAGE_SIZE,
                         pages + part1_page_cnt);
        return 0;
}

It's working, but too complicated.
Instead of single vmap_pages_range()
we might need to do up to 4 calls and map certain pages into
two places to make both 64-bit virtual addresses:
kern_vm_start + 0xfffff000 + 4096
and
kern_vm_start + (u32)(0xfffff000 + 4096)
point to the same page.

I'm inclined to tackle wrap32 issue differently and simply
disallow [user_vm_start, user_vm_end] combination
where lower 32-bit can wrap.

In other words it would mean that mmap() of len=4Gb will be
aligned to 4Gb,
while mmap() of len=1M will be offsetted in such a way
that both addr and add+1M have the same upper 32-bit.
(It's not the same as 1M aligned).

With that I will remove vmap_pages_range_wrap32() and
do single normal vmap_pages_range() without extra tricks.

wdyt?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
  2024-02-08 23:36             ` Alexei Starovoitov
@ 2024-02-08 23:50               ` Barret Rhoden
  0 siblings, 0 replies; 56+ messages in thread
From: Barret Rhoden @ 2024-02-08 23:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Johannes Weiner,
	linux-mm, Kernel Team

On 2/8/24 18:36, Alexei Starovoitov wrote:
> I'm inclined to tackle wrap32 issue differently and simply
> disallow [user_vm_start, user_vm_end] combination
> where lower 32-bit can wrap.
> 
> In other words it would mean that mmap() of len=4Gb will be
> aligned to 4Gb,
> while mmap() of len=1M will be offsetted in such a way
> that both addr and add+1M have the same upper 32-bit.
> (It's not the same as 1M aligned).
> 
> With that I will remove vmap_pages_range_wrap32() and
> do single normal vmap_pages_range() without extra tricks.
> 
> wdyt?

SGTM.

knowing that you can't wrap the lower 32 removes a lot of headaches. 
and the restriction of aligning a 4GB mapping to 4GB boundary is pretty 
sane.  TBH doing it elsewhere is just asking for heartache.  =)

barret




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-08  5:44     ` Johannes Weiner
@ 2024-02-08 23:55       ` Alexei Starovoitov
  2024-02-09  6:36       ` Lorenzo Stoakes
  1 sibling, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-08 23:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Lorenzo Stoakes, bpf, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, linux-mm, Kernel Team, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig

On Wed, Feb 7, 2024 at 9:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > The next commit will introduce bpf_arena which is a sparsely populated shared
> > > memory region between bpf program and user space process.
> > > It will function similar to vmalloc()/vm_map_ram():
> > > - get_vm_area()
> > > - alloc_pages()
> > > - vmap_pages_range()
> >
> > This tells me absolutely nothing about why it is justified to expose this
> > internal interface. You need to put more explanation here along the lines
> > of 'we had no other means of achieving what we needed from vmalloc because
> > X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
>
> How about this:
>
> ---
>
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
>
> The vmap API is generally public and has functions to request and
> release areas of kernel address space, as well as functions to map
> various types of backing memory into that space.
>
> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.
>
> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.
>
> Make it public again for the new external user.

Thank you Johannes!
You've said it better than I ever could.
I'll replace my cryptic commit log with the above in v2.

>
> ---
>
> > I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> > for instance. We good to expose that, not only for you but for any other
> > core kernel users?
>
> Those are applicable only to the higher-level vmap/vmalloc usecases:
> controlling the implied call to get_vm_area; managing the area with
> vfree(). They're not relevant for mapping privately-managed pages into
> an existing vm area. It's the same pattern and layer of abstraction as
> ioremap_pages_range(), which doesn't have any of those checks either.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-08 19:40   ` Andrii Nakryiko
@ 2024-02-09  0:09     ` Alexei Starovoitov
  2024-02-09 19:09       ` Andrii Nakryiko
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  0:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 11:40 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 2:04 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Recognize return of 'void *' from kfunc as returning unknown scalar.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/verifier.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index ddaf09db1175..d9c2dbb3939f 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >                                         meta.func_name);
> >                                 return -EFAULT;
> >                         }
> > +               } else if (btf_type_is_void(ptr_type)) {
> > +                       /* kfunc returning 'void *' is equivalent to returning scalar */
> > +                       mark_reg_unknown(env, regs, BPF_REG_0);
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> I think we should do a similar extension when passing `void *` into
> global funcs. It's best to treat it as SCALAR instead of rejecting it
> because we can't calculate the size. Currently users in practice just
> have to define it as `uintptr_t` and then cast (or create static
> wrappers doing the casting). Anyways, my point is that it makes sense
> to treat `void *` as non-pointer.

Makes sense. Will add it to my todo list.

On that note I've been thinking how to get rid of __arg_arena
that I'm adding in this series.

How about the following algorithm?
do_check_main() sees that scalar or ptr_to_arena is passed
into global subprog that has BTF 'struct foo *'
and today would require ptr_to_mem.
Instead of rejecting the prog the verifier would override
(only once and in one direction)
that arg of that global func from ptr_to_mem into scalar.
And will proceed as usual.
do_check_common() of that global subprog will pick up scalar
for that arg, since args are cached.
And verification will proceed successfully without special __arg_arena
.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-08  5:44     ` Johannes Weiner
  2024-02-08 23:55       ` Alexei Starovoitov
@ 2024-02-09  6:36       ` Lorenzo Stoakes
  1 sibling, 0 replies; 56+ messages in thread
From: Lorenzo Stoakes @ 2024-02-09  6:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alexei Starovoitov, bpf, daniel, andrii, martin.lau, memxor,
	eddyz87, tj, brho, linux-mm, kernel-team, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig

On Thu, Feb 08, 2024 at 06:44:35AM +0100, Johannes Weiner wrote:
> On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > The next commit will introduce bpf_arena which is a sparsely populated shared
> > > memory region between bpf program and user space process.
> > > It will function similar to vmalloc()/vm_map_ram():
> > > - get_vm_area()
> > > - alloc_pages()
> > > - vmap_pages_range()
> >
> > This tells me absolutely nothing about why it is justified to expose this
> > internal interface. You need to put more explanation here along the lines
> > of 'we had no other means of achieving what we needed from vmalloc because
> > X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
>
> How about this:
>
> ---
>
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
>
> The vmap API is generally public and has functions to request and
> release areas of kernel address space, as well as functions to map
> various types of backing memory into that space.
>
> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.
>
> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.
>
> Make it public again for the new external user.

Thanks yes this is much better!

>
> ---
>
> > I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> > for instance. We good to expose that, not only for you but for any other
> > core kernel users?
>
> Those are applicable only to the higher-level vmap/vmalloc usecases:
> controlling the implied call to get_vm_area; managing the area with
> vfree(). They're not relevant for mapping privately-managed pages into
> an existing vm area. It's the same pattern and layer of abstraction as
> ioremap_pages_range(), which doesn't have any of those checks either.

OK that makes more sense re: comparison to ioremap_page_range(). My concern
arises from a couple things - firstly to avoid the exposure of an interface
that might be misinterpreted as acting as if it were a standard vmap() when
it instead skips a lot of checks (e.g. count > totalram_pages()).

Secondly my concern is that this side-steps metadata tracking the use of
the vmap range doesn't it? So there is nothing something coming along and
remapping some other vmalloc memory into that range later right?

It feels like exposing page table code that sits outside of the whole
vmalloc mechanism for other users.

On the other hand... since we already expose ioremap_page_range() and that
has the exact same issue I guess it's moot anyway?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
  2024-02-08 19:40   ` Andrii Nakryiko
@ 2024-02-09 16:06   ` David Vernet
  1 sibling, 0 replies; 56+ messages in thread
From: David Vernet @ 2024-02-09 16:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

On Tue, Feb 06, 2024 at 02:04:26PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Recognize return of 'void *' from kfunc as returning unknown scalar.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-06 22:04 ` [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
@ 2024-02-09 16:57   ` David Vernet
  2024-02-09 17:46     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: David Vernet @ 2024-02-09 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

[-- Attachment #1: Type: text/plain, Size: 5594 bytes --]

On Tue, Feb 06, 2024 at 02:04:27PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
> It allows kfunc to have 'void *' argument for maps, since bpf progs
> will call them as:
> struct {
>         __uint(type, BPF_MAP_TYPE_ARENA);
> 	...
> } arena SEC(".maps");
> 
> bpf_kfunc_with_map(... &arena ...);
> 
> Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
> If kfunc was defined with 'struct bpf_map *' it would pass
> the verifier, but bpf prog would need to use '(void *)&arena'.
> Which is not clean.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/verifier.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d9c2dbb3939f..db569ce89fb1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
>  	return __kfunc_param_match_suffix(btf, arg, "__ign");
>  }
>  
> +static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
> +{
> +	return __kfunc_param_match_suffix(btf, arg, "__map");
> +}
> +
>  static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
>  {
>  	return __kfunc_param_match_suffix(btf, arg, "__alloc");
> @@ -11064,7 +11069,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
>  		return KF_ARG_PTR_TO_CONST_STR;
>  
>  	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> -		if (!btf_type_is_struct(ref_t)) {
> +		if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
>  			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
>  				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
>  			return -EINVAL;
> @@ -11660,6 +11665,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  		if (kf_arg_type < 0)
>  			return kf_arg_type;
>  
> +		if (is_kfunc_arg_map(btf, &args[i])) {
> +			/* If argument has '__map' suffix expect 'struct bpf_map *' */
> +			ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
> +			ref_t = btf_type_by_id(btf_vmlinux, ref_id);
> +			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
> +		}

This is fine, but given that this should only apply to KF_ARG_PTR_TO_BTF_ID,
this seems a bit cleaner, wdyt?

index ddaf09db1175..998da8b302ac 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
        return __kfunc_param_match_suffix(btf, arg, "__ign");
 }

+static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
+{
+       return __kfunc_param_match_suffix(btf, arg, "__map");
+}
+
 static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
 {
        return __kfunc_param_match_suffix(btf, arg, "__alloc");
@@ -10910,6 +10915,7 @@ enum kfunc_ptr_arg_type {
        KF_ARG_PTR_TO_RB_NODE,
        KF_ARG_PTR_TO_NULL,
        KF_ARG_PTR_TO_CONST_STR,
+       KF_ARG_PTR_TO_MAP,      /* pointer to a struct bpf_map */
 };

 enum special_kfunc_type {
@@ -11064,12 +11070,12 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
                return KF_ARG_PTR_TO_CONST_STR;

        if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
-               if (!btf_type_is_struct(ref_t)) {
+               if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
                        verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
                                meta->func_name, argno, btf_type_str(ref_t), ref_tname);
                        return -EINVAL;
                }
-               return KF_ARG_PTR_TO_BTF_ID;
+               return is_kfunc_arg_map(meta->btf, &args[argno]) ? KF_ARG_PTR_TO_MAP : KF_ARG_PTR_TO_BTF_ID;
        }

        if (is_kfunc_arg_callback(env, meta->btf, &args[argno]))
@@ -11663,6 +11669,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
                switch (kf_arg_type) {
                case KF_ARG_PTR_TO_NULL:
                        continue;
+               case KF_ARG_PTR_TO_MAP:
                case KF_ARG_PTR_TO_ALLOC_BTF_ID:
                case KF_ARG_PTR_TO_BTF_ID:
                        if (!is_kfunc_trusted_args(meta) && !is_kfunc_rcu(meta))
@@ -11879,6 +11886,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
                        if (ret < 0)
                                return ret;
                        break;
+               case KF_ARG_PTR_TO_MAP:
+                       /* If argument has '__map' suffix expect 'struct bpf_map *' */
+                       ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
+                       ref_t = btf_type_by_id(btf_vmlinux, ref_id);
+                       ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+
+                       fallthrough;
                case KF_ARG_PTR_TO_BTF_ID:
                        /* Only base_type is checked, further checks are done here */
                        if ((base_type(reg->type) != PTR_TO_BTF_ID ||


> +
>  		switch (kf_arg_type) {
>  		case KF_ARG_PTR_TO_NULL:
>  			continue;
> -- 
> 2.34.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09 16:57   ` David Vernet
@ 2024-02-09 17:46     ` Alexei Starovoitov
  2024-02-09 18:11       ` David Vernet
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-09 17:46 UTC (permalink / raw)
  To: David Vernet
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

On Fri, Feb 09, 2024 at 10:57:45AM -0600, David Vernet wrote:
> On Tue, Feb 06, 2024 at 02:04:27PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
> > It allows kfunc to have 'void *' argument for maps, since bpf progs
> > will call them as:
> > struct {
> >         __uint(type, BPF_MAP_TYPE_ARENA);
> > 	...
> > } arena SEC(".maps");
> > 
> > bpf_kfunc_with_map(... &arena ...);
> > 
> > Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
> > If kfunc was defined with 'struct bpf_map *' it would pass
> > the verifier, but bpf prog would need to use '(void *)&arena'.
> > Which is not clean.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/verifier.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index d9c2dbb3939f..db569ce89fb1 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
> >  	return __kfunc_param_match_suffix(btf, arg, "__ign");
> >  }
> >  
> > +static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
> > +{
> > +	return __kfunc_param_match_suffix(btf, arg, "__map");
> > +}
> > +
> >  static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
> >  {
> >  	return __kfunc_param_match_suffix(btf, arg, "__alloc");
> > @@ -11064,7 +11069,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> >  		return KF_ARG_PTR_TO_CONST_STR;
> >  
> >  	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > -		if (!btf_type_is_struct(ref_t)) {
> > +		if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
> >  			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> >  				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
> >  			return -EINVAL;
> > @@ -11660,6 +11665,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> >  		if (kf_arg_type < 0)
> >  			return kf_arg_type;
> >  
> > +		if (is_kfunc_arg_map(btf, &args[i])) {
> > +			/* If argument has '__map' suffix expect 'struct bpf_map *' */
> > +			ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
> > +			ref_t = btf_type_by_id(btf_vmlinux, ref_id);
> > +			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
> > +		}
> 
> This is fine, but given that this should only apply to KF_ARG_PTR_TO_BTF_ID,
> this seems a bit cleaner, wdyt?
> 
> index ddaf09db1175..998da8b302ac 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
>         return __kfunc_param_match_suffix(btf, arg, "__ign");
>  }
> 
> +static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
> +{
> +       return __kfunc_param_match_suffix(btf, arg, "__map");
> +}
> +
>  static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
>  {
>         return __kfunc_param_match_suffix(btf, arg, "__alloc");
> @@ -10910,6 +10915,7 @@ enum kfunc_ptr_arg_type {
>         KF_ARG_PTR_TO_RB_NODE,
>         KF_ARG_PTR_TO_NULL,
>         KF_ARG_PTR_TO_CONST_STR,
> +       KF_ARG_PTR_TO_MAP,      /* pointer to a struct bpf_map */
>  };
> 
>  enum special_kfunc_type {
> @@ -11064,12 +11070,12 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
>                 return KF_ARG_PTR_TO_CONST_STR;
> 
>         if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> -               if (!btf_type_is_struct(ref_t)) {
> +               if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
>                         verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
>                                 meta->func_name, argno, btf_type_str(ref_t), ref_tname);
>                         return -EINVAL;
>                 }
> -               return KF_ARG_PTR_TO_BTF_ID;
> +               return is_kfunc_arg_map(meta->btf, &args[argno]) ? KF_ARG_PTR_TO_MAP : KF_ARG_PTR_TO_BTF_ID;

Makes sense, but then should I add the following on top:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e970d9fd7f32..b524dc168023 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11088,13 +11088,16 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
        if (is_kfunc_arg_const_str(meta->btf, &args[argno]))
                return KF_ARG_PTR_TO_CONST_STR;

+       if (is_kfunc_arg_map(meta->btf, &args[argno]))
+               return KF_ARG_PTR_TO_MAP;
+
        if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
-               if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
+               if (!btf_type_is_struct(ref_t)) {
                        verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
                                meta->func_name, argno, btf_type_str(ref_t), ref_tname);
                        return -EINVAL;
                }
-               return is_kfunc_arg_map(meta->btf, &args[argno]) ? KF_ARG_PTR_TO_MAP : KF_ARG_PTR_TO_BTF_ID;
+               return KF_ARG_PTR_TO_BTF_ID;
        }

?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09 17:46     ` Alexei Starovoitov
@ 2024-02-09 18:11       ` David Vernet
  2024-02-09 18:59         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: David Vernet @ 2024-02-09 18:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, martin.lau, memxor, eddyz87, tj, brho,
	hannes, linux-mm, kernel-team

[-- Attachment #1: Type: text/plain, Size: 7051 bytes --]

On Fri, Feb 09, 2024 at 09:46:57AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 09, 2024 at 10:57:45AM -0600, David Vernet wrote:
> > On Tue, Feb 06, 2024 at 02:04:27PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > 
> > > Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
> > > It allows kfunc to have 'void *' argument for maps, since bpf progs
> > > will call them as:
> > > struct {
> > >         __uint(type, BPF_MAP_TYPE_ARENA);
> > > 	...
> > > } arena SEC(".maps");
> > > 
> > > bpf_kfunc_with_map(... &arena ...);
> > > 
> > > Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
> > > If kfunc was defined with 'struct bpf_map *' it would pass
> > > the verifier, but bpf prog would need to use '(void *)&arena'.
> > > Which is not clean.
> > > 
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  kernel/bpf/verifier.c | 14 +++++++++++++-
> > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index d9c2dbb3939f..db569ce89fb1 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
> > >  	return __kfunc_param_match_suffix(btf, arg, "__ign");
> > >  }
> > >  
> > > +static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
> > > +{
> > > +	return __kfunc_param_match_suffix(btf, arg, "__map");
> > > +}
> > > +
> > >  static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
> > >  {
> > >  	return __kfunc_param_match_suffix(btf, arg, "__alloc");
> > > @@ -11064,7 +11069,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > >  		return KF_ARG_PTR_TO_CONST_STR;
> > >  
> > >  	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > > -		if (!btf_type_is_struct(ref_t)) {
> > > +		if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
> > >  			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> > >  				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
> > >  			return -EINVAL;
> > > @@ -11660,6 +11665,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > >  		if (kf_arg_type < 0)
> > >  			return kf_arg_type;
> > >  
> > > +		if (is_kfunc_arg_map(btf, &args[i])) {
> > > +			/* If argument has '__map' suffix expect 'struct bpf_map *' */
> > > +			ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
> > > +			ref_t = btf_type_by_id(btf_vmlinux, ref_id);
> > > +			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
> > > +		}
> > 
> > This is fine, but given that this should only apply to KF_ARG_PTR_TO_BTF_ID,
> > this seems a bit cleaner, wdyt?
> > 
> > index ddaf09db1175..998da8b302ac 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
> >         return __kfunc_param_match_suffix(btf, arg, "__ign");
> >  }
> > 
> > +static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
> > +{
> > +       return __kfunc_param_match_suffix(btf, arg, "__map");
> > +}
> > +
> >  static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
> >  {
> >         return __kfunc_param_match_suffix(btf, arg, "__alloc");
> > @@ -10910,6 +10915,7 @@ enum kfunc_ptr_arg_type {
> >         KF_ARG_PTR_TO_RB_NODE,
> >         KF_ARG_PTR_TO_NULL,
> >         KF_ARG_PTR_TO_CONST_STR,
> > +       KF_ARG_PTR_TO_MAP,      /* pointer to a struct bpf_map */
> >  };
> > 
> >  enum special_kfunc_type {
> > @@ -11064,12 +11070,12 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> >                 return KF_ARG_PTR_TO_CONST_STR;
> > 
> >         if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> > -               if (!btf_type_is_struct(ref_t)) {
> > +               if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
> >                         verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
> >                                 meta->func_name, argno, btf_type_str(ref_t), ref_tname);
> >                         return -EINVAL;
> >                 }
> > -               return KF_ARG_PTR_TO_BTF_ID;
> > +               return is_kfunc_arg_map(meta->btf, &args[argno]) ? KF_ARG_PTR_TO_MAP : KF_ARG_PTR_TO_BTF_ID;
> 
> Makes sense, but then should I add the following on top:
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e970d9fd7f32..b524dc168023 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -11088,13 +11088,16 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
>         if (is_kfunc_arg_const_str(meta->btf, &args[argno]))
>                 return KF_ARG_PTR_TO_CONST_STR;
> 
> +       if (is_kfunc_arg_map(meta->btf, &args[argno]))
> +               return KF_ARG_PTR_TO_MAP;
> +

Yeah, it's probably cleaner to pull it out of that block, which is
already a bit of a mess.

Only thing is that it doesn't make sense to invoke is_kfunc_arg_map() on
something that doesn't have base_type(reg->type) == CONST_PTR_TO_MAP
right? We sort of had that covered in the below block beacuse of the
reg2btf_ids[base_type(reg->type)] check, but even then it was kind of
sketchy because we could have base_type(reg->type) == PTR_TO_BTF_ID or
some other base_type with a nonzero btf ID and still treat it as a
KF_ARG_PTR_TO_MAP depending on how the kfunc was named. So maybe
something like this would be yet another improvement on top of both
proposals that would avoid any weird edge cases or confusion on the part
of the kfunc author?

+ if (is_kfunc_arg_map(meta->btf, &args[argno])) {
+         if (base_type(reg->type) != CONST_PTR_TO_MAP) {
+                 verbose(env, "kernel function %s map arg#%d %s reg was not type %s\n",
+                         meta->func_name, argno, ref_name, reg_type_str(env, CONST_PTR_TO_MAP));
+                 return -EINVAL;
+         }
+         return KF_ARG_PTR_TO_MAP;
+ }
+

>         if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
> -               if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
> +               if (!btf_type_is_struct(ref_t)) {
>                         verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
>                                 meta->func_name, argno, btf_type_str(ref_t), ref_tname);
>                         return -EINVAL;
>                 }
> -               return is_kfunc_arg_map(meta->btf, &args[argno]) ? KF_ARG_PTR_TO_MAP : KF_ARG_PTR_TO_BTF_ID;
> +               return KF_ARG_PTR_TO_BTF_ID;
>         }
> 
> ?
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09 18:11       ` David Vernet
@ 2024-02-09 18:59         ` Alexei Starovoitov
  2024-02-09 19:18           ` David Vernet
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-09 18:59 UTC (permalink / raw)
  To: David Vernet
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 10:11 AM David Vernet <void@manifault.com> wrote:
> >
> > Makes sense, but then should I add the following on top:
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index e970d9fd7f32..b524dc168023 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -11088,13 +11088,16 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> >         if (is_kfunc_arg_const_str(meta->btf, &args[argno]))
> >                 return KF_ARG_PTR_TO_CONST_STR;
> >
> > +       if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > +               return KF_ARG_PTR_TO_MAP;
> > +
>
> Yeah, it's probably cleaner to pull it out of that block, which is
> already a bit of a mess.
>
> Only thing is that it doesn't make sense to invoke is_kfunc_arg_map() on
> something that doesn't have base_type(reg->type) == CONST_PTR_TO_MAP
> right? We sort of had that covered in the below block beacuse of the
> reg2btf_ids[base_type(reg->type)] check, but even then it was kind of
> sketchy because we could have base_type(reg->type) == PTR_TO_BTF_ID or
> some other base_type with a nonzero btf ID and still treat it as a
> KF_ARG_PTR_TO_MAP depending on how the kfunc was named. So maybe
> something like this would be yet another improvement on top of both
> proposals that would avoid any weird edge cases or confusion on the part
> of the kfunc author?
>
> + if (is_kfunc_arg_map(meta->btf, &args[argno])) {
> +         if (base_type(reg->type) != CONST_PTR_TO_MAP) {
> +                 verbose(env, "kernel function %s map arg#%d %s reg was not type %s\n",
> +                         meta->func_name, argno, ref_name, reg_type_str(env, CONST_PTR_TO_MAP));
> +                 return -EINVAL;
> +         }

This would be an unnecessary restriction.
We should allow this to work:

+SEC("iter.s/bpf_map")
+__success __log_level(2)
+int iter_maps(struct bpf_iter__bpf_map *ctx)
+{
+       struct bpf_map *map = ctx->map;
+
+       if (!map)
+               return 0;
+       bpf_arena_alloc_pages(map, NULL, map->max_entries, NUMA_NO_NODE, 0);
+       return 0;
+}

verifier log:
0: R1=ctx() R10=fp0
; struct bpf_map *map = ctx->map;
0: (79) r1 = *(u64 *)(r1 +8)          ; R1_w=trusted_ptr_or_null_bpf_map(id=1)
; if (map == (void *)0)
1: (15) if r1 == 0x0 goto pc+5        ; R1_w=trusted_ptr_bpf_map()
; bpf_arena_alloc_pages(map, NULL, map->max_entries, NUMA_NO_NODE, 0);
2: (61) r3 = *(u32 *)(r1 +36)         ; R1_w=trusted_ptr_bpf_map()
R3_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
; bpf_arena_alloc_pages(map, NULL, map->max_entries, NUMA_NO_NODE, 0);
3: (b7) r2 = 0                        ; R2_w=0
4: (b4) w4 = -1                       ; R4_w=0xffffffff
5: (b7) r5 = 0                        ; R5_w=0
6: (85) call bpf_arena_alloc_pages#42141      ; R0=scalar()

the following two tests fail as expected:

1.
int iter_maps(struct bpf_iter__bpf_map *ctx)
{
  struct seq_file *seq = ctx->meta->seq;
  struct bpf_map *map = ctx->map;

  bpf_arena_alloc_pages((void *)seq, NULL, map->max_entries, NUMA_NO_NODE, 0);

kernel function bpf_arena_alloc_pages args#0 expected pointer to
STRUCT bpf_map but R1 has a pointer to STRUCT seq_file

2.
  bpf_arena_alloc_pages(map->inner_map_meta, NULL, map->max_entries,
NUMA_NO_NODE, 0);

(79) r1 = *(u64 *)(r1 +8)          ; R1_w=untrusted_ptr_bpf_map()
R1 must be referenced or trusted


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-09  0:09     ` Alexei Starovoitov
@ 2024-02-09 19:09       ` Andrii Nakryiko
  2024-02-10  2:32         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Andrii Nakryiko @ 2024-02-09 19:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Thu, Feb 8, 2024 at 4:09 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 11:40 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Feb 6, 2024 at 2:04 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Recognize return of 'void *' from kfunc as returning unknown scalar.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  kernel/bpf/verifier.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index ddaf09db1175..d9c2dbb3939f 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > >                                         meta.func_name);
> > >                                 return -EFAULT;
> > >                         }
> > > +               } else if (btf_type_is_void(ptr_type)) {
> > > +                       /* kfunc returning 'void *' is equivalent to returning scalar */
> > > +                       mark_reg_unknown(env, regs, BPF_REG_0);
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > I think we should do a similar extension when passing `void *` into
> > global funcs. It's best to treat it as SCALAR instead of rejecting it
> > because we can't calculate the size. Currently users in practice just
> > have to define it as `uintptr_t` and then cast (or create static
> > wrappers doing the casting). Anyways, my point is that it makes sense
> > to treat `void *` as non-pointer.
>
> Makes sense. Will add it to my todo list.
>
> On that note I've been thinking how to get rid of __arg_arena
> that I'm adding in this series.
>
> How about the following algorithm?
> do_check_main() sees that scalar or ptr_to_arena is passed
> into global subprog that has BTF 'struct foo *'
> and today would require ptr_to_mem.
> Instead of rejecting the prog the verifier would override
> (only once and in one direction)
> that arg of that global func from ptr_to_mem into scalar.
> And will proceed as usual.
> do_check_common() of that global subprog will pick up scalar
> for that arg, since args are cached.
> And verification will proceed successfully without special __arg_arena
> .

Can we pass PTR_TO_MEM (e.g., map value pointer) to something that is
expecting PTR_TO_ARENA? Because there are few problems with the above
algorithm, I think.

First, this check won't be just in do_check_main(), the same global
function can be called from another function.

And second, what if you have the first few calls that pass PTR_TO_MEM.
Verifier sees that, allows it, assumes global func will take
PTR_TO_MEM. Then we get to a call that passes PTR_TO_ARENA or scalar,
we change the argument expectation to be __arg_arena-like and
subsequent checks will assume arena stuff. But the first few calls
already assumed correctness based on PTR_TO_MEM.


In short, it seems like this introduces more subtleness and
potentially unexpected interactions. I don't really see explicit
__arg_arena as a bad thing, I find that explicit annotations for
"special things" help in practice as they bring specialness into
attention. And also allow people to ask/google more specific
questions.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09 18:59         ` Alexei Starovoitov
@ 2024-02-09 19:18           ` David Vernet
  0 siblings, 0 replies; 56+ messages in thread
From: David Vernet @ 2024-02-09 19:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

On Fri, Feb 09, 2024 at 10:59:57AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 9, 2024 at 10:11 AM David Vernet <void@manifault.com> wrote:
> > >
> > > Makes sense, but then should I add the following on top:
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index e970d9fd7f32..b524dc168023 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -11088,13 +11088,16 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
> > >         if (is_kfunc_arg_const_str(meta->btf, &args[argno]))
> > >                 return KF_ARG_PTR_TO_CONST_STR;
> > >
> > > +       if (is_kfunc_arg_map(meta->btf, &args[argno]))
> > > +               return KF_ARG_PTR_TO_MAP;
> > > +
> >
> > Yeah, it's probably cleaner to pull it out of that block, which is
> > already a bit of a mess.
> >
> > Only thing is that it doesn't make sense to invoke is_kfunc_arg_map() on
> > something that doesn't have base_type(reg->type) == CONST_PTR_TO_MAP
> > right? We sort of had that covered in the below block beacuse of the
> > reg2btf_ids[base_type(reg->type)] check, but even then it was kind of
> > sketchy because we could have base_type(reg->type) == PTR_TO_BTF_ID or
> > some other base_type with a nonzero btf ID and still treat it as a
> > KF_ARG_PTR_TO_MAP depending on how the kfunc was named. So maybe
> > something like this would be yet another improvement on top of both
> > proposals that would avoid any weird edge cases or confusion on the part
> > of the kfunc author?
> >
> > + if (is_kfunc_arg_map(meta->btf, &args[argno])) {
> > +         if (base_type(reg->type) != CONST_PTR_TO_MAP) {
> > +                 verbose(env, "kernel function %s map arg#%d %s reg was not type %s\n",
> > +                         meta->func_name, argno, ref_name, reg_type_str(env, CONST_PTR_TO_MAP));
> > +                 return -EINVAL;
> > +         }
> 
> This would be an unnecessary restriction.
> We should allow this to work:
> 
> +SEC("iter.s/bpf_map")
> +__success __log_level(2)
> +int iter_maps(struct bpf_iter__bpf_map *ctx)
> +{
> +       struct bpf_map *map = ctx->map;
> +
> +       if (!map)
> +               return 0;
> +       bpf_arena_alloc_pages(map, NULL, map->max_entries, NUMA_NO_NODE, 0);
> +       return 0;
> +}

Ah, I see, so this would be a PTR_TO_BTF_ID then. Fair enough, we can
leave that restriction off and rely on the check in
process_kf_arg_ptr_to_btf_id().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *'
  2024-02-09 19:09       ` Andrii Nakryiko
@ 2024-02-10  2:32         ` Alexei Starovoitov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2024-02-10  2:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 11:09 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 4:09 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Feb 8, 2024 at 11:40 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 2:04 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Recognize return of 'void *' from kfunc as returning unknown scalar.
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > ---
> > > >  kernel/bpf/verifier.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index ddaf09db1175..d9c2dbb3939f 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > >                                         meta.func_name);
> > > >                                 return -EFAULT;
> > > >                         }
> > > > +               } else if (btf_type_is_void(ptr_type)) {
> > > > +                       /* kfunc returning 'void *' is equivalent to returning scalar */
> > > > +                       mark_reg_unknown(env, regs, BPF_REG_0);
> > >
> > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > >
> > > I think we should do a similar extension when passing `void *` into
> > > global funcs. It's best to treat it as SCALAR instead of rejecting it
> > > because we can't calculate the size. Currently users in practice just
> > > have to define it as `uintptr_t` and then cast (or create static
> > > wrappers doing the casting). Anyways, my point is that it makes sense
> > > to treat `void *` as non-pointer.
> >
> > Makes sense. Will add it to my todo list.
> >
> > On that note I've been thinking how to get rid of __arg_arena
> > that I'm adding in this series.
> >
> > How about the following algorithm?
> > do_check_main() sees that scalar or ptr_to_arena is passed
> > into global subprog that has BTF 'struct foo *'
> > and today would require ptr_to_mem.
> > Instead of rejecting the prog the verifier would override
> > (only once and in one direction)
> > that arg of that global func from ptr_to_mem into scalar.
> > And will proceed as usual.
> > do_check_common() of that global subprog will pick up scalar
> > for that arg, since args are cached.
> > And verification will proceed successfully without special __arg_arena
> > .
>
> Can we pass PTR_TO_MEM (e.g., map value pointer) to something that is
> expecting PTR_TO_ARENA? Because there are few problems with the above
> algorithm, I think.

The patch 10 allows only ptr_to_arena and scalar to be passed in,
but passing ptr_to_mem is safe too. It won't crash the kernel,
but it won't do what user might expect.
Hence it's disabled.

> First, this check won't be just in do_check_main(), the same global
> function can be called from another function.

that shouldn't matter.

> And second, what if you have the first few calls that pass PTR_TO_MEM.
> Verifier sees that, allows it, assumes global func will take
> PTR_TO_MEM. Then we get to a call that passes PTR_TO_ARENA or scalar,
> we change the argument expectation to be __arg_arena-like and
> subsequent checks will assume arena stuff. But the first few calls
> already assumed correctness based on PTR_TO_MEM.

I think that would the issue only if we call that global func
with ptr_to_mem and then went at processed the body of it
with ptr_to_mem and later discovered another call site that
passes scalar.
Such bug can be accounted for.

> In short, it seems like this introduces more subtleness and
> potentially unexpected interactions. I don't really see explicit
> __arg_arena as a bad thing, I find that explicit annotations for
> "special things" help in practice as they bring specialness into
> attention. And also allow people to ask/google more specific
> questions.

In general I agree that __arg is a useful indication.
I've been writing arena enabled bpf progs and found this __arg_arena
quite annoying to add when converting static to global func.
I know that globals are verified differently, but it feels
we can do better with arena pointers.

Anyway I'll proceed with the existing __arg_arean approach and
put this "smart" detection of arena pointers on back burner for now.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-07 21:07   ` Lorenzo Stoakes
  2024-02-07 22:56     ` Alexei Starovoitov
  2024-02-08  5:44     ` Johannes Weiner
@ 2024-02-14  8:31     ` Christoph Hellwig
  2 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2024-02-14  8:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Alexei Starovoitov, bpf, daniel, andrii, martin.lau, memxor,
	eddyz87, tj, brho, hannes, linux-mm, kernel-team, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig

On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
> 
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
> 
> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

And as someone who has reviewed these same thing before:

hard NAK.  We need to keep vmalloc internals internal and not start
poking holes into the abstractions after we've got them roughly into
shape.



^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2024-02-14  8:31 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
2024-02-08 19:40   ` Andrii Nakryiko
2024-02-09  0:09     ` Alexei Starovoitov
2024-02-09 19:09       ` Andrii Nakryiko
2024-02-10  2:32         ` Alexei Starovoitov
2024-02-09 16:06   ` David Vernet
2024-02-06 22:04 ` [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
2024-02-09 16:57   ` David Vernet
2024-02-09 17:46     ` Alexei Starovoitov
2024-02-09 18:11       ` David Vernet
2024-02-09 18:59         ` Alexei Starovoitov
2024-02-09 19:18           ` David Vernet
2024-02-06 22:04 ` [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
2024-02-07 21:07   ` Lorenzo Stoakes
2024-02-07 22:56     ` Alexei Starovoitov
2024-02-08  5:44     ` Johannes Weiner
2024-02-08 23:55       ` Alexei Starovoitov
2024-02-09  6:36       ` Lorenzo Stoakes
2024-02-14  8:31     ` Christoph Hellwig
2024-02-06 22:04 ` [PATCH bpf-next 04/16] bpf: Introduce bpf_arena Alexei Starovoitov
2024-02-07 18:40   ` Barret Rhoden
2024-02-07 20:55     ` Alexei Starovoitov
2024-02-07 21:11       ` Barret Rhoden
2024-02-08  6:26         ` Alexei Starovoitov
2024-02-08 21:58           ` Barret Rhoden
2024-02-08 23:36             ` Alexei Starovoitov
2024-02-08 23:50               ` Barret Rhoden
2024-02-06 22:04 ` [PATCH bpf-next 05/16] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 06/16] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 07/16] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 08/16] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 09/16] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 10/16] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena Alexei Starovoitov
2024-02-08  1:15   ` Andrii Nakryiko
2024-02-08  1:38     ` Alexei Starovoitov
2024-02-08 18:29       ` Andrii Nakryiko
2024-02-08 18:45         ` Alexei Starovoitov
2024-02-08 18:54           ` Andrii Nakryiko
2024-02-08 18:59             ` Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
2024-02-08  1:16   ` Andrii Nakryiko
2024-02-08  1:58     ` Alexei Starovoitov
2024-02-08 18:16       ` Andrii Nakryiko
2024-02-06 22:04 ` [PATCH bpf-next 13/16] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 14/16] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
2024-02-07 17:04   ` Eduard Zingerman
2024-02-08  2:59     ` Alexei Starovoitov
2024-02-08 11:10       ` Jose E. Marchesi
2024-02-06 22:04 ` [PATCH bpf-next 16/16] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
2024-02-07 13:33   ` Barret Rhoden
2024-02-07 20:16     ` Alexei Starovoitov
2024-02-07 20:12   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox