From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 66FEAC54E58
	for <linux-mm@archiver.kernel.org>; Mon, 11 Mar 2024 22:01:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E3CF56B0133; Mon, 11 Mar 2024 18:01:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DEB9A6B0134; Mon, 11 Mar 2024 18:01:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C8C686B0135; Mon, 11 Mar 2024 18:01:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id B67B16B0133
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 18:01:39 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 5D5FD1A0964
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 22:01:39 +0000 (UTC)
X-FDA: 81886130718.28.5074680
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf11.hostedemail.com (Postfix) with ESMTP id A9D2A4002E
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 22:01:37 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=HqHSxKYe;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710194497;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=oetuATnmI6T7xE+z6eeBRaCJ+kkjrCNvBDkUC6LtDrE=;
	b=yP8Wh0E1m7Sv1aGNfZIetaRcj4XkNT2AELTW22uY4GJu3GuTPhEJXtFAPKklkBa13HM79Z
	trbmNM6FgPwEpfqh3RBn0VAsainNETz5iHpylemaoZ9jhlQkGlRODrwoiM/uhwMor0X+VU
	4mRArKmpIeNqRZp62qsvq7jhaWkSCcY=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=HqHSxKYe;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf11.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710194497; a=rsa-sha256;
	cv=none;
	b=LxMNH/6BTUSgQsMRZVsxYlAq6XI959q2toadMjZtHy1RT99ROOrtCYwlnsxGYwtDczwFAc
	9bOyfhbfNDjma+Jz41m7C8+8aQgl3Qvw7dKY1UQ024VKqtcT9FkuGW1iTIZkET4FsU7eSl
	UD7/NYJ6zkOl/Es87gnux5pHtPW7fsI=
Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-5cddc5455aeso2827271a12.1
        for <linux-mm@kvack.org>; Mon, 11 Mar 2024 15:01:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710194496; x=1710799296; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=oetuATnmI6T7xE+z6eeBRaCJ+kkjrCNvBDkUC6LtDrE=;
        b=HqHSxKYepoXk53Bx1GMllTQA/Vqep3aU4DiCM3gwuoUR/kHcoslnIEr14UH0JIkTFO
         PLNv7Ku8gG46CtHQTjqFO7Q9q/803qjV40avazJerVaWGxROjYoJS3RcBxn+/DYNCrTt
         VvgEtYldAPhM5Jyd1tzmRIeZKbqLnvnNm71ISNHlOAC7estkAcYmLwXyX1ArsD+4rddQ
         h47uq/evVaE+o+V/jlMcD3UUK8rTJfpg2q4Zxd9SkIqjso9jS76M0uZDigPPj65+LdDa
         xTzws6Mkc1AhpZZkYFEEYq/452fQ0fDv/a99hNasUIKCvZuZR2sPEFXj2q4ty/3gklny
         yLbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710194496; x=1710799296;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=oetuATnmI6T7xE+z6eeBRaCJ+kkjrCNvBDkUC6LtDrE=;
        b=gu9uuxHa0KdsvOkuW4R5SzLSr02Yaj6v43xtx26/OD4ZvaWaqtJMsQmCgmOgOK+cnX
         cX8GiSphmA10m7AGHp+WGxC6nYgUfav6i0oQV5epFXNoiDFvisgu8aZYphaKUkX5pHr8
         ne2yccms3OPQ4kdbI8635TBsyt/We+Aqfv0/byzDoO0F1vCQZ8Uj/vkgi9C1Ye4Va0pO
         Ax1OyhM+KJqe/TkCPVr1MysisAs/j9P+sowRCGWvvLKK1dAFPTvBdfQQmIUH8P+2yIbL
         0n447qjk4s4Z/iw5NbM+H5ChBl/QP2eXhMSc06FauEUKMEFJSkiC861XkkRrzQ0oyM3d
         pfOw==
X-Forwarded-Encrypted: i=1; AJvYcCVFcJO9MSDl4fcDdSZ1r38EdBXwpjtiaLZtw2mM0sTXfdmmmsr9EFPTrMXmn2Wv4Rq2CulQm8Dp4uH8bxl3YG9APL0=
X-Gm-Message-State: AOJu0Yw0OXdUkp9ep1RHHrmXJyqwuZVacTDz8IgyYaMAhgcdyieQFdPl
	FT8QTOF4NmTlnW2vgoLjwtp2xYxOoo2a/Fj/o2+cbAlll6f/yQ5ykE/CoH0WoPiDHgRIMG00UDX
	LRaUpUHGoGBG4ekynsiJnUjHzAnA=
X-Google-Smtp-Source: AGHT+IFhV6Hop4eTCQ9/re/sWMVZkZ+/8qPACjvwRsXKU28oUqFpc1EGBD/mNd7ZBWV3B+f/TNOY6yBj6X8J2WsB9Bo=
X-Received: by 2002:a17:90a:43a7:b0:29b:af87:1e82 with SMTP id
 r36-20020a17090a43a700b0029baf871e82mr5497766pjg.48.1710194496183; Mon, 11
 Mar 2024 15:01:36 -0700 (PDT)
MIME-Version: 1.0
References: <20240308010812.89848-1-alexei.starovoitov@gmail.com> <20240308010812.89848-2-alexei.starovoitov@gmail.com>
In-Reply-To: <20240308010812.89848-2-alexei.starovoitov@gmail.com>
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Date: Mon, 11 Mar 2024 15:01:23 -0700
Message-ID: <CAEf4BzZr_Ki7VnEO6Gi-w07vX9p-oaqaGZQq62VTD_fJm39K4A@mail.gmail.com>
Subject: Re: [PATCH v3 bpf-next 01/14] bpf: Introduce bpf_arena.
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, daniel@iogearbox.net, andrii@kernel.org, 
	torvalds@linux-foundation.org, brho@google.com, hannes@cmpxchg.org, 
	akpm@linux-foundation.org, urezki@gmail.com, hch@infradead.org, 
	linux-mm@kvack.org, kernel-team@fb.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: A9D2A4002E
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: t7e6ht9o53ph5ca14hgokxi76wms87pe
X-HE-Tag: 1710194497-399867
X-HE-Meta: U2FsdGVkX1/ccL5agwy08LDQfTLTnqlAKzP3OfZ/Eyfm+VWfIfQm8ICLq3mECmU1XuEatAMm8rM/JfFB970S1V9ohxP8r4Zguyz1ihLtUJ/LUkJ2x2O6nVddJg//9XFq5rvVI9DRpLXhz1HoQ/tdJkVQAVZJgfUT383AcPupvrpyIrHHHIGdW5cI9YTOETOKkoxpN6TqOZZaiSuAwg2BInN0E4iVgRmCgKL3UQzzSEpOh/N8jwGCeT1teM5wiE1YL7K4DHSF3/NPxX3rKpQrVvszt526GkLKZ2L3o1JGEyCMwxtGry4ijaCD9giWxLERtYaHRKDXTtFwANG1juVWh6rmc4fXqU2Kz72UyKJdvprOZCl/mt62NpaA0F+7CcBGTjiFOxRKEe61GpzNz+6EqWORJ3Qz9GG/jTvjk9F5BJnYmGMsBJWpr8rmZJEIs0h9UqpAyjEWJV6+t7ed3hnd3TH6vsTaTBJlW/UdhJ2qN80szi71kw81J8YEnfBYvP60sGoqO0j5UcmfA5PWsHbO3ND3cXV22OY5plS51x9ATc1TWGHB4Sme+2YYG0/dnC4Vjl3H5gDW2qAuN+B5K/Ch1ZWtfseG9qZKND5pIq/6+86MeTaLfrdCz50jI61KR5+KLYSaUS+fqJEM7GqVnCKqgahXCyb2aP027yHn5zqQTNpG7cRNfMveuabRroDkkman2u65MA1cKWw/OT3+z+VF304pSSizo4XJdc7YfB9dUT/cO8/ckT30zfE1R64vZ13jZzZUYIgf+1fE+UwB+NY8VJVQcZS29agb/xp7Vf/WXdIHDledzzZM41EqPaKvg4mZNbURFXcdO9poEM3Yb9XT/qBrRsZPNaHUj5aL4LDirEyaF9JtnpqywuSmem+HkL2dhc3xZmZf1Z8/PyfBu8lgFl8b/ZJuGdv7n3tDP6cDGjHQ1YAsSashLZVIe5r2hFiVjdGbDrYukOLOXgAg4tQ
 v0HHwWsh
 1zBssu0yIwbs/1Ya2ivc7j2YXLxUpE0pFK8/5sgGn+xwhJjvgYWp9fESO8Bf/8Wn0wlH4clx9iHLzN6Q8ZfBL0b0KxvZ7u7pTC97yeOQBNRnT89d20pH1w/l7utePZyzQ10bpqfklaGcOnWr0LjAcOrPrYlE20txfVI5Ugbnt4RogjSwvflkiaTTiyw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Mar 7, 2024 at 5:08=E2=80=AFPM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_arena, which is a sparse shared memory region between the b=
pf
> program and user space.
>
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
>    anonymous region, like memcached or any key/value storage. The bpf
>    program implements an in-kernel accelerator. XDP prog can search for
>    a key in bpf_arena and return a value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash
>    tables, rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
>
> Initially, the kernel vm_area and user vma are not populated. User space
> can fault in pages within the range. While servicing a page fault,
> bpf_arena logic will insert a new page into the kernel and user vmas. The
> bpf program can allocate pages from that region via
> bpf_arena_alloc_pages(). This kernel function will insert pages into the
> kernel vm_area. The subsequent fault-in from user space will populate tha=
t
> page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation ti=
me
> can be used to prevent fault-in from user space. In such a case, if a pag=
e
> is not allocated by the bpf program and not present in the kernel vm_area=
,
> the user process will segfault. This is useful for use cases 2 and 3 abov=
e.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pag=
es
> either at a specific address within the arena or allocates a range with t=
he
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
> pages and removes the range from the kernel vm_area and from user process
> vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
> bpf program is more important than ease of sharing with user space. This =
is
> use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
> It will tell the verifier to treat the rX =3D bpf_arena_cast_user(rY)
> instruction as a 32-bit move wX =3D wY, which will improve bpf prog
> performance. Otherwise, bpf_arena_cast_user is translated by JIT to
> conditionally add the upper 32 bits of user vm_start (if the pointer is n=
ot
> NULL) to arena pointers before they are stored into memory. This way, use=
r
> space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
> backend generate the bpf_addr_space_cast() instruction to cast pointers
> between address_space(1) which is reserved for bpf_arena pointers and
> default address space zero. All arena pointers in a bpf program written i=
n
> C language are tagged as __attribute__((address_space(1))). Hence, clang
> provides helpful diagnostics when pointers cross address space. Libbpf an=
d
> the kernel support only address_space =3D=3D 1. All other address space
> identifiers are reserved.
>
> rX =3D bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
> verifier that rX->type =3D PTR_TO_ARENA. Any further operations on
> PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
> mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
> them as kern_vm_start + 32bit_addr memory accesses. The behavior is simil=
ar
> to copy_from_kernel_nofault() except that no address checks are necessary=
.
> The address is guaranteed to be in the 4GB range. If the page is not
> present, the destination register is zeroed on read, and the operation is
> ignored on write.
>
> rX =3D bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =3D
> unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
> verifier converts such cast instructions to mov32. Otherwise, JIT will em=
it
> native code equivalent to:
> rX =3D (u32)rY;
> if (rY)
>   rX |=3D clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in =
rX */
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list
> built by a bpf program can be walked natively by user space.
>
> Reviewed-by: Barret Rhoden <brho@google.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h            |   7 +-
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |  10 +
>  kernel/bpf/Makefile            |   3 +
>  kernel/bpf/arena.c             | 558 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c              |  11 +
>  kernel/bpf/syscall.c           |  36 +++
>  kernel/bpf/verifier.c          |   1 +
>  tools/include/uapi/linux/bpf.h |  10 +
>  9 files changed, 635 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
>

[...]

>
>  struct bpf_offload_dev;
> @@ -2215,6 +2216,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
>  struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
>  struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
>
> +int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,

nit: you use more meaningful node_id in arena_alloc_pages(), here
"nid" was a big mystery when looking at just function definition

> +                       unsigned long nr_pages, struct page **page_array)=
;
>  #ifdef CONFIG_MEMCG_KMEM
>  void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t=
 flags,
>                            int node);

[...]

> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> +{
> +       return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 :=
 0;
> +}
> +
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> +{
> +       return arena ? arena->user_vm_start : 0;
> +}
> +

is it anticipated that these helpers can be called with NULL? I might
see this later in the patch set, but if not, these NULL checks would
be best removed to not create wrong expectations.

> +static long arena_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 fl=
ags)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_delete_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *=
next_key)
> +{
> +       return -EOPNOTSUPP;
> +}
> +

This is a separate topic, but I'll just mention it here. It was always
confusing to me why we don't just treat all these callbacks as
optional and return -EOPNOTSUPP in generic map code. Unless I miss
something subtle, we should do a round of clean ups and remove dozens
of unnecessary single line callbacks like these throughout the entire
BPF kernel code.

> +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> +{
> +       return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> +}
> +

[...]

> +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> +{
> +       struct bpf_map *map =3D vmf->vma->vm_file->private_data;
> +       struct bpf_arena *arena =3D container_of(map, struct bpf_arena, m=
ap);
> +       struct page *page;
> +       long kbase, kaddr;
> +       int ret;
> +
> +       kbase =3D bpf_arena_get_kern_vm_start(arena);
> +       kaddr =3D kbase + (u32)(vmf->address & PAGE_MASK);
> +
> +       guard(mutex)(&arena->lock);
> +       page =3D vmalloc_to_page((void *)kaddr);
> +       if (page)
> +               /* already have a page vmap-ed */
> +               goto out;
> +
> +       if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> +               /* User space requested to segfault when page is not allo=
cated by bpf prog */
> +               return VM_FAULT_SIGSEGV;
> +
> +       ret =3D mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL=
);
> +       if (ret)
> +               return VM_FAULT_SIGSEGV;
> +
> +       /* Account into memcg of the process that created bpf_arena */
> +       ret =3D bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO=
_NODE, 1, &page);

any specific reason to not take into account map->numa_node here?

> +       if (ret) {
> +               mtree_erase(&arena->mt, vmf->pgoff);
> +               return VM_FAULT_SIGSEGV;
> +       }
> +
> +       ret =3D vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZ=
E, &page);
> +       if (ret) {
> +               mtree_erase(&arena->mt, vmf->pgoff);
> +               __free_page(page);
> +               return VM_FAULT_SIGSEGV;
> +       }
> +out:
> +       page_ref_add(page, 1);
> +       vmf->page =3D page;
> +       return 0;
> +}
> +

[...]

> +/*
> + * Allocate pages and vmap them into kernel vmalloc area.
> + * Later the pages will be mmaped into user space vma.
> + */
> +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long =
page_cnt, int node_id)
> +{
> +       /* user_vm_end/start are fixed before bpf prog runs */
> +       long page_cnt_max =3D (arena->user_vm_end - arena->user_vm_start)=
 >> PAGE_SHIFT;
> +       u64 kern_vm_start =3D bpf_arena_get_kern_vm_start(arena);
> +       struct page **pages;
> +       long pgoff =3D 0;
> +       u32 uaddr32;
> +       int ret, i;
> +
> +       if (page_cnt > page_cnt_max)
> +               return 0;
> +
> +       if (uaddr) {
> +               if (uaddr & ~PAGE_MASK)
> +                       return 0;
> +               pgoff =3D compute_pgoff(arena, uaddr);
> +               if (pgoff + page_cnt > page_cnt_max)

As I mentioned offline, is this guaranteed to not overflow? it's not
obvious because at least according to all the types (longs), uaddr can
be arbitrary, so pgoff can be quite large, etc. Might be worthwhile
rewriting as `pgoff > page_cnt_max - page_cnt` or something, just to
make it clear in code it has no chance of overflowing.

> +                       /* requested address will be outside of user VMA =
*/
> +                       return 0;
> +       }
> +
> +       /* zeroing is needed, since alloc_pages_bulk_array() only fills i=
n non-zero entries */
> +       pages =3D kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> +       if (!pages)
> +               return 0;
> +
> +       guard(mutex)(&arena->lock);
> +
> +       if (uaddr)
> +               ret =3D mtree_insert_range(&arena->mt, pgoff, pgoff + pag=
e_cnt - 1,
> +                                        MT_ENTRY, GFP_KERNEL);
> +       else
> +               ret =3D mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
> +                                       page_cnt, 0, page_cnt_max - 1, GF=
P_KERNEL);

mtree_alloc_range() is lacking documentation, unfortunately, so it's
not clear to me whether max should be just `page_cnt_max - 1` as you
have or `page_cnt_max - page_cnt`. There is a "Test a single entry" in
lib/test_maple_tree.c where min =3D=3D max and size =3D=3D 4096 which is
expected to work, so I have a feeling that the correct max should be
up to the maximum possible beginning of range, but I might be
mistaken. Can you please double check?

> +       if (ret)
> +               goto out_free_pages;
> +
> +       ret =3D bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
> +                                 node_id, page_cnt, pages);
> +       if (ret)
> +               goto out;
> +
> +       uaddr32 =3D (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
> +       /* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE w=
ill not overflow 32-bit */

we checked that `uaddr32 + page_cnt * PAGE_SIZE - 1` won't overflow,
full page_cnt * PAGE_SIZE can actually overflow, so comment is a bit
imprecise. But it's not really clear why it matters here, tbh.
kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE can actually update
upper 32-bits of the kernel-side memory address, is that a problem?

> +       ret =3D vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32=
,
> +                               kern_vm_start + uaddr32 + page_cnt * PAGE=
_SIZE, pages);
> +       if (ret) {
> +               for (i =3D 0; i < page_cnt; i++)
> +                       __free_page(pages[i]);
> +               goto out;
> +       }
> +       kvfree(pages);
> +       return clear_lo32(arena->user_vm_start) + uaddr32;
> +out:
> +       mtree_erase(&arena->mt, pgoff);
> +out_free_pages:
> +       kvfree(pages);
> +       return 0;
> +}
> +
> +/*
> + * If page is present in vmalloc area, unmap it from vmalloc area,
> + * unmap it from all user space vma-s,
> + * and free it.
> + */
> +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt=
)
> +{
> +       struct vma_list *vml;
> +
> +       list_for_each_entry(vml, &arena->vma_list, head)
> +               zap_page_range_single(vml->vma, uaddr,
> +                                     PAGE_SIZE * page_cnt, NULL);
> +}
> +
> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long p=
age_cnt)
> +{
> +       u64 full_uaddr, uaddr_end;
> +       long kaddr, pgoff, i;
> +       struct page *page;
> +
> +       /* only aligned lower 32-bit are relevant */
> +       uaddr =3D (u32)uaddr;
> +       uaddr &=3D PAGE_MASK;
> +       full_uaddr =3D clear_lo32(arena->user_vm_start) + uaddr;
> +       uaddr_end =3D min(arena->user_vm_end, full_uaddr + (page_cnt << P=
AGE_SHIFT));
> +       if (full_uaddr >=3D uaddr_end)
> +               return;
> +
> +       page_cnt =3D (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> +
> +       guard(mutex)(&arena->lock);
> +
> +       pgoff =3D compute_pgoff(arena, uaddr);
> +       /* clear range */
> +       mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, =
GFP_KERNEL);
> +
> +       if (page_cnt > 1)
> +               /* bulk zap if multiple pages being freed */
> +               zap_pages(arena, full_uaddr, page_cnt);
> +
> +       kaddr =3D bpf_arena_get_kern_vm_start(arena) + uaddr;
> +       for (i =3D 0; i < page_cnt; i++, kaddr +=3D PAGE_SIZE, full_uaddr=
 +=3D PAGE_SIZE) {
> +               page =3D vmalloc_to_page((void *)kaddr);
> +               if (!page)
> +                       continue;
> +               if (page_cnt =3D=3D 1 && page_mapped(page)) /* mapped by =
some user process */
> +                       zap_pages(arena, full_uaddr, 1);

The way you split these zap_pages for page_cnt =3D=3D 1 and page_cnt > 1
is quite confusing. Why can't you just unconditionally zap_pages()
regardless of page_cnt before this loop? And why for page_cnt =3D=3D 1 we
have `page_mapped(page)` check, but it's ok to not check this for
page_cnt>1 case?

This asymmetric handling is confusing and suggests something more is
going on here. Or am I overthinking it?

> +               vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_S=
IZE);
> +               __free_page(page);

can something else in the kernel somehow get a refcnt on this page?
I.e., is it ok to unconditionally free page here instead of some sort
of put_page() instead?

> +       }
> +}
> +
> +__bpf_kfunc_start_defs();
> +

[...]