Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Barret Rhoden <brho@google.com>
Cc: bpf <bpf@vger.kernel.org>, Daniel Borkmann <daniel@iogearbox.net>,
	 Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	 Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Eddy Z <eddyz87@gmail.com>, Tejun Heo <tj@kernel.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm <linux-mm@kvack.org>,  Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.
Date: Thu, 8 Feb 2024 15:36:53 -0800	[thread overview]
Message-ID: <CAADnVQJsbZeJCmyQbL-CAX7b4KgBtw_carPihOV_tG7nna=W4Q@mail.gmail.com> (raw)
In-Reply-To: <b1fe20c8-cd97-4ffc-8043-7fe42bf18c77@google.com>

On Thu, Feb 8, 2024 at 1:58 PM Barret Rhoden <brho@google.com> wrote:
>
> On 2/8/24 01:26, Alexei Starovoitov wrote:
> > Also I believe I addressed all issues with missing mutex and wrap around,
> > and pushed to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=e1cb522fee661e7346e8be567eade9cf607eaf11
> > Please take a look.
>
> LGTM, thanks.
>
> minor things:
>
> > +static void arena_vm_close(struct vm_area_struct *vma)
> > +{
> > +     struct vma_list *vml;
> > +
> > +     vml = vma->vm_private_data;
> > +     list_del(&vml->head);
> > +     vma->vm_private_data = NULL;
> > +     kfree(vml);
> > +}
>
> i think this also needs protected by the arena mutex.  otherwise two
> VMAs that close at the same time can corrupt the arena vma_list.  or a
> VMA that closes while you're zapping.

Excellent catch.

> remember_vma() already has the mutex held, since it's called from mmap.
>
> > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> > +{
> > +     long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
>
> this function and arena_free_pages() are both using user_vm_start/end
> before grabbing the mutex.  so need to grab the mutex very early.
>
> alternatively, you could make it so that the user must set the
> user_vm_start via map_extra, so you don't have to worry about these
> changing after the arena is created.

Looks like I lost the diff hunk where verifier checks that
arena has user_vm_start set before loading the prog.
And for some reason forgot to remove
if (!arena->user_vm_start) return..
in bpf_arena_alloc/free_page().
I'll remove the latter and add the verifier enforcement back.
The intent was to never call arena_alloc/free_pages when the arena is
not fully formed.
Once it's fixed there will be no race in arena_alloc_pages().
user_vm_end/start are fixed before the program is loaded.

One more thing.
The vmap_pages_range_wrap32() fix that you saw in that commit is not
enough.
Turns out that [%r12 + src_reg + off] in JIT asm doesn't
fully conform to "kernel bounds all access into 32-bit".
That "+ off" part is added _after_ src_reg is bounded to 32-bit.
Remember, that was the reason we added guard pages before and after
kernel 4Gb vm area.
It's working as intended, but for this wrap32 case we need to
map one page into the normal kernel vma _and_ into the guard page.
Consider your example:
user_start_va = 0x1,fffff000
user_end_va =   0x2,fffff000

the pgoff = 0 is uaddr 0x1,fffff000.
It's kaddr = kern_vm_start + 0xfffff000
and kaddr + PAGE_SIZE is kern_vm_start + 0.

When bpf prog access an arena pointer it can do:
dst_reg = *(u64 *)(src_reg + 0)
and
dst_reg = *(u64 *)(src_reg + 4096)

the first LDX is fine, but the 2nd will be faulting
when src_reg is fffff000.
From user space pov it's a virtually contiguous address range.
For bpf prog it's also contiguous when src_reg is 32-bit bounded,
but "+ 4096" breaks that.
The 2nd load becomes:
kern_vm_start + 0xfffff000 + 4096
and it faults.
Theoretically a solution is to do:
kern_vm_start + (u32)(0xfffff000 + 4096)
in JIT, but that is too expensive.

Hence I went with arena fix (ignore lack of error checking):
static int vunmap_guard_pages(u64 kern_vm_start, u64 start, u64 end)
{
        end = (u32)end;
        if (start < S16_MAX) {
                u64 end1 = min(end, S16_MAX + 1);

                vunmap_range(kern_vm_start + (1ull << 32) + start,
                             kern_vm_start + (1ull << 32) + end1);
        }

        if (end >= U32_MAX - S16_MAX + 1) {
                u64 start2 = max(start, U32_MAX - S16_MAX + 1);

                vunmap_range(kern_vm_start - (1ull << 32) + start2,
                             kern_vm_start - (1ull << 32) + end);
        }
        return 0;
}
static int vmap_pages_range_wrap32(u64 kern_vm_start, u64 uaddr, u64 page_cnt,
                                   struct page **pages)
{
        u64 start = kern_vm_start + uaddr;
        u64 end = start + page_cnt * PAGE_SIZE;
        u64 part1_page_cnt, start2, end2;
        int ret;

        if (page_cnt == 1 || !((uaddr + page_cnt * PAGE_SIZE) >> 32)) {
                /* uaddr doesn't overflow in 32-bit */
                ret = vmap_pages_range(start, end, PAGE_KERNEL, pages,
PAGE_SHIFT);
                if (ret)
                        return ret;
                vmap_guard_pages(kern_vm_start, uaddr, uaddr +
page_cnt * PAGE_SIZE, pages);
                return 0;
        }

        part1_page_cnt = ((1ull << 32) - (u32)uaddr) >> PAGE_SHIFT;
        end = start + part1_page_cnt * PAGE_SIZE;
        ret = vmap_pages_range(start, end,
                               PAGE_KERNEL, pages, PAGE_SHIFT);
        if (ret)
            return ret;

        vmap_guard_pages(kern_vm_start, uaddr, uaddr + part1_page_cnt
* PAGE_SIZE, pages);

        start2 = kern_vm_start;
        end2 = start2 + (page_cnt - part1_page_cnt) * PAGE_SIZE;
        ret = vmap_pages_range(start2, end2,
                               PAGE_KERNEL, &pages[part1_page_cnt], PAGE_SHIFT);
        if (ret) {
                vunmap_range(start, end);
                return ret;
        }

        vmap_guard_pages(kern_vm_start, 0, (page_cnt - part1_page_cnt)
* PAGE_SIZE,
                         pages + part1_page_cnt);
        return 0;
}

It's working, but too complicated.
Instead of single vmap_pages_range()
we might need to do up to 4 calls and map certain pages into
two places to make both 64-bit virtual addresses:
kern_vm_start + 0xfffff000 + 4096
and
kern_vm_start + (u32)(0xfffff000 + 4096)
point to the same page.

I'm inclined to tackle wrap32 issue differently and simply
disallow [user_vm_start, user_vm_end] combination
where lower 32-bit can wrap.

In other words it would mean that mmap() of len=4Gb will be
aligned to 4Gb,
while mmap() of len=1M will be offsetted in such a way
that both addr and add+1M have the same upper 32-bit.
(It's not the same as 1M aligned).

With that I will remove vmap_pages_range_wrap32() and
do single normal vmap_pages_range() without extra tricks.

wdyt?

next prev parent reply	other threads:[~2024-02-08 23:37 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-06 22:04 [PATCH bpf-next 00/16] bpf: Introduce BPF arena Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 01/16] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
2024-02-08 19:40   ` Andrii Nakryiko
2024-02-09  0:09     ` Alexei Starovoitov
2024-02-09 19:09       ` Andrii Nakryiko
2024-02-10  2:32         ` Alexei Starovoitov
2024-02-09 16:06   ` David Vernet
2024-02-06 22:04 ` [PATCH bpf-next 02/16] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
2024-02-09 16:57   ` David Vernet
2024-02-09 17:46     ` Alexei Starovoitov
2024-02-09 18:11       ` David Vernet
2024-02-09 18:59         ` Alexei Starovoitov
2024-02-09 19:18           ` David Vernet
2024-02-06 22:04 ` [PATCH bpf-next 03/16] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
2024-02-07 21:07   ` Lorenzo Stoakes
2024-02-07 22:56     ` Alexei Starovoitov
2024-02-08  5:44     ` Johannes Weiner
2024-02-08 23:55       ` Alexei Starovoitov
2024-02-09  6:36       ` Lorenzo Stoakes
2024-02-14  8:31     ` Christoph Hellwig
2024-02-06 22:04 ` [PATCH bpf-next 04/16] bpf: Introduce bpf_arena Alexei Starovoitov
2024-02-07 18:40   ` Barret Rhoden
2024-02-07 20:55     ` Alexei Starovoitov
2024-02-07 21:11       ` Barret Rhoden
2024-02-08  6:26         ` Alexei Starovoitov
2024-02-08 21:58           ` Barret Rhoden
2024-02-08 23:36             ` Alexei Starovoitov [this message]
2024-02-08 23:50               ` Barret Rhoden
2024-02-06 22:04 ` [PATCH bpf-next 05/16] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 06/16] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 07/16] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 08/16] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 09/16] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 10/16] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 11/16] libbpf: Add support for bpf_arena Alexei Starovoitov
2024-02-08  1:15   ` Andrii Nakryiko
2024-02-08  1:38     ` Alexei Starovoitov
2024-02-08 18:29       ` Andrii Nakryiko
2024-02-08 18:45         ` Alexei Starovoitov
2024-02-08 18:54           ` Andrii Nakryiko
2024-02-08 18:59             ` Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 12/16] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
2024-02-08  1:16   ` Andrii Nakryiko
2024-02-08  1:58     ` Alexei Starovoitov
2024-02-08 18:16       ` Andrii Nakryiko
2024-02-06 22:04 ` [PATCH bpf-next 13/16] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 14/16] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
2024-02-06 22:04 ` [PATCH bpf-next 15/16] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
2024-02-07 17:04   ` Eduard Zingerman
2024-02-08  2:59     ` Alexei Starovoitov
2024-02-08 11:10       ` Jose E. Marchesi
2024-02-06 22:04 ` [PATCH bpf-next 16/16] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
2024-02-07 12:34 ` [PATCH bpf-next 00/16] bpf: Introduce BPF arena Donald Hunter
2024-02-07 13:33   ` Barret Rhoden
2024-02-07 20:16     ` Alexei Starovoitov
2024-02-07 20:12   ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAADnVQJsbZeJCmyQbL-CAX7b4KgBtw_carPihOV_tG7nna=W4Q@mail.gmail.com' \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox