From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC562C4828D for ; Wed, 7 Feb 2024 20:55:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32F0B6B0089; Wed, 7 Feb 2024 15:55:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2DF386B0093; Wed, 7 Feb 2024 15:55:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A6F66B0096; Wed, 7 Feb 2024 15:55:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0A3626B0089 for ; Wed, 7 Feb 2024 15:55:30 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B3AAF40DAA for ; Wed, 7 Feb 2024 20:55:29 +0000 (UTC) X-FDA: 81766213578.03.32F586E Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) by imf22.hostedemail.com (Postfix) with ESMTP id BEFF1C0022 for ; Wed, 7 Feb 2024 20:55:27 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OZ9J5lNa; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707339327; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=K6nVMRuzMF22aVX0g3o+SOHdNqS+k1RkkC5MAuoh9ZY=; b=wss3O0eYnCS0g/EimeEON/TaQp1ua3agPpYN1Sv9me4SS2TLu62VheLtud5t3Zp11goA0w DSe6pe3audob5I6XNa5wC8eZEZmqfcRmPeUkIrxN+4mvpzU34tskWY8fZrmgKmkGpy9XzT cP1IEuOJ29XLLjeo+eDUnPuYD5khGcw= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OZ9J5lNa; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707339327; a=rsa-sha256; cv=none; b=p5Akl1hA1YPyhKP/4MtOM6h/koXAD7rHbdkcfYSNnyZND8HyHy9Nlm0chiR1K+rRwQynE/ mZMmCd8b1yQdBq75wPkf4WlFIoi9JnBPmLLxHqC61JMuU6OQ3KgsWI6vxTgfYwBQmz7kxp 5szZ9bpKirgOuO9pvL29jRjIKBFD9Xc= Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-4101565d20bso8812145e9.2 for ; Wed, 07 Feb 2024 12:55:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707339326; x=1707944126; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K6nVMRuzMF22aVX0g3o+SOHdNqS+k1RkkC5MAuoh9ZY=; b=OZ9J5lNaoxjxuIhq+mKSGAUyG3dNLmCDSPVXpz7CzFWkTFZuyzBLs5GsmqZ1Hoybmw V42TbbwkCnO3LguXHtDZj1pI+AvpkbtzJGvigmQIhVtRMvrMDF5yD9ugPDQ2AMjIyrEC bBJQgCGhUi08VEQ9eDO5KSMJsjNopko9wsWhNmcueM93D21OlCEqceL3HvcmgUWlU+cz SKvO1lKtBM76/egnveUd1Tj44w+oi1FDN5Gs5vdDXC9gOU6W4t55UoqZEEk3ftBKCGK7 aZZgcOT88mi0f3T7RWg9NFO5LINw+FV15gT43KtT8PBVb60zPfnJuWVuhQH20i7T+pxW nGmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707339326; x=1707944126; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K6nVMRuzMF22aVX0g3o+SOHdNqS+k1RkkC5MAuoh9ZY=; b=ON9MEizrXAyLbBudcZVtZfOaQoO03LLvXL98Tqkl7rtGgjZEBZWOMQDUZHUDde0z7m F6xKT/yiSvFqhPI026BZlBpbIpUSgsgWUiWyfQt9BdWJBFxEEiaj427KUoirCkmRc0FT HIeBqVKesEE5PpwUTx0DfBvV84MZHmtkphxv7r2C5zRIfYm5Q0GryT7daBnyS4DsMeOC cfOyar+qN3WL8Pt+hMmAiZ5/HtVagbaCKJbgCbFF9VVqIzwzVnOFtCYfH6Jjd6kbSkc3 A6aCoHF/fSrCIpEjdI3u/8Q+AorfepMJUWeJZXx6KBM1Ktwz39IzSgcw/rY+mmLE25yp Df1g== X-Forwarded-Encrypted: i=1; AJvYcCUHprpsc40TxumJ3dK0fe7TmJ6w8MQUD3zK1AHsrIZcDWVQGxiiTcR2IVn71VOTJkZ+pF0J+ZmnA4pMwh8gpgT1log= X-Gm-Message-State: AOJu0YxylE5b3JWOoq+ZrWTXwtDFfVkvrVCVcmfm5jKzNvQNyMA41Ed/ qX+1zmVH7bG/tcDILR9Wj54wd8gY+Z11hc6FahjNZxBwPSPHxYprO/BnwP8MxZB9YOqG8p320xW RyCBj/1EyI67eFHc9Hz2peaBSKyQ= X-Google-Smtp-Source: AGHT+IHM2Fuh92yENoVeozw8n+ZALBxKEi4UovvwrehmZa0ABEDmUsNmPJeP+/T0gbWZa37Tgq6Z1Q/uWNRgVzAj2Ys= X-Received: by 2002:a5d:5082:0:b0:33a:f3b1:6019 with SMTP id a2-20020a5d5082000000b0033af3b16019mr4262802wrt.24.1707339325731; Wed, 07 Feb 2024 12:55:25 -0800 (PST) MIME-Version: 1.0 References: <20240206220441.38311-1-alexei.starovoitov@gmail.com> <20240206220441.38311-5-alexei.starovoitov@gmail.com> In-Reply-To: From: Alexei Starovoitov Date: Wed, 7 Feb 2024 12:55:14 -0800 Message-ID: Subject: Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena. To: Barret Rhoden Cc: bpf , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Kumar Kartikeya Dwivedi , Eddy Z , Tejun Heo , Johannes Weiner , linux-mm , Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: BEFF1C0022 X-Stat-Signature: yubp41uara4daza3nxcdy8eg9x7xs17c X-HE-Tag: 1707339327-608079 X-HE-Meta: U2FsdGVkX19zo5TSOAnLbTlDgku6oGtc3sEk2qyp/krVxrJoQ0K6ds76IeZDJ8zqkmRjWsdyc2Lnyasr4CzLwSoShXCJ3ZqWOiMaC9eSWjKmmxPi9CNNbz1XGbhXr6W2O7cB8NJZBCVjieTrcHmpGCENKxhwhltif0+Iuuk4S5WIaJaY/2G42vCAmN5hGAtcgC8GIf5VlUPTlTClZ+XX+f+ykjKbCqbkh9/SvcHInVHHqvo9qsoehORw2oAxhPKbW8Ui+aA/RZWT14dDu3fkpriESTiAI0jpKKRHzAdIk7oq1i1O7tsKU9+/AEitr8iuADomeaXn28AeePs0Ki+drppN01ZuGk8kNfJ2XoFWYWZK4Sr756qoiuecjNcs/lB9Q/9IAFJv3HR0GEox49jnt7Ho6Zxy8HXycS//o2VBqFk45FyWKXze+ZBjNFEN/iMxaAVHatIx++2biPwPg5oJvGU/p7rWH4j/s0ttdduMB3YJvS78gvwUFfI/jK8SeYXDC+qy18xfXlYqpujvdcBetGhPeG82Yf2lZTav4TP7Tq9eRCGQxZ5D33KAktafVY/LlSSYjctYOHz8DWqLDjjU7tgO6Ln94lja46bpunpTfgR6m3IKDV+n4MvR6Q9ZFsfY6yITnKKdPWWKMLff44FRR2jgskoHiQCcBcfxJ7+KB0GWGVWS3bJDwywcDFuUMR2fSioFlhTf9UQRTJQDzIXSpj5XvV0yT9yH34AE4aHeJC6LL78wZJdaMGVbJ6uwUM8b+yC/MBrs99J/7mCaLMKKcJEQ13luE+pVgh7TW5B1VtPNJHnctDnsoCGQInby8YFTQBPf0KFDvz2xVJFMtSOzbn5vNaasaomTGK9ATEIc9nz1Td/G6GCJwP3LLQTiJvQ0q3RuMOL6xyOdNkEUDxKMDRWKR5ZnNK/l/aa73l56UnAEegXSEF1mf7CjoeFBgkv2bWK7IqGlg0eW2CPgmcU hazL2ijw TQCdoB0yPxYVO6LqTTEVbTBWEFB6KHqS8RiQ4JUMgxKA8e8sV2wZx3g99UCgjnL83o3JWd0NlOyboZ1gZqhfJGGFS/STHnvjezOfe1CE9LiVQ/w0cnRi0boCqGckt8b7Zsngg52D23Aaj0dLxC30y0Fr/gmqMEaLkVvBb2OjxRcnGtYOgskMsTRe5fQ8VCfHCZQfICAeHc9pj2C5+v8u4L1cCmjEJjbVKDFTRUYOCWU8X4FAQ7NY2y1NDdUtUiFHBO3CYECENCsLHso72CJqHlkPZww2XJCICk7uoS1JIcgNDdlLkoEq35mIu9BRBkZrZLzN8gQtDyULes6ydOTjjzUUtYYL/3u4UdTBNWK5OKiaD81tl0PkMAz6VSi/Kt+rhjHuetBuUr4OKM68oINwDIswPOOpdH/XW76RA1JDTVab0s6ObvgRW3pggostG1HMXZWOI5to29pGnPos= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 7, 2024 at 10:40=E2=80=AFAM Barret Rhoden wro= te: > > On 2/6/24 17:04, Alexei Starovoitov wrote: > > + > > +static long compute_pgoff(struct bpf_arena *arena, long uaddr) > > +{ > > + return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > > +} > > + > > +#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid p= ointer */ > > + > > +/* > > + * Reserve a "zero page", so that bpf prog and user space never see > > + * a pointer to arena with lower 32 bits being zero. > > + * bpf_cast_user() promotes it to full 64-bit NULL. > > + */ > > +static int reserve_zero_page(struct bpf_arena *arena) > > +{ > > + long pgoff =3D compute_pgoff(arena, 0); > > + > > + return mtree_insert(&arena->mt, pgoff, MT_ENTRY, GFP_KERNEL); > > +} > > + > > this is pretty tricky, and i think i didn't understand it at first. > > you're punching a hole in the arena, such that BPF won't allocate it via > arena_alloc_pages(). thus BPF won't 'produce' an object with a pointer > ending in 0x00000000. > > depending on where userspace mmaps the arena, that hole may or may not > be the first page in the array. if userspace mmaps it to a 4GB aligned > virtual address, it'll be page 0. but it could be at some arbitrary > offset within the 4GB arena. > > that arbitrariness makes it harder for a BPF program to do its own > allocations within the arena. i'm planning on carving up the 4GB arena > for my own purposes, managed by BPF, with the expectation that i'll be > able to allocate any 'virtual address' within the arena. but there's a > magic page that won't be usable. > > i can certainly live with this. just mmap userspace to a 4GB aligned > address + PGSIZE, so that the last page in the arena is page 0. but > it's a little weird. Agree. I made the same conclusion while adding global variables to the aren= a. >From the compiler point of view all such global vars start at offset zero and there is no way to just "move them up by a page". For example in C code it will look like: int __arena var1; int __arena var2; &var1 =3D=3D user_vm_start &var2 =3D=3D user_vm_start + 4 If __ulong(map_extra,...) was used or mmap(addr, MAP_FIXED) and was 4Gb aligned the lower 32-bits of &var1 address will be zero and there is not much we can do about it. We can tell LLVM to emit extra 8 byte padding to the arena section, but it will be useless pad if the arena is not aligned to 4Gb. Anyway, in the v2 I will remove this reserve_zero_page() logic. It's causing more harm than good. > > though i think we'll have more serious issues if anyone accidentally > tries to use that zero page. BPF would get an EEXIST if they try to > allocate it directly, but then page fault and die if they touched it, > since there's no page. i can live with that, if we force it to be the > last page in the arena. > > however, i think you need to add something to the fault handler (below) > in case userspace touches that page: > > [snip] > > +static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > +{ > > + struct bpf_map *map =3D vmf->vma->vm_file->private_data; > > + struct bpf_arena *arena =3D container_of(map, struct bpf_arena, m= ap); > > + struct page *page; > > + long kbase, kaddr; > > + int ret; > > + > > + kbase =3D bpf_arena_get_kern_vm_start(arena); > > + kaddr =3D kbase + (u32)(vmf->address & PAGE_MASK); > > + > > + guard(mutex)(&arena->lock); > > + page =3D vmalloc_to_page((void *)kaddr); > > + if (page) > > + /* already have a page vmap-ed */ > > + goto out; > > + > > + if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > > + /* User space requested to segfault when page is not allo= cated by bpf prog */ > > + return VM_FAULT_SIGSEGV; > > + > > + ret =3D mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL= ); > > + if (ret =3D=3D -EEXIST) > > + return VM_FAULT_RETRY; > > say this was the zero page. vmalloc_to_page() failed, so we tried to > insert. we get EEXIST, since the slot is reserved. we retry, since we > were expecting the case where "no page, yet slot reserved" meant that > BPF was in the middle of filling this page. Yes. Great catch! I hit that too while playing with global vars. > > though i think you can fix this by just treating this as a SIGSEGV > instead of RETRY. Agree. > when i made the original suggestion of making this a > retry (in an email off list), that was before you had the arena mutex. > now that you have the mutex, you shouldn't have the scenario where two > threads are concurrently trying to fill a page. i.e. mtree_insert + > page_alloc + vmap are all atomic w.r.t. the mutex. yes. mutex part makes sense. > > + > > + if (!arena->user_vm_start) { > > + arena->user_vm_start =3D vma->vm_start; > > + err =3D reserve_zero_page(arena); > > + if (err) > > + return err; > > + } > > + arena->user_vm_end =3D vma->vm_end; > > + /* > > + * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and > > + * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid > > + * potential change of user_vm_start. > > + */ > > + vm_flags_set(vma, VM_DONTEXPAND); > > + vma->vm_ops =3D &arena_vm_ops; > > + return 0; > > +} > > i think this whole function needs to be protected by the mutex, or at > least all the stuff relate to user_vm_{start,end}. if you have to > threads mmapping the region for the first time, you'll race on the > values of user_vm_*. yes. will add a mutex guard. > > [snip] > > > +/* > > + * Allocate pages and vmap them into kernel vmalloc area. > > + * Later the pages will be mmaped into user space vma. > > + */ > > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, lon= g page_cnt, int node_id) > > instead of uaddr, can you change this to take an address relative to the > arena ("arena virtual address"?)? the caller of this is in BPF, and > they don't easily know the user virtual address. maybe even just pgoff > directly. I thought about it, but it doesn't quite make sense. bpf prog only sees user addresses. All load/store returns them. If it bpf_printk-s an address it will be user address. bpf_arena_alloc_pages() also returns a user address. Kernel addresses are not seen by bpf prog at all. kern_vm_base is completely hidden. Only at JIT time, it's added to pointers. So passing uaddr to arena_alloc_pages() matches mmap style. uaddr =3D bpf_arena_alloc_pages(... uaddr ...) uaddr =3D mmap(uaddr, ...MAP_FIXED) Passing pgoff would be weird. Also note that there is no extra flag for bpf_arena_alloc_pages(). uaddr =3D=3D full 64-bit of zeros is not a valid addr to use. > additionally, you won't need to call compute_pgoff(). as it is now, i'm > not sure what would happen if BPF did an arena_alloc with a uaddr and > user_vm_start wasn't set yet. That's impossible. bpf prog won't load, if the arena map is not created either with fixed map_extra =3D=3D user_vm_start or map is created with map_extra =3D=3D 0 and mmaped. Only then bpf prog that uses that arena can be loaded. > > > +{ > > + long page_cnt_max =3D (arena->user_vm_end - arena->user_vm_start)= >> PAGE_SHIFT; > > any time you compute_pgoff() or look at user_vm_{start,end}, maybe > either hold the mutex, or only do it from mmap faults (where we know > user_vm_start is already set). o/w there might be subtle races where > some other thread is mmapping the arena for the first time. That's unnecessary, since user_vm_start is fixed for the lifetime of bpf prog. But you spotted a bug. I need to set user_vm_end in arena_map_alloc() when map_extra is specified. Otherwise after arena create and before mmap bpf prog will see different page_cnt_max. user_vm_start will be the same, of course. > > + > > + kaddr =3D kern_vm_start + (u32)(arena->user_vm_start + pgoff * PA= GE_SIZE); > > adding user_vm_start here is pretty subtle. > > so far i've been thinking that the mtree is the "address space" of the > arena, in units of pages instead of bytes. and pgoff is an address > within the arena. so mtree slot 0 is the 0th page of the 4GB region. > and that "arena address space" is mapped at a kernel virtual address and > a user virtual address (the same for all processes). > > to convert user addresses (uaddr et al.) to arena addresses, we subtract > user_vm_start, which makes sense. that's what compute_pgoff() does. > > i was expecting kaddr =3D kern_vm_start + pgoff * PGSIZE, essentially > converting from arena address space to kernel virtual address. > > instead, by adding user_vm_start and casting to u32, you're converting > or shifting arena addresses to *another* arena address (user address, > truncated to 4GB to keep it in the arena), and that is the one that the > kernel will use. > > is that correct? Pretty much. kern and user have to see lower 32-bit exactly the same. >From user pov allocation starts at pgoff=3D0 which is the first page after user_vm_start. Which is normal mmap behavior and in-kernel vma convention. Hence in arena_alloc_pages() does the same pgoff=3D0 is the first page from user vma pov. The math to compute kernel address gets complicated because two bases need to be added. Both kernel and user bases are different 64-bit bases and may not be aligned to 4G. > my one concern is that there's some subtle wrap-around going on, and due > to the shifting, kaddr can be very close to the end of the arena and > page_cnt can be big enough to go outside the 4GB range. we'd want it to page_cnt cannot go outside of 4gb range. page_cnt is a number of pages in arena->user_vm_end - arena->user_vm_start and during mmap we check that it's <=3D 4gb. > wrap around. e.g. > > user_start_va =3D 0x1,fffff000 > user_end_va =3D 0x2,fffff000 > page_cnt_max =3D 0x100000 or whatever. full 4GB range. > > say we want to alloc at pgoff=3D0 (uaddr 0x1,fffff000), page_cnt =3D X. = you > can get this pgoff either by doing mtree_insert_range or > mtree_alloc_range on an arena with no allocations. > > kaddr =3D kern_vm_start + 0xfffff000 > > the size of the vm area is 4GB + guard stuff, and we're right up against > the end of it. > > if page_cnt > the guard size, vmap_pages_range() would be called on > something outside the vm area we reserved, which seems bad. and even if > it wasn't, what we want is for later page maps to start at the beginning > of kern_vm_start. > > the fix might be to just only map a page at a time - maybe a loop. or > detect when we're close to the edge and break it into two vmaps. i feel > like the loop would be easier to understand, but maybe less efficient. Oops. You're correct. Great catch. In earlier versions I had it as a loop, but then I decided that doing all mapping ops page at a time is not efficient. Oh well. Will fix. Thanks a lot for the review!