From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C72AC48260 for ; Thu, 8 Feb 2024 23:37:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DFA256B0075; Thu, 8 Feb 2024 18:37:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DAA186B0078; Thu, 8 Feb 2024 18:37:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C23A36B007E; Thu, 8 Feb 2024 18:37:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AD8236B0075 for ; Thu, 8 Feb 2024 18:37:08 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6BFDF40A42 for ; Thu, 8 Feb 2024 23:37:08 +0000 (UTC) X-FDA: 81770249736.11.A279B89 Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com [209.85.221.45]) by imf04.hostedemail.com (Postfix) with ESMTP id 9CB794000C for ; Thu, 8 Feb 2024 23:37:06 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bvUdwdKp; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.221.45 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707435426; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ASkD7a4Oh5bNhTtaP4HZDzX3Az+lgvtErIFoGDbNuCM=; b=BKfKD4M4myKunlQeVtyLczh4Gjoi+pDfu4sK7XI1KoNqldKWGWVqh96aIAauGp0TmGldTD GtGqZkiGCGTh+3G0Mz94xnZqf3ZETxTNGbl8S9uVTzyNr1RC1oJLsyfQSIMYjb5fa3v7oE urPLtdtGIHxRih9My1wG/94iSXMy1pY= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bvUdwdKp; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.221.45 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707435426; a=rsa-sha256; cv=none; b=5C0LUr5KI+IxjL4CIDZZ9KISQVBojkSMF77lAlWBxv6Kj2wYhgoR2MvQyEKB5pZ9d1eMqY vwFoe480JTgcW5vOQBZa+qqsdBvzUs0lCRJ8W1QnFUzB5NHdWVgkV17Ii8StH3QentSTF5 SWW5WYVmHTq93gq/jHZj9ITpAD23PM4= Received: by mail-wr1-f45.google.com with SMTP id ffacd0b85a97d-33b53d7c093so121481f8f.1 for ; Thu, 08 Feb 2024 15:37:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707435425; x=1708040225; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ASkD7a4Oh5bNhTtaP4HZDzX3Az+lgvtErIFoGDbNuCM=; b=bvUdwdKpYkyZYZOVNCjP2LSWRCyYfmS26sHty/m+Hi8cYzOi/DJ8t4+iAA01sG9PCK HRdXewh3Mn80HgVtVDWO/LYzitkw7OD0q/y0TCswJn2/TcB/LRM3rs5ymP0jH0KVLvoA 1NmS3+4jMc6RCg94RdzCS6VpxIgBBtQuOQpLfvB3veuLUtueRVyoBdCjom/g41VJ6FUe xr3mydO85ZQTFVxYI0F+JVqq9VdtDWNbLJAHeJ+K/XIMXfi29l2VazBe9lqKSoHxeaRC TouqUVnlNCsZi6R5/xre3AWY8eSk8qDPvaQFiA43m2F2ohM/xBuaUn87nyPYFEJCpvlV RuVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707435425; x=1708040225; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ASkD7a4Oh5bNhTtaP4HZDzX3Az+lgvtErIFoGDbNuCM=; b=uDiELkSsCAfbo3uf3pTgsj5gQ8/l9TnXnRuih0PsM1zaJv0d9WM+PODrLyD+3rDOyp bscmzE5WukLLPK/5ZqW1BO3PKi0XUipNm90Yvp7TgfmWdKhtzFf/zMr88410F70uBAVx dL3uFln73zTVz6B1YjbY5eaihXk0UiygBbcqGC9MReOsWjh/W+dLXQFIKTxrXM/1EPaR EDAX8hUXo5NfabtpZHq9Tvgn7PTTJE9KNCw0qujlF2G/hsnaqXGmu5oTs1zRUwKw59fE I4Iiz0aVkYHGw/0Y/1yOE3gEDyc0VZz+qWl8Ljm64ft3Q84/foQzEaj+JNObYwSr8NRR +Q5g== X-Gm-Message-State: AOJu0YxV1fw6iIpu9Fgvmc//GKSW/M1Kihj6MfXhSDknOCjPZMGAVPOr XAdA47O+KfRF/tYu8guk5oNrRkbYwQtFoXBTyC4p8N2EekJfFv9Zf+vr3XpjkQR+8drFQ2Tqcvh NLHY/xXcBMtY0KVl7JUYDd3GjKoQ= X-Google-Smtp-Source: AGHT+IFfg6MsxTjsNBbm4JJvCIa0LaciZioK1DcnMZT9RzlZi0kv0bekKeoWdIbijzTIhzaD4rX4uVbtmzRee8BPtRk= X-Received: by 2002:adf:ce09:0:b0:336:6a76:40cd with SMTP id p9-20020adfce09000000b003366a7640cdmr628892wrn.62.1707435424744; Thu, 08 Feb 2024 15:37:04 -0800 (PST) MIME-Version: 1.0 References: <20240206220441.38311-1-alexei.starovoitov@gmail.com> <20240206220441.38311-5-alexei.starovoitov@gmail.com> In-Reply-To: From: Alexei Starovoitov Date: Thu, 8 Feb 2024 15:36:53 -0800 Message-ID: Subject: Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena. To: Barret Rhoden Cc: bpf , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Kumar Kartikeya Dwivedi , Eddy Z , Tejun Heo , Johannes Weiner , linux-mm , Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9CB794000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 5mpqqd98nwokfagbn1dmyp4sbna6uge9 X-HE-Tag: 1707435426-863957 X-HE-Meta: U2FsdGVkX19v1APx3jXy4EWJrlotBldQ5rXesstTOgBd9roQIRgwDaXPYf4BBgXOY7Dtn4SZIxyeu0tthsqaOKmHG/33pCd+obbIXBKeb5IdHyPkYlCJC5B5tr3Of8cx7pX/WohW8ys7ziEtgWBdCeN89mkRdls/DuK2vVixueS9zbjrAlgxCV9Yeixcb7MERDW6kk9n2xYUq0Y86XKBblu9sAEU4wJZyEsG6UzPMC5i3IYsfhj1mF4IOj0VipaRaezqnqzr+7om2FrujoOkXOO0chhipm85RfANQdHvChk4oer9uzCRzRIQJqhX0/apZAKtTq+6c6x8Nea05IprzZfjgbfeJQ8L0FClBBwvvr1TrxD9Qp7b3tvpBrRWARgxkLq2E1YPwa2FWFWTDaBXIt8lv85oUN+eysG94IgPOesuHWeWNQjJzI5dhqRZRJQUpRSvjkCimC2g+5Qi1zUQp9NQRsz4FOJWgJa8Yl2cDUF5xqboNuPszij0Zua55H+JrjIFxycizOpo1bE2sGH/ChSx+ZHNqOTHRmt5whQDed+FNTjpb91T/AVTKywwdyHs81COCwhGrnyChi+tbYZ6K/iUMjarjccBODEmV68SYB9zDlupinSHE5nNF2Y2jLjmcGjRbk1+8wGnswA6xjeN771sWFmqQtu9R25Pw/m7EC0bSR4mYUfPZzUIlNkkiZRp2vWC2b+gWgWAzQMeKs7q/JSXgYEOjZ0LyXL+QQO6EkOVDALmmF9BuZJkIXbxgYFpRj7UgMKegbEAmblbCzi3zEPnskJrt3GNnmLUkXxCinOZuoXtqcNs0lXWRD6EgDGcJTjtZ0+vucDElMOZ1nanEns8nLEFgK9iOAa5pYceyw58pZEEBtttda/xGziDcpAvkWj9EjKMh4z6VGDc7W3UA6jXP5R38WumpxDYNibsK5oD6XFIJeQW4eGHGljc9cyGeYhJXcibDqCDK+6uKV4 qIZFPIKA 5QdUqTs3vQYTU2ENzkPPCd7sqcPhFCCtj6k0AqfrZ/04KIkP0R7SnEM4JLGGir3jDu1I8ea4bBGTrJCGaV8HbaY1LxvjiuA9uhN9wwWZXUjpwFbPqp1/wodaB7IaLujmjCBRZkd2EGRH3ol1cioW2pvN5lv8idBG9rBYTlmY4RhlExv7zEVzQTdYeAiTv9ixJtKWZ+vw6UxSFno92Uxn1CeHposLuqrNMRaAsPciXI+BB6r04lKT2dswPhsxadf5pZCrGkSn6KayUImKl47QPiXRhCqJj0GgMyD8ZaXfCSmUiSDIQFqWf6mshxlY9EkPehQHzaKcPuNANUrCMSA6yOGIAZXPegM3GmNTNyjuYXzm10X6CGPsIndR/3cySaWf3b9+Nw/+WuNGYOgCI5D8A3CKLZ+Uf8QpzNYnxwP0KI6IbO+9KkE+ymrTZf/2wwKgDOeP4B+Rm4UvjwbiYRCVtqjNx2UQ/8BsPTHN3NEGAxYwchn2qN4syl4JC0HuIEvv1X47G X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 8, 2024 at 1:58=E2=80=AFPM Barret Rhoden wrot= e: > > On 2/8/24 01:26, Alexei Starovoitov wrote: > > Also I believe I addressed all issues with missing mutex and wrap aroun= d, > > and pushed to: > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h= =3Darena&id=3De1cb522fee661e7346e8be567eade9cf607eaf11 > > Please take a look. > > LGTM, thanks. > > minor things: > > > +static void arena_vm_close(struct vm_area_struct *vma) > > +{ > > + struct vma_list *vml; > > + > > + vml =3D vma->vm_private_data; > > + list_del(&vml->head); > > + vma->vm_private_data =3D NULL; > > + kfree(vml); > > +} > > i think this also needs protected by the arena mutex. otherwise two > VMAs that close at the same time can corrupt the arena vma_list. or a > VMA that closes while you're zapping. Excellent catch. > remember_vma() already has the mutex held, since it's called from mmap. > > > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, lon= g page_cnt, int node_id) > > +{ > > + long page_cnt_max =3D (arena->user_vm_end - arena->user_vm_start)= >> PAGE_SHIFT; > > this function and arena_free_pages() are both using user_vm_start/end > before grabbing the mutex. so need to grab the mutex very early. > > alternatively, you could make it so that the user must set the > user_vm_start via map_extra, so you don't have to worry about these > changing after the arena is created. Looks like I lost the diff hunk where verifier checks that arena has user_vm_start set before loading the prog. And for some reason forgot to remove if (!arena->user_vm_start) return.. in bpf_arena_alloc/free_page(). I'll remove the latter and add the verifier enforcement back. The intent was to never call arena_alloc/free_pages when the arena is not fully formed. Once it's fixed there will be no race in arena_alloc_pages(). user_vm_end/start are fixed before the program is loaded. One more thing. The vmap_pages_range_wrap32() fix that you saw in that commit is not enough. Turns out that [%r12 + src_reg + off] in JIT asm doesn't fully conform to "kernel bounds all access into 32-bit". That "+ off" part is added _after_ src_reg is bounded to 32-bit. Remember, that was the reason we added guard pages before and after kernel 4Gb vm area. It's working as intended, but for this wrap32 case we need to map one page into the normal kernel vma _and_ into the guard page. Consider your example: user_start_va =3D 0x1,fffff000 user_end_va =3D 0x2,fffff000 the pgoff =3D 0 is uaddr 0x1,fffff000. It's kaddr =3D kern_vm_start + 0xfffff000 and kaddr + PAGE_SIZE is kern_vm_start + 0. When bpf prog access an arena pointer it can do: dst_reg =3D *(u64 *)(src_reg + 0) and dst_reg =3D *(u64 *)(src_reg + 4096) the first LDX is fine, but the 2nd will be faulting when src_reg is fffff000. >From user space pov it's a virtually contiguous address range. For bpf prog it's also contiguous when src_reg is 32-bit bounded, but "+ 4096" breaks that. The 2nd load becomes: kern_vm_start + 0xfffff000 + 4096 and it faults. Theoretically a solution is to do: kern_vm_start + (u32)(0xfffff000 + 4096) in JIT, but that is too expensive. Hence I went with arena fix (ignore lack of error checking): static int vunmap_guard_pages(u64 kern_vm_start, u64 start, u64 end) { end =3D (u32)end; if (start < S16_MAX) { u64 end1 =3D min(end, S16_MAX + 1); vunmap_range(kern_vm_start + (1ull << 32) + start, kern_vm_start + (1ull << 32) + end1); } if (end >=3D U32_MAX - S16_MAX + 1) { u64 start2 =3D max(start, U32_MAX - S16_MAX + 1); vunmap_range(kern_vm_start - (1ull << 32) + start2, kern_vm_start - (1ull << 32) + end); } return 0; } static int vmap_pages_range_wrap32(u64 kern_vm_start, u64 uaddr, u64 page_c= nt, struct page **pages) { u64 start =3D kern_vm_start + uaddr; u64 end =3D start + page_cnt * PAGE_SIZE; u64 part1_page_cnt, start2, end2; int ret; if (page_cnt =3D=3D 1 || !((uaddr + page_cnt * PAGE_SIZE) >> 32)) { /* uaddr doesn't overflow in 32-bit */ ret =3D vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT); if (ret) return ret; vmap_guard_pages(kern_vm_start, uaddr, uaddr + page_cnt * PAGE_SIZE, pages); return 0; } part1_page_cnt =3D ((1ull << 32) - (u32)uaddr) >> PAGE_SHIFT; end =3D start + part1_page_cnt * PAGE_SIZE; ret =3D vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT); if (ret) return ret; vmap_guard_pages(kern_vm_start, uaddr, uaddr + part1_page_cnt * PAGE_SIZE, pages); start2 =3D kern_vm_start; end2 =3D start2 + (page_cnt - part1_page_cnt) * PAGE_SIZE; ret =3D vmap_pages_range(start2, end2, PAGE_KERNEL, &pages[part1_page_cnt], PAGE_SH= IFT); if (ret) { vunmap_range(start, end); return ret; } vmap_guard_pages(kern_vm_start, 0, (page_cnt - part1_page_cnt) * PAGE_SIZE, pages + part1_page_cnt); return 0; } It's working, but too complicated. Instead of single vmap_pages_range() we might need to do up to 4 calls and map certain pages into two places to make both 64-bit virtual addresses: kern_vm_start + 0xfffff000 + 4096 and kern_vm_start + (u32)(0xfffff000 + 4096) point to the same page. I'm inclined to tackle wrap32 issue differently and simply disallow [user_vm_start, user_vm_end] combination where lower 32-bit can wrap. In other words it would mean that mmap() of len=3D4Gb will be aligned to 4Gb, while mmap() of len=3D1M will be offsetted in such a way that both addr and add+1M have the same upper 32-bit. (It's not the same as 1M aligned). With that I will remove vmap_pages_range_wrap32() and do single normal vmap_pages_range() without extra tricks. wdyt?