From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E9ADC54798 for ; Thu, 22 Feb 2024 23:25:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0AF306B00C6; Thu, 22 Feb 2024 18:25:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 034E66B00C9; Thu, 22 Feb 2024 18:25:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF2F96B00CC; Thu, 22 Feb 2024 18:25:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CB5C26B00C6 for ; Thu, 22 Feb 2024 18:25:40 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9849440E83 for ; Thu, 22 Feb 2024 23:25:40 +0000 (UTC) X-FDA: 81821024040.17.327DC5A Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf04.hostedemail.com (Postfix) with ESMTP id D150440004 for ; Thu, 22 Feb 2024 23:25:38 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mx9ZFVlg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708644338; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XTnT6Ri/VO+YLhq9bpz5XsPmPtihKEI64RnnmcAnKno=; b=bsqT1p/cJSqQ/mc5/Z5v2COGPQsZeXYcZpqI9+JWa/WLs8tLX2HTPxyogFOvPp5JHQx6T/ P4H62rEyMDRyFMXPuflcDSCzyR5VYhydhHVCq52GVBhFPplNNIW+HyIum8s7vzQ4Ik6wcj R8rX30zFc4JNZLJyQqcsM1PZqLSJOjA= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mx9ZFVlg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708644338; a=rsa-sha256; cv=none; b=rskUATP+2lk7/+imdF0h8a+6EiCSoSLC0lkY/iUm/q//mVk7okPL2pI4KaCzYM11wUuIxt zatjzDd+bpP3xJrmVY8wg8n6mnhre+OdnOXq+YGg5W2q1muHbEVAcGgnpbEPmqoQS53Yhf VB8qySHHRTsuVRr2jKPerXAWkhSV+XQ= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-412934b98b8so26385e9.3 for ; Thu, 22 Feb 2024 15:25:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708644337; x=1709249137; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XTnT6Ri/VO+YLhq9bpz5XsPmPtihKEI64RnnmcAnKno=; b=mx9ZFVlguHNnSYsC4EDod0jtz433Ff3m0g0VaBqnDwXeFTrbhJeTjHm4GAMXWUnNQN vXKajz3IPTxuwtQ7abpnV1632lgE7+OFfbU725sn+qbOg+rj02rMVGriCpKT0vaSt5wP hZfonMPG4dQ+bOZHFazKh170XUuG/Q4eZFSD2l1nZjrd3ITUxCwBOv/kYJS7v1rr8H3k E/jcSPWQvgxILP0K0TSIDH6pZGbQ8apc8W01YbfqqDyO9mgemHPXvOV89Y6e03HKouDE KLKTZR1vcfqTn2H2ZEsL1OxtI1YTKBGYCnjCEqLLx7+Z/ygwamBpWmQAAl7+tr1ufO9f Il3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708644337; x=1709249137; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XTnT6Ri/VO+YLhq9bpz5XsPmPtihKEI64RnnmcAnKno=; b=WaIxfXr+ZlZ07iRDDfxWhBzoqq/Top7216YhLFQ3udqpfSa1hdjmHflaR2q9Ttbuyb 2snwlaRb5/xoeVZN3DsUk1v95iPCtzC+8p2qJhVgVcB8/2hWWhE3n5vyEbXbS9OVzqa+ lbSwaSuIcSK6G9RvSvw6WjDnpiAB9uPwyx/uNJVxGkZs3oAQYZtcsFJUj2cJqHhQvHZr vkzpLUFxYUkEZXZCjZd3E/KPQymkfRtTJ49psV5lSUFJj76vmilczjj6+3G6++fUoW80 YZbUv2n6fSKAKpL4o622dBHtuZ963SYA1UY4mLd1KF0EPwzA9SJhaSMATEEyVpJxHtug DUWw== X-Forwarded-Encrypted: i=1; AJvYcCWoqvmPb+HxymW14t662IJpe3Xupau33ScanynHx5uyle6vsVK4gGTiM4xJJ9PNIVNLdn6BU/t8N85tD2nX6tqkSz0= X-Gm-Message-State: AOJu0YyevjGN1FqcKK9H2SB36tHuSXpzWsr8O9eAUoQS8pYmTTBumhd2 KeqsAPsqhVBt3bq3Ya95rwMsCpUUFMKzYFSfwMsJ2dI7u+zK4Q/QnnIFeNIwaXAHJ+wXsn+MXS4 q9l7KLZoZIAyy5H1YAZKxdr7Iwm0= X-Google-Smtp-Source: AGHT+IHu9zkmPBePtbZyJ3jZfNepFYeSlseu8Bqm8yP2Lr+/xqzeg8ZtCOi4jmRA0Wg31Gf89YOYFVtLVkIv6qHHPaw= X-Received: by 2002:a5d:6346:0:b0:33d:3a0e:9168 with SMTP id b6-20020a5d6346000000b0033d3a0e9168mr328625wrw.3.1708644336979; Thu, 22 Feb 2024 15:25:36 -0800 (PST) MIME-Version: 1.0 References: <20240220192613.8840-1-alexei.starovoitov@gmail.com> In-Reply-To: From: Alexei Starovoitov Date: Thu, 22 Feb 2024 15:25:25 -0800 Message-ID: Subject: Re: [PATCH bpf-next] mm: Introduce vm_area_[un]map_pages(). To: Christoph Hellwig Cc: bpf , Daniel Borkmann , Andrii Nakryiko , Linus Torvalds , Barret Rhoden , Johannes Weiner , Lorenzo Stoakes , Andrew Morton , Uladzislau Rezki , linux-mm , Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D150440004 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 7yz9dy71cfg1dap63p4g4rggpoco4eta X-HE-Tag: 1708644338-610052 X-HE-Meta: U2FsdGVkX185MCObySyryPFyjg4MQQzyiaEPWvwnrqJ+TstTnM8hW4jwv2HAlgBOccTsm+MZgXG1vwP8uiFJOYEqdbzKZEOeKL94vem+HgCUTz1fJoaghfypldvWw/fYMM5WMURTG0qtVVpGsVA3aUfPMNH7I7wO7mtf30DTfwPrtjTN5jk+c7Ne7q9XXmI0B3giGuahX6yDxgs6x0kcQFhbP9TQtmPHaaNkqGtUE02qMAAQhqeTh1LG4EOWs32xzux8WSzwMam0q8tKZzkRJk7YHmKoFHwe4W6GQeKnw+fDpPcYuqY9U47+sMKBy20zbFV/qp/rMohj6YqN1/RtjrWjDxE8du/1Y0pg5ppI+7/X0K65wJEEOuqYqlcyvC/rWM29HrZUMo9ynLCZyXP56qFSE8wVqKwr/J7W5R3nyDUz6lrhwTQ0JWzLJ5DsmqndBaENF9su+T48M1UAEZ2BjrnQjCkF1P411wIR1hxN684xCaTqP5Iz8qClMqRHHdl0ZtSKWygOkE5aZTidT/Aq1jRYpSNwIKPMBCjJgXkmVe0n27s23Gu9yMgPDrnJFzvNq06dehZc7lXG/WHSKHGW7YS5P/GBbcNMbqsUD8eSA1HDvoq4EgibuU47SXVU9/L+NEc0SBiX/50QKTpf/iMJQ3c2llwcqTcJl4Bx62IP2BEMkrs4kvdS7J9uPSvNbWyaB9IqS1uHZWV/KMilzfMFkh32dU0ApP2AhNeTVJHY/FSY547sMZDRSBdTEQ3zzKEx7QSiJ/MsA0mRmGCUD1kZcEWRGUDNwmbbEx6v64ufsPr+k2FlhnzQgAcvvO6VvP5rF5YyRvlRC9aQJi2a2LyC+vJVDwmETLnYriQfW55MB91UCtFFj1kV/FJN1YMwe7hhDybrYdhMWBQnK1EshpJgskjIVWDjDYR8IjUiLuSeXfGYJnz3f1rB4gqqHuzPpQR2M97CFnb+IG6PoMNa6/d aRHCtdl7 MXadA0GdTarKEgXlmNMUx9xef83fM2Trn220vnm8vMYRxO2Md4pfha0+Py6bDPuZZ1v1KjFNOpYVPplQTF4IemQOR4hkgAndt1mQs+dFCRUBt444ps140DRAXyg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 21, 2024 at 11:05=E2=80=AFAM Alexei Starovoitov wrote: > > On Tue, Feb 20, 2024 at 9:52=E2=80=AFPM Christoph Hellwig wrote: > > > > On Tue, Feb 20, 2024 at 11:26:13AM -0800, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov > > > > > > vmap() API is used to map a set of pages into contiguous kernel virtu= al space. > > > > > > BPF would like to extend the vmap API to implement a lazily-populated > > > contiguous kernel virtual space which size and start address is fixed= early. > > > > > > The vmap API has functions to request and release areas of kernel add= ress space: > > > get_vm_area() and free_vm_area(). > > > > As said before I really hate growing more get_vm_area and > > free_vm_area outside the core vmalloc code. We have a few of those > > mostly due to ioremap (which is beeing consolidate) and executable code > > allocation (which there have been various attempts at consolidation, > > and hopefully one finally succeeds..). So let's take a step back and > > think how we can do that without it. > > There are also xen grant tables that grab the range with get_vm_area(), > but manage it on their own. It's not an ioremap case. > It looks to me the vmalloc address range has different kinds of areas > already: vmalloc, vmap, ioremap, xen. > > Maybe we can do: > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 7d112cc5f2a3..633c7b643daa 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -28,6 +28,7 @@ struct iov_iter; /* in uio.h */ > #define VM_MAP_PUT_PAGES 0x00000200 /* put pages and free > array in vfree */ > #define VM_ALLOW_HUGE_VMAP 0x00000400 /* Allow for huge > pages on archs with HAVE_ARCH_HUGE_VMALLOC */ > +#define VM_BPF 0x00000800 /* bpf_arena pages */ > > +static inline struct vm_struct *get_bpf_vm_area(unsigned long size) > +{ > + return get_vm_area(size, VM_BPF); > +} > > and enforce that flag in vm_area_[un]map_pages() ? > > vmallocinfo can display it or skip it. > Things like find_vm_area() can do something different with such an area > (if that was the concern). > > > For the dynamically growing part do you need a special allocator or > > can we just go straight to the page allocator and implement this > > in common code? > > It's a bit special allocator that is using maple tree to manage > range within 4G region and > alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT) > to grab pages. > With extra dance: > memcg =3D bpf_map_get_memcg(map); > old_memcg =3D set_active_memcg(memcg); > to make sure memcg accounting is done the common way for all bpf maps. > > The tricky bpf specific part is a computation of pgoff, > since it's a shared memory region between user space and bpf prog. > The lower 32-bits of the pointer have to be the same for user space and b= pf. > > Not much changed in the patch since the earlier thread. > Either find it in your email or here: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=3Da= rena&id=3D364c9b5d233d775728ec2bf3b4168fa6909e58d1 > > Are you suggesting the api like: > > struct vm_struct *area =3D get_sparse_vm_area(size); > vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt, > int numa_id); > > and vm_area_alloc_pages() will allocate pages and vmap_pages_range() > them while all code in mm/vmalloc.c ? > > I can give it a shot. > > The ugly part is bpf_map_get_memcg() would need to be passed in somehow. > > Another bpf specific bit is the guard pages before and after 4G range > and such vm_area_alloc_pages() would need to skip them. I've looked at this approach more. The somewhat generic-ish api for mm/vmalloc.c may look like: struct vm_sparse_struct *area; area =3D get_sparse_vm_area(vm_area_size, guard_size, pgoff_offset, max_pages, memcg, ...); vm_area_size is what get_vm_area() will reserve out of the kernel vmalloc region. For bpf_arena case it will be 4gb+64k. guard_size is the size of the guard area. 64k for bpf_arena. pgoff_offset is the offset where pages would need to start allocating after the guard area. For any normal vma the pgoff=3D=3D0 is the first page after vma->vm_start. bpf_arena is bpf/user shared sparse region and it needs to keep lower 32-bi= t from the address that user space received from mmap(). So that the first allocated page with pgoff=3D0 will be the first page for _user_ vma->vm_start. Hence for kernel vmalloc range the page allocator needs that pgoff_offset. max_pages is easy. It's the max number of pages that this sparse_vm_area is allowed to allocate. It's also driven by user space. When user does mmap(NULL, bpf_arena_size, ..., bpf_arena_map_fd) it gets an address and that address determines pgoff_offset and arena_size determines the max_pages. That arena_size can be 1 page or 1000 pages. Always less than 4Gb. But vm_area_size will be 4gb+64k regardless. vm_area_alloc_pages(struct vm_sparse_struct *area, ulong addr, int page_cnt, int numa_id); is semantically similar to user's mmap(). If addr =3D=3D 0 the kernel will find a free range after pgoff_offset and will allocate page_cnt pages from there and vmap to kernel's vm_sparse_struct area. If addr is specified it would have to be >=3D pgoff_offset and page_cnt <=3D max_pages. All pages are accounted into memcg specified at vm_sparse_struct creation time. And it will use maple tree to track all these range allocation within vm_sparse_struct. So far it looks like the bigger half of kernel/bpf/arena.c will migrate to mm/vmalloc.c and will be very bpf specific. So I don't particularly like this direction. Feels like a burden for mm and bpf folks. btw LWN just posted a nice article describing the motivation https://lwn.net/Articles/961941/ So far doing: +#define VM_BPF 0x00000800 /* bpf_arena pages */ or VM_SPARSE ? +static inline struct vm_struct *get_bpf_vm_area(unsigned long size) +{ + return get_vm_area(size, VM_BPF); +} and enforcing that flag where appropriate in mm/vmalloc.c is the easiest for everyone. We probably should add #define VM_XEN 0x00001000 and use it in xen use cases to differentiate vmalloc vs vmap vs ioremap vs bpf vs xen users. Please share your opinions.