From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5149DC5475B for ; Wed, 6 Mar 2024 22:57:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BC2A46B00AF; Wed, 6 Mar 2024 17:57:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B727E6B00B0; Wed, 6 Mar 2024 17:57:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A3A586B00B1; Wed, 6 Mar 2024 17:57:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9178D6B00AF for ; Wed, 6 Mar 2024 17:57:16 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6CDA716058A for ; Wed, 6 Mar 2024 22:57:16 +0000 (UTC) X-FDA: 81868126872.27.1EAB420 Received: from mail-yw1-f176.google.com (mail-yw1-f176.google.com [209.85.128.176]) by imf06.hostedemail.com (Postfix) with ESMTP id B48F9180015 for ; Wed, 6 Mar 2024 22:57:14 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=04KKiBrU; spf=pass (imf06.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.128.176 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709765834; a=rsa-sha256; cv=none; b=Wz5NLd0jNneniD/OlLMtRFmWSp6JJdgnCEKxFL7cwVY3NA6fF8RAnS26v8RNdgH9Wjhqxr HESPg3Qck7I47byUeye5NpAtKW2tevfde+me9nXFWDBAFuLqSDKXLqDxwZNGcVI+GIvcT+ cAp+OIeblg3VdV325ePRzwZqsfQU9m4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=04KKiBrU; spf=pass (imf06.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.128.176 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com; dmarc=pass (policy=none) header.from=soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709765834; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=b4Nw3ruzCEh/6x6lraL5DkvfsiMjEFZksiQV2YYXotE=; b=j5/vduAFuBa2I//QX85Mbvu82UULYS6JcNtPZkn1IqqBBdyh9oKEUFu0rvXlaBFeQT9EXl iRK9YYg2M6pg9zyTDLbja1+Vl6gerHBXcO8ORADsYqdCzPUHj26Fw1Iy0ckrWa+tl+s4h5 BMiftntzEjqOjexr02lBe24uGZfDbR0= Received: by mail-yw1-f176.google.com with SMTP id 00721157ae682-6096ab005c0so3168767b3.1 for ; Wed, 06 Mar 2024 14:57:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1709765834; x=1710370634; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=b4Nw3ruzCEh/6x6lraL5DkvfsiMjEFZksiQV2YYXotE=; b=04KKiBrUUGXQxoy4DQyFE9f3z2t3rFwleq0njsWFKpOPIp7Gj8zQDqg6mi4mZn0RRt fVTmTSgyhwm0zUBBHWWW++aSCcExykAsBGMLeHKC7wUHlCsTbmwGHQsWYcjqmGF0mmCA dsNiBqRp42ZWVLl880qi1+04OYpnRYM3mJlJ2kFHAfXg116/7HcCcOlj2Cbj4jkNr+i0 n8hHAZnPWSpVUVbavtOmfMPpHHOBmH9okCsEYxmuVpeDIlt2wbDsAbpogKRnVmtcTsby PbPcCIled1IB/gWK5NJ57as7yEgnPylnAzO+xz7ZI6xr2BRP7SFbjnD3D43y2KgRgGo9 XARg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709765834; x=1710370634; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b4Nw3ruzCEh/6x6lraL5DkvfsiMjEFZksiQV2YYXotE=; b=kToRbjpixaB1gvuJJYi+CCOG0+LEYWSJ+1EOtV06ws6bUdTztNNCZMSrLtnqBWrn70 BpLDP/VVYf+Ax9DQM3cPOWRcTJHZJ1pHtf4RD/ywDdD5DXGYW/oKfN0j5UgRGgeFpwkR 8Q4Nz7fBzoCaeOUE3VbICkXnhXJsBriysSGn9ixuv1jJXH3W1CRJoimF7YVBHhClKLze YWMiGmdgWW7RVv+epe9RIkJ3NvQqrir0iKtB9sd2YqYneJkrg1jrHl9v9aT7QKXlJiNW 3oAnha+mFOWZ19tnvwQsLjUmG7k30xLYvQi9+GvhaSjtCROLoTCzVlcirj5/wqvvp+Vx cOMw== X-Forwarded-Encrypted: i=1; AJvYcCWx5P9jyp2C2gtRhmGe18WgEIYfJcUyiueycoTc1vQ06kK9JxR2ALmY/mkco9LGHoXVzW+loPRgsCMRTtN/lpZ/ENI= X-Gm-Message-State: AOJu0YwWPawMUCGqbq8e9LCOycSG6fHvOetVuB52ht1fhN5RPjL0NkEA g3EESJ3ZKuvSaOjHM2YLlnnZqyeplto2e8YEyzdIIX8BixrwjminJfn8UggSKUypmBiHrvog9y0 LIThCpvIS+0gNZkTrBmPnyjgs094UtEa6VUOl7w== X-Google-Smtp-Source: AGHT+IHwPxKVjGasXZN3AkdEcSPiwiOeknvDH42CBz2GEXFk4GNpppovu9J9FQrbuAW73ifV1XqA6uY12hiEUCXFYQw= X-Received: by 2002:a25:5186:0:b0:dcc:99b6:830b with SMTP id f128-20020a255186000000b00dcc99b6830bmr13006393ybb.19.1709765833765; Wed, 06 Mar 2024 14:57:13 -0800 (PST) MIME-Version: 1.0 References: <20240305030516.41519-1-alexei.starovoitov@gmail.com> <20240305030516.41519-3-alexei.starovoitov@gmail.com> In-Reply-To: From: Pasha Tatashin Date: Wed, 6 Mar 2024 17:56:37 -0500 Message-ID: Subject: Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages(). To: Alexei Starovoitov Cc: bpf , Daniel Borkmann , Andrii Nakryiko , Linus Torvalds , Barret Rhoden , Johannes Weiner , Lorenzo Stoakes , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Mike Rapoport , Boris Ostrovsky , sstabellini@kernel.org, Juergen Gross , linux-mm , xen-devel@lists.xenproject.org, Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B48F9180015 X-Stat-Signature: 9wxk33g8k9afupd93m1txmee6dby4jmq X-Rspam-User: X-HE-Tag: 1709765834-75590 X-HE-Meta: U2FsdGVkX18gzOl4R4P9UFEDtXlyNECdxUxj8rC22l1TPMbkEsvbdzHdswEAt3P+z3TP4UtKx7vx9IGOQ0K8q7PlxSU6kqwB1zsd3VLi43SywOODqji1Fef38aB9ryR0Ah611g3aScv20p/vOpHuDER+jZ+Mkn9/hA1qNtYgbzfwnqVLDvD0RMRiQZUSWi4eUmZTsmNTqwTYrfwUJBtEoJFVoc+KyFdJGJNA/jTuxSuEpaXyAZcAN5dNxYL52KPllZOqG6NGwXg7q3ZuQf6wqTd1ZY2axTAidw+KV2HG08BUmjRzxMTg0XVEZlsvZCfequiPxrHGFqxFJczyB+9vvAcAmUWjq9MLj9a4Vosl2yoD/NnZBHC4FYVYsuEMJOQ2D8hpSXD8qQA0bO+uMZu9dExlhLpoZmjFGuOydZrFMLiHoumwKbU4yKsKv+lgorKLcCFdugfMfS6Dr5SKott0ca0iXG4q9/6EFXHwmehePsMltrmXN7ldmrBLdVmWM6IL1AuuJmwGPlx+qTw7zaBFhh4IgHQlKqS8NSY7gCICfe35iqSd1xXcGRgkRoOrYBWpk8VRFKlM0H4p+UVKxR5V52oUn37daCatM77+dh3Iw0OLPaPSsy8Yb/KT0d6AcujkBS0MwWQiLof9Pcbkn2h197BSH13wSRKtQIgXnJNIGV4EgDQz5GTuHb7uqBE0SX+9pPcQgRtz72QO6TVHrX/D/eEFe/PnulOoMSTNKXmk3Bfc3joThdKf/GXkxcs74ZsHCILVgGVvNwKVX02A+eJVA8tURJW3hnFT3pgQ+POdgTnIXrhD5iIcDDmeWTDJpxE84QWGbPthjE7qv/8bCGNOZLvObDKbJPdlaQNXYSg0MW0WH/6S/g05hBEIh7fWXAztFi6/YuRPQVlXtdQxNGWBgHX1d5ElxnnBOYD/R0IVFClcW5rdUapxxXbvHcYzzSErAudWQjSQqXcCyUWF+iU 0eEqGDEr aSmuAEZobMeARgBeqh2KFNKH7zSewmKDLDCZEtHzaHCPE6j5ZOpWdW6T79tIRR8pW3p+1CKEAOPpOV34u5JZU/tABw38AM4eHC82d4ibvS8jCfXtMKrsJNUWh3i8S/g+9bX0pB15+mmxprM/dLa5JfbUd/EaBzlar2sWvqtd/k3RAgho= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 6, 2024 at 5:13=E2=80=AFPM Alexei Starovoitov wrote: > > On Wed, Mar 6, 2024 at 1:46=E2=80=AFPM Pasha Tatashin wrote: > > > > > > This interface and in general VM_SPARSE would be useful for > > > > dynamically grown kernel stacks [1]. However, the might_sleep() her= e > > > > would be a problem. We would need to be able to handle > > > > vm_area_map_pages() from interrupt disabled context therefore no > > > > sleeping. The caller would need to guarantee that the page tables a= re > > > > pre-allocated before the mapping. > > > > > > Sounds like we'd need to differentiate two kinds of sparse regions. > > > One that is really sparse where page tables are not populated (bpf us= e case) > > > and another where only the pte level might be empty. > > > Only the latter one will be usable for such auto-grow stacks. > > > > > > Months back I played with this idea: > > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&= id=3Dce63949a879f2f26c1c1834303e6dfbfb79d1fbd > > > that > > > "Make vmap_pages_range() allocate page tables down to the last (PTE) = level." > > > Essentially pass NULL instead of 'pages' into vmap_pages_range() > > > and it will populate all levels except the last. > > > > Yes, this is what is needed, however, it can be a little simpler with > > kernel stacks: > > given that the first page in the vm_area is mapped when stack is first > > allocated, and that the VA range is aligned to 16K, we actually are > > guaranteed to have all page table levels down to pte pre-allocated > > during that initial mapping. Therefore, we do not need to worry about > > allocating them later during PFs. > > Ahh. Found: > stack =3D __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ... > > > > Then the page fault handler can service a fault in auto-growing stack > > > area if it has a page stashed in some per-cpu free list. > > > I suspect this is something you might need for > > > "16k stack that is populated on fault", > > > plus a free list of 3 pages per-cpu, > > > and set_pte_at() in pf handler. > > > > Yes, what you described is exactly what I am working on: using 3-pages > > per-cpu to handle kstack page faults. The only thing that is missing > > is that I would like to have the ability to call a non-sleeping > > version of vm_area_map_pages(). > > vm_area_map_pages() cannot be non-sleepable, since the [start, end) > range will dictate whether mid level allocs and locks are needed. > > Instead in alloc_thread_stack_node() you'd need a flavor > of get_vm_area() that can align the range to THREAD_ALIGN. > Then immediately call _sleepable_ vm_area_map_pages() to populate > the first page and later set_pte_at() the other pages on demand > from the fault handler. We still need to get to PTE level to use set_pte_at(). So, either store it in task_struct for faster PF handling, or add another non-sleeping vmap function that will do something like this: vm_area_set_page_at(addr, page) { pgd =3D pgd_offset_k(addr) p4d =3D vunmap_p4d_range(pgd, addr) pud =3D pud_offset(p4d, addr) pmd =3D pmd_offset(pud, addr) pte =3D pte_offset_kernel(pmd, addr) set_pte_at(init_mm, addr, pte, mk_pte(page...)); } Pasha