From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 513C6E7717F for ; Tue, 10 Dec 2024 22:42:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BD9AB6B0088; Tue, 10 Dec 2024 17:42:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B88FE6B00B1; Tue, 10 Dec 2024 17:42:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A032C6B00B2; Tue, 10 Dec 2024 17:42:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7FF356B0088 for ; Tue, 10 Dec 2024 17:42:24 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4115E160C00 for ; Tue, 10 Dec 2024 22:42:24 +0000 (UTC) X-FDA: 82880523726.13.3A372EB Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) by imf15.hostedemail.com (Postfix) with ESMTP id DF9CCA000A for ; Tue, 10 Dec 2024 22:41:58 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JCldGd2p; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733870531; a=rsa-sha256; cv=none; b=vU7kZiGHZc4DV3vWCygRyUaa3TpAJO5CSzcy81r5DLLxYEocq8uYvn1X5meMH+x3FCAVSA ZeOvvnTPkVSlQFDDJfnXSUMy0KI32QmW1rDvOzdAuidSIPFldwWsePl9JnQT/DpipJEXNr PcNL7zuWTWuQ34I9ks048vd8F8AvTGo= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JCldGd2p; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733870531; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5V8YDQEXdiG8e/cO5xVij7n6RmOLAOSXEgElRYuLPvY=; b=imTikf+d4hMV+/8klgYAoWEzAbLnYHt+Z1YRB8wXFuvEM1q+00KDvhbEN/olelpHax411V io3XmYzs4cwW3nSVe2BQ60qktr6Cgm17JE3NHIhxe0yfCKJS3ZGkHYhW6nJF+O9P4PawKZ 3VaSS6URGSLAde9/HQZbO51r7ndS6og= Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-385e3621518so3965108f8f.1 for ; Tue, 10 Dec 2024 14:42:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733870541; x=1734475341; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5V8YDQEXdiG8e/cO5xVij7n6RmOLAOSXEgElRYuLPvY=; b=JCldGd2pSrEQayYOrb4AaDwC1K/xC3UVnbMksopZ5zHkN/3G1uP5C2RW7IFugHyGY9 Tzu+s5KuCYycATOy6fwDyEEbCpEAzJS4Y1GZe9o68qc6YysWlrT0P+HVmuoIec3df0ZJ BKqQS5++FmSejBy44EWaZ2cJb5f9Mu8eKe2Roq0pIIEX51gRD1bIIcM5Gy/nzdKSivq9 LNi+TlUs8F0x5SbPhjWPAhdB9hssjewVErkeGt781mv/nHTZK0IpesYHqSre+YzyStjc oaiXdDaMqlSu+hUds0ciPW7gXpMxZmkQl8MSSkgqeueHpjDm8/U6ceo/433t41UN5SIH 839Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733870541; x=1734475341; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5V8YDQEXdiG8e/cO5xVij7n6RmOLAOSXEgElRYuLPvY=; b=HedApu6sNZbEY/8O2wxOl3qhU9azOOP3hUM5eOF41eDJ53Fyeyuk+1XrGIjeyfvi9C qoNMeBv+71o/EGjqGAVV4JzPlEBTF+cqN8rih7JVzIF68VuYsTqkNwndPjiY+WZ9LONO gj8EBZAWbW4669sK9tQo8fjLhvqBwdKjygVbnH8Zz43qb2rGQ3+mgE5HVXLzmyO/PNrb bA5uDNM51RdYeHteH1FQhjokGv2B3LbEOHPkxc5rbMy3g2sQ/Z0VLjbjiva8d/lCFfef p5FdwKvpU80ctrug2BpOZz7mEVj4afQJ6RLbeTzZiokqe7XKcI4vHiA65clDxfrVR2X8 NbRw== X-Forwarded-Encrypted: i=1; AJvYcCW+54awbJDYyaDHX2XvOuaEAuNrKEvjIf0Lxmb7Kv0/ZM1ICStXogmAnpQRhKEdgZ6dkGdgkEjBZw==@kvack.org X-Gm-Message-State: AOJu0YxlLzvdNSmBxPADmu3ETiUX5PTok1NrB1aJj2f/MDGWXHFrFzF/ 6NFTwlOhsjKMWDhgjTOlDn/5NvM3vyZjIyrb4Om8kTwZkV0jua9FonhjlQum8L92uEcqaNyhpKM +DWQHQ0VLquWAta2NcwBCynwyIG8= X-Gm-Gg: ASbGnct4KpQ5jUWZNQmY+5eV1Qy1mIffcadpv9M52sNOuQ2YjXho/pjuz8knvGjymhI z7cix+qjbcc34VdlZ+1swD9nKSGeu/3a2wCysPyTbo+6p/xeioN8= X-Google-Smtp-Source: AGHT+IHj/62BMYkIgVRaz93McxZ59M/qHEEAKX+T9xDTcRjD+6EtSJzUAmuKClFVK3RHpFbHxAW80PDJV2pw1ISUMAU= X-Received: by 2002:a05:6000:1542:b0:385:f470:c2e1 with SMTP id ffacd0b85a97d-3864ce4b014mr572733f8f.2.1733870540616; Tue, 10 Dec 2024 14:42:20 -0800 (PST) MIME-Version: 1.0 References: <20241210023936.46871-1-alexei.starovoitov@gmail.com> <20241210023936.46871-2-alexei.starovoitov@gmail.com> In-Reply-To: From: Alexei Starovoitov Date: Tue, 10 Dec 2024 14:42:09 -0800 Message-ID: Subject: Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation To: Vlastimil Babka Cc: bpf , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Andrew Morton , Peter Zijlstra , Sebastian Sewior , Steven Rostedt , Hou Tao , Johannes Weiner , shakeel.butt@linux.dev, Michal Hocko , Matthew Wilcox , Thomas Gleixner , Tejun Heo , linux-mm , Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DF9CCA000A X-Stat-Signature: nmdsrx65yon6cr7fgjbcu3sz5dzs6j73 X-HE-Tag: 1733870518-492651 X-HE-Meta: U2FsdGVkX18VJIONvvKJN3Otoa/dNRG7ZwHIQCfJiZqK5gIjjSrzT4vkReYnQBPe6Shw6AFXEPOpfeqmiA/jpMWrnBJV8Z1q0ce3L6LJGTBuzCp700YpglE2MTES0FYD6oLsX41VyJXZ7axJxSOsBk7bO5iC13wEicGkIQzNk1ES5mJ3yJsLoTjXvpeSJ+dL/yFE8h+AOb2smq07Av6pJjlFoX8QD0QzXoHkFuGAreG5dnYY+zEumcmHgG5p+YCMdQFj4pfP+SqLQaYCc6WOnlw9IqQIpE0gE/W+GcNqgRWhPa/sdmWHfn+DHPCVkdzvJnGnDxSoi3/aGaV4sEO/B8A24q5LE7NP9Eq4l/NaKiY9hIfYcTTZ16kIqBJxqa09B8usFRdy1OLsYAegVqRka1wV0blq1RTRzzpaBGcDKixZKrf2Tq3Q2KdCn9V95lTpZf1+omiZAkOG9UNC8J2byv0Xi3ZihtavJcuRBcFo3mUROv06MdqJdnklXvGmeAOyqeF7rOHfsL/IYRxLtXA9Eu3anrmJNYXED568IW9hifcFphARYn+GnLElqrx1jd/41jzklsuaBXkS01+rk30wFYaIcTtJZxRF/lx9BeMRSKsZHWCghaTqTVUKcel/g060UyckgfzhIqa87/LbCnQv/wk+KFB/eeyptkeS90Gji3fm9A6SKlhNW5FlSVLHl1lrI4cG0gHWDREwQqQoJHAWVwcaE2sO1YHDxHTW78CxFW4+I7/SIZAIGbWw4icgvm5nr4XthZvP6nhWiTJvjkKT1ix4OmtRsEMKVy/53W+sP+Tl3M9fttHTUarCFGr29TPWP7gVGTMy6HzrO9VtOMQqz3nE868qGKKkvUibYlTSVS2pU/kWEnS7fQE0+AzwIsRjYH+GfzHBko1uDLjIV9o6emWxaxp03i3JJhi7cbNNL9ZAH2HE+6x9cbFg/xm9HG5jEAECQoQGhSiVyEDZyTf E7OTyLpU pPt+diks4clbwPgm2v9bYBE/rmmXtnsYnORm93XRngwXz4Cgh6RjHnMTsP3XSnsCgIZDvvPVAsyCPRWg9uspkfJZRP/eEOICJjkS0OM/JuzXC/jeFaOYLCzvX8AAL6aTsrgIeylNgVVYpFXA/sncnSyrTOuWkUmrA8UY9U0nx+cUYZlWcZwf5xbLC6daHOJ+iOj1yLVz8axkHyXLXl/Rwyqr9+mEdRoAkK2auPh2azgbSt+0S4xIfzSH7Y5dlyNbVkU+3qcwIK1/A6IzvSs0AasUmV7j5ri+B13t1LsO9mPfBvMb5otu7ADEfKKWuGzggzgdmue2A6z3Yn2GpDuiAMq99SxuTA+cre7ge0QF19OwwjfVkYPllEx7fASgl2uVfFbID4wOZonrLhf+hzM1pwASF3GUcdmJOD42HjizXfMSaX1xb4Ah/RJ6lh5s5TjfGPNazhP+AhQE0VOdUIoQTKAzfzdGKRB+HEJonXWdPEhRs5ww= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 10, 2024 at 10:39=E2=80=AFAM Vlastimil Babka w= rote: > > On 12/10/24 03:39, Alexei Starovoitov wrote: > > From: Alexei Starovoitov > > > > Tracing BPF programs execute from tracepoints and kprobes where running > > context is unknown, but they need to request additional memory. > > The prior workarounds were using pre-allocated memory and BPF specific > > freelists to satisfy such allocation requests. Instead, introduce > > __GFP_TRYLOCK flag that makes page allocator accessible from any contex= t. > > It relies on percpu free list of pages that rmqueue_pcplist() should be > > able to pop the page from. If it fails (due to IRQ re-entrancy or list > > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > BPF program may execute with IRQs disabled and zone->lock is sleeping i= n RT, > > so trylock is the only option. > > In theory we can introduce percpu reentrance counter and increment it > > every time spin_lock_irqsave(&zone->lock, flags) is used, > > but we cannot rely on it. Even if this cpu is not in page_alloc path > > the spin_lock_irqsave() is not safe, since BPF prog might be called > > from tracepoint where preemption is disabled. So trylock only. > > > > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet. > > The support comes in the next patches. > > > > This is a first step towards supporting BPF requirements in SLUB > > and getting rid of bpf_mem_alloc. > > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > > > Signed-off-by: Alexei Starovoitov > > I think there might be more non-try spin_locks reachable from page alloca= tions: > > - in reserve_highatomic_pageblock() which I think is reachable unless thi= s > is limited to order-0 Good point. I missed this bit: if (order > 0) alloc_flags |=3D ALLOC_HIGHATOMIC; In bpf use case it will be called with order =3D=3D 0 only, but it's better to fool proof it. I will switch to: __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUN= T > - try_to_accept_memory_one() when I studied the code it looked to me that there should be no unaccepted_pages. I think you're saying that there could be unaccepted memory from the previous allocation and trylock attempt just got unlucky to reach that path? What do you think of the following: - cond_accept_memory(zone, order); + cond_accept_memory(zone, order, alloc_flags); /* * Detect whether the number of free pages is below high @@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void) return static_branch_unlikely(&zones_with_unaccepted_pages); } -static bool cond_accept_memory(struct zone *zone, unsigned int order) +static bool cond_accept_memory(struct zone *zone, unsigned int order, + unsigned int alloc_flags) { long to_accept; bool ret =3D false; @@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) if (!has_unaccepted_memory()) return false; + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return false; + or is there a better approach? Reading from current->flags the way Matthew proposed? > - as part of post_alloc_hook() in set_page_owner(), stack depot might do > raw_spin_lock_irqsave(), is that one ok? Well, I looked at the stack depot and was tempted to add trylock handling there, but it looked to be a bit dodgy in general and I figured it should be done separately from this set. Like: if (unlikely(can_alloc && !READ_ONCE(new_pool))) { page =3D alloc_pages(gfp_nested_mask(alloc_flags), followed by: if (in_nmi()) { /* We can never allocate in NMI context. */ WARN_ON_ONCE(can_alloc); that warn is too late. If we were in_nmi and called alloc_pages the kernel might be misbehaving already. > > hope I didn't miss anything else especially in those other debugging hook= s > (KASAN etc) I looked through them and could be missing something, of course. kasan usage in alloc_page path seems fine. But for slab I found kasan_quarantine logic which needs a special treatment= . Other slab debugging bits pose issues too. The rough idea is to do kmalloc_nolock() / kfree_nolock() that don't call into any pre/post hooks (including slab_free_hook, slab_pre_alloc_hook). kmalloc_nolock() will pretty much call __slab_alloc_node() directly and do basic kasan poison stuff that needs no locks. I will be going over all the paths again, of course. Thanks for the reviews so far!