From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26BCFC02180 for ; Wed, 15 Jan 2025 23:00:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B4AC280001; Wed, 15 Jan 2025 18:00:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 93DB26B0082; Wed, 15 Jan 2025 18:00:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79019280001; Wed, 15 Jan 2025 18:00:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5723E6B007B for ; Wed, 15 Jan 2025 18:00:25 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 086E9A03D2 for ; Wed, 15 Jan 2025 23:00:25 +0000 (UTC) X-FDA: 83011206810.18.78E4BB5 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) by imf29.hostedemail.com (Postfix) with ESMTP id 1226A120005 for ; Wed, 15 Jan 2025 23:00:22 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ITQ90jI2; spf=pass (imf29.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736982023; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BDiLtXy3mDukE7KokmiuY0qdo8zYuEuR0EeKTl2FepU=; b=VPtj+3s1FjDTEQycchkvM3k7ug/Y2xDBPCwk+gh16HZKay8Tcl/bgYAiBswpLe/fzXombQ OCxgyk71x9FF0fg/C79IPOmAguVR/c8tiUPIzzk6sTMdWjZ9yr+a4yi6aNzxLzWOd4RD8y hvhrTMJaHB0RdaYmoW4HwXfIpI3VRAY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ITQ90jI2; spf=pass (imf29.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736982023; a=rsa-sha256; cv=none; b=E/ZJ8Z5S7dkU9eVnoXj8OfyMys2iRPpmI5XHt6sJYKx+zvHA8vjC4RaAeRj6ioXMYLUhyI ow0p48fX9ror4MGOCeAY8LgNMhe+k1FitYIPmf4isoOQ+37alPmcxFC8z1PsIIV+UANb/E kMwTOEElTo54q9z4nykpDkX0pusQbXs= Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4363ae65100so2238815e9.0 for ; Wed, 15 Jan 2025 15:00:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736982021; x=1737586821; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BDiLtXy3mDukE7KokmiuY0qdo8zYuEuR0EeKTl2FepU=; b=ITQ90jI2tMYbY4IX3QzuUcj8WxKSAO3Gzr7Qdz52DoDjzsI6/V0WOScugvl141raEL ihWVuQA3APUz7k19pi5aGXi1S+VZW+dA8mgiK7sN+j7/p64CN07y4urGYPjC5OJDnykD ch3Cih7t2x9ezuLjcsga9n0JQ9F4gQn4S9kBC0DpAlq9fvQwUgLqL0f+QNKrh3uKqbwt lGEYh0YUUVASe68Ov4GCdXilU9yREuQK6BEv4S8rvwYVe1/cu5bWnpc33Mu0y+uE8Oiq wdsaPU0N3XrI+o8U23LEtkScslxkyOoWQZXKUkw3SnmrcC1nw888Nx6VWxLdQKOZaoJu 5O7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736982021; x=1737586821; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BDiLtXy3mDukE7KokmiuY0qdo8zYuEuR0EeKTl2FepU=; b=DwqqYQ88+oSD/MjR4RWFRnVl0iSHKQOSMAxtFyUz0R1PFYdiL8j03I6oczfMAcgddL uUmI1SZRvQwsmHXPAGpPWmq7guPVag48hQmLVKNvP2FQdc1MoQffs0agojOM/lNN2YYU VyKYJosevHbxNHgijvpxtmP25k0dXHbWKBtQ0f2hLcf64FOJ6i2j2fRQ3t8awzzDsU4D 6w2kEy4Dh8ODW/jtz/kVWVjZktTO22iIDCWYdoQvvaWExXEo5Ps0PV8JXnxMHWe5+KbB sE8G7aEIqOEjTH/nzhQqhNfQJOduWnGrajZ1Y645pCWgBdcwXjO3oSnMQfTceW61YL3Z lA9Q== X-Forwarded-Encrypted: i=1; AJvYcCV+jOlm39CGb8n6Hk/AA6vYEHNr9XJ+kjIBgA/ov67j/jL8dONzWVbdSHNsMQfx1XiUHzbu7knXQg==@kvack.org X-Gm-Message-State: AOJu0YwSaojeLj+/LAmfjx7rR2zXZ8GNGwHFfaQcL9BIBlfGfkJofAmt ahxg/tq6k80OKYLUNPLoLRZWyvzJ/E2osO8JAZFU+eiw4nrgX3n3o72SqgUcHZh/OKLX90uj/03 mMuEBdWn87uQcSjx9zB/zeEitZcc= X-Gm-Gg: ASbGnct6I+6/b0b3u9sC3iWVcTWxS5kIKm+8mMi8tfpCeUbY/FaeJjXKb20KCHGymq6 NQ8FH7uJL6MfFH5dKlS3YiKrUZM/vj9/3Xy67bLD3Sf1Bzv2TqHRRQQ== X-Google-Smtp-Source: AGHT+IFvAWZqhvjWTlWWrUxUYUlrsbwmsa6i91GddSK+ksfnvUV72QDfB024X3jaOZwJhKmOFkVRZ/PeL4GK0KYRr7w= X-Received: by 2002:a05:600c:1c1a:b0:434:a802:e9a6 with SMTP id 5b1f17b1804b1-436e2679a94mr309980085e9.7.1736982021269; Wed, 15 Jan 2025 15:00:21 -0800 (PST) MIME-Version: 1.0 References: <20250115021746.34691-1-alexei.starovoitov@gmail.com> <20250115021746.34691-2-alexei.starovoitov@gmail.com> <9fb94763-69b2-45bd-bc54-aef82037a68c@suse.cz> In-Reply-To: <9fb94763-69b2-45bd-bc54-aef82037a68c@suse.cz> From: Alexei Starovoitov Date: Wed, 15 Jan 2025 15:00:08 -0800 X-Gm-Features: AbW1kvbd_Bvn01PDqy4m0c5cxIca8IjdasmskFcO8KHTjl2zQB5aEDhGuSNSjLo Message-ID: Subject: Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation To: Vlastimil Babka Cc: bpf , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Andrew Morton , Peter Zijlstra , Sebastian Sewior , Steven Rostedt , Hou Tao , Johannes Weiner , Shakeel Butt , Michal Hocko , Matthew Wilcox , Thomas Gleixner , Jann Horn , Tejun Heo , linux-mm , Kernel Team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1226A120005 X-Stat-Signature: srowp7a3kp6xh3ok4n3smiia4qad531a X-Rspam-User: X-HE-Tag: 1736982022-363461 X-HE-Meta: U2FsdGVkX1+Cq+vE0hxOKn3D3eS4PEq/dcNdCq1XmVv8BzxwaNiJJO4iNXvGRhf1oVUt7+mQMeOOycpbzp5zOYF87WFeuxMeTD1z9PEKjvMWshp7JD9eceDSat3X9pEIEOG/kS0Q6wVbsUISR0KdbB/oUaaDaVDYm7c9UBnpLsoREq5K0CLgXFk1w/odZbYSozKO1mjXFVg09lwjUY/37jUaiKz3e1b4bEKBcv/+27MEFDCQSAYqOvb8YNFwNL/7NA4eFtDfRHIh4sV6x7le7bEkwBHLnMxIDko0eLsPOWQtojc/2/fkROtrRfcSdzEHp7GnJNmQYu9QCMm6AjAUsA7eosHOHo1rwhDulrNm9FRhsEscRzGPhuZeDw0ned2oz6hxeS5eqp1cSAtpfDi2FHB1dJn0riA694AN+Ae6ea9T9LFCpXTaWze3+Psd0fJI5PVfJD4Ge6JVFWEUJtangCk9aFWIJu4JyVDKv3XBvpeGOM88Qec874y+vEHQIecjPD1PyyIj2F3lC+Dk2LPGOifae/pDB7TGsKWOiCt1OE7nAUJM3A9c9m6cpRNcbmjuuibS6qviCvSEhospNFlNWnZngERMJcpyNd5ZI6HPORTvpxEvVf+n0bWVJOFQ8H007St3SW5tKoesWmBT9dv2oQtPjPkBJ5fYyelaSqDw9mBGu2XnESVs99KFItivZhHbyWQ7WnIeGCN1PQnVmMA2TgQgM9NesZeWm6MNYOOPxSR8CNd/lmjMgmfV5grh4j7d/WHY0I3+tj0JG7iJQGFIL973UAsEETCtERzqttOCp2bAmV8A4GHf09xDMCBFmrAHGmzyZq8QI9mf5ZKH/ILBl8vNXn7OQu8zUhTyAHrGHp/iiuK8vjj8XhDAIQJ96eYiglN1GPFSXjqaMWsRklIQkBv2+b8roIFyjAsF4iycSCWWB/dH8CNdk7EbgyEuwy6A5BIeAxtaNbUIY8NJ8KH 8U8XilGI mdEE34XuLtKVxPyBRLND+BuNfUsyEFp9zi1jGI9nrD/qBQ8tO+D7xrj/6k/gWf/OPj7yjlSxdB+1IXfOurbygKpLIWSXcjk1f34a/Pp8FIFm4IGl+ObLnzagKaa7Bzckp3tni8x3IooO2LdyoYgWEiLbrsdP1jVp+YtlIKmH/cMgIuJqcNrEMzfvQMmyiClMsYVErvCNpU81SBHxZxkXg3G3Y4o+1gQDY76C2rWNZvJUSM6h7LE53z2GtOUoFaP5p4o7Zdz/7QIZzsZc0Z/FySasel9qppVZULmG7XWbf9yrp6PD2A3jSxfmxZEt4LyVhchhvi8tJcGeS6y2QpElj95we2gen5m7W/bZJYXU3n3cxVnbIHvuKH095X0+a1XXBCn0ZYwp4tbFQB4fupsBG0j41tskOmJlfMXqTkMFy0SYwZCIXbW/HHfjDwaM8QAJe3qkXn7AfbyrWl7wdb5dB6QRHShnxlDdYZuvNObfWqgFzyhgfqQuvaTPsLYifAfQf+r1mzPlyggYiSULKHq6lcm43SH1fXQqfFbia X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 15, 2025 at 3:19=E2=80=AFAM Vlastimil Babka wr= ote: > > On 1/15/25 03:17, Alexei Starovoitov wrote: > > From: Alexei Starovoitov > > > > Tracing BPF programs execute from tracepoints and kprobes where > > running context is unknown, but they need to request additional > > memory. The prior workarounds were using pre-allocated memory and > > BPF specific freelists to satisfy such allocation requests. > > Instead, introduce gfpflags_allow_spinning() condition that signals > > to the allocator that running context is unknown. > > Then rely on percpu free list of pages to allocate a page. > > try_alloc_pages() -> get_page_from_freelist() -> rmqueue() -> > > rmqueue_pcplist() will spin_trylock to grab the page from percpu > > free list. If it fails (due to re-entrancy or list being empty) > > then rmqueue_bulk()/rmqueue_buddy() will attempt to > > spin_trylock zone->lock and grab the page from there. > > spin_trylock() is not safe in RT when in NMI or in hard IRQ. > > Bailout early in such case. > > > > The support for gfpflags_allow_spinning() mode for free_page and memcg > > comes in the next patches. > > > > This is a first step towards supporting BPF requirements in SLUB > > and getting rid of bpf_mem_alloc. > > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > > > Acked-by: Michal Hocko > > Signed-off-by: Alexei Starovoitov > > Acked-by: Vlastimil Babka > > Some nits below: > > > --- > > include/linux/gfp.h | 22 ++++++++++ > > mm/internal.h | 1 + > > mm/page_alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++-- > > 3 files changed, 118 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index b0fe9f62d15b..b41bb6e01781 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp= _t gfp_flags) > > return !!(gfp_flags & __GFP_DIRECT_RECLAIM); > > } > > > > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags) > > +{ > > + /* > > + * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed. > > + * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd. > > + * All GFP_* flags including GFP_NOWAIT use one or both flags. > > + * try_alloc_pages() is the only API that doesn't specify either = flag. > > + * > > + * This is stronger than GFP_NOWAIT or GFP_ATOMIC because > > + * those are guaranteed to never block on a sleeping lock. > > + * Here we are enforcing that the allaaction doesn't ever spin > > allocation oops. > > + * on any locks (i.e. only trylocks). There is no highlevel > > + * GFP_$FOO flag for this use in try_alloc_pages() as the > > + * regular page allocator doesn't fully support this > > + * allocation mode. > > + */ > > + return !(gfp_flags & __GFP_RECLAIM); > > +} > > This function seems unused, guess the following patches will use. > > > @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_= mask, unsigned int order, > > > > might_alloc(gfp_mask); > > > > - if (should_fail_alloc_page(gfp_mask, order)) > > + if (!(*alloc_flags & ALLOC_TRYLOCK) && > > + should_fail_alloc_page(gfp_mask, order)) > > Is it because should_fail_alloc_page() might take some lock or whatnot? > Maybe comment? This is mainly because all kinds of printk-s that fail* logic does. We've seen way too many syzbot reports because of printk from the unsupported context. This part of the comment: + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason + * to warn. Also warn would trigger printk() which is unsafe from + * various contexts. We cannot use printk_deferred_enter() to mitigate, + * since the running context is unknown. and also because of get_random_u32() inside fail* logic that grabs spin_lock. We've seen syzbot reports about get_random() deadlocking. Fixing printk and fail*/get_random is necessary too, but orthogonal work for these patches. Here I'm preventing this path from prepare_alloc_pages() to avoid dealing with more syzbot reports than already there. With try_alloc_pages() out of bpf attached to kprobe inside printk core logic it would be too easy for syzbot. And then people will yell at bpf causing problems whereas the root cause is printk being broken. We see this finger pointing all too often. > > > return false; > > > > *alloc_flags =3D gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags); > > @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page) > > } > > > > #endif /* CONFIG_UNACCEPTED_MEMORY */ > > + > > +/** > > + * try_alloc_pages_noprof - opportunistic reentrant allocation from an= y context > > + * @nid - node to allocate from > > + * @order - allocation order size > > + * > > + * Allocates pages of a given order from the given node. This is safe = to > > + * call from any context (from atomic, NMI, and also reentrant > > + * allocator -> tracepoint -> try_alloc_pages_noprof). > > + * Allocation is best effort and to be expected to fail easily so nobo= dy should > > + * rely on the success. Failures are not reported via warn_alloc(). > > + * > > + * Return: allocated page or NULL on failure. > > + */ > > +struct page *try_alloc_pages_noprof(int nid, unsigned int order) > > +{ > > + /* > > + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not= allowed. > > + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of k= swapd > > + * is not safe in arbitrary context. > > + * > > + * These two are the conditions for gfpflags_allow_spinning() bei= ng true. > > + * > > + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a = reason > > + * to warn. Also warn would trigger printk() which is unsafe from > > + * various contexts. We cannot use printk_deferred_enter() to mit= igate, > > + * since the running context is unknown. > > + * > > + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page(= ) below > > + * is safe in any context. Also zeroing the page is mandatory for > > + * BPF use cases. > > + * > > + * Though __GFP_NOMEMALLOC is not checked in the code path below, > > + * specify it here to highlight that try_alloc_pages() > > + * doesn't want to deplete reserves. > > + */ > > + gfp_t alloc_gfp =3D __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC; > > + unsigned int alloc_flags =3D ALLOC_TRYLOCK; > > + struct alloc_context ac =3D { }; > > + struct page *page; > > + > > + /* > > + * In RT spin_trylock() may call raw_spin_lock() which is unsafe = in NMI. > > + * If spin_trylock() is called from hard IRQ the current task may= be > > + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark = the > > + * task as the owner of another rt_spin_lock which will confuse P= I > > + * logic, so return immediately if called form hard IRQ or NMI. > > + * > > + * Note, irqs_disabled() case is ok. This function can be called > > + * from raw_spin_lock_irqsave region. > > + */ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > > + return NULL; > > + if (!pcp_allowed_order(order)) > > + return NULL; > > + > > +#ifdef CONFIG_UNACCEPTED_MEMORY > > + /* Bailout, since try_to_accept_memory_one() needs to take a lock= */ > > + if (has_unaccepted_memory()) > > + return NULL; > > +#endif > > + /* Bailout, since _deferred_grow_zone() needs to take a lock */ > > + if (deferred_pages_enabled()) > > + return NULL; > > Is it fine for BPF that things will fail to allocate until all memory is > deferred-initialized and accepted? I guess it's easy to teach those place= s > later to evaluate if they can take the lock. Exactly. If it becomes an issue it can be addressed. I didn't want to complicate the code just-because. > > + > > + if (nid =3D=3D NUMA_NO_NODE) > > + nid =3D numa_node_id(); > > + > > + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac, > > + &alloc_gfp, &alloc_flags); > > + > > + /* > > + * Best effort allocation from percpu free list. > > + * If it's empty attempt to spin_trylock zone->lock. > > + */ > > + page =3D get_page_from_freelist(alloc_gfp, order, alloc_flags, &a= c); > > What about set_page_owner() from post_alloc_hook() and it's stackdepot > saving. I guess not an issue until try_alloc_pages() gets used later, so > just a mental note that it has to be resolved before. Or is it actually s= afe? set_page_owner() should be fine. save_stack() has in_page_owner recursion protection mechanism. stack_depot_save_flags() may be problematic if there is another path to it. I guess I can do: diff --git a/lib/stackdepot.c b/lib/stackdepot.c index 245d5b416699..61772bc4b811 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -630,7 +630,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries, prealloc =3D page_address(page); } - if (in_nmi()) { + if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) { /* We can never allocate in NMI context. */ WARN_ON_ONCE(can_alloc); /* Best effort; bail if we fail to take the lock. */ if (!raw_spin_trylock_irqsave(&pool_lock, flags)) goto exit; as part of this patch, but not convinced whether it's necessary. stack_depot* is effectively noinstr. kprobe-bpf cannot be placed in there and afaict it doesn't call any tracepoints. So in_nmi() is the only way to reenter and that's already covered.