From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4810FC282EC for ; Tue, 11 Mar 2025 18:05:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39D49280002; Tue, 11 Mar 2025 14:05:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 34CA9280001; Tue, 11 Mar 2025 14:05:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21637280002; Tue, 11 Mar 2025 14:05:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 04DC1280001 for ; Tue, 11 Mar 2025 14:05:01 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id BCB6352EF8 for ; Tue, 11 Mar 2025 18:05:01 +0000 (UTC) X-FDA: 83210046402.26.9749227 Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by imf17.hostedemail.com (Postfix) with ESMTP id 99EE140006 for ; Tue, 11 Mar 2025 18:04:59 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OkAnaZO9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741716299; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iMsGH3YV+7Jez0K1iVv4LQwNOT9vQI0481jYwFLNLGk=; b=6PK5qPqoQlFdNQgkrUGTebcQViRO+MMfd+HjYfIvQ/kOV4QeVAiPmq4CabDhMofYLMx5PC iIoe2GT+IUAezOw+3Mfmj7gnGAW9/1cyl+ihwv0iREzY1Di8bUf2cJnlF/mrdKjcE7favF VU7Y80fkenusQqlFxT1SOFyLSPDEf80= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741716299; a=rsa-sha256; cv=none; b=SF1Z1dvr3O8rDKS66BuNZ9dL0ogc8TcAla8xIs1jgWFt6HFErJnSU+lhfaZKN8V619risV HpMHj0T7xi+XRQPhyOkWUZ3wtjCg0fyWc43tmgp3yjpMwBjoATx7X7mdQbSqsY9EjL2uXm 80KXxBFxAe0++enSOrlsrRlAQ3qdflQ= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=OkAnaZO9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.167.54 as permitted sender) smtp.mailfrom=mjguzik@gmail.com Received: by mail-lf1-f54.google.com with SMTP id 2adb3069b0e04-5439a6179a7so108010e87.1 for ; Tue, 11 Mar 2025 11:04:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741716298; x=1742321098; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=iMsGH3YV+7Jez0K1iVv4LQwNOT9vQI0481jYwFLNLGk=; b=OkAnaZO901P0gbsxuqipfh2x8kZmg140nVC+n58N4RpEQ4jn1EalNJoFJjdll9klbr GpvV2Vht68bYZt1thGZ92l6CDOQDHdynYlZBN3PC1SsGjx1ZIeXnCmiN97h/NFnd6WIy N0XNKg+xv4/REhu46oHfZDjfhUDU8OFT7hCTK4GHszUt0rZxDkGRt/30+wh5vzUOiXDn KcY7tvVcVo/IkAYAu1e/2wybW5Z3U0QLL7NZXTxjfLASDlf0jXZtDAB/PKZgZ7ScFizh xLeeNw87c0kG9yAo/yIQVo2MGmwcfhXvBZi3gj5tdL0CRf/4eKuVqwhVUMgFAqn/p60p Gsew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741716298; x=1742321098; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=iMsGH3YV+7Jez0K1iVv4LQwNOT9vQI0481jYwFLNLGk=; b=uxF1Zuf0FHCXIgVwLuUKID4aHjmkSS8Nrvh1GsloxaaRu22pESIg4CD64PxA4d3C4A KaPOxedAzcjlzckqV/UMkia5FO8vFJi/tA0IZNtd0a1IlN3dO90Pppfr68i4FokVBD7N OlaRdQQr1mmvJ7anU83JeRbIT4I7sAwoi/TSL3LDMbSbFx6/i6gxorNzwjQ9bztd8YvB e/4zhENDArV2Ivcc8Xazb9e1q1ceSTvIHmiCHaGvrK8+3Hhkds8Hs7goFn5V4WhWKK6N 9YyIEMTr93KVQ++AE7cMJPcXEzhg9tBrhsJCbBTfJwh+NKnlEkFOi5ushJkxXwDP+ijl sh9w== X-Forwarded-Encrypted: i=1; AJvYcCWV3R6jkbS9B76Is821+3toWUbCWWziVQ0rPUM9/Kw5M5k0wfhEqvyt24P2YhWiH4tU+MT1XbUhWQ==@kvack.org X-Gm-Message-State: AOJu0YwhIigBneHIAIQKogPOFSkOUMbg8grVCh16szo675/CWWCJBDvK s2BT56/lO/BE+uxzwslZ4lVXcQn5pwrb11laau9hHnUWEE//c/pv X-Gm-Gg: ASbGnctTXj4NZRAIMyB5RX3G9vDmLpqqhunD/lxQtihEHPJScxfloAAXjR89wdnezaB +OkmZnJOaZnvfrYRJmzPahRv38ZClyc5qkHm73woCDR5FYDOFpLl2IMJiJ3zCk58UjHncMBr/s3 EadqNVLKQEztV7ZwG9vLwfUPqE9Av7IZVQHgNOpCpDifXLOy95PHkWpqzr2DAy9pEqK6q31Rw/G uw8LiemgqXDaPecO2WZcpwU7a7bLLDvlvKTHl6zzbG7BhyNLzx4a6+vMcWkJY5+d7DcD0rNkYTR bPZh/sCM927KXJS0SAgc0TXmLgAS+tAcTCc+x+gOvemrJksEUNBJhFsoGEWL X-Google-Smtp-Source: AGHT+IFUK/N2ttWr3IIs5Uri+DRbvApiqKz4hmjmPEHs9K+Woe2sSswWWoqioF5cNEHR7FaTTf9Nag== X-Received: by 2002:a05:6512:2313:b0:549:8ccd:4538 with SMTP id 2adb3069b0e04-549abd58949mr1576975e87.26.1741716297204; Tue, 11 Mar 2025 11:04:57 -0700 (PDT) Received: from f (cst-prg-86-144.cust.vodafone.cz. [46.135.86.144]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5498ae59171sm1880818e87.94.2025.03.11.11.04.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Mar 2025 11:04:56 -0700 (PDT) Date: Tue, 11 Mar 2025 19:04:47 +0100 From: Mateusz Guzik To: Alexei Starovoitov Cc: Andrew Morton , bpf , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Peter Zijlstra , Vlastimil Babka , Sebastian Sewior , Steven Rostedt , Hou Tao , Johannes Weiner , Shakeel Butt , Michal Hocko , Matthew Wilcox , Thomas Gleixner , Jann Horn , Tejun Heo , linux-mm , Kernel Team Subject: Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Message-ID: References: <20250222024427.30294-1-alexei.starovoitov@gmail.com> <20250222024427.30294-3-alexei.starovoitov@gmail.com> <20250310190427.32ce3ba9adb3771198fe2a5c@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 99EE140006 X-Rspamd-Server: rspam08 X-Stat-Signature: cmosbxuxjc73ssre4acw5e9qfjd1fo8s X-HE-Tag: 1741716299-814613 X-HE-Meta: U2FsdGVkX1+HVN4iRrVkxRrxSkDZwJSsSR0TPgDKm3xwjs78+rhqIvkqt8YVFrTBWLk4V+5FgHmDVFTsDzBsaIHUiiiL9p74cBLrufYhWLlMmRgC+aZqz48HXY+K2r4X+JfI6eOxfupwEyI5hiKgYbIA4PpcfILjHoq/UyuoPm+PvHt8T4slIBh4XtBxbH1Gfsl8JCWoIVNX2P0DsBs0U9UyIJ7hNf9oBos3rKkp4qOda+cpGUW3V61ufEZSYBUqG5S9UL9XGEJ3fNeGdHpMAWfi9rnU/V+oSrsFQkw3MUtnRbZ3L4lKhmeuGWlH7B/YI/XAeafR4WDa0AsOVm+awOan1io0kzcNpfSH3qfd8Nb0FWuX9o7mp68CMwNMGyHWUWbMQB/srfQNE+Fynn6DpzmIbVotyMqtA0LbWUhMoGDwWYa7iXukvvkLjzzj6PuRH6Qa6pS2x/RfizL15GhOrcY0JogcnZmQkjS6gSMEx6PGlnTAzzAprHXhIYw1ZBZMmBh4u+XAZeqHbJOWGb5N7lDgFtriVYAFPyhQ+4ilHV2sTS+Be7JwSRz8C+zXeHH6NH5Ux+p5BEhKpBiB0ctwGf4GwxK6fF+4o4GfDN+ysaAWKJE8Tb0Ahk6OTKrNviiS6lpPCXV25uVBAfm4RRlLo7K6AnSDuV6fyw1g8mLnoFs5gBiB4Z5hifM+nEjGxWQTa5+a3CR0vWPw0x4YYM6kWUvvBBgVKMh4ufXaOQWpxDCW8v7CgL5RkpsEVhgvgfS+RrjUkOQupx7Q29mlwHF69VvnvV/g3wD9AbtaUXvynM8GiVkKcL1uVsyd+zzoROU0BhDvv1seiBwY+58kg0CGmGczMXVVxAkDqLOWWNYpZHkA3/9HCQEeNAt5nBzLKeydiwr1jrfoByq//lIRkfjdgT1U+oGBF65QhMZd8XZC3mSW1NPEGtKj0jSStt3bZLFdLNwYnUn0cif6c1jt367 3x0dsfkC oAsqJx94LH5qBWlv91yBetGFMbmZFWqxKp47ytBhk0v74B0tQp2Wkv68yJADN5Dk7+GpLuCjdKjGQF017kgaXZo6URdXlPIRVQW427qhqMdfwUlkemLk5AAowarS9jXtaxtzBe3+HCkICwlIbSNnGjEk/5qEyG1dSDTEaNTqGCaz2wj5TC+OVhP7Cn2ZIFZ5sIegA5VxhiLWiEvg88EAOaj/zYvh/zqTo9ioEmxS6hMfg2xqDkF1VCmesuM4yBxT3iK213cb986A2vOdbXZOVju+xfGRYQ2LVSFZetVzUD6xRDRuIH4AbjFrq8HVFOns0fG9LaHXgQ6xdelEPwHpwGKipFKuh39bO3ONdBRSEcNaTvnBKqEf3pbfGpnG0ZNxV11WmEssdYtHBoZJsEeAZ+vY1HERvKvr5WHnu46etlvRY1INcJRQo+t2nVDg87sIxkRStDsmtS4LevT1oPWx5KdGxoNovYvV2q0lF7UYl7FS3qwOz4iepQgJ3Z8T/AdQQyTYzztkYSwqXX6iQaQCqdEx/vSEOO+764bVuZ++2fXMfi7jdq5gckRRCCQEVRWyT+kCC6ods7RvvWZ574U1QpRoNflZ+Qb5uWrKFOJUg28RmRT9ENUMZvUayTTtAy2U07ZJnPaWKUVlclrkI+xH2uIt91rD7RKFJkijMu6mHLojIUw8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote: > On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton wrote: > > > > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov wrote: > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > running context is unknown, but they need to request additional > > > memory. The prior workarounds were using pre-allocated memory and > > > BPF specific freelists to satisfy such allocation requests. > > > > The "prior workarounds" sound entirely appropriate. Because the > > performance and maintainability of Linux's page allocator is about > > 1,000,040 times more important than relieving BPF of having to carry a > > "workaround". > > Please explain where performance and maintainability is affected? > I have some related questions below. Note I'm a bystander, not claiming to have any (N)ACK power. A small bit before that: if (!spin_trylock_irqsave(&zone->lock, flags)) { if (unlikely(alloc_flags & ALLOC_TRYLOCK)) return NULL; spin_lock_irqsave(&zone->lock, flags); } This is going to perform worse when contested due to an extra access to the lock. I presume it was done this way to avoid suffering another branch, with the assumption the trylock is normally going to succeed. So happens I'm looking at parallel exec on the side and while currently there is bigger fish to fry, contention on zone->lock is very much a factor. Majority of it comes from RCU freeing (in free_pcppages_bulk()), but I also see several rmqueue calls below. As they trylock, they are going to make it more expensive for free_pcppages_bulk() to even get the lock. So this *is* contested, but at the moment is largely overshadowed by bigger problems (which someone(tm) hopefully will sort out sooner than later). So should this land, I expect someone is going to hoist the trylock at some point in the future. If it was my patch I would just do it now, but I understand this may result in new people showing up and complaining. > As far as motivation, if I recall correctly, you were present in > the room when Vlastimil presented the next steps for SLUB at > LSFMM back in May of last year. > A link to memory refresher is in the commit log: > https://lwn.net/Articles/974138/ > > Back then he talked about a bunch of reasons including better > maintainability of the kernel overall, but what stood out to me > as the main reason to use SLUB for bpf, objpool, mempool, > and networking needs is prevention of memory waste. > All these wrappers of slub pin memory that should be shared. > bpf, objpool, mempools should be good citizens of the kernel > instead of stealing the memory. That's the core job of the > kernel. To share resources. Memory is one such resource. > I suspect the worry is that the added behavior may complicate things down the road (or even prevent some optimizations) -- there is another context to worry about. I think it would help to outline why these are doing any memory allocation from something like NMI to begin with. Perhaps you could have carved out a very small piece of memory as a reserve just for that? It would be refilled as needed from process context. A general remark is that support for an arbitrary running context in core primitives artificially limits what can be done to optimize them for their most common users. imo the sheaves patchset is a little bit of an admission (also see below). It may be the get pages routine will get there. If non-task memory allocs got beaten to the curb, or at least got heavily limited, then a small allocator just for that purpose would do the trick and the two variants would likely be simpler than one thing which supports everyone. This patchset is a step in the opposite direction, but it may be there is a good reason. To my understanding ebpf can be used to run "real" code to do something or "merely" collect data. I presume the former case is already running from a context where it can allocate memory no problem. For the latter I presume ebpf has conflicting goals: 1. not disrupt the workload under observation (cpu and ram) -- to that end small memory usage limits are in place. otherwise a carelessly written aggregation can OOM the box (e.g., say someone wants to know which files get opened the most and aggregates on names, while a malicious user opens and unlinks autogenerated names, endlessly growing the list if you let it) 2. actually log stuff even if resources are scarce. to that end I would expect that a small area is pre-allocated and periodically drained Which for me further puts a question mark on general alloc from the NMI context. All that said, the cover letter: > The main motivation is to make alloc page and slab reentrant and > remove bpf_mem_alloc. does not justify why ebpf performs allocations in a manner which warrant any of this, which I suspect is what Andrew asked about. I never put any effort into ebpf -- it may be all the above makes excellent sense. But then you need to make a case to the people maintaining the code.