From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15098ECAAD5 for ; Sat, 27 Aug 2022 22:54:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9C7A194000B; Sat, 27 Aug 2022 18:54:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97679940007; Sat, 27 Aug 2022 18:54:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8657694000B; Sat, 27 Aug 2022 18:54:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 75E83940007 for ; Sat, 27 Aug 2022 18:54:26 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 56D6C40A95 for ; Sat, 27 Aug 2022 22:54:26 +0000 (UTC) X-FDA: 79846878132.07.9727FB4 Received: from mail-il1-f194.google.com (mail-il1-f194.google.com [209.85.166.194]) by imf07.hostedemail.com (Postfix) with ESMTP id 0AFE24002C for ; Sat, 27 Aug 2022 22:54:25 +0000 (UTC) Received: by mail-il1-f194.google.com with SMTP id o13so2727306ilt.3 for ; Sat, 27 Aug 2022 15:54:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=w9D9D/NajGhrxoSovCkbBuI+jyK+pO/iei9GkIyjXiU=; b=Ki5wKNYle8YZQxup08zAYtp+lXjSRiLE2fDEfdCfbjGNz0lpkyDig51LhwrJMCRkx+ N6V0cGnczPxy99G1rEIOPqraw7bpXAqjyICLnJpWEq434A1H5T3bTW+o3ecSK7qqYQAU hRYDHqnaC0RYloD87WtZ1MgA4/S872TsopYSVcC0WCjwjJXdJ0njUnW1TpuqBnTTPuIf CQsOxYfFwU4CK8JoHT/VPRzR577nbjT59h6/v1dHkqOjrciBfkd5Xf6YJWOzmJApNykn QPJeyp4Fvn0GKEUGvdm5C4Io+d5vNsylHY9njcVA/+pc0AGBLkm161ygO+28ac6lpKdT aWwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=w9D9D/NajGhrxoSovCkbBuI+jyK+pO/iei9GkIyjXiU=; b=k71QroVzXKLbErorhf5QRgYC+pCdg0BAuGjMWJrIcGzm2k94IioOwNgRijpxaEjywl NtWaiVE7S212DZDFXBBlZh4ZXqhFUHPoTfqdUVRTxKRabAYRL//yQfFOdKqNkJM2sZUl C/LcAVkNYWjovM/HS08NYor3ejn31cCKeKSG7PdsgoNpXEOImjPBYO1/QtejZ4rTre1N SedV1tcm5HBsPoDO+zZF+4Bv5SsBS5VwW5mNrm29z0mqsNZeB7ZoiRo2GOfdMv6bYB/g yuU7dPcjxi+C7XxMSH+lCDHg3/jLmc6KPZY0ZRJCAVpYjjGFWMd6c4eoGC0TKC0zTDXx RHQw== X-Gm-Message-State: ACgBeo1ovMqH0dbpjKtiNtJ/59D1I7cgslRb5fyLhBpUq7KWF3Ae0woH 1VeSzLqlxJ2qGdg3NgONa1bKBdF+VmGhoiX8+3M= X-Google-Smtp-Source: AA6agR6qg5REdc8mr1uyH5qF6IZtkImJQoczv6iCKoMk1GxyArMuXcsq1N/Qn/AnYCJh3yEEGg8KKNqMmmapeHYbHcM= X-Received: by 2002:a05:6e02:661:b0:2e2:be22:67f0 with SMTP id l1-20020a056e02066100b002e2be2267f0mr7007047ilt.91.1661640865299; Sat, 27 Aug 2022 15:54:25 -0700 (PDT) MIME-Version: 1.0 References: <20220826024430.84565-1-alexei.starovoitov@gmail.com> In-Reply-To: From: Kumar Kartikeya Dwivedi Date: Sun, 28 Aug 2022 00:53:48 +0200 Message-ID: Subject: Re: [PATCH v4 bpf-next 00/15] bpf: BPF specific memory allocator. To: Andrii Nakryiko Cc: Alexei Starovoitov , davem@davemloft.net, daniel@iogearbox.net, andrii@kernel.org, tj@kernel.org, delyank@fb.com, linux-mm@kvack.org, bpf@vger.kernel.org, kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661640866; a=rsa-sha256; cv=none; b=Rogh648v2VPboD+rTOTZeGYq45zc76j9lPXm8RiQ01H8BSAalhWXJ0ZaPzd0s4ElSLrBCC YLVnk/pjlJX6oNcKVeFlpem/Il4F8ijIoOrXfqQwlThPPQnj4CPQJ21LBFu0gXHsNI90NG riz3IlFXKgrQZq283QXlHgsTbVEOIGU= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Ki5wKNYl; spf=pass (imf07.hostedemail.com: domain of memxor@gmail.com designates 209.85.166.194 as permitted sender) smtp.mailfrom=memxor@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661640866; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=w9D9D/NajGhrxoSovCkbBuI+jyK+pO/iei9GkIyjXiU=; b=aXaVDytp4yFKnvK1Js6U7f27WiVoVrFU8n/SztrHLJJ6fTjsIhgproMxQjMKKNm9XyQ9a2 T8TbGVASm+IulqUIzMqVufosMbkqhF9TYP7vmpT7LsNfig0quAsmpKM7epRL63ZS5TjqyO 6mvrLlE0oEENC7nkiDJ56odU33REyjU= X-Stat-Signature: ptzwymeycxecsm7o79znjrei4dqjfxk4 X-Rspamd-Queue-Id: 0AFE24002C X-Rspam-User: X-Rspamd-Server: rspam01 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Ki5wKNYl; spf=pass (imf07.hostedemail.com: domain of memxor@gmail.com designates 209.85.166.194 as permitted sender) smtp.mailfrom=memxor@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1661640865-953556 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, 27 Aug 2022 at 18:57, Andrii Nakryiko wrote: > > On Thu, Aug 25, 2022 at 7:44 PM Alexei Starovoitov > wrote: > > > > From: Alexei Starovoitov > > > > Introduce any context BPF specific memory allocator. > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they > > run in unknown context where calling plain kmalloc() might not be safe. > > Front-end kmalloc() with per-cpu cache of free elements. > > Refill this cache asynchronously from irq_work. > > > > Major achievements enabled by bpf_mem_alloc: > > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated. > > With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc. > > - Tracing bpf programs can use dynamically allocated hash maps. > > Potentially saving lots of memory. Typical hash map is sparsely populated. > > - Sleepable bpf programs can used dynamically allocated hash maps. > > > > v3->v4: > > - fix build issue due to missing local.h on 32-bit arch > > - add Kumar's ack > > - proposal for next steps from Delyan: > > https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/ > > > > v2->v3: > > - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1. > > - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14. > > - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is > > requested to reduce memory consumption. Patch 15. > > - Fix: zero-fill percpu allocation > > - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction > > > > v2 thread: > > https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/ > > > > v1->v2: > > - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9. > > - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6. > > - Tuned watermarks per allocation size. Patch 8 > > - Adopted this approach to per-cpu allocation. Patch 10. > > - Fully converted hash map to bpf_mem_alloc. Patch 11. > > - Removed tracing prog restriction on map types. Combination of all patches and final patch 12. > > > > v1 thread: > > https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/ > > > > LWN article: > > https://lwn.net/Articles/899274/ > > > > Future work: > > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc > > - convert lru map to bpf_mem_alloc > > > > Alexei Starovoitov (15): > > bpf: Introduce any context BPF specific memory allocator. > > bpf: Convert hash map to bpf_mem_alloc. > > selftests/bpf: Improve test coverage of test_maps > > samples/bpf: Reduce syscall overhead in map_perf_test. > > bpf: Relax the requirement to use preallocated hash maps in tracing > > progs. > > bpf: Optimize element count in non-preallocated hash map. > > bpf: Optimize call_rcu in non-preallocated hash map. > > bpf: Adjust low/high watermarks in bpf_mem_cache > > bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. > > bpf: Add percpu allocation support to bpf_mem_alloc. > > bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. > > bpf: Remove tracing program restriction on map types > > bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. > > bpf: Remove prealloc-only restriction for sleepable bpf programs. > > bpf: Introduce sysctl kernel.bpf_force_dyn_alloc. > > > > include/linux/bpf_mem_alloc.h | 26 + > > include/linux/filter.h | 2 + > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/core.c | 2 + > > kernel/bpf/hashtab.c | 132 +++-- > > kernel/bpf/memalloc.c | 602 ++++++++++++++++++++++ > > kernel/bpf/syscall.c | 14 +- > > kernel/bpf/verifier.c | 52 -- > > samples/bpf/map_perf_test_kern.c | 44 +- > > samples/bpf/map_perf_test_user.c | 2 +- > > tools/testing/selftests/bpf/progs/timer.c | 11 - > > tools/testing/selftests/bpf/test_maps.c | 38 +- > > 12 files changed, 796 insertions(+), 131 deletions(-) > > create mode 100644 include/linux/bpf_mem_alloc.h > > create mode 100644 kernel/bpf/memalloc.c > > > > -- > > 2.30.2 > > > > It's great to lift all those NMI restrictions on non-prealloc hashmap! > This should also open up new maps (like qp-trie) that can't be > pre-sized to the NMI world as well. > > But just to clarify, in NMI mode we can exhaust memory in caches (and > thus if we do a lot of allocation in single BPF program execution we > can fail some operations). That's unavoidable. But it's not 100% clear > what's the behavior in IRQ mode and separately from that in "usual" > less restrictive mode. Is my understanding correct that we shouldn't > run out of memory (assuming there is memory available, of course) > because replenishing of caches will interrupt BPF program execution? When I was reviewing the code what I understood was as follows: There are two ways work is queued. On non-RT, it is queued for execution in hardirq context (raised_list), on RT it will instead be executed by per-CPU pinned irq_work kthreads (from lazy_list). We cannot set the IRQ_WORK_HARD_IRQ to force RT to execute them in hardirq context, as bpf_mem_refill may take sleepable non-raw spinlocks when calling into kmalloc, which is disallowed. So, to summarize the behavior: In NMI context: - for both RT and non-RT, once we deplete the cache we get -ENOMEM. In IRQ context: - for RT, it will fill it asynchronously by waking up the irq_work kthread, so you may still get -ENOMEM (also depends on if bpf prog is in hardirq or threaded irq context, since hardirq context would be non-preemptible, delaying refilling from irq_work kthread context). - for non-RT, it is already inside the interrupt handler hence you will get -ENOMEM. Interrupt handlers keep interrupts disabled, so IPI execution is delayed until the handler returns. In softirq and task context: - for RT, it will fill it asynchronously by waking up the irq_work kthread, so you may still get -ENOMEM. - for non-RT, it will send IPI to local cpu, which will execute the work synchronously, so you will refill the cache by interrupting the program. Even when executing softirq inside the exit path of interrupts, at that point interrupts are enabled so it will refill synchronously by raising local IPI. For the last case (say task context), the problem of kmalloc reentrancy comes to mind again, e.g if we are tracing in guts of kmalloc and send local IPI which eventually calls kmalloc again (which may deadlock). But remember that such cases are already possible without BPF, interrupts which allocate may come in at any time, so the kmalloc code itself will keep IRQs disabled at these places, hence we are fine from BPF side as well. Please let me know of any inaccuracies in the above description. > Or am I wrong and we can still run out of memory if we don't have > enough pre-cached memory. I think it would be good to clearly state > such things (unless I missed them somewhere in patches). I'm trying to > understand if in non-restrictive mode we can still fail to allocate a > bunch of hashmap elements in a loop just because of the design of > bpf_mem_alloc? > > But it looks great otherwise. For the series: > > Acked-by: Andrii Nakryiko