From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 475BDC64EC7 for ; Sat, 25 Feb 2023 06:23:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 49D146B0071; Sat, 25 Feb 2023 01:23:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 44D6D6B0073; Sat, 25 Feb 2023 01:23:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 314C56B0074; Sat, 25 Feb 2023 01:23:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 230746B0071 for ; Sat, 25 Feb 2023 01:23:47 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E85961A0D72 for ; Sat, 25 Feb 2023 06:23:46 +0000 (UTC) X-FDA: 80504823252.11.0B41697 Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) by imf26.hostedemail.com (Postfix) with ESMTP id 2A36F140004 for ; Sat, 25 Feb 2023 06:23:41 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of houtao@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=houtao@huaweicloud.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677306225; a=rsa-sha256; cv=none; b=PPibgMP3UXpxjLJo1OJ0lDBwDmxytE+tGSy+KsCTls9kPs5HzgHb+zkgdJL7r19kqr2K9l Ysrb9r//F08SJJx9U6YbQBNTzibznjdY7QvzCRpK0DL/Mi2uKg69PG8OLl1Td2Wc8Ev4Wv lgd5FI6Cwa22LsgUwYDnG8on+vvQEiw= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of houtao@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=houtao@huaweicloud.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677306225; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=nhGij1dz2YBMwxCFobxwojs/YTkH7YYLRlFROG6imu8=; b=MBrwlstvUzaPpwCJ8B3PtZTUX4/7oi0Qn44CQSI65qvzhKz4C5WzGhBSgumwdL3SpeOTrA JPPDOGjejWLcWdnylJYilFWntkqjnk4dxUBnvWAqB6SqgK83L+H2uPCFyjVmwmT9RXm+Hx +flof70L8U5E6sGw+hIJy+wofZVSn5Q= Received: from mail02.huawei.com (unknown [172.30.67.153]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4PNxXj3KnZz4f3wR7 for ; Sat, 25 Feb 2023 14:23:33 +0800 (CST) Received: from [10.174.176.117] (unknown [10.174.176.117]) by APP4 (Coremail) with SMTP id gCh0CgBnF6thqflj1oAcEQ--.12260S2; Sat, 25 Feb 2023 14:23:32 +0800 (CST) From: Hou Tao Subject: [LSF/MM/BPF TOPIC] Make bpf memory allocator more robust To: lsf-pc@lists.linux-foundation.org Cc: bpf , linux-mm@kvack.org, Alexei Starovoitov , Andrii Nakryiko , Martin KaFai Lau , David Vernet , "houtao1@huawei.com" Message-ID: <2d29f66f-fcb1-ec76-c74f-d12495a9516f@huaweicloud.com> Date: Sat, 25 Feb 2023 14:23:29 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US X-CM-TRANSID:gCh0CgBnF6thqflj1oAcEQ--.12260S2 X-Coremail-Antispam: 1UD129KBjvJXoW3GryUGw13Aryktw1xtFy7KFg_yoW7tF1UpF WfK3y3Gr90qFn7C34vqw17Ga4YywsYqr15Gr1Fvw15u3y3Wry7ur4SvayYvFy5uFsrGa4U trnFvF1DZ3ykXaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUyEb4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JMxk0xIA0c2IEe2xFo4CEbIxvr21l42xK82IYc2Ij 64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x 8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE 2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE42 xK8VAvwI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIE c7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU1zuWJUUUUU== X-CM-SenderInfo: xkrx3t3r6k3tpzhluzxrxghudrp/ X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Queue-Id: 2A36F140004 X-Rspamd-Server: rspam01 X-Stat-Signature: acerdswjqqgiz7ma4w6u7nntdz96phmo X-HE-Tag: 1677306221-312733 X-HE-Meta: U2FsdGVkX19y2+qXv66d0drfIAF1Mh7ZaOGMAFg4LoIYV1ZXx4pFpYCRo1BAzsZM7WR92AimdUsPcRW3AhlpNEmsRZoLQqJWd251w/0hlCFuUIt3VCiXmeps+cUkKnR2nOVDcR7bWNuXua3KLyFiWt1+DHsMEXrn50mdYCDzKACPMQ3F5KYl882f8GNh3eTz0CEO5Vrx7KfAPvt0Zg/dUw1hs6iGZv3wBy2MGpYAcfLvWDMAogyqLeYVSdUV1twIX3wePHaAoVAWH0tdukUXKzLtdEK9b3R+LgVuammEaYOx/u8KeC2RdazTUQ/AI6RTiImvVFa+/rWY8h1BHLUnKd4hsNpNJAe9obUj+YnCwsR3YVM/0zuji2yFgfBAC3BU9LpEMN40tRvARMUaN044iwU+8ampiL0bTO/obnqQ6ivXemhsaQBy7m8TNGY/bQPDcl+hR2Kcotys+WtwkuzWUVvRtBDIS1pK2ovYJFkKSpFhwZhX2Jakefx+g5dvNbzvlom3reS0LhLeDs4f0JaZaBiiXRl+wfSN8fznY3xnbacB7aRb2/Sr7b0IxBERAZa4+HuvJtYP+djZB7DaUXTUMtvvCU60LWxyGJ6sAmYIhVNTvZ6Wher0S5miAezLnkXRlco2tm01liqCD4k1Bov42YNaqvjC0hd2YN6xqUejrOqt4Cx0SJORp4SSt4ZYy8GdGI4X1HyKAsHhaVfZjbcQK73bBaybC+JMXo4532wyjFzAY9rBX/ursAj3dLaAtfF7U4t0sc5RuhO0TgEFtQJa4H3tfsV7r+CVLGRc3hdy2giP7t/GPZyXgoL+Vz/kgJ7pKoqrLgMDV85nijwOwKIEv2KFbdADeyYGNu+PCKNL8nzN6LOElJvXX8r1ka9pTgELT6FLLFzVs1C11xxEf1DjEMnvteZBtc1qQ6jjhc/EIIouGh1G0It0WjMOgix+1ohNOC8c9t0Isp33d8dhcw/ ttdL26Lw Z5O20w/4uTI4zQss1+LIIlpPyBZd9vfv68iKXbimGnbjrN3L5PScLaHm/SiJXsUeWZsR5UfaHKfiCrh1ouVHxd/AYy19mAHwdY6UtfIOvuL5z03JUIV4rdizLmetU5gnSu+0UQp0vt2eG+KJMw+KFthp10jD43LZ8Qvq08t2BM0VcOuSUJVe9yH1CZeWfAqyQELiWkXpQIv3/fA6e9/3Ufmt+aJWJuWTzI8ZZZZfK4h7dfzNJcJkpKby4KId/EKIZHfHeq47QLjs+a4eLaonpb1bp+GhY8yQeAZSn4J93qeuEbwpmaR0XOGKV1No0KxzeZANY2XCeX9lz6uWyFsjx9kN3eWDYnEPY3OQGRVgqi56nDXoiNs7LwCTStYc3VM4mbpl0lmtaJS+ZfvxVLkxtJagV/POFmyh7f0kA0J/j8DmUpEo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: bpf memory allocator was introduced in v6.1 by Alexei Starovoitov [0]. Its main purpose is to provide an any-context allocator for bpf program which can be attached to anywhere (e.g., __kmalloc()) and any context (e.g., the NMI context through perf_event_overflow()). Before that, only pre-allocated hash map is usable for these contexts but it incurs memory waste because typically hash table is sparse. In addition to the memory saving, it also increases the performance of dynamically allocated hash map significantly and makes the hash-map usable for sleep-able program. As more use cases of bpf memory allocator emerges, it seems that there are some problems that need to be discussed and fixed. The first problem is the immediate reuse of elements in bpf memory allocator.  The reason for immediate reuse is to prevent OOM for typical usage scenario for bpf hash map, but it introduces use-after-free problem [1] for dynamically-allocated hash table although the problems existed for pre-allocated hash-table since it was introduced. For hash-table, the reuse problem may be acceptable for hash table, but the reuse makes introducing new use cases more difficult. For example, in bpf-qp-trie [2] the internal nodes of qp-trie are managed by bpf memory allocator, if internal node used during lookup is freed and reused, the lookup procedure may panic or return an incorrect result. Although I have already implemented a qp-trie demo in which two version numbers are added for each internal node to ensure its validity: one version is saved in its parent node and another in itself, but I am not sure about its correctness. bpf_cpumask was introduced recently [3] is another example, in bpf_cpumask_kptr_get() it checks the validity of bpf_cpumask by checking whether or not its usage is zero, but I don't know how does it handle the reuse of bpf_cpumask if the cpumask is freed and then reused by another bpf_cpumask_create() call. Alexei proposed using BPF_MA_REUSE_AFTER_GP [4] to solve the reuse problem. For bpf ma with BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU grace period and are returned back to slab system after one-RCU-grace-period and one-RCU-tasks-trace grace period. So for bpf programs which care about reuse problem, these programs can use bpf_rcu_read_{lock,unlock}() to access these freed objects safely and for  those which doesn't care, there will be safely use-after-free because these objects have not been returned to slab subsystem. I was worried about the possibility of OOM for BPF_MA_REUSE_AFTER_GP, so I proposed using BPF_MA_FREE_AFTER_GP [5] to directly return these freed objects to slab system after one RCU grace period and enforce the accesses of these objects are protected by bpf_rcu_read_{lock,unlock}(). But if BPF_MA_FREE_AFTER_GP is used by local storage, it may break existing sleep-able program. Currently, I am working on BPF_MA_REUSE_AFTER_GP  with Martin. Hope to work out an RFC soon. Another problem is the potential OOM problem. bpf memory allocator is more suitable for the following case: alloc, free, alloc, free on the same CPU. The above use case is also the typical use case for hash table, but for other use cases, bpf memory allocator doesn't handle the scenario well and may incur OOM. One such use case is batched allocation and batched freeing on same CPU. According to [6], for a small hash table, the peak memory for this use case can increase to 860MB or more. Another use case is allocation and free are done on different CPUs [6]. Memory usage exposure can easily occur for such case, because there is no reuse and these freed objects can only be returned to subsystem after one RCU tasks-trace grace period. I think the potential OOM problem can be attacked by two ways. One is returning these freed objects to slab system timely. Although some work (e.g., skip unnecessary call_rcu for freeing [7]) has been done, but I think it is not enough. For example, for bpf_global_ma, because it will not be destroyed like bpf ma in hash-tab, so there may still be freed objects in per-cpu free_by_rcu list and will not be freed if there is no free operations on this CPU afterwards. Also there is no ways to limit the memory usage of bpf_global_ma because its usage is accounted under root memcg, so maybe a shrinker is also needed to free some memory back to slab system. Another example is CPU hot-plug. Because bpf memory allocator is a per-CPU allocator, so when one CPU is offline, all freed elements need be returned to slab system and when the CPU is online, we may need to do pre-fill for it. Another approach is to try to reuse freed object if possible. One fix [6] had already been done to fix the batched allocation and freed case, but for the case of allocation and freeing on different CPUs, it seems we may need to share freed object among multiple CPUs  and do it cheaply. Not sure whether or not the issues above are important enough for a session, but I think a discussion in mail-list will be helpful as well. 0: https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com/ 1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ 2: https://lore.kernel.org/bpf/20220924133620.4147153-1-houtao@huaweicloud.com/ 3: https://lore.kernel.org/bpf/20230125143816.721952-1-void@manifault.com/ 4: https://lore.kernel.org/bpf/CAADnVQKecUqGF-gLFS5Wiz7_E-cHOkp7NPCUK0woHUmJG6hEuA@mail.gmail.com/ 5: https://lore.kernel.org/bpf/2a58c4a8-781f-6d84-e72a-f8b7117762b4@huaweicloud.com/ 6: https://lore.kernel.org/bpf/20221209010947.3130477-1-houtao@huaweicloud.com/ 7: https://lore.kernel.org/bpf/20221014113946.965131-3-houtao@huaweicloud.com/