From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 475BDC64EC7
	for <linux-mm@archiver.kernel.org>; Sat, 25 Feb 2023 06:23:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 49D146B0071; Sat, 25 Feb 2023 01:23:47 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 44D6D6B0073; Sat, 25 Feb 2023 01:23:47 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 314C56B0074; Sat, 25 Feb 2023 01:23:47 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 230746B0071
	for <linux-mm@kvack.org>; Sat, 25 Feb 2023 01:23:47 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id E85961A0D72
	for <linux-mm@kvack.org>; Sat, 25 Feb 2023 06:23:46 +0000 (UTC)
X-FDA: 80504823252.11.0B41697
Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51])
	by imf26.hostedemail.com (Postfix) with ESMTP id 2A36F140004
	for <linux-mm@kvack.org>; Sat, 25 Feb 2023 06:23:41 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=none;
	spf=pass (imf26.hostedemail.com: domain of houtao@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=houtao@huaweicloud.com;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677306225; a=rsa-sha256;
	cv=none;
	b=PPibgMP3UXpxjLJo1OJ0lDBwDmxytE+tGSy+KsCTls9kPs5HzgHb+zkgdJL7r19kqr2K9l
	Ysrb9r//F08SJJx9U6YbQBNTzibznjdY7QvzCRpK0DL/Mi2uKg69PG8OLl1Td2Wc8Ev4Wv
	lgd5FI6Cwa22LsgUwYDnG8on+vvQEiw=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=none;
	spf=pass (imf26.hostedemail.com: domain of houtao@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=houtao@huaweicloud.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677306225;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:in-reply-to:
	 references; bh=nhGij1dz2YBMwxCFobxwojs/YTkH7YYLRlFROG6imu8=;
	b=MBrwlstvUzaPpwCJ8B3PtZTUX4/7oi0Qn44CQSI65qvzhKz4C5WzGhBSgumwdL3SpeOTrA
	JPPDOGjejWLcWdnylJYilFWntkqjnk4dxUBnvWAqB6SqgK83L+H2uPCFyjVmwmT9RXm+Hx
	+flof70L8U5E6sGw+hIJy+wofZVSn5Q=
Received: from mail02.huawei.com (unknown [172.30.67.153])
	by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4PNxXj3KnZz4f3wR7
	for <linux-mm@kvack.org>; Sat, 25 Feb 2023 14:23:33 +0800 (CST)
Received: from [10.174.176.117] (unknown [10.174.176.117])
	by APP4 (Coremail) with SMTP id gCh0CgBnF6thqflj1oAcEQ--.12260S2;
	Sat, 25 Feb 2023 14:23:32 +0800 (CST)
From: Hou Tao <houtao@huaweicloud.com>
Subject: [LSF/MM/BPF TOPIC] Make bpf memory allocator more robust
To: lsf-pc@lists.linux-foundation.org
Cc: bpf <bpf@vger.kernel.org>, linux-mm@kvack.org,
 Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>,
 Martin KaFai Lau <martin.lau@linux.dev>, David Vernet <void@manifault.com>,
 "houtao1@huawei.com" <houtao1@huawei.com>
Message-ID: <2d29f66f-fcb1-ec76-c74f-d12495a9516f@huaweicloud.com>
Date: Sat, 25 Feb 2023 14:23:29 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Content-Language: en-US
X-CM-TRANSID:gCh0CgBnF6thqflj1oAcEQ--.12260S2
X-Coremail-Antispam: 1UD129KBjvJXoW3GryUGw13Aryktw1xtFy7KFg_yoW7tF1UpF
	WfK3y3Gr90qFn7C34vqw17Ga4YywsYqr15Gr1Fvw15u3y3Wry7ur4SvayYvFy5uFsrGa4U
	trnFvF1DZ3ykXaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDU0xBIdaVrnRJUUUyEb4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2
	6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4
	vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj
	xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x
	0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG
	6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV
	Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JMxk0xIA0c2IEe2xFo4CEbIxvr21l42xK82IYc2Ij
	64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x
	8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE
	2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE42
	xK8VAvwI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIE
	c7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU1zuWJUUUUU==
X-CM-SenderInfo: xkrx3t3r6k3tpzhluzxrxghudrp/
X-CFilter-Loop: Reflected
X-Rspam-User: 
X-Rspamd-Queue-Id: 2A36F140004
X-Rspamd-Server: rspam01
X-Stat-Signature: acerdswjqqgiz7ma4w6u7nntdz96phmo
X-HE-Tag: 1677306221-312733
X-HE-Meta: U2FsdGVkX19y2+qXv66d0drfIAF1Mh7ZaOGMAFg4LoIYV1ZXx4pFpYCRo1BAzsZM7WR92AimdUsPcRW3AhlpNEmsRZoLQqJWd251w/0hlCFuUIt3VCiXmeps+cUkKnR2nOVDcR7bWNuXua3KLyFiWt1+DHsMEXrn50mdYCDzKACPMQ3F5KYl882f8GNh3eTz0CEO5Vrx7KfAPvt0Zg/dUw1hs6iGZv3wBy2MGpYAcfLvWDMAogyqLeYVSdUV1twIX3wePHaAoVAWH0tdukUXKzLtdEK9b3R+LgVuammEaYOx/u8KeC2RdazTUQ/AI6RTiImvVFa+/rWY8h1BHLUnKd4hsNpNJAe9obUj+YnCwsR3YVM/0zuji2yFgfBAC3BU9LpEMN40tRvARMUaN044iwU+8ampiL0bTO/obnqQ6ivXemhsaQBy7m8TNGY/bQPDcl+hR2Kcotys+WtwkuzWUVvRtBDIS1pK2ovYJFkKSpFhwZhX2Jakefx+g5dvNbzvlom3reS0LhLeDs4f0JaZaBiiXRl+wfSN8fznY3xnbacB7aRb2/Sr7b0IxBERAZa4+HuvJtYP+djZB7DaUXTUMtvvCU60LWxyGJ6sAmYIhVNTvZ6Wher0S5miAezLnkXRlco2tm01liqCD4k1Bov42YNaqvjC0hd2YN6xqUejrOqt4Cx0SJORp4SSt4ZYy8GdGI4X1HyKAsHhaVfZjbcQK73bBaybC+JMXo4532wyjFzAY9rBX/ursAj3dLaAtfF7U4t0sc5RuhO0TgEFtQJa4H3tfsV7r+CVLGRc3hdy2giP7t/GPZyXgoL+Vz/kgJ7pKoqrLgMDV85nijwOwKIEv2KFbdADeyYGNu+PCKNL8nzN6LOElJvXX8r1ka9pTgELT6FLLFzVs1C11xxEf1DjEMnvteZBtc1qQ6jjhc/EIIouGh1G0It0WjMOgix+1ohNOC8c9t0Isp33d8dhcw/
 ttdL26Lw
 Z5O20w/4uTI4zQss1+LIIlpPyBZd9vfv68iKXbimGnbjrN3L5PScLaHm/SiJXsUeWZsR5UfaHKfiCrh1ouVHxd/AYy19mAHwdY6UtfIOvuL5z03JUIV4rdizLmetU5gnSu+0UQp0vt2eG+KJMw+KFthp10jD43LZ8Qvq08t2BM0VcOuSUJVe9yH1CZeWfAqyQELiWkXpQIv3/fA6e9/3Ufmt+aJWJuWTzI8ZZZZfK4h7dfzNJcJkpKby4KId/EKIZHfHeq47QLjs+a4eLaonpb1bp+GhY8yQeAZSn4J93qeuEbwpmaR0XOGKV1No0KxzeZANY2XCeX9lz6uWyFsjx9kN3eWDYnEPY3OQGRVgqi56nDXoiNs7LwCTStYc3VM4mbpl0lmtaJS+ZfvxVLkxtJagV/POFmyh7f0kA0J/j8DmUpEo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

bpf memory allocator was introduced in v6.1 by Alexei Starovoitov [0]. Its main
purpose is to provide an any-context allocator for bpf program which can be
attached to anywhere (e.g., __kmalloc()) and any context (e.g., the NMI context
through perf_event_overflow()). Before that, only pre-allocated hash map is
usable for these contexts but it incurs memory waste because typically hash
table is sparse. In addition to the memory saving, it also increases the
performance of dynamically allocated hash map significantly and makes the
hash-map usable for sleep-able program. As more use cases of bpf memory
allocator emerges, it seems that there are some problems that need to be
discussed and fixed.

The first problem is the immediate reuse of elements in bpf memory allocator. 
The reason for immediate reuse is to prevent OOM for typical usage scenario for
bpf hash map, but it introduces use-after-free problem [1] for
dynamically-allocated hash table although the problems existed for pre-allocated
hash-table since it was introduced. For hash-table, the reuse problem may be
acceptable for hash table, but the reuse makes introducing new use cases more
difficult.

For example, in bpf-qp-trie [2] the internal nodes of qp-trie are managed by bpf
memory allocator, if internal node used during lookup is freed and reused, the
lookup procedure may panic or return an incorrect result. Although I have
already implemented a qp-trie demo in which two version numbers are added for
each internal node to ensure its validity: one version is saved in its parent
node and another in itself, but I am not sure about its correctness. bpf_cpumask
was introduced recently [3] is another example, in bpf_cpumask_kptr_get() it
checks the validity of bpf_cpumask by checking whether or not its usage is zero,
but I don't know how does it handle the reuse of bpf_cpumask if the cpumask is
freed and then reused by another bpf_cpumask_create() call.

Alexei proposed using BPF_MA_REUSE_AFTER_GP [4] to solve the reuse problem. For
bpf ma with BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one
RCU grace period and are returned back to slab system after one-RCU-grace-period
and one-RCU-tasks-trace grace period. So for bpf programs which care about reuse
problem, these programs can use bpf_rcu_read_{lock,unlock}() to access these
freed objects safely and for  those which doesn't care, there will be safely
use-after-free because these objects have not been returned to slab subsystem. I
was worried about the possibility of OOM for BPF_MA_REUSE_AFTER_GP, so I
proposed using BPF_MA_FREE_AFTER_GP [5] to directly return these freed objects
to slab system after one RCU grace period and enforce the accesses of these
objects are protected by bpf_rcu_read_{lock,unlock}(). But if
BPF_MA_FREE_AFTER_GP is used by local storage, it may break existing sleep-able
program. Currently, I am working on BPF_MA_REUSE_AFTER_GP  with Martin. Hope to
work out an RFC soon.

Another problem is the potential OOM problem. bpf memory allocator is more
suitable for the following case: alloc, free, alloc, free on the same CPU. The
above use case is also the typical use case for hash table, but for other use
cases, bpf memory allocator doesn't handle the scenario well and may incur OOM.
One such use case is batched allocation and batched freeing on same CPU.
According to [6], for a small hash table, the peak memory for this use case can
increase to 860MB or more. Another use case is allocation and free are done on
different CPUs [6]. Memory usage exposure can easily occur for such case,
because there is no reuse and these freed objects can only be returned to
subsystem after one RCU tasks-trace grace period.

I think the potential OOM problem can be attacked by two ways. One is returning
these freed objects to slab system timely. Although some work (e.g., skip
unnecessary call_rcu for freeing [7]) has been done, but I think it is not
enough. For example, for bpf_global_ma, because it will not be destroyed like
bpf ma in hash-tab, so there may still be freed objects in per-cpu free_by_rcu
list and will not be freed if there is no free operations on this CPU
afterwards. Also there is no ways to limit the memory usage of bpf_global_ma
because its usage is accounted under root memcg, so maybe a shrinker is also
needed to free some memory back to slab system. Another example is CPU hot-plug.
Because bpf memory allocator is a per-CPU allocator, so when one CPU is offline,
all freed elements need be returned to slab system and when the CPU is online,
we may need to do pre-fill for it. Another approach is to try to reuse freed
object if possible. One fix [6] had already been done to fix the batched
allocation and freed case, but for the case of allocation and freeing on
different CPUs, it seems we may need to share freed object among multiple CPUs 
and do it cheaply.

Not sure whether or not the issues above are important enough for a session, but
I think a discussion in mail-list will be helpful as well.

0: https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com/
1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
2: https://lore.kernel.org/bpf/20220924133620.4147153-1-houtao@huaweicloud.com/
3: https://lore.kernel.org/bpf/20230125143816.721952-1-void@manifault.com/
4:
https://lore.kernel.org/bpf/CAADnVQKecUqGF-gLFS5Wiz7_E-cHOkp7NPCUK0woHUmJG6hEuA@mail.gmail.com/
5: https://lore.kernel.org/bpf/2a58c4a8-781f-6d84-e72a-f8b7117762b4@huaweicloud.com/
6: https://lore.kernel.org/bpf/20221209010947.3130477-1-houtao@huaweicloud.com/
7: https://lore.kernel.org/bpf/20221014113946.965131-3-houtao@huaweicloud.com/