From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 648AEC43334
	for <linux-mm@archiver.kernel.org>; Wed,  6 Jul 2022 22:54:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BC7B56B0072; Wed,  6 Jul 2022 18:54:14 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B78156B0073; Wed,  6 Jul 2022 18:54:14 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A68186B0074; Wed,  6 Jul 2022 18:54:14 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 975656B0072
	for <linux-mm@kvack.org>; Wed,  6 Jul 2022 18:54:14 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 623AF2058D
	for <linux-mm@kvack.org>; Wed,  6 Jul 2022 22:54:14 +0000 (UTC)
X-FDA: 79658180028.16.80B0097
Received: from out1.migadu.com (out1.migadu.com [91.121.223.63])
	by imf26.hostedemail.com (Postfix) with ESMTP id 853A514003D
	for <linux-mm@kvack.org>; Wed,  6 Jul 2022 22:54:13 +0000 (UTC)
Date: Wed, 6 Jul 2022 15:54:04 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1657148051;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Lz2FbgkzXpsYglBWracLt+nXVup0eMNxn/M0c2hWXBo=;
	b=aEVXNbJOOFMBLc3x5punvY+VCqkzi+opHiitMRm6g4iBfNxaTdjuNUudt+q4NGH9uipUU+
	nGJmNdf467i9mwGGK+vL1CdgMDbC/mVLYVsbph7sHtvKHeO0Vgr0OrGYupod2dtxxMfphA
	DUoV1J8NR4Qv+seprN+YcvMAQtMNGMg=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Roman Gushchin <roman.gushchin@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Quentin Monnet <quentin@isovalent.com>, Hao Luo <haoluo@google.com>,
	bpf <bpf@vger.kernel.org>, linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Make non-preallocated allocation
 low priority
Message-ID: <YsYSjAUGUkERUZ4q@castle>
References: <20220706155848.4939-1-laoar.shao@gmail.com>
 <20220706155848.4939-2-laoar.shao@gmail.com>
 <CAADnVQJEK+Puyz8b4eUV3H7Z+OtrvHd4MU42OsPiBodMQxEw-g@mail.gmail.com>
 <YsXd2Tah+irhth9t@castle>
 <CAADnVQ+c_2Q6GxH3E0iD0RkOy2H2-UhuYL4V3v2BTQ6sZNxQAA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAADnVQ+c_2Q6GxH3E0iD0RkOy2H2-UhuYL4V3v2BTQ6sZNxQAA@mail.gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Migadu-Auth-User: linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1657148053;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Lz2FbgkzXpsYglBWracLt+nXVup0eMNxn/M0c2hWXBo=;
	b=kvN5lonh9b87KP8263ShEqD7wRch9yR4G73ysCaoq7uCc50OzXlfj5wFNspLh8PGZwSHAp
	YL8UszvEdVnFbb0AKQR+fWZ8eMn9mgjHOOoBazpMTrXcDgIWDZ3gHJ8sZPqdioue52tnw7
	OCL+xQbLJktVyteHrFYuOUpQyXFr9BM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657148053; a=rsa-sha256;
	cv=none;
	b=yTXGOejAHsZGqAm1f2TlRE3vaps/QzMX2cxWKA0gdNoQo94OOH5EBX3SWXjyV/yGJsc5hk
	dnGVpARxhNcuP3gGGUWquzNnznkRpPAieQBoKivTTXUZ0BhEbt8dAGl0mTkruiyJFgd0+J
	Y/WsgjBHwf0CuVDagMR69uwVrzjxPFg=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=aEVXNbJO;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf26.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev
X-Rspam-User: 
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=aEVXNbJO;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf26.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 853A514003D
X-Stat-Signature: d3a97rzfq9fpt91cttq77nbejsth4kpm
X-HE-Tag: 1657148053-843077
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jul 06, 2022 at 03:11:46PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 6, 2022 at 12:09 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Jul 06, 2022 at 09:47:32AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jul 6, 2022 at 8:59 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > GFP_ATOMIC doesn't cooperate well with memcg pressure so far, especially
> > > > if we allocate too much GFP_ATOMIC memory. For example, when we set the
> > > > memcg limit to limit a non-preallocated bpf memory, the GFP_ATOMIC can
> > > > easily break the memcg limit by force charge. So it is very dangerous to
> > > > use GFP_ATOMIC in non-preallocated case. One way to make it safe is to
> > > > remove __GFP_HIGH from GFP_ATOMIC, IOW, use (__GFP_ATOMIC |
> > > > __GFP_KSWAPD_RECLAIM) instead, then it will be limited if we allocate
> > > > too much memory.
> > > >
> > > > We introduced BPF_F_NO_PREALLOC is because full map pre-allocation is
> > > > too memory expensive for some cases. That means removing __GFP_HIGH
> > > > doesn't break the rule of BPF_F_NO_PREALLOC, but has the same goal with
> > > > it-avoiding issues caused by too much memory. So let's remove it.
> > > >
> > > > The force charge of GFP_ATOMIC was introduced in
> > > > commit 869712fd3de5 ("mm: memcontrol: fix network errors from failing
> > > > __GFP_ATOMIC charges") by checking __GFP_ATOMIC, then got improved in
> > > > commit 1461e8c2b6af ("memcg: unify force charging conditions") by
> > > > checking __GFP_HIGH (that is no problem because both __GFP_HIGH and
> > > > __GFP_ATOMIC are set in GFP_AOMIC). So, if we want to fix it in memcg,
> > > > we have to carefully verify all the callsites. Now that we can fix it in
> > > > BPF, we'd better not modify the memcg code.
> > > >
> > > > This fix can also apply to other run-time allocations, for example, the
> > > > allocation in lpm trie, local storage and devmap. So let fix it
> > > > consistently over the bpf code
> > > >
> > > > __GFP_KSWAPD_RECLAIM doesn't cooperate well with memcg pressure neither
> > > > currently. But the memcg code can be improved to make
> > > > __GFP_KSWAPD_RECLAIM work well under memcg pressure if desired.
> > >
> > > Could you elaborate ?
> > >
> > > > It also fixes a typo in the comment.
> > > >
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
> > >
> > > Roman, do you agree with this change ?
> >
> > Yes, removing __GFP_HIGH makes sense to me. I can imagine we might want
> > it for *some* bpf allocations, but applying it unconditionally looks wrong.
> 
> Yeah. It's a difficult trade-off to make without having the data
> to decide whether removing __GFP_HIGH can cause issues or not,

Yeah, the change looks reasonable, but it's hard to say without giving
it a good testing in (something close to) a production environment.

> but do you agree that __GFP_HIGH doesn't cooperate well with memcg ?
> If so it's a bug on memcg side, right?

No. Historically we allowed high-prio allocations to exceed the memcg limit
because otherwise there were too many stability and performance issues.
It's not a memcg bug, it's a way to avoid exposing ENOMEM handling bugs all over
the kernel code. Without memory cgroups GFP_ATOMIC allocations rarely fail
and a lot of code paths in the kernel are not really ready for it (at least
it was the case several years ago, maybe things are better now).

But it was usually thought in the context of small(ish) allocations which do not
change the global memory usage picture. Subsequent "normal" allocations are
triggering reclaim/OOM, so from a user's POV the limit works as expected.

But with the ownership model and size of bpf maps it's a different story:
if a bpf map belongs to an abandoned cgroup, it might consume a lot of memory
and there will be no "normal" allocations. So cgroup memory limit will be never
applied. It's a valid issue, I agree with Yafang here.

> but we should probably
> apply this band-aid on bpf side to fix the bleeding.
> Later we can add a knob to allow __GFP_HIGH usage on demand from
> bpf prog.

Yes, it sounds like a good idea. I have to think what's the best approach
here, it's not obvious for me.

Thanks!