From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9BAF3CCA47B
	for <linux-mm@archiver.kernel.org>; Fri,  8 Jul 2022 20:14:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 207886B0072; Fri,  8 Jul 2022 16:14:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1B7EE6B0073; Fri,  8 Jul 2022 16:14:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 07FC06B0074; Fri,  8 Jul 2022 16:14:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id E74D96B0072
	for <linux-mm@kvack.org>; Fri,  8 Jul 2022 16:14:31 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay11.hostedemail.com (Postfix) with ESMTP id A5AED8038C
	for <linux-mm@kvack.org>; Fri,  8 Jul 2022 20:14:31 +0000 (UTC)
X-FDA: 79665035142.01.46C1E26
Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44])
	by imf18.hostedemail.com (Postfix) with ESMTP id E9D881C0031
	for <linux-mm@kvack.org>; Fri,  8 Jul 2022 20:14:30 +0000 (UTC)
Received: by mail-wr1-f44.google.com with SMTP id bk26so16841580wrb.11
        for <linux-mm@kvack.org>; Fri, 08 Jul 2022 13:14:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=0lrp7KZxWmFlFtxCCrcoVbLJIOY0LO9e4z7L5GiyJ+4=;
        b=SDt/23BtJu5iHF14giz3QTuBSW5MgTJOTSH4eGLDV/iG3jJ4w78m+XvEBZ6zSiM9lR
         vCZY7waOxz4n2vyNPy16BwchTvXeRs0pYaM5ncAIWTVvenOJ6ZrjamDfeDFLGoQdcH3r
         6Y7KDbRyEZz2fXLLWU15YAfsARGGE6aTBwt5sv7F4DXR5Im1Sqy2BUXYKNQ7j7a6T8XE
         uvv+jKUxC/gwQczya3fWpK7oVMsf/3YuqEDoHMh2TK26M3L9fS/Du4n46dpptt0UunKL
         VmGQNiicgmuRLbHKyL4feEk+DynH81CCZlPN4ik2ne3zyKv8bvxPVEOe+BRa2O92nxaB
         2d6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=0lrp7KZxWmFlFtxCCrcoVbLJIOY0LO9e4z7L5GiyJ+4=;
        b=iLfwKLeBU2zu1dz6mmnr9ltTHBuOYOlSELEGa2lMt6n/tHFOVxKaFH0e5Qvsp4GL6b
         IhmDPIE+0ld1+asILy10hdK6va7fnlsOMfTAYqkCcy8X/CI1EBZiS6VCPhJ0FE7yKYP3
         cgbcY6B0AbJML1w0NZHRklgjLFjwwOUXoeKuQ0SsQ1XF0PYhhMxOno9wgWJh4RHmctiV
         bUz2qbjwsivl9ueAmk2AifBB/S1CD6OdJdYIIFtdWJGngilbwxcwVruWiEd5JmOvXRWf
         dcTqSST0QyzOak362OR1iBOnDaYzjTLxq5NTuA6N7T1qQZfPU7bCy/BKNCC+/AU5q4UT
         9fmw==
X-Gm-Message-State: AJIora9PKs1IMNMYOSVga8lpdjRYfORwkH82D8d093Jeu3xZo24kyrOE
	n/7csbX5LVJ1WS081gFmqL/plKagPB+/MAbg/1R7mQ==
X-Google-Smtp-Source: AGRyM1uPg2Vv0NJVF9Fa/lGyzMhnhz9eclYnaJtA7FDjUVK7yhE6fda1M2nB53t5K2BfkZgKoQLi0oI+EGi6f6wc8oo=
X-Received: by 2002:a05:6000:a1e:b0:21b:8c8d:3cb5 with SMTP id
 co30-20020a0560000a1e00b0021b8c8d3cb5mr5000863wrb.372.1657311269476; Fri, 08
 Jul 2022 13:14:29 -0700 (PDT)
MIME-Version: 1.0
References: <20220623003230.37497-1-alexei.starovoitov@gmail.com>
 <YrlWLLDdvDlH0C6J@infradead.org> <YsNOzwNztBsBcv7Q@casper.infradead.org>
 <20220706175034.y4hw5gfbswxya36z@MacBook-Pro-3.local> <YsXMmBf9Xsp61I0m@casper.infradead.org>
 <20220706180525.ozkxnbifgd4vzxym@MacBook-Pro-3.local.dhcp.thefacebook.com>
 <Ysg0GyvqUe0od2NN@dhcp22.suse.cz> <20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com>
In-Reply-To: <20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 8 Jul 2022 13:13:53 -0700
Message-ID: <CAJD7tkZ5mh87uO7jZg3hySe1sjFfHsE4xSSg_2SmzpPmwVcMDQ@mail.gmail.com>
Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>, 
	Christoph Hellwig <hch@infradead.org>, davem@davemloft.net, 
	Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Tejun Heo <tj@kernel.org>, 
	Martin KaFai Lau <kafai@fb.com>, bpf <bpf@vger.kernel.org>, Kernel Team <kernel-team@fb.com>, 
	Linux-MM <linux-mm@kvack.org>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, 
	David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>
Content-Type: text/plain; charset="UTF-8"
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b="SDt/23Bt";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1657311271;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=0lrp7KZxWmFlFtxCCrcoVbLJIOY0LO9e4z7L5GiyJ+4=;
	b=Yv/Nt1fc/PMjpRCGYCsbS3iXbqag8+PJrnn3LYswlYreKuOc9f/vjw1J72r6+yqDgOYS1y
	oikjMFDoPuzJcI6ls0qJ9mwNGwvrMNKBUE31YrCrF/fLRsgekF6fvGSOEi2qZq0B3gy/uT
	EXa6SBOHPLzGmnvbfyK7m5xb9Jggq1w=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657311271; a=rsa-sha256;
	cv=none;
	b=Z57TP1aMtD+of1NPgHagBlD5K/nmiDCxuFWX1+++buWqjvUN1wy6q2DzI2xCK5SpU69Uxq
	1dLFpkR92EmtRyvOBURLOvkoHOj64Mk20PkS0tq5VoibXzCC7qVcjwHogUQ7PPHSLnp4HZ
	TvHRL8sL8aV4+lxnmDBfe8jY+IcfiiU=
X-Stat-Signature: rxbfchmcpf1akew916fdqcx8rpuaspz1
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: E9D881C0031
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b="SDt/23Bt";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com
X-Rspam-User: 
X-HE-Tag: 1657311270-414098
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jul 8, 2022 at 10:49 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> > On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> > [...]
> > > > For example, I assume that a BPF program
> > > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > > Right?
> > >
> > > No. It's constrained by memcg limits only. It can allocate gigabytes.
> >
> > I have very briefly had a look at the core allocator parts (please note
> > that my understanding of BPF is really close to zero so I might be
> > missing a lot of implicit stuff). So by constrained by memcg you mean
> > __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> > gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> > which means this allocation is not allowed to sleep and GFP_ATOMIC
> > implies __GFP_HIGH to say that access to memory reserves is allowed.
> > Memcg charging code interprets this that the hard limit can be breached
> > under assumption that these are rare and will be compensated in some
> > way. The bulk allocator implemented here, however, doesn't reflect that
> > and continues allocating as it sees a success so the breach of the limit
> > is only bound by the number of objects to be allocated. If those can be
> > really large then this is a clear problem and __GFP_HIGH usage is not
> > really appropriate.
>
> That was a copy paste from the networking stack. See kmalloc_reserve().
> Not sure whether it's a bug there or not.
> In a separate thread we've agreed to convert all of bpf allocations
> to GFP_NOWAIT. For this patch set I've already fixed it in my branch.
>
> > Also, I do not see any tracking of the overall memory sitting in these
> > pools and I think this would be really appropriate. As there doesn't
> > seem to be any reclaim mechanism implemented this can hide quite some
> > unreachable memory.
> >
> > Finally it is not really clear to what kind of entity is the life time
> > of these caches bound to. Let's say the system goes OOM, is any process
> > responsible for it and a clean up would be done if it gets killed?
>
> We've been asking these questions for years and have been trying to
> come up with a solution.
> bpf progs are not analogous to user space processes.
> There are bpf progs that function completely without user space component.
> bpf progs are pretty close to be full featured kernel modules with
> the difference that bpf progs are safe, portable and users have
> full visibility into them (source code, line info, type info, etc)
> They are not binary blobs unlike kernel modules.
> But from OOM perspective they're pretty much like .ko-s.
> Which kernel module would you force unload when system is OOMing ?
> Force unloading ko-s will likely crash the system.
> Force unloading bpf progs maybe equally bad. The system won't crash,
> but it may be a sorrow state. The bpf could have been doing security
> enforcement or network firewall or providing key insights to critical
> user space components like systemd or health check daemon.
> We've been discussing ideas on how to rank and auto cleanup
> the system state when progs have to be unloaded. Some sort of
> destructor mechanism. Fingers crossed we will have it eventually.
> bpf infra keeps track of everything, of course.
> Technically we can detach, unpin and unload everything and all memory
> will be returned back to the system.
> Anyhow not a new problem. Orthogonal to this patch set.
> bpf progs have been doing memory allocation from day one. 8 years ago.
> This patch set is trying to make it 100% safe.
> Currently it's 99% safe.
>

I think part of Michal's concern here is about memory sitting in
caches that is not yet used by any bpf allocation. I honestly didn't
look at the patches, so I don't know, but if the amount of cached
memory in the bpf allocator is significant then maybe it's worth
reclaiming it on memory pressure? Just thinking out loud.