linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	davem@davemloft.net, daniel@iogearbox.net, andrii@kernel.org,
	tj@kernel.org, kafai@fb.com, bpf@vger.kernel.org,
	kernel-team@fb.com, linux-mm@kvack.org,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
Date: Fri, 8 Jul 2022 10:48:58 -0700	[thread overview]
Message-ID: <20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com> (raw)
In-Reply-To: <Ysg0GyvqUe0od2NN@dhcp22.suse.cz>

On Fri, Jul 08, 2022 at 03:41:47PM +0200, Michal Hocko wrote:
> On Wed 06-07-22 11:05:25, Alexei Starovoitov wrote:
> > On Wed, Jul 06, 2022 at 06:55:36PM +0100, Matthew Wilcox wrote:
> [...]
> > > For example, I assume that a BPF program
> > > has a fairly tight limit on how much memory it can cause to be allocated.
> > > Right?
> > 
> > No. It's constrained by memcg limits only. It can allocate gigabytes.
>  
> I have very briefly had a look at the core allocator parts (please note
> that my understanding of BPF is really close to zero so I might be
> missing a lot of implicit stuff). So by constrained by memcg you mean
> __GFP_ACCOUNT done from the allocation context (irq_work). The complete
> gfp mask is GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_ACCOUNT
> which means this allocation is not allowed to sleep and GFP_ATOMIC
> implies __GFP_HIGH to say that access to memory reserves is allowed.
> Memcg charging code interprets this that the hard limit can be breached
> under assumption that these are rare and will be compensated in some
> way. The bulk allocator implemented here, however, doesn't reflect that
> and continues allocating as it sees a success so the breach of the limit
> is only bound by the number of objects to be allocated. If those can be
> really large then this is a clear problem and __GFP_HIGH usage is not
> really appropriate.

That was a copy paste from the networking stack. See kmalloc_reserve().
Not sure whether it's a bug there or not.
In a separate thread we've agreed to convert all of bpf allocations
to GFP_NOWAIT. For this patch set I've already fixed it in my branch.

> Also, I do not see any tracking of the overall memory sitting in these
> pools and I think this would be really appropriate. As there doesn't
> seem to be any reclaim mechanism implemented this can hide quite some
> unreachable memory.
> 
> Finally it is not really clear to what kind of entity is the life time
> of these caches bound to. Let's say the system goes OOM, is any process
> responsible for it and a clean up would be done if it gets killed?

We've been asking these questions for years and have been trying to
come up with a solution.
bpf progs are not analogous to user space processes. 
There are bpf progs that function completely without user space component.
bpf progs are pretty close to be full featured kernel modules with
the difference that bpf progs are safe, portable and users have
full visibility into them (source code, line info, type info, etc)
They are not binary blobs unlike kernel modules.
But from OOM perspective they're pretty much like .ko-s.
Which kernel module would you force unload when system is OOMing ?
Force unloading ko-s will likely crash the system.
Force unloading bpf progs maybe equally bad. The system won't crash,
but it may be a sorrow state. The bpf could have been doing security
enforcement or network firewall or providing key insights to critical
user space components like systemd or health check daemon.
We've been discussing ideas on how to rank and auto cleanup
the system state when progs have to be unloaded. Some sort of
destructor mechanism. Fingers crossed we will have it eventually.
bpf infra keeps track of everything, of course.
Technically we can detach, unpin and unload everything and all memory
will be returned back to the system.
Anyhow not a new problem. Orthogonal to this patch set.
bpf progs have been doing memory allocation from day one. 8 years ago.
This patch set is trying to make it 100% safe.
Currently it's 99% safe.


  reply	other threads:[~2022-07-08 17:49 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220623003230.37497-1-alexei.starovoitov@gmail.com>
2022-06-27  7:03 ` Christoph Hellwig
2022-06-28  0:17   ` Christoph Lameter
2022-06-28  5:01     ` Alexei Starovoitov
2022-06-28 13:57       ` Christoph Lameter
2022-06-28 17:03         ` Alexei Starovoitov
2022-06-29  2:35           ` Christoph Lameter
2022-06-29  2:49             ` Alexei Starovoitov
2022-07-04 16:13               ` Vlastimil Babka
2022-07-06 17:43                 ` Alexei Starovoitov
2022-07-19 11:52                   ` Vlastimil Babka
2022-07-04 20:34   ` Matthew Wilcox
2022-07-06 17:50     ` Alexei Starovoitov
2022-07-06 17:55       ` Matthew Wilcox
2022-07-06 18:05         ` Alexei Starovoitov
2022-07-06 18:21           ` Matthew Wilcox
2022-07-06 18:26             ` Alexei Starovoitov
2022-07-06 18:31               ` Matthew Wilcox
2022-07-06 18:36                 ` Alexei Starovoitov
2022-07-06 18:40                   ` Matthew Wilcox
2022-07-06 18:51                     ` Alexei Starovoitov
2022-07-06 18:55                       ` Matthew Wilcox
2022-07-08 13:41           ` Michal Hocko
2022-07-08 17:48             ` Alexei Starovoitov [this message]
2022-07-08 20:13               ` Yosry Ahmed
2022-07-08 21:55               ` Shakeel Butt
2022-07-10  5:26                 ` Alexei Starovoitov
2022-07-10  7:32                   ` Shakeel Butt
2022-07-11 12:15                     ` Michal Hocko
2022-07-12  4:39                       ` Alexei Starovoitov
2022-07-12  7:40                         ` Michal Hocko
2022-07-12  8:39                           ` Yafang Shao
2022-07-12  9:52                             ` Michal Hocko
2022-07-12 15:25                               ` Shakeel Butt
2022-07-12 16:32                                 ` Tejun Heo
2022-07-12 17:26                                   ` Shakeel Butt
2022-07-12 17:36                                     ` Tejun Heo
2022-07-12 18:11                                       ` Shakeel Butt
2022-07-12 18:43                                         ` Alexei Starovoitov
2022-07-13 13:56                                           ` Yafang Shao
2022-07-12 19:11                                         ` Mina Almasry
2022-07-12 16:24                               ` Tejun Heo
2022-07-18 14:13                                 ` Michal Hocko
2022-07-13  2:39                               ` Roman Gushchin
2022-07-13 14:24                                 ` Yafang Shao
2022-07-13 16:24                                   ` Tejun Heo
2022-07-14  6:15                                     ` Yafang Shao
2022-07-18 17:55                                 ` Yosry Ahmed
2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
2022-07-19 18:00                                     ` Yosry Ahmed
2022-07-19 18:01                                       ` Yosry Ahmed
2022-07-19 18:46                                       ` Mina Almasry
2022-07-19 19:16                                         ` Tejun Heo
2022-07-19 19:30                                           ` Yosry Ahmed
2022-07-19 19:38                                             ` Tejun Heo
2022-07-19 19:40                                               ` Yosry Ahmed
2022-07-19 19:47                                               ` Mina Almasry
2022-07-19 19:54                                                 ` Tejun Heo
2022-07-19 20:16                                                   ` Mina Almasry
2022-07-19 20:29                                                     ` Tejun Heo
2022-07-20 12:26                                         ` Michal Hocko
2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-07-18 12:27                             ` Michal Hocko
2022-07-13  2:27                           ` Roman Gushchin
2022-07-11 12:22               ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=hch@infradead.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox