From: Yafang Shao <laoar.shao@gmail.com>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>, Martin Lau <kafai@fb.com>,
Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
john fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Quentin Monnet <quentin@isovalent.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
songmuchun@bytedance.com,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
penberg@kernel.org, David Rientjes <rientjes@google.com>,
iamjoonsoo.kim@lge.com, Vlastimil Babka <vbabka@suse.cz>,
Linux MM <linux-mm@kvack.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
Date: Sun, 26 Jun 2022 14:38:00 +0800 [thread overview]
Message-ID: <CALOAHbBuA=Nn=-kxXcETr1MvEnaJ1BqNgAHQ5a9Z9vJcR=dbXw@mail.gmail.com> (raw)
In-Reply-To: <YrfTR2nfU1ze/sTS@castle>
On Sun, Jun 26, 2022 at 11:32 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote:
> > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote:
> > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > > charged to the loader's memcg by default, that causes unexpected issues for
> > > > > us. For instance, the container of the loader may be restarted after
> > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > > > > system. Once the loader's new generation container is started, the leftover
> > > > > pages won't be charged to it. That inconsistent behavior will make trouble
> > > > > for the memory resource management for this container.
> > > > >
> > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > > > > this issue, but in both of these two proposals the user code has to be
> > > > > changed to adapt to it, that is a pain for us. This patchset relieves the
> > > > > pain by triggering the recharge in libbpf. It also addresses Roman's
> > > > > critical comments.
> > > > >
> > > > > The key point we can avoid changing the user code is that there's a resue
> > > > > path in libbpf. Once the bpf container is restarted again, it will try
> > > > > to re-run the required bpf programs, if the bpf programs are the same with
> > > > > the already pinned one, it will reuse them.
> > > > >
> > > > > To make sure we either recharge all of them successfully or don't recharge
> > > > > any of them. The recharge prograss is divided into three steps:
> > > > > - Pre charge to the new generation
> > > > > To make sure once we uncharge from the old generation, we can always
> > > > > charge to the new generation succeesfully. If we can't pre charge to
> > > > > the new generation, we won't allow it to be uncharged from the old
> > > > > generation.
> > > > > - Uncharge from the old generation
> > > > > After pre charge to the new generation, we can uncharge from the old
> > > > > generation.
> > > > > - Post charge to the new generation
> > > > > Finnaly we can set pages' memcg_data to the new generation.
> > > > > In the pre charge step, we may succeed to charge some addresses, but fail
> > > > > to charge a new address, then we should uncharge the already charged
> > > > > addresses, so another recharge-err step is instroduced.
> > > > >
> > > > > This pachset has finished recharging bpf hash map. which is mostly used
> > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > > > > progs hasn't been implemented neither.
> > > >
> > > > Without going into the implementation details, the overall approach looks
> > > > ok to me. But it adds complexity and code into several different subsystems,
> > > > and I'm 100% sure it's not worth it if we talking about a partial support
> > > > of a single map type. Are you committed to implement the recharging
> > > > for all/most map types and progs and support this code in the future?
> > > >
> > >
> > > I'm planning to support it for all map types and progs. Regarding the
> > > progs, it seems that we have to introduce a new UAPI for the user to
> > > do the recharge, because there's no similar reuse path in libbpf.
> > >
> > > Our company is a heavy bpf user. We have many bpf programs running on
> > > our production environment, including networking bpf,
> > > tracing/profiling bpf, and some other bpf programs which are not
> > > supported in upstream kernel, for example we're even trying the
> > > sched-bpf[1] proposed by you (and you may remember that I reviewed
> > > your patchset). Most of the networking bpf, e.g. gateway-bpf,
> > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system.
> > >
> > > It is a trend that bpf will be introduced in more and more subsystems,
> > > and thus it is no doubt that a bpf patchset will involve many
> > > subsystems.
> > >
> > > That means I will be continuously active in these areas in the near
> > > future, several years at least.
> >
> > Ok, I'm glad to hear this. I highly recommend to cover more map types
> > and use cases in next iterations of the patchset.
> >
> > >
> > > [1]. https://lwn.net/Articles/869433/
> > >
> > > > I'm still feeling you trying to solve a userspace problem in the kernel.
> > >
> > > Your feeling can be converted to a simple question: is it allowed to
> > > pin a bpf program by a process running in a container. The answer to
> > > this simple question can help us to understand whether it is a user
> > > bug or a kernel bug.
> > >
> > > I think you will agree with me that there's definitely no reason to
> > > refuse to pin a bpf program by a containerized process. And then we
> > > will find that the pinned-bpf-program doesn't cooperate well with
> > > memcg. A kernel feature can't work together with another kernel
> > > feature, and there's not even an interface provided to the user to
> > > adjust it. The user either doesn't pin the bpf program or disable
> > > kmemcg. Isn't it a kernel bug ?
> > >
> > > You may have a doubt why these two features can't cooperate. I will
> > > explain it in detail. That will be a long story.
> > >
> > > It should begin with why we introduce bpf pinning. We pin it because
> > > sometimes the lifecycle of a user application is different with the
> > > bpf program, or there's no user agent at all. In order to make it
> > > simple, I will take the no-user-agent (agent exits after pinning bpf
> > > program) case as an example.
> > >
> > > Now thinking about what will happen if the agent which pins the bpf
> > > program has a memcg. No matter if the agent destroys the memcg or not
> > > once it exits, the memcg will not disappear because it is pinned by
> > > the bpf program. To make it easy, let's assume the memcg isn't being
> > > destroyed, IOW, it is online.
> > >
> > > An online memcg is not populated, but it is still being remote charged
> > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we
> > > will look into the details to find what will happen to this ghost
> > > memcg.
> > >
> > > If this ghost memcg is limited, it will introduce many issues AFAICS.
> > > Firstly, the memcg will be force charged[2], and I think I don't need
> > > to explain the reason to you.
> > > Even worse is that it force-charges silently without any event,
> > > because it comes from,
> > > if (!gfpflags_allow_blocking(gfp_mask))
> > > goto nomem;
> > > And then all memcg events will be skipped. So at least we will
> > > introduce a force-charge event,
> > > force:
> > > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE);
> > > page_counter_charge(&memcg->memory, nr_pages);
> >
> > This is actually a good point, let me try to fix it.
> >
> > >
> > > And then we should allow alloc_htab_elem() to fail,
> > > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size,
> > > - GFP_ATOMIC | __GFP_NOWARN,
> > > + __GFP_ATOMIC |
> > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN,
> >
> > It's not a memcg thing, it was done this way previously. Probably Alexei
> > has a better explanation. Personally, I'm totally fine with removing
> > __GFP_NOWARN, but maybe I just don't know something.
> >
> > > htab->map.numa_node);
> > > And then we'd better introduce an improvement for memcg,
> > > + /*
> > > + * Should wakeup async memcg reclaim first,
> > > + * in case there will be no direct memcg reclaim for a long time.
> > > + * We can either introduce async memcg reclaim
> > > + * or modify kswapd to reclaim a specific memcg
> > > + */
> > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > > + wake_up_async_memcg_reclaim();
> > > if (!gfpflags_allow_blocking(gfp_mask))
> > > goto nomem;
> >
> > Hm, I see. It might be an issue if there is no global memory pressure, right?
> > Let me think what I can do here too.
> >
> > >
> > > And .....
> > >
> > > Really bad luck that there are so many issues in memcg, but it may
> > > also be because I don't have a deep understanding of memcg ......
> > >
> > > I have to clarify that these issues are not caused by
> > > memcg-based-bpf-accounting, but exposed by it.
> > >
> > > [ Time for lunch here, so I have to stop. ]
> >
> > Thank you for writing this text, it was interesting to follow your thinking.
> > And thank you for bringing in these problems above.
> >
> > Let me be clear: I'm not opposing the idea of recharging, I'm only against
> > introducing hacks for bpf-specific issues, which can't be nicely generalized
> > for other use cases and subsystems. That's the only reason why I'm a bit
> > defensive here.
> >
> > In general, as now memory cgroups do not provide an ability to recharge
> > accounted objects (with some exceptions from the v1 legacy). It applies
> > both to user and kernel memory. I agree, that bpf maps are in some sense
> > unique, as they are potentially large kernel objects with a lot of control
> > from the userspace. Is this a good reason to extend memory cgroup API
> > with the recharging ability? Maybe, but if yes, let's do it well.
> >
> > The big question is how to do it? Memcg accounting is done in a way
> > that requires little changes from the kernel code, right? You just
> > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to
> > an already accounted object. The same applies for uncharging.
> > It works transparently.
> >
> > Recharging is different: a caller should have some sort of the ownership
> > over the object (to make sure we are not racing against the reclaim and/or
> > another recharging). And the rules are different for each type of objects.
> > It's a caller duty to make sure all parts of the complex object are properly
> > recharged and nothing is left behind. There is also the reparenting mechanism
> > which can race against the recharging. So it's not an easy problem.
> > If an object is large, we probably don't want to recharge it at once,
> > otherwise temporarily doubling of the accounted memory (thanks to the
> > precharge-uncharge-commit approach) risks introducing spurious OOMs
> > on memory-limited systems.
> >
> > So yeah, if it doesn't sound too scary for you, I'm happy to help
> > with this. But it's a lot of work to do it properly, that's why I'm thinking
> > that maybe it's better to workaround it in userspace, as Alexei suggested.
>
> And as Alexei mentioned, there are some changes coming around the way
> how memory allocations will be usually performed by the bpf code, which might
> make the whole problem and/or your solution obsolete. Please, make sure it's
> still actual before sending the next version.
>
I have taken a look at Alexei's patchset.
For the non-preallocated bpf memory, we have to use map->memcg to do
the accounting, then this issue will still exist.
It also reminds us that when we reuse a map we must make sure the
map->memcg is the expected one.
--
Regards
Yafang
next prev parent reply other threads:[~2022-06-26 6:38 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-19 15:50 Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 02/10] bpftool: Show memcg info of bpf map Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 03/10] mm, memcg: Add new helper obj_cgroup_from_current() Yafang Shao
2022-06-23 3:01 ` Roman Gushchin
2022-06-25 13:54 ` Yafang Shao
2022-06-26 1:52 ` Roman Gushchin
2022-06-19 15:50 ` [RFC PATCH bpf-next 04/10] mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 05/10] mm: Add helper to recharge kmalloc'ed address Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 06/10] mm: Add helper to recharge vmalloc'ed address Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 07/10] mm: Add helper to recharge percpu address Yafang Shao
2022-06-23 5:25 ` Dennis Zhou
2022-06-25 14:18 ` Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 08/10] bpf: Recharge memory when reuse bpf map Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 09/10] bpf: Make bpf_map_{save, release}_memcg public Yafang Shao
2022-06-19 15:50 ` [RFC PATCH bpf-next 10/10] bpf: Support recharge for hash map Yafang Shao
2022-06-21 23:28 ` [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map Alexei Starovoitov
2022-06-22 14:03 ` Yafang Shao
2022-06-23 3:29 ` Roman Gushchin
2022-06-25 3:26 ` Yafang Shao
2022-06-26 3:28 ` Roman Gushchin
2022-06-26 3:32 ` Roman Gushchin
2022-06-26 6:38 ` Yafang Shao [this message]
2022-06-26 6:25 ` Yafang Shao
2022-07-02 4:23 ` Roman Gushchin
2022-07-02 15:24 ` Yafang Shao
2022-07-02 15:33 ` Roman Gushchin
2022-06-27 0:40 ` Alexei Starovoitov
2022-06-27 15:02 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALOAHbBuA=Nn=-kxXcETr1MvEnaJ1BqNgAHQ5a9Z9vJcR=dbXw@mail.gmail.com' \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cl@linux.com \
--cc=daniel@iogearbox.net \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=john.fastabend@gmail.com \
--cc=kafai@fb.com \
--cc=kpsingh@kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=penberg@kernel.org \
--cc=quentin@isovalent.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=songliubraving@fb.com \
--cc=songmuchun@bytedance.com \
--cc=vbabka@suse.cz \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox