From: Song Liu <songliubraving@fb.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Masami Hiramatsu <mhiramat@kernel.org>,
"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
"David S. Miller" <davem@davemloft.net>,
Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>,
Kees Cook <keescook@chromium.org>, Song Liu <song@kernel.org>,
bpf <bpf@vger.kernel.org>, Christoph Hellwig <hch@infradead.org>,
Davidlohr Bueso <dave@stgolabs.net>,
lkml <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <Kernel-team@fb.com>,
"x86@kernel.org" <x86@kernel.org>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"rick.p.edgecombe@intel.com" <rick.p.edgecombe@intel.com>,
"linux-modules@vger.kernel.org" <linux-modules@vger.kernel.org>
Subject: Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
Date: Tue, 12 Jul 2022 23:12:22 +0000 [thread overview]
Message-ID: <6CB56563-29E2-4CE0-BF7B-360979E42429@fb.com> (raw)
In-Reply-To: <Ys3FvYnASr2v9iPc@bombadil.infradead.org>
> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>
>>> I believe you are mentioning requiring text_poke() because the way
>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>> module_alloc() and before you can use this memory. This is a different type
>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>> impact of TLB misses.
>>
>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke.
>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and
>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>> much value on top of current module_alloc().
>
> Thanks for the clarification.
>
>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>>
>>> Can kprobes, ftrace and modules use it too? It would be nice
>>> so to not have to deal with the loose semantics on the user to
>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>> I think this can be addressed separately on a case by case basis.
>>
>> I am pretty confident that kprobe and ftrace can share huge pages with
>> BPF programs.
>
> Then wonderful, we know where to go in terms of a new API then as it
> can be shared in the future for sure and there are gains.
>
>> I haven't looked into all the details with modules, but
>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also
>> possible.
>
> Sure.
>
>> Once this is done, a regular system (without huge BPF program or huge
>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe,
>> and bpf programs.
>
> That would be nice, if possible, however modules will require likely its
> own thing, on my system I see about 57 MiB used on coresize alone.
>
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> xargs sudo modinfo | grep filename | \
> grep -o '/.*' | xargs stat -c "%s - %n" | \
> awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> 60001272
>
> And so perhaps we need such a pool size to be configurable.
>
>>> But a vmalloc_ro_exec() with a respective free can remove the
>>> requirement to do set_vm_flush_reset_perms().
>>
>> Removing the requirement to set_vm_flush_reset_perms() is the other
>> reason to go directly to vmalloc_ro_exec().
>
> Yes fantastic.
>
>> My current version looks like this:
>>
>> void *vmalloc_exec(unsigned long size);
>> void vfree_exec(void *ptr, unsigned int size);
>>
>> ro is eliminated as there is no rw version of the API.
>
> Alright.
>
> I am not sure if 2 MiB will suffice given what I mentioned above, and
> what to do to ensure this grows at a reasonable pace. Then, at least for
> usage for all architectures since not all will support text_poke() we
> will want to consider a way to make it easy to users to use non huge
> page fallbacks, but that would be up to those users, so we can wait for
> that.
We are not limited to 2MiB total. The logic is like:
1. Anything bigger than 2MiB gets its own allocation.
2. We maintain a list of 2MiB pages, and bitmaps showing which parts of
these pages are in use.
3. For objects smaller than 2MiB, we will try to fit it in one of these
pages.
3. a) If there isn't a page with big enough continuous free space, we
will allocate a new 2MiB page.
(For system with n NUMA nodes, multiple 2MiB above by n).
So, if we have 100 kernel modules using 1MiB each, they will share 50x
2MiB pages.
Thanks,
Song
next prev parent reply other threads:[~2022-07-12 23:12 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-07 22:35 Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
2022-07-07 23:52 ` Song Liu
2022-07-08 0:53 ` Luis Chamberlain
2022-07-08 1:36 ` Song Liu
2022-07-08 15:58 ` Luis Chamberlain
2022-07-08 19:58 ` Song Liu
2022-07-08 22:24 ` Luis Chamberlain
2022-07-09 1:14 ` Song Liu
2022-07-12 4:18 ` Luis Chamberlain
2022-07-12 4:24 ` Luis Chamberlain
2022-07-12 5:49 ` Song Liu
2022-07-12 19:04 ` Luis Chamberlain
2022-07-12 23:12 ` Song Liu [this message]
2022-07-12 23:42 ` Luis Chamberlain
2022-07-13 1:00 ` Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6CB56563-29E2-4CE0-BF7B-360979E42429@fb.com \
--to=songliubraving@fb.com \
--cc=Kernel-team@fb.com \
--cc=anil.s.keshavamurthy@intel.com \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=dave@stgolabs.net \
--cc=davem@davemloft.net \
--cc=hch@infradead.org \
--cc=keescook@chromium.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-modules@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=mhiramat@kernel.org \
--cc=mingo@redhat.com \
--cc=naveen.n.rao@linux.ibm.com \
--cc=peterz@infradead.org \
--cc=rick.p.edgecombe@intel.com \
--cc=rostedt@goodmis.org \
--cc=song@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox