From: Song Liu <songliubraving@fb.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Luis Chamberlain <mcgrof@kernel.org>, Song Liu <song@kernel.org>,
bpf <bpf@vger.kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
open list <linux-kernel@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <Kernel-team@fb.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP
Date: Sat, 16 Apr 2022 22:26:08 +0000 [thread overview]
Message-ID: <B995F7EB-2019-4290-9C09-AE19C5BA3A70@fb.com> (raw)
In-Reply-To: <CAHk-=wiMCndbBvGSmRVvsuHFWC6BArv-OEG2Lcasih=B=7bFNQ@mail.gmail.com>
Hi Linus,
Thanks a lot for your kind reply.
> On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote:
>>
>> Based on this analysis, I think we should either
>> 1) ship the whole set with 5.18; or
>> 2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19.
>
> Honestly, I think the proper thing to do is
>
> - apply #1, because yes, that "use huge pages" should be an opt-in.
>
> - but just disable hugepages for now.
Hmm.. This sure is an option...
>
> I think those games with set_memory_nx() and friends just show how
> rough this all is right now.
>
> In fact, I personally think that the whole bpf 'prog_pack' stuff
> should probably be disabled. It looks incredible broken to me right
> now.
>
> Unless I mis-read it, it does a "module_alloc()" to allocate the vmap
> area, and then just marks it executable without having even
> initialized the pages. Am I missing something? So now we have random
> kernel memory that is marked executable.
>
> Sure, it's also marked RO, but who cares? It's random data that is now
> executable.
While this is a serious issue (my fault), it is relatively easy to fix.
We can fill the whole space with invalid instructions when the page
is allocated and when a BPF program is freed. Both these operations are
on the slow paths, so the overhead is minimal.
>
> Maybe I am missing something, but I really don't think this is ready
> for prime-time. We should effectively disable it all, and have people
> think through it a lot more.
This has been discussed on lwn.net: https://lwn.net/Articles/883454/.
AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
programs is a good trade-off for memory usage. This is again my fault
not to state the motivation clearly: the primary gain comes from less
page table fragmentation and thus better iTLB efficiency.
Other folks (in recent thread on this topic and offline in other
discussions) also showed strong interests in using similar technical
for text of kernel modules. So I would really like to learn your
opinion on this. There are many details we can optimize, but I guess
the general mechanism has to be something like:
- allocate a huge page, make it safe, and set it as executable;
- as users (BPF, kernel module, etc.) request memory for text, give
a chunk of the huge page to the user.
- use some mechanism to update the chunk of memory safely.
As you correctly noted, I missed the "make it safe" step (which
should be easy to fix). But the rest of the flow is more or less the
best solution to me.
Did I totally miss some better option here? Or is this really not a
viable option (IOW, bpf programs and kernel modules have to use
regular pages)?
Thanks again,
Song
next prev parent reply other threads:[~2022-04-16 22:26 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-15 16:44 Song Liu
2022-04-15 16:44 ` [PATCH v4 bpf 1/4] vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP Song Liu
2022-04-15 17:43 ` Rik van Riel
2022-04-15 16:44 ` [PATCH v4 bpf 2/4] page_alloc: use vmalloc_huge for large system hash Song Liu
2022-04-15 17:43 ` Rik van Riel
2022-04-25 7:07 ` Geert Uytterhoeven
2022-04-25 8:17 ` Linus Torvalds
2022-04-25 8:24 ` Geert Uytterhoeven
2022-04-15 16:44 ` [PATCH v4 bpf 3/4] module: introduce module_alloc_huge Song Liu
2022-04-15 18:06 ` Rik van Riel
2022-06-16 16:10 ` Dave Hansen
2022-04-15 16:44 ` [PATCH v4 bpf 4/4] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-04-15 19:05 ` [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Luis Chamberlain
2022-04-16 1:34 ` Song Liu
2022-04-16 1:42 ` Luis Chamberlain
2022-04-16 1:43 ` Luis Chamberlain
2022-04-16 5:08 ` Christoph Hellwig
2022-04-16 19:55 ` Song Liu
2022-04-16 20:30 ` Linus Torvalds
2022-04-16 22:26 ` Song Liu [this message]
2022-04-18 10:06 ` Mike Rapoport
2022-04-19 0:44 ` Luis Chamberlain
2022-04-19 1:56 ` Edgecombe, Rick P
2022-04-19 5:36 ` Song Liu
2022-04-19 18:42 ` Mike Rapoport
2022-04-19 19:20 ` Linus Torvalds
2022-04-20 2:03 ` Alexei Starovoitov
2022-04-20 2:18 ` Linus Torvalds
2022-04-20 14:42 ` Song Liu
2022-04-20 18:28 ` Luis Chamberlain
2022-04-21 7:29 ` Song Liu
2022-04-21 3:25 ` Nicholas Piggin
2022-04-21 5:48 ` Linus Torvalds
2022-04-21 6:02 ` Linus Torvalds
2022-04-21 9:07 ` Nicholas Piggin
2022-04-21 8:57 ` Nicholas Piggin
2022-04-21 15:44 ` Linus Torvalds
2022-04-21 23:30 ` Nicholas Piggin
2022-04-22 0:49 ` Linus Torvalds
2022-04-22 1:51 ` Nicholas Piggin
2022-04-22 2:31 ` Linus Torvalds
2022-04-22 2:57 ` Nicholas Piggin
2022-04-21 15:47 ` Edgecombe, Rick P
2022-04-21 16:15 ` Linus Torvalds
2022-04-22 0:12 ` Nicholas Piggin
2022-04-22 2:29 ` Edgecombe, Rick P
2022-04-22 2:47 ` Linus Torvalds
2022-04-22 16:54 ` Edgecombe, Rick P
2022-04-22 3:08 ` Nicholas Piggin
2022-04-22 4:31 ` Nicholas Piggin
2022-04-22 17:10 ` Edgecombe, Rick P
2022-04-22 20:22 ` Edgecombe, Rick P
2022-04-22 3:33 ` Nicholas Piggin
2022-04-21 9:47 ` Nicholas Piggin
2022-04-19 21:24 ` Luis Chamberlain
2022-04-19 23:58 ` Edgecombe, Rick P
2022-04-20 7:58 ` Petr Mladek
2022-04-19 18:20 ` Mike Rapoport
2022-04-24 17:43 ` Linus Torvalds
2022-04-25 6:48 ` Song Liu
2022-04-21 3:19 ` Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=B995F7EB-2019-4290-9C09-AE19C5BA3A70@fb.com \
--to=songliubraving@fb.com \
--cc=Kernel-team@fb.com \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=hch@infradead.org \
--cc=imbrenda@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=rick.p.edgecombe@intel.com \
--cc=song@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox