From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BC3ACC433FE
	for <linux-mm@archiver.kernel.org>; Thu,  3 Nov 2022 18:59:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 08AFD6B0072; Thu,  3 Nov 2022 14:59:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 03AEE6B0073; Thu,  3 Nov 2022 14:59:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E6C136B0074; Thu,  3 Nov 2022 14:59:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D77316B0072
	for <linux-mm@kvack.org>; Thu,  3 Nov 2022 14:59:56 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 6E44CABEDC
	for <linux-mm@kvack.org>; Thu,  3 Nov 2022 18:59:56 +0000 (UTC)
X-FDA: 80093045592.10.E5CC282
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	by imf11.hostedemail.com (Postfix) with ESMTP id 669A140002
	for <linux-mm@kvack.org>; Thu,  3 Nov 2022 18:59:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type:
	MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=Df/5Jd4LtL064HDmqF1pPzcpMKgBg+DvFuaOIIuNgC0=; b=3osOA/J6sr+eBBlhgbf3GMmK9A
	e2JjTAUCj51VXwS9UkiT7AIVQzTpLXUN9InHbOQIAXH5BhuSIZ11xZep+K9I5gG3oP0lDbQnIeLgU
	Y/ACUpR74YfrVYZuwETnVhtg+IdnGfnIqWoYcftX17dGsrhFURXOYpbJdkKyhMzbaN8PrPiWzUDEB
	T6hBLeo8q19kqi4ylkH1CpQyVigq91X914MjQh5WoUwEXU4e9Wu2T9HHujxmLnzbdirsKOPU3+71m
	deU0khNTuFlD4T9zjDz3kN1Zq27aZ60Ur+I7GlDXCILno1hhEgZh0KLZAH6huP/sg8q/BxyR9SGCk
	5iF4iErg==;
Received: from mcgrof by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux))
	id 1oqfRN-001N4h-1o; Thu, 03 Nov 2022 18:59:49 +0000
Date: Thu, 3 Nov 2022 11:59:48 -0700
From: Luis Chamberlain <mcgrof@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
Cc: Song Liu <song@kernel.org>, bpf@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, x86@kernel.org, peterz@infradead.org,
	hch@lst.de, rick.p.edgecombe@intel.com, dave.hansen@intel.com,
	zhengjun.xing@linux.intel.com, kbusch@kernel.org,
	p.raghav@samsung.com, dave@stgolabs.net, vbabka@suse.cz,
	mgorman@suse.de, willy@infradead.org, torvalds@linux-foundation.org,
	a.manzanares@samsung.com
Subject: Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec,
 vfree_exec, and vcopy_exec
Message-ID: <Y2QPpODzdP+2YSMN@bombadil.infradead.org>
References: <20221031222541.1773452-1-song@kernel.org>
 <20221031222541.1773452-2-song@kernel.org>
 <Y2MAR0aj+jcq+15H@bombadil.infradead.org>
 <Y2Pjnd3mxA9fTlox@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Y2Pjnd3mxA9fTlox@kernel.org>
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1667501996;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Df/5Jd4LtL064HDmqF1pPzcpMKgBg+DvFuaOIIuNgC0=;
	b=T0mGXijA1E4sf7EK0E7OxuJhqtlYKThZTWgz8yDfo59j0tXqYyPXLqtcp1UGy50SRx6xWR
	JE/JYyP/BoMgE45FK2Q+8KCIU5kxHu2oONpcaKf1KGwNoO2Nov4D9zpfIyrN/eeH9RE8t5
	yZFkxyA+qg8fb5FMsxaIfvHz1w43NRc=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b="3osOA/J6";
	spf=none (imf11.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org;
	dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none)
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667501996; a=rsa-sha256;
	cv=none;
	b=oAwjyilcBA4m2drWzs7CbD4OrliLLhkRK3vRj5DN2PH2NeBvlfzDJj7uSpl1h56SLHAiXT
	jxPUHu0NokCF7KkNqnQcdvyu3JMRnFPa0e6jIPQnnrN2U9D6ySfVYVP9mYmHIzymXNzoFj
	f/4ad0Us739N9pDyQIQnZjZGSb9Lj2U=
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b="3osOA/J6";
	spf=none (imf11.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org;
	dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none)
X-Stat-Signature: hwpo6ckaiocjiszqh1utk8hscyzrgnxb
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: 669A140002
X-HE-Tag: 1667501995-470623
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> Hi Luis,
> 
> Thanks for looping me in.
> 
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > > proposal by Peter in [1].
> > 
> > This is allg reat but we need to clarify *why* we would go through the
> > trouble.  So if folks are not to excited about this series, that's
> > probably why. IMHO it lacks substance for rationale, **and** implies a few
> > gains without any *clear* performance metrics. I have 0 experience with
> > mm so I'd like other's feedback on my this -- I'm just trying to do
> > decipher rationale from prior "bpf prog pack" efforts.
> > 
> > I'm sensing that the cables in messaging are a bit crossed here and we need
> > to provide a bit better full picture for rationale and this is being
> > completely missed and this work is being undersold.  If my assessment is
> > accurate though, the bpf prog pack strategy with sharing huge pages may prove
> > useful long term for other things than just modules / ftrace / kprobes.
> > 
> > I was surprised to see this entire patch series upgrade from RFC to proper
> > PATCH form now completely fails to mention any of the original motivations
> > behind the "BPF prog pack", which you are doing a true heroic effort to try to
> > generalize as the problem is hard. Let me try to help with that. The rationale
> > for the old BPF prog pack is documented as follows:
> > 
> > * Most BPF programs are pretty small. Allocating a hole page for each
> > * program is sometime a waste. Many small bpf program also adds pressure
> > * to instruction TLB. To solve this issue, we introduce a BPF program pack
> > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> > * to host BPF programs.
> > 
> > Previously you have also stated in earlier versions of this patch set:
> > 
> >   "Most BPF programs are small, but they consume a page each. For systems
> >    with busy traffic and many BPF programs, this could also add significant
> >    pressure to instruction TLB. High iTLB pressure usually causes slow down
> >    for the whole system, which includes visible performance
> >    degradation for production workloads."
> > 
> > So it is implied here that one of the benefits is to help reduce iTLB misses.
> > But that's it. We have no visible numbers to look at and for what... But
> > reducing iTLB misses doesn't always have a complete direct correlation
> > with improving things, but if the code change is small enough it obviously
> > makes sense to apply. If the change is a bit more intrusive, as in this
> > patch series a bit more rationale should be provided.
> > 
> > Other than the "performance aspects" of your patchset, the *main* reason
> > I am engaged and like it is it reduces the nasty mess of semantics on
> > dealing with special permissions on pages which we see in modules and a
> > few other places which today completely open code it. That proves error
> > prone and I'm glad to see efforts to generalize that nastiness. So
> > please ensure this is added as part of the documented rationale. Even
> > if the iTLB miss ratio improvement is not astronomical I believe that
> > the gains in sanity on improving semantics on special pages and sharing code
> > make it well worthwhile. The iTLB miss ratio improvement is just a small
> > cherry on top.
> > 
> > Going back to performance aspects, when Linus had poked for more details
> > about this your have elaborated further:
> > 
> >   "we have seen direct map fragmentation causing visible
> >    performance drop for our major services. This is the shadow 
> >    production benchmark, so it is not possible to run it out of 
> >    our data centers. Tracing showed that BPF program was the top 
> >    trigger of these direct map splits."
> > 
> > And the only other metric we have is:
> > 
> >   "For our web service production benchmark, bpf_prog_pack on 4kB pages
> >    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> > 
> > These metrics are completely arbitrary and opaque to us. We need
> > something tangible and reproducible and I have been suggesting that
> > from early on...
> > 
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
> 
> I believe that while the eBPF prog pack is helpful in mitigation of the
> direct map fragmentation caused by the eBPF JIT programs, the same strategy
> of allocating a large page, splitting its PMD entry and then reusing the
> memory for smaller allocations can be (and should be) generalized to other
> use cases that require non-default permissions in the page table.  Most
> prominent use cases are those that allocate memory for code, but the same
> approach is relevant for other cases, like secretmem or page table
> protection with PKS.
> 
> A while ago I've suggested to handle such caching of large pages at the
> page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
> opinion was that added value does not justify changes to the page
> allocator and it was suggested to handle such caching elsewhere. 

I saw that on the lwn coverage.

> I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> think that we need a generic allocator for memory with non-default
> permissions in the direct map and that code allocation should build on that
> allocator.

It seems this generalization of the bpf prog pack to possibly be used
for modules / kprobes / ftrace is a small step in that direction.

> All that said, the direct map fragmentation problem is currently relevant
> only to x86 because it's the only architecture that supports splitting of
> the large pages in the direct map.

I was thinking even more long term too, using this as a proof of concept. If
this practice in general helps with fragmentation, could it be used for
experimetnation with compound pages later, as a way to reduce possible
fragmentation.

> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> > 
> > I contend therefore that the theoretical reduction of iTLB misses by using
> > huge pages for "bpf prog pack" is not what gets your systems to perform
> > somehow better. It should be simply that it reduces fragmentation and
> > *this* generally can help with performance long term. If this is accurate
> > then let's please separate the two aspects to this.
> 
> The direct map fragmentation is the reason for higher TLB miss rate, both
> for iTLB and dTLB.

OK so then whatever benchmark is running in tandem as eBPF JIT is hammered
should *also* be measured with perf for iTLB and dTLB. ie, the patch can
provide such results as a justifications.

> Whenever a large page in the direct map is split, all
> kernel accesses via the direct map will use small pages which requires
> dealing with 512 page table entries instead of one for 2M range.
> 
> Since small pages in the direct map are never collapsed back to large
> pages, long living system that heavily uses eBPF programs will have its
> direct map severely fragmented, higher TLB miss rate and worse overall
> performance. 

Shouldn't compaction help with those situations?

> > There's two aspects to what I would like to see from a performance
> > perspective then actually mentioned in the commit logs:
> > 
> > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
> >    Vs not using it at all:
> 
> ... 
>  
> > 2) Estimate in reduction on direct map fragmentation by using the "bpf
> >    prog pack" or this generalized solution:
> > 
> >    For this I'd expect a benchmark similar to the workload you guys
> >    run or something memory intensive, as eBPF JITs are heavily used,
> >    and after a certain amount of time somehow compute how fragmented
> >    memory is. The only sensible thing I can think to measure memory
> >    fragmentation is to look at the memory compaction index
> >    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
> >    ideas as I'm a mm n00b.
> 
> The direct map fragmentation can be tracked with 
> 
> 	grep DirectMap /proc/meminfo
> 	grep direct_map /proc/vmstat
> 
> and by looking at /sys/kernel/debug/page_tables/kernel

Thanks!

  Luis