From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 20A00C48BF6
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 00:05:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6D04C4401CF; Mon, 26 Feb 2024 19:05:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 67FE844017F; Mon, 26 Feb 2024 19:05:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 56F7B4401CF; Mon, 26 Feb 2024 19:05:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 46A0744017F
	for <linux-mm@kvack.org>; Mon, 26 Feb 2024 19:05:41 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id E33681C0755
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 00:05:40 +0000 (UTC)
X-FDA: 81835640040.07.4B7D31B
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf05.hostedemail.com (Postfix) with ESMTP id 5043310001B
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 00:05:39 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=EFwjYlkX;
	spf=pass (imf05.hostedemail.com: domain of "SRS0=xPBP=KE=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=xPBP=KE=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org";
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708992339;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=8uRB6LIo5K0ELSzq3sXN0vmsBHOGRQH5x7r25mVDZvs=;
	b=pFLutApcmSxRUhyIh2lpSeUl8nffoAeaCZnsZSsBZidUabogRxJQ4bMANa7+/txNPI8gGP
	P2KPQQ1F+fMr/3HmVETja4RU7TSb0j/ipP1RAQQszxYRcWP6o3mZgSV3aX5diz8G2GrlG8
	VCiX2M217CtARpNQAb3lSGveRBprFBc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708992339; a=rsa-sha256;
	cv=none;
	b=DI4nkORALFE4s5ViBT+ctPKVnEzWdIts+/AG3TokJmfJBABWrCrdf48o8Sggsy0xDLgb3m
	cN3LWcal9IyDkE1+EwYrtrCWdSXUo58jVYgjvh16hIUiSpmni/9f5gBkSvMon5m319h3Zg
	3ofEmyPQVBjgRfWTexn1o9zRyPzN3xo=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=EFwjYlkX;
	spf=pass (imf05.hostedemail.com: domain of "SRS0=xPBP=KE=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=xPBP=KE=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org";
	dmarc=pass (policy=none) header.from=kernel.org
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 39BD861272;
	Tue, 27 Feb 2024 00:05:38 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id D2B22C433C7;
	Tue, 27 Feb 2024 00:05:37 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1708992337;
	bh=uZAUfeDeYjXvYFMuuU+6QykzFfW+1MdoQowQgG6Q+BA=;
	h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From;
	b=EFwjYlkX+y7TrXDex+Gb+6NTgOPclhut2C9eFtRk4GdfAIFJlbHKBifKFgnOjSUFv
	 nyUufzncsLlWhNnVCdNFDjv0AicqL7R8/BgNW5lchiElxQ6qZR7ELEzgulQVHn5PCO
	 10ppgdZwr2Xse79G/4ddA9b0lO4HoLfFMkihXjOs1dEU3ZdwP9Icrd2Jgo9hHN8C88
	 drEKlJ0JttosTQa4RqpJv9v6yYrrs/TOt4+R9pvGC4jeLBRAE0gZy722jtwe/gQCZV
	 vLdra8OyhoLGecsgLJyuOZKqF2OXDgbMXUxb39Y8OBodP4pYX8VmtdmeF4vZHPBpAI
	 vE88WTNBbgjqQ==
Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000)
	id 73EE1CE098C; Mon, 26 Feb 2024 16:05:37 -0800 (PST)
Date: Mon, 26 Feb 2024 16:05:37 -0800
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Al Viro <viro@kernel.org>, Luis Chamberlain <mcgrof@kernel.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm <linux-mm@kvack.org>, Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>,
	Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@lst.de>,
	Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <efb40e53-dae5-44c8-9e15-3cbf3a0cf537@paulmck-laptop>
Reply-To: paulmck@kernel.org
References: <CAHk-=wibYaWYqs5A30a7ywJdsW5LDT1LYysjcCmzjzkK=uh+tQ@mail.gmail.com>
 <bk45mgxpdbm5gfa6wl37nhecttnb5bxh6wo3slixsray77azu5@pi3bblfn3c5u>
 <CAHk-=wjnW96+oP0zhEd1zjPNqOHvrddKkwp0+CuS5HpZavfmMQ@mail.gmail.com>
 <Zdv8dujdOg0dD53k@duke.home>
 <CAHk-=wiEVcqTU1oQPSjaJvxj5NReg3GzkBO8zpL1tXFG1UVyvg@mail.gmail.com>
 <Zdz9p_Kn0puI1KEL@casper.infradead.org>
 <znixgiqxzoksfwwzggmzsu6hwpqfszigjh5k6hx273qil7dx5t@5dxcovjdaypk>
 <upnvhnqaitifuwwbxcpa4zgf2hribfrtqzxtcrv5djbyjs2ond@axetql2wrwnt>
 <fb4d944e-fde7-423b-a376-25db0b317398@paulmck-laptop>
 <5c6ueuv5vlyir76yssuwmfmfuof3ukxz6h5hkyzfvsm2wkncrl@7wvkfpmvy2gp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5c6ueuv5vlyir76yssuwmfmfuof3ukxz6h5hkyzfvsm2wkncrl@7wvkfpmvy2gp>
X-Rspamd-Queue-Id: 5043310001B
X-Rspam-User: 
X-Stat-Signature: ymugrwhwyc3hcfsotn5138zoxnhqzdbn
X-Rspamd-Server: rspam03
X-HE-Tag: 1708992339-803069
X-HE-Meta: U2FsdGVkX18Sb48U00PCRew2Dt3oAusP/JZibeg5RHep/DVU6SYfC1C+xSTuOlx1xwOGILIqKxHww5P0iQitz2pskU2LS5dXVMRma3Ggh5jlPzcIDty6/6/JBnLdMcy6uG6DeidHwkbGUcqFBsxIPeAzPBlIwfen0Ccc2/Bo+imeiI2o29DjEg4O13Z4JY/Kg/IL95zfjiZCRibssnzSgxh3EggUT7x05qThxgK+WUjlmUbc0kgmHYGjFPLNeVb+m1dAOmQQ3Pb7olFflG3/k46IkopCLyJa/K0xeQfxoD5O9hu11ML97NT30qKDiRa2PrDWNinHG9NhoMK5LiBpOKi6Hi8AOjFV1dl2TmJOnrWgztPr9MautuXhHGXY+Dcqtu/+x+actff1HX4ksEzEnYz8xw3dCUDUMbgnZzh9t8UzFD8WiJ8MvArs+JuxEuL1mWiqb/+DcKU+Bymv5RxiPWg3azb/bDeqj8jC40YKqb5XOOxK5kntP9yHztoFTjAsGYRHf+Qptr7E6AA4zZwg8nYG2/WFZYsr7dSaOiOmYNFO3mTFAhKCAg0zhWTb5qlvR7fmOg9QTDR9EjiFsHDs1jpzv08+eVIlp/eCYmMvPfJPG9VUOq+FjDqjc+PeaAPH+ajaZAsYOEWgjEAZpmDSInlGOQpmzfkZ8c7VmqMp3lSZUfR0uDFagh7CmyBvhFVWwsGYWcjqALybTcXWniUD0ZSExKfY/uDASNRGm5Ll7PC371KDQ8FWKOIb1lBnhNCjWtMLRPejTGnIUlFBB689iVj+BU+y3lqxO+D2xtJZ7GLYr35FzUhwt3jxYgQzH7px9TxzRmnjnJUamSW5cuX6AEB4MrA7j7E+NYTK9tvsMqHqOatfT577ukYKlH9zoHceR8WNCxG14aBzjVWFBOpk+xFy9Nkj2kyz4Cv0d6UDeGdKkRmifMojKiFZOv+V0h0ZfYLoJlSKHlHw0IIGII3
 goOXLb9I
 kUgopZslEc4rMpY9llNpcfaO4jnh7sROHXujPgU9QRJa6EQj5IpGu0XLE5cfrnoLRKk4TK8jMBxpArQM6oJxS/JQF9Dk+vEPXhz/OLJwdtfMcZ1f/6siUneSxUHwRpyqYdHBqgtZIwrbwNTdcjx3ANkmoLgUQvfE2ivUCHS6QEyX9x87KnepTpyNd+jSWS1cH0/daPZzPl2PNTm7hwxEKf9J8sx82ZrDHSHGuogRgC5AS2HrhLInDPvpR9RT1Ur7TL/XS2tSIcpgcM28=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > > +cc Paul
> > > 
> > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > > > Willy - tangential side note: I looked closer at the issue that you
> > > > > > reported (indirectly) with the small reads during heavy write
> > > > > > activity.
> > > > > > 
> > > > > > Our _reading_ side is very optimized and has none of the write-side
> > > > > > oddities that I can see, and we just have
> > > > > > 
> > > > > >   filemap_read ->
> > > > > >     filemap_get_pages ->
> > > > > >         filemap_get_read_batch ->
> > > > > >           folio_try_get_rcu()
> > > > > > 
> > > > > > and there is no page locking or other locking involved (assuming the
> > > > > > page is cached and marked uptodate etc, of course).
> > > > > > 
> > > > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > > > page ref decrement afterwards).
> > > > > 
> > > > > Yep, that was what the customer reported on their ancient kernel, and
> > > > > we at least didn't make that worse ...
> > > > > 
> > > > > > We could easily do all of this without getting any ref to the page at
> > > > > > all if we did the page cache release with RCU (and the user copy with
> > > > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > > > really *only* for tiny reads where we could have that buffer on the
> > > > > > stack.
> > > > > > 
> > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > > > for to that degree?
> > > > > > 
> > > > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > > > in general. We've had other situations where that would have been
> > > > > > nice. The main worry would be low-memory situations, I suspect.
> > > > > > 
> > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > > > with the cacheline possibly bouncing, the system call overhead for
> > > > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > > > magnitude higher than two atomic accesses.
> > > > > 
> > > > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > > > I suspect that this customer would simply disable them; afaik the machine
> > > > > is an appliance and one interacts with it purely by sending transactions
> > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > > > kind of system).  But that makes it even more special case, inapplicable
> > > > > to the majority of workloads and closer to smelling like a benchmark.
> > > > > 
> > > > > I've thought about and rejected RCU delaying of the page cache in the
> > > > > past.  With the majority of memory in anon memory & file memory, it just
> > > > > feels too risky to have so much memory waiting to be reused.  We could
> > > > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > > > Not sure what workloads might benefit from that, though.
> > > > 
> > > > RCU allocating and freeing of memory can already be fairly significant
> > > > depending on workload, and I'd expect that to grow - we really just need
> > > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > > counter for "amount of memory stranded until the next RCU grace
> > > > period").
> > 
> > There are some APIs for that, though the are sharp-edged and mainly
> > intended for rcutorture, and there are some hooks for a CI Kconfig
> > option called RCU_STRICT_GRACE_PERIOD that could be organized into
> > something useful.
> > 
> > Of course, if there is a long-running RCU reader, there is nothing
> > RCU can do.  By definition, it must wait on all pre-existing readers,
> > no exceptions.
> > 
> > But my guess is that you instead are thinking of memory-exhaustion
> > emergencies where you would like RCU to burn more CPU than usual to
> > reduce grace-period latency, there are definitely things that can be done.
> > 
> > I am sure that there are more questions that I should ask, but the one
> > that comes immediately to mind is "Is this API call an occasional thing,
> > or does RCU need to tolerate many CPUs hammering it frequently?"
> > Either answer is fine, I just need to know.  ;-)
> 
> Well, we won't want it getting hammered on continuously - we should be
> able to tune reclaim so that doesn't happen.
> 
> I think getting numbers on the amount of memory stranded waiting for RCU
> is probably first order of business - minor tweak to kfree_rcu() et all
> for that; there's APIs they can query to maintain that counter.

We can easily tell you the number of blocks of memory waiting to be freed.
But RCU does not know their size.  Yes, we could ferret this on each
call to kmem_free_rcu(), but that might not be great for performance.
We could traverse the lists at runtime, but such traversal must be done
with interrupts disabled, which is also not great.

> then, we can add a heuristic threshhold somewhere, something like 
> 
> if (rcu_stranded * multiplier > reclaimable_memory)
> 	kick_rcu()

If it is a heuristic anyway, it sounds best to base the heuristic on
the number of objects rather than their aggregate size.

							Thanx, Paul