From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6882C48BF6 for ; Mon, 26 Feb 2024 23:29:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4BE7E6B00E9; Mon, 26 Feb 2024 18:29:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 446DC44017F; Mon, 26 Feb 2024 18:29:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C0286B00F6; Mon, 26 Feb 2024 18:29:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 168DD6B00E9 for ; Mon, 26 Feb 2024 18:29:54 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D70074089A for ; Mon, 26 Feb 2024 23:29:53 +0000 (UTC) X-FDA: 81835549866.21.A698CEE Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf01.hostedemail.com (Postfix) with ESMTP id CF40D4000D for ; Mon, 26 Feb 2024 23:29:51 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="KOS/Is2c"; spf=pass (imf01.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708990192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qSjgjFLRQ0svTT7XtovrXKdNHpVCE/4+Eriq1XuSmJU=; b=1bikeQC4vLrWe3bi4HUJgfSoS/EjCFtG+vJW3r5aXeT7TVadZsZ/RA08O5dSnmQQ+Quccm kTNuHGd9RaEl9cnmYWQGmZWcgeb8AQ2iYA2n7PZOJ7YRI8rYpPYxMJ8Jjj4Rt7MISTm8zm HG2k/YMyeSU0vzFX/BZZm9D8liwv2zI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708990192; a=rsa-sha256; cv=none; b=0VhSaUr5zg3S8evZ22WmkGZ1rqYv+YPktG8Gk3kAn7jwTOG14ehmQ2/EdkyKuCDLAOa8Wg ckp8X1qGO6jIay0VyfmPnrPlROdz1Qe+s6avv3tYw1yp3dGSq6FshLDmFAXwChbEI//KlX Qsbm42rBhBKDnBQcUa0UJiPvOSr5yo4= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="KOS/Is2c"; spf=pass (imf01.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Mon, 26 Feb 2024 18:29:43 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708990189; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qSjgjFLRQ0svTT7XtovrXKdNHpVCE/4+Eriq1XuSmJU=; b=KOS/Is2cfdoW661ygrKDejmtPCmybEJjs4pEptbp6vWppWENgBdu9LcEXuVLXOoUYeiYO+ QmhXm9xJNRIe0Xgw5xH5Zvbn6SLX3Lt3BO31kRCfGsLFavRHdz9/SRx1p5UiD9eeel5DEa TIOw+GSt/IOMIbMJJXy7sweyKQxiMLI= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: "Paul E. McKenney" Cc: Matthew Wilcox , Linus Torvalds , Al Viro , Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: <5c6ueuv5vlyir76yssuwmfmfuof3ukxz6h5hkyzfvsm2wkncrl@7wvkfpmvy2gp> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Stat-Signature: ko3oqoom7c7p5p851nfffdxnuixm8equ X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: CF40D4000D X-Rspam-User: X-HE-Tag: 1708990191-196325 X-HE-Meta: U2FsdGVkX19PLo2UMSgdX74uFpV0xMWlUMxrOjgZQfIAkxRSAS19QE7LE7HTm71zP4G4gM+pb3IrMwqdiKDBw+w7lI9wlfBZFUa99Ok42LGHUhkCbpohtf+tvDXEP9HMoG92BGOR9GuyrE6yYLRW/ZNnDJ6WrBfSAIyKmriTT1HGD2ze5JEAYHxbo9TxruyFEe1nK762Ne7QCm0A2sXYHwpCMBUa+BkLVNJTMXmJwDZcYaHMTyKBdCrxbpvwyCnI3ZVnWUiPCEOO08MMj0jlyod25zX0645xE8JPL7TeBMedi887/WdT4Hhb9Rb51rEN5asynbQJ+Udjqgv2gN0z7FTNGWKibdP98AOEtSlr7copeOZr/2owG0NgxVa8rx9fD6jngJMIrnHHOr+h5KytC8N/Yk4tJueC5tzh8kJrhaFAHIJX0aUdhp/nylxvNF70bSeOx5Mr5d3i62QE8r2MCQY4Qr6urEoZud5DMfRShWnKDSRlyd1ALllvup2sirPv6Prd9VE2+E3kxloQ7Rv5nRf4/H2DxdYjKO+zotAwI05eFwjwdclp3PAvXcuntOViqL+BTFBBke98uJYPR58IGNfYyjMb6LEypn4WKwi9mtTW0ss23LD5E/8hecNEFAUw7CvA1YRfMkpttYmD1KiyMVIRsUC/eLxJH8BPKeeOAWKjq58AoqEK8DW0qvWP5EHHDqjrn87K/D8m8ul4QmqRolcj+2LrYa6n8/Qsaj3EWte26cjvAQwjYEFynmLLq6QHU77pLpisgF93U5RPvCMasMtte4jrrFRDkCrU8E7KkA1c/1f18W8G3hv7uuJt9hQDi2dSntSGAkdIkSEJMhlZzi2p4GMzJ9WzRtmaJPCK5aqDAEk5Ae5Mg7Eetxw40bueFGG2iiGEQSn+RHdO83O5vyq1k+LJxhqtR4xbAJqIRgy0dOLYDEaP1VeqhNbuSwx2QCpbVpKiEB8Q3+STi/j pBygMK1D ymgMZwsoG0wPhQ/zWePmf9kjyy1xGCvz1AanUv7Y2irmFAGUnKvMLO2KpU3skQXz06uirrgejLC/isqONwjezTxuB1w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote: > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote: > > +cc Paul > > > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote: > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > > > > Willy - tangential side note: I looked closer at the issue that you > > > > > reported (indirectly) with the small reads during heavy write > > > > > activity. > > > > > > > > > > Our _reading_ side is very optimized and has none of the write-side > > > > > oddities that I can see, and we just have > > > > > > > > > > filemap_read -> > > > > > filemap_get_pages -> > > > > > filemap_get_read_batch -> > > > > > folio_try_get_rcu() > > > > > > > > > > and there is no page locking or other locking involved (assuming the > > > > > page is cached and marked uptodate etc, of course). > > > > > > > > > > So afaik, it really is just that *one* atomic access (and the matching > > > > > page ref decrement afterwards). > > > > > > > > Yep, that was what the customer reported on their ancient kernel, and > > > > we at least didn't make that worse ... > > > > > > > > > We could easily do all of this without getting any ref to the page at > > > > > all if we did the page cache release with RCU (and the user copy with > > > > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > > > > really *only* for tiny reads where we could have that buffer on the > > > > > stack. > > > > > > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > > > > for to that degree? > > > > > > > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > > > > in general. We've had other situations where that would have been > > > > > nice. The main worry would be low-memory situations, I suspect. > > > > > > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > > > > with the cacheline possibly bouncing, the system call overhead for > > > > > tiny reads (particularly with all the mitigations) should be orders of > > > > > magnitude higher than two atomic accesses. > > > > > > > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > > > > I suspect that this customer would simply disable them; afaik the machine > > > > is an appliance and one interacts with it purely by sending transactions > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript" > > > > kind of system). But that makes it even more special case, inapplicable > > > > to the majority of workloads and closer to smelling like a benchmark. > > > > > > > > I've thought about and rejected RCU delaying of the page cache in the > > > > past. With the majority of memory in anon memory & file memory, it just > > > > feels too risky to have so much memory waiting to be reused. We could > > > > also improve gup-fast if we could rely on RCU freeing of anon memory. > > > > Not sure what workloads might benefit from that, though. > > > > > > RCU allocating and freeing of memory can already be fairly significant > > > depending on workload, and I'd expect that to grow - we really just need > > > a way for reclaim to kick RCU when needed (and probably add a percpu > > > counter for "amount of memory stranded until the next RCU grace > > > period"). > > There are some APIs for that, though the are sharp-edged and mainly > intended for rcutorture, and there are some hooks for a CI Kconfig > option called RCU_STRICT_GRACE_PERIOD that could be organized into > something useful. > > Of course, if there is a long-running RCU reader, there is nothing > RCU can do. By definition, it must wait on all pre-existing readers, > no exceptions. > > But my guess is that you instead are thinking of memory-exhaustion > emergencies where you would like RCU to burn more CPU than usual to > reduce grace-period latency, there are definitely things that can be done. > > I am sure that there are more questions that I should ask, but the one > that comes immediately to mind is "Is this API call an occasional thing, > or does RCU need to tolerate many CPUs hammering it frequently?" > Either answer is fine, I just need to know. ;-) Well, we won't want it getting hammered on continuously - we should be able to tune reclaim so that doesn't happen. I think getting numbers on the amount of memory stranded waiting for RCU is probably first order of business - minor tweak to kfree_rcu() et all for that; there's APIs they can query to maintain that counter. then, we can add a heuristic threshhold somewhere, something like if (rcu_stranded * multiplier > reclaimable_memory) kick_rcu()