From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28FF5C48BF6 for ; Mon, 26 Feb 2024 21:08:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB6AC4401C4; Mon, 26 Feb 2024 16:08:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A662644017F; Mon, 26 Feb 2024 16:08:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 906D24401C4; Mon, 26 Feb 2024 16:08:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7C27D44017F for ; Mon, 26 Feb 2024 16:08:11 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 4DA1EA0A24 for ; Mon, 26 Feb 2024 21:08:11 +0000 (UTC) X-FDA: 81835192782.08.9E4D7F2 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf02.hostedemail.com (Postfix) with ESMTP id 404EB80004 for ; Mon, 26 Feb 2024 21:08:06 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=EgIwleAL; spf=none (imf02.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708981687; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yx2ecZmSxs5OJEzpli0ea56NCZw1og6Vtd4aLoQQyOw=; b=U/UiJkP2NkZuAf6UlKgG0xecR1Qs/UYDnFi8O8qMMrNzd0AAEdlwQ0YL/Y+nb3S+Pd88vF lsaSRPKizvhuDgPyHF/ha9fPDSnAQopqsgQVAfvuqXoYj9hAfvaKSczTUV2C7t/kzG3Suc Rq5vuBO8M1DlTVxR2MvyZdIU1EBpjf8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708981687; a=rsa-sha256; cv=none; b=2EHpMEPA9x9OanpZoaHMRJv6b8BcWjnf6DibX/0vQknwm5d0R9XmGKSPg5VXWXPfNoIMjC HH2R73V7vRC6V8sWMeFmKRheJSERgMDkr8jnGtc3RTKjWqLuc4EIfuXdtv9w/aP/gyi25B IrJBQB6Xi5ZTwLlEZ+1LhZL1eujyxLM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=EgIwleAL; spf=none (imf02.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=yx2ecZmSxs5OJEzpli0ea56NCZw1og6Vtd4aLoQQyOw=; b=EgIwleALwyWPB08e/Qecjsk35I rPwHOusf4Jbre5vKPqJkUJMFqLzmcQhbRervfw6C6ylpIqQbB+12oeD0rvMIUJBgr47G7973iDZrL DFLFHgW06KZ7lTjELcSn/2DCEm7crmH3PeKwLoSC3yI7ms8tv9+vMniAds3u4dYnBSqrx2jY1MPFS UQZ01WN3hiR1S3REBkIKlQlw+odqAcNreY13TqrLmZZwXEkiYmVoaStBFMtuDi+zJm+1mKD7wpx8F M8JC0fUyUwWO7O/5RRsvq+q+uUqz9oQpNOFiTgWMddRcNgxlrqKRc7Pd2IXQejoczyuKMqNwTcKSa aPGOezKA==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1reiCV-00000000awV-4B1r; Mon, 26 Feb 2024 21:07:52 +0000 Date: Mon, 26 Feb 2024 21:07:51 +0000 From: Matthew Wilcox To: Linus Torvalds Cc: Al Viro , Kent Overstreet , Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 404EB80004 X-Rspam-User: X-Stat-Signature: 7thf4bhqp1omfpywi43qs7z3zttryzaj X-Rspamd-Server: rspam03 X-HE-Tag: 1708981686-990064 X-HE-Meta: U2FsdGVkX1/ESahhz3HPDlW8jr1BS2IZ+Xwc6qbVXpTH5e05sU8yuJE3DUuWIea82gOc/6W/XtNI361akchU3C4Y63OePjNj4mYdYbfXfK9vavGR/j2y6mmv3MVDzWaAVwYj6S7ZCz8BNWgecEACQuKjJsgykX4Z4nlVyYrVwmfMHt4iKQ8N/Gb/DZak5XOxwv60PCCM05Q5iZFDIT1HmaOV5PGpGfNVy5JzKx9esVh8uTg0fcvjg/q0ZqA/Kdb6k/yK5cLRgC0ej/ciN60Rr8S6Ooad1FKVl3td0LLSTWuoz5BG4ChMQTET4mfzO5LKIqk7JWhURxFC7i2PAns/qgK1yuE9cMB3W+XuVqDuR/NK8XQtYbAG2/IGZqUL5liI6j9z2SYxVfFB2g2nETUqeKyQBJtOzvTGTPTEfuJCOZ3aYSZGrWu+dDMzcmYwXv2KSGAtEFxTCNFoxwK4mUy1TvC5b+gLJQKQXZWHAouUGO4/4EufpsbXO3cOreGcQM3JZfi3+TH93rkQzdlIkYkAnNB1ugbA22pDBmXWIg8cr+MSCEQey6H8mtpPyHODnO1t5WAoN5D5aagSkf9Vz+AnY7ftecdheMY4Lb/pPgusDh3PeIU9wPr6+OUkw9veW6nUoCcojHu2u4qzw7F8BchPOvpnuRbXxYtTfsd15uemzMjwjo98JFxXZ/HP8rDgBzA7qpuLP43dyPUd6Vs8GfNSrVSIXUcw3Pe01b8CH4SXE3Nr/8RzvZqc8I9XXi+Srb6wUUmi5Wdc80EKZYZcCOsyFUrroLEiDmqv91IsIhQczwY/MUx6IqBt9ROi7bduAk+03D2Frp6jS5Wgn2855SMe4j24Hh9UrM7f1gTBMR/L1BHenlLxkG2uLpoyUjBJiVf8aBJfqnFfA2BtiDJSS1i4nPV8eHxN2DvyF1YZKeCDPjNj6GHPpmxv/TIul696YCTcodY+JmJ4L61oxeuuIqy U3ydDoF3 p23JRJLJ+QvbsL0+GVWu8UafHyWB1VE284rwyxLvooBpPNO5pr86WBUOyAX4IIUcm+nAQIdpjPnno4LfjUf9n7ZMA0vqqxCEHZGyFm/5Oom5Iog/OLDhYo1JAqw/IPxACGsHuWayb+3zHtd4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > Willy - tangential side note: I looked closer at the issue that you > reported (indirectly) with the small reads during heavy write > activity. > > Our _reading_ side is very optimized and has none of the write-side > oddities that I can see, and we just have > > filemap_read -> > filemap_get_pages -> > filemap_get_read_batch -> > folio_try_get_rcu() > > and there is no page locking or other locking involved (assuming the > page is cached and marked uptodate etc, of course). > > So afaik, it really is just that *one* atomic access (and the matching > page ref decrement afterwards). Yep, that was what the customer reported on their ancient kernel, and we at least didn't make that worse ... > We could easily do all of this without getting any ref to the page at > all if we did the page cache release with RCU (and the user copy with > "copy_to_user_atomic()"). Honestly, anything else looks like a > complete disaster. For tiny reads, a temporary buffer sounds ok, but > really *only* for tiny reads where we could have that buffer on the > stack. > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > for to that degree? > > In contrast, the RCU-delaying of the page cache might be a good idea > in general. We've had other situations where that would have been > nice. The main worry would be low-memory situations, I suspect. > > The "tiny read" optimization smells like a benchmark thing to me. Even > with the cacheline possibly bouncing, the system call overhead for > tiny reads (particularly with all the mitigations) should be orders of > magnitude higher than two atomic accesses. Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. I suspect that this customer would simply disable them; afaik the machine is an appliance and one interacts with it purely by sending transactions to it (it's not even an SQL system, much less a "run arbitrary javascript" kind of system). But that makes it even more special case, inapplicable to the majority of workloads and closer to smelling like a benchmark. I've thought about and rejected RCU delaying of the page cache in the past. With the majority of memory in anon memory & file memory, it just feels too risky to have so much memory waiting to be reused. We could also improve gup-fast if we could rely on RCU freeing of anon memory. Not sure what workloads might benefit from that, though. It'd be cute if we could restrict free memory to be only reallocatable by the process that had previously allocated it until an RCU grace period had passed. That way you could still snoop, but only on yourself which wouldn't be all that exciting. Doubt it'd be worth the effort of setting up per-mm freelists, although it would allow us to skip zeroing the page under some circumstances ...