From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 88261CCD18D for ; Mon, 13 Oct 2025 15:35:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C09A68E004E; Mon, 13 Oct 2025 11:35:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B93938E0009; Mon, 13 Oct 2025 11:35:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A33EA8E004E; Mon, 13 Oct 2025 11:35:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 888FA8E0009 for ; Mon, 13 Oct 2025 11:35:23 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 3C5251403FC for ; Mon, 13 Oct 2025 15:35:23 +0000 (UTC) X-FDA: 83993490126.06.C0F6AD6 Received: from fout-a7-smtp.messagingengine.com (fout-a7-smtp.messagingengine.com [103.168.172.150]) by imf14.hostedemail.com (Postfix) with ESMTP id 275C0100012 for ; Mon, 13 Oct 2025 15:35:20 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="R BVs3WA"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="l/TBRaWJ"; spf=pass (imf14.hostedemail.com: domain of kirill@shutemov.name designates 103.168.172.150 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1760369721; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OPB5SMW+lHPjunu0k1S6OSGIl+NjGPYXXkp++jujHjc=; b=N7aRzmkXKwtuADHtf6Sri10CjvjaWdG52FtgwqEhLv0Hu31KKrqTxDL6DYkIeh3xVfTX9W fZyzD8KHmvGjnSzjYNWrJkvG8vvEX6V1ykAqtBL/EBDbtSEK5wDrx1b/GvgOUZc/Lwt60R HfgUAko6gkBGx2icmlqAEoYafaka0No= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1760369721; a=rsa-sha256; cv=none; b=elrFkVYgieR+JmgDH9yU0mrXR0L5ZBVSJibvt2iYV/+epoikAWQSxw22wi5nGZ5nz+1nux Dt9khHigqBGBsiKwjLfFG6AsF5jXOl8dcMA9Ycf3M27GoK7I6I4rxKaLuL5BF2eYnEg4wu BBnnjdl2ZcZp1dWzBfFjCnSNPfy8zNA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="R BVs3WA"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b="l/TBRaWJ"; spf=pass (imf14.hostedemail.com: domain of kirill@shutemov.name designates 103.168.172.150 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfout.phl.internal (Postfix) with ESMTP id 51000EC011D; Mon, 13 Oct 2025 11:35:20 -0400 (EDT) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Mon, 13 Oct 2025 11:35:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov.name; h=cc:cc:content-type:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm3; t=1760369720; x= 1760456120; bh=OPB5SMW+lHPjunu0k1S6OSGIl+NjGPYXXkp++jujHjc=; b=R BVs3WAQOfMULREgtrtY1j2C54NlWnyRSygLSeH/5N/bc6jzLTuPF6O+N77hlfvNn Ehp0CGntcrwigT6+AjSynO/tJzra/14Tz0mcGWoWTsSIzxShBmzfjDtJflTHwNZ1 aOz/x8vkyIMWfywtGgwQWmY1IITjkzUsix/gjS+1ogRRO9fV3BRlCIJDAM8+LzCb sUu5IrfApYPPMJDSveQVXTycbWVJeTeUgZzuiwbb5WXgp3mBXMqc0erOamjq0izf K4+D/deq0MHKt/pIheQ2LjeqKnKAxZExmCRCYA2pRm1KESNU4DcoA/cYHcWoOmTm y6dbdPbfL9/F4J3a3DylA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1760369720; x=1760456120; bh=OPB5SMW+lHPjunu0k1S6OSGIl+NjGPYXXkp ++jujHjc=; b=l/TBRaWJuuigxcBJNxOcA6I1Na0vy3c1+eYjELfG59/7EAIzDnS PzhoRmoEvN7YOaeQ3WdiqxhNu7pn05o5QHZpuS4tjSsykiXsMdsKNm3yDFS7ogG3 1vnA2OyPwRQQHVdT99EFpltLPmk+9UUFLdtu4FWqd37NqXZ+XYtpHLnMfBRZLjzc VUST3LLvfSAKjnVzrSkaSRLGYWaO6zasM5KCs23g1OzKpYR4Y3qXCmeFs+xqDfRA G8dAOAEE3NRlK8ztZcdBx0MztIA0nk/AlMdgaPBE6NsENJqR4DVZW9bauJoYdoOV fX6AmsaIADiMI1RO5j7V5F/PkOUfbZfVzYQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggdduudektdefucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomhepmfhirhihlhcu ufhhuhhtshgvmhgruhcuoehkihhrihhllhesshhhuhhtvghmohhvrdhnrghmvgeqnecugg ftrfgrthhtvghrnhepjeehueefuddvgfejkeeivdejvdegjefgfeeiteevfffhtddvtdel udfhfeefffdunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homhepkhhirhhilhhlsehshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopedu tddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepthhorhhvrghlughssehlihhnuh igqdhfohhunhgurghtihhonhdrohhrghdprhgtphhtthhopeifihhllhihsehinhhfrhgr uggvrggurdhorhhgpdhrtghpthhtohepmhgtghhrohhfsehkvghrnhgvlhdrohhrghdprh gtphhtthhopehlihhnuhigqdhmmheskhhvrggtkhdrohhrghdprhgtphhtthhopehlihhn uhigqdhfshguvghvvghlsehvghgvrhdrkhgvrhhnvghlrdhorhhg X-ME-Proxy: Feedback-ID: ie3994620:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 13 Oct 2025 11:35:19 -0400 (EDT) Date: Mon, 13 Oct 2025 16:35:17 +0100 From: Kiryl Shutsemau To: Linus Torvalds Cc: Matthew Wilcox , Luis Chamberlain , Linux-MM , linux-fsdevel@vger.kernel.org Subject: Re: Optimizing small reads Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Stat-Signature: e7irark5s3nhbr6wj476nda4jrd87mgj X-Rspam-User: X-Rspamd-Queue-Id: 275C0100012 X-HE-Tag: 1760369720-646014 X-HE-Meta: U2FsdGVkX1/S11dnDjjTWt5F5yop4KoV+NFi4qLgWXEisb8EPjaF1OK96VZPWSc4Eexp1SMl7Ynn0Fy6cqrYA+VvYaLmcZeoZQLGyrwO3tsj2ECH0WgZ4jywQ70z23O/OrHj88uEVgPYBybzzzKexBpWdaSoeGWBqGNcPj3SiD0Ua67JjZvVi9sa7f6XfGkCZanxlGCndjWNBWiQ4NZuvMJqB034q0oRlKWNvb7OVf5rsbxxTH4GDq1visTikyqB2663wkxljthp4qUqKib1bW4j35dEYknSC49J0C0MJzjAF0jtNngkUf/cuRqr2aG0WanUn9GoObTVWHjb2mVrMB5dLuSZMNbnXLz7Va+Qlto9X/e6F51U7ajcxDAORggb62vTu7JdaL6nI6Rq4lraBGKrHoRe6p+K0SOtBjHX5PvbNKRdAZcZnodpM8rewvrm7l+aEwpzJIgkTQeeKPAV07nQSr3tzzQvqoG8iQQ7LIz4RqsC8U7afJIgBoZ/SweaPm9Jkmv0KBYs0z2prS+RKruj5I9DCFpWleMz/JlRxVXX4IXG5avpyAxM2o0lnqJoZpVTxoMo99Q1hoPxgL39KY2qkX4Ga5GL/aIvISdK7S420Hn+LKs6anW+UabOSMzY1Ea9HYgE0HYsVcOXg78DIPTuntighg+Uzvat21ibjxuGrjC5O7cM24BIMcjLF1q1b34Nkt8JCgUMNk6X9PkYtK59KCIzcVvfRBpL3pLy738wh7VABSBLbzAdmNjL+yG0pbgk+sxdBq/IjGl79DMQVyTTlziwyp2tyXPR8G5sT0aXHE/MhChTH4hj31vxBJywT2L0WMLDwn7te48JVyuvlthmjsAEbDO+vDuS//FBGaezRWIlrlrxPrqjpSq4H97u7b4Px2PeBUq+Kk0NWhSLskY49vYQIvlG0qXiF5CearG9TZyds9owfA+kbB2Ylytb1doVZw+AKEqtoGpsVH7 OzWXDkWf ct8VUjKKzx8xkM2vX7WsaWB5P8CKV023ZKNaq6ZumUCSfOfeCIxeOTCcNrHEaRLdDIAHaDjW/7nVUrVLqUWDVWHGa4KR/UVvvTR4QLJ/S2reGm7lwcq/SvJiewl/G6m/ovfP8IxiRs0/Wr2NlPwstfmWULqItyHE6vcTOM+sazIHkcvRINVMl8AaUUnpR0NYtWEMpkaayUEhQxkuq1wXPJy8Kym2iTJd2EjbkZ79xmsY1jNwcOcbQ8M/nsbDsQPbZ34VmKGoAuXusBrMjbqc6c2TZaalYTbvtsaFZjvAoy/prD8RdmfjPfi7UC11F7OTRyWHmqxzcs6Jb3jh+CchBcdT+mgojdHAW1SmZ3TeizyKrSgTCqA4a8JE+EV8RBM122ar/kaS+u/FTN3WO7518kksOfA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 10, 2025 at 10:51:40AM -0700, Linus Torvalds wrote: > Sounds like a plan? The patch is below. Can I use your Signed-off-by for it? Here's some numbers. I cannot explain some of this. Like, there should be zero changes for block size above 256, but... I am not confident enough in these measurements and will work on making them reproducible. Looks like there's sizable boot-to-boot difference. I also need to run xfstests. Unless someone can help me with this? I don't have ready-to-go setup. 16 threads, reads from 4k file(s), MiB/s --------------------------------------------------------- | Block | Baseline | Baseline | Patched | Patched | | size | same file | diff files | same file | diff files | --------------------------------------------------------- | 1 | 11.6 | 27.6 | 33.5 | 33.4 | | 32 | 375 | 880 | 1027 | 1028 | | 256 | 2940 | 6932 | 7872 | 7884 | | 1024 | 11500 | 26900 | 11400 | 28300 | | 4096 | 46500 | 103000 | 45700 | 108000 | --------------------------------------------------------- 16 threads, reads from 1M file(s), MiB/s --------------------------------------------------------- | Block | Baseline | Baseline | Patched | Patched | | size | same file | diff files | same file | diff files | --------------------------------------------------------- | 1 | 26.8 | 27.4 | 32.0 | 32.2 | | 32 | 840 | 872 | 1034 | 1033 | | 256 | 6606 | 6904 | 7872 | 7919 | | 1024 | 25700 | 26600 | 25300 | 28300 | | 4096 | 96900 | 99000 | 103000 | 106000 | --------------------------------------------------------- diff --git a/fs/inode.c b/fs/inode.c index ec9339024ac3..52163d28d630 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -482,6 +482,8 @@ EXPORT_SYMBOL(inc_nlink); static void __address_space_init_once(struct address_space *mapping) { xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); + seqcount_spinlock_init(&mapping->i_pages_delete_seqcnt, + &mapping->i_pages->xa_lock); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->i_private_list); spin_lock_init(&mapping->i_private_lock); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9e9d7c757efe..a900214f0f3a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -522,6 +522,7 @@ struct address_space { struct list_head i_private_list; struct rw_semaphore i_mmap_rwsem; void * i_private_data; + seqcount_spinlock_t i_pages_delete_seqcnt; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but diff --git a/mm/filemap.c b/mm/filemap.c index 751838ef05e5..8c39a9445471 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -138,8 +138,10 @@ static void page_cache_delete(struct address_space *mapping, VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + write_seqcount_begin(&mapping->i_pages_delete_seqcnt); xas_store(&xas, shadow); xas_init_marks(&xas); + write_seqcount_end(&mapping->i_pages_delete_seqcnt); folio->mapping = NULL; /* Leave folio->index set: truncation lookup relies upon it */ @@ -2659,6 +2661,92 @@ static void filemap_end_dropbehind_read(struct folio *folio) } } +static inline unsigned long filemap_read_fast_rcu(struct address_space *mapping, + loff_t pos, char *buffer, + size_t size) +{ + XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT); + struct folio *folio; + loff_t file_size; + unsigned int seq; + + lockdep_assert_in_rcu_read_lock(); + + seq = read_seqcount_begin(&mapping->i_pages_delete_seqcnt); + + xas_reset(&xas); + folio = xas_load(&xas); + if (xas_retry(&xas, folio)) + return 0; + + if (!folio || xa_is_value(folio)) + return 0; + + if (!folio_test_uptodate(folio)) + return 0; + + /* No fast-case if readahead is supposed to started */ + if (folio_test_readahead(folio)) + return 0; + /* .. or mark it accessed */ + if (!folio_test_referenced(folio)) + return 0; + + /* i_size check must be after folio_test_uptodate() */ + file_size = i_size_read(mapping->host); + if (unlikely(pos >= file_size)) + return 0; + if (size > file_size - pos) + size = file_size - pos; + + /* Do the data copy */ + size = memcpy_from_file_folio(buffer, folio, pos, size); + if (!size) + return 0; + + /* Give up and go to slow path if raced with page_cache_delete() */ + if (read_seqcount_retry(&mapping->i_pages_delete_seqcnt, seq)) + return 0; + + return size; +} + +static inline bool filemap_read_fast(struct kiocb *iocb, struct iov_iter *iter, + ssize_t *already_read, + char *buffer, size_t buffer_size) +{ + struct address_space *mapping = iocb->ki_filp->f_mapping; + struct file_ra_state *ra = &iocb->ki_filp->f_ra; + size_t count; + + if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE) + return false; + + if (iov_iter_count(iter) > buffer_size) + return false; + + count = iov_iter_count(iter); + + /* Let's see if we can just do the read under RCU */ + rcu_read_lock(); + count = filemap_read_fast_rcu(mapping, iocb->ki_pos, buffer, count); + rcu_read_unlock(); + + if (!count) + return false; + + count = copy_to_iter(buffer, count, iter); + if (unlikely(!count)) + return false; + + iocb->ki_pos += count; + ra->prev_pos = iocb->ki_pos; + file_accessed(iocb->ki_filp); + *already_read += count; + + return !iov_iter_count(iter); +} + /** * filemap_read - Read data from the page cache. * @iocb: The iocb to read. @@ -2679,7 +2767,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, struct file_ra_state *ra = &filp->f_ra; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; - struct folio_batch fbatch; + union { + struct folio_batch fbatch; + __DECLARE_FLEX_ARRAY(char, buffer); + } area __uninitialized; int i, error = 0; bool writably_mapped; loff_t isize, end_offset; @@ -2693,7 +2784,20 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, return 0; iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); - folio_batch_init(&fbatch); + + /* + * Try a quick lockless read into the 'area' union. Note that + * this union is intentionally marked "__uninitialized", because + * any compiler initialization would be pointless since this + * can fill it will garbage. + */ + if (filemap_read_fast(iocb, iter, &already_read, area.buffer, sizeof(area))) + return already_read; + + /* + * This actually properly initializes the fbatch for the slow case + */ + folio_batch_init(&area.fbatch); do { cond_resched(); @@ -2709,7 +2813,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; - error = filemap_get_pages(iocb, iter->count, &fbatch, false); + error = filemap_get_pages(iocb, iter->count, &area.fbatch, false); if (error < 0) break; @@ -2737,11 +2841,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, * mark it as accessed the first time. */ if (!pos_same_folio(iocb->ki_pos, last_pos - 1, - fbatch.folios[0])) - folio_mark_accessed(fbatch.folios[0]); + area.fbatch.folios[0])) + folio_mark_accessed(area.fbatch.folios[0]); - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(&area.fbatch); i++) { + struct folio *folio = area.fbatch.folios[i]; size_t fsize = folio_size(folio); size_t offset = iocb->ki_pos & (fsize - 1); size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos, @@ -2772,13 +2876,13 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, } } put_folios: - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(&area.fbatch); i++) { + struct folio *folio = area.fbatch.folios[i]; filemap_end_dropbehind_read(folio); folio_put(folio); } - folio_batch_init(&fbatch); + folio_batch_init(&area.fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); file_accessed(filp); -- Kiryl Shutsemau / Kirill A. Shutemov