From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5EB30CCD183 for ; Thu, 9 Oct 2025 16:22:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B96D68E009D; Thu, 9 Oct 2025 12:22:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B18F68E009B; Thu, 9 Oct 2025 12:22:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E07C8E009D; Thu, 9 Oct 2025 12:22:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 863908E009B for ; Thu, 9 Oct 2025 12:22:20 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1B728C0A0D for ; Thu, 9 Oct 2025 16:22:20 +0000 (UTC) X-FDA: 83979093240.13.F2665A9 Received: from fhigh-b1-smtp.messagingengine.com (fhigh-b1-smtp.messagingengine.com [202.12.124.152]) by imf03.hostedemail.com (Postfix) with ESMTP id 0400A2000D for ; Thu, 9 Oct 2025 16:22:17 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="d 9ttPxE"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=QVpu+eMF; dmarc=none; spf=pass (imf03.hostedemail.com: domain of kirill@shutemov.name designates 202.12.124.152 as permitted sender) smtp.mailfrom=kirill@shutemov.name ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1760026938; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=egk7Oe6IWqbN1+isawtTksqL1RtsIJ0Y3N3WpC9pjfs=; b=2ZRLQ3gTdg4WT5lPnApkNInj37OKoOqb4eiDrbPMZbVGNLX9Qh4eCeb8aCGoR+913lLctU yooamJyegsBj+jFZUvsNz8l28J1VYRF/hDPXGXM4rwDOJk1dDPRAC+TZlH0mfiUGB+Dznf 3nh6gQd7qO1eLAMf7HfRaPXWDqNNaaQ= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="d 9ttPxE"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=QVpu+eMF; dmarc=none; spf=pass (imf03.hostedemail.com: domain of kirill@shutemov.name designates 202.12.124.152 as permitted sender) smtp.mailfrom=kirill@shutemov.name ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1760026938; a=rsa-sha256; cv=none; b=6yohZ6s/Y9CSTb5H4bxnxS/qtSqtvikg1Gu/ihGNd+Bm6p4FT6jESMfNQixoxklkfBs/Pv MmEvS1359QQra5y4wjZNVkoVV9bQmSEt+n+h4H0DX3d3kFffs6gaTiI4CTXntcW/8KnOOQ brugh2WW7cDf/RyaEiIrq/j4ddw1mJc= Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfhigh.stl.internal (Postfix) with ESMTP id DA9E57A00A4; Thu, 9 Oct 2025 12:22:16 -0400 (EDT) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-05.internal (MEProxy); Thu, 09 Oct 2025 12:22:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov.name; h=cc:cc:content-type:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm3; t=1760026936; x= 1760113336; bh=egk7Oe6IWqbN1+isawtTksqL1RtsIJ0Y3N3WpC9pjfs=; b=d 9ttPxE/fNAYDf16LWN9s2Bn3iFoiqg8trr7Zl9T86ZwoakQBfwWEn78Iu6kn8O+x 6dBNqeJ1QnqZuOE6Fi8YkIrwd7XAwo3UXTfS/4c7d9dNqoWrkEJLTk7g4KjWuIxi XwtZiEsbBvuB/s6Te14UeUCx8lWoA1LrpoXDIQ9pYF8ke36g+igfiBavFTb+wDvX VxB17Nl9QiPvth6FTMHFJG7oullYVonyVnZa7pj0WFrm4Ag60iEbnTcfgaxYCSOi YGEFUTOUWWFbjvHrteamR210hnCF68VuA28tZ4TUIaD9ApUNUTy8YdSoCbARpZkA iS3vRT+F0c2F3QHJjBODw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1760026936; x=1760113336; bh=egk7Oe6IWqbN1+isawtTksqL1RtsIJ0Y3N3 WpC9pjfs=; b=QVpu+eMFixBnn7GVxFdGsHOzlA0ul5Nd4HRlaRWea3uVy6ejC9P jk5nQ9cI2YgEyELHuh9ODaw0VdxcghlywKyyvwESL7qruCJ8/zIwEFDxJGCr8s0v egPazDO6ws2DdfKqzGsaEf/KoSXtoB3AnGreG/dQU/YTKaspZaqe9v7L0ztlvFYZ uFS3ToauiOcXMudoPA9I9Q5yzUa28iGvsiuyeUFeyRg1QQVYuoBunIi0fws9A7VL wHoWziSVVVGAYtVy2zasAUOogI+pl9hUZkXmftjFhoMYrxM3fFM+KpGb4kmLp7Oz +sqLpk1rvA9aoGMb6jnw17ZbCTVYJWFR8mg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddutdeiieehucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomhepmfhirhihlhcu ufhhuhhtshgvmhgruhcuoehkihhrihhllhesshhhuhhtvghmohhvrdhnrghmvgeqnecugg ftrfgrthhtvghrnhepjeehueefuddvgfejkeeivdejvdegjefgfeeiteevfffhtddvtdel udfhfeefffdunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homhepkhhirhhilhhlsehshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopedu tddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepthhorhhvrghlughssehlihhnuh igqdhfohhunhgurghtihhonhdrohhrghdprhgtphhtthhopeifihhllhihsehinhhfrhgr uggvrggurdhorhhgpdhrtghpthhtohepmhgtghhrohhfsehkvghrnhgvlhdrohhrghdprh gtphhtthhopehlihhnuhigqdhmmheskhhvrggtkhdrohhrghdprhgtphhtthhopehlihhn uhigqdhfshguvghvvghlsehvghgvrhdrkhgvrhhnvghlrdhorhhg X-ME-Proxy: Feedback-ID: ie3994620:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 9 Oct 2025 12:22:16 -0400 (EDT) Date: Thu, 9 Oct 2025 17:22:13 +0100 From: Kiryl Shutsemau To: Linus Torvalds Cc: Matthew Wilcox , Luis Chamberlain , Linux-MM , linux-fsdevel@vger.kernel.org Subject: Re: Optimizing small reads Message-ID: References: <4bjh23pk56gtnhutt4i46magq74zx3nlkuo4ym2tkn54rv4gjl@rhxb6t6ncewp> <5zq4qlllkr7zlif3dohwuraa7rukykkuu6khifumnwoltcijfc@po27djfyqbka> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 0400A2000D X-Rspamd-Server: rspam03 X-Stat-Signature: 6ffjuqpbi4py84u6rf5tk1ajpnjawqyq X-HE-Tag: 1760026937-18766 X-HE-Meta: U2FsdGVkX1/SggKxcZ6+0u/OfBkI7/AiayKIWsDXzV8DKdMCbIT9w8l1eEKVJk4k6kkADOP8SvR1T0Vqdaf5rzm4LGL5eX+jlO/R4Ji0+lf5dvR1/FsjGqF/eN4KHLIOjaws3V/6y0wj8Grcy5qKqfy2QkzPnoEMp6/qrLJwFzh5XAMFtPOmOmg5cD3HBMf/tlTTqR7Touu2/MwIw9deSQKVcpRXDOPH4tbJhD9MXpNwf2pdLnhBBMrssGABU14NFoFBe0VuGhI3LDBc3XxbRyW0bOshxPM4GLX8GJ9LqBcl7Y1sHJULj57sO56ng/HnVAH7aa3yIEzSxJR2z7wmo94JjiDlmGJHEc8ITnFamD3scHgbRW6YQwte3l2TFD+t6vE4yLZMJsYDg9MIXUXfgH6xdkOF7S64q8eJcCwOa/c8llEHPI/1ZzmWcW8fmZmBfEkVn9H/J0JtsmHnrIeRu2ZJncUNQnVfUi3yDtP7PbIfF7osuatB2AnLcqe5SA7TUWBHtVH/44V3AmuZPOtQTCDtxz+yjCuVZk6DNlZwou25GIGL4Vxn2VqLpmkno7MzV3AKRtGhrnvzcy/fwLfbnpZpx6jNq3OSiculsFf7rcbP3dJFPt+AVR0MJQSEOdZptQtQjgSjdPdcbluXEX2S7RXBcgP9SSKoR4zUASX28oLP8rh9VV+nSjZtvjLJxzJuWLHkCp4Fo+Yg2N0eFpEGg3uqcxD5o1hbojcCYcUTX2yv23/3GrMLzqDnFjbKa7+yKTwFzAbFRagBGquet1T6jLoY0o2+rQ5qRoDV9MFuxpG1DXBJ9gOrIdBJR3wNC8E97pLaD5vBnz5Hf3f0I/4prAiPIeqr56rXEQv7oypUvklvmIZ2tdccWjU249VkIC5unnKgmW6XGcLFV2rvg7ku1PY8BeTnWHAbH5oLSXejb3RzI5A7Ii4NU9AdanxGmKxChtIY5P7+P04CMrbKKpY s0Bs7yUq Qv/npW9DDTEG2JmK3PVluZv/iV0BUC+n3BpTLron3W0bYH1/MXNhVrDFC3XL9AfWKsBL48WOxoASRkUpRWZgIeFaGaUhBeOIJXdainizoJVznyFg2ZtU6fBJRtkss2XAm6RKtj5L1K4ZGNoIXNWYyGHn/0ON9gOx0bZ242U6xCz62v1IiaYuW9nUajUNcleIlZgaGThnd4/Ei3KCluc7BWIQmM+ewHieuW32VQFgMkDBwsQyrpraJ/D4OmkM1bAlPov1vqWBZv0WQMWS7f9eFw4hC6wWXr6nTnvO27CaPCfdIsAEPr84barm8T1Xe/0EX52bVpFdHg7qSdUHvuj4b330YvWfOKKRnbFgkYmahAJS3isoZl7WgoNjsTep6rziciXjJOnFlwgyqbw0SuPooU+Fg0vzazhDcFY6ZjAVXpbQGQqTMveuR4lf/ABoL070ikC7Vtt2JjfE849lmCXbG+vWkMg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 08, 2025 at 10:03:47AM -0700, Linus Torvalds wrote: > On Wed, 8 Oct 2025 at 09:27, Linus Torvalds > wrote: > > > > On Wed, 8 Oct 2025 at 07:54, Kiryl Shutsemau wrote: > > > > > > Disabling SMAP (clearcpuid=smap) makes it 45.7GiB/s for mine patch and > > > 50.9GiB/s for yours. So it cannot be fully attributed to SMAP. > > > > It's not just smap. It's the iov iterator overheads I mentioned. > > I also suspect that if the smap and iov overhead are fixed, the next > thing in line is the folio lookup. Below is the patch I currently have. I went for more clear separation of fast and slow path. Objtool is not happy about calling random stuff within UACCESS. I ignored it for now. I am not sure if I use user_access_begin()/_end() correctly. Let me know if I misunderstood or misimplemented your idea. This patch brings 4k reads from 512k files to ~60GiB/s. Making the buffer 4k, brings it ~95GiB/s (baseline is 100GiB/s). I tried to optimized folio walk, but it got slower for some reason. I don't yet understand why. Maybe something silly. Will play with it more. Any other ideas? diff --git a/fs/inode.c b/fs/inode.c index ec9339024ac3..52163d28d630 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -482,6 +482,8 @@ EXPORT_SYMBOL(inc_nlink); static void __address_space_init_once(struct address_space *mapping) { xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); + seqcount_spinlock_init(&mapping->i_pages_delete_seqcnt, + &mapping->i_pages->xa_lock); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->i_private_list); spin_lock_init(&mapping->i_private_lock); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9e9d7c757efe..a900214f0f3a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -522,6 +522,7 @@ struct address_space { struct list_head i_private_list; struct rw_semaphore i_mmap_rwsem; void * i_private_data; + seqcount_spinlock_t i_pages_delete_seqcnt; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but diff --git a/mm/filemap.c b/mm/filemap.c index 751838ef05e5..732756116b6a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -138,8 +138,10 @@ static void page_cache_delete(struct address_space *mapping, VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + write_seqcount_begin(&mapping->i_pages_delete_seqcnt); xas_store(&xas, shadow); xas_init_marks(&xas); + write_seqcount_end(&mapping->i_pages_delete_seqcnt); folio->mapping = NULL; /* Leave folio->index set: truncation lookup relies upon it */ @@ -2659,41 +2661,106 @@ static void filemap_end_dropbehind_read(struct folio *folio) } } -/** - * filemap_read - Read data from the page cache. - * @iocb: The iocb to read. - * @iter: Destination for the data. - * @already_read: Number of bytes already read by the caller. - * - * Copies data from the page cache. If the data is not currently present, - * uses the readahead and read_folio address_space operations to fetch it. - * - * Return: Total number of bytes copied, including those already read by - * the caller. If an error happens before any bytes are copied, returns - * a negative error number. - */ -ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, - ssize_t already_read) +static bool filemap_read_fast(struct kiocb *iocb, struct iov_iter *iter, + char *buffer, size_t buffer_size, + ssize_t *already_read) +{ + struct address_space *mapping = iocb->ki_filp->f_mapping; + struct file_ra_state *ra = &iocb->ki_filp->f_ra; + loff_t last_pos = ra->prev_pos; + struct folio *folio; + loff_t file_size; + unsigned int seq; + + /* Don't bother with flush_dcache_folio() */ + if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE) + return false; + + if (!iter_is_ubuf(iter)) + return false; + + /* Give up and go to slow path if raced with page_cache_delete() */ + if (!raw_seqcount_try_begin(&mapping->i_pages_delete_seqcnt, seq)) + return false; + + if (!user_access_begin(iter->ubuf + iter->iov_offset, iter->count)) + return false; + + rcu_read_lock(); + pagefault_disable(); + + do { + size_t to_read, read; + XA_STATE(xas, &mapping->i_pages, iocb->ki_pos >> PAGE_SHIFT); + + xas_reset(&xas); + folio = xas_load(&xas); + if (xas_retry(&xas, folio)) + break; + + if (!folio || xa_is_value(folio)) + break; + + if (!folio_test_uptodate(folio)) + break; + + /* No fast-case if readahead is supposed to started */ + if (folio_test_readahead(folio)) + break; + /* .. or mark it accessed */ + if (!folio_test_referenced(folio)) + break; + + /* i_size check must be after folio_test_uptodate() */ + file_size = i_size_read(mapping->host); + + do { + if (unlikely(iocb->ki_pos >= file_size)) + goto out; + + to_read = min(iov_iter_count(iter), buffer_size); + if (to_read > file_size - iocb->ki_pos) + to_read = file_size - iocb->ki_pos; + + read = memcpy_from_file_folio(buffer, folio, iocb->ki_pos, to_read); + + /* Give up and go to slow path if raced with page_cache_delete() */ + if (read_seqcount_retry(&mapping->i_pages_delete_seqcnt, seq)) + goto out; + + unsafe_copy_to_user(iter->ubuf + iter->iov_offset, buffer, + read, out); + + iter->iov_offset += read; + iter->count -= read; + *already_read += read; + iocb->ki_pos += read; + last_pos = iocb->ki_pos; + } while (iov_iter_count(iter) && iocb->ki_pos % folio_size(folio)); + } while (iov_iter_count(iter)); +out: + pagefault_enable(); + rcu_read_unlock(); + user_access_end(); + + file_accessed(iocb->ki_filp); + ra->prev_pos = last_pos; + return !iov_iter_count(iter); +} + +static ssize_t filemap_read_slow(struct kiocb *iocb, struct iov_iter *iter, + struct folio_batch *fbatch, ssize_t already_read) { struct file *filp = iocb->ki_filp; struct file_ra_state *ra = &filp->f_ra; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; - struct folio_batch fbatch; int i, error = 0; bool writably_mapped; loff_t isize, end_offset; loff_t last_pos = ra->prev_pos; - if (unlikely(iocb->ki_pos < 0)) - return -EINVAL; - if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes)) - return 0; - if (unlikely(!iov_iter_count(iter))) - return 0; - - iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); - folio_batch_init(&fbatch); + folio_batch_init(fbatch); do { cond_resched(); @@ -2709,7 +2776,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; - error = filemap_get_pages(iocb, iter->count, &fbatch, false); + error = filemap_get_pages(iocb, iter->count, fbatch, false); if (error < 0) break; @@ -2737,11 +2804,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, * mark it as accessed the first time. */ if (!pos_same_folio(iocb->ki_pos, last_pos - 1, - fbatch.folios[0])) - folio_mark_accessed(fbatch.folios[0]); + fbatch->folios[0])) + folio_mark_accessed(fbatch->folios[0]); - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(fbatch); i++) { + struct folio *folio = fbatch->folios[i]; size_t fsize = folio_size(folio); size_t offset = iocb->ki_pos & (fsize - 1); size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos, @@ -2772,19 +2839,57 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, } } put_folios: - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(fbatch); i++) { + struct folio *folio = fbatch->folios[i]; filemap_end_dropbehind_read(folio); folio_put(folio); } - folio_batch_init(&fbatch); + folio_batch_init(fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); file_accessed(filp); ra->prev_pos = last_pos; return already_read ? already_read : error; } + +/** + * filemap_read - Read data from the page cache. + * @iocb: The iocb to read. + * @iter: Destination for the data. + * @already_read: Number of bytes already read by the caller. + * + * Copies data from the page cache. If the data is not currently present, + * uses the readahead and read_folio address_space operations to fetch it. + * + * Return: Total number of bytes copied, including those already read by + * the caller. If an error happens before any bytes are copied, returns + * a negative error number. + */ +ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, + ssize_t already_read) +{ + struct inode *inode = iocb->ki_filp->f_mapping->host; + union { + struct folio_batch fbatch; + __DECLARE_FLEX_ARRAY(char, buffer); + //char __buffer[4096]; + } area __uninitialized; + + if (unlikely(iocb->ki_pos < 0)) + return -EINVAL; + if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes)) + return 0; + if (unlikely(!iov_iter_count(iter))) + return 0; + + iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); + + if (filemap_read_fast(iocb, iter, area.buffer, sizeof(area), &already_read)) + return already_read; + + return filemap_read_slow(iocb, iter, &area.fbatch, already_read); +} EXPORT_SYMBOL_GPL(filemap_read); int kiocb_write_and_wait(struct kiocb *iocb, size_t count) -- Kiryl Shutsemau / Kirill A. Shutemov