From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BF63DCCD199 for ; Mon, 20 Oct 2025 04:53:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D563B8E0005; Mon, 20 Oct 2025 00:53:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D06AD8E0002; Mon, 20 Oct 2025 00:53:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1C688E0005; Mon, 20 Oct 2025 00:53:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B1C568E0002 for ; Mon, 20 Oct 2025 00:53:32 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 511C2C05B9 for ; Mon, 20 Oct 2025 04:53:32 +0000 (UTC) X-FDA: 84017274264.09.87FDE6C Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf09.hostedemail.com (Postfix) with ESMTP id 859F7140007 for ; Mon, 20 Oct 2025 04:53:30 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=bVoQwAyu; dmarc=none; spf=pass (imf09.hostedemail.com: domain of akpm@linux-foundation.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1760936010; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=x+vF0NXxLpqUs1Cj4KKMjWaTT7mC88mvfP8fHi8s/Ew=; b=rvNbKNzIXSRXQjU7VsyAAI2Wao7RfZlBSj2OOHR/ltXQzfS15fREMqHx3Gf+q0M+b2CRqa NgwAfxd3oDyTO1rm8QOyvFWCOFwOaj0/QyqBUOEtqUOYRIRL0YgRtxuuiUGNr0P+8cK0aW XQgIpGy2TElb1ybNo0BohwqrS5kWu0I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1760936010; a=rsa-sha256; cv=none; b=nD7AYk9OwcyOmYSfn7T7WE/1z4lPH1Ssvh8hB/VrVpkL/1n4eTIaJCrObuuoisJjRpfbKl c9EinKvDTSB6xRV83UXWjvoeA847zKNfTc8M0FkiiVrcr9TFLjZpQajDPZjE4Fc4esKm2o +XcQ6JepHxkQpjIafEtX39/5ZZZu36g= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=bVoQwAyu; dmarc=none; spf=pass (imf09.hostedemail.com: domain of akpm@linux-foundation.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id C58B360144; Mon, 20 Oct 2025 04:53:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1507AC4CEF9; Mon, 20 Oct 2025 04:53:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1760936009; bh=xw6wua+9An7R6Pk+rGlLYtknBm2NgeItgTQplrPZ2hA=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=bVoQwAyuKyviRKF9fU9nSvYcvfJLo505wTKzAodqnmecEG25NtrUVsVVX3hh96TFO qAX90unR2OF/trDqqF+2U4dm7Ax7V7YeBxFFNf4t6HhBM88JzxEepXBjaCWr2/w+cG jmz/xn7/KEukpHBxvrIF2ExzXYjvu+d2Q6VxBUic= Date: Sun, 19 Oct 2025 21:53:28 -0700 From: Andrew Morton To: Kiryl Shutsemau Cc: David Hildenbrand , Matthew Wilcox , Linus Torvalds , Alexander Viro , Christian Brauner , Jan Kara , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Kiryl Shutsemau , Suren Baghdasaryan Subject: Re: [PATCH] mm/filemap: Implement fast short reads Message-Id: <20251019215328.3b529dc78222787226bd4ffe@linux-foundation.org> In-Reply-To: <20251017141536.577466-1-kirill@shutemov.name> References: <20251017141536.577466-1-kirill@shutemov.name> X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam01 X-Stat-Signature: k6fuijehc4aejsidp36euuqsoam3jeau X-Rspam-User: X-Rspamd-Queue-Id: 859F7140007 X-HE-Tag: 1760936010-400314 X-HE-Meta: U2FsdGVkX1/e1fehs5dzvfzrwfbQ0kl9sZnFGS7ho6xSqaeQeZcQZeVVxBaFUiwYX8pT9kfBcoQKAF+TU/hQIpcWSS+0qB/7WqdcbuuZaFbzFCbzHaOO0sSe2lY7axEJgdH1NC3IKPNMYZmMfwlF/nakhFNo4Pb2EY5BBkG7eJvZ5s0s6aNqQxXolu0HzhirFtCJmLRNoYCZsbqBf6ML/ubjEUcYOcH4ZQdQARD3Fx3DE1HhAiDIKT4/2EIG5WFqIKGNUwpkKT6fj4RU2J1u1dHmsW6w7NRMPOEKeP2mWpz/kAAH4OpcGQrjA/SlMZBP7lu3v0ZgZdtxwP/ywop50Pao5WXeHJd8nCXTVQdiDUZc0o89fzykg1RcwfwblSkXJbXxwYc06AinNGf+U37OJolTBJJmvQY294kpDYMsjacPpUx3sQt/GxI1UJhJyLEzP0fOQnBSjUF5oUtBAwfoYH6ejVvSAY8Rc9Aw9bKtMXLTVPATDdidnZyMOAaox69wiFLZQ39oT48bvXp092ptHDCZ9Tv/NPiGpXneHs9X0DylsTotuFUYaE9PJK+wqzEcrfiJAYR0TDyer0Qv2vMoEt7jNpMdrKdtu6d3kDvx6r7WkUqWUAX0hrXqjQfiXVeiHGG/IMz9Yu5euQPmPdeWUpjJbAXXA6s4p7fhI8pIeAFS/d8AlthYGb7o1kx6kDDpJp5Vf7Jui+y1miI+0jAew8BnOdaaWpWqo7p+Nw8AnrHDhw+/qePciDmlt2emT/dJlmGSxpMkRTOtVgRnis3rwUs8uWl8VCYicuSJh6K2wnmVQ7YNllx4NnHKTpcnogwb5mkd336I+gHjaUvxVdTHQNQPIaSmau92wIwIzaDcaZh3uBb+EaRnVidakdwAm1ggb0BA7JXlgN5LFKocPUPv+JMpWievz6YUW45i+aZKoadr3xRo9JdROXsEcZUNCwbBxSie1oEv71n7CP/cnrO 8fvcRAff VDTEjZmXyDBn8Msxz2JW6Dx12oWN5hts/ruEyFp2IiiDIAJL/VZx2mcHHepgM8r93wxcSHzTJzb6z2IIQz6iek447nI4FNBbaHO2EjXYE+CTsMsWhBJTn+XKqU1rWihcWJl+ca2gxEgAr41eQ2HUEQV9hZ+/rU5zhiNuRj75Mtq8r4Iai3Vh0dTxr7oMDXtUpR22vVFv6nO7Y0EOACgtTqSpif0Kw3yqq2TkeOQi5P6dJVhN+Oa0JwD9x8MGMzzF9jdMRv8dPJ9E6wRvoiBTcDilChw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 17 Oct 2025 15:15:36 +0100 Kiryl Shutsemau wrote: > From: Kiryl Shutsemau > > The protocol for page cache lookup is as follows: > > 1. Locate a folio in XArray. > 2. Obtain a reference on the folio using folio_try_get(). > 3. If successful, verify that the folio still belongs to > the mapping and has not been truncated or reclaimed. > 4. Perform operations on the folio, such as copying data > to userspace. > 5. Release the reference. > > For short reads, the overhead of atomic operations on reference > manipulation can be significant, particularly when multiple tasks access > the same folio, leading to cache line bouncing. > > To address this issue, introduce i_pages_delete_seqcnt, which increments > each time a folio is deleted from the page cache and implement a modified > page cache lookup protocol for short reads: > > 1. Locate a folio in XArray. > 2. Take note of the i_pages_delete_seqcnt. > 3. Copy the data to a local buffer on the stack. > 4. Verify that the i_pages_delete_seqcnt has not changed. > 5. Copy the data from the local buffer to the iterator. > > If any issues arise in the fast path, fallback to the slow path that > relies on the refcount to stabilize the folio. Well this is a fun patch. > The new approach requires a local buffer in the stack. The size of the > buffer determines which read requests are served by the fast path. Set > the buffer to 1k. This seems to be a reasonable amount of stack usage > for the function at the bottom of the call stack. A use case for alloca() or equiv. That would improve the average-case stack depth but not the worst-case. The 1k guess-or-giggle is crying out for histogramming - I bet 0.1k covers the great majority. I suspect it wouldn't be hard to add a new ad-hoc API into memory allocation profiling asking it to profile something like this for us, given an explicit request to to do. Is there really no way to copy the dang thing straight out to userspace, skip the bouncing? > The fast read approach demonstrates significant performance > improvements, particularly in contended cases. > > 16 threads, reads from 4k file(s), mean MiB/s (StdDev) > > ------------------------------------------------------------- > | Block | Baseline | Baseline | Patched | Patched | > | size | same file | diff files | same file | diff files | > ------------------------------------------------------------- > | 1 | 10.96 | 27.56 | 30.42 | 30.4 | > | | (0.497) | (0.114) | (0.130) | (0.158) | > | 32 | 350.8 | 886.2 | 980.6 | 981.8 | > | | (13.64) | (2.863) | (3.361) | (1.303) | > | 256 | 2798 | 7009.6 | 7641.4 | 7653.6 | > | | (103.9) | (28.00) | (33.26) | (25.50) | tl;dr, 256-byte reads from the same file nearly tripled. > | 1024 | 10780 | 27040 | 29280 | 29320 | > | | (389.8) | (89.44) | (130.3) | (83.66) | > | 4096 | 43700 | 103800 | 48420 | 102000 | > | | (1953) | (447.2) | (2012) | (0) | > ------------------------------------------------------------- > > 16 threads, reads from 1M file(s), mean MiB/s (StdDev) > > -------------------------------------------------------------- > | Block | Baseline | Baseline | Patched | Patched | > | size | same file | diff files | same file | diff files | > --------------------------------------------------------- > | 1 | 26.38 | 27.34 | 30.38 | 30.36 | > | | (0.998) | (0.114) | (0.083) | (0.089) | > | 32 | 824.4 | 877.2 | 977.8 | 975.8 | > | | (15.78) | (3.271) | (2.683) | (1.095) | > | 256 | 6494 | 6992.8 | 7619.8 | 7625 | > | | (116.0) | (32.39) | (10.66) | (28.19) | > | 1024 | 24960 | 26840 | 29100 | 29180 | > | | (606.6) | (151.6) | (122.4) | (83.66) | > | 4096 | 94420 | 100520 | 95260 | 99760 | > | | (3144) | (672.3) | (2874) | (134.1) | > | 32768 | 386000 | 402400 | 368600 | 397400 | > | | (36599) | (10526) | (47188) | (6107) | > -------------------------------------------------------------- > > There's also improvement on kernel build: > > Base line: 61.3462 +- 0.0597 seconds time elapsed ( +- 0.10% ) > Patched: 60.6106 +- 0.0759 seconds time elapsed ( +- 0.13% ) > > ... > > --- a/mm/filemap.c > +++ b/mm/filemap.c > > ... > > -ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, > - ssize_t already_read) > +static inline unsigned long filemap_read_fast_rcu(struct address_space *mapping, > + loff_t pos, char *buffer, > + size_t size) > +{ > + XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT); > + struct folio *folio; > + loff_t file_size; > + unsigned int seq; > + > + lockdep_assert_in_rcu_read_lock(); > + > + /* Give up and go to slow path if raced with page_cache_delete() */ > + if (!raw_seqcount_try_begin(&mapping->i_pages_delete_seqcnt, seq)) > + return false; return 0; > + > + folio = xas_load(&xas); > + if (xas_retry(&xas, folio)) > + return 0; > + > + if (!folio || xa_is_value(folio)) > + return 0; > + > + if (!folio_test_uptodate(folio)) > + return 0; > + > + /* No fast-case if readahead is supposed to started */ Please expand this comment. "explain why, not what". Why do we care if it's under readahead? It's uptodate, so just grab it?? > + if (folio_test_readahead(folio)) > + return 0; > + /* .. or mark it accessed */ This comment disagrees with the code which it is commenting. > + if (!folio_test_referenced(folio)) > + return 0; > + > + /* i_size check must be after folio_test_uptodate() */ why? > + file_size = i_size_read(mapping->host); > + if (unlikely(pos >= file_size)) > + return 0; > + if (size > file_size - pos) > + size = file_size - pos; min() is feeling all sad? > + /* Do the data copy */ We can live without this comment ;) > + size = memcpy_from_file_folio(buffer, folio, pos, size); > + if (!size) > + return 0; > + > + /* Give up and go to slow path if raced with page_cache_delete() */ I wonder if truncation is all we need to worry about here. For example, direct-io does weird stuff. > + if (read_seqcount_retry(&mapping->i_pages_delete_seqcnt, seq)) > + return 0; > + > + return size; > +} > + > +#define FAST_READ_BUF_SIZE 1024 > + > +static noinline bool filemap_read_fast(struct kiocb *iocb, struct iov_iter *iter, > + ssize_t *already_read) > +{ > + struct address_space *mapping = iocb->ki_filp->f_mapping; > + struct file_ra_state *ra = &iocb->ki_filp->f_ra; > + char buffer[FAST_READ_BUF_SIZE]; > + size_t count; > + > + if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE) > + return false; why? (comment please) > + if (iov_iter_count(iter) > sizeof(buffer)) > + return false; > + > + count = iov_iter_count(iter); It'd be a tinier bit tidier to swap the above to avoid evaluating iov_iter_count() twice. Yes, iov_iter_count() happens to be fast, but we aren't supposed to know that here. > + /* Let's see if we can just do the read under RCU */ > + rcu_read_lock(); > + count = filemap_read_fast_rcu(mapping, iocb->ki_pos, buffer, count); > + rcu_read_unlock(); > + > + if (!count) > + return false; > + > + count = copy_to_iter(buffer, count, iter); > + if (unlikely(!count)) > + return false; > + > + iocb->ki_pos += count; > + ra->prev_pos = iocb->ki_pos; > + file_accessed(iocb->ki_filp); > + *already_read += count; > + > + return !iov_iter_count(iter); > +} > + > +static noinline ssize_t filemap_read_slow(struct kiocb *iocb, > + struct iov_iter *iter, > + ssize_t already_read) > { > struct file *filp = iocb->ki_filp; > struct file_ra_state *ra = &filp->f_ra;