From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE7FDC3ABAC for ; Tue, 6 May 2025 11:29:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F6A46B000A; Tue, 6 May 2025 07:29:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A3F66B0082; Tue, 6 May 2025 07:29:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 346226B0085; Tue, 6 May 2025 07:29:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 139C66B000A for ; Tue, 6 May 2025 07:29:21 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B6CFA80110 for ; Tue, 6 May 2025 11:29:21 +0000 (UTC) X-FDA: 83412262122.06.E260D63 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf11.hostedemail.com (Postfix) with ESMTP id 61C5040003 for ; Tue, 6 May 2025 11:29:19 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2yH7gPjn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Gh6bh0xG; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2yH7gPjn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Gh6bh0xG; dmarc=none; spf=pass (imf11.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746530959; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tWsIcamJBKCW1LLn+/GWBAN99qwgm/lCnEVszYMLa1k=; b=g7vJRvT3QhHBLpM9ndNs7hzEeqHhaWIWqHe9lJCeudCafzodcTZb1p6rNfxkgrEBYNZoV1 A7KY9wdzx9RooXj5em/CsbHq9r2HWWo2XjpNfhhOh02fvQIzc6ayk17AhhG+FxJPBQEdop Id5vxQhFd+CLmYpYKjqCOEc0e1N+Koo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746530959; a=rsa-sha256; cv=none; b=r3M1dVb7fo4P2gnIH0Mfhb8qxYtbAjT6WohrJT2HIpoxJjc30gov7n+RI2wJV19p9ZszAC xiIv8CMKhdUgJZS5b4nhBfwRDZ3mPyLZ/Ewsv5Rz+vT9Xtk+8ZBekmEIyiAGVofqsgTWv8 dLllbbrRzvNyeg4SgCLxPdfhbawLadQ= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2yH7gPjn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Gh6bh0xG; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=2yH7gPjn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Gh6bh0xG; dmarc=none; spf=pass (imf11.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 9CCA9210F4; Tue, 6 May 2025 11:29:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746530957; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tWsIcamJBKCW1LLn+/GWBAN99qwgm/lCnEVszYMLa1k=; b=2yH7gPjn31Qn8K+bOhIGJWYkJJqa+IIBWn6zVUY4wdLmUaEXPUxzU+S1G5OryNOqbR1529 qtljQxahgr9JcoiYAqeIvyGVL6h7kXnTvoWr1njgxJxo1g2k5SSQjbY45vWx0qFaBItZv/ KLvkx2iq6lcf8oCYggb0yRTllNIDeg4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746530957; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tWsIcamJBKCW1LLn+/GWBAN99qwgm/lCnEVszYMLa1k=; b=Gh6bh0xGqiSxLYImzkKJXichdiW+rLDUwEU1yXrflVpRXwUFX0rMWGHa/1fjqtxYlyXJXQ tnU5mzZ1KDMLpZAQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746530957; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tWsIcamJBKCW1LLn+/GWBAN99qwgm/lCnEVszYMLa1k=; b=2yH7gPjn31Qn8K+bOhIGJWYkJJqa+IIBWn6zVUY4wdLmUaEXPUxzU+S1G5OryNOqbR1529 qtljQxahgr9JcoiYAqeIvyGVL6h7kXnTvoWr1njgxJxo1g2k5SSQjbY45vWx0qFaBItZv/ KLvkx2iq6lcf8oCYggb0yRTllNIDeg4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746530957; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tWsIcamJBKCW1LLn+/GWBAN99qwgm/lCnEVszYMLa1k=; b=Gh6bh0xGqiSxLYImzkKJXichdiW+rLDUwEU1yXrflVpRXwUFX0rMWGHa/1fjqtxYlyXJXQ tnU5mzZ1KDMLpZAQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 8E19D137CF; Tue, 6 May 2025 11:29:17 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id aKGrIo3yGWhSFgAAD6G6ig (envelope-from ); Tue, 06 May 2025 11:29:17 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 49933A09BE; Tue, 6 May 2025 13:29:13 +0200 (CEST) Date: Tue, 6 May 2025 13:29:13 +0200 From: Jan Kara To: Ryan Roberts Cc: Jan Kara , Andrew Morton , "Matthew Wilcox (Oracle)" , Alexander Viro , Christian Brauner , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon , Kalesh Singh , Zi Yan , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Message-ID: References: <20250430145920.3748738-1-ryan.roberts@arm.com> <20250430145920.3748738-3-ryan.roberts@arm.com> <3myknukhnrtdb4y5i6ewcgpubg2fopxc35ii6a4oy5ffgn7xdf@uileryotgd7z> <67wws7qs5v3poq6sefrrt4dgdn4ejh52mg5x7ycbxqvrfdvow3@zraqczowrvrl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Action: no action X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 61C5040003 X-Rspam-User: X-Stat-Signature: qn6kg7woemtgzextp36g9acxjwge8179 X-HE-Tag: 1746530959-859136 X-HE-Meta: U2FsdGVkX1/H+f90kjumybfN7aY7DktWU8KK3HtqAQ7bSNXls0SYJ8AsnIQH+5VTLkEoDnW9k0lNn2YlwXB6069nxszwef7z8HTY5peBGixk9cyZfRbXmUsVAA1j0FxAOI3bqDnnntChQzFxhGILpsNVMsZcg9szYfbk3EgdLDkYrRXBo/QJFgKi/AAkCE/oithLW0+8SP9gJj8Gd1e/ItmTrcQ+yoLrzup4gsoJMndyJ3aQUTCsQN2tJTSRj7t4YpF5aS5pi2kLAJKKO2DcKrs4Lh8Np30HndrWINe6MoL0+NgQbhTEF68evYpOCZkzcl/RkH2rDN1waPQWNYxDvbHzhCnQmm6B5ESnrtg96oEz4E6mkLTMVi/jsK38Uq6Ipaqayegm5q6k77lntX7DnAofeZAqHAqkdt93K6XDwMJBAWDrTdCO4ePHEhJJl3rYIGACdnTf6bd2gswbfrz6kTga1wVgLB0aOhnRB8XXLWd7n4R0rmZon3xICveHjwFiqhhfAFAeFXyPo6m5rdCrAZ3mCu8F6ssoXElDONhQgqJ2yGDU4e0bDu5HWnINwn64TpLGhrOuMCcJvB5OtfnCA1KjNcBuq8WmOl0UUprurlSyuggkXCW4vx3DG+7UAJ9dm5U+buCfZOAv3+6eL94/dTpCbKLjVGyq1KWS0OHHF/Z3pkHQGVrDKIypQ0M24qozZOo04aUmQTO2MLqvhw4JkEXXWU5nFH42YwhxKXPHlcbGrSGn8mpgpi5WikOi5xSFO19msXBi5Or/mb0RDJjS7LFIt6Vy/O0J32RttWN2LKgNXsTyczglO1B8xvHRNXsxm2hUUGppChZsdsKPzfjElSVBGGlkP/PUv/yRkLbHIZohjWwTjN42deXZ5wALqZ6/1Az6OPuFMBS2zef0Sd5kj1Iz+9fkoqel+3N6Nx652tRMKrLt2U26c4O7KDrj/SD3z6zt8YyOxkKHrgTgKUF 72KeYp/o K7HfsJK+uoPU4ifWS87cY6nxejV34a3NMfvXMnQ+9P0l08hqLRErkbSAIqs3VU4uR909K4ziqIDNlIlyZsXn+p54oxAF/1jykBWE0zBN/GEjOdayHopWYt2Okp84VHmaXRoDQ2ZpPooi9vyLG7KtxCU5ZbUYRpPjqFqxCPZTuhM3qtZoALe1zXSLgWor7pIfoVDUyxgJSxweRJ3nS0XBhHztAlF4iiuWkt6LwibSxehfjwQ1WOY1FakS5FW+VLA4HES7+S9S2SySSYSdCU+X/kjx+t80c5WOvp/3IvSHkXlbHKcrvXKCTlXJNEzRC+2PKiyLr+wX/oy+ISDZj60p36NzaVA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 06-05-25 10:28:11, Ryan Roberts wrote: > On 05/05/2025 10:37, Jan Kara wrote: > > On Mon 05-05-25 11:13:26, Jan Kara wrote: > >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > >>> Previously asynchonous readahead would read ra_pages (usually 128K) > >>> directly after the end of the synchonous readahead and given the > >>> synchronous readahead portion had no alignment guarantees (beyond page > >>> boundaries) it is possible (and likely) that the end of the initial 128K > >>> region would not fall on a natural boundary for the folio size being > >>> used. Therefore smaller folios were used to align down to the required > >>> boundary, both at the end of the previous readahead block and at the > >>> start of the new one. > >>> > >>> In the worst cases, this can result in never properly ramping up the > >>> folio size, and instead getting stuck oscillating between order-0, -1 > >>> and -2 folios. The next readahead will try to use folios whose order is > >>> +2 bigger than the folio that had the readahead marker. But because of > >>> the alignment requirements, that folio (the first one in the readahead > >>> block) can end up being order-0 in some cases. > >>> > >>> There will be 2 modifications to solve this issue: > >>> > >>> 1) Calculate the readahead size so the end is aligned to a folio > >>> boundary. This prevents needing to allocate small folios to align > >>> down at the end of the window and fixes the oscillation problem. > >>> > >>> 2) Remember the "preferred folio order" in the ra state instead of > >>> inferring it from the folio with the readahead marker. This solves > >>> the slow ramp up problem (discussed in a subsequent patch). > >>> > >>> This patch addresses (1) only. A subsequent patch will address (2). > >>> > >>> Worked example: > >>> > >>> The following shows the previous pathalogical behaviour when the initial > >>> synchronous readahead is unaligned. We start reading at page 17 in the > >>> file and read sequentially from there. I'm showing a dump of the pages > >>> in the page cache just after we read the first page of the folio with > >>> the readahead marker. > > > > > > > >> Looks good. When I was reading this code some time ago, I also felt we > >> should rather do some rounding instead of creating small folios so thanks > >> for working on this. Feel free to add: > >> > >> Reviewed-by: Jan Kara > > > > But now I've also remembered why what you do here isn't an obvious win. > > There are storage devices (mostly RAID arrays) where optimum read size > > isn't a power of 2. Think for example a RAID-0 device composed from three > > disks. It will have max_pages something like 384 (512k * 3). Suppose we are > > on x86 and max_order is 9. Then previously (if we were lucky with > > alignment) we were alternating between order 7 and order 8 pages in the > > page cache and do optimally sized IOs od 1536k. > > Sorry I'm struggling to follow some of this, perhaps my superficial > understanding of all the readahead subtleties is starting to show... > > How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and > bdi->ra_pages would remain the usual 32 (128K)? Sorry, I have been probably too brief in my previous message :) bdi->ra_pages is actually set based on optimal IO size reported by the hardware (see blk_apply_bdi_limits() and how its callers are filling in lim->io_opt). The 128K you speak about is just a last-resort value if hardware doesn't provide one. And some storage devices do report optimal IO size that is not power of two. Also note that bdi->ra_pages can be tuned in sysfs and a lot of users actually do this (usually from their udev rules). We don't have to perform well when some odd value gets set but you definitely cannot assume bdi->ra_pages is 128K :). > In which case, for mmap, won't > we continue to be limited by ra_pages and will never get beyond order-5? (for > mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just > returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not > spotting it in the code... > > I guess you are also implicitly teaching me something about how the block layer > works here too... if there are 2 read requests for an order-7 and order-8, then > the block layer will merge those to a single read (upto the 384 optimal size?) Correct. In fact readahead code will already perform this merging when submitting the IO. > but if there are 2 reads of order-8 then it won't merge because it would be > bigger than the optimal size and it won't split the second one at the optimal > size either? Have I inferred that correctly? With the code as you modify it, you would round down ra->size from 384 to 256 and submit only one 1MB sized IO (with one order-8 page). And this will cause regression in read throughput for such devices because they now don't get buffer large enough to run at full speed. > > Now you will allocate all > > folios of order 8 (nice) but reads will be just 1024k and you'll see > > noticeable drop in read throughput (not nice). Note that this is not just a > > theoretical example but a real case we have hit when doing performance > > testing of servers and for which I was tweaking readahead code in the past. > > > > So I think we need to tweak this logic a bit. Perhaps we should round_down > > end to the minimum alignment dictated by 'order' and maxpages? Like: > > > > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) > > Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - > 1" part? My bad. It should have been: 1 << min(order, ffs(max_pages) - 1) > I think what you are suggesting is that the patch becomes something like > this: > > ---8<--- > + end = ra->start + ra->size; > + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); Not quite. ilog2() returns the most significant bit set but we really want to align to the least significant bit set. So when max_pages is 384, we want to align to at most order-7 (aligning the end more does not make sense when you want to do IO 384 pages large). That's why I'm using ffs() and not ilog2(). > + if (aligned_end > ra->start) > + ra->size -= end - aligned_end; > + ra->async_size = ra->size; > ---8<--- > > So if max_pages=384, then aligned_end will be aligned down to a maximum > of the previous 1MB boundary? No, it needs to be aligned only to previous 512K boundary because we want to do IOs 3*512K large. Hope things are a bit clearer now :) Honza -- Jan Kara SUSE Labs, CR