From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1B22C3ABB0 for ; Mon, 5 May 2025 09:37:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 877596B008A; Mon, 5 May 2025 05:37:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7FE376B008C; Mon, 5 May 2025 05:37:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 678D56B0092; Mon, 5 May 2025 05:37:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 450366B008A for ; Mon, 5 May 2025 05:37:41 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 48A5914026C for ; Mon, 5 May 2025 09:37:42 +0000 (UTC) X-FDA: 83408351964.21.102923B Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf14.hostedemail.com (Postfix) with ESMTP id 17217100006 for ; Mon, 5 May 2025 09:37:39 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746437860; a=rsa-sha256; cv=none; b=FJVdipXacx/yWwklJYjuGNq0F0ZfJoqi/U+YLjKffdj61DaeIIwW9JXeLQv6vib8bBemRD 7E2gXaIVUEAoVRsuh4+oeT3NiqvHH9HekLF6ZIaVoEMUT9JzF4Wn2bTCGgOJdoOQLacwOQ wwdzVzGbW5ACkXf5IP/fMP3kcOqHn6E= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf14.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746437860; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3+Jkamxjp4+lroFClwLj0bMPc+rCKHqTVg568UjbwKs=; b=h4XQpoIQe08GN7YGBhCHEWDKt61Ph88sakIJqNNCOaXsvzWJjYhODLgGv8KHXFRBQ0yobh 4iCQYoQaApPlRVJgxcPEdOPDmS+h7vPhRWgmFBGjDLK32DeeyYvXE44Tz6WUH/CdyuJVwr fDs4oXNL7+hz9hTfAAYmouTBWsAnf/c= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id BCE7D211E9; Mon, 5 May 2025 09:37:38 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id AFA1B13883; Mon, 5 May 2025 09:37:38 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id pdHcKuKGGGhCbwAAD6G6ig (envelope-from ); Mon, 05 May 2025 09:37:38 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 6DB05A082A; Mon, 5 May 2025 11:37:38 +0200 (CEST) Date: Mon, 5 May 2025 11:37:38 +0200 From: Jan Kara To: Ryan Roberts Cc: Andrew Morton , "Matthew Wilcox (Oracle)" , Alexander Viro , Christian Brauner , Jan Kara , David Hildenbrand , Dave Chinner , Catalin Marinas , Will Deacon , Kalesh Singh , Zi Yan , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Message-ID: <67wws7qs5v3poq6sefrrt4dgdn4ejh52mg5x7ycbxqvrfdvow3@zraqczowrvrl> References: <20250430145920.3748738-1-ryan.roberts@arm.com> <20250430145920.3748738-3-ryan.roberts@arm.com> <3myknukhnrtdb4y5i6ewcgpubg2fopxc35ii6a4oy5ffgn7xdf@uileryotgd7z> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3myknukhnrtdb4y5i6ewcgpubg2fopxc35ii6a4oy5ffgn7xdf@uileryotgd7z> X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Rspamd-Queue-Id: 17217100006 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: gg3pbyuaezud83bt68kqbf36hmz8khrn X-HE-Tag: 1746437859-817715 X-HE-Meta: U2FsdGVkX1/SIJoXMqdvGoXD7JGSc1c82eKQQXK4yooqjKnEpUJPP/GRGRVjkU/Iyzmzz76qaKvduZxbkhN8YkSn+PWMU3FGuCPJABMwumOxDDyDWf+zn+2Y7WtieTNt1JItCpvdmsmhN5WWBCe2g8Nn8VhWq4jXOzSIv462cHeej5uDFRUm67J/TT9p6DpbSrFFpJhpEfqD5zE5E7+4UYbzzqFxmTBShHBNilGy8PDBUmk/5aAJCl9ny4TrrI+yzRug1brvfEcMKRvIBoZ/nF0mmDdRdebGgQmZykdVNf9MsAmN8q5zyG7LDrtiSVqmaelkLnujR2p2EAnqw9wzGgIqCaC5Ioegfd6Uwq7G+yWkrhXlEjMif+g2lfyG+4pfTKgU1w7hOtEhpUtwd55qPtEKu5ge/UrEDh1laK8bbqcwLudXs1tBgcGzGOyqWWzRL+tGK6j9hhw6z4VHrfF9b9m6rKkQYoAVvKu9mfuj8pjK1a0IGPFpfB2RnJ/9TlhNkPCqu/LBB9hZC5qP8msryic2IPRlilHn+nUxlk0Pw4SlT826QJF5Ngn6T+9arS30cx/n0bYNe9Ktj4yqQFm5go45AsaaC9gRoInJ7pRbMijTJqLPyhjNvJk2DxOWW5Gv3bgvZEDkvK+Lzo/CpKhQZ6f9ccSHuOigNKnGwAxTNZL8DnvgMNYT0iAcLDPwzw//dV3I80qM7pWDDIT8qiltToLlRJ/VFN4EKidu0UQhQYEwabnaCSPF/c9Sz745iPnQ/SL6/kiKXvOn6/18/1uf2nxiik+dcxq2f6YPuhOo6PWbyEnUfcvmxP8UbhzL0WNahnt4TMLTAF/OM08xtSBajyyXTDSkScEgK7sJZPklSTRZZAEhgVP00f0qfHBXvmOZks+P/S0TLCQoEFS4xBBcWhWg8gsyz3glrESZPHeHpeTMy8N+3Bf9Rfm7ixuAw0C73ysjAl5IaK8ro5I7cMZ +IuY+hiB CKjgk7a1R4TZUPzRh20OM8Ro9q1pUxvNITgO+ys2+dYNmnN/+Txta7gAbK5ZTSapRnUaLWYGIXgbfEmGaZsqDlUSw763eyExuM8ujCfETullw5O4Kt8Lex47jYbAhmRxZiYkUCpkr2mXvJ0KqZbicnbYySjgbZszjztorprTaTyoxBgB2Pvkjt77GeQA6IDMHSuE86npdUuDMi9s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon 05-05-25 11:13:26, Jan Kara wrote: > On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > > Previously asynchonous readahead would read ra_pages (usually 128K) > > directly after the end of the synchonous readahead and given the > > synchronous readahead portion had no alignment guarantees (beyond page > > boundaries) it is possible (and likely) that the end of the initial 128K > > region would not fall on a natural boundary for the folio size being > > used. Therefore smaller folios were used to align down to the required > > boundary, both at the end of the previous readahead block and at the > > start of the new one. > > > > In the worst cases, this can result in never properly ramping up the > > folio size, and instead getting stuck oscillating between order-0, -1 > > and -2 folios. The next readahead will try to use folios whose order is > > +2 bigger than the folio that had the readahead marker. But because of > > the alignment requirements, that folio (the first one in the readahead > > block) can end up being order-0 in some cases. > > > > There will be 2 modifications to solve this issue: > > > > 1) Calculate the readahead size so the end is aligned to a folio > > boundary. This prevents needing to allocate small folios to align > > down at the end of the window and fixes the oscillation problem. > > > > 2) Remember the "preferred folio order" in the ra state instead of > > inferring it from the folio with the readahead marker. This solves > > the slow ramp up problem (discussed in a subsequent patch). > > > > This patch addresses (1) only. A subsequent patch will address (2). > > > > Worked example: > > > > The following shows the previous pathalogical behaviour when the initial > > synchronous readahead is unaligned. We start reading at page 17 in the > > file and read sequentially from there. I'm showing a dump of the pages > > in the page cache just after we read the first page of the folio with > > the readahead marker. > Looks good. When I was reading this code some time ago, I also felt we > should rather do some rounding instead of creating small folios so thanks > for working on this. Feel free to add: > > Reviewed-by: Jan Kara But now I've also remembered why what you do here isn't an obvious win. There are storage devices (mostly RAID arrays) where optimum read size isn't a power of 2. Think for example a RAID-0 device composed from three disks. It will have max_pages something like 384 (512k * 3). Suppose we are on x86 and max_order is 9. Then previously (if we were lucky with alignment) we were alternating between order 7 and order 8 pages in the page cache and do optimally sized IOs od 1536k. Now you will allocate all folios of order 8 (nice) but reads will be just 1024k and you'll see noticeable drop in read throughput (not nice). Note that this is not just a theoretical example but a real case we have hit when doing performance testing of servers and for which I was tweaking readahead code in the past. So I think we need to tweak this logic a bit. Perhaps we should round_down end to the minimum alignment dictated by 'order' and maxpages? Like: 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) If you set badly aligned readahead size manually, you will get small pages in the page cache but that's just you being stupid. In practice, hardware induced readahead size need not be powers of 2 but they are *sane* :). Honza > > diff --git a/mm/readahead.c b/mm/readahead.c > > index 8bb316f5a842..82f9f623f2d7 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl, > > unsigned long max_pages; > > struct file_ra_state *ra = ractl->ra; > > pgoff_t index = readahead_index(ractl); > > - pgoff_t expected, start; > > + pgoff_t expected, start, end, aligned_end; > > unsigned int order = folio_order(folio); > > > > /* no readahead */ > > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl, > > * the readahead window. > > */ > > ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); > > - ra->async_size = ra->size; > > goto readit; > > } > > > > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl, > > ra->size = start - index; /* old async_size */ > > ra->size += req_count; > > ra->size = get_next_ra_size(ra, max_pages); > > - ra->async_size = ra->size; > > readit: > > order += 2; > > + end = ra->start + ra->size; > > + aligned_end = round_down(end, 1UL << order); > > + if (aligned_end > ra->start) > > + ra->size -= end - aligned_end; > > + ra->async_size = ra->size; > > ractl->_index = ra->start; > > page_cache_ra_order(ractl, ra, order); > > } > > -- > > 2.43.0 > > > -- > Jan Kara > SUSE Labs, CR -- Jan Kara SUSE Labs, CR