From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D719FD609AF for ; Wed, 27 Nov 2024 08:37:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0056B6B0088; Wed, 27 Nov 2024 03:37:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EF7BB6B008C; Wed, 27 Nov 2024 03:37:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DBF836B0092; Wed, 27 Nov 2024 03:37:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BE5086B0088 for ; Wed, 27 Nov 2024 03:37:26 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 503A48116A for ; Wed, 27 Nov 2024 08:37:26 +0000 (UTC) X-FDA: 82831220766.04.3108E27 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf23.hostedemail.com (Postfix) with ESMTP id 7E8C514000B for ; Wed, 27 Nov 2024 08:37:20 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf23.hostedemail.com: domain of neilb@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=neilb@suse.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732696639; a=rsa-sha256; cv=none; b=js6ZvPQ0Qa9eyul8AZ5dD3zKDpDEeVOUJGyq/mR31KXnL6khm84I5bt4F8T2vo12fS/d4w dCRjtvJpL3OcGqb7uyfXURqCPirmBxnfgYmJ6hejPiQzDd7CYh4r7THUrIU3SyV3AvFDfv /RExNJlxGheRfT4alZmjv5VvlOimTmY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=suse.de; spf=pass (imf23.hostedemail.com: domain of neilb@suse.de designates 195.135.223.131 as permitted sender) smtp.mailfrom=neilb@suse.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732696639; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=r/4spSDsO2Eq03HHcRzuwsSHv01JdyKPJ0trxriIq2s=; b=HM11Hl0Oh950MZL1IN++l5pZZd9VHg98jY3qnOfadHI9KAZanACPSbym6o1h7oF12SxknO SBdA9D1tL5sTRQSqoCBp9Hsb78ih87f4pBCcW+OKs6/Mizgp4PeH/A/WUgAuNxtnVRWhte rx/K4K6fdgs9JYYuLYXx858MjDoLyZU= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 67ECD1F74D; Wed, 27 Nov 2024 08:37:22 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 2067813941; Wed, 27 Nov 2024 08:37:18 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id Vl/SLT7aRmfsLAAAD6G6ig (envelope-from ); Wed, 27 Nov 2024 08:37:18 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "NeilBrown" To: "Jan Kara" Cc: "Anders Blomdell" , "Philippe Troin" , "Jan Kara" , "Matthew Wilcox (Oracle)" , "Andrew Morton" , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: Regression in NFS probably due to very large amounts of readahead In-reply-to: <20241126150613.a4b57y2qmolapsuc@quack3> References: <>, <20241126150613.a4b57y2qmolapsuc@quack3> Date: Wed, 27 Nov 2024 19:37:10 +1100 Message-id: <173269663098.1734440.13407516531783940860@noble.neil.brown.name> X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 7E8C514000B X-Stat-Signature: gx13ay79nod6dh1f1uof38kqzidps1yn X-Rspam-User: X-HE-Tag: 1732696640-407001 X-HE-Meta: U2FsdGVkX1+uZOC0gYNhjhEavlb5rQRnaUjJgOFLnkZRN98ISNDWHyF+ER3ikq4Wb/ASVHO8w59TAUUrgK8zZ3u/resASvnkyo3DrnL2fopTIVek7E5dOWQsH63rSSgY9zGbL6K1nHBwWUuWyzx/molrI2uy4LXq5s/8Mh6cbbOfyJmFguPWXcRVW3tAOYJ0ItsXkad5qqGj6sIcIt3h/as24Tdisy8CvBFGunOMsizO/Gkpe/JHr+hdGxXXImgKziP9MXeKI5r3ZqQxB08IzBvuOBICZmEM/IYYU7Ve3qZiWdlso4BB9WcO+q1ejFUEgIqpTvjxTAlji94l89YVwLnC+QFTEnv/CySIKJt5bg0nwI12G1gYUfyZXkWagJB52Is7M+2BvBFp9E0NqNqDvMFDBQy19Tfr7ywhKPEu7vnm68XzMi3k57hPMIuqJ2Cw+gJd3pHBUtNf5YXJhCjuyc5KDHKVA1vRiMxcQyWr59seKbkTJT5SVN9pbdoJkAcYK2oKG435OLsGpyoTN8wG0UcxfpJabYRLpEIPLy+6ncyCJ1bxUjAT320h7nNHQCcXAF43Cz2B/nM9ELHfFDzO3JcazAIA12clmgKdrqYMYKg4ubKwLp4gx04zycCCm3ZhhmSeqg1lisFLq/wfL4CWKb/vKPsSVo/bmY+tita5UCpYjboblqFGh9noIhnBoo6k4wB5JZZQpS7n6Do76g23pr2miGN/sbK8Zowzru+MCXrovvaeQFXGGGUekYFMuf4dSY+zmbZNxJkcbKBF363miYudNQw1NR7a+ZaOV/c8P3nCyqdhm3e7GYfJLK2fyX87T2RwVJceBgjQtkXwkXI20KtFeUC55cz991ErsVKtLby+lB3IXoDyX27kHxHRotqNqzb9+Nfceq9RcUqfRUBGliDVeSjkYu9KL2Jiiff6lnM8AX1crxfQ7w03SQpcT6sc3f+cXFInfsEM5l6a9VE T7ot6rsa K486JnkMFh5RwujLk6IfN9MroqKN7ihvLfevbTqPtrrYAzEMYn3gD+UquJDksJ+FhQBkYtaF/wnQ1cnKYfgDFXq/yZy1E9uBNi4TYC17Fx65Wz8p7BHG+kCa2BjUpU7PDVt82q5V3jZA7bVkd4hKjb/PObabdR89y7LtcBEo6ZKRZnl/H0tiBrm3M5jomrQ3i6eaO6NNnRH2/it4h0FkkN79Yq95wTwSlU9ZO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 27 Nov 2024, Jan Kara wrote: > On Tue 26-11-24 11:37:19, Jan Kara wrote: > > On Tue 26-11-24 09:01:35, Anders Blomdell wrote: > > > On 2024-11-26 02:48, Philippe Troin wrote: > > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote: > > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64, > > > > > we got terrible performance (lots of nfs: server x.x.x.x not > > > > > responding). > > > > > What triggered this problem was virtual machines with NFS-mounted > > > > > qcow2 disks > > > > > that often triggered large readaheads that generates long streaks of > > > > > disk I/O > > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache > > > > > area of the > > > > > machine. > > > > >=20 > > > > > A git bisect gave the following suspect: > > > > >=20 > > > > > git bisect start > > > >=20 > > > > 8< snip >8 > > > >=20 > > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25] > > > > > readahead: properly shorten readahead when falling back to > > > > > do_page_cache_ra() > > > >=20 > > > > Thank you for taking the time to bisect, this issue has been bugging > > > > me, but it's been non-deterministic, and hence hard to bisect. > > > >=20 > > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in > > > > slightly different setups: > > > >=20 > > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a > > > > "nfs server XXX not responding, still trying" that never recovers > > > > (while the server remains pingable and other NFSv3 volumes from the > > > > hanging server can be mounted). > > > >=20 > > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to > > > > several minutes) on random I/O. These stalls eventually recover. > > > >=20 > > > > I've built a 6.11.10 kernel with > > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to > > > > normal (no more NFS hangs, no more VM stalls). > > > >=20 > > > Some printk debugging, seems to indicate that the problem > > > is that the entity 'ra->size - (index - start)' goes > > > negative, which then gets cast to a very large unsigned > > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true > > > bug is still eludes me, though. > >=20 > > Thanks for the report, bisection and debugging! I think I see what's going > > on. read_pages() can go and reduce ra->size when ->readahead() callback > > failed to read all folios prepared for reading and apparently that's what > > happens with NFS and what can lead to negative argument to > > do_page_cache_ra(). Now at this point I'm of the opinion that updating > > ra->size / ra->async_size does more harm than good (because those values > > show *desired* readahead to happen, not exact number of pages read), > > furthermore it is problematic because ra can be shared by multiple > > processes and so updates are inherently racy. If we indeed need to store > > number of read pages, we could do it through ractl which is call-site loc= al > > and used for communication between readahead generic functions and caller= s. > > But I have to do some more history digging and code reading to understand > > what is using this logic in read_pages(). >=20 > Hum, checking the history the update of ra->size has been added by Neil two > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't > process all pages"). Neil, the changelog seems as there was some real > motivation behind updating of ra->size in read_pages(). What was it? Now I > somewhat disagree with reducing ra->size in read_pages() because it seems > like a wrong place to do that and if we do need something like that, > readahead window sizing logic should rather be changed to take that into > account? But it all depends on what was the real rationale behind reducing > ra->size in read_pages()... >=20 I cannot tell you much more than what the commit itself says. If there are any pages still in the rac, then we didn't try read-ahead and shouldn't pretend that we did. Else the numbers will be wrong. I think the important part of the patch was the delete_from_page_cache(). Leaving pages in the page cache which we didn't try to read will cause a future read-ahead to skip those pages and they can only be read synchronously. But maybe you are right that ra, being shared, shouldn't be modified like this. Thanks, NeilBrown