From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F3AAC36010 for ; Wed, 26 Mar 2025 17:47:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7EF3A28009A; Wed, 26 Mar 2025 13:47:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 79E6F28008D; Wed, 26 Mar 2025 13:47:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63F0628009A; Wed, 26 Mar 2025 13:47:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 42F1728008D for ; Wed, 26 Mar 2025 13:47:27 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 31518C0FCE for ; Wed, 26 Mar 2025 17:47:29 +0000 (UTC) X-FDA: 83264434218.26.90E3FB3 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf24.hostedemail.com (Postfix) with ESMTP id C4F0418000F for ; Wed, 26 Mar 2025 17:47:26 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=x2OuAa67; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Yrq5a6Dd; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=M3Rn+cVq; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=lBH3nhTS; spf=pass (imf24.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743011247; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zxjLSrIeDnh+b6kDBz6kP3IUcYhwNPjqT3MLq7qkDMw=; b=Zw0q5Ln+DovgKDKwUqDNEqAVomatm8MiVkFqQuuxW6mBxyLOWYn7g2zuOhU9iwDBfnXOTC zpP1KrjU9mHXa17olQVaRVtpokNTq2SfyN2gYOtw4wEfmhmzsFpQGtd+HVEH4HOxgvDSk6 jXSZsHaXFFnU3Z+1SMgS5CAPTsJXw9A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743011247; a=rsa-sha256; cv=none; b=4ISyshwoj3T/NoMf6axBMUveqQM4NhSOYHJvgVZee/BGM3QNeVhaWWthlQQIGxkLfioRDp g8/goVhlZjSTJuns+EaZ7LRG0PzjnGaE/QKGKG7V4vA65U61yk9RE8cYq3argLYtnCOd9p KJehO05joWg0c73Tcdt+AKnxDcqeo0c= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=x2OuAa67; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Yrq5a6Dd; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=M3Rn+cVq; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=lBH3nhTS; spf=pass (imf24.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id C4E8B1F391; Wed, 26 Mar 2025 17:47:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1743011245; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=zxjLSrIeDnh+b6kDBz6kP3IUcYhwNPjqT3MLq7qkDMw=; b=x2OuAa67i2oHWZSuf0YfZJcIlJdrPJgUsI+0aJZz3YklrGBNzvghE+qYk+DuEC5NBjDgaM zDSbeifV8rBhABGvrF1WCSvyNOYfWFLtHJRRxfKS7DHSjTY1aiNLzXh/y0zh1Di8H4oZfr OqunWqlffvQl3U0LnwQwtgUvNCb7XGQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1743011245; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=zxjLSrIeDnh+b6kDBz6kP3IUcYhwNPjqT3MLq7qkDMw=; b=Yrq5a6Dd6wEtnOsa53HxndvXc+8SUy5wDqWJEzYXLOoB11S1gMrWRfB1Z5fu/fFoIWnp7u 3AIZ+xDIt1ymWJBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1743011244; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=zxjLSrIeDnh+b6kDBz6kP3IUcYhwNPjqT3MLq7qkDMw=; b=M3Rn+cVq6kM37gBDI4SCl+HNJligH9FJjTuV+9Fj6KFRGmWC/5kpphqc25NMtLQcBdQmIK 8aa9OeQqPCMTixykf1u13bWqSHkBWCL4b8m9GfQgcLRruFxuQFSifItPJHDTkBFAf92Hsq 6RO9ham9NUJffrgIhTvDpPwltVQGDXQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1743011244; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=zxjLSrIeDnh+b6kDBz6kP3IUcYhwNPjqT3MLq7qkDMw=; b=lBH3nhTSiwVcvPbxi4Hi2vUlBibB0MFl/z/21m2/0igVMihLPkYlgUpK947wg8vjQgRwAN et+wUaHx/8w8QTBA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id AFE4F1374A; Wed, 26 Mar 2025 17:47:24 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id PpzrKqw95GeoCgAAD6G6ig (envelope-from ); Wed, 26 Mar 2025 17:47:24 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 51648A082A; Wed, 26 Mar 2025 18:47:20 +0100 (CET) Date: Wed, 26 Mar 2025 18:47:20 +0100 From: Jan Kara To: Matthew Wilcox Cc: Theodore Ts'o , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Chris Mason , Josef Bacik , Luis Chamberlain Subject: Re: [Lsf-pc] [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF Message-ID: References: <20250326155522.GB1459574@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Action: no action X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C4F0418000F X-Stat-Signature: 7a6hub45h9hy6qstg5baycfow6bgi68s X-HE-Tag: 1743011246-96188 X-HE-Meta: U2FsdGVkX1/mk/HzswWRLzn0V0o7bEfdKx9uOdWzWx7Iz9N8y0OF6hFt7S4C8QXrWpDe1wc0u/2xSndFDN3UHlzGYSSPbtPpL8yFLnCnB/Gr8iIGKvX2aq86ZEFtWqVpIu1bfmmX4husRgCWug2LkepCfA3Rs4LASJXXP8FYWzaDddVGqwj4LIda0glfEopZWhSvX2+5AQlSQKE7o58arevBhWTmUpGeoNGgGQx+xyGh5hpyDQppR/qPgTVcAltu7LoX8uxsv8ZP09Uq3K4mBEuPjHRyjPBVNKNuJpBvhaX7QSmBoyhtLv4qL9AvNvgykR3Kf4LYApNUSrXwxHTiisWfuYlvkjMkKO+Df2JoIjODqnYI6VZKX96bdnAbS23OAF/06xxgPmDWUPE4m0uoGen+aU/+Bsg0gizNKs+kz54u2iK/1RaqC6VGm1fYmLG2GuwjyL34u4AR1y4w20/YRD+Z1r5Wo7fGPV4Mj1gyW6Qp3SibyRgCOhDDFvqdtZbGq+WqX8us8jcpP7VzxwQ2M9wgaAIWNirXUM+ig1Hnj4yhUQ6MRWOxMnu+YMl8CM0jTrMuVZzf3KZq6aZoNYN3sLmaDGZHoG+59ZsbaKHJzdo+n85k5SBzO5Ie6NuuxHJ5C7s4amPw7oy969HcU3JpwqZJs1QhJ3vrshvUhe1chCOjOSHoYNj0lrvj1FKMWwpc6l/1s+hHEr60uojqwm0U5YUqAcd2Jojc7JWatd9QrhwI0YSvkCnZ+LL8KXqAtqvXIQ9cZo6B4Rwa3iOQtplIlWy76a7Feh9lC4ttphnBDfML/uOmV9Tmw1y8jJkg/qlb9WeUKN6wimFL31d1MosbjcsaajQPgRFRUvp3wwxG9m5yYbD771VTTAJeZcoAQ+qFTI9STLO42IBFmU5MFjwFR6KLR8E//woh46FfLCUT4BxIElb+WlveVPMG+gEkf4C87wahkSddBAovuVeZmB7 Mj1dZrRw H5R3mUvsFhIMSdy0J+dUzGlO2VtCHFuhXk6jeIQ9xCPbx90jnreZ3NBIjW55zeBGc/q35WDgJ6tL2f1bFBtGBQxSZfBm+2qE+1HEZaMEqxVAtzR7V7zCijyBEShd3V16lwFTHMxVuodUUsj78PX94urK6FtTUCO21CSnbYz5Gkn+tb4VVUnd05t5W3k3KjqrmV4VDGW324Ppv+ruMLmMfUq7GaBqScy9E3rfy/RAx+XVQWyGg+PFQLaFpfzTHB7sz8lhjio7xOpFxn9D3RMmdcBAwWwYOIW/D076Oa3avnvsgg0m/IRYR+g17cZts7UjCpQYTBm1oQQxhQX9VD/gcj1Nf2ljfOD4fBhrR4y3/sDr6gxnPFHtMpzDpJBpJwOF7epC/Nu37dB3HKY21GyEReSmOgz3+t+JjEDvC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 26-03-25 16:19:32, Matthew Wilcox wrote: > On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote: > > On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote: > > > > > > We've got three reports now (two are syzkaller kiddie stuff, but one's a > > > real workload) of a warning in the page allocator from filesystems > > > doing reclaim. Essentially they're using GFP_NOFAIL from reclaim > > > context. This got me thinking about bs>PS and I realised that if we fix > > > this, then we're going to end up trying to do high order GFP_NOFAIL allocations > > > in the memory reclaim path, and that is really no bueno. > > > > > > https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/ > > > > > > I'll prepare a better explainer of the problem in advance of this. > > > > Thanks for proposing this as a last-minute LSF/MM topic! > > > > I was looking at this myself, and was going to reply to the mail > > thread above, but I'll do it here. > > > > >From my perspective, the problem is that as part of memory reclaim, > > there is an attempt to shrink the inode cache, and there are cases > > where an inode's refcount was elevated (for example, because it was > > referenced by a dentry), and when the dentry gets flushed, now the > > inode can get evicted. But if the inode is one that has been deleted, > > then at eviction time the file system will try to release the blocks > > associated with the deleted-file. This operation will require memory > > allocation, potential I/O, and perhaps waiting for a journal > > transaction to complete. > > > > So basically, there are a class of inodes where if we are in reclaim, > > we should probably skip trying to evict them because there are very > > likely other inodes that will be more likely to result in memory > > getting released expeditiously. And if we take a look at > > inode_lru_isolate(), there's logic there already about when inodes > > should skipped getting evicted. It's probably just a matter of adding > > some additional coditions there. > > This is a helpful way of looking at the problem. I was looking at the > problem further down where we've already entered evict_inode(). At that > point we can't fail. My proposal was going to be that the filesystem pin > the metadata that it would need to modify in order to evict the inode. > But avoiding entering evict_inode() is even better. > > However, I can't see how inode_lru_isolate() can know whether (looking > at the three reports): > > - the ext4 inode table has been reclaimed and ext4 would need to > allocate memory in order to reload the table from disc in order to > evict this inode > - the ext4 block bitmap has been reclaimed and ext4 would need to > allocate memory in order to reload the bitmap from disc to > discard the preallocation > - the fat cluster information has been reclaimed and fat would > need to allocate memory in order to reload the cluster from > disc to update the cluster information Well, I think Ted was speaking about a more "big hammer" approach like adding: if (current->flags & PF_MEMALLOC && !inode->i_nlink) { spin_unlock(&inode->i_lock); return LRU_SKIP; } to inode_lru_isolate(). The problem isn't with inode_lru_isolate() here as far as I'm reading the stacktrace. We are scanning *dentry* LRU list, killing the dentry which is dropping the last reference to the inode and iput() then ends up doing all the deletion work. So we would have to avoid dropping dentry from the LRU if dentry->d_inode->i_nlink == 0 and that frankly seems a bit silly to me. > So maybe it makes sense for ->evict_inode() to change from void to > being able to return an errno, and then change the filesystems to not > set GFP_NOFAIL, and instead just decline to evict the inode. So this would help somewhat but inode deletion is a *heavy* operation (you can be freeing gigabytes of blocks) so you may end up doing a lot of metadata IO through the journal and deep in the bowels of the filesystem we are doing GFP_NOFAIL allocations anyway because there's just no sane way to unroll what we've started. So I'm afraid that ->evict() doing GFP_NOFAIL allocation for inodes with inode->i_nlink == 0 is a fact of life that is very hard to change. Honza -- Jan Kara SUSE Labs, CR