Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	linux-fsdevel@vger.kernel.org,  linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>,
	 lsf-pc@lists.linux-foundation.org
Subject: Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
Date: Sat, 18 Apr 2026 18:43:42 -0400	[thread overview]
Message-ID: <49f9b0f41ed690c0e9c62f893a9199ad9b857251.camel@kernel.org> (raw)
In-Reply-To: <vpgymg4vq7fv2iebfjebxbufjvo2yre64lcwjgsbyivu2pxmpp@zhjplhrjayc5>

On Fri, 2026-04-10 at 11:43 +0200, Jan Kara wrote:
> On Thu 09-04-26 13:37:17, Jeff Layton wrote:
> > On Thu, 2026-04-09 at 09:12 -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > This is a recurring topic Matthew has been kicking forward for the last
> > > > year so let me maybe offer a fs-person point of view on the problem and
> > > > possible solutions. The problem is very simple: When a filesystem (ext4,
> > > > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > > > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > > > making sure journalling machinery is done with the inode, etc.. This may
> > > > require reading metadata into memory which requires memory allocations and
> > > > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > > > allocations (and there are other reasons why it would be very difficult to
> > > > make some of these required allocations in the filesystems failable).
> > > > 
> > > > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > > > trigger warnings - and for a good reason as forward progress isn't
> > > > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > > > rather long running operations blocking on IO from reclaim context thus
> > > > stalling reclaim for substantial amount of time to free 1k worth of slab
> > > > cache.
> > > > 
> > > > I have been mulling over possible solutions since I don't think each
> > > > filesystem should be inventing a complex inode lifetime management scheme
> > > > as XFS has invented to solve these issues. Here's what I think we could do:
> > > > 
> > > > 1) Filesystems will be required to mark inodes that have non-trivial
> > > > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > > > whatever :)). Usually I expect this to happen on first inode modification
> > > > or so. This will require some per-fs work but it shouldn't be that
> > > > difficult and filesystems can be adapted one-by-one as they decide to
> > > > address these warnings from reclaim.
> > > > 
> > > > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > > > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > > > performance reasons. I expect this to be a significant portion of inodes
> > > > on average and in particular for some workloads which scan a lot of inodes
> > > > (find through the whole fs or similar) the efficiency of inode reclaim is
> > > > one of the determining factors for their performance.
> > > > 
> > > > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > > > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > > > to process them.
> > > > 
> > > > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > > > inode, doing the hard work.
> > > > 
> > > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > > they should really be addressed.
> > > 
> > > This more or less sounds fine to me.
> > > 
> > > > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > > > control for some workloads (in particular because there could be multiple
> > > > CPUs generating hard to reclaim inodes while the cleanup would be
> > > > single-threaded). This could be addressed by tracking number of inodes in
> > > > that list and if it grows over some limit, we could start throttling
> > > > processes when setting I_RECLAIM_HARD inode flag.
> > > 
> > > <nod> XFS does that, see xfs_inodegc_want_flush_work in
> > > xfs_inodegc_queue.
> > > 
> > > > There's also a simpler approach to this problem but with more radical
> > > > changes to behavior. For example getting rid of inode LRU completely -
> > > > inodes without dentries referencing them anymore should be rare and it
> > > > isn't very useful to cache them. So we can always drop inodes on last
> > > > iput() (as we currently do for example for unlinked inodes). But I have a
> > > > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > > > like poll the collective knowledge of what could possibly go wrong here :)
> > > 
> > > NFS, possibly? ;)
> > > 
> > 
> > NFS indeed.
> > 
> > Bear in mind that the NFS may fail d_revalidate checks on child
> > dentries when a parent directory is changed on the server. I imagine
> > some workloads might see a performance hit if a large file's dentry has
> > to be discarded and looked back up because we suddenly threw away a
> > bunch of useful data in the pagecache.
> 
> Thanks for filling in details! I was having vague memories that NFS was
> relying on inodes with 0 refcount not getting immediately evicted. I'll
> keep that in mind when thinking about solutions.
> 

In my mind, there are two classes of filesystems when it comes to the
dcache: ones where the kernel has perfect knowledge of the directory
tree, and ones where it does not. In general, in-memory and local
filesystems are the former, and distributed/network filesystems are the
latter.

It might make sense to allow filesystems to declare that they are of
the type where the kernel has perfect knowledge, and use that to do
this kind of optimization. It wouldn't get rid of the LRU entirely, but
could allow you to reduce the size significantly.
-- 
Jeff Layton <jlayton@kernel.org>

next prev parent reply	other threads:[~2026-04-18 22:43 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09  9:16 Jan Kara
2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
2026-04-09 16:48   ` Boris Burkov
2026-04-10 10:00     ` Jan Kara
2026-04-10 11:08     ` Christoph Hellwig
2026-04-10 13:58       ` Jan Kara
2026-04-10  9:54   ` Jan Kara
2026-04-09 16:12 ` Darrick J. Wong
2026-04-09 17:37   ` Jeff Layton
2026-04-10  9:43     ` Jan Kara
2026-04-18 22:43       ` Jeff Layton [this message]
2026-04-10  7:19 ` Christoph Hellwig
2026-04-10 20:56   ` Jan Kara
2026-04-10 21:14     ` Andreas Dilger
2026-04-13  7:45       ` Jan Kara
2026-04-10  9:23 ` Christian Brauner
2026-04-10 10:14   ` Jan Kara
2026-04-13 21:23 ` Shakeel Butt
2026-04-14  9:15   ` Jan Kara
2026-04-15 17:45     ` Shakeel Butt
2026-04-16 10:06       ` Jan Kara
2026-04-17  0:06         ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49f9b0f41ed690c0e9c62f893a9199ad9b857251.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox