[LSF/MM/BPF TOPIC] Filesystem inode reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Filesystem inode reclaim
@ 2026-04-09  9:16 Jan Kara
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-09  9:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Matthew Wilcox, lsf-pc

Hello!

This is a recurring topic Matthew has been kicking forward for the last
year so let me maybe offer a fs-person point of view on the problem and
possible solutions. The problem is very simple: When a filesystem (ext4,
btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
complex cleanup - like trimming of preallocated blocks beyond end of file,
making sure journalling machinery is done with the inode, etc.. This may
require reading metadata into memory which requires memory allocations and
as inode eviction cannot fail, these are effectively GFP_NOFAIL
allocations (and there are other reasons why it would be very difficult to
make some of these required allocations in the filesystems failable).

GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
trigger warnings - and for a good reason as forward progress isn't
guaranteed. Also it leaves a bad taste that we are performing sometimes
rather long running operations blocking on IO from reclaim context thus
stalling reclaim for substantial amount of time to free 1k worth of slab
cache.

I have been mulling over possible solutions since I don't think each
filesystem should be inventing a complex inode lifetime management scheme
as XFS has invented to solve these issues. Here's what I think we could do:

1) Filesystems will be required to mark inodes that have non-trivial
cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
whatever :)). Usually I expect this to happen on first inode modification
or so. This will require some per-fs work but it shouldn't be that
difficult and filesystems can be adapted one-by-one as they decide to
address these warnings from reclaim.

2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
performance reasons. I expect this to be a significant portion of inodes
on average and in particular for some workloads which scan a lot of inodes
(find through the whole fs or similar) the efficiency of inode reclaim is
one of the determining factors for their performance.

3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
to process them.

4) The work will walk s_hard_reclaim_inodes list and call evict() for each
inode, doing the hard work.

This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
and they can work on freeing memory needed for freeing of hard to reclaim
inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
they should really be addressed.

One possible concern is that s_hard_reclaim_inodes list could grow out of
control for some workloads (in particular because there could be multiple
CPUs generating hard to reclaim inodes while the cleanup would be
single-threaded). This could be addressed by tracking number of inodes in
that list and if it grows over some limit, we could start throttling
processes when setting I_RECLAIM_HARD inode flag.

There's also a simpler approach to this problem but with more radical
changes to behavior. For example getting rid of inode LRU completely -
inodes without dentries referencing them anymore should be rare and it
isn't very useful to cache them. So we can always drop inodes on last
iput() (as we currently do for example for unlinked inodes). But I have a
nagging feeling that somebody is depending on inode LRU somewhere - I'd
like poll the collective knowledge of what could possibly go wrong here :)

In the session I'd like to discuss if people see some problems with these
approaches, what they'd prefer etc.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
@ 2026-04-09 12:57 ` Amir Goldstein
  2026-04-09 16:48   ` Boris Burkov
  2026-04-10  9:54   ` Jan Kara
  2026-04-09 16:12 ` Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Amir Goldstein @ 2026-04-09 12:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc, Boris Burkov

On Thu, Apr 9, 2026 at 11:17 AM Jan Kara <jack@suse.cz> wrote:
>
> Hello!
>
> This is a recurring topic Matthew has been kicking forward for the last
> year so let me maybe offer a fs-person point of view on the problem and
> possible solutions. The problem is very simple: When a filesystem (ext4,
> btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> complex cleanup - like trimming of preallocated blocks beyond end of file,
> making sure journalling machinery is done with the inode, etc.. This may
> require reading metadata into memory which requires memory allocations and
> as inode eviction cannot fail, these are effectively GFP_NOFAIL
> allocations (and there are other reasons why it would be very difficult to
> make some of these required allocations in the filesystems failable).
>
> GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> trigger warnings - and for a good reason as forward progress isn't
> guaranteed. Also it leaves a bad taste that we are performing sometimes
> rather long running operations blocking on IO from reclaim context thus
> stalling reclaim for substantial amount of time to free 1k worth of slab
> cache.
>
> I have been mulling over possible solutions since I don't think each
> filesystem should be inventing a complex inode lifetime management scheme
> as XFS has invented to solve these issues. Here's what I think we could do:
>
> 1) Filesystems will be required to mark inodes that have non-trivial
> cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> whatever :)). Usually I expect this to happen on first inode modification
> or so. This will require some per-fs work but it shouldn't be that
> difficult and filesystems can be adapted one-by-one as they decide to
> address these warnings from reclaim.
>
> 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> performance reasons. I expect this to be a significant portion of inodes
> on average and in particular for some workloads which scan a lot of inodes
> (find through the whole fs or similar) the efficiency of inode reclaim is
> one of the determining factors for their performance.
>
> 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> to process them.
>
> 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> inode, doing the hard work.
>
> This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> and they can work on freeing memory needed for freeing of hard to reclaim
> inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> they should really be addressed.
>
> One possible concern is that s_hard_reclaim_inodes list could grow out of
> control for some workloads (in particular because there could be multiple
> CPUs generating hard to reclaim inodes while the cleanup would be
> single-threaded). This could be addressed by tracking number of inodes in
> that list and if it grows over some limit, we could start throttling
> processes when setting I_RECLAIM_HARD inode flag.
>
> There's also a simpler approach to this problem but with more radical
> changes to behavior. For example getting rid of inode LRU completely -
> inodes without dentries referencing them anymore should be rare and it
> isn't very useful to cache them. So we can always drop inodes on last
> iput() (as we currently do for example for unlinked inodes). But I have a
> nagging feeling that somebody is depending on inode LRU somewhere - I'd
> like poll the collective knowledge of what could possibly go wrong here :)
>
> In the session I'd like to discuss if people see some problems with these
> approaches, what they'd prefer etc.

Hi Jan,

Is this expected to be a FS+MM session or only FS+Matthew?

Boris,

Is this related to the Direct Reclaim Scalability topic you wanted to discuss?
We are still waiting for posting on this topic.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
@ 2026-04-09 16:12 ` Darrick J. Wong
  2026-04-09 17:37   ` Jeff Layton
  2026-04-10  7:19 ` Christoph Hellwig
  2026-04-10  9:23 ` Christian Brauner
  3 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2026-04-09 16:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> Hello!
> 
> This is a recurring topic Matthew has been kicking forward for the last
> year so let me maybe offer a fs-person point of view on the problem and
> possible solutions. The problem is very simple: When a filesystem (ext4,
> btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> complex cleanup - like trimming of preallocated blocks beyond end of file,
> making sure journalling machinery is done with the inode, etc.. This may
> require reading metadata into memory which requires memory allocations and
> as inode eviction cannot fail, these are effectively GFP_NOFAIL
> allocations (and there are other reasons why it would be very difficult to
> make some of these required allocations in the filesystems failable).
> 
> GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> trigger warnings - and for a good reason as forward progress isn't
> guaranteed. Also it leaves a bad taste that we are performing sometimes
> rather long running operations blocking on IO from reclaim context thus
> stalling reclaim for substantial amount of time to free 1k worth of slab
> cache.
> 
> I have been mulling over possible solutions since I don't think each
> filesystem should be inventing a complex inode lifetime management scheme
> as XFS has invented to solve these issues. Here's what I think we could do:
> 
> 1) Filesystems will be required to mark inodes that have non-trivial
> cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> whatever :)). Usually I expect this to happen on first inode modification
> or so. This will require some per-fs work but it shouldn't be that
> difficult and filesystems can be adapted one-by-one as they decide to
> address these warnings from reclaim.
> 
> 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> performance reasons. I expect this to be a significant portion of inodes
> on average and in particular for some workloads which scan a lot of inodes
> (find through the whole fs or similar) the efficiency of inode reclaim is
> one of the determining factors for their performance.
> 
> 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> to process them.
> 
> 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> inode, doing the hard work.
> 
> This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> and they can work on freeing memory needed for freeing of hard to reclaim
> inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> they should really be addressed.

This more or less sounds fine to me.

> One possible concern is that s_hard_reclaim_inodes list could grow out of
> control for some workloads (in particular because there could be multiple
> CPUs generating hard to reclaim inodes while the cleanup would be
> single-threaded). This could be addressed by tracking number of inodes in
> that list and if it grows over some limit, we could start throttling
> processes when setting I_RECLAIM_HARD inode flag.

<nod> XFS does that, see xfs_inodegc_want_flush_work in
xfs_inodegc_queue.

> There's also a simpler approach to this problem but with more radical
> changes to behavior. For example getting rid of inode LRU completely -
> inodes without dentries referencing them anymore should be rare and it
> isn't very useful to cache them. So we can always drop inodes on last
> iput() (as we currently do for example for unlinked inodes). But I have a
> nagging feeling that somebody is depending on inode LRU somewhere - I'd
> like poll the collective knowledge of what could possibly go wrong here :)

NFS, possibly? ;)

--D

> In the session I'd like to discuss if people see some problems with these
> approaches, what they'd prefer etc.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
@ 2026-04-09 16:48   ` Boris Burkov
  2026-04-10 10:00     ` Jan Kara
  2026-04-10 11:08     ` Christoph Hellwig
  2026-04-10  9:54   ` Jan Kara
  1 sibling, 2 replies; 13+ messages in thread
From: Boris Burkov @ 2026-04-09 16:48 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

On Thu, Apr 09, 2026 at 02:57:47PM +0200, Amir Goldstein wrote:
> On Thu, Apr 9, 2026 at 11:17 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Hello!
> >
> > This is a recurring topic Matthew has been kicking forward for the last
> > year so let me maybe offer a fs-person point of view on the problem and
> > possible solutions. The problem is very simple: When a filesystem (ext4,
> > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > making sure journalling machinery is done with the inode, etc.. This may
> > require reading metadata into memory which requires memory allocations and
> > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > allocations (and there are other reasons why it would be very difficult to
> > make some of these required allocations in the filesystems failable).
> >
> > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > trigger warnings - and for a good reason as forward progress isn't
> > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > rather long running operations blocking on IO from reclaim context thus
> > stalling reclaim for substantial amount of time to free 1k worth of slab
> > cache.
> >
> > I have been mulling over possible solutions since I don't think each
> > filesystem should be inventing a complex inode lifetime management scheme
> > as XFS has invented to solve these issues. Here's what I think we could do:
> >
> > 1) Filesystems will be required to mark inodes that have non-trivial
> > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > whatever :)). Usually I expect this to happen on first inode modification
> > or so. This will require some per-fs work but it shouldn't be that
> > difficult and filesystems can be adapted one-by-one as they decide to
> > address these warnings from reclaim.
> >
> > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > performance reasons. I expect this to be a significant portion of inodes
> > on average and in particular for some workloads which scan a lot of inodes
> > (find through the whole fs or similar) the efficiency of inode reclaim is
> > one of the determining factors for their performance.
> >
> > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > to process them.
> >
> > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > inode, doing the hard work.
> >
> > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > and they can work on freeing memory needed for freeing of hard to reclaim
> > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > they should really be addressed.

One question that pops in my mind (which is similar to an issue you and
Qu debugged with the btrfs metadata reclaim floor earlier this year) is:
what if the hard to reclaim inodes are the *only* source of significant 
reclaimable space?

> >
> > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > control for some workloads (in particular because there could be multiple
> > CPUs generating hard to reclaim inodes while the cleanup would be
> > single-threaded). This could be addressed by tracking number of inodes in
> > that list and if it grows over some limit, we could start throttling
> > processes when setting I_RECLAIM_HARD inode flag.

Anything that pushes back on the "villains" sounds very good to me :)

> >
> > There's also a simpler approach to this problem but with more radical
> > changes to behavior. For example getting rid of inode LRU completely -
> > inodes without dentries referencing them anymore should be rare and it
> > isn't very useful to cache them. So we can always drop inodes on last
> > iput() (as we currently do for example for unlinked inodes). But I have a
> > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > like poll the collective knowledge of what could possibly go wrong here :)
> >
> > In the session I'd like to discuss if people see some problems with these
> > approaches, what they'd prefer etc.
> 
> Hi Jan,
> 
> Is this expected to be a FS+MM session or only FS+Matthew?
> 
> Boris,
> 
> Is this related to the Direct Reclaim Scalability topic you wanted to discuss?
> We are still waiting for posting on this topic.

Very much related. Thank you for the message. I (and others at Meta) are
working on this general class of problems, so I will send out a separate
message right after this email, but I don't want that to suggest I am
not interested in this particular aspect!

Sorry for the delay with the topic, Amir.

Thanks,
Boris

> 
> Thanks,
> Amir.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 16:12 ` Darrick J. Wong
@ 2026-04-09 17:37   ` Jeff Layton
  2026-04-10  9:43     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2026-04-09 17:37 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

On Thu, 2026-04-09 at 09:12 -0700, Darrick J. Wong wrote:
> On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> > Hello!
> > 
> > This is a recurring topic Matthew has been kicking forward for the last
> > year so let me maybe offer a fs-person point of view on the problem and
> > possible solutions. The problem is very simple: When a filesystem (ext4,
> > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > making sure journalling machinery is done with the inode, etc.. This may
> > require reading metadata into memory which requires memory allocations and
> > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > allocations (and there are other reasons why it would be very difficult to
> > make some of these required allocations in the filesystems failable).
> > 
> > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > trigger warnings - and for a good reason as forward progress isn't
> > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > rather long running operations blocking on IO from reclaim context thus
> > stalling reclaim for substantial amount of time to free 1k worth of slab
> > cache.
> > 
> > I have been mulling over possible solutions since I don't think each
> > filesystem should be inventing a complex inode lifetime management scheme
> > as XFS has invented to solve these issues. Here's what I think we could do:
> > 
> > 1) Filesystems will be required to mark inodes that have non-trivial
> > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > whatever :)). Usually I expect this to happen on first inode modification
> > or so. This will require some per-fs work but it shouldn't be that
> > difficult and filesystems can be adapted one-by-one as they decide to
> > address these warnings from reclaim.
> > 
> > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > performance reasons. I expect this to be a significant portion of inodes
> > on average and in particular for some workloads which scan a lot of inodes
> > (find through the whole fs or similar) the efficiency of inode reclaim is
> > one of the determining factors for their performance.
> > 
> > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > to process them.
> > 
> > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > inode, doing the hard work.
> > 
> > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > and they can work on freeing memory needed for freeing of hard to reclaim
> > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > they should really be addressed.
> 
> This more or less sounds fine to me.
> 
> > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > control for some workloads (in particular because there could be multiple
> > CPUs generating hard to reclaim inodes while the cleanup would be
> > single-threaded). This could be addressed by tracking number of inodes in
> > that list and if it grows over some limit, we could start throttling
> > processes when setting I_RECLAIM_HARD inode flag.
> 
> <nod> XFS does that, see xfs_inodegc_want_flush_work in
> xfs_inodegc_queue.
> 
> > There's also a simpler approach to this problem but with more radical
> > changes to behavior. For example getting rid of inode LRU completely -
> > inodes without dentries referencing them anymore should be rare and it
> > isn't very useful to cache them. So we can always drop inodes on last
> > iput() (as we currently do for example for unlinked inodes). But I have a
> > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > like poll the collective knowledge of what could possibly go wrong here :)
> 
> NFS, possibly? ;)
> 

NFS indeed.

Bear in mind that the NFS may fail d_revalidate checks on child
dentries when a parent directory is changed on the server. I imagine
some workloads might see a performance hit if a large file's dentry has
to be discarded and looked back up because we suddenly threw away a
bunch of useful data in the pagecache.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
  2026-04-09 16:12 ` Darrick J. Wong
@ 2026-04-10  7:19 ` Christoph Hellwig
  2026-04-10  9:23 ` Christian Brauner
  3 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2026-04-10  7:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

I think a patch is more useful than a discussion here, that idea has been
voiced multiple times, and effecticely implemented in XFS.

Trying to lift the XFS logic into the VFS and finding other consumers
for it would be very helpful.

> 1) Filesystems will be required to mark inodes that have non-trivial
> cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> whatever :)). Usually I expect this to happen on first inode modification
> or so. This will require some per-fs work but it shouldn't be that
> difficult and filesystems can be adapted one-by-one as they decide to
> address these warnings from reclaim.

I think otherwise we call this dirty :)

> 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> performance reasons. I expect this to be a significant portion of inodes
> on average and in particular for some workloads which scan a lot of inodes
> (find through the whole fs or similar) the efficiency of inode reclaim is
> one of the determining factors for their performance.

Yes, in most file systems most inodes are clean.

> There's also a simpler approach to this problem but with more radical
> changes to behavior. For example getting rid of inode LRU completely -
> inodes without dentries referencing them anymore should be rare and it
> isn't very useful to cache them. So we can always drop inodes on last
> iput() (as we currently do for example for unlinked inodes). But I have a
> nagging feeling that somebody is depending on inode LRU somewhere - I'd
> like poll the collective knowledge of what could possibly go wrong here :)

I've heard this theory multiple times, but we really need to valide that
we don't need the LRU.  It also doesn't really solve the above problem,
as we still would not want to perform the expensive inode inactivation
work inline with the last dput.

So while this might be worth investigating, please keept it separate.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
                   ` (2 preceding siblings ...)
  2026-04-10  7:19 ` Christoph Hellwig
@ 2026-04-10  9:23 ` Christian Brauner
  2026-04-10 10:14   ` Jan Kara
  3 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2026-04-10  9:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> Hello!
> 
> This is a recurring topic Matthew has been kicking forward for the last
> year so let me maybe offer a fs-person point of view on the problem and
> possible solutions. The problem is very simple: When a filesystem (ext4,
> btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> complex cleanup - like trimming of preallocated blocks beyond end of file,
> making sure journalling machinery is done with the inode, etc.. This may
> require reading metadata into memory which requires memory allocations and
> as inode eviction cannot fail, these are effectively GFP_NOFAIL
> allocations (and there are other reasons why it would be very difficult to
> make some of these required allocations in the filesystems failable).
> 
> GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> trigger warnings - and for a good reason as forward progress isn't
> guaranteed. Also it leaves a bad taste that we are performing sometimes
> rather long running operations blocking on IO from reclaim context thus
> stalling reclaim for substantial amount of time to free 1k worth of slab
> cache.
> 
> I have been mulling over possible solutions since I don't think each
> filesystem should be inventing a complex inode lifetime management scheme
> as XFS has invented to solve these issues. Here's what I think we could do:
> 
> 1) Filesystems will be required to mark inodes that have non-trivial
> cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> whatever :)). Usually I expect this to happen on first inode modification
> or so. This will require some per-fs work but it shouldn't be that
> difficult and filesystems can be adapted one-by-one as they decide to
> address these warnings from reclaim.
> 
> 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> performance reasons. I expect this to be a significant portion of inodes
> on average and in particular for some workloads which scan a lot of inodes
> (find through the whole fs or similar) the efficiency of inode reclaim is
> one of the determining factors for their performance.
> 
> 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> to process them.

I like this approach.

> 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> inode, doing the hard work.
> 
> This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> and they can work on freeing memory needed for freeing of hard to reclaim
> inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> they should really be addressed.
> 
> One possible concern is that s_hard_reclaim_inodes list could grow out of
> control for some workloads (in particular because there could be multiple
> CPUs generating hard to reclaim inodes while the cleanup would be
> single-threaded). This could be addressed by tracking number of inodes in

Hm, I don't know with WQ_UNBOUND is that really a concern?

> that list and if it grows over some limit, we could start throttling
> processes when setting I_RECLAIM_HARD inode flag.
> 
> There's also a simpler approach to this problem but with more radical
> changes to behavior. For example getting rid of inode LRU completely -
> inodes without dentries referencing them anymore should be rare and it
> isn't very useful to cache them. So we can always drop inodes on last
> iput() (as we currently do for example for unlinked inodes). But I have a
> nagging feeling that somebody is depending on inode LRU somewhere - I'd
> like poll the collective knowledge of what could possibly go wrong here :)

I still think we should try this - for the reduced maintenance cost
alone. Imagine living in a world where there aren't 2 different LRUs
constantly battling for review attention..

I'm split here but depending on the size of the actual work needed to
make this happen we should at least be open to try this.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 17:37   ` Jeff Layton
@ 2026-04-10  9:43     ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-10  9:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Darrick J. Wong, Jan Kara, linux-fsdevel, linux-mm,
	Matthew Wilcox, lsf-pc

On Thu 09-04-26 13:37:17, Jeff Layton wrote:
> On Thu, 2026-04-09 at 09:12 -0700, Darrick J. Wong wrote:
> > On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > This is a recurring topic Matthew has been kicking forward for the last
> > > year so let me maybe offer a fs-person point of view on the problem and
> > > possible solutions. The problem is very simple: When a filesystem (ext4,
> > > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > > making sure journalling machinery is done with the inode, etc.. This may
> > > require reading metadata into memory which requires memory allocations and
> > > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > > allocations (and there are other reasons why it would be very difficult to
> > > make some of these required allocations in the filesystems failable).
> > > 
> > > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > > trigger warnings - and for a good reason as forward progress isn't
> > > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > > rather long running operations blocking on IO from reclaim context thus
> > > stalling reclaim for substantial amount of time to free 1k worth of slab
> > > cache.
> > > 
> > > I have been mulling over possible solutions since I don't think each
> > > filesystem should be inventing a complex inode lifetime management scheme
> > > as XFS has invented to solve these issues. Here's what I think we could do:
> > > 
> > > 1) Filesystems will be required to mark inodes that have non-trivial
> > > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > > whatever :)). Usually I expect this to happen on first inode modification
> > > or so. This will require some per-fs work but it shouldn't be that
> > > difficult and filesystems can be adapted one-by-one as they decide to
> > > address these warnings from reclaim.
> > > 
> > > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > > performance reasons. I expect this to be a significant portion of inodes
> > > on average and in particular for some workloads which scan a lot of inodes
> > > (find through the whole fs or similar) the efficiency of inode reclaim is
> > > one of the determining factors for their performance.
> > > 
> > > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > > to process them.
> > > 
> > > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > > inode, doing the hard work.
> > > 
> > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > they should really be addressed.
> > 
> > This more or less sounds fine to me.
> > 
> > > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > > control for some workloads (in particular because there could be multiple
> > > CPUs generating hard to reclaim inodes while the cleanup would be
> > > single-threaded). This could be addressed by tracking number of inodes in
> > > that list and if it grows over some limit, we could start throttling
> > > processes when setting I_RECLAIM_HARD inode flag.
> > 
> > <nod> XFS does that, see xfs_inodegc_want_flush_work in
> > xfs_inodegc_queue.
> > 
> > > There's also a simpler approach to this problem but with more radical
> > > changes to behavior. For example getting rid of inode LRU completely -
> > > inodes without dentries referencing them anymore should be rare and it
> > > isn't very useful to cache them. So we can always drop inodes on last
> > > iput() (as we currently do for example for unlinked inodes). But I have a
> > > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > > like poll the collective knowledge of what could possibly go wrong here :)
> > 
> > NFS, possibly? ;)
> > 
> 
> NFS indeed.
> 
> Bear in mind that the NFS may fail d_revalidate checks on child
> dentries when a parent directory is changed on the server. I imagine
> some workloads might see a performance hit if a large file's dentry has
> to be discarded and looked back up because we suddenly threw away a
> bunch of useful data in the pagecache.

Thanks for filling in details! I was having vague memories that NFS was
relying on inodes with 0 refcount not getting immediately evicted. I'll
keep that in mind when thinking about solutions.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
  2026-04-09 16:48   ` Boris Burkov
@ 2026-04-10  9:54   ` Jan Kara
  1 sibling, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-10  9:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc, Boris Burkov

On Thu 09-04-26 14:57:47, Amir Goldstein wrote:
> On Thu, Apr 9, 2026 at 11:17 AM Jan Kara <jack@suse.cz> wrote:
> > This is a recurring topic Matthew has been kicking forward for the last
> > year so let me maybe offer a fs-person point of view on the problem and
> > possible solutions. The problem is very simple: When a filesystem (ext4,
> > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > making sure journalling machinery is done with the inode, etc.. This may
> > require reading metadata into memory which requires memory allocations and
> > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > allocations (and there are other reasons why it would be very difficult to
> > make some of these required allocations in the filesystems failable).
> >
> > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > trigger warnings - and for a good reason as forward progress isn't
> > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > rather long running operations blocking on IO from reclaim context thus
> > stalling reclaim for substantial amount of time to free 1k worth of slab
> > cache.
> >
> > I have been mulling over possible solutions since I don't think each
> > filesystem should be inventing a complex inode lifetime management scheme
> > as XFS has invented to solve these issues. Here's what I think we could do:
> >
> > 1) Filesystems will be required to mark inodes that have non-trivial
> > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > whatever :)). Usually I expect this to happen on first inode modification
> > or so. This will require some per-fs work but it shouldn't be that
> > difficult and filesystems can be adapted one-by-one as they decide to
> > address these warnings from reclaim.
> >
> > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > performance reasons. I expect this to be a significant portion of inodes
> > on average and in particular for some workloads which scan a lot of inodes
> > (find through the whole fs or similar) the efficiency of inode reclaim is
> > one of the determining factors for their performance.
> >
> > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > to process them.
> >
> > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > inode, doing the hard work.
> >
> > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > and they can work on freeing memory needed for freeing of hard to reclaim
> > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > they should really be addressed.
> >
> > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > control for some workloads (in particular because there could be multiple
> > CPUs generating hard to reclaim inodes while the cleanup would be
> > single-threaded). This could be addressed by tracking number of inodes in
> > that list and if it grows over some limit, we could start throttling
> > processes when setting I_RECLAIM_HARD inode flag.
> >
> > There's also a simpler approach to this problem but with more radical
> > changes to behavior. For example getting rid of inode LRU completely -
> > inodes without dentries referencing them anymore should be rare and it
> > isn't very useful to cache them. So we can always drop inodes on last
> > iput() (as we currently do for example for unlinked inodes). But I have a
> > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > like poll the collective knowledge of what could possibly go wrong here :)
> >
> > In the session I'd like to discuss if people see some problems with these
> > approaches, what they'd prefer etc.
> 
> Is this expected to be a FS+MM session or only FS+Matthew?

I was thinking of FS+MM but if MM gets too crowded, we could do with MM +
Matthew. I think the problem lies mostly in fs land but it's good to get
some vetting from MM people that we aren't bending slab reclaim assumptions
too much...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 16:48   ` Boris Burkov
@ 2026-04-10 10:00     ` Jan Kara
  2026-04-10 11:08     ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-10 10:00 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, linux-mm,
	Matthew Wilcox, lsf-pc

On Thu 09-04-26 09:48:34, Boris Burkov wrote:
> On Thu, Apr 09, 2026 at 02:57:47PM +0200, Amir Goldstein wrote:
> > On Thu, Apr 9, 2026 at 11:17 AM Jan Kara <jack@suse.cz> wrote:
> > > This is a recurring topic Matthew has been kicking forward for the last
> > > year so let me maybe offer a fs-person point of view on the problem and
> > > possible solutions. The problem is very simple: When a filesystem (ext4,
> > > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > > making sure journalling machinery is done with the inode, etc.. This may
> > > require reading metadata into memory which requires memory allocations and
> > > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > > allocations (and there are other reasons why it would be very difficult to
> > > make some of these required allocations in the filesystems failable).
> > >
> > > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > > trigger warnings - and for a good reason as forward progress isn't
> > > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > > rather long running operations blocking on IO from reclaim context thus
> > > stalling reclaim for substantial amount of time to free 1k worth of slab
> > > cache.
> > >
> > > I have been mulling over possible solutions since I don't think each
> > > filesystem should be inventing a complex inode lifetime management scheme
> > > as XFS has invented to solve these issues. Here's what I think we could do:
> > >
> > > 1) Filesystems will be required to mark inodes that have non-trivial
> > > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > > whatever :)). Usually I expect this to happen on first inode modification
> > > or so. This will require some per-fs work but it shouldn't be that
> > > difficult and filesystems can be adapted one-by-one as they decide to
> > > address these warnings from reclaim.
> > >
> > > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > > performance reasons. I expect this to be a significant portion of inodes
> > > on average and in particular for some workloads which scan a lot of inodes
> > > (find through the whole fs or similar) the efficiency of inode reclaim is
> > > one of the determining factors for their performance.
> > >
> > > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > > to process them.
> > >
> > > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > > inode, doing the hard work.
> > >
> > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > they should really be addressed.
> 
> One question that pops in my mind (which is similar to an issue you and
> Qu debugged with the btrfs metadata reclaim floor earlier this year) is:
> what if the hard to reclaim inodes are the *only* source of significant 
> reclaimable space?

Then we are effectively deadlocked on ENOMEM. That's why I think we'll have
to put some throttling on the creation of hard to reclaim inodes so that
they cannot grow out of control.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-10  9:23 ` Christian Brauner
@ 2026-04-10 10:14   ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-10 10:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc

Hello!

On Fri 10-04-26 11:23:26, Christian Brauner wrote:
> On Thu, Apr 09, 2026 at 11:16:44AM +0200, Jan Kara wrote:
> > This is a recurring topic Matthew has been kicking forward for the last
> > year so let me maybe offer a fs-person point of view on the problem and
> > possible solutions. The problem is very simple: When a filesystem (ext4,
> > btrfs, vfat) is about to reclaim an inode, it sometimes needs to perform a
> > complex cleanup - like trimming of preallocated blocks beyond end of file,
> > making sure journalling machinery is done with the inode, etc.. This may
> > require reading metadata into memory which requires memory allocations and
> > as inode eviction cannot fail, these are effectively GFP_NOFAIL
> > allocations (and there are other reasons why it would be very difficult to
> > make some of these required allocations in the filesystems failable).
> > 
> > GFP_NOFAIL allocation from reclaim context (be it kswapd or direct reclaim)
> > trigger warnings - and for a good reason as forward progress isn't
> > guaranteed. Also it leaves a bad taste that we are performing sometimes
> > rather long running operations blocking on IO from reclaim context thus
> > stalling reclaim for substantial amount of time to free 1k worth of slab
> > cache.
> > 
> > I have been mulling over possible solutions since I don't think each
> > filesystem should be inventing a complex inode lifetime management scheme
> > as XFS has invented to solve these issues. Here's what I think we could do:
> > 
> > 1) Filesystems will be required to mark inodes that have non-trivial
> > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > whatever :)). Usually I expect this to happen on first inode modification
> > or so. This will require some per-fs work but it shouldn't be that
> > difficult and filesystems can be adapted one-by-one as they decide to
> > address these warnings from reclaim.
> > 
> > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > performance reasons. I expect this to be a significant portion of inodes
> > on average and in particular for some workloads which scan a lot of inodes
> > (find through the whole fs or similar) the efficiency of inode reclaim is
> > one of the determining factors for their performance.
> > 
> > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > to process them.
> 
> I like this approach.
> 
> > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > inode, doing the hard work.
> > 
> > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > and they can work on freeing memory needed for freeing of hard to reclaim
> > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > they should really be addressed.
> > 
> > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > control for some workloads (in particular because there could be multiple
> > CPUs generating hard to reclaim inodes while the cleanup would be
> > single-threaded). This could be addressed by tracking number of inodes in
> 
> Hm, I don't know with WQ_UNBOUND is that really a concern?

I planned to have a single work item processing the inodes which means
single CPU cleaning the list even with WQ_UNBOUND. And MM folks tend to be
cautious about these pathological scenarios where all your reclaimable
memory is filled with hard to reclaim objects (dirty pages are prime
example we have solved long ago but dirty / hard to reclaim inodes aren't
really different). I'm definitely open to postponing the throttling part
for later if people are willing to try.

> > that list and if it grows over some limit, we could start throttling
> > processes when setting I_RECLAIM_HARD inode flag.
> > 
> > There's also a simpler approach to this problem but with more radical
> > changes to behavior. For example getting rid of inode LRU completely -
> > inodes without dentries referencing them anymore should be rare and it
> > isn't very useful to cache them. So we can always drop inodes on last
> > iput() (as we currently do for example for unlinked inodes). But I have a
> > nagging feeling that somebody is depending on inode LRU somewhere - I'd
> > like poll the collective knowledge of what could possibly go wrong here :)
> 
> I still think we should try this - for the reduced maintenance cost
> alone. Imagine living in a world where there aren't 2 different LRUs
> constantly battling for review attention..
> 
> I'm split here but depending on the size of the actual work needed to
> make this happen we should at least be open to try this.

I'd love to but as Jeff points out, at least NFS depends on inode LRU today
so we'd have to come up with some way to avoid purging all files from NFS
directory from cache on revalidate events. And I don't see a simple
solution for that...

								Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-09 16:48   ` Boris Burkov
  2026-04-10 10:00     ` Jan Kara
@ 2026-04-10 11:08     ` Christoph Hellwig
  2026-04-10 13:58       ` Jan Kara
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2026-04-10 11:08 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, linux-mm,
	Matthew Wilcox, lsf-pc

On Thu, Apr 09, 2026 at 09:48:34AM -0700, Boris Burkov wrote:
> > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > they should really be addressed.
> 
> One question that pops in my mind (which is similar to an issue you and
> Qu debugged with the btrfs metadata reclaim floor earlier this year) is:
> what if the hard to reclaim inodes are the *only* source of significant 
> reclaimable space?

(disk)space or memory?  If this about disk space, make sure your file
system ENOSPC handling triggers inode reclaim, XFS already does it.

If it is the only source of memory we just need to do the slow reclaim
to gain memory.  You better use mempools or similar to make it safe.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem inode reclaim
  2026-04-10 11:08     ` Christoph Hellwig
@ 2026-04-10 13:58       ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-04-10 13:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Boris Burkov, Amir Goldstein, Jan Kara, linux-fsdevel, linux-mm,
	Matthew Wilcox, lsf-pc

On Fri 10-04-26 04:08:14, Christoph Hellwig wrote:
> On Thu, Apr 09, 2026 at 09:48:34AM -0700, Boris Burkov wrote:
> > > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > > they should really be addressed.
> > 
> > One question that pops in my mind (which is similar to an issue you and
> > Qu debugged with the btrfs metadata reclaim floor earlier this year) is:
> > what if the hard to reclaim inodes are the *only* source of significant 
> > reclaimable space?
> 
> (disk)space or memory?  If this about disk space, make sure your file
> system ENOSPC handling triggers inode reclaim, XFS already does it.
> 
> If it is the only source of memory we just need to do the slow reclaim
> to gain memory.  You better use mempools or similar to make it safe.

AFAIU he spoke about memory. Yes, mempools are the standard answer for
guaranteeing forward progress but in this case good luck with properly
tracking down all that needs to be "mempoolized" (and how large the pools
should be!) to guarantee you can run transactions to cleanup inode - for
each filesystem. I don't say it's impossible but IMO its way too big effort
for the gain. So I think just throttling hard to reclaim inode creation is
way more practical solution, although strictly speaking you cannot
guarantee forward progress in all the insane configurations you can think
of.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-10 13:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-09  9:16 [LSF/MM/BPF TOPIC] Filesystem inode reclaim Jan Kara
2026-04-09 12:57 ` [Lsf-pc] " Amir Goldstein
2026-04-09 16:48   ` Boris Burkov
2026-04-10 10:00     ` Jan Kara
2026-04-10 11:08     ` Christoph Hellwig
2026-04-10 13:58       ` Jan Kara
2026-04-10  9:54   ` Jan Kara
2026-04-09 16:12 ` Darrick J. Wong
2026-04-09 17:37   ` Jeff Layton
2026-04-10  9:43     ` Jan Kara
2026-04-10  7:19 ` Christoph Hellwig
2026-04-10  9:23 ` Christian Brauner
2026-04-10 10:14   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox