* [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
@ 2025-03-26 15:25 Matthew Wilcox
2025-03-26 15:55 ` Theodore Ts'o
2025-03-26 21:48 ` Dave Chinner
0 siblings, 2 replies; 6+ messages in thread
From: Matthew Wilcox @ 2025-03-26 15:25 UTC (permalink / raw)
To: lsf-pc, linux-mm, linux-fsdevel
Cc: Theodore Y. Ts'o, Chris Mason, Josef Bacik, Luis Chamberlain
We've got three reports now (two are syzkaller kiddie stuff, but one's a
real workload) of a warning in the page allocator from filesystems
doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
context. This got me thinking about bs>PS and I realised that if we fix
this, then we're going to end up trying to do high order GFP_NOFAIL allocations
in the memory reclaim path, and that is really no bueno.
https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
I'll prepare a better explainer of the problem in advance of this. It
looks like we have a slot at 17:30 today?
Required attendees: Ted, Luis, Chris, Josef, other people who've wrestled
with this before.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
2025-03-26 15:25 [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF Matthew Wilcox
@ 2025-03-26 15:55 ` Theodore Ts'o
2025-03-26 16:19 ` Matthew Wilcox
2025-03-26 21:48 ` Dave Chinner
1 sibling, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2025-03-26 15:55 UTC (permalink / raw)
To: Matthew Wilcox
Cc: lsf-pc, linux-mm, linux-fsdevel, Chris Mason, Josef Bacik,
Luis Chamberlain
On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
>
> We've got three reports now (two are syzkaller kiddie stuff, but one's a
> real workload) of a warning in the page allocator from filesystems
> doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
> context. This got me thinking about bs>PS and I realised that if we fix
> this, then we're going to end up trying to do high order GFP_NOFAIL allocations
> in the memory reclaim path, and that is really no bueno.
>
> https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
>
> I'll prepare a better explainer of the problem in advance of this.
Thanks for proposing this as a last-minute LSF/MM topic!
I was looking at this myself, and was going to reply to the mail
thread above, but I'll do it here.
From my perspective, the problem is that as part of memory reclaim,
there is an attempt to shrink the inode cache, and there are cases
where an inode's refcount was elevated (for example, because it was
referenced by a dentry), and when the dentry gets flushed, now the
inode can get evicted. But if the inode is one that has been deleted,
then at eviction time the file system will try to release the blocks
associated with the deleted-file. This operation will require memory
allocation, potential I/O, and perhaps waiting for a journal
transaction to complete.
So basically, there are a class of inodes where if we are in reclaim,
we should probably skip trying to evict them because there are very
likely other inodes that will be more likely to result in memory
getting released expeditiously. And if we take a look at
inode_lru_isolate(), there's logic there already about when inodes
should skipped getting evicted. It's probably just a matter of adding
some additional coditions there.
This seems relatively straightforward; what am I missing?
> Required attendees: Ted, Luis, Chris, Josef, other people who've wrestled
> with this before.
Happy to be there! :-)_
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
2025-03-26 15:55 ` Theodore Ts'o
@ 2025-03-26 16:19 ` Matthew Wilcox
2025-03-26 17:47 ` [Lsf-pc] " Jan Kara
0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2025-03-26 16:19 UTC (permalink / raw)
To: Theodore Ts'o
Cc: lsf-pc, linux-mm, linux-fsdevel, Chris Mason, Josef Bacik,
Luis Chamberlain
On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
> >
> > We've got three reports now (two are syzkaller kiddie stuff, but one's a
> > real workload) of a warning in the page allocator from filesystems
> > doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
> > context. This got me thinking about bs>PS and I realised that if we fix
> > this, then we're going to end up trying to do high order GFP_NOFAIL allocations
> > in the memory reclaim path, and that is really no bueno.
> >
> > https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
> >
> > I'll prepare a better explainer of the problem in advance of this.
>
> Thanks for proposing this as a last-minute LSF/MM topic!
>
> I was looking at this myself, and was going to reply to the mail
> thread above, but I'll do it here.
>
> >From my perspective, the problem is that as part of memory reclaim,
> there is an attempt to shrink the inode cache, and there are cases
> where an inode's refcount was elevated (for example, because it was
> referenced by a dentry), and when the dentry gets flushed, now the
> inode can get evicted. But if the inode is one that has been deleted,
> then at eviction time the file system will try to release the blocks
> associated with the deleted-file. This operation will require memory
> allocation, potential I/O, and perhaps waiting for a journal
> transaction to complete.
>
> So basically, there are a class of inodes where if we are in reclaim,
> we should probably skip trying to evict them because there are very
> likely other inodes that will be more likely to result in memory
> getting released expeditiously. And if we take a look at
> inode_lru_isolate(), there's logic there already about when inodes
> should skipped getting evicted. It's probably just a matter of adding
> some additional coditions there.
This is a helpful way of looking at the problem. I was looking at the
problem further down where we've already entered evict_inode(). At that
point we can't fail. My proposal was going to be that the filesystem pin
the metadata that it would need to modify in order to evict the inode.
But avoiding entering evict_inode() is even better.
However, I can't see how inode_lru_isolate() can know whether (looking
at the three reports):
- the ext4 inode table has been reclaimed and ext4 would need to
allocate memory in order to reload the table from disc in order to
evict this inode
- the ext4 block bitmap has been reclaimed and ext4 would need to
allocate memory in order to reload the bitmap from disc to
discard the preallocation
- the fat cluster information has been reclaimed and fat would
need to allocate memory in order to reload the cluster from
disc to update the cluster information
If we did have, say, a callback from inode_lru_isolate() to the filesystem
to find out if the inode can be dropped without memory allocation, that
callback would have to pin the underlying memory in order for it to not
be reclaimed between inode_lru_isolate() and evict_inode().
So maybe it makes sense for ->evict_inode() to change from void to
being able to return an errno, and then change the filesystems to not
set GFP_NOFAIL, and instead just decline to evict the inode.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
2025-03-26 16:19 ` Matthew Wilcox
@ 2025-03-26 17:47 ` Jan Kara
2025-03-26 19:08 ` Chris Mason
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2025-03-26 17:47 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Theodore Ts'o, lsf-pc, linux-mm, linux-fsdevel, Chris Mason,
Josef Bacik, Luis Chamberlain
On Wed 26-03-25 16:19:32, Matthew Wilcox wrote:
> On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
> > >
> > > We've got three reports now (two are syzkaller kiddie stuff, but one's a
> > > real workload) of a warning in the page allocator from filesystems
> > > doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
> > > context. This got me thinking about bs>PS and I realised that if we fix
> > > this, then we're going to end up trying to do high order GFP_NOFAIL allocations
> > > in the memory reclaim path, and that is really no bueno.
> > >
> > > https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
> > >
> > > I'll prepare a better explainer of the problem in advance of this.
> >
> > Thanks for proposing this as a last-minute LSF/MM topic!
> >
> > I was looking at this myself, and was going to reply to the mail
> > thread above, but I'll do it here.
> >
> > >From my perspective, the problem is that as part of memory reclaim,
> > there is an attempt to shrink the inode cache, and there are cases
> > where an inode's refcount was elevated (for example, because it was
> > referenced by a dentry), and when the dentry gets flushed, now the
> > inode can get evicted. But if the inode is one that has been deleted,
> > then at eviction time the file system will try to release the blocks
> > associated with the deleted-file. This operation will require memory
> > allocation, potential I/O, and perhaps waiting for a journal
> > transaction to complete.
> >
> > So basically, there are a class of inodes where if we are in reclaim,
> > we should probably skip trying to evict them because there are very
> > likely other inodes that will be more likely to result in memory
> > getting released expeditiously. And if we take a look at
> > inode_lru_isolate(), there's logic there already about when inodes
> > should skipped getting evicted. It's probably just a matter of adding
> > some additional coditions there.
>
> This is a helpful way of looking at the problem. I was looking at the
> problem further down where we've already entered evict_inode(). At that
> point we can't fail. My proposal was going to be that the filesystem pin
> the metadata that it would need to modify in order to evict the inode.
> But avoiding entering evict_inode() is even better.
>
> However, I can't see how inode_lru_isolate() can know whether (looking
> at the three reports):
>
> - the ext4 inode table has been reclaimed and ext4 would need to
> allocate memory in order to reload the table from disc in order to
> evict this inode
> - the ext4 block bitmap has been reclaimed and ext4 would need to
> allocate memory in order to reload the bitmap from disc to
> discard the preallocation
> - the fat cluster information has been reclaimed and fat would
> need to allocate memory in order to reload the cluster from
> disc to update the cluster information
Well, I think Ted was speaking about a more "big hammer" approach like
adding:
if (current->flags & PF_MEMALLOC && !inode->i_nlink) {
spin_unlock(&inode->i_lock);
return LRU_SKIP;
}
to inode_lru_isolate(). The problem isn't with inode_lru_isolate() here as
far as I'm reading the stacktrace. We are scanning *dentry* LRU list,
killing the dentry which is dropping the last reference to the inode and
iput() then ends up doing all the deletion work. So we would have to avoid
dropping dentry from the LRU if dentry->d_inode->i_nlink == 0 and that
frankly seems a bit silly to me.
> So maybe it makes sense for ->evict_inode() to change from void to
> being able to return an errno, and then change the filesystems to not
> set GFP_NOFAIL, and instead just decline to evict the inode.
So this would help somewhat but inode deletion is a *heavy* operation (you
can be freeing gigabytes of blocks) so you may end up doing a lot of
metadata IO through the journal and deep in the bowels of the filesystem we
are doing GFP_NOFAIL allocations anyway because there's just no sane way to
unroll what we've started. So I'm afraid that ->evict() doing GFP_NOFAIL
allocation for inodes with inode->i_nlink == 0 is a fact of life that is very
hard to change.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
2025-03-26 17:47 ` [Lsf-pc] " Jan Kara
@ 2025-03-26 19:08 ` Chris Mason
0 siblings, 0 replies; 6+ messages in thread
From: Chris Mason @ 2025-03-26 19:08 UTC (permalink / raw)
To: Jan Kara, Matthew Wilcox
Cc: Theodore Ts'o, lsf-pc, linux-mm, linux-fsdevel, Chris Mason,
Josef Bacik, Luis Chamberlain
On 3/26/25 1:47 PM, Jan Kara wrote:
> On Wed 26-03-25 16:19:32, Matthew Wilcox wrote:
>> On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote:
>>> On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
>>>>
>>>> We've got three reports now (two are syzkaller kiddie stuff, but one's a
>>>> real workload) of a warning in the page allocator from filesystems
>>>> doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
>>>> context. This got me thinking about bs>PS and I realised that if we fix
>>>> this, then we're going to end up trying to do high order GFP_NOFAIL allocations
>>>> in the memory reclaim path, and that is really no bueno.
>>>>
>>>> https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
>>>>
>>>> I'll prepare a better explainer of the problem in advance of this.
>>>
>>> Thanks for proposing this as a last-minute LSF/MM topic!
>>>
>>> I was looking at this myself, and was going to reply to the mail
>>> thread above, but I'll do it here.
>>>
>>> >From my perspective, the problem is that as part of memory reclaim,
>>> there is an attempt to shrink the inode cache, and there are cases
>>> where an inode's refcount was elevated (for example, because it was
>>> referenced by a dentry), and when the dentry gets flushed, now the
>>> inode can get evicted. But if the inode is one that has been deleted,
>>> then at eviction time the file system will try to release the blocks
>>> associated with the deleted-file. This operation will require memory
>>> allocation, potential I/O, and perhaps waiting for a journal
>>> transaction to complete.
>>>
>>> So basically, there are a class of inodes where if we are in reclaim,
>>> we should probably skip trying to evict them because there are very
>>> likely other inodes that will be more likely to result in memory
>>> getting released expeditiously. And if we take a look at
>>> inode_lru_isolate(), there's logic there already about when inodes
>>> should skipped getting evicted. It's probably just a matter of adding
>>> some additional coditions there.
>>
>> This is a helpful way of looking at the problem. I was looking at the
>> problem further down where we've already entered evict_inode(). At that
>> point we can't fail. My proposal was going to be that the filesystem pin
>> the metadata that it would need to modify in order to evict the inode.
>> But avoiding entering evict_inode() is even better.
>>
>> However, I can't see how inode_lru_isolate() can know whether (looking
>> at the three reports):
>>
>> - the ext4 inode table has been reclaimed and ext4 would need to
>> allocate memory in order to reload the table from disc in order to
>> evict this inode
>> - the ext4 block bitmap has been reclaimed and ext4 would need to
>> allocate memory in order to reload the bitmap from disc to
>> discard the preallocation
>> - the fat cluster information has been reclaimed and fat would
>> need to allocate memory in order to reload the cluster from
>> disc to update the cluster information
>
> Well, I think Ted was speaking about a more "big hammer" approach like
> adding:
>
> if (current->flags & PF_MEMALLOC && !inode->i_nlink) {
> spin_unlock(&inode->i_lock);
> return LRU_SKIP;
> }
>
> to inode_lru_isolate(). The problem isn't with inode_lru_isolate() here as
> far as I'm reading the stacktrace. We are scanning *dentry* LRU list,
> killing the dentry which is dropping the last reference to the inode and
> iput() then ends up doing all the deletion work. So we would have to avoid
> dropping dentry from the LRU if dentry->d_inode->i_nlink == 0 and that
> frankly seems a bit silly to me.
>
>> So maybe it makes sense for ->evict_inode() to change from void to
>> being able to return an errno, and then change the filesystems to not
>> set GFP_NOFAIL, and instead just decline to evict the inode.
>
> So this would help somewhat but inode deletion is a *heavy* operation (you
> can be freeing gigabytes of blocks) so you may end up doing a lot of
> metadata IO through the journal and deep in the bowels of the filesystem we
> are doing GFP_NOFAIL allocations anyway because there's just no sane way to
> unroll what we've started. So I'm afraid that ->evict() doing GFP_NOFAIL
> allocation for inodes with inode->i_nlink == 0 is a fact of life that is very
> hard to change.
From a memory reclaim point of view, I think we'll get more traction
from explicitly separating the page reclaim part of i_nlink==0 from
reclaiming disk space. In extreme cases we could kick disk space
reclaim off to other threads, but I'd prioritize reclaiming the pages.
(also, sorry but I'll miss the bof session today)
Related XFS thread:
https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/
-chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF
2025-03-26 15:25 [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF Matthew Wilcox
2025-03-26 15:55 ` Theodore Ts'o
@ 2025-03-26 21:48 ` Dave Chinner
1 sibling, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2025-03-26 21:48 UTC (permalink / raw)
To: Matthew Wilcox
Cc: lsf-pc, linux-mm, linux-fsdevel, Theodore Y. Ts'o,
Chris Mason, Josef Bacik, Luis Chamberlain
On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
>
> We've got three reports now (two are syzkaller kiddie stuff, but one's a
> real workload) of a warning in the page allocator from filesystems
> doing reclaim. Essentially they're using GFP_NOFAIL from reclaim
> context. This got me thinking about bs>PS and I realised that if we fix
> this, then we're going to end up trying to do high order GFP_NOFAIL allocations
> in the memory reclaim path, and that is really no bueno.
>
> https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@readmodwrite.com/
Anything that does IO or blocking memory allocation from evict()
context is a deadlock vector. They will also cause unpredictable
memory allocation latency as direct reclaim can get stuck on them.
The case that was brought up here is overlay dropping the last
reference to an inode from dentry cache reclaim, and that inode
having evict() run on it.
The filesystems then make journal reservations (which can block
waiting on IO), memory allocation (which can block waiting on IO
and/or direct memory reclaim stalling), do IO directly from that
context, etc.
Memory reclaim is supposed to be a non-blocking operation, so inode
reclaim really needs to avoid blocking or doing complex stuff that
requires memory allocation or IO in the direct evict() path.
Indeed, people spent -years- complaining that XFS did IO from
evict() context from direct memory reclaim because this caused
unacceptable memory allocation latency variations. It required
significant architectural changes to XFS inode journalling and
writeback to avoid blocking RMW IO during inode reclaim. It's also
one of the driving reasons for XFS aggressively pushing *any*
XFS-specific inode reclaim work that could block to background
inodegc workers that run after ->destroy_inode has removed the inode
from VFS visibility.
As I understand it, Josef's recent inode reference counting changes
will help with this, allowing the filesystem to hold a passive
reference to the inode whilst it it gets pushed to a background
context where the fs-specific cleanup code is allowed to block. This
is probably the direction we need to head to solve this problem in a
generic manner....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-03-26 21:48 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-26 15:25 [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF Matthew Wilcox
2025-03-26 15:55 ` Theodore Ts'o
2025-03-26 16:19 ` Matthew Wilcox
2025-03-26 17:47 ` [Lsf-pc] " Jan Kara
2025-03-26 19:08 ` Chris Mason
2025-03-26 21:48 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox