Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
       [not found]     ` <ZH6K0McWBeCjaf16@dread.disaster.area>
@ 2023-06-06 21:21       ` Kirill Tkhai
  2023-06-06 22:30         ` Dave Chinner
  2023-06-08 16:36       ` Theodore Ts'o
  1 sibling, 1 reply; 9+ messages in thread
From: Kirill Tkhai @ 2023-06-06 21:21 UTC (permalink / raw)
  To: Dave Chinner, Roman Gushchin
  Cc: akpm, vbabka, viro, brauner, djwong, hughd, paulmck, muchun.song,
	linux-mm, linux-fsdevel, linux-xfs, linux-kernel, zhengqi.arch

On 06.06.2023 04:24, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
>> On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
>>> Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
>>> test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
>>> shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
>>> synchronize_srcu() occuring in unregister_shrinker().
>>>
>>> This patch fixes the problem by using new unregistration interfaces,
>>> which split unregister_shrinker() in two parts. First part actually only
>>> notifies shrinker subsystem about the fact of unregistration and it prevents
>>> future shrinker methods calls. The second part completes the unregistration
>>> and it insures, that struct shrinker is not used during shrinker chain
>>> iteration anymore, so shrinker memory may be freed. Since the long second
>>> part is called from delayed work asynchronously, it hides synchronize_srcu()
>>> delay from a user.
>>>
>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>>> ---
>>>  fs/super.c |    3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/super.c b/fs/super.c
>>> index 8d8d68799b34..f3e4f205ec79 100644
>>> --- a/fs/super.c
>>> +++ b/fs/super.c
>>> @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
>>>  							destroy_work);
>>>  	int i;
>>>  
>>> +	unregister_shrinker_delayed_finalize(&s->s_shrink);
>>>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
>>>  		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
>>>  	kfree(s);
>>> @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
>>>  {
>>>  	struct file_system_type *fs = s->s_type;
>>>  	if (atomic_dec_and_test(&s->s_active)) {
>>> -		unregister_shrinker(&s->s_shrink);
>>> +		unregister_shrinker_delayed_initiate(&s->s_shrink);
>>
>> Hm, it makes the API more complex and easier to mess with. Like what will happen
>> if the second part is never called? Or it's called without the first part being
>> called first?
> 
> Bad things.
> 
> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
> 
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
> 
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection.  This is why
> I really want the SRCU changeover reverted.
> 
> Not only are the significant changes the API being necessary, it's
> put the entire shrinker paths under a SRCU critical section. AIUI,
> this means while the shrinkers are running the RCU grace period
> cannot expire and no RCU freed memory will actually get freed until
> the srcu read lock is dropped by the shrinker.

Why so? Doesn't SRCU and RCU have different grace period and they does not prolong
each other?

Also, it looks like every SRCU has it's own namespace like shrinker_srcu for shrinker.
Don't different SRCU namespaces never prolong each other?!
 
> Given the superblock shrinkers are freeing dentry and inode objects
> by RCU freeing, this is also a fairly significant change of
> behaviour. i.e.  cond_resched() in the shrinker processing loops no
> longer allows RCU grace periods to expire and have memory freed with
> the shrinkers are running.
> 
> Are there problems this will cause? I don't know, but I'm pretty
> sure they haven't even been considered until now....
> 
>> Isn't it possible to hide it from a user and call the second part from a work
>> context automatically?
> 
> Nope, because it has to be done before the struct shrinker is freed.
> Those are embedded into other structures rather than being
> dynamically allocated objects. Hence the synchronise_srcu() has to
> complete before the structure the shrinker is embedded in is freed.
> 
> Now, this can be dealt with by having register_shrinker() return an
> allocated struct shrinker and the callers only keep a pointer, but
> that's an even bigger API change. But, IMO, it is an API change that
> should have been done before SRCU was introduced precisely because
> it allows decoupling of shrinker execution and completion from
> the owning structure.
> 
> Then we can stop shrinker execution, wait for it to complete and
> prevent future execution in unregister_shrinker(), then punt the
> expensive shrinker list removal to background work where processing
> delays just don't matter for dead shrinker instances. It doesn't
> need SRCU at all...
> 
> -Dave.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] mm: Make unregistration of super_block shrinker more faster
       [not found]   ` <ef1b0ecd-5a03-4256-2a7a-3e22b755aa53@ya.ru>
@ 2023-06-06 22:02     ` Dave Chinner
  2023-06-07  2:51       ` Qi Zheng
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-06-06 22:02 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, roman.gushchin, vbabka, viro, brauner, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, zhengqi.arch

On Wed, Jun 07, 2023 at 12:06:03AM +0300, Kirill Tkhai wrote:
> On 06.06.2023 01:32, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 10:02:46PM +0300, Kirill Tkhai wrote:
> >> This patch set introduces a new scheme of shrinker unregistration. It allows to split
> >> the unregistration in two parts: fast and slow. This allows to hide slow part from
> >> a user, so user-visible unregistration becomes fast.
> >>
> >> This fixes the -88.8% regression of stress-ng.ramfs.ops_per_sec noticed
> >> by kernel test robot:
> >>
> >> https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
> >>
> >> ---
> >>
> >> Kirill Tkhai (2):
> >>       mm: Split unregister_shrinker() in fast and slow part
> >>       fs: Use delayed shrinker unregistration
> > 
> > Did you test any filesystem other than ramfs?
> > 
> > Filesystems more complex than ramfs have internal shrinkers, and so
> > they will still be running the slow synchronize_srcu() - potentially
> > multiple times! - in every unmount. Both XFS and ext4 have 3
> > internal shrinker instances per mount, so they will still call
> > synchronize_srcu() at least 3 times per unmount after this change.
> > 
> > What about any other subsystem that runs a shrinker - do they have
> > context depedent shrinker instances that get frequently created and
> > destroyed? They'll need the same treatment.
> 
> Of course, all of shrinkers should be fixed. This patch set just aims to describe
> the idea more wider, because I'm not sure most people read replys to kernel robot reports.
> 
> This is my suggestion of way to go. Probably, Qi is right person to ask whether
> we're going to extend this and to maintain f95bdb700bc6 in tree.
> 
> There is not much time. Unfortunately, kernel test robot reported this significantly late.

And that's why it should be reverted rather than trying to rush to
try to fix it.

I'm kind of tired of finding out about mm reclaim regressions only
when I see patches making naive and/or broken changes to subsystem
shrinker implementations without any real clue about what they are
doing.  If people/subsystems who maintain shrinker implementations
were cc'd on the changes to the shrinker implementation, this would
have all been resolved before merging occurred....

Lockless shrinker lists need a heap of supporting changes to be done
first so that they aren't reliant on synchronise_srcu() *at all*. If
the code was properly designed in the first place (i.e. dynamic
shrinker structures freed via call_rcu()), we wouldn't be in rushing
to fix weird regressions right now. 

Can we please revert this and start again with a properly throught
out and reveiwed design?

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
  2023-06-06 21:21       ` [PATCH v2 3/3] fs: Use delayed shrinker unregistration Kirill Tkhai
@ 2023-06-06 22:30         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2023-06-06 22:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, akpm, vbabka, viro, brauner, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, zhengqi.arch

On Wed, Jun 07, 2023 at 12:21:42AM +0300, Kirill Tkhai wrote:
> On 06.06.2023 04:24, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> >> On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> >>> Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> >>> test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> >>> shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> >>> synchronize_srcu() occuring in unregister_shrinker().
> >>>
> >>> This patch fixes the problem by using new unregistration interfaces,
> >>> which split unregister_shrinker() in two parts. First part actually only
> >>> notifies shrinker subsystem about the fact of unregistration and it prevents
> >>> future shrinker methods calls. The second part completes the unregistration
> >>> and it insures, that struct shrinker is not used during shrinker chain
> >>> iteration anymore, so shrinker memory may be freed. Since the long second
> >>> part is called from delayed work asynchronously, it hides synchronize_srcu()
> >>> delay from a user.
> >>>
> >>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> >>> ---
> >>>  fs/super.c |    3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/super.c b/fs/super.c
> >>> index 8d8d68799b34..f3e4f205ec79 100644
> >>> --- a/fs/super.c
> >>> +++ b/fs/super.c
> >>> @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> >>>  							destroy_work);
> >>>  	int i;
> >>>  
> >>> +	unregister_shrinker_delayed_finalize(&s->s_shrink);
> >>>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> >>>  		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> >>>  	kfree(s);
> >>> @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> >>>  {
> >>>  	struct file_system_type *fs = s->s_type;
> >>>  	if (atomic_dec_and_test(&s->s_active)) {
> >>> -		unregister_shrinker(&s->s_shrink);
> >>> +		unregister_shrinker_delayed_initiate(&s->s_shrink);
> >>
> >> Hm, it makes the API more complex and easier to mess with. Like what will happen
> >> if the second part is never called? Or it's called without the first part being
> >> called first?
> > 
> > Bad things.
> > 
> > Also, it doesn't fix the three other unregister_shrinker() calls in
> > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> > path.
> > 
> > Those are just some of the unregister_shrinker() calls that have
> > dynamic contexts that would also need this same fix; I haven't
> > audited the 3 dozen other unregister_shrinker() calls around the
> > kernel to determine if any of them need similar treatment, too.
> > 
> > IOWs, this patchset is purely a band-aid to fix the reported
> > regression, not an actual fix for the underlying problems caused by
> > moving the shrinker infrastructure to SRCU protection.  This is why
> > I really want the SRCU changeover reverted.
> > 
> > Not only are the significant changes the API being necessary, it's
> > put the entire shrinker paths under a SRCU critical section. AIUI,
> > this means while the shrinkers are running the RCU grace period
> > cannot expire and no RCU freed memory will actually get freed until
> > the srcu read lock is dropped by the shrinker.
> 
> Why so? Doesn't SRCU and RCU have different grace period and they does not prolong
> each other?

No idea - Documentation/RCU/whatisRCU.rst doesn't describe any
differences between SRCU and RCU except for "use SRCU if you need to
sleep in the read side" and there's no discussion of how they
interact, either. maybe there's some discussion in other RCU
documentation, but there's nothing in the "how to use RCU"
documentation that tells me they use different grace period
definitions...

> Also, it looks like every SRCU has it's own namespace like shrinker_srcu for shrinker.
> Don't different SRCU namespaces never prolong each other?!

RIght, SRCU vs SRCU is well defined. What is not clear from anything
I've read is SRCU vs RCU interactions, so I can only assuming from
the shared "RCU" in the name there are shared implementation
details and interactions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] mm: Make unregistration of super_block shrinker more faster
  2023-06-06 22:02     ` [PATCH v2 0/3] mm: Make unregistration of super_block shrinker more faster Dave Chinner
@ 2023-06-07  2:51       ` Qi Zheng
  2023-06-08 21:58         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Qi Zheng @ 2023-06-07  2:51 UTC (permalink / raw)
  To: Dave Chinner, Kirill Tkhai
  Cc: akpm, roman.gushchin, vbabka, viro, brauner, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng



On 2023/6/7 06:02, Dave Chinner wrote:
> On Wed, Jun 07, 2023 at 12:06:03AM +0300, Kirill Tkhai wrote:
>> On 06.06.2023 01:32, Dave Chinner wrote:
>>> On Mon, Jun 05, 2023 at 10:02:46PM +0300, Kirill Tkhai wrote:
>>>> This patch set introduces a new scheme of shrinker unregistration. It allows to split
>>>> the unregistration in two parts: fast and slow. This allows to hide slow part from
>>>> a user, so user-visible unregistration becomes fast.
>>>>
>>>> This fixes the -88.8% regression of stress-ng.ramfs.ops_per_sec noticed
>>>> by kernel test robot:
>>>>
>>>> https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>>>
>>>> ---
>>>>
>>>> Kirill Tkhai (2):
>>>>        mm: Split unregister_shrinker() in fast and slow part
>>>>        fs: Use delayed shrinker unregistration
>>>
>>> Did you test any filesystem other than ramfs?
>>>
>>> Filesystems more complex than ramfs have internal shrinkers, and so
>>> they will still be running the slow synchronize_srcu() - potentially
>>> multiple times! - in every unmount. Both XFS and ext4 have 3
>>> internal shrinker instances per mount, so they will still call
>>> synchronize_srcu() at least 3 times per unmount after this change.
>>>
>>> What about any other subsystem that runs a shrinker - do they have
>>> context depedent shrinker instances that get frequently created and
>>> destroyed? They'll need the same treatment.
>>
>> Of course, all of shrinkers should be fixed. This patch set just aims to describe
>> the idea more wider, because I'm not sure most people read replys to kernel robot reports.

Thank you, Kirill.

>>
>> This is my suggestion of way to go. Probably, Qi is right person to ask whether
>> we're going to extend this and to maintain f95bdb700bc6 in tree.
>>
>> There is not much time. Unfortunately, kernel test robot reported this significantly late.
> 
> And that's why it should be reverted rather than trying to rush to
> try to fix it.
> 
> I'm kind of tired of finding out about mm reclaim regressions only
> when I see patches making naive and/or broken changes to subsystem
> shrinker implementations without any real clue about what they are
> doing.  If people/subsystems who maintain shrinker implementations
> were cc'd on the changes to the shrinker implementation, this would
> have all been resolved before merging occurred....
> 
> Lockless shrinker lists need a heap of supporting changes to be done
> first so that they aren't reliant on synchronise_srcu() *at all*. If
> the code was properly designed in the first place (i.e. dynamic
> shrinker structures freed via call_rcu()), we wouldn't be in rushing
> to fix weird regressions right now.
> 
> Can we please revert this and start again with a properly throught
> out and reveiwed design?

I have no idea on whether to revert this, I follow the final decision of
the community.

 From my personal point of view, I think it is worth sacrificing the
speed of unregistration alone compared to the benefits it brings
(lockless shrink, etc).

Of course, it would be better if there is a more perfect solution.
If you have a better idea, it might be better to post the code first for
discussion. Otherwise, I am afraid that if we just revert it, the
problem of shrinker_rwsem will continue for many years.

And hi Dave, I know you're mad that I didn't cc you in the original
patch. Sorry again. How about splitting shrinker-related codes into
the separate files? Then we can add a MAINTAINERS entry to it and add
linux-fsdevel@vger.kernel.org to this entry? So that future people
will not miss to cc fs folks.

Qi.

> 
> -Dave.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
       [not found]     ` <ZH6K0McWBeCjaf16@dread.disaster.area>
  2023-06-06 21:21       ` [PATCH v2 3/3] fs: Use delayed shrinker unregistration Kirill Tkhai
@ 2023-06-08 16:36       ` Theodore Ts'o
  2023-06-08 23:17         ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2023-06-08 16:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, akpm, vbabka, viro, brauner,
	djwong, hughd, paulmck, muchun.song, linux-mm, linux-fsdevel,
	linux-xfs, linux-kernel, zhengqi.arch

On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > if the second part is never called? Or it's called without the first part being
> > called first?
> 
> Bad things.
> 
> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
> 
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
> 
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection.  This is why
> I really want the SRCU changeover reverted.

There's been so much traffic on linux-fsdevel so I missed this thread
until Darrick pointed it out to me today.  (Thanks, Darrick!)

From his description, and my quick read-through of this thread....
I'm worried.

Given that we're at -rc5 now, and the file system folks didn't get
consulted until fairly late in the progress, and the fact that this
may cause use-after-free problems that could lead to security issues,
perhaps we shoould consider reverting the SRCU changeover now, and try
again for the next merge window?

Thanks!!

						- Ted


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] mm: Make unregistration of super_block shrinker more faster
  2023-06-07  2:51       ` Qi Zheng
@ 2023-06-08 21:58         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2023-06-08 21:58 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Kirill Tkhai, akpm, roman.gushchin, vbabka, viro, brauner,
	djwong, hughd, paulmck, muchun.song, linux-mm, linux-fsdevel,
	linux-xfs, linux-kernel, Qi Zheng

On Wed, Jun 07, 2023 at 10:51:35AM +0800, Qi Zheng wrote:
> From my personal point of view, I think it is worth sacrificing the
> speed of unregistration alone compared to the benefits it brings
> (lockless shrink, etc).

Nobody is questioning whether this is a worthwhile improvement. The
lockless shrinker instance iteration is definitely a good direction
to move in. The problem is the -process- that has been followed has
lead to a very sub-optimal result.

> Of course, it would be better if there is a more perfect solution.
> If you have a better idea, it might be better to post the code first for
> discussion. Otherwise, I am afraid that if we just revert it, the
> problem of shrinker_rwsem will continue for many years.

No, a revert doesn't mean we don't want the change; a revert means
the way the change was attempted has caused unexpected problems.
We need to go back to the drawing board and work out a better way to
do this.

> And hi Dave, I know you're mad that I didn't cc you in the original
> patch.

No, I'm not mad at you.

If I'm annoyed at anyone, it's the senior mm developers and
maintainers that I'm annoyed at - not informing relevant parties
about modifications to shrinker infrastructure or implementations
has lead to regressions escaping out to user systems multiple times
in the past. 

Yet here we are again....

> Sorry again. How about splitting shrinker-related codes into
> the separate files? Then we can add a MAINTAINERS entry to it and add
> linux-fsdevel@vger.kernel.org to this entry? So that future people
> will not miss to cc fs folks.

I don't think that fixes the problem, because the scope if much
wider than fs-devel:  look at all the different subsystems
that have a shrinker.

The whole kernel development process has always worked by the rule
that we're changing common infrastructure, all the subsystems using
that infrastructure need to be cc'd on the changes to the
infrastructure they are using. Just cc'ing -fsdevel isn't enough,
we've also got shrinkers in graphics driver infrastructure, *RCU*,
virtio, DM, bcache and various other subsystems.

And I'm betting most of them don't know that significant changes
have been made to how the shrinkers work....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
  2023-06-08 16:36       ` Theodore Ts'o
@ 2023-06-08 23:17         ` Dave Chinner
  2023-06-09  0:27           ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2023-06-08 23:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Roman Gushchin, Kirill Tkhai, akpm, vbabka, viro, brauner,
	djwong, hughd, paulmck, muchun.song, linux-mm, linux-fsdevel,
	linux-xfs, linux-kernel, zhengqi.arch

On Thu, Jun 08, 2023 at 12:36:22PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > > if the second part is never called? Or it's called without the first part being
> > > called first?
> > 
> > Bad things.
> > 
> > Also, it doesn't fix the three other unregister_shrinker() calls in
> > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> > path.
> > 
> > Those are just some of the unregister_shrinker() calls that have
> > dynamic contexts that would also need this same fix; I haven't
> > audited the 3 dozen other unregister_shrinker() calls around the
> > kernel to determine if any of them need similar treatment, too.
> > 
> > IOWs, this patchset is purely a band-aid to fix the reported
> > regression, not an actual fix for the underlying problems caused by
> > moving the shrinker infrastructure to SRCU protection.  This is why
> > I really want the SRCU changeover reverted.
> 
> There's been so much traffic on linux-fsdevel so I missed this thread
> until Darrick pointed it out to me today.  (Thanks, Darrick!)
> 
> From his description, and my quick read-through of this thread....
> I'm worried.
> 
> Given that we're at -rc5 now, and the file system folks didn't get
> consulted until fairly late in the progress, and the fact that this
> may cause use-after-free problems that could lead to security issues,
> perhaps we shoould consider reverting the SRCU changeover now, and try
> again for the next merge window?

Yes, please, because I think we can fix this in a much better way
and make things a whole lot simpler at the same time.

The root cause of the SRCU usage is that mm/memcg developers still
haven't unified the non-memcg and the memcg based shrinker
implementations. shrink_slab_memcg() doesn't require SRCU
protection as it does not iterate the shrinker list at all; it
requires *shrinker instance* lifetime protection. The problem is
shrink_slab() doing the root memcg/global shrinker work - it
iterates the shrinker list directly, and this is the list walk that
SRCU is necessary for to "make shrinkers lockless"

Going back to shrink_slab_memcg(), it does a lookup of the shrinker
instance by idr_find(); idr_find() is a wrapper around
radix_tree_lookup(), which means we can use RCU protection and
reference counting to both validate the shrinker instance *and*
guarantee that it isn't free from under us as we execute the
shrinker.

This requires, as I mentioned elsewhere in this thread, that the
shrinker instance to be dynamically allocated, not embedded in other
structures. Dynamically allocated shrinker instances and reference
counting them means we can do this in shrink_slab_memcg() to ensure
shrinker instance validity and lifetime rules are followed:

	rcu_read_lock()
	shrinker = idr_find(&shrinker_idr, i);
	if (!shrinker ||
	    !refcount_inc_not_zero(&shrinker->refcount)) {
		/* shrinker is being torn down */
		clear_bit(i, info->map);
		rcu_read_unlock();
		continue;
	}
	rcu_read_unlock();

	/* do shrinker stuff */

	if (refcount_dec_and_test(&shrinker->refcount)) {
		/* shrinker is being torn down, waiting for us */
		wakeup(&shrinker->completion_wait);
	}
	/* unsafe to reference shrinker now */

And we enable the shrinker to run simply by:

shrinker_register()
{
	.....
	/* allow the shrinker to run now */
	refcount_set(shrinker->refcount, 1);
	return 0;
}

We then shut down the shrinker so we can tear it down like so:

shrinker_unregister()
{
	DECLARE_WAITQUEUE(wait, current);

	add_wait_queue_exclusive(shrinker->completion_wait, &wait);
	if (!refcount_dec_and_test(&shrinker->refcount))) {
		/* Wait for running instances to exit */
		__set_current_state(TASK_UNINTERRUPTIBLE);
		schedule();
	}
	remove_wait_queue(wq, &wait);

	/* We own the entire shrinker instance now, start tearing it down */

	.....

	/* Free the shrinker itself after a RCU grace period expires */
	kfree_rcu(shrinker, shrinker->rcu_head);
}

So, essentially we don't need SCRU at all to do lockless shrinker
lookup for the memcg shrinker side of the fence, nor do we need
synchronise_srcu() to wait for shrinker instances to finish running
before we can tear stuff down. There is no global state in this at
all; everything is per-shrinker instance.

But SRCU is needed to protect the global shrinker list walk because
it hasn't been converted to always use the root memcg and be
iterated as if it is just another memcg.  IOWs, using SRCU to
protect the global shrinker list walk is effectively slapping a
bandaid over a more fundamental problem that we've known about for
a long time.

So the first thing that has to be done here is unify the shrinker
infrastructure under the memcg implementation. The global shrinker
state should be tracked in the root memcg, just like any other memcg
shrinker is tracked. If memcg's are not enabled, then we should be
creating a dummy global memcg that a get_root_memcg() helper returns
when memcgs are disabled to tracks all the global shrinker state.
then shrink_slab just doesn't care about what memcg is passed to it,
it just does the same thing.

IOWs, shrink_slab gets entirely replaced by shrink_slab_memcg(), and
the need for SRCU goes away entirely because shrinker instance
lookups are all RCU+refcount protected.

Yes, this requires we change how shrinker instances are instantiated
by subsystems, but this is mostly simple transformation of existing
code. But it doesn't require changing shrinker implementations, it
doesn't require a new teardown API, and it doesn't change the
context under which shrinkers might run.

All the existing RCU protected stuff in the shrinker maps and memcgs
can remain that way.  We can also keep the shrinker rwsem/mutex for
all the little internal bits of setup/teardown/non-rcu coordination
that are needed; once the list iteration is lockless, there will be
almost no contention on that lock at all...

This isn't all that hard - it's just replicating a well known design
pattern (i.e. refcount+RCU) we use all over the kernel combined with
taking advantage of IDR being backed by RCU-safe infrastructure.

If I had been cc'd on the original SRCU patches, this is exactly
what I would have suggested as a better solution to the problem. We
end up with cleaner, more robust and better performing
infrastructure. This is far better than layering more complexity on
top of what is really a poor foundation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
  2023-06-08 23:17         ` Dave Chinner
@ 2023-06-09  0:27           ` Andrew Morton
  2023-06-09  2:50             ` Qi Zheng
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2023-06-09  0:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Roman Gushchin, Kirill Tkhai, vbabka, viro,
	brauner, djwong, hughd, paulmck, muchun.song, linux-mm,
	linux-fsdevel, linux-xfs, linux-kernel, zhengqi.arch

On Fri, 9 Jun 2023 09:17:54 +1000 Dave Chinner <david@fromorbit.com> wrote:

> > Given that we're at -rc5 now, and the file system folks didn't get
> > consulted until fairly late in the progress, and the fact that this
> > may cause use-after-free problems that could lead to security issues,
> > perhaps we shoould consider reverting the SRCU changeover now, and try
> > again for the next merge window?
> 
> Yes, please, because I think we can fix this in a much better way
> and make things a whole lot simpler at the same time.

Qi Zheng, if agreeable could you please prepare and send reverts of
475733dda5a ("mm: vmscan: add shrinker_srcu_generation") and of
f95bdb700bc6bb74 ("mm: vmscan: make global slab shrink lockless")?

Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration
  2023-06-09  0:27           ` Andrew Morton
@ 2023-06-09  2:50             ` Qi Zheng
  0 siblings, 0 replies; 9+ messages in thread
From: Qi Zheng @ 2023-06-09  2:50 UTC (permalink / raw)
  To: Andrew Morton, Dave Chinner
  Cc: Theodore Ts'o, Roman Gushchin, Kirill Tkhai, vbabka, viro,
	brauner, djwong, hughd, paulmck, muchun.song, linux-mm,
	linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng



On 2023/6/9 08:27, Andrew Morton wrote:
> On Fri, 9 Jun 2023 09:17:54 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
>>> Given that we're at -rc5 now, and the file system folks didn't get
>>> consulted until fairly late in the progress, and the fact that this
>>> may cause use-after-free problems that could lead to security issues,
>>> perhaps we shoould consider reverting the SRCU changeover now, and try
>>> again for the next merge window?
>>
>> Yes, please, because I think we can fix this in a much better way
>> and make things a whole lot simpler at the same time.
> 
> Qi Zheng, if agreeable could you please prepare and send reverts of
> 475733dda5a ("mm: vmscan: add shrinker_srcu_generation") and of
> f95bdb700bc6bb74 ("mm: vmscan: make global slab shrink lockless")?

OK. After reading the proposal suggested by Dave, I think it is more
feasible. I will revert the previous changes ASAP, and then try to
implement Dave's proposal.

Thanks,
Qi

> 
> Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-06-09  2:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <168599103578.70911.9402374667983518835.stgit@pro.pro>
     [not found] ` <ZH5ig590WleaH1Ed@dread.disaster.area>
     [not found]   ` <ef1b0ecd-5a03-4256-2a7a-3e22b755aa53@ya.ru>
2023-06-06 22:02     ` [PATCH v2 0/3] mm: Make unregistration of super_block shrinker more faster Dave Chinner
2023-06-07  2:51       ` Qi Zheng
2023-06-08 21:58         ` Dave Chinner
     [not found] ` <168599180526.70911.14606767590861123431.stgit@pro.pro>
     [not found]   ` <ZH6AA72wOd4HKTKE@P9FQF9L96D>
     [not found]     ` <ZH6K0McWBeCjaf16@dread.disaster.area>
2023-06-06 21:21       ` [PATCH v2 3/3] fs: Use delayed shrinker unregistration Kirill Tkhai
2023-06-06 22:30         ` Dave Chinner
2023-06-08 16:36       ` Theodore Ts'o
2023-06-08 23:17         ` Dave Chinner
2023-06-09  0:27           ` Andrew Morton
2023-06-09  2:50             ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox