Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
       [not found] <20220319001635.4097742-1-khazhy@google.com>
@ 2022-03-19  0:36 ` Trond Myklebust
  2022-03-19  1:45   ` Khazhy Kumykov
  2022-03-19  9:36   ` Amir Goldstein
  0 siblings, 2 replies; 22+ messages in thread
From: Trond Myklebust @ 2022-03-19  0:36 UTC (permalink / raw)
  To: bfields, khazhy, chuck.lever; +Cc: linux-mm, linux-nfs, linux-kernel

On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> result
> in recursing back into nfsd, resulting in deadlock. See below stack.
> 
> nfsd            D    0 1591536      2 0x80004080
> Call Trace:
>  __schedule+0x497/0x630
>  schedule+0x67/0x90
>  schedule_preempt_disabled+0xe/0x10
>  __mutex_lock+0x347/0x4b0
>  fsnotify_destroy_mark+0x22/0xa0
>  nfsd_file_free+0x79/0xd0 [nfsd]
>  nfsd_file_put_noref+0x7c/0x90 [nfsd]
>  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
>  nfsd_file_lru_scan+0x57/0x80 [nfsd]
>  do_shrink_slab+0x1f2/0x330
>  shrink_slab+0x244/0x2f0
>  shrink_node+0xd7/0x490
>  do_try_to_free_pages+0x12f/0x3b0
>  try_to_free_pages+0x43f/0x540
>  __alloc_pages_slowpath+0x6ab/0x11c0
>  __alloc_pages_nodemask+0x274/0x2c0
>  alloc_slab_page+0x32/0x2e0
>  new_slab+0xa6/0x8b0
>  ___slab_alloc+0x34b/0x520
>  kmem_cache_alloc+0x1c4/0x250
>  fsnotify_add_mark_locked+0x18d/0x4c0
>  fsnotify_add_mark+0x48/0x70
>  nfsd_file_acquire+0x570/0x6f0 [nfsd]
>  nfsd_read+0xa7/0x1c0 [nfsd]
>  nfsd3_proc_read+0xc1/0x110 [nfsd]
>  nfsd_dispatch+0xf7/0x240 [nfsd]
>  svc_process_common+0x2f4/0x610 [sunrpc]
>  svc_process+0xf9/0x110 [sunrpc]
>  nfsd+0x10e/0x180 [nfsd]
>  kthread+0x130/0x140
>  ret_from_fork+0x35/0x40
> 
> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> ---
>  fs/nfsd/filecache.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> Marking this RFC since I haven't actually had a chance to test this,
> we
> we're seeing this deadlock for some customers.
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index fdf89fcf1a0c..a14760f9b486 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>         struct fsnotify_mark    *mark;
>         struct nfsd_file_mark   *nfm = NULL, *new;
>         struct inode *inode = nf->nf_inode;
> +       unsigned int pflags;
>  
>         do {
>                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
>                 refcount_set(&new->nfm_ref, 1);
>  
> +               /* fsnotify allocates, avoid recursion back into nfsd
> */
> +               pflags = memalloc_nofs_save();
>                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> 0);
> +               memalloc_nofs_restore(pflags);
>  
>                 /*
>                  * If the add was successful, then return the object.

Isn't that stack trace showing a slab direct reclaim, and not a
filesystem writeback situation?

Does memalloc_nofs_save()/restore() really fix this problem? It seems
to me that it cannot, particularly since knfsd is not a filesystem, and
so does not ever handle writeback of dirty pages.

Cc: the linux-mm mailing list in search of answers to the above 2
questions.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-19  0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust
@ 2022-03-19  1:45   ` Khazhy Kumykov
  2022-03-19  9:36   ` Amir Goldstein
  1 sibling, 0 replies; 22+ messages in thread
From: Khazhy Kumykov @ 2022-03-19  1:45 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, chuck.lever, linux-mm, linux-nfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Fri, Mar 18, 2022 at 5:36 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.

Ah you're right. An alternative would be delaying the destroy_mark,
which I am noticing now that, on mainline, the shrinker calls
nfsd_file_dispose_list_delayed, something missing from the version of
5.4 I was looking at.

To my reading 9542e6a643fc6 ("nfsd: Containerise filecache
laundrette") should have the effect of fixing this deadlock (and the
message does explicitly call out notifier deadlocks) - maybe something
to send towards stable?
>
>
> Cc: the linux-mm mailing list in search of answers to the above 2
> questions.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3999 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-19  0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust
  2022-03-19  1:45   ` Khazhy Kumykov
@ 2022-03-19  9:36   ` Amir Goldstein
  2022-03-21 11:23     ` Jan Kara
  2022-03-21 17:06     ` Khazhy Kumykov
  1 sibling, 2 replies; 22+ messages in thread
From: Amir Goldstein @ 2022-03-19  9:36 UTC (permalink / raw)
  To: Trond Myklebust, Jan Kara
  Cc: bfields, khazhy, chuck.lever, linux-mm, linux-nfs, linux-kernel,
	Jeff Layton

On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > result
> > in recursing back into nfsd, resulting in deadlock. See below stack.
> >
> > nfsd            D    0 1591536      2 0x80004080
> > Call Trace:
> >  __schedule+0x497/0x630
> >  schedule+0x67/0x90
> >  schedule_preempt_disabled+0xe/0x10
> >  __mutex_lock+0x347/0x4b0
> >  fsnotify_destroy_mark+0x22/0xa0
> >  nfsd_file_free+0x79/0xd0 [nfsd]
> >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> >  do_shrink_slab+0x1f2/0x330
> >  shrink_slab+0x244/0x2f0
> >  shrink_node+0xd7/0x490
> >  do_try_to_free_pages+0x12f/0x3b0
> >  try_to_free_pages+0x43f/0x540
> >  __alloc_pages_slowpath+0x6ab/0x11c0
> >  __alloc_pages_nodemask+0x274/0x2c0
> >  alloc_slab_page+0x32/0x2e0
> >  new_slab+0xa6/0x8b0
> >  ___slab_alloc+0x34b/0x520
> >  kmem_cache_alloc+0x1c4/0x250
> >  fsnotify_add_mark_locked+0x18d/0x4c0
> >  fsnotify_add_mark+0x48/0x70
> >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> >  nfsd_read+0xa7/0x1c0 [nfsd]
> >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> >  nfsd_dispatch+0xf7/0x240 [nfsd]
> >  svc_process_common+0x2f4/0x610 [sunrpc]
> >  svc_process+0xf9/0x110 [sunrpc]
> >  nfsd+0x10e/0x180 [nfsd]
> >  kthread+0x130/0x140
> >  ret_from_fork+0x35/0x40
> >
> > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > ---
> >  fs/nfsd/filecache.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > Marking this RFC since I haven't actually had a chance to test this,
> > we
> > we're seeing this deadlock for some customers.
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index fdf89fcf1a0c..a14760f9b486 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> >         struct fsnotify_mark    *mark;
> >         struct nfsd_file_mark   *nfm = NULL, *new;
> >         struct inode *inode = nf->nf_inode;
> > +       unsigned int pflags;
> >
> >         do {
> >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> >                 refcount_set(&new->nfm_ref, 1);
> >
> > +               /* fsnotify allocates, avoid recursion back into nfsd
> > */
> > +               pflags = memalloc_nofs_save();
> >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > 0);
> > +               memalloc_nofs_restore(pflags);
> >
> >                 /*
> >                  * If the add was successful, then return the object.
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.
>

Maybe NOFS throttles direct reclaims to the point that the problem is
harder to hit?

This report came in at good timing for me.

It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
As far as I can tell, nfsd filecache is currently the only fsnotify backend that
frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
be evictable in that way, so they would expose fanotify to this deadlock.

For the short term, maybe nfsd filecache can avoid the problem by checking
mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

Jan,

A relatively simple fix would be to allocate fsnotify_mark_connector in
fsnotify_add_mark() and free it, if a connector already exists for the object.
I don't think there is a good reason to optimize away this allocation
for the case of a non-first group to set a mark on an object?

Thanks,
Amir.



[1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-19  9:36   ` Amir Goldstein
@ 2022-03-21 11:23     ` Jan Kara
  2022-03-21 11:56       ` Amir Goldstein
  2022-03-21 22:50       ` Trond Myklebust
  2022-03-21 17:06     ` Khazhy Kumykov
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Kara @ 2022-03-21 11:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Trond Myklebust, Jan Kara, bfields, khazhy, chuck.lever,
	linux-mm, linux-nfs, linux-kernel, Jeff Layton

On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd            D    0 1591536      2 0x80004080
> > > Call Trace:
> > >  __schedule+0x497/0x630
> > >  schedule+0x67/0x90
> > >  schedule_preempt_disabled+0xe/0x10
> > >  __mutex_lock+0x347/0x4b0
> > >  fsnotify_destroy_mark+0x22/0xa0
> > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > >  do_shrink_slab+0x1f2/0x330
> > >  shrink_slab+0x244/0x2f0
> > >  shrink_node+0xd7/0x490
> > >  do_try_to_free_pages+0x12f/0x3b0
> > >  try_to_free_pages+0x43f/0x540
> > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > >  __alloc_pages_nodemask+0x274/0x2c0
> > >  alloc_slab_page+0x32/0x2e0
> > >  new_slab+0xa6/0x8b0
> > >  ___slab_alloc+0x34b/0x520
> > >  kmem_cache_alloc+0x1c4/0x250
> > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > >  fsnotify_add_mark+0x48/0x70
> > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > >  svc_process+0xf9/0x110 [sunrpc]
> > >  nfsd+0x10e/0x180 [nfsd]
> > >  kthread+0x130/0x140
> > >  ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > ---
> > >  fs/nfsd/filecache.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >         struct fsnotify_mark    *mark;
> > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > >         struct inode *inode = nf->nf_inode;
> > > +       unsigned int pflags;
> > >
> > >         do {
> > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > >                 refcount_set(&new->nfm_ref, 1);
> > >
> > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > +               pflags = memalloc_nofs_save();
> > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > +               memalloc_nofs_restore(pflags);
> > >
> > >                 /*
> > >                  * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
> 
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?
> 
> This report came in at good timing for me.
> 
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
> 
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> 
> Jan,
> 
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?

Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
locks grabbed during inode slab reclaim. So any allocation under
group->mark_mutex has to be GFP_NOFS now. This is not just about connector
allocations but also mark allocations for fanotify. Moving allocations from
under mark_mutex is also possible solution but passing preallocated memory
around is kind of ugly as well. So the cleanest solution I currently see is
to come up with helpers like "fsnotify_lock_group() &
fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
memalloc_nofs_save / restore magic. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 11:23     ` Jan Kara
@ 2022-03-21 11:56       ` Amir Goldstein
  2022-03-21 14:51         ` Jan Kara
  2022-03-21 22:50       ` Trond Myklebust
  1 sibling, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2022-03-21 11:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Trond Myklebust, bfields, khazhy, chuck.lever, linux-mm,
	linux-nfs, linux-kernel, Jeff Layton

On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > >
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > > >
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > >
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > Marking this RFC since I haven't actually had a chance to test this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > >
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > >
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > >
> > > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > >
> > > >                 /*
> > > >                  * If the add was successful, then return the object.
> > >
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > >
> > > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > > to me that it cannot, particularly since knfsd is not a filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > >
> >
> > Maybe NOFS throttles direct reclaims to the point that the problem is
> > harder to hit?
> >
> > This report came in at good timing for me.
> >
> > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> > be evictable in that way, so they would expose fanotify to this deadlock.
> >
> > For the short term, maybe nfsd filecache can avoid the problem by checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> >
> > Jan,
> >
> > A relatively simple fix would be to allocate fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for the object.
> > I don't think there is a good reason to optimize away this allocation
> > for the case of a non-first group to set a mark on an object?
>
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about connector
> allocations but also mark allocations for fanotify. Moving allocations from
> under mark_mutex is also possible solution but passing preallocated memory
> around is kind of ugly as well.

Yes, kind of, here is how it looks:
https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3
https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7

> So the cleanest solution I currently see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> memalloc_nofs_save / restore magic.
>

Sounds good. Won't this cause a regression - more failures to setup new mark
under memory pressure?

Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
and set NOFS state only in that case, so at least we don't cause regression
for existing applications?

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 11:56       ` Amir Goldstein
@ 2022-03-21 14:51         ` Jan Kara
  2022-03-22 22:41           ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2022-03-21 14:51 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Trond Myklebust, bfields, khazhy, chuck.lever,
	linux-mm, linux-nfs, linux-kernel, Jeff Layton

On Mon 21-03-22 13:56:47, Amir Goldstein wrote:
> On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > > >
> > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > > result
> > > > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > > > >
> > > > > nfsd            D    0 1591536      2 0x80004080
> > > > > Call Trace:
> > > > >  __schedule+0x497/0x630
> > > > >  schedule+0x67/0x90
> > > > >  schedule_preempt_disabled+0xe/0x10
> > > > >  __mutex_lock+0x347/0x4b0
> > > > >  fsnotify_destroy_mark+0x22/0xa0
> > > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > > >  do_shrink_slab+0x1f2/0x330
> > > > >  shrink_slab+0x244/0x2f0
> > > > >  shrink_node+0xd7/0x490
> > > > >  do_try_to_free_pages+0x12f/0x3b0
> > > > >  try_to_free_pages+0x43f/0x540
> > > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > > >  alloc_slab_page+0x32/0x2e0
> > > > >  new_slab+0xa6/0x8b0
> > > > >  ___slab_alloc+0x34b/0x520
> > > > >  kmem_cache_alloc+0x1c4/0x250
> > > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > > >  fsnotify_add_mark+0x48/0x70
> > > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > > >  svc_process+0xf9/0x110 [sunrpc]
> > > > >  nfsd+0x10e/0x180 [nfsd]
> > > > >  kthread+0x130/0x140
> > > > >  ret_from_fork+0x35/0x40
> > > > >
> > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > Marking this RFC since I haven't actually had a chance to test this,
> > > > > we
> > > > > we're seeing this deadlock for some customers.
> > > > >
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > > *nf)
> > > > >         struct fsnotify_mark    *mark;
> > > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > > >         struct inode *inode = nf->nf_inode;
> > > > > +       unsigned int pflags;
> > > > >
> > > > >         do {
> > > > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > > *nf)
> > > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > > >                 refcount_set(&new->nfm_ref, 1);
> > > > >
> > > > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > > > */
> > > > > +               pflags = memalloc_nofs_save();
> > > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > > > 0);
> > > > > +               memalloc_nofs_restore(pflags);
> > > > >
> > > > >                 /*
> > > > >                  * If the add was successful, then return the object.
> > > >
> > > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > > filesystem writeback situation?
> > > >
> > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > > > to me that it cannot, particularly since knfsd is not a filesystem, and
> > > > so does not ever handle writeback of dirty pages.
> > > >
> > >
> > > Maybe NOFS throttles direct reclaims to the point that the problem is
> > > harder to hit?
> > >
> > > This report came in at good timing for me.
> > >
> > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> > > be evictable in that way, so they would expose fanotify to this deadlock.
> > >
> > > For the short term, maybe nfsd filecache can avoid the problem by checking
> > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> > >
> > > Jan,
> > >
> > > A relatively simple fix would be to allocate fsnotify_mark_connector in
> > > fsnotify_add_mark() and free it, if a connector already exists for the object.
> > > I don't think there is a good reason to optimize away this allocation
> > > for the case of a non-first group to set a mark on an object?
> >
> > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
> > locks grabbed during inode slab reclaim. So any allocation under
> > group->mark_mutex has to be GFP_NOFS now. This is not just about connector
> > allocations but also mark allocations for fanotify. Moving allocations from
> > under mark_mutex is also possible solution but passing preallocated memory
> > around is kind of ugly as well.
> 
> Yes, kind of, here is how it looks:
> https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3
> https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7

Yup, not an act of beauty but bearable in the worst case :).

> > So the cleanest solution I currently see is
> > to come up with helpers like "fsnotify_lock_group() &
> > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > memalloc_nofs_save / restore magic.
> >
> 
> Sounds good. Won't this cause a regression - more failures to setup new mark
> under memory pressure?

Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
higher. But I don't think that much memory is consumed by connectors or
marks that the reduced chances for direct reclaim would really
substantially matter for the system as a whole.

> Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> and set NOFS state only in that case, so at least we don't cause regression
> for existing applications?

So that's a possibility I've left in my sleeve ;). We could do it but then
we'd also have to tell lockdep that there are two kinds of mark_mutex locks
so that it does not complain about possible reclaim deadlocks. Doable but
at this point I didn't consider it worth it unless someone comes with a bug
report from a real user scenario.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 14:51         ` Jan Kara
@ 2022-03-22 22:41           ` Amir Goldstein
  2022-03-23 10:41             ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2022-03-22 22:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

> > > So the cleanest solution I currently see is
> > > to come up with helpers like "fsnotify_lock_group() &
> > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > > memalloc_nofs_save / restore magic.
> > >
> >
> > Sounds good. Won't this cause a regression - more failures to setup new mark
> > under memory pressure?
>
> Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
> higher. But I don't think that much memory is consumed by connectors or
> marks that the reduced chances for direct reclaim would really
> substantially matter for the system as a whole.
>
> > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> > and set NOFS state only in that case, so at least we don't cause regression
> > for existing applications?
>
> So that's a possibility I've left in my sleeve ;). We could do it but then
> we'd also have to tell lockdep that there are two kinds of mark_mutex locks
> so that it does not complain about possible reclaim deadlocks. Doable but
> at this point I didn't consider it worth it unless someone comes with a bug
> report from a real user scenario.

Are you sure about that?
Note that fsnotify_destroy_mark() and friends already use lockdep class
SINGLE_DEPTH_NESTING, so I think the lockdep annotation already
assumes that deadlock from direct reclaim cannot happen and it is that
assumption that was nearly broken by evictable inode marks.

IIUC that means that we only need to wrap the fanotify allocations
with GFP_NOFS (technically only after the first evictable mark)?

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-22 22:41           ` Amir Goldstein
@ 2022-03-23 10:41             ` Jan Kara
  2022-03-23 11:40               ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2022-03-23 10:41 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel

On Wed 23-03-22 00:41:28, Amir Goldstein wrote:
> > > > So the cleanest solution I currently see is
> > > > to come up with helpers like "fsnotify_lock_group() &
> > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > > > memalloc_nofs_save / restore magic.
> > > >
> > >
> > > Sounds good. Won't this cause a regression - more failures to setup new mark
> > > under memory pressure?
> >
> > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
> > higher. But I don't think that much memory is consumed by connectors or
> > marks that the reduced chances for direct reclaim would really
> > substantially matter for the system as a whole.
> >
> > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> > > and set NOFS state only in that case, so at least we don't cause regression
> > > for existing applications?
> >
> > So that's a possibility I've left in my sleeve ;). We could do it but then
> > we'd also have to tell lockdep that there are two kinds of mark_mutex locks
> > so that it does not complain about possible reclaim deadlocks. Doable but
> > at this point I didn't consider it worth it unless someone comes with a bug
> > report from a real user scenario.
> 
> Are you sure about that?

Feel free to try it, I can be wrong...

> Note that fsnotify_destroy_mark() and friends already use lockdep class
> SINGLE_DEPTH_NESTING, so I think the lockdep annotation already
> assumes that deadlock from direct reclaim cannot happen and it is that
> assumption that was nearly broken by evictable inode marks.
> 
> IIUC that means that we only need to wrap the fanotify allocations
> with GFP_NOFS (technically only after the first evictable mark)?

Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark()
is called from inode reclaim, it will record mark_mutex as
'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once
filesystem direct reclaim happens from an allocation under mark_mutex,
lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe'
(mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that
SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear
reasons - we should clean that up) might defeat this lockdep detection but
in that case it would also defeat detection of real potential deadlocks
(because the deadlock scenario you've found is real). Proper lockdep
annotation needs to distinguish mark_locks which can be acquired from under
fs reclaim and mark_locks which cannot be.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 10:41             ` Jan Kara
@ 2022-03-23 11:40               ` Amir Goldstein
  2022-03-23 13:48                 ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2022-03-23 11:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

On Wed, Mar 23, 2022 at 12:41 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 23-03-22 00:41:28, Amir Goldstein wrote:
> > > > > So the cleanest solution I currently see is
> > > > > to come up with helpers like "fsnotify_lock_group() &
> > > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > > > > memalloc_nofs_save / restore magic.
> > > > >
> > > >
> > > > Sounds good. Won't this cause a regression - more failures to setup new mark
> > > > under memory pressure?
> > >
> > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
> > > higher. But I don't think that much memory is consumed by connectors or
> > > marks that the reduced chances for direct reclaim would really
> > > substantially matter for the system as a whole.
> > >
> > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> > > > and set NOFS state only in that case, so at least we don't cause regression
> > > > for existing applications?
> > >
> > > So that's a possibility I've left in my sleeve ;). We could do it but then
> > > we'd also have to tell lockdep that there are two kinds of mark_mutex locks
> > > so that it does not complain about possible reclaim deadlocks. Doable but
> > > at this point I didn't consider it worth it unless someone comes with a bug
> > > report from a real user scenario.
> >
> > Are you sure about that?
>
> Feel free to try it, I can be wrong...
>
> > Note that fsnotify_destroy_mark() and friends already use lockdep class
> > SINGLE_DEPTH_NESTING, so I think the lockdep annotation already
> > assumes that deadlock from direct reclaim cannot happen and it is that
> > assumption that was nearly broken by evictable inode marks.
> >
> > IIUC that means that we only need to wrap the fanotify allocations
> > with GFP_NOFS (technically only after the first evictable mark)?
>
> Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark()
> is called from inode reclaim, it will record mark_mutex as
> 'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once
> filesystem direct reclaim happens from an allocation under mark_mutex,
> lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe'
> (mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that
> SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear
> reasons - we should clean that up) might defeat this lockdep detection but
> in that case it would also defeat detection of real potential deadlocks
> (because the deadlock scenario you've found is real). Proper lockdep

Definitely. My test now reproduces the deadlock very reliably within seconds
lockdep is unaware of the deadlock because of the SINGLE_DEPTH_NESTING
subclass missguided annotation.

> annotation needs to distinguish mark_locks which can be acquired from under
> fs reclaim and mark_locks which cannot be.
>

I see. So technically we can annotate the fanotify group mark_mutex with
a different key and then we have 4 subclasses of lock:
- fanotify mark_mutex are NOT reclaim safe
- non-fanotify mark_mutex are reclaim safe
- ANY mark_mutex(SINGLE_DEPTH_NESTING) are fs-reclaim unsafe

The reason I am a bit uneasy with regressing inotify is that there are users
of large recursive inotify watch crawlers out there (e.g. watchman) and when
crawling a large tree to add marks, there may be a need to reclaim
some memory by evicting inode cache (of non-marked subtrees).

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 11:40               ` Amir Goldstein
@ 2022-03-23 13:48                 ` Jan Kara
  2022-03-23 14:00                   ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2022-03-23 13:48 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel

On Wed 23-03-22 13:40:18, Amir Goldstein wrote:
> On Wed, Mar 23, 2022 at 12:41 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 23-03-22 00:41:28, Amir Goldstein wrote:
> > > > > > So the cleanest solution I currently see is
> > > > > > to come up with helpers like "fsnotify_lock_group() &
> > > > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > > > > > memalloc_nofs_save / restore magic.
> > > > > >
> > > > >
> > > > > Sounds good. Won't this cause a regression - more failures to setup new mark
> > > > > under memory pressure?
> > > >
> > > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
> > > > higher. But I don't think that much memory is consumed by connectors or
> > > > marks that the reduced chances for direct reclaim would really
> > > > substantially matter for the system as a whole.
> > > >
> > > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> > > > > and set NOFS state only in that case, so at least we don't cause regression
> > > > > for existing applications?
> > > >
> > > > So that's a possibility I've left in my sleeve ;). We could do it but then
> > > > we'd also have to tell lockdep that there are two kinds of mark_mutex locks
> > > > so that it does not complain about possible reclaim deadlocks. Doable but
> > > > at this point I didn't consider it worth it unless someone comes with a bug
> > > > report from a real user scenario.
> > >
> > > Are you sure about that?
> >
> > Feel free to try it, I can be wrong...
> >
> > > Note that fsnotify_destroy_mark() and friends already use lockdep class
> > > SINGLE_DEPTH_NESTING, so I think the lockdep annotation already
> > > assumes that deadlock from direct reclaim cannot happen and it is that
> > > assumption that was nearly broken by evictable inode marks.
> > >
> > > IIUC that means that we only need to wrap the fanotify allocations
> > > with GFP_NOFS (technically only after the first evictable mark)?
> >
> > Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark()
> > is called from inode reclaim, it will record mark_mutex as
> > 'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once
> > filesystem direct reclaim happens from an allocation under mark_mutex,
> > lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe'
> > (mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that
> > SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear
> > reasons - we should clean that up) might defeat this lockdep detection but
> > in that case it would also defeat detection of real potential deadlocks
> > (because the deadlock scenario you've found is real). Proper lockdep
> 
> Definitely. My test now reproduces the deadlock very reliably within seconds
> lockdep is unaware of the deadlock because of the SINGLE_DEPTH_NESTING
> subclass missguided annotation.
> 
> > annotation needs to distinguish mark_locks which can be acquired from under
> > fs reclaim and mark_locks which cannot be.
> >
> 
> I see. So technically we can annotate the fanotify group mark_mutex with
> a different key and then we have 4 subclasses of lock:
> - fanotify mark_mutex are NOT reclaim safe
> - non-fanotify mark_mutex are reclaim safe
> - ANY mark_mutex(SINGLE_DEPTH_NESTING) are fs-reclaim unsafe
> 
> The reason I am a bit uneasy with regressing inotify is that there are users
> of large recursive inotify watch crawlers out there (e.g. watchman) and when
> crawling a large tree to add marks, there may be a need to reclaim
> some memory by evicting inode cache (of non-marked subtrees).

Well, but reclaim from kswapd is always the main and preferred source of
memory reclaim. And we will kick kswapd to do work if we are running out of
memory. Doing direct filesystem slab reclaim from mark allocation is useful
only to throttle possibly aggressive mark allocations to the speed of
reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
big issue but I certainly won't stop you from implementing more fine
grained GFP mode selection and lockdep annotations if you want to go that
way :).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 13:48                 ` Jan Kara
@ 2022-03-23 14:00                   ` Amir Goldstein
  2022-03-23 14:28                     ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2022-03-23 14:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

> Well, but reclaim from kswapd is always the main and preferred source of
> memory reclaim. And we will kick kswapd to do work if we are running out of
> memory. Doing direct filesystem slab reclaim from mark allocation is useful
> only to throttle possibly aggressive mark allocations to the speed of
> reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> big issue but I certainly won't stop you from implementing more fine
> grained GFP mode selection and lockdep annotations if you want to go that
> way :).

Well it was just two lines of code to annotate the fanotify mutex as its own
class, so I just did that:

https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 14:00                   ` Amir Goldstein
@ 2022-03-23 14:28                     ` Jan Kara
  2022-03-23 15:46                       ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2022-03-23 14:28 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel

On Wed 23-03-22 16:00:30, Amir Goldstein wrote:
> > Well, but reclaim from kswapd is always the main and preferred source of
> > memory reclaim. And we will kick kswapd to do work if we are running out of
> > memory. Doing direct filesystem slab reclaim from mark allocation is useful
> > only to throttle possibly aggressive mark allocations to the speed of
> > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> > big issue but I certainly won't stop you from implementing more fine
> > grained GFP mode selection and lockdep annotations if you want to go that
> > way :).
> 
> Well it was just two lines of code to annotate the fanotify mutex as its own
> class, so I just did that:
> 
> https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6

But this implicitely assumes there isn't any allocation under mark_mutex
anywhere else where it is held. Which is likely true (I didn't check) but
it is kind of fragile. So I was rather imagining we would have per-group
"NOFS" flag and fsnotify_group_lock/unlock() would call
memalloc_nofs_save() based on the flag. And we would use
fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 14:28                     ` Jan Kara
@ 2022-03-23 15:46                       ` Amir Goldstein
  2022-03-23 19:31                         ` Amir Goldstein
  2022-03-24 19:17                         ` Amir Goldstein
  0 siblings, 2 replies; 22+ messages in thread
From: Amir Goldstein @ 2022-03-23 15:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 23-03-22 16:00:30, Amir Goldstein wrote:
> > > Well, but reclaim from kswapd is always the main and preferred source of
> > > memory reclaim. And we will kick kswapd to do work if we are running out of
> > > memory. Doing direct filesystem slab reclaim from mark allocation is useful
> > > only to throttle possibly aggressive mark allocations to the speed of
> > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> > > big issue but I certainly won't stop you from implementing more fine
> > > grained GFP mode selection and lockdep annotations if you want to go that
> > > way :).
> >
> > Well it was just two lines of code to annotate the fanotify mutex as its own
> > class, so I just did that:
> >
> > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6
>
> But this implicitely assumes there isn't any allocation under mark_mutex
> anywhere else where it is held. Which is likely true (I didn't check) but
> it is kind of fragile. So I was rather imagining we would have per-group
> "NOFS" flag and fsnotify_group_lock/unlock() would call
> memalloc_nofs_save() based on the flag. And we would use
> fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase.
>

I see what you mean, but looking at the code it seems quite a bit of churn to go
over all the old backends and convert the locks to use wrappers where we "know"
those backends are fs reclaim safe (because we did not get reports of deadlocks
over the decades they existed).

I think I will sleep better with a conversion to three flavors:

1. pflags = fsnotify_group_nofs_lock(fanotify_group);
2. fsnotify_group_lock(dnotify_group) =>
    WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS)
    mutex_lock(&group->mark_mutex)
3. fsnotify_group_lock_nested(group) =>
    mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING)

Thoughts?

One more UAPI question.
Do you think we should require user to opt-in to NOFS and evictable marks with
FAN_SHRINKABLE? If we don't, it will be harder to fix regressions in legacy
fanotify workloads if those are reported without breaking the evictable marks
UAPI.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 15:46                       ` Amir Goldstein
@ 2022-03-23 19:31                         ` Amir Goldstein
  2022-03-24 19:17                         ` Amir Goldstein
  1 sibling, 0 replies; 22+ messages in thread
From: Amir Goldstein @ 2022-03-23 19:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 23-03-22 16:00:30, Amir Goldstein wrote:
> > > > Well, but reclaim from kswapd is always the main and preferred source of
> > > > memory reclaim. And we will kick kswapd to do work if we are running out of
> > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful
> > > > only to throttle possibly aggressive mark allocations to the speed of
> > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> > > > big issue but I certainly won't stop you from implementing more fine
> > > > grained GFP mode selection and lockdep annotations if you want to go that
> > > > way :).
> > >
> > > Well it was just two lines of code to annotate the fanotify mutex as its own
> > > class, so I just did that:
> > >
> > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6
> >
> > But this implicitely assumes there isn't any allocation under mark_mutex
> > anywhere else where it is held. Which is likely true (I didn't check) but
> > it is kind of fragile. So I was rather imagining we would have per-group
> > "NOFS" flag and fsnotify_group_lock/unlock() would call
> > memalloc_nofs_save() based on the flag. And we would use
> > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase.
> >
>
> I see what you mean, but looking at the code it seems quite a bit of churn to go
> over all the old backends and convert the locks to use wrappers where we "know"
> those backends are fs reclaim safe (because we did not get reports of deadlocks
> over the decades they existed).
>
> I think I will sleep better with a conversion to three flavors:
>
> 1. pflags = fsnotify_group_nofs_lock(fanotify_group);
> 2. fsnotify_group_lock(dnotify_group) =>
>     WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS)
>     mutex_lock(&group->mark_mutex)
> 3. fsnotify_group_lock_nested(group) =>
>     mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING)
>

Converted common code:
https://github.com/amir73il/linux/commit/a21677eaf6b45445b6eb3f0befcd7525c932b9da
and fanotify:
https://github.com/amir73il/linux/commit/b25f22b37c488b0898de8cd7a551892eacec0dae

I can convert the rest of the backends cleanup patches later.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-23 15:46                       ` Amir Goldstein
  2022-03-23 19:31                         ` Amir Goldstein
@ 2022-03-24 19:17                         ` Amir Goldstein
  2022-03-25  9:29                           ` Jan Kara
  1 sibling, 1 reply; 22+ messages in thread
From: Amir Goldstein @ 2022-03-24 19:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 23-03-22 16:00:30, Amir Goldstein wrote:
> > > > Well, but reclaim from kswapd is always the main and preferred source of
> > > > memory reclaim. And we will kick kswapd to do work if we are running out of
> > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful
> > > > only to throttle possibly aggressive mark allocations to the speed of
> > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> > > > big issue but I certainly won't stop you from implementing more fine
> > > > grained GFP mode selection and lockdep annotations if you want to go that
> > > > way :).
> > >
> > > Well it was just two lines of code to annotate the fanotify mutex as its own
> > > class, so I just did that:
> > >
> > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6
> >
> > But this implicitely assumes there isn't any allocation under mark_mutex
> > anywhere else where it is held. Which is likely true (I didn't check) but
> > it is kind of fragile. So I was rather imagining we would have per-group
> > "NOFS" flag and fsnotify_group_lock/unlock() would call
> > memalloc_nofs_save() based on the flag. And we would use
> > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase.
> >
>
> I see what you mean, but looking at the code it seems quite a bit of churn to go
> over all the old backends and convert the locks to use wrappers where we "know"
> those backends are fs reclaim safe (because we did not get reports of deadlocks
> over the decades they existed).
>
> I think I will sleep better with a conversion to three flavors:
>
> 1. pflags = fsnotify_group_nofs_lock(fanotify_group);
> 2. fsnotify_group_lock(dnotify_group) =>
>     WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS)
>     mutex_lock(&group->mark_mutex)
> 3. fsnotify_group_lock_nested(group) =>
>     mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING)
>

I think I might have misunderstood you and you meant that the
SINGLE_DEPTH_NESTING subcalls should be eliminated and then
we are left with two lock classes.
Correct?

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-24 19:17                         ` Amir Goldstein
@ 2022-03-25  9:29                           ` Jan Kara
  2022-03-27 18:14                             ` Amir Goldstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2022-03-25  9:29 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel

On Thu 24-03-22 21:17:09, Amir Goldstein wrote:
> On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 23-03-22 16:00:30, Amir Goldstein wrote:
> > > > > Well, but reclaim from kswapd is always the main and preferred source of
> > > > > memory reclaim. And we will kick kswapd to do work if we are running out of
> > > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful
> > > > > only to throttle possibly aggressive mark allocations to the speed of
> > > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a
> > > > > big issue but I certainly won't stop you from implementing more fine
> > > > > grained GFP mode selection and lockdep annotations if you want to go that
> > > > > way :).
> > > >
> > > > Well it was just two lines of code to annotate the fanotify mutex as its own
> > > > class, so I just did that:
> > > >
> > > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6
> > >
> > > But this implicitely assumes there isn't any allocation under mark_mutex
> > > anywhere else where it is held. Which is likely true (I didn't check) but
> > > it is kind of fragile. So I was rather imagining we would have per-group
> > > "NOFS" flag and fsnotify_group_lock/unlock() would call
> > > memalloc_nofs_save() based on the flag. And we would use
> > > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase.
> > >
> >
> > I see what you mean, but looking at the code it seems quite a bit of churn to go
> > over all the old backends and convert the locks to use wrappers where we "know"
> > those backends are fs reclaim safe (because we did not get reports of deadlocks
> > over the decades they existed).
> >
> > I think I will sleep better with a conversion to three flavors:
> >
> > 1. pflags = fsnotify_group_nofs_lock(fanotify_group);
> > 2. fsnotify_group_lock(dnotify_group) =>
> >     WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS)
> >     mutex_lock(&group->mark_mutex)
> > 3. fsnotify_group_lock_nested(group) =>
> >     mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING)
> >
> 
> I think I might have misunderstood you and you meant that the
> SINGLE_DEPTH_NESTING subcalls should be eliminated and then
> we are left with two lock classes.
> Correct?

Yeah, at least at this point I don't see a good reason for using
SINGLE_DEPTH_NESTING lockdep annotation. In my opinion it has just a
potential of silencing reports of real locking problems. So removing it and
seeing what complains would be IMO a way to go.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-25  9:29                           ` Jan Kara
@ 2022-03-27 18:14                             ` Amir Goldstein
  0 siblings, 0 replies; 22+ messages in thread
From: Amir Goldstein @ 2022-03-27 18:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel

> > I think I might have misunderstood you and you meant that the
> > SINGLE_DEPTH_NESTING subcalls should be eliminated and then
> > we are left with two lock classes.
> > Correct?
>
> Yeah, at least at this point I don't see a good reason for using
> SINGLE_DEPTH_NESTING lockdep annotation. In my opinion it has just a
> potential of silencing reports of real locking problems. So removing it and
> seeing what complains would be IMO a way to go.
>

Agreed. In fact, the reason it was added is based on a misunderstanding
of mutex_lock_nested():

Commit 6960b0d909cd ("fsnotify: change locking order") changed some
    of the mark_mutex locks in direct reclaim path to use:
      mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);

    This change is explained:
     "...It uses nested locking to avoid deadlock in case we do the final
      iput() on an inode which still holds marks and thus would take the
      mutex again when calling fsnotify_inode_delete() in destroy_inode()."

Pushed the fix along with conversion of all backends to fsnotify_group_lock()
to fan_evictable branch, which I am still testing.

It's worth noting that I found dnotify to be not fs reclaim safe, because
dnotify_flush() is called from any filp_close() (problem is explained in the
commit message), which contradicts my earlier argument that legacy
backends "must be deadlock safe or we would have gotten deadlock
reports by now".

So yeh, eventually, we may set all legacy backends NOFS, but I would like
to try and exempt inotify and audit for now to reduce the chance of regressions.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 11:23     ` Jan Kara
  2022-03-21 11:56       ` Amir Goldstein
@ 2022-03-21 22:50       ` Trond Myklebust
  2022-03-21 23:36         ` Khazhy Kumykov
  2022-03-22 10:37         ` Jan Kara
  1 sibling, 2 replies; 22+ messages in thread
From: Trond Myklebust @ 2022-03-21 22:50 UTC (permalink / raw)
  To: jack, amir73il
  Cc: bfields, khazhy, linux-mm, linux-nfs, linux-kernel, jlayton, chuck.lever

On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote:
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > > 
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below
> > > > stack.
> > > > 
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > > 
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > Marking this RFC since I haven't actually had a chance to test
> > > > this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > > 
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > > 
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group-
> > > > >mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > > 
> > > > +               /* fsnotify allocates, avoid recursion back
> > > > into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark,
> > > > inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > > 
> > > >                 /*
> > > >                  * If the add was successful, then return the
> > > > object.
> > > 
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > > 
> > > Does memalloc_nofs_save()/restore() really fix this problem? It
> > > seems
> > > to me that it cannot, particularly since knfsd is not a
> > > filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > > 
> > 
> > Maybe NOFS throttles direct reclaims to the point that the problem
> > is
> > harder to hit?
> > 
> > This report came in at good timing for me.
> > 
> > It demonstrates an issue I did not predict for "volatile"' fanotify
> > marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify
> > backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks
> > would also
> > be evictable in that way, so they would expose fanotify to this
> > deadlock.
> > 
> > For the short term, maybe nfsd filecache can avoid the problem by
> > checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort
> > the
> > shrinker. I wonder if there is a place for a helper
> > mutex_is_locked_by_me()?
> > 
> > Jan,
> > 
> > A relatively simple fix would be to allocate
> > fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for
> > the object.
> > I don't think there is a good reason to optimize away this
> > allocation
> > for the case of a non-first group to set a mark on an object?
> 
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set
> of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about
> connector
> allocations but also mark allocations for fanotify. Moving
> allocations from
> under mark_mutex is also possible solution but passing preallocated
> memory
> around is kind of ugly as well. So the cleanest solution I currently
> see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also
> do
> memalloc_nofs_save / restore magic. 

As has already been reported, the problem was fixed in Linux 5.5 by the
garbage collector rewrite, and so this is no longer an issue.

In addition, please note that memalloc_nofs_save/restore and the use of
GFP_NOFS was never a solution, because it does not prevent the kind of
direct reclaim that was happening here. You'd have to enforce
GFP_NOWAIT allocations, afaics.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 22:50       ` Trond Myklebust
@ 2022-03-21 23:36         ` Khazhy Kumykov
  2022-03-21 23:50           ` Trond Myklebust
  2022-03-22 10:37         ` Jan Kara
  1 sibling, 1 reply; 22+ messages in thread
From: Khazhy Kumykov @ 2022-03-21 23:36 UTC (permalink / raw)
  To: Trond Myklebust, gregkh
  Cc: jack, amir73il, bfields, linux-mm, linux-nfs, linux-kernel,
	jlayton, chuck.lever

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> As has already been reported, the problem was fixed in Linux 5.5 by the
> garbage collector rewrite, and so this is no longer an issue.
>
9542e6a643fc ("nfsd: Containerise filecache laundrette"), 36ebbdb96b69
("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for
me, which is still LTS. Since this should fix a real deadlock, would
it be appropriate to include them for the 5.4 stable?

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3999 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 23:36         ` Khazhy Kumykov
@ 2022-03-21 23:50           ` Trond Myklebust
  0 siblings, 0 replies; 22+ messages in thread
From: Trond Myklebust @ 2022-03-21 23:50 UTC (permalink / raw)
  To: khazhy, gregkh
  Cc: jack, chuck.lever, linux-mm, linux-kernel, jlayton, bfields,
	linux-nfs, amir73il

On Mon, 2022-03-21 at 16:36 -0700, Khazhy Kumykov wrote:
> On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > 
> > As has already been reported, the problem was fixed in Linux 5.5 by
> > the
> > garbage collector rewrite, and so this is no longer an issue.
> > 
> 9542e6a643fc ("nfsd: Containerise filecache laundrette"),
> 36ebbdb96b69
> ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for
> me, which is still LTS. Since this should fix a real deadlock, would
> it be appropriate to include them for the 5.4 stable?

That would be OK with me. I'm not aware of any side-effects that might
be a problem for 5.4.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-21 22:50       ` Trond Myklebust
  2022-03-21 23:36         ` Khazhy Kumykov
@ 2022-03-22 10:37         ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Kara @ 2022-03-22 10:37 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: jack, amir73il, bfields, khazhy, linux-mm, linux-nfs,
	linux-kernel, jlayton, chuck.lever

On Mon 21-03-22 22:50:11, Trond Myklebust wrote:
> On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote:
> > On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust
> > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > > result
> > > > > in recursing back into nfsd, resulting in deadlock. See below
> > > > > stack.
> > > > > 
> > > > > nfsd            D    0 1591536      2 0x80004080
> > > > > Call Trace:
> > > > >  __schedule+0x497/0x630
> > > > >  schedule+0x67/0x90
> > > > >  schedule_preempt_disabled+0xe/0x10
> > > > >  __mutex_lock+0x347/0x4b0
> > > > >  fsnotify_destroy_mark+0x22/0xa0
> > > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > > >  do_shrink_slab+0x1f2/0x330
> > > > >  shrink_slab+0x244/0x2f0
> > > > >  shrink_node+0xd7/0x490
> > > > >  do_try_to_free_pages+0x12f/0x3b0
> > > > >  try_to_free_pages+0x43f/0x540
> > > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > > >  alloc_slab_page+0x32/0x2e0
> > > > >  new_slab+0xa6/0x8b0
> > > > >  ___slab_alloc+0x34b/0x520
> > > > >  kmem_cache_alloc+0x1c4/0x250
> > > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > > >  fsnotify_add_mark+0x48/0x70
> > > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > > >  svc_process+0xf9/0x110 [sunrpc]
> > > > >  nfsd+0x10e/0x180 [nfsd]
> > > > >  kthread+0x130/0x140
> > > > >  ret_from_fork+0x35/0x40
> > > > > 
> > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > > 
> > > > > Marking this RFC since I haven't actually had a chance to test
> > > > > this,
> > > > > we
> > > > > we're seeing this deadlock for some customers.
> > > > > 
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct
> > > > > nfsd_file
> > > > > *nf)
> > > > >         struct fsnotify_mark    *mark;
> > > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > > >         struct inode *inode = nf->nf_inode;
> > > > > +       unsigned int pflags;
> > > > > 
> > > > >         do {
> > > > >                 mutex_lock(&nfsd_file_fsnotify_group-
> > > > > >mark_mutex);
> > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct
> > > > > nfsd_file
> > > > > *nf)
> > > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > > >                 refcount_set(&new->nfm_ref, 1);
> > > > > 
> > > > > +               /* fsnotify allocates, avoid recursion back
> > > > > into nfsd
> > > > > */
> > > > > +               pflags = memalloc_nofs_save();
> > > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark,
> > > > > inode,
> > > > > 0);
> > > > > +               memalloc_nofs_restore(pflags);
> > > > > 
> > > > >                 /*
> > > > >                  * If the add was successful, then return the
> > > > > object.
> > > > 
> > > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > > filesystem writeback situation?
> > > > 
> > > > Does memalloc_nofs_save()/restore() really fix this problem? It
> > > > seems
> > > > to me that it cannot, particularly since knfsd is not a
> > > > filesystem, and
> > > > so does not ever handle writeback of dirty pages.
> > > > 
> > > 
> > > Maybe NOFS throttles direct reclaims to the point that the problem
> > > is
> > > harder to hit?
> > > 
> > > This report came in at good timing for me.
> > > 
> > > It demonstrates an issue I did not predict for "volatile"' fanotify
> > > marks [1].
> > > As far as I can tell, nfsd filecache is currently the only fsnotify
> > > backend that
> > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks
> > > would also
> > > be evictable in that way, so they would expose fanotify to this
> > > deadlock.
> > > 
> > > For the short term, maybe nfsd filecache can avoid the problem by
> > > checking
> > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort
> > > the
> > > shrinker. I wonder if there is a place for a helper
> > > mutex_is_locked_by_me()?
> > > 
> > > Jan,
> > > 
> > > A relatively simple fix would be to allocate
> > > fsnotify_mark_connector in
> > > fsnotify_add_mark() and free it, if a connector already exists for
> > > the object.
> > > I don't think there is a good reason to optimize away this
> > > allocation
> > > for the case of a non-first group to set a mark on an object?
> > 
> > Indeed, nasty. Volatile marks will add group->mark_mutex into a set
> > of
> > locks grabbed during inode slab reclaim. So any allocation under
> > group->mark_mutex has to be GFP_NOFS now. This is not just about
> > connector
> > allocations but also mark allocations for fanotify. Moving
> > allocations from
> > under mark_mutex is also possible solution but passing preallocated
> > memory
> > around is kind of ugly as well. So the cleanest solution I currently
> > see is
> > to come up with helpers like "fsnotify_lock_group() &
> > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also
> > do
> > memalloc_nofs_save / restore magic. 
> 
> As has already been reported, the problem was fixed in Linux 5.5 by the
> garbage collector rewrite, and so this is no longer an issue.

Sorry, I was not clear enough I guess. NFS is not a problem since 5.5 as
you say. But Amir has changes in the works after which any filesystem inode
reclaim could end up in exactly the same path (calling
fsnotify_destroy_mark() from clear_inode()). So these changes would
introduce the same deadlock NFS was prone to before 5.5.

> In addition, please note that memalloc_nofs_save/restore and the use of
> GFP_NOFS was never a solution, because it does not prevent the kind of
> direct reclaim that was happening here. You'd have to enforce
> GFP_NOWAIT allocations, afaics.

GFP_NOFS should solve the above problem because with GFP_NOFS we cannot
enter inode reclaim from the memory allocation and thus end up freeing
marks.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
  2022-03-19  9:36   ` Amir Goldstein
  2022-03-21 11:23     ` Jan Kara
@ 2022-03-21 17:06     ` Khazhy Kumykov
  1 sibling, 0 replies; 22+ messages in thread
From: Khazhy Kumykov @ 2022-03-21 17:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Trond Myklebust, Jan Kara, bfields, chuck.lever, linux-mm,
	linux-nfs, linux-kernel, Jeff Layton

[-- Attachment #1: Type: text/plain, Size: 4985 bytes --]

On Sat, Mar 19, 2022 at 2:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd            D    0 1591536      2 0x80004080
> > > Call Trace:
> > >  __schedule+0x497/0x630
> > >  schedule+0x67/0x90
> > >  schedule_preempt_disabled+0xe/0x10
> > >  __mutex_lock+0x347/0x4b0
> > >  fsnotify_destroy_mark+0x22/0xa0
> > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > >  do_shrink_slab+0x1f2/0x330
> > >  shrink_slab+0x244/0x2f0
> > >  shrink_node+0xd7/0x490
> > >  do_try_to_free_pages+0x12f/0x3b0
> > >  try_to_free_pages+0x43f/0x540
> > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > >  __alloc_pages_nodemask+0x274/0x2c0
> > >  alloc_slab_page+0x32/0x2e0
> > >  new_slab+0xa6/0x8b0
> > >  ___slab_alloc+0x34b/0x520
> > >  kmem_cache_alloc+0x1c4/0x250
> > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > >  fsnotify_add_mark+0x48/0x70
> > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > >  svc_process+0xf9/0x110 [sunrpc]
> > >  nfsd+0x10e/0x180 [nfsd]
> > >  kthread+0x130/0x140
> > >  ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > ---
> > >  fs/nfsd/filecache.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >         struct fsnotify_mark    *mark;
> > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > >         struct inode *inode = nf->nf_inode;
> > > +       unsigned int pflags;
> > >
> > >         do {
> > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > >                 refcount_set(&new->nfm_ref, 1);
> > >
> > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > +               pflags = memalloc_nofs_save();
> > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > +               memalloc_nofs_restore(pflags);
> > >
> > >                 /*
> > >                  * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
>
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?

(I think I simply got confused - I don't see reason that NOFS would
help with direct reclaim, though it does look like the gfp flags are
passed via a shrink_control struct so one *could* react to them in the
shrinker - again not an area i'm super familiar with)

>
> This report came in at good timing for me.
>
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
>
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

fwiw, it does look like ~5.5 nfsd did stop freeing fanotify marks
during reclaim, in the commit "nfsd: Containerise filecache
laundrette" (I had sent an earlier email about this, not sure where
that's getting caught up, but I do see it on lore...)


>
> Jan,
>
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?
>
> Thanks,
> Amir.
>
>
>
> [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3999 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-03-27 18:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220319001635.4097742-1-khazhy@google.com>
2022-03-19  0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust
2022-03-19  1:45   ` Khazhy Kumykov
2022-03-19  9:36   ` Amir Goldstein
2022-03-21 11:23     ` Jan Kara
2022-03-21 11:56       ` Amir Goldstein
2022-03-21 14:51         ` Jan Kara
2022-03-22 22:41           ` Amir Goldstein
2022-03-23 10:41             ` Jan Kara
2022-03-23 11:40               ` Amir Goldstein
2022-03-23 13:48                 ` Jan Kara
2022-03-23 14:00                   ` Amir Goldstein
2022-03-23 14:28                     ` Jan Kara
2022-03-23 15:46                       ` Amir Goldstein
2022-03-23 19:31                         ` Amir Goldstein
2022-03-24 19:17                         ` Amir Goldstein
2022-03-25  9:29                           ` Jan Kara
2022-03-27 18:14                             ` Amir Goldstein
2022-03-21 22:50       ` Trond Myklebust
2022-03-21 23:36         ` Khazhy Kumykov
2022-03-21 23:50           ` Trond Myklebust
2022-03-22 10:37         ` Jan Kara
2022-03-21 17:06     ` Khazhy Kumykov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox