* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify [not found] <20220319001635.4097742-1-khazhy@google.com> @ 2022-03-19 0:36 ` Trond Myklebust 2022-03-19 1:45 ` Khazhy Kumykov 2022-03-19 9:36 ` Amir Goldstein 0 siblings, 2 replies; 22+ messages in thread From: Trond Myklebust @ 2022-03-19 0:36 UTC (permalink / raw) To: bfields, khazhy, chuck.lever; +Cc: linux-mm, linux-nfs, linux-kernel On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > result > in recursing back into nfsd, resulting in deadlock. See below stack. > > nfsd D 0 1591536 2 0x80004080 > Call Trace: > __schedule+0x497/0x630 > schedule+0x67/0x90 > schedule_preempt_disabled+0xe/0x10 > __mutex_lock+0x347/0x4b0 > fsnotify_destroy_mark+0x22/0xa0 > nfsd_file_free+0x79/0xd0 [nfsd] > nfsd_file_put_noref+0x7c/0x90 [nfsd] > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > nfsd_file_lru_scan+0x57/0x80 [nfsd] > do_shrink_slab+0x1f2/0x330 > shrink_slab+0x244/0x2f0 > shrink_node+0xd7/0x490 > do_try_to_free_pages+0x12f/0x3b0 > try_to_free_pages+0x43f/0x540 > __alloc_pages_slowpath+0x6ab/0x11c0 > __alloc_pages_nodemask+0x274/0x2c0 > alloc_slab_page+0x32/0x2e0 > new_slab+0xa6/0x8b0 > ___slab_alloc+0x34b/0x520 > kmem_cache_alloc+0x1c4/0x250 > fsnotify_add_mark_locked+0x18d/0x4c0 > fsnotify_add_mark+0x48/0x70 > nfsd_file_acquire+0x570/0x6f0 [nfsd] > nfsd_read+0xa7/0x1c0 [nfsd] > nfsd3_proc_read+0xc1/0x110 [nfsd] > nfsd_dispatch+0xf7/0x240 [nfsd] > svc_process_common+0x2f4/0x610 [sunrpc] > svc_process+0xf9/0x110 [sunrpc] > nfsd+0x10e/0x180 [nfsd] > kthread+0x130/0x140 > ret_from_fork+0x35/0x40 > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > --- > fs/nfsd/filecache.c | 4 ++++ > 1 file changed, 4 insertions(+) > > Marking this RFC since I haven't actually had a chance to test this, > we > we're seeing this deadlock for some customers. > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index fdf89fcf1a0c..a14760f9b486 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > *nf) > struct fsnotify_mark *mark; > struct nfsd_file_mark *nfm = NULL, *new; > struct inode *inode = nf->nf_inode; > + unsigned int pflags; > > do { > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > *nf) > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > refcount_set(&new->nfm_ref, 1); > > + /* fsnotify allocates, avoid recursion back into nfsd > */ > + pflags = memalloc_nofs_save(); > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > 0); > + memalloc_nofs_restore(pflags); > > /* > * If the add was successful, then return the object. Isn't that stack trace showing a slab direct reclaim, and not a filesystem writeback situation? Does memalloc_nofs_save()/restore() really fix this problem? It seems to me that it cannot, particularly since knfsd is not a filesystem, and so does not ever handle writeback of dirty pages. Cc: the linux-mm mailing list in search of answers to the above 2 questions. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-19 0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust @ 2022-03-19 1:45 ` Khazhy Kumykov 2022-03-19 9:36 ` Amir Goldstein 1 sibling, 0 replies; 22+ messages in thread From: Khazhy Kumykov @ 2022-03-19 1:45 UTC (permalink / raw) To: Trond Myklebust; +Cc: bfields, chuck.lever, linux-mm, linux-nfs, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1021 bytes --] On Fri, Mar 18, 2022 at 5:36 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > Isn't that stack trace showing a slab direct reclaim, and not a > filesystem writeback situation? > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > to me that it cannot, particularly since knfsd is not a filesystem, and > so does not ever handle writeback of dirty pages. Ah you're right. An alternative would be delaying the destroy_mark, which I am noticing now that, on mainline, the shrinker calls nfsd_file_dispose_list_delayed, something missing from the version of 5.4 I was looking at. To my reading 9542e6a643fc6 ("nfsd: Containerise filecache laundrette") should have the effect of fixing this deadlock (and the message does explicitly call out notifier deadlocks) - maybe something to send towards stable? > > > Cc: the linux-mm mailing list in search of answers to the above 2 > questions. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3999 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-19 0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust 2022-03-19 1:45 ` Khazhy Kumykov @ 2022-03-19 9:36 ` Amir Goldstein 2022-03-21 11:23 ` Jan Kara 2022-03-21 17:06 ` Khazhy Kumykov 1 sibling, 2 replies; 22+ messages in thread From: Amir Goldstein @ 2022-03-19 9:36 UTC (permalink / raw) To: Trond Myklebust, Jan Kara Cc: bfields, khazhy, chuck.lever, linux-mm, linux-nfs, linux-kernel, Jeff Layton On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > result > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > nfsd D 0 1591536 2 0x80004080 > > Call Trace: > > __schedule+0x497/0x630 > > schedule+0x67/0x90 > > schedule_preempt_disabled+0xe/0x10 > > __mutex_lock+0x347/0x4b0 > > fsnotify_destroy_mark+0x22/0xa0 > > nfsd_file_free+0x79/0xd0 [nfsd] > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > do_shrink_slab+0x1f2/0x330 > > shrink_slab+0x244/0x2f0 > > shrink_node+0xd7/0x490 > > do_try_to_free_pages+0x12f/0x3b0 > > try_to_free_pages+0x43f/0x540 > > __alloc_pages_slowpath+0x6ab/0x11c0 > > __alloc_pages_nodemask+0x274/0x2c0 > > alloc_slab_page+0x32/0x2e0 > > new_slab+0xa6/0x8b0 > > ___slab_alloc+0x34b/0x520 > > kmem_cache_alloc+0x1c4/0x250 > > fsnotify_add_mark_locked+0x18d/0x4c0 > > fsnotify_add_mark+0x48/0x70 > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > nfsd_read+0xa7/0x1c0 [nfsd] > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > nfsd_dispatch+0xf7/0x240 [nfsd] > > svc_process_common+0x2f4/0x610 [sunrpc] > > svc_process+0xf9/0x110 [sunrpc] > > nfsd+0x10e/0x180 [nfsd] > > kthread+0x130/0x140 > > ret_from_fork+0x35/0x40 > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > --- > > fs/nfsd/filecache.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > Marking this RFC since I haven't actually had a chance to test this, > > we > > we're seeing this deadlock for some customers. > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index fdf89fcf1a0c..a14760f9b486 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > *nf) > > struct fsnotify_mark *mark; > > struct nfsd_file_mark *nfm = NULL, *new; > > struct inode *inode = nf->nf_inode; > > + unsigned int pflags; > > > > do { > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > *nf) > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > refcount_set(&new->nfm_ref, 1); > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > */ > > + pflags = memalloc_nofs_save(); > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > 0); > > + memalloc_nofs_restore(pflags); > > > > /* > > * If the add was successful, then return the object. > > Isn't that stack trace showing a slab direct reclaim, and not a > filesystem writeback situation? > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > to me that it cannot, particularly since knfsd is not a filesystem, and > so does not ever handle writeback of dirty pages. > Maybe NOFS throttles direct reclaims to the point that the problem is harder to hit? This report came in at good timing for me. It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. As far as I can tell, nfsd filecache is currently the only fsnotify backend that frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also be evictable in that way, so they would expose fanotify to this deadlock. For the short term, maybe nfsd filecache can avoid the problem by checking mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? Jan, A relatively simple fix would be to allocate fsnotify_mark_connector in fsnotify_add_mark() and free it, if a connector already exists for the object. I don't think there is a good reason to optimize away this allocation for the case of a non-first group to set a mark on an object? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-19 9:36 ` Amir Goldstein @ 2022-03-21 11:23 ` Jan Kara 2022-03-21 11:56 ` Amir Goldstein 2022-03-21 22:50 ` Trond Myklebust 2022-03-21 17:06 ` Khazhy Kumykov 1 sibling, 2 replies; 22+ messages in thread From: Jan Kara @ 2022-03-21 11:23 UTC (permalink / raw) To: Amir Goldstein Cc: Trond Myklebust, Jan Kara, bfields, khazhy, chuck.lever, linux-mm, linux-nfs, linux-kernel, Jeff Layton On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > result > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > nfsd D 0 1591536 2 0x80004080 > > > Call Trace: > > > __schedule+0x497/0x630 > > > schedule+0x67/0x90 > > > schedule_preempt_disabled+0xe/0x10 > > > __mutex_lock+0x347/0x4b0 > > > fsnotify_destroy_mark+0x22/0xa0 > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > do_shrink_slab+0x1f2/0x330 > > > shrink_slab+0x244/0x2f0 > > > shrink_node+0xd7/0x490 > > > do_try_to_free_pages+0x12f/0x3b0 > > > try_to_free_pages+0x43f/0x540 > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > __alloc_pages_nodemask+0x274/0x2c0 > > > alloc_slab_page+0x32/0x2e0 > > > new_slab+0xa6/0x8b0 > > > ___slab_alloc+0x34b/0x520 > > > kmem_cache_alloc+0x1c4/0x250 > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > fsnotify_add_mark+0x48/0x70 > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > svc_process+0xf9/0x110 [sunrpc] > > > nfsd+0x10e/0x180 [nfsd] > > > kthread+0x130/0x140 > > > ret_from_fork+0x35/0x40 > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > --- > > > fs/nfsd/filecache.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > we > > > we're seeing this deadlock for some customers. > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > struct fsnotify_mark *mark; > > > struct nfsd_file_mark *nfm = NULL, *new; > > > struct inode *inode = nf->nf_inode; > > > + unsigned int pflags; > > > > > > do { > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > refcount_set(&new->nfm_ref, 1); > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > */ > > > + pflags = memalloc_nofs_save(); > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > 0); > > > + memalloc_nofs_restore(pflags); > > > > > > /* > > > * If the add was successful, then return the object. > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > filesystem writeback situation? > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > to me that it cannot, particularly since knfsd is not a filesystem, and > > so does not ever handle writeback of dirty pages. > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > harder to hit? > > This report came in at good timing for me. > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > be evictable in that way, so they would expose fanotify to this deadlock. > > For the short term, maybe nfsd filecache can avoid the problem by checking > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > Jan, > > A relatively simple fix would be to allocate fsnotify_mark_connector in > fsnotify_add_mark() and free it, if a connector already exists for the object. > I don't think there is a good reason to optimize away this allocation > for the case of a non-first group to set a mark on an object? Indeed, nasty. Volatile marks will add group->mark_mutex into a set of locks grabbed during inode slab reclaim. So any allocation under group->mark_mutex has to be GFP_NOFS now. This is not just about connector allocations but also mark allocations for fanotify. Moving allocations from under mark_mutex is also possible solution but passing preallocated memory around is kind of ugly as well. So the cleanest solution I currently see is to come up with helpers like "fsnotify_lock_group() & fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do memalloc_nofs_save / restore magic. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 11:23 ` Jan Kara @ 2022-03-21 11:56 ` Amir Goldstein 2022-03-21 14:51 ` Jan Kara 2022-03-21 22:50 ` Trond Myklebust 1 sibling, 1 reply; 22+ messages in thread From: Amir Goldstein @ 2022-03-21 11:56 UTC (permalink / raw) To: Jan Kara Cc: Trond Myklebust, bfields, khazhy, chuck.lever, linux-mm, linux-nfs, linux-kernel, Jeff Layton On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote: > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > result > > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > Call Trace: > > > > __schedule+0x497/0x630 > > > > schedule+0x67/0x90 > > > > schedule_preempt_disabled+0xe/0x10 > > > > __mutex_lock+0x347/0x4b0 > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > do_shrink_slab+0x1f2/0x330 > > > > shrink_slab+0x244/0x2f0 > > > > shrink_node+0xd7/0x490 > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > try_to_free_pages+0x43f/0x540 > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > alloc_slab_page+0x32/0x2e0 > > > > new_slab+0xa6/0x8b0 > > > > ___slab_alloc+0x34b/0x520 > > > > kmem_cache_alloc+0x1c4/0x250 > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > fsnotify_add_mark+0x48/0x70 > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > svc_process+0xf9/0x110 [sunrpc] > > > > nfsd+0x10e/0x180 [nfsd] > > > > kthread+0x130/0x140 > > > > ret_from_fork+0x35/0x40 > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > --- > > > > fs/nfsd/filecache.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > > we > > > > we're seeing this deadlock for some customers. > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > *nf) > > > > struct fsnotify_mark *mark; > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > struct inode *inode = nf->nf_inode; > > > > + unsigned int pflags; > > > > > > > > do { > > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > *nf) > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > > */ > > > > + pflags = memalloc_nofs_save(); > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > > 0); > > > > + memalloc_nofs_restore(pflags); > > > > > > > > /* > > > > * If the add was successful, then return the object. > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > filesystem writeback situation? > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > > to me that it cannot, particularly since knfsd is not a filesystem, and > > > so does not ever handle writeback of dirty pages. > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > > harder to hit? > > > > This report came in at good timing for me. > > > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > > be evictable in that way, so they would expose fanotify to this deadlock. > > > > For the short term, maybe nfsd filecache can avoid the problem by checking > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > > > Jan, > > > > A relatively simple fix would be to allocate fsnotify_mark_connector in > > fsnotify_add_mark() and free it, if a connector already exists for the object. > > I don't think there is a good reason to optimize away this allocation > > for the case of a non-first group to set a mark on an object? > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of > locks grabbed during inode slab reclaim. So any allocation under > group->mark_mutex has to be GFP_NOFS now. This is not just about connector > allocations but also mark allocations for fanotify. Moving allocations from > under mark_mutex is also possible solution but passing preallocated memory > around is kind of ugly as well. Yes, kind of, here is how it looks: https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3 https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7 > So the cleanest solution I currently see is > to come up with helpers like "fsnotify_lock_group() & > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > memalloc_nofs_save / restore magic. > Sounds good. Won't this cause a regression - more failures to setup new mark under memory pressure? Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? and set NOFS state only in that case, so at least we don't cause regression for existing applications? Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 11:56 ` Amir Goldstein @ 2022-03-21 14:51 ` Jan Kara 2022-03-22 22:41 ` Amir Goldstein 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2022-03-21 14:51 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Trond Myklebust, bfields, khazhy, chuck.lever, linux-mm, linux-nfs, linux-kernel, Jeff Layton On Mon 21-03-22 13:56:47, Amir Goldstein wrote: > On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote: > > > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > > result > > > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > > Call Trace: > > > > > __schedule+0x497/0x630 > > > > > schedule+0x67/0x90 > > > > > schedule_preempt_disabled+0xe/0x10 > > > > > __mutex_lock+0x347/0x4b0 > > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > > do_shrink_slab+0x1f2/0x330 > > > > > shrink_slab+0x244/0x2f0 > > > > > shrink_node+0xd7/0x490 > > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > > try_to_free_pages+0x43f/0x540 > > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > > alloc_slab_page+0x32/0x2e0 > > > > > new_slab+0xa6/0x8b0 > > > > > ___slab_alloc+0x34b/0x520 > > > > > kmem_cache_alloc+0x1c4/0x250 > > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > > fsnotify_add_mark+0x48/0x70 > > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > > svc_process+0xf9/0x110 [sunrpc] > > > > > nfsd+0x10e/0x180 [nfsd] > > > > > kthread+0x130/0x140 > > > > > ret_from_fork+0x35/0x40 > > > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > > > we > > > > > we're seeing this deadlock for some customers. > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > > *nf) > > > > > struct fsnotify_mark *mark; > > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > > struct inode *inode = nf->nf_inode; > > > > > + unsigned int pflags; > > > > > > > > > > do { > > > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > > *nf) > > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > > > */ > > > > > + pflags = memalloc_nofs_save(); > > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > > > 0); > > > > > + memalloc_nofs_restore(pflags); > > > > > > > > > > /* > > > > > * If the add was successful, then return the object. > > > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > > filesystem writeback situation? > > > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > > > to me that it cannot, particularly since knfsd is not a filesystem, and > > > > so does not ever handle writeback of dirty pages. > > > > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > > > harder to hit? > > > > > > This report came in at good timing for me. > > > > > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > > > be evictable in that way, so they would expose fanotify to this deadlock. > > > > > > For the short term, maybe nfsd filecache can avoid the problem by checking > > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > > > > > Jan, > > > > > > A relatively simple fix would be to allocate fsnotify_mark_connector in > > > fsnotify_add_mark() and free it, if a connector already exists for the object. > > > I don't think there is a good reason to optimize away this allocation > > > for the case of a non-first group to set a mark on an object? > > > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of > > locks grabbed during inode slab reclaim. So any allocation under > > group->mark_mutex has to be GFP_NOFS now. This is not just about connector > > allocations but also mark allocations for fanotify. Moving allocations from > > under mark_mutex is also possible solution but passing preallocated memory > > around is kind of ugly as well. > > Yes, kind of, here is how it looks: > https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3 > https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7 Yup, not an act of beauty but bearable in the worst case :). > > So the cleanest solution I currently see is > > to come up with helpers like "fsnotify_lock_group() & > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > memalloc_nofs_save / restore magic. > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > under memory pressure? Well, yes, the chances of hitting ENOMEM under heavy memory pressure are higher. But I don't think that much memory is consumed by connectors or marks that the reduced chances for direct reclaim would really substantially matter for the system as a whole. > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > and set NOFS state only in that case, so at least we don't cause regression > for existing applications? So that's a possibility I've left in my sleeve ;). We could do it but then we'd also have to tell lockdep that there are two kinds of mark_mutex locks so that it does not complain about possible reclaim deadlocks. Doable but at this point I didn't consider it worth it unless someone comes with a bug report from a real user scenario. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 14:51 ` Jan Kara @ 2022-03-22 22:41 ` Amir Goldstein 2022-03-23 10:41 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Amir Goldstein @ 2022-03-22 22:41 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel > > > So the cleanest solution I currently see is > > > to come up with helpers like "fsnotify_lock_group() & > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > > memalloc_nofs_save / restore magic. > > > > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > > under memory pressure? > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are > higher. But I don't think that much memory is consumed by connectors or > marks that the reduced chances for direct reclaim would really > substantially matter for the system as a whole. > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > > and set NOFS state only in that case, so at least we don't cause regression > > for existing applications? > > So that's a possibility I've left in my sleeve ;). We could do it but then > we'd also have to tell lockdep that there are two kinds of mark_mutex locks > so that it does not complain about possible reclaim deadlocks. Doable but > at this point I didn't consider it worth it unless someone comes with a bug > report from a real user scenario. Are you sure about that? Note that fsnotify_destroy_mark() and friends already use lockdep class SINGLE_DEPTH_NESTING, so I think the lockdep annotation already assumes that deadlock from direct reclaim cannot happen and it is that assumption that was nearly broken by evictable inode marks. IIUC that means that we only need to wrap the fanotify allocations with GFP_NOFS (technically only after the first evictable mark)? Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-22 22:41 ` Amir Goldstein @ 2022-03-23 10:41 ` Jan Kara 2022-03-23 11:40 ` Amir Goldstein 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2022-03-23 10:41 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel On Wed 23-03-22 00:41:28, Amir Goldstein wrote: > > > > So the cleanest solution I currently see is > > > > to come up with helpers like "fsnotify_lock_group() & > > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > > > memalloc_nofs_save / restore magic. > > > > > > > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > > > under memory pressure? > > > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are > > higher. But I don't think that much memory is consumed by connectors or > > marks that the reduced chances for direct reclaim would really > > substantially matter for the system as a whole. > > > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > > > and set NOFS state only in that case, so at least we don't cause regression > > > for existing applications? > > > > So that's a possibility I've left in my sleeve ;). We could do it but then > > we'd also have to tell lockdep that there are two kinds of mark_mutex locks > > so that it does not complain about possible reclaim deadlocks. Doable but > > at this point I didn't consider it worth it unless someone comes with a bug > > report from a real user scenario. > > Are you sure about that? Feel free to try it, I can be wrong... > Note that fsnotify_destroy_mark() and friends already use lockdep class > SINGLE_DEPTH_NESTING, so I think the lockdep annotation already > assumes that deadlock from direct reclaim cannot happen and it is that > assumption that was nearly broken by evictable inode marks. > > IIUC that means that we only need to wrap the fanotify allocations > with GFP_NOFS (technically only after the first evictable mark)? Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark() is called from inode reclaim, it will record mark_mutex as 'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once filesystem direct reclaim happens from an allocation under mark_mutex, lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe' (mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear reasons - we should clean that up) might defeat this lockdep detection but in that case it would also defeat detection of real potential deadlocks (because the deadlock scenario you've found is real). Proper lockdep annotation needs to distinguish mark_locks which can be acquired from under fs reclaim and mark_locks which cannot be. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 10:41 ` Jan Kara @ 2022-03-23 11:40 ` Amir Goldstein 2022-03-23 13:48 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Amir Goldstein @ 2022-03-23 11:40 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel On Wed, Mar 23, 2022 at 12:41 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 23-03-22 00:41:28, Amir Goldstein wrote: > > > > > So the cleanest solution I currently see is > > > > > to come up with helpers like "fsnotify_lock_group() & > > > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > > > > memalloc_nofs_save / restore magic. > > > > > > > > > > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > > > > under memory pressure? > > > > > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are > > > higher. But I don't think that much memory is consumed by connectors or > > > marks that the reduced chances for direct reclaim would really > > > substantially matter for the system as a whole. > > > > > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > > > > and set NOFS state only in that case, so at least we don't cause regression > > > > for existing applications? > > > > > > So that's a possibility I've left in my sleeve ;). We could do it but then > > > we'd also have to tell lockdep that there are two kinds of mark_mutex locks > > > so that it does not complain about possible reclaim deadlocks. Doable but > > > at this point I didn't consider it worth it unless someone comes with a bug > > > report from a real user scenario. > > > > Are you sure about that? > > Feel free to try it, I can be wrong... > > > Note that fsnotify_destroy_mark() and friends already use lockdep class > > SINGLE_DEPTH_NESTING, so I think the lockdep annotation already > > assumes that deadlock from direct reclaim cannot happen and it is that > > assumption that was nearly broken by evictable inode marks. > > > > IIUC that means that we only need to wrap the fanotify allocations > > with GFP_NOFS (technically only after the first evictable mark)? > > Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark() > is called from inode reclaim, it will record mark_mutex as > 'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once > filesystem direct reclaim happens from an allocation under mark_mutex, > lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe' > (mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that > SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear > reasons - we should clean that up) might defeat this lockdep detection but > in that case it would also defeat detection of real potential deadlocks > (because the deadlock scenario you've found is real). Proper lockdep Definitely. My test now reproduces the deadlock very reliably within seconds lockdep is unaware of the deadlock because of the SINGLE_DEPTH_NESTING subclass missguided annotation. > annotation needs to distinguish mark_locks which can be acquired from under > fs reclaim and mark_locks which cannot be. > I see. So technically we can annotate the fanotify group mark_mutex with a different key and then we have 4 subclasses of lock: - fanotify mark_mutex are NOT reclaim safe - non-fanotify mark_mutex are reclaim safe - ANY mark_mutex(SINGLE_DEPTH_NESTING) are fs-reclaim unsafe The reason I am a bit uneasy with regressing inotify is that there are users of large recursive inotify watch crawlers out there (e.g. watchman) and when crawling a large tree to add marks, there may be a need to reclaim some memory by evicting inode cache (of non-marked subtrees). Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 11:40 ` Amir Goldstein @ 2022-03-23 13:48 ` Jan Kara 2022-03-23 14:00 ` Amir Goldstein 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2022-03-23 13:48 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel On Wed 23-03-22 13:40:18, Amir Goldstein wrote: > On Wed, Mar 23, 2022 at 12:41 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 23-03-22 00:41:28, Amir Goldstein wrote: > > > > > > So the cleanest solution I currently see is > > > > > > to come up with helpers like "fsnotify_lock_group() & > > > > > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > > > > > memalloc_nofs_save / restore magic. > > > > > > > > > > > > > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > > > > > under memory pressure? > > > > > > > > Well, yes, the chances of hitting ENOMEM under heavy memory pressure are > > > > higher. But I don't think that much memory is consumed by connectors or > > > > marks that the reduced chances for direct reclaim would really > > > > substantially matter for the system as a whole. > > > > > > > > > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > > > > > and set NOFS state only in that case, so at least we don't cause regression > > > > > for existing applications? > > > > > > > > So that's a possibility I've left in my sleeve ;). We could do it but then > > > > we'd also have to tell lockdep that there are two kinds of mark_mutex locks > > > > so that it does not complain about possible reclaim deadlocks. Doable but > > > > at this point I didn't consider it worth it unless someone comes with a bug > > > > report from a real user scenario. > > > > > > Are you sure about that? > > > > Feel free to try it, I can be wrong... > > > > > Note that fsnotify_destroy_mark() and friends already use lockdep class > > > SINGLE_DEPTH_NESTING, so I think the lockdep annotation already > > > assumes that deadlock from direct reclaim cannot happen and it is that > > > assumption that was nearly broken by evictable inode marks. > > > > > > IIUC that means that we only need to wrap the fanotify allocations > > > with GFP_NOFS (technically only after the first evictable mark)? > > > > Well, the dependencies lockdep will infer are: Once fsnotify_destroy_mark() > > is called from inode reclaim, it will record mark_mutex as > > 'fs-reclaim-unsafe' (essentially fs_reclaim->mark_mutex dependency). Once > > filesystem direct reclaim happens from an allocation under mark_mutex, > > lockdep will record mark_mutex as 'need-to-be-fs-reclaim-safe' > > (mark_mutex->fs_reclaim) dependency. Hence a loop. Now I agree that > > SINGLE_DEPTH_NESTING (which is BTW used in several other places for unclear > > reasons - we should clean that up) might defeat this lockdep detection but > > in that case it would also defeat detection of real potential deadlocks > > (because the deadlock scenario you've found is real). Proper lockdep > > Definitely. My test now reproduces the deadlock very reliably within seconds > lockdep is unaware of the deadlock because of the SINGLE_DEPTH_NESTING > subclass missguided annotation. > > > annotation needs to distinguish mark_locks which can be acquired from under > > fs reclaim and mark_locks which cannot be. > > > > I see. So technically we can annotate the fanotify group mark_mutex with > a different key and then we have 4 subclasses of lock: > - fanotify mark_mutex are NOT reclaim safe > - non-fanotify mark_mutex are reclaim safe > - ANY mark_mutex(SINGLE_DEPTH_NESTING) are fs-reclaim unsafe > > The reason I am a bit uneasy with regressing inotify is that there are users > of large recursive inotify watch crawlers out there (e.g. watchman) and when > crawling a large tree to add marks, there may be a need to reclaim > some memory by evicting inode cache (of non-marked subtrees). Well, but reclaim from kswapd is always the main and preferred source of memory reclaim. And we will kick kswapd to do work if we are running out of memory. Doing direct filesystem slab reclaim from mark allocation is useful only to throttle possibly aggressive mark allocations to the speed of reclaim (instead of getting ENOMEM). So I'm still not convinced this is a big issue but I certainly won't stop you from implementing more fine grained GFP mode selection and lockdep annotations if you want to go that way :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 13:48 ` Jan Kara @ 2022-03-23 14:00 ` Amir Goldstein 2022-03-23 14:28 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Amir Goldstein @ 2022-03-23 14:00 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel > Well, but reclaim from kswapd is always the main and preferred source of > memory reclaim. And we will kick kswapd to do work if we are running out of > memory. Doing direct filesystem slab reclaim from mark allocation is useful > only to throttle possibly aggressive mark allocations to the speed of > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > big issue but I certainly won't stop you from implementing more fine > grained GFP mode selection and lockdep annotations if you want to go that > way :). Well it was just two lines of code to annotate the fanotify mutex as its own class, so I just did that: https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 14:00 ` Amir Goldstein @ 2022-03-23 14:28 ` Jan Kara 2022-03-23 15:46 ` Amir Goldstein 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2022-03-23 14:28 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel On Wed 23-03-22 16:00:30, Amir Goldstein wrote: > > Well, but reclaim from kswapd is always the main and preferred source of > > memory reclaim. And we will kick kswapd to do work if we are running out of > > memory. Doing direct filesystem slab reclaim from mark allocation is useful > > only to throttle possibly aggressive mark allocations to the speed of > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > > big issue but I certainly won't stop you from implementing more fine > > grained GFP mode selection and lockdep annotations if you want to go that > > way :). > > Well it was just two lines of code to annotate the fanotify mutex as its own > class, so I just did that: > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 But this implicitely assumes there isn't any allocation under mark_mutex anywhere else where it is held. Which is likely true (I didn't check) but it is kind of fragile. So I was rather imagining we would have per-group "NOFS" flag and fsnotify_group_lock/unlock() would call memalloc_nofs_save() based on the flag. And we would use fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 14:28 ` Jan Kara @ 2022-03-23 15:46 ` Amir Goldstein 2022-03-23 19:31 ` Amir Goldstein 2022-03-24 19:17 ` Amir Goldstein 0 siblings, 2 replies; 22+ messages in thread From: Amir Goldstein @ 2022-03-23 15:46 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 23-03-22 16:00:30, Amir Goldstein wrote: > > > Well, but reclaim from kswapd is always the main and preferred source of > > > memory reclaim. And we will kick kswapd to do work if we are running out of > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful > > > only to throttle possibly aggressive mark allocations to the speed of > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > > > big issue but I certainly won't stop you from implementing more fine > > > grained GFP mode selection and lockdep annotations if you want to go that > > > way :). > > > > Well it was just two lines of code to annotate the fanotify mutex as its own > > class, so I just did that: > > > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 > > But this implicitely assumes there isn't any allocation under mark_mutex > anywhere else where it is held. Which is likely true (I didn't check) but > it is kind of fragile. So I was rather imagining we would have per-group > "NOFS" flag and fsnotify_group_lock/unlock() would call > memalloc_nofs_save() based on the flag. And we would use > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase. > I see what you mean, but looking at the code it seems quite a bit of churn to go over all the old backends and convert the locks to use wrappers where we "know" those backends are fs reclaim safe (because we did not get reports of deadlocks over the decades they existed). I think I will sleep better with a conversion to three flavors: 1. pflags = fsnotify_group_nofs_lock(fanotify_group); 2. fsnotify_group_lock(dnotify_group) => WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS) mutex_lock(&group->mark_mutex) 3. fsnotify_group_lock_nested(group) => mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING) Thoughts? One more UAPI question. Do you think we should require user to opt-in to NOFS and evictable marks with FAN_SHRINKABLE? If we don't, it will be harder to fix regressions in legacy fanotify workloads if those are reported without breaking the evictable marks UAPI. Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 15:46 ` Amir Goldstein @ 2022-03-23 19:31 ` Amir Goldstein 2022-03-24 19:17 ` Amir Goldstein 1 sibling, 0 replies; 22+ messages in thread From: Amir Goldstein @ 2022-03-23 19:31 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 23-03-22 16:00:30, Amir Goldstein wrote: > > > > Well, but reclaim from kswapd is always the main and preferred source of > > > > memory reclaim. And we will kick kswapd to do work if we are running out of > > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful > > > > only to throttle possibly aggressive mark allocations to the speed of > > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > > > > big issue but I certainly won't stop you from implementing more fine > > > > grained GFP mode selection and lockdep annotations if you want to go that > > > > way :). > > > > > > Well it was just two lines of code to annotate the fanotify mutex as its own > > > class, so I just did that: > > > > > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 > > > > But this implicitely assumes there isn't any allocation under mark_mutex > > anywhere else where it is held. Which is likely true (I didn't check) but > > it is kind of fragile. So I was rather imagining we would have per-group > > "NOFS" flag and fsnotify_group_lock/unlock() would call > > memalloc_nofs_save() based on the flag. And we would use > > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase. > > > > I see what you mean, but looking at the code it seems quite a bit of churn to go > over all the old backends and convert the locks to use wrappers where we "know" > those backends are fs reclaim safe (because we did not get reports of deadlocks > over the decades they existed). > > I think I will sleep better with a conversion to three flavors: > > 1. pflags = fsnotify_group_nofs_lock(fanotify_group); > 2. fsnotify_group_lock(dnotify_group) => > WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS) > mutex_lock(&group->mark_mutex) > 3. fsnotify_group_lock_nested(group) => > mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING) > Converted common code: https://github.com/amir73il/linux/commit/a21677eaf6b45445b6eb3f0befcd7525c932b9da and fanotify: https://github.com/amir73il/linux/commit/b25f22b37c488b0898de8cd7a551892eacec0dae I can convert the rest of the backends cleanup patches later. Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-23 15:46 ` Amir Goldstein 2022-03-23 19:31 ` Amir Goldstein @ 2022-03-24 19:17 ` Amir Goldstein 2022-03-25 9:29 ` Jan Kara 1 sibling, 1 reply; 22+ messages in thread From: Amir Goldstein @ 2022-03-24 19:17 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote: > > > > On Wed 23-03-22 16:00:30, Amir Goldstein wrote: > > > > Well, but reclaim from kswapd is always the main and preferred source of > > > > memory reclaim. And we will kick kswapd to do work if we are running out of > > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful > > > > only to throttle possibly aggressive mark allocations to the speed of > > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > > > > big issue but I certainly won't stop you from implementing more fine > > > > grained GFP mode selection and lockdep annotations if you want to go that > > > > way :). > > > > > > Well it was just two lines of code to annotate the fanotify mutex as its own > > > class, so I just did that: > > > > > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 > > > > But this implicitely assumes there isn't any allocation under mark_mutex > > anywhere else where it is held. Which is likely true (I didn't check) but > > it is kind of fragile. So I was rather imagining we would have per-group > > "NOFS" flag and fsnotify_group_lock/unlock() would call > > memalloc_nofs_save() based on the flag. And we would use > > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase. > > > > I see what you mean, but looking at the code it seems quite a bit of churn to go > over all the old backends and convert the locks to use wrappers where we "know" > those backends are fs reclaim safe (because we did not get reports of deadlocks > over the decades they existed). > > I think I will sleep better with a conversion to three flavors: > > 1. pflags = fsnotify_group_nofs_lock(fanotify_group); > 2. fsnotify_group_lock(dnotify_group) => > WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS) > mutex_lock(&group->mark_mutex) > 3. fsnotify_group_lock_nested(group) => > mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING) > I think I might have misunderstood you and you meant that the SINGLE_DEPTH_NESTING subcalls should be eliminated and then we are left with two lock classes. Correct? Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-24 19:17 ` Amir Goldstein @ 2022-03-25 9:29 ` Jan Kara 2022-03-27 18:14 ` Amir Goldstein 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2022-03-25 9:29 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, khazhy, linux-mm, linux-fsdevel On Thu 24-03-22 21:17:09, Amir Goldstein wrote: > On Wed, Mar 23, 2022 at 5:46 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Wed, Mar 23, 2022 at 4:28 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Wed 23-03-22 16:00:30, Amir Goldstein wrote: > > > > > Well, but reclaim from kswapd is always the main and preferred source of > > > > > memory reclaim. And we will kick kswapd to do work if we are running out of > > > > > memory. Doing direct filesystem slab reclaim from mark allocation is useful > > > > > only to throttle possibly aggressive mark allocations to the speed of > > > > > reclaim (instead of getting ENOMEM). So I'm still not convinced this is a > > > > > big issue but I certainly won't stop you from implementing more fine > > > > > grained GFP mode selection and lockdep annotations if you want to go that > > > > > way :). > > > > > > > > Well it was just two lines of code to annotate the fanotify mutex as its own > > > > class, so I just did that: > > > > > > > > https://github.com/amir73il/linux/commit/7b4b6e2c0bd1942cd130e9202c4b187a8fb468c6 > > > > > > But this implicitely assumes there isn't any allocation under mark_mutex > > > anywhere else where it is held. Which is likely true (I didn't check) but > > > it is kind of fragile. So I was rather imagining we would have per-group > > > "NOFS" flag and fsnotify_group_lock/unlock() would call > > > memalloc_nofs_save() based on the flag. And we would use > > > fsnotify_group_lock/unlock() uniformly across the whole fsnotify codebase. > > > > > > > I see what you mean, but looking at the code it seems quite a bit of churn to go > > over all the old backends and convert the locks to use wrappers where we "know" > > those backends are fs reclaim safe (because we did not get reports of deadlocks > > over the decades they existed). > > > > I think I will sleep better with a conversion to three flavors: > > > > 1. pflags = fsnotify_group_nofs_lock(fanotify_group); > > 2. fsnotify_group_lock(dnotify_group) => > > WARN_ON_ONCE(group->flags & FSNOTIFY_NOFS) > > mutex_lock(&group->mark_mutex) > > 3. fsnotify_group_lock_nested(group) => > > mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING) > > > > I think I might have misunderstood you and you meant that the > SINGLE_DEPTH_NESTING subcalls should be eliminated and then > we are left with two lock classes. > Correct? Yeah, at least at this point I don't see a good reason for using SINGLE_DEPTH_NESTING lockdep annotation. In my opinion it has just a potential of silencing reports of real locking problems. So removing it and seeing what complains would be IMO a way to go. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-25 9:29 ` Jan Kara @ 2022-03-27 18:14 ` Amir Goldstein 0 siblings, 0 replies; 22+ messages in thread From: Amir Goldstein @ 2022-03-27 18:14 UTC (permalink / raw) To: Jan Kara; +Cc: khazhy, linux-mm, linux-fsdevel > > I think I might have misunderstood you and you meant that the > > SINGLE_DEPTH_NESTING subcalls should be eliminated and then > > we are left with two lock classes. > > Correct? > > Yeah, at least at this point I don't see a good reason for using > SINGLE_DEPTH_NESTING lockdep annotation. In my opinion it has just a > potential of silencing reports of real locking problems. So removing it and > seeing what complains would be IMO a way to go. > Agreed. In fact, the reason it was added is based on a misunderstanding of mutex_lock_nested(): Commit 6960b0d909cd ("fsnotify: change locking order") changed some of the mark_mutex locks in direct reclaim path to use: mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING); This change is explained: "...It uses nested locking to avoid deadlock in case we do the final iput() on an inode which still holds marks and thus would take the mutex again when calling fsnotify_inode_delete() in destroy_inode()." Pushed the fix along with conversion of all backends to fsnotify_group_lock() to fan_evictable branch, which I am still testing. It's worth noting that I found dnotify to be not fs reclaim safe, because dnotify_flush() is called from any filp_close() (problem is explained in the commit message), which contradicts my earlier argument that legacy backends "must be deadlock safe or we would have gotten deadlock reports by now". So yeh, eventually, we may set all legacy backends NOFS, but I would like to try and exempt inotify and audit for now to reduce the chance of regressions. Thanks, Amir. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 11:23 ` Jan Kara 2022-03-21 11:56 ` Amir Goldstein @ 2022-03-21 22:50 ` Trond Myklebust 2022-03-21 23:36 ` Khazhy Kumykov 2022-03-22 10:37 ` Jan Kara 1 sibling, 2 replies; 22+ messages in thread From: Trond Myklebust @ 2022-03-21 22:50 UTC (permalink / raw) To: jack, amir73il Cc: bfields, khazhy, linux-mm, linux-nfs, linux-kernel, jlayton, chuck.lever On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote: > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust > > <trondmy@hammerspace.com> wrote: > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > result > > > > in recursing back into nfsd, resulting in deadlock. See below > > > > stack. > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > Call Trace: > > > > __schedule+0x497/0x630 > > > > schedule+0x67/0x90 > > > > schedule_preempt_disabled+0xe/0x10 > > > > __mutex_lock+0x347/0x4b0 > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > do_shrink_slab+0x1f2/0x330 > > > > shrink_slab+0x244/0x2f0 > > > > shrink_node+0xd7/0x490 > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > try_to_free_pages+0x43f/0x540 > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > alloc_slab_page+0x32/0x2e0 > > > > new_slab+0xa6/0x8b0 > > > > ___slab_alloc+0x34b/0x520 > > > > kmem_cache_alloc+0x1c4/0x250 > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > fsnotify_add_mark+0x48/0x70 > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > svc_process+0xf9/0x110 [sunrpc] > > > > nfsd+0x10e/0x180 [nfsd] > > > > kthread+0x130/0x140 > > > > ret_from_fork+0x35/0x40 > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > --- > > > > fs/nfsd/filecache.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > Marking this RFC since I haven't actually had a chance to test > > > > this, > > > > we > > > > we're seeing this deadlock for some customers. > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct > > > > nfsd_file > > > > *nf) > > > > struct fsnotify_mark *mark; > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > struct inode *inode = nf->nf_inode; > > > > + unsigned int pflags; > > > > > > > > do { > > > > mutex_lock(&nfsd_file_fsnotify_group- > > > > >mark_mutex); > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct > > > > nfsd_file > > > > *nf) > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > + /* fsnotify allocates, avoid recursion back > > > > into nfsd > > > > */ > > > > + pflags = memalloc_nofs_save(); > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, > > > > inode, > > > > 0); > > > > + memalloc_nofs_restore(pflags); > > > > > > > > /* > > > > * If the add was successful, then return the > > > > object. > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > filesystem writeback situation? > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It > > > seems > > > to me that it cannot, particularly since knfsd is not a > > > filesystem, and > > > so does not ever handle writeback of dirty pages. > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem > > is > > harder to hit? > > > > This report came in at good timing for me. > > > > It demonstrates an issue I did not predict for "volatile"' fanotify > > marks [1]. > > As far as I can tell, nfsd filecache is currently the only fsnotify > > backend that > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks > > would also > > be evictable in that way, so they would expose fanotify to this > > deadlock. > > > > For the short term, maybe nfsd filecache can avoid the problem by > > checking > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort > > the > > shrinker. I wonder if there is a place for a helper > > mutex_is_locked_by_me()? > > > > Jan, > > > > A relatively simple fix would be to allocate > > fsnotify_mark_connector in > > fsnotify_add_mark() and free it, if a connector already exists for > > the object. > > I don't think there is a good reason to optimize away this > > allocation > > for the case of a non-first group to set a mark on an object? > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set > of > locks grabbed during inode slab reclaim. So any allocation under > group->mark_mutex has to be GFP_NOFS now. This is not just about > connector > allocations but also mark allocations for fanotify. Moving > allocations from > under mark_mutex is also possible solution but passing preallocated > memory > around is kind of ugly as well. So the cleanest solution I currently > see is > to come up with helpers like "fsnotify_lock_group() & > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also > do > memalloc_nofs_save / restore magic. As has already been reported, the problem was fixed in Linux 5.5 by the garbage collector rewrite, and so this is no longer an issue. In addition, please note that memalloc_nofs_save/restore and the use of GFP_NOFS was never a solution, because it does not prevent the kind of direct reclaim that was happening here. You'd have to enforce GFP_NOWAIT allocations, afaics. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 22:50 ` Trond Myklebust @ 2022-03-21 23:36 ` Khazhy Kumykov 2022-03-21 23:50 ` Trond Myklebust 2022-03-22 10:37 ` Jan Kara 1 sibling, 1 reply; 22+ messages in thread From: Khazhy Kumykov @ 2022-03-21 23:36 UTC (permalink / raw) To: Trond Myklebust, gregkh Cc: jack, amir73il, bfields, linux-mm, linux-nfs, linux-kernel, jlayton, chuck.lever [-- Attachment #1: Type: text/plain, Size: 486 bytes --] On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > As has already been reported, the problem was fixed in Linux 5.5 by the > garbage collector rewrite, and so this is no longer an issue. > 9542e6a643fc ("nfsd: Containerise filecache laundrette"), 36ebbdb96b69 ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for me, which is still LTS. Since this should fix a real deadlock, would it be appropriate to include them for the 5.4 stable? [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3999 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 23:36 ` Khazhy Kumykov @ 2022-03-21 23:50 ` Trond Myklebust 0 siblings, 0 replies; 22+ messages in thread From: Trond Myklebust @ 2022-03-21 23:50 UTC (permalink / raw) To: khazhy, gregkh Cc: jack, chuck.lever, linux-mm, linux-kernel, jlayton, bfields, linux-nfs, amir73il On Mon, 2022-03-21 at 16:36 -0700, Khazhy Kumykov wrote: > On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust > <trondmy@hammerspace.com> wrote: > > > > As has already been reported, the problem was fixed in Linux 5.5 by > > the > > garbage collector rewrite, and so this is no longer an issue. > > > 9542e6a643fc ("nfsd: Containerise filecache laundrette"), > 36ebbdb96b69 > ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for > me, which is still LTS. Since this should fix a real deadlock, would > it be appropriate to include them for the 5.4 stable? That would be OK with me. I'm not aware of any side-effects that might be a problem for 5.4. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-21 22:50 ` Trond Myklebust 2022-03-21 23:36 ` Khazhy Kumykov @ 2022-03-22 10:37 ` Jan Kara 1 sibling, 0 replies; 22+ messages in thread From: Jan Kara @ 2022-03-22 10:37 UTC (permalink / raw) To: Trond Myklebust Cc: jack, amir73il, bfields, khazhy, linux-mm, linux-nfs, linux-kernel, jlayton, chuck.lever On Mon 21-03-22 22:50:11, Trond Myklebust wrote: > On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote: > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust > > > <trondmy@hammerspace.com> wrote: > > > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > > result > > > > > in recursing back into nfsd, resulting in deadlock. See below > > > > > stack. > > > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > > Call Trace: > > > > > __schedule+0x497/0x630 > > > > > schedule+0x67/0x90 > > > > > schedule_preempt_disabled+0xe/0x10 > > > > > __mutex_lock+0x347/0x4b0 > > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > > do_shrink_slab+0x1f2/0x330 > > > > > shrink_slab+0x244/0x2f0 > > > > > shrink_node+0xd7/0x490 > > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > > try_to_free_pages+0x43f/0x540 > > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > > alloc_slab_page+0x32/0x2e0 > > > > > new_slab+0xa6/0x8b0 > > > > > ___slab_alloc+0x34b/0x520 > > > > > kmem_cache_alloc+0x1c4/0x250 > > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > > fsnotify_add_mark+0x48/0x70 > > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > > svc_process+0xf9/0x110 [sunrpc] > > > > > nfsd+0x10e/0x180 [nfsd] > > > > > kthread+0x130/0x140 > > > > > ret_from_fork+0x35/0x40 > > > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > Marking this RFC since I haven't actually had a chance to test > > > > > this, > > > > > we > > > > > we're seeing this deadlock for some customers. > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct > > > > > nfsd_file > > > > > *nf) > > > > > struct fsnotify_mark *mark; > > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > > struct inode *inode = nf->nf_inode; > > > > > + unsigned int pflags; > > > > > > > > > > do { > > > > > mutex_lock(&nfsd_file_fsnotify_group- > > > > > >mark_mutex); > > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct > > > > > nfsd_file > > > > > *nf) > > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > > > + /* fsnotify allocates, avoid recursion back > > > > > into nfsd > > > > > */ > > > > > + pflags = memalloc_nofs_save(); > > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, > > > > > inode, > > > > > 0); > > > > > + memalloc_nofs_restore(pflags); > > > > > > > > > > /* > > > > > * If the add was successful, then return the > > > > > object. > > > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > > filesystem writeback situation? > > > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It > > > > seems > > > > to me that it cannot, particularly since knfsd is not a > > > > filesystem, and > > > > so does not ever handle writeback of dirty pages. > > > > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem > > > is > > > harder to hit? > > > > > > This report came in at good timing for me. > > > > > > It demonstrates an issue I did not predict for "volatile"' fanotify > > > marks [1]. > > > As far as I can tell, nfsd filecache is currently the only fsnotify > > > backend that > > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks > > > would also > > > be evictable in that way, so they would expose fanotify to this > > > deadlock. > > > > > > For the short term, maybe nfsd filecache can avoid the problem by > > > checking > > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort > > > the > > > shrinker. I wonder if there is a place for a helper > > > mutex_is_locked_by_me()? > > > > > > Jan, > > > > > > A relatively simple fix would be to allocate > > > fsnotify_mark_connector in > > > fsnotify_add_mark() and free it, if a connector already exists for > > > the object. > > > I don't think there is a good reason to optimize away this > > > allocation > > > for the case of a non-first group to set a mark on an object? > > > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set > > of > > locks grabbed during inode slab reclaim. So any allocation under > > group->mark_mutex has to be GFP_NOFS now. This is not just about > > connector > > allocations but also mark allocations for fanotify. Moving > > allocations from > > under mark_mutex is also possible solution but passing preallocated > > memory > > around is kind of ugly as well. So the cleanest solution I currently > > see is > > to come up with helpers like "fsnotify_lock_group() & > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also > > do > > memalloc_nofs_save / restore magic. > > As has already been reported, the problem was fixed in Linux 5.5 by the > garbage collector rewrite, and so this is no longer an issue. Sorry, I was not clear enough I guess. NFS is not a problem since 5.5 as you say. But Amir has changes in the works after which any filesystem inode reclaim could end up in exactly the same path (calling fsnotify_destroy_mark() from clear_inode()). So these changes would introduce the same deadlock NFS was prone to before 5.5. > In addition, please note that memalloc_nofs_save/restore and the use of > GFP_NOFS was never a solution, because it does not prevent the kind of > direct reclaim that was happening here. You'd have to enforce > GFP_NOWAIT allocations, afaics. GFP_NOFS should solve the above problem because with GFP_NOFS we cannot enter inode reclaim from the memory allocation and thus end up freeing marks. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify 2022-03-19 9:36 ` Amir Goldstein 2022-03-21 11:23 ` Jan Kara @ 2022-03-21 17:06 ` Khazhy Kumykov 1 sibling, 0 replies; 22+ messages in thread From: Khazhy Kumykov @ 2022-03-21 17:06 UTC (permalink / raw) To: Amir Goldstein Cc: Trond Myklebust, Jan Kara, bfields, chuck.lever, linux-mm, linux-nfs, linux-kernel, Jeff Layton [-- Attachment #1: Type: text/plain, Size: 4985 bytes --] On Sat, Mar 19, 2022 at 2:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > result > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > nfsd D 0 1591536 2 0x80004080 > > > Call Trace: > > > __schedule+0x497/0x630 > > > schedule+0x67/0x90 > > > schedule_preempt_disabled+0xe/0x10 > > > __mutex_lock+0x347/0x4b0 > > > fsnotify_destroy_mark+0x22/0xa0 > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > do_shrink_slab+0x1f2/0x330 > > > shrink_slab+0x244/0x2f0 > > > shrink_node+0xd7/0x490 > > > do_try_to_free_pages+0x12f/0x3b0 > > > try_to_free_pages+0x43f/0x540 > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > __alloc_pages_nodemask+0x274/0x2c0 > > > alloc_slab_page+0x32/0x2e0 > > > new_slab+0xa6/0x8b0 > > > ___slab_alloc+0x34b/0x520 > > > kmem_cache_alloc+0x1c4/0x250 > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > fsnotify_add_mark+0x48/0x70 > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > svc_process+0xf9/0x110 [sunrpc] > > > nfsd+0x10e/0x180 [nfsd] > > > kthread+0x130/0x140 > > > ret_from_fork+0x35/0x40 > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > --- > > > fs/nfsd/filecache.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > we > > > we're seeing this deadlock for some customers. > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > struct fsnotify_mark *mark; > > > struct nfsd_file_mark *nfm = NULL, *new; > > > struct inode *inode = nf->nf_inode; > > > + unsigned int pflags; > > > > > > do { > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > refcount_set(&new->nfm_ref, 1); > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > */ > > > + pflags = memalloc_nofs_save(); > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > 0); > > > + memalloc_nofs_restore(pflags); > > > > > > /* > > > * If the add was successful, then return the object. > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > filesystem writeback situation? > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > to me that it cannot, particularly since knfsd is not a filesystem, and > > so does not ever handle writeback of dirty pages. > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > harder to hit? (I think I simply got confused - I don't see reason that NOFS would help with direct reclaim, though it does look like the gfp flags are passed via a shrink_control struct so one *could* react to them in the shrinker - again not an area i'm super familiar with) > > This report came in at good timing for me. > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > be evictable in that way, so they would expose fanotify to this deadlock. > > For the short term, maybe nfsd filecache can avoid the problem by checking > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? fwiw, it does look like ~5.5 nfsd did stop freeing fanotify marks during reclaim, in the commit "nfsd: Containerise filecache laundrette" (I had sent an earlier email about this, not sure where that's getting caught up, but I do see it on lore...) > > Jan, > > A relatively simple fix would be to allocate fsnotify_mark_connector in > fsnotify_add_mark() and free it, if a connector already exists for the object. > I don't think there is a good reason to optimize away this allocation > for the case of a non-first group to set a mark on an object? > > Thanks, > Amir. > > > > [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/ [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3999 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-03-27 18:15 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20220319001635.4097742-1-khazhy@google.com>
2022-03-19 0:36 ` [PATCH RFC] nfsd: avoid recursive locking through fsnotify Trond Myklebust
2022-03-19 1:45 ` Khazhy Kumykov
2022-03-19 9:36 ` Amir Goldstein
2022-03-21 11:23 ` Jan Kara
2022-03-21 11:56 ` Amir Goldstein
2022-03-21 14:51 ` Jan Kara
2022-03-22 22:41 ` Amir Goldstein
2022-03-23 10:41 ` Jan Kara
2022-03-23 11:40 ` Amir Goldstein
2022-03-23 13:48 ` Jan Kara
2022-03-23 14:00 ` Amir Goldstein
2022-03-23 14:28 ` Jan Kara
2022-03-23 15:46 ` Amir Goldstein
2022-03-23 19:31 ` Amir Goldstein
2022-03-24 19:17 ` Amir Goldstein
2022-03-25 9:29 ` Jan Kara
2022-03-27 18:14 ` Amir Goldstein
2022-03-21 22:50 ` Trond Myklebust
2022-03-21 23:36 ` Khazhy Kumykov
2022-03-21 23:50 ` Trond Myklebust
2022-03-22 10:37 ` Jan Kara
2022-03-21 17:06 ` Khazhy Kumykov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox