* [PATCH 0/2] mm: zswap: simplify zswap_swapoff() @ 2024-01-20 2:40 Yosry Ahmed 2024-01-20 2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed 0 siblings, 2 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-20 2:40 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel, Yosry Ahmed This mini-series simplifies aims to zswap_swapoff(), which should simplify in progress work touching it (zswap tree split, rbtree to xarray conversion). Patch 1 is a code readability change in the core swap code that makes patch 2 (the actual simplification) more obviously correct. Yosry Ahmed (2): mm: swap: update inuse_pages after all cleanups are done mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() mm/swapfile.c | 4 ++-- mm/zswap.c | 9 ++------- 2 files changed, 4 insertions(+), 9 deletions(-) -- 2.43.0.429.g432eaa2c6b-goog ^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-20 2:40 [PATCH 0/2] mm: zswap: simplify zswap_swapoff() Yosry Ahmed @ 2024-01-20 2:40 ` Yosry Ahmed 2024-01-22 13:17 ` Chengming Zhou 2024-01-23 8:59 ` Huang, Ying 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed 1 sibling, 2 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-20 2:40 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel, Yosry Ahmed In swap_range_free(), we update inuse_pages then do some cleanups (arch invalidation, zswap invalidation, swap cache cleanups, etc). During swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries are freed. Make sure we only update inuse_pages after we are done with the cleanups. In practice, this shouldn't matter, because swap_range_free() is called with the swap info lock held, and the swapoff code will spin for that lock after try_to_unuse() anyway. The goal is to make it obvious and more future proof that once try_to_unuse() returns, all cleanups are done. This also facilitates a following zswap cleanup patch which uses this fact to simplify zswap_swapoff(). Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/swapfile.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 556ff7347d5f0..2fedb148b9404 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, if (was_full && (si->flags & SWP_WRITEOK)) add_to_avail_list(si); } - atomic_long_add(nr_entries, &nr_swap_pages); - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); if (si->flags & SWP_BLKDEV) swap_slot_free_notify = si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, offset++; } clear_shadow_from_swap_cache(si->type, begin, end); + atomic_long_add(nr_entries, &nr_swap_pages); + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); } static void set_cluster_next(struct swap_info_struct *si, unsigned long next) -- 2.43.0.429.g432eaa2c6b-goog ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-20 2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed @ 2024-01-22 13:17 ` Chengming Zhou 2024-01-23 8:59 ` Huang, Ying 1 sibling, 0 replies; 47+ messages in thread From: Chengming Zhou @ 2024-01-22 13:17 UTC (permalink / raw) To: Yosry Ahmed, Andrew Morton Cc: Johannes Weiner, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/20 10:40, Yosry Ahmed wrote: > In swap_range_free(), we update inuse_pages then do some cleanups (arch > invalidation, zswap invalidation, swap cache cleanups, etc). During > swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries > are freed. Make sure we only update inuse_pages after we are done with > the cleanups. > > In practice, this shouldn't matter, because swap_range_free() is called > with the swap info lock held, and the swapoff code will spin for that > lock after try_to_unuse() anyway. > > The goal is to make it obvious and more future proof that once > try_to_unuse() returns, all cleanups are done. This also facilitates a > following zswap cleanup patch which uses this fact to simplify > zswap_swapoff(). > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Thanks. > --- > mm/swapfile.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 556ff7347d5f0..2fedb148b9404 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > if (was_full && (si->flags & SWP_WRITEOK)) > add_to_avail_list(si); > } > - atomic_long_add(nr_entries, &nr_swap_pages); > - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > if (si->flags & SWP_BLKDEV) > swap_slot_free_notify = > si->bdev->bd_disk->fops->swap_slot_free_notify; > @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > offset++; > } > clear_shadow_from_swap_cache(si->type, begin, end); > + atomic_long_add(nr_entries, &nr_swap_pages); > + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > } > > static void set_cluster_next(struct swap_info_struct *si, unsigned long next) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-20 2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed 2024-01-22 13:17 ` Chengming Zhou @ 2024-01-23 8:59 ` Huang, Ying 2024-01-23 9:40 ` Yosry Ahmed 1 sibling, 1 reply; 47+ messages in thread From: Huang, Ying @ 2024-01-23 8:59 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel Yosry Ahmed <yosryahmed@google.com> writes: > In swap_range_free(), we update inuse_pages then do some cleanups (arch > invalidation, zswap invalidation, swap cache cleanups, etc). During > swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries > are freed. Make sure we only update inuse_pages after we are done with > the cleanups. > > In practice, this shouldn't matter, because swap_range_free() is called > with the swap info lock held, and the swapoff code will spin for that > lock after try_to_unuse() anyway. > > The goal is to make it obvious and more future proof that once > try_to_unuse() returns, all cleanups are done. Defines "all cleanups". Apparently, some other operations are still to be done after try_to_unuse() in swap_off(). > This also facilitates a > following zswap cleanup patch which uses this fact to simplify > zswap_swapoff(). > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > mm/swapfile.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 556ff7347d5f0..2fedb148b9404 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > if (was_full && (si->flags & SWP_WRITEOK)) > add_to_avail_list(si); > } > - atomic_long_add(nr_entries, &nr_swap_pages); > - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > if (si->flags & SWP_BLKDEV) > swap_slot_free_notify = > si->bdev->bd_disk->fops->swap_slot_free_notify; > @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > offset++; > } > clear_shadow_from_swap_cache(si->type, begin, end); > + atomic_long_add(nr_entries, &nr_swap_pages); > + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); This isn't enough. You need to use smp_wmb() here and smp_rmb() in somewhere reading si->inuse_pages. > } > > static void set_cluster_next(struct swap_info_struct *si, unsigned long next) -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-23 8:59 ` Huang, Ying @ 2024-01-23 9:40 ` Yosry Ahmed 2024-01-23 9:54 ` Yosry Ahmed 2024-01-24 3:13 ` Huang, Ying 0 siblings, 2 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-23 9:40 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 1:01 AM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > In swap_range_free(), we update inuse_pages then do some cleanups (arch > > invalidation, zswap invalidation, swap cache cleanups, etc). During > > swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries > > are freed. Make sure we only update inuse_pages after we are done with > > the cleanups. > > > > In practice, this shouldn't matter, because swap_range_free() is called > > with the swap info lock held, and the swapoff code will spin for that > > lock after try_to_unuse() anyway. > > > > The goal is to make it obvious and more future proof that once > > try_to_unuse() returns, all cleanups are done. > > Defines "all cleanups". Apparently, some other operations are still > to be done after try_to_unuse() in swap_off(). I am referring to the cleanups in swap_range_free() that I mentioned above. How about s/all the cleanups/all the cleanups in swap_range_free()? > > > This also facilitates a > > following zswap cleanup patch which uses this fact to simplify > > zswap_swapoff(). > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > mm/swapfile.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 556ff7347d5f0..2fedb148b9404 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > > if (was_full && (si->flags & SWP_WRITEOK)) > > add_to_avail_list(si); > > } > > - atomic_long_add(nr_entries, &nr_swap_pages); > > - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > if (si->flags & SWP_BLKDEV) > > swap_slot_free_notify = > > si->bdev->bd_disk->fops->swap_slot_free_notify; > > @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > > offset++; > > } > > clear_shadow_from_swap_cache(si->type, begin, end); > > + atomic_long_add(nr_entries, &nr_swap_pages); > > + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > This isn't enough. You need to use smp_wmb() here and smp_rmb() in > somewhere reading si->inuse_pages. Hmm, good point. Although as I mentioned in the commit message, this shouldn't matter today as swap_range_free() executes with the lock held, and we spin on the lock after try_to_unuse() returns. It may still be more future-proof to add the memory barriers. In swap_range_free, we want to make sure that the write to si->inuse_pages in swap_range_free() happens *after* the cleanups (specifically zswap_invalidate() in this case). In swap_off, we want to make sure that the cleanups following try_to_unuse() (e.g. zswap_swapoff) happen *after* reading si->inuse_pages == 0 in try_to_unuse(). So I think we want smp_wmb() in swap_range_free() and smp_mb() in try_to_unuse(). Does the below look correct to you? diff --git a/mm/swapfile.c b/mm/swapfile.c index 2fedb148b9404..a2fa2f65a8ddd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -750,6 +750,12 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, offset++; } clear_shadow_from_swap_cache(si->type, begin, end); + + /* + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 + * only after the above cleanups are done. + */ + smp_wmb(); atomic_long_add(nr_entries, &nr_swap_pages); WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); } @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) return -EINTR; } + /* + * Make sure that further cleanups after try_to_unuse() returns happen + * after swap_range_free() reduces si->inuse_pages to 0. + */ + smp_mb(); return 0; } Alternatively, we may just hold the spinlock in try_to_unuse() when we check si->inuse_pages at the end. This will also ensure that any calls to swap_range_free() have completed. Let me know what you prefer. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-23 9:40 ` Yosry Ahmed @ 2024-01-23 9:54 ` Yosry Ahmed 2024-01-24 3:13 ` Huang, Ying 1 sibling, 0 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-23 9:54 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel > Alternatively, we may just hold the spinlock in try_to_unuse() when we > check si->inuse_pages at the end. This will also ensure that any calls > to swap_range_free() have completed. Let me know what you prefer. To elaborate, I mean replacing this patch and the memory barriers with the diff below. diff --git a/mm/swapfile.c b/mm/swapfile.c index 2fedb148b9404..9b932ecbd80a8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2046,6 +2046,7 @@ static int try_to_unuse(unsigned int type) struct swap_info_struct *si = swap_info[type]; struct folio *folio; swp_entry_t entry; + unsigned int inuse; unsigned int i; if (!READ_ONCE(si->inuse_pages)) @@ -2123,8 +2124,14 @@ static int try_to_unuse(unsigned int type) * and even shmem_writepage() could have been preempted after * folio_alloc_swap(), temporarily hiding that swap. It's easy * and robust (though cpu-intensive) just to keep retrying. + * + * Read si->inuse_pages with the lock held to make sure that cleanups in + * swap_range_free() are completed when we read si->inuse_pages == 0. */ - if (READ_ONCE(si->inuse_pages)) { + spin_lock(&si->lock); + inuse = si->inuse_pages; + spin_unlock(&si->lock); + if (inuse) { if (!signal_pending(current)) goto retry; return -EINTR; ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-23 9:40 ` Yosry Ahmed 2024-01-23 9:54 ` Yosry Ahmed @ 2024-01-24 3:13 ` Huang, Ying 2024-01-24 3:20 ` Yosry Ahmed 1 sibling, 1 reply; 47+ messages in thread From: Huang, Ying @ 2024-01-24 3:13 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel Yosry Ahmed <yosryahmed@google.com> writes: > On Tue, Jan 23, 2024 at 1:01 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > In swap_range_free(), we update inuse_pages then do some cleanups (arch >> > invalidation, zswap invalidation, swap cache cleanups, etc). During >> > swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries >> > are freed. Make sure we only update inuse_pages after we are done with >> > the cleanups. >> > >> > In practice, this shouldn't matter, because swap_range_free() is called >> > with the swap info lock held, and the swapoff code will spin for that >> > lock after try_to_unuse() anyway. >> > >> > The goal is to make it obvious and more future proof that once >> > try_to_unuse() returns, all cleanups are done. >> >> Defines "all cleanups". Apparently, some other operations are still >> to be done after try_to_unuse() in swap_off(). > > I am referring to the cleanups in swap_range_free() that I mentioned above. > > How about s/all the cleanups/all the cleanups in swap_range_free()? Sounds good for me. >> >> > This also facilitates a >> > following zswap cleanup patch which uses this fact to simplify >> > zswap_swapoff(). >> > >> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >> > --- >> > mm/swapfile.c | 4 ++-- >> > 1 file changed, 2 insertions(+), 2 deletions(-) >> > >> > diff --git a/mm/swapfile.c b/mm/swapfile.c >> > index 556ff7347d5f0..2fedb148b9404 100644 >> > --- a/mm/swapfile.c >> > +++ b/mm/swapfile.c >> > @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, >> > if (was_full && (si->flags & SWP_WRITEOK)) >> > add_to_avail_list(si); >> > } >> > - atomic_long_add(nr_entries, &nr_swap_pages); >> > - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> > if (si->flags & SWP_BLKDEV) >> > swap_slot_free_notify = >> > si->bdev->bd_disk->fops->swap_slot_free_notify; >> > @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, >> > offset++; >> > } >> > clear_shadow_from_swap_cache(si->type, begin, end); >> > + atomic_long_add(nr_entries, &nr_swap_pages); >> > + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> >> This isn't enough. You need to use smp_wmb() here and smp_rmb() in >> somewhere reading si->inuse_pages. > > Hmm, good point. Although as I mentioned in the commit message, this > shouldn't matter today as swap_range_free() executes with the lock > held, and we spin on the lock after try_to_unuse() returns. Yes. IIUC, this patch isn't needed too because we have spinlock already. > It may still be more future-proof to add the memory barriers. Yes. Without memory barriers, moving code doesn't guarantee memory order. > In swap_range_free, we want to make sure that the write to > si->inuse_pages in swap_range_free() happens *after* the cleanups > (specifically zswap_invalidate() in this case). > In swap_off, we want to make sure that the cleanups following > try_to_unuse() (e.g. zswap_swapoff) happen *after* reading > si->inuse_pages == 0 in try_to_unuse(). > > So I think we want smp_wmb() in swap_range_free() and smp_mb() in > try_to_unuse(). Does the below look correct to you? > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2fedb148b9404..a2fa2f65a8ddd 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -750,6 +750,12 @@ static void swap_range_free(struct > swap_info_struct *si, unsigned long offset, > offset++; > } > clear_shadow_from_swap_cache(si->type, begin, end); > + > + /* > + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 > + * only after the above cleanups are done. > + */ > + smp_wmb(); > atomic_long_add(nr_entries, &nr_swap_pages); > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > } > @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) > return -EINTR; > } > > + /* > + * Make sure that further cleanups after try_to_unuse() returns happen > + * after swap_range_free() reduces si->inuse_pages to 0. > + */ > + smp_mb(); > return 0; > } We need to take care of "si->inuse_pages" checking at the beginning of try_to_unuse() too. Otherwise, it looks good to me. > Alternatively, we may just hold the spinlock in try_to_unuse() when we > check si->inuse_pages at the end. This will also ensure that any calls > to swap_range_free() have completed. Let me know what you prefer. Personally, I prefer memory barriers here. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-24 3:13 ` Huang, Ying @ 2024-01-24 3:20 ` Yosry Ahmed 2024-01-24 3:27 ` Huang, Ying 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-24 3:20 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel > > In swap_range_free, we want to make sure that the write to > > si->inuse_pages in swap_range_free() happens *after* the cleanups > > (specifically zswap_invalidate() in this case). > > In swap_off, we want to make sure that the cleanups following > > try_to_unuse() (e.g. zswap_swapoff) happen *after* reading > > si->inuse_pages == 0 in try_to_unuse(). > > > > So I think we want smp_wmb() in swap_range_free() and smp_mb() in > > try_to_unuse(). Does the below look correct to you? > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 2fedb148b9404..a2fa2f65a8ddd 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -750,6 +750,12 @@ static void swap_range_free(struct > > swap_info_struct *si, unsigned long offset, > > offset++; > > } > > clear_shadow_from_swap_cache(si->type, begin, end); > > + > > + /* > > + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 > > + * only after the above cleanups are done. > > + */ > > + smp_wmb(); > > atomic_long_add(nr_entries, &nr_swap_pages); > > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > } > > @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) > > return -EINTR; > > } > > > > + /* > > + * Make sure that further cleanups after try_to_unuse() returns happen > > + * after swap_range_free() reduces si->inuse_pages to 0. > > + */ > > + smp_mb(); > > return 0; > > } > > We need to take care of "si->inuse_pages" checking at the beginning of > try_to_unuse() too. Otherwise, it looks good to me. Hmm, why isn't one barrier at the end of the function enough? I think all we need is that before we return from try_to_unuse(), all the cleanups in swap_range_free() are taken care of, which the barrier at the end should be doing. We just want instructions after try_to_unuse() to not get re-ordered before si->inuse_pages is read as 0, right? > > > Alternatively, we may just hold the spinlock in try_to_unuse() when we > > check si->inuse_pages at the end. This will also ensure that any calls > > to swap_range_free() have completed. Let me know what you prefer. > > Personally, I prefer memory barriers here. Ack. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-24 3:20 ` Yosry Ahmed @ 2024-01-24 3:27 ` Huang, Ying 2024-01-24 4:15 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Huang, Ying @ 2024-01-24 3:27 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel Yosry Ahmed <yosryahmed@google.com> writes: >> > In swap_range_free, we want to make sure that the write to >> > si->inuse_pages in swap_range_free() happens *after* the cleanups >> > (specifically zswap_invalidate() in this case). >> > In swap_off, we want to make sure that the cleanups following >> > try_to_unuse() (e.g. zswap_swapoff) happen *after* reading >> > si->inuse_pages == 0 in try_to_unuse(). >> > >> > So I think we want smp_wmb() in swap_range_free() and smp_mb() in >> > try_to_unuse(). Does the below look correct to you? >> > >> > diff --git a/mm/swapfile.c b/mm/swapfile.c >> > index 2fedb148b9404..a2fa2f65a8ddd 100644 >> > --- a/mm/swapfile.c >> > +++ b/mm/swapfile.c >> > @@ -750,6 +750,12 @@ static void swap_range_free(struct >> > swap_info_struct *si, unsigned long offset, >> > offset++; >> > } >> > clear_shadow_from_swap_cache(si->type, begin, end); >> > + >> > + /* >> > + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 >> > + * only after the above cleanups are done. >> > + */ >> > + smp_wmb(); >> > atomic_long_add(nr_entries, &nr_swap_pages); >> > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> > } >> > @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) >> > return -EINTR; >> > } >> > >> > + /* >> > + * Make sure that further cleanups after try_to_unuse() returns happen >> > + * after swap_range_free() reduces si->inuse_pages to 0. >> > + */ >> > + smp_mb(); >> > return 0; >> > } >> >> We need to take care of "si->inuse_pages" checking at the beginning of >> try_to_unuse() too. Otherwise, it looks good to me. > > Hmm, why isn't one barrier at the end of the function enough? I think > all we need is that before we return from try_to_unuse(), all the > cleanups in swap_range_free() are taken care of, which the barrier at > the end should be doing. We just want instructions after > try_to_unuse() to not get re-ordered before si->inuse_pages is read as > 0, right? Because at the begin of try_to_unuse() as below, after reading, function returns directly without any memory barriers. if (!READ_ONCE(si->inuse_pages)) return 0; -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done 2024-01-24 3:27 ` Huang, Ying @ 2024-01-24 4:15 ` Yosry Ahmed 0 siblings, 0 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-24 4:15 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 7:29 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > >> > In swap_range_free, we want to make sure that the write to > >> > si->inuse_pages in swap_range_free() happens *after* the cleanups > >> > (specifically zswap_invalidate() in this case). > >> > In swap_off, we want to make sure that the cleanups following > >> > try_to_unuse() (e.g. zswap_swapoff) happen *after* reading > >> > si->inuse_pages == 0 in try_to_unuse(). > >> > > >> > So I think we want smp_wmb() in swap_range_free() and smp_mb() in > >> > try_to_unuse(). Does the below look correct to you? > >> > > >> > diff --git a/mm/swapfile.c b/mm/swapfile.c > >> > index 2fedb148b9404..a2fa2f65a8ddd 100644 > >> > --- a/mm/swapfile.c > >> > +++ b/mm/swapfile.c > >> > @@ -750,6 +750,12 @@ static void swap_range_free(struct > >> > swap_info_struct *si, unsigned long offset, > >> > offset++; > >> > } > >> > clear_shadow_from_swap_cache(si->type, begin, end); > >> > + > >> > + /* > >> > + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 > >> > + * only after the above cleanups are done. > >> > + */ > >> > + smp_wmb(); > >> > atomic_long_add(nr_entries, &nr_swap_pages); > >> > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > >> > } > >> > @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) > >> > return -EINTR; > >> > } > >> > > >> > + /* > >> > + * Make sure that further cleanups after try_to_unuse() returns happen > >> > + * after swap_range_free() reduces si->inuse_pages to 0. > >> > + */ > >> > + smp_mb(); > >> > return 0; > >> > } > >> > >> We need to take care of "si->inuse_pages" checking at the beginning of > >> try_to_unuse() too. Otherwise, it looks good to me. > > > > Hmm, why isn't one barrier at the end of the function enough? I think > > all we need is that before we return from try_to_unuse(), all the > > cleanups in swap_range_free() are taken care of, which the barrier at > > the end should be doing. We just want instructions after > > try_to_unuse() to not get re-ordered before si->inuse_pages is read as > > 0, right? > > Because at the begin of try_to_unuse() as below, after reading, function > returns directly without any memory barriers. > > if (!READ_ONCE(si->inuse_pages)) > return 0; Right, I missed this one. Let me fix this up and send a v2. Thanks! ^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-20 2:40 [PATCH 0/2] mm: zswap: simplify zswap_swapoff() Yosry Ahmed 2024-01-20 2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed @ 2024-01-20 2:40 ` Yosry Ahmed 2024-01-22 13:13 ` Chengming Zhou ` (3 more replies) 1 sibling, 4 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-20 2:40 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel, Yosry Ahmed During swapoff, try_to_unuse() makes sure that zswap_invalidate() is called for all swap entries before zswap_swapoff() is called. This means that all zswap entries should already be removed from the tree. Simplify zswap_swapoff() by removing the tree cleanup loop, and leaving an assertion in its place. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- Chengming, Chris, I think this should make the tree split and the xarray conversion patches simpler (especially the former). If others agree, both changes can be rebased on top of this. --- mm/zswap.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index f8bc9e0892687..9675c3c27f9d1 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) void zswap_swapoff(int type) { struct zswap_tree *tree = zswap_trees[type]; - struct zswap_entry *entry, *n; if (!tree) return; - /* walk the tree and free everything */ - spin_lock(&tree->lock); - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) - zswap_free_entry(entry); - tree->rbroot = RB_ROOT; - spin_unlock(&tree->lock); + /* try_to_unuse() invalidated all entries already */ + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); kfree(tree); zswap_trees[type] = NULL; } -- 2.43.0.429.g432eaa2c6b-goog ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed @ 2024-01-22 13:13 ` Chengming Zhou 2024-01-22 20:19 ` Johannes Weiner ` (2 subsequent siblings) 3 siblings, 0 replies; 47+ messages in thread From: Chengming Zhou @ 2024-01-22 13:13 UTC (permalink / raw) To: Yosry Ahmed, Andrew Morton Cc: Johannes Weiner, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/20 10:40, Yosry Ahmed wrote: > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. Ok. Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Thanks. > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > } ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed 2024-01-22 13:13 ` Chengming Zhou @ 2024-01-22 20:19 ` Johannes Weiner 2024-01-22 20:39 ` Yosry Ahmed 2024-01-22 21:21 ` Nhat Pham 2024-01-22 22:31 ` Chris Li 3 siblings, 1 reply; 47+ messages in thread From: Johannes Weiner @ 2024-01-22 20:19 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> That's a great simplification. Removing the tree->lock made me double take, but at this point the swapfile and its cache should be fully dead and I don't see how any of the zswap operations that take tree->lock could race at this point. > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. The resulting code is definitely simpler, but this patch is not a completely trivial cleanup, either. If you put it before Chengming's patch and it breaks something, it would be difficult to pull out without affecting the tree split. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-22 20:19 ` Johannes Weiner @ 2024-01-22 20:39 ` Yosry Ahmed 2024-01-23 15:38 ` Johannes Weiner 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-22 20:39 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > called for all swap entries before zswap_swapoff() is called. This means > > that all zswap entries should already be removed from the tree. Simplify > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > assertion in its place. > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > That's a great simplification. > > Removing the tree->lock made me double take, but at this point the > swapfile and its cache should be fully dead and I don't see how any of > the zswap operations that take tree->lock could race at this point. It took me a while staring at the code to realize this loop is pointless. However, while I have your attention on the swapoff path, there's a slightly irrelevant problem that I think might be there, but I am not sure. It looks to me like swapoff can race with writeback, and there may be a chance of UAF for the zswap tree. For example, if zswap_swapoff() races with shrink_memcg_cb(), I feel like we may free the tree as it is being used. For example if zswap_swapoff()->kfree(tree) happen right before shrink_memcg_cb()->list_lru_isolate(l, item). Please tell me that I am being paranoid and that there is some protection against zswap writeback racing with swapoff. It feels like we are very careful with zswap entries refcounting, but not with the zswap tree itself. > > > --- > > Chengming, Chris, I think this should make the tree split and the xarray > > conversion patches simpler (especially the former). If others agree, > > both changes can be rebased on top of this. > > The resulting code is definitely simpler, but this patch is not a > completely trivial cleanup, either. If you put it before Chengming's > patch and it breaks something, it would be difficult to pull out > without affecting the tree split. Are you suggesting I rebase this on top of Chengming's patches? I can definitely do this, but the patch will be slightly less straightforward, and if the tree split patches break something it would be difficult to pull out as well. If you feel like this patch is more likely to break things, I can rebase. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-22 20:39 ` Yosry Ahmed @ 2024-01-23 15:38 ` Johannes Weiner 2024-01-23 15:54 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Johannes Weiner @ 2024-01-23 15:38 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > called for all swap entries before zswap_swapoff() is called. This means > > > that all zswap entries should already be removed from the tree. Simplify > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > assertion in its place. > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > That's a great simplification. > > > > Removing the tree->lock made me double take, but at this point the > > swapfile and its cache should be fully dead and I don't see how any of > > the zswap operations that take tree->lock could race at this point. > > It took me a while staring at the code to realize this loop is pointless. > > However, while I have your attention on the swapoff path, there's a > slightly irrelevant problem that I think might be there, but I am not > sure. > > It looks to me like swapoff can race with writeback, and there may be > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > races with shrink_memcg_cb(), I feel like we may free the tree as it > is being used. For example if zswap_swapoff()->kfree(tree) happen > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > Please tell me that I am being paranoid and that there is some > protection against zswap writeback racing with swapoff. It feels like > we are very careful with zswap entries refcounting, but not with the > zswap tree itself. Hm, I don't see how. Writeback operates on entries from the LRU. By the time zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should will have emptied out the LRU and tree. Writeback could have gotten a refcount to the entry and dropped the tree->lock. But then it does __read_swap_cache_async(), and while holding the page lock checks the tree under lock once more; if that finds the entry valid, it means try_to_unuse() hasn't started on this page yet, and would be held up by the page lock/writeback state. > > > Chengming, Chris, I think this should make the tree split and the xarray > > > conversion patches simpler (especially the former). If others agree, > > > both changes can be rebased on top of this. > > > > The resulting code is definitely simpler, but this patch is not a > > completely trivial cleanup, either. If you put it before Chengming's > > patch and it breaks something, it would be difficult to pull out > > without affecting the tree split. > > Are you suggesting I rebase this on top of Chengming's patches? I can > definitely do this, but the patch will be slightly less > straightforward, and if the tree split patches break something it > would be difficult to pull out as well. If you feel like this patch is > more likely to break things, I can rebase. Yeah I think it's more subtle. I'd only ask somebody to rebase an already tested patch on a newer one if the latter were an obvious, low-risk, prep-style patch. Your patch is good, but it doesn't quite fit into this particular category, so I'd say no jumping the queue ;) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 15:38 ` Johannes Weiner @ 2024-01-23 15:54 ` Yosry Ahmed 2024-01-23 20:12 ` Johannes Weiner 2024-01-23 20:30 ` Nhat Pham 0 siblings, 2 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-23 15:54 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > that all zswap entries should already be removed from the tree. Simplify > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > assertion in its place. > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > That's a great simplification. > > > > > > Removing the tree->lock made me double take, but at this point the > > > swapfile and its cache should be fully dead and I don't see how any of > > > the zswap operations that take tree->lock could race at this point. > > > > It took me a while staring at the code to realize this loop is pointless. > > > > However, while I have your attention on the swapoff path, there's a > > slightly irrelevant problem that I think might be there, but I am not > > sure. > > > > It looks to me like swapoff can race with writeback, and there may be > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > Please tell me that I am being paranoid and that there is some > > protection against zswap writeback racing with swapoff. It feels like > > we are very careful with zswap entries refcounting, but not with the > > zswap tree itself. > > Hm, I don't see how. > > Writeback operates on entries from the LRU. By the time > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > will have emptied out the LRU and tree. > > Writeback could have gotten a refcount to the entry and dropped the > tree->lock. But then it does __read_swap_cache_async(), and while > holding the page lock checks the tree under lock once more; if that > finds the entry valid, it means try_to_unuse() hasn't started on this > page yet, and would be held up by the page lock/writeback state. Consider the following race: CPU 1 CPU 2 # In shrink_memcg_cb() # In swap_off list_lru_isolate() zswap_invalidate() .. zswap_swapoff() -> kfree(tree) spin_lock(&tree->lock); Isn't this a UAF or am I missing something here? > > > > > Chengming, Chris, I think this should make the tree split and the xarray > > > > conversion patches simpler (especially the former). If others agree, > > > > both changes can be rebased on top of this. > > > > > > The resulting code is definitely simpler, but this patch is not a > > > completely trivial cleanup, either. If you put it before Chengming's > > > patch and it breaks something, it would be difficult to pull out > > > without affecting the tree split. > > > > Are you suggesting I rebase this on top of Chengming's patches? I can > > definitely do this, but the patch will be slightly less > > straightforward, and if the tree split patches break something it > > would be difficult to pull out as well. If you feel like this patch is > > more likely to break things, I can rebase. > > Yeah I think it's more subtle. I'd only ask somebody to rebase an > already tested patch on a newer one if the latter were an obvious, > low-risk, prep-style patch. Your patch is good, but it doesn't quite > fit into this particular category, so I'd say no jumping the queue ;) My intention was to reduce the diff in both this patch and the tree split patches, but I do understand this is more subtle. I can rebase on top of Chengming's patches instead. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 15:54 ` Yosry Ahmed @ 2024-01-23 20:12 ` Johannes Weiner 2024-01-23 21:02 ` Yosry Ahmed 2024-01-23 20:30 ` Nhat Pham 1 sibling, 1 reply; 47+ messages in thread From: Johannes Weiner @ 2024-01-23 20:12 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > assertion in its place. > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > That's a great simplification. > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > the zswap operations that take tree->lock could race at this point. > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > However, while I have your attention on the swapoff path, there's a > > > slightly irrelevant problem that I think might be there, but I am not > > > sure. > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > Please tell me that I am being paranoid and that there is some > > > protection against zswap writeback racing with swapoff. It feels like > > > we are very careful with zswap entries refcounting, but not with the > > > zswap tree itself. > > > > Hm, I don't see how. > > > > Writeback operates on entries from the LRU. By the time > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > will have emptied out the LRU and tree. > > > > Writeback could have gotten a refcount to the entry and dropped the > > tree->lock. But then it does __read_swap_cache_async(), and while > > holding the page lock checks the tree under lock once more; if that > > finds the entry valid, it means try_to_unuse() hasn't started on this > > page yet, and would be held up by the page lock/writeback state. > > Consider the following race: > > CPU 1 CPU 2 > # In shrink_memcg_cb() # In swap_off > list_lru_isolate() > zswap_invalidate() > .. > zswap_swapoff() -> kfree(tree) > spin_lock(&tree->lock); > > Isn't this a UAF or am I missing something here? Oof. You're right, it looks like there is a bug. Digging through the history, I think this is actually quite old: the original backend shrinkers would pluck something off their LRU, drop all locks, then try to acquire tree->lock. There is no protection against swapoff. The lock that is supposed to protect this is the LRU lock. That's where reclaim and invalidation should synchronize. But it's not right: 1. We drop the LRU lock before acquiring the tree lock. We should instead trylock the tree while still holding the LRU lock to make sure the tree is safe against swapoff. 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But it must always cycle the LRU lock before freeing the tree for that synchronization to work. Once we're holding a refcount to the entry, it's safe to drop all locks for the next step because we'll then work against the swapcache and entry: __read_swap_cache_async() will try to pin and lock whatever swap entry is at that type+offset. This also pins the type's current tree. HOWEVER, if swapoff + swapon raced, this could be a different tree than what we had in @tree, so 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to look up zswap_trees[] again after __read_swap_cache_async() succeeded to validate the entry. Once it succeeded, we can validate the entry. The entry is valid due to our refcount. The zswap_trees[type] is valid due to the cache pin. However, if validation failed and we have a non-zero writeback_result, there is one last bug: 4. the original entry's tree is no longer valid for the entry put. The current scheme handles invalidation fine (which is good because that's quite common). But it's fundamentally unsynchronized against swapoff (which has probably gone undetected because that's rare). I can't think of an immediate solution to this, but I wanted to put my analysis out for comments. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 20:12 ` Johannes Weiner @ 2024-01-23 21:02 ` Yosry Ahmed 2024-01-24 6:57 ` Yosry Ahmed 2024-01-24 7:20 ` Chengming Zhou 0 siblings, 2 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-23 21:02 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: > > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > > assertion in its place. > > > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > > > That's a great simplification. > > > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > > the zswap operations that take tree->lock could race at this point. > > > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > > > However, while I have your attention on the swapoff path, there's a > > > > slightly irrelevant problem that I think might be there, but I am not > > > > sure. > > > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > > > Please tell me that I am being paranoid and that there is some > > > > protection against zswap writeback racing with swapoff. It feels like > > > > we are very careful with zswap entries refcounting, but not with the > > > > zswap tree itself. > > > > > > Hm, I don't see how. > > > > > > Writeback operates on entries from the LRU. By the time > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > > will have emptied out the LRU and tree. > > > > > > Writeback could have gotten a refcount to the entry and dropped the > > > tree->lock. But then it does __read_swap_cache_async(), and while > > > holding the page lock checks the tree under lock once more; if that > > > finds the entry valid, it means try_to_unuse() hasn't started on this > > > page yet, and would be held up by the page lock/writeback state. > > > > Consider the following race: > > > > CPU 1 CPU 2 > > # In shrink_memcg_cb() # In swap_off > > list_lru_isolate() > > zswap_invalidate() > > .. > > zswap_swapoff() -> kfree(tree) > > spin_lock(&tree->lock); > > > > Isn't this a UAF or am I missing something here? > > Oof. You're right, it looks like there is a bug. Digging through the > history, I think this is actually quite old: the original backend > shrinkers would pluck something off their LRU, drop all locks, then > try to acquire tree->lock. There is no protection against swapoff. > > The lock that is supposed to protect this is the LRU lock. That's > where reclaim and invalidation should synchronize. But it's not right: > > 1. We drop the LRU lock before acquiring the tree lock. We should > instead trylock the tree while still holding the LRU lock to make > sure the tree is safe against swapoff. > > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But > it must always cycle the LRU lock before freeing the tree for that > synchronization to work. > > Once we're holding a refcount to the entry, it's safe to drop all > locks for the next step because we'll then work against the swapcache > and entry: __read_swap_cache_async() will try to pin and lock whatever > swap entry is at that type+offset. This also pins the type's current > tree. HOWEVER, if swapoff + swapon raced, this could be a different > tree than what we had in @tree, so > > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to > look up zswap_trees[] again after __read_swap_cache_async() > succeeded to validate the entry. > > Once it succeeded, we can validate the entry. The entry is valid due > to our refcount. The zswap_trees[type] is valid due to the cache pin. > > However, if validation failed and we have a non-zero writeback_result, > there is one last bug: > > 4. the original entry's tree is no longer valid for the entry put. > > The current scheme handles invalidation fine (which is good because > that's quite common). But it's fundamentally unsynchronized against > swapoff (which has probably gone undetected because that's rare). > > I can't think of an immediate solution to this, but I wanted to put my > analysis out for comments. Thanks for the great analysis, I missed the swapoff/swapon race myself :) The first solution that came to mind for me was refcounting the zswap tree with RCU with percpu-refcount, similar to how cgroup refs are handled (init in zswap_swapon() and kill in zswap_swapoff()). I think the percpu-refcount may be an overkill in terms of memory usage though. I think we can still do our own refcounting with RCU, but it may be more complicated. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 21:02 ` Yosry Ahmed @ 2024-01-24 6:57 ` Yosry Ahmed 2024-01-25 5:28 ` Chris Li 2024-01-24 7:20 ` Chengming Zhou 1 sibling, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-24 6:57 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Nhat Pham, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5613 bytes --] > > > > > However, while I have your attention on the swapoff path, there's a > > > > > slightly irrelevant problem that I think might be there, but I am not > > > > > sure. > > > > > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > > > > > Please tell me that I am being paranoid and that there is some > > > > > protection against zswap writeback racing with swapoff. It feels like > > > > > we are very careful with zswap entries refcounting, but not with the > > > > > zswap tree itself. > > > > > > > > Hm, I don't see how. > > > > > > > > Writeback operates on entries from the LRU. By the time > > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > > > will have emptied out the LRU and tree. > > > > > > > > Writeback could have gotten a refcount to the entry and dropped the > > > > tree->lock. But then it does __read_swap_cache_async(), and while > > > > holding the page lock checks the tree under lock once more; if that > > > > finds the entry valid, it means try_to_unuse() hasn't started on this > > > > page yet, and would be held up by the page lock/writeback state. > > > > > > Consider the following race: > > > > > > CPU 1 CPU 2 > > > # In shrink_memcg_cb() # In swap_off > > > list_lru_isolate() > > > zswap_invalidate() > > > .. > > > zswap_swapoff() -> kfree(tree) > > > spin_lock(&tree->lock); > > > > > > Isn't this a UAF or am I missing something here? > > > > Oof. You're right, it looks like there is a bug. Digging through the > > history, I think this is actually quite old: the original backend > > shrinkers would pluck something off their LRU, drop all locks, then > > try to acquire tree->lock. There is no protection against swapoff. > > > > The lock that is supposed to protect this is the LRU lock. That's > > where reclaim and invalidation should synchronize. But it's not right: > > > > 1. We drop the LRU lock before acquiring the tree lock. We should > > instead trylock the tree while still holding the LRU lock to make > > sure the tree is safe against swapoff. > > > > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But > > it must always cycle the LRU lock before freeing the tree for that > > synchronization to work. > > > > Once we're holding a refcount to the entry, it's safe to drop all > > locks for the next step because we'll then work against the swapcache > > and entry: __read_swap_cache_async() will try to pin and lock whatever > > swap entry is at that type+offset. This also pins the type's current > > tree. HOWEVER, if swapoff + swapon raced, this could be a different > > tree than what we had in @tree, so > > > > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to > > look up zswap_trees[] again after __read_swap_cache_async() > > succeeded to validate the entry. > > > > Once it succeeded, we can validate the entry. The entry is valid due > > to our refcount. The zswap_trees[type] is valid due to the cache pin. > > > > However, if validation failed and we have a non-zero writeback_result, > > there is one last bug: > > > > 4. the original entry's tree is no longer valid for the entry put. > > > > The current scheme handles invalidation fine (which is good because > > that's quite common). But it's fundamentally unsynchronized against > > swapoff (which has probably gone undetected because that's rare). > > > > I can't think of an immediate solution to this, but I wanted to put my > > analysis out for comments. > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > The first solution that came to mind for me was refcounting the zswap > tree with RCU with percpu-refcount, similar to how cgroup refs are > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > the percpu-refcount may be an overkill in terms of memory usage > though. I think we can still do our own refcounting with RCU, but it > may be more complicated. FWIW, I was able to reproduce the problem in a vm with the following kernel diff: diff --git a/mm/zswap.c b/mm/zswap.c index 78df16d307aa8..6580a4be52a18 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o */ spin_unlock(lock); + pr_info("Sleeping in shrink_memcg_cb() before spin_lock(&tree->lock)\n"); + schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000)); + /* Check for invalidate() race */ spin_lock(&tree->lock); if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) This basically expands the race window to 10 seconds. I have a reproducer script attached that utilizes the zswap shrinker (which makes this much easier to reproduce btw). The steps are as follows: - Compile the kernel with the above diff, and both ZRAM & KASAN enabled. - In one terminal, run zswap_wb_swapoff_race.sh. - In a different terminal, once the "Sleeping in shrink_memcg_cb()" message is logged, run "swapoff /dev/zram0". - In the first terminal, once the 10 seconds elapse, I get a UAF BUG from KASAN (log attached as well). [-- Attachment #2: zswap_wb_swapoff_race.sh --] [-- Type: text/x-sh, Size: 988 bytes --] #!/bin/bash TOTAL_MB=100 TOTAL_BYTES=$((TOTAL_MB << 20)) ROOT_CG=/sys/fs/cgroup ZRAM_DEV=/dev/zram0 TMPFS=./tmpfs A_ASCII=$(printf '%d' "'A") function cleanup() { rm -rf $TMPFS/* umount $TMPFS rmdir $TMPFS if swapon -s | grep -q $ZRAM_DEV; then swapoff $ZRAM_DEV fi echo 1 > /sys/block/zram0/reset rmdir $ROOT_CG/test } trap cleanup EXIT echo 1 > /sys/module/zswap/parameters/shrinker_enabled # Initialize tmpfs mkdir $TMPFS mount -t tmpfs none $TMPFS # Initialize zram echo $((TOTAL_MB*2))m > /sys/block/zram0/disksize mkswap -L zram0 $ZRAM_DEV swapon $ZRAM_DEV # Initialize cgroup echo "+memory" > $ROOT_CG/cgroup.subtree_control mkdir $ROOT_CG/test # Create a test tmpfs file containing the alphabet on repeat ( echo 0 > $ROOT_CG/test/cgroup.procs s=$(printf '%s' {A..Z}) yes $s | dd iflag=fullblock of=$TMPFS/test bs=1M count=$TOTAL_MB status=none ) # Reclaim everything (invoking the zswap shrinker) echo $TOTAL_BYTES > $ROOT_CG/test/memory.reclaim [-- Attachment #3: uaf_log --] [-- Type: application/octet-stream, Size: 5550 bytes --] [ 30.251114] zswap: Sleeping in shrink_memcg_cb() before spin_lock(&tree->lock) [ 40.408431] ================================================================== [ 40.413062] BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x59/0x110 [ 40.417247] Write of size 4 at addr ffff88810252dc98 by task zswap_wb_swapof/383 [ 40.421609] [ 40.422648] CPU: 0 PID: 383 Comm: zswap_wb_swapof Not tainted 6.7.0+ #19 [ 40.426663] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 40.432151] Call Trace: [ 40.433717] <TASK> [ 40.435125] dump_stack_lvl+0x98/0xd0 [ 40.437389] print_address_description+0x7f/0x3a0 [ 40.440272] print_report+0x119/0x220 [ 40.442540] ? _raw_spin_lock+0x59/0x110 [ 40.444971] ? kasan_complete_mode_report_info+0x66/0x80 [ 40.448170] ? _raw_spin_lock+0x59/0x110 [ 40.450571] kasan_report+0xd7/0x110 [ 40.452762] ? _raw_spin_lock+0x59/0x110 [ 40.455281] kasan_check_range+0x250/0x2f0 [ 40.457782] ? _raw_spin_lock+0x59/0x110 [ 40.460161] __kasan_check_write+0x18/0x20 [ 40.462638] _raw_spin_lock+0x59/0x110 [ 40.464933] shrink_memcg_cb+0x11c/0x700 [ 40.467360] ? xa_load+0xcd/0x140 [ 40.469429] __list_lru_walk_one+0x236/0x5c0 [ 40.472080] ? zswap_shrinker_count+0x330/0x330 [ 40.474841] ? zswap_shrinker_count+0x330/0x330 [ 40.477586] list_lru_walk_one+0xa3/0x100 [ 40.480023] zswap_shrinker_scan+0x2c7/0x450 [ 40.482649] do_shrink_slab+0x3a6/0x8f0 [ 40.485005] shrink_slab_memcg+0x3fe/0x7c0 [ 40.487499] shrink_slab+0x7c/0x3a0 [ 40.489658] ? mem_cgroup_iter+0x2ae/0x780 [ 40.492137] shrink_node_memcgs+0x397/0x670 [ 40.494671] shrink_node+0x25f/0xce0 [ 40.496877] shrink_zones+0x531/0x9e0 [ 40.499160] do_try_to_free_pages+0x1f4/0xb10 [ 40.501838] try_to_free_mem_cgroup_pages+0x37d/0x7c0 [ 40.504893] memory_reclaim+0x301/0x370 [ 40.507249] ? _copy_from_iter+0x162/0x11e0 [ 40.509851] ? check_heap_object+0x160/0x410 [ 40.512463] cgroup_file_write+0x1e3/0x3d0 [ 40.514960] ? cgroup_seqfile_stop+0xb0/0xb0 [ 40.517601] kernfs_fop_write_iter+0x28e/0x390 [ 40.520349] vfs_write+0x7d2/0xb10 [ 40.522478] ksys_write+0xde/0x1a0 [ 40.524574] __x64_sys_write+0x7f/0x90 [ 40.526891] do_syscall_64+0x6c/0x150 [ 40.529140] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 40.532189] RIP: 0033:0x7f718a982b00 [ 40.534411] Code: 40 00 48 8b 15 19 b3 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d e1 3a 0e 00 00 74 17 b8 01 00 00 00 09 [ 40.545338] RSP: 002b:00007ffff5fb5c68 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 40.549846] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f718a982b00 [ 40.554106] RDX: 000000000000000a RSI: 00005564fb1d5880 RDI: 0000000000000001 [ 40.558342] RBP: 00005564fb1d5880 R08: 0000000000000073 R09: 0000000000000001 [ 40.562586] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a [ 40.566859] R13: 00007f718aa5f780 R14: 000000000000000a R15: 00007f718aa5ad00 [ 40.571103] </TASK> [ 40.572489] [ 40.573512] Allocated by task 388: [ 40.575650] kasan_save_track+0x30/0x60 [ 40.577986] kasan_save_alloc_info+0x69/0x70 [ 40.580572] __kasan_kmalloc+0xa2/0xb0 [ 40.582879] __kmalloc_node+0x1ee/0x420 [ 40.585235] kvmalloc_node+0x54/0x1a0 [ 40.587451] zswap_swapon+0x4a/0x260 [ 40.589650] __do_sys_swapon+0x12ad/0x1fc0 [ 40.592109] __x64_sys_swapon+0x5e/0x70 [ 40.594434] do_syscall_64+0x6c/0x150 [ 40.596652] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 40.599685] [ 40.600688] Freed by task 400: [ 40.602610] kasan_save_track+0x30/0x60 [ 40.604943] kasan_save_free_info+0x52/0x60 [ 40.607476] poison_slab_object+0x11a/0x190 [ 40.610006] __kasan_slab_free+0x39/0x70 [ 40.612375] kfree+0xf3/0x280 [ 40.614294] kvfree+0x29/0x30 [ 40.616136] zswap_swapoff+0x103/0x1a0 [ 40.618454] __se_sys_swapoff+0x19d8/0x21d0 [ 40.620990] __x64_sys_swapoff+0x3c/0x40 [ 40.623361] do_syscall_64+0x6c/0x150 [ 40.625605] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 40.628620] [ 40.629644] The buggy address belongs to the object at ffff88810252dc80 [ 40.629644] which belongs to the cache kmalloc-64 of size 64 [ 40.636813] The buggy address is located 24 bytes inside of [ 40.636813] freed 64-byte region [ffff88810252dc80, ffff88810252dcc0) [ 40.643854] [ 40.644871] The buggy address belongs to the physical page: [ 40.648187] page:00000000985903b6 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10252d [ 40.653713] flags: 0x400000000000800(slab|node=0|zone=1) [ 40.656880] page_type: 0xffffffff() [ 40.659031] raw: 0400000000000800 ffff888100042640 dead000000000100 dead000000000122 [ 40.663635] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [ 40.668235] page dumped because: kasan: bad access detected [ 40.671564] [ 40.672569] Memory state around the buggy address: [ 40.675463] ffff88810252db80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc [ 40.679766] ffff88810252dc00: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc [ 40.684120] >ffff88810252dc80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 40.688423] ^ [ 40.690860] ffff88810252dd00: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc [ 40.695136] ffff88810252dd80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc [ 40.699455] ================================================================== [ 40.704258] Disabling lock debugging due to kernel taint ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-24 6:57 ` Yosry Ahmed @ 2024-01-25 5:28 ` Chris Li 2024-01-25 7:59 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chris Li @ 2024-01-25 5:28 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel Hi Yosry, On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > The first solution that came to mind for me was refcounting the zswap > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > the percpu-refcount may be an overkill in terms of memory usage > > though. I think we can still do our own refcounting with RCU, but it > > may be more complicated. > > FWIW, I was able to reproduce the problem in a vm with the following > kernel diff: Thanks for the great find. I was worry about the usage after free situation in this email: https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ Glad you are able to find a reproducible case. That is one of the reasons I change the free to invalidate entries in my xarray patch. I think the swap_off code should remove the entry from the tree, just wait for each zswap entry to drop to zero. Then free it. That way you shouldn't need to refcount the tree. The tree refcount is effectively the combined refcount of all the zswap entries. Having refcount on the tree would be very high contention. Chris > diff --git a/mm/zswap.c b/mm/zswap.c > index 78df16d307aa8..6580a4be52a18 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct > list_head *item, struct list_lru_o > */ > spin_unlock(lock); > > + pr_info("Sleeping in shrink_memcg_cb() before > spin_lock(&tree->lock)\n"); > + schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000)); > + > /* Check for invalidate() race */ > spin_lock(&tree->lock); > if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) > > This basically expands the race window to 10 seconds. I have a > reproducer script attached that utilizes the zswap shrinker (which > makes this much easier to reproduce btw). The steps are as follows: > - Compile the kernel with the above diff, and both ZRAM & KASAN enabled. > - In one terminal, run zswap_wb_swapoff_race.sh. > - In a different terminal, once the "Sleeping in shrink_memcg_cb()" > message is logged, run "swapoff /dev/zram0". > - In the first terminal, once the 10 seconds elapse, I get a UAF BUG > from KASAN (log attached as well). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 5:28 ` Chris Li @ 2024-01-25 7:59 ` Yosry Ahmed 2024-01-25 18:55 ` Chris Li 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 7:59 UTC (permalink / raw) To: Chris Li Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote: > > Hi Yosry, > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > The first solution that came to mind for me was refcounting the zswap > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > the percpu-refcount may be an overkill in terms of memory usage > > > though. I think we can still do our own refcounting with RCU, but it > > > may be more complicated. > > > > FWIW, I was able to reproduce the problem in a vm with the following > > kernel diff: > > Thanks for the great find. > > I was worry about the usage after free situation in this email: > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ > > Glad you are able to find a reproducible case. That is one of the > reasons I change the free to invalidate entries in my xarray patch. > > I think the swap_off code should remove the entry from the tree, just > wait for each zswap entry to drop to zero. Then free it. This doesn't really help. The swapoff code is already removing all the entries from the trees before zswap_swapoff() is called through zswap_invalidate(). The race I described occurs because the writeback code is accessing the entries through the LRU, not the tree. The writeback code could have isolated a zswap entry from the LRU before swapoff, then tried to access it after swapoff. Although the zswap entry itself is referenced and safe to use, accessing the tree to grab the tree lock and check if the entry is still in the tree is the problem. > > That way you shouldn't need to refcount the tree. The tree refcount is > effectively the combined refcount of all the zswap entries. The problem is that given a zswap entry, you have no way to stabilize the zswap tree before trying to deference it with the current code. Chengming's suggestion of moving the swap cache pin before accessing the tree seems like the right way to go. > Having refcount on the tree would be very high contention. A percpu refcount cannot be contended by definition :) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 7:59 ` Yosry Ahmed @ 2024-01-25 18:55 ` Chris Li 2024-01-25 20:57 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chris Li @ 2024-01-25 18:55 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel Hi Yosry, On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote: > > > > Hi Yosry, > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > though. I think we can still do our own refcounting with RCU, but it > > > > may be more complicated. > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > kernel diff: > > > > Thanks for the great find. > > > > I was worry about the usage after free situation in this email: > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ > > > > Glad you are able to find a reproducible case. That is one of the > > reasons I change the free to invalidate entries in my xarray patch. > > > > I think the swap_off code should remove the entry from the tree, just > > wait for each zswap entry to drop to zero. Then free it. > > This doesn't really help. The swapoff code is already removing all the > entries from the trees before zswap_swapoff() is called through > zswap_invalidate(). The race I described occurs because the writeback > code is accessing the entries through the LRU, not the tree. The Why? Entry not in the tree is fine. What you describe is that, swap_off code will not see the entry because it is already not in the tree. The last one holding the reference count would free it. > writeback code could have isolated a zswap entry from the LRU before > swapoff, then tried to access it after swapoff. Although the zswap The writeback should have a reference count to the zswap entry it holds. The entry will not be free until the LRU is done and drop the reference count to zero. > entry itself is referenced and safe to use, accessing the tree to grab > the tree lock and check if the entry is still in the tree is the > problem. The swap off should wait until all the LRU list from that tree has been consumed before destroying the tree. In swap off, it walks all the process MM anyway, walking all the memcg and finding all the zswap entries in zswap LRU should solve that problem. Anyway, I think it is easier to reason about the behavior that way. Swap off will take the extra hit, but not the normal access of the tree. > > > > > That way you shouldn't need to refcount the tree. The tree refcount is > > effectively the combined refcount of all the zswap entries. > > The problem is that given a zswap entry, you have no way to stabilize > the zswap tree before trying to deference it with the current code. > Chengming's suggestion of moving the swap cache pin before accessing > the tree seems like the right way to go. > > > Having refcount on the tree would be very high contention. > > A percpu refcount cannot be contended by definition :) It still has overhead on the normal access path and memory consumption. If we can avoid it, it would be a win. Chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 18:55 ` Chris Li @ 2024-01-25 20:57 ` Yosry Ahmed 2024-01-25 22:31 ` Chris Li 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 20:57 UTC (permalink / raw) To: Chris Li Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote: > > > > > > Hi Yosry, > > > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > though. I think we can still do our own refcounting with RCU, but it > > > > > may be more complicated. > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > > kernel diff: > > > > > > Thanks for the great find. > > > > > > I was worry about the usage after free situation in this email: > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ > > > > > > Glad you are able to find a reproducible case. That is one of the > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > I think the swap_off code should remove the entry from the tree, just > > > wait for each zswap entry to drop to zero. Then free it. > > > > This doesn't really help. The swapoff code is already removing all the > > entries from the trees before zswap_swapoff() is called through > > zswap_invalidate(). The race I described occurs because the writeback > > code is accessing the entries through the LRU, not the tree. The > > Why? > Entry not in the tree is fine. What you describe is that, swap_off > code will not see the entry because it is already not in the tree. > The last one holding the reference count would free it. > > > writeback code could have isolated a zswap entry from the LRU before > > swapoff, then tried to access it after swapoff. Although the zswap > > The writeback should have a reference count to the zswap entry it > holds. The entry will not be free until the LRU is done and drop the > reference count to zero. > > > entry itself is referenced and safe to use, accessing the tree to grab > > the tree lock and check if the entry is still in the tree is the > > problem. > > The swap off should wait until all the LRU list from that tree has > been consumed before destroying the tree. > In swap off, it walks all the process MM anyway, walking all the memcg > and finding all the zswap entries in zswap LRU should solve that > problem. At that point, the entry is isolated from the zswap LRU list as well. So even if swap off iterates the zswap LRUs, it cannot find it to wait for it. Take a closer look at the race condition I described. The problem is that after the entry is isolated from the zswap LRU, we need to grab the tree lock to make sure it's still there and get a ref, and just trying to lock the tree may be a UAF if we race with swapoff. > Anyway, I think it is easier to reason about the behavior that way. > Swap off will take the extra hit, but not the normal access of the > tree. I think this adds a lot of unnecessary complexity tbh. I think the operations reordering that Chengming described may be the way to go here. It does not include adding more logic or synchronization primitives, just reordering operations to be closer to what we do in zswap_load() and leverage existing synchronization. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 20:57 ` Yosry Ahmed @ 2024-01-25 22:31 ` Chris Li 2024-01-25 22:33 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chris Li @ 2024-01-25 22:31 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel Hi Yosry, On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Yosry, > > > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote: > > > > > > > > Hi Yosry, > > > > > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > > though. I think we can still do our own refcounting with RCU, but it > > > > > > may be more complicated. > > > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > > > kernel diff: > > > > > > > > Thanks for the great find. > > > > > > > > I was worry about the usage after free situation in this email: > > > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ > > > > > > > > Glad you are able to find a reproducible case. That is one of the > > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > > > I think the swap_off code should remove the entry from the tree, just > > > > wait for each zswap entry to drop to zero. Then free it. > > > > > > This doesn't really help. The swapoff code is already removing all the > > > entries from the trees before zswap_swapoff() is called through > > > zswap_invalidate(). The race I described occurs because the writeback > > > code is accessing the entries through the LRU, not the tree. The > > > > Why? > > Entry not in the tree is fine. What you describe is that, swap_off > > code will not see the entry because it is already not in the tree. > > The last one holding the reference count would free it. > > > > > writeback code could have isolated a zswap entry from the LRU before > > > swapoff, then tried to access it after swapoff. Although the zswap > > > > The writeback should have a reference count to the zswap entry it > > holds. The entry will not be free until the LRU is done and drop the > > reference count to zero. > > > > > entry itself is referenced and safe to use, accessing the tree to grab > > > the tree lock and check if the entry is still in the tree is the > > > problem. > > > > The swap off should wait until all the LRU list from that tree has > > been consumed before destroying the tree. > > In swap off, it walks all the process MM anyway, walking all the memcg > > and finding all the zswap entries in zswap LRU should solve that > > problem. > > At that point, the entry is isolated from the zswap LRU list as well. > So even if swap off iterates the zswap LRUs, it cannot find it to wait It just means that we need to defer removing the entry from LRU, only remove it after most of the write back is complete to some critical steps. > for it. Take a closer look at the race condition I described. The I take a closer look at the sequence Chengming describe, it has the element of delay removing entry from the LRU as well. > problem is that after the entry is isolated from the zswap LRU, we > need to grab the tree lock to make sure it's still there and get a > ref, and just trying to lock the tree may be a UAF if we race with > swapoff. I feel it is very wrong to have the tree freed while having outstanding entry allocationed from the tree pending. I would want to avoid that situation if possible. > > > Anyway, I think it is easier to reason about the behavior that way. > > Swap off will take the extra hit, but not the normal access of the > > tree. > > I think this adds a lot of unnecessary complexity tbh. I think the > operations reordering that Chengming described may be the way to go > here. It does not include adding more logic or synchronization Does not require adding the tree reference count right? > primitives, just reordering operations to be closer to what we do in > zswap_load() and leverage existing synchronization. The complexity is mostly for avoiding the tree reference count. If we don't add the tree refcount and we don't need the extra complexity in the tree waiting for LRU, that sounds great to me. Chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 22:31 ` Chris Li @ 2024-01-25 22:33 ` Yosry Ahmed 2024-01-26 1:09 ` Chris Li 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 22:33 UTC (permalink / raw) To: Chris Li Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 2:31 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > Hi Yosry, > > > > > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote: > > > > > > > > > > Hi Yosry, > > > > > > > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > > > though. I think we can still do our own refcounting with RCU, but it > > > > > > > may be more complicated. > > > > > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > > > > kernel diff: > > > > > > > > > > Thanks for the great find. > > > > > > > > > > I was worry about the usage after free situation in this email: > > > > > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ > > > > > > > > > > Glad you are able to find a reproducible case. That is one of the > > > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > > > > > I think the swap_off code should remove the entry from the tree, just > > > > > wait for each zswap entry to drop to zero. Then free it. > > > > > > > > This doesn't really help. The swapoff code is already removing all the > > > > entries from the trees before zswap_swapoff() is called through > > > > zswap_invalidate(). The race I described occurs because the writeback > > > > code is accessing the entries through the LRU, not the tree. The > > > > > > Why? > > > Entry not in the tree is fine. What you describe is that, swap_off > > > code will not see the entry because it is already not in the tree. > > > The last one holding the reference count would free it. > > > > > > > writeback code could have isolated a zswap entry from the LRU before > > > > swapoff, then tried to access it after swapoff. Although the zswap > > > > > > The writeback should have a reference count to the zswap entry it > > > holds. The entry will not be free until the LRU is done and drop the > > > reference count to zero. > > > > > > > entry itself is referenced and safe to use, accessing the tree to grab > > > > the tree lock and check if the entry is still in the tree is the > > > > problem. > > > > > > The swap off should wait until all the LRU list from that tree has > > > been consumed before destroying the tree. > > > In swap off, it walks all the process MM anyway, walking all the memcg > > > and finding all the zswap entries in zswap LRU should solve that > > > problem. > > > > At that point, the entry is isolated from the zswap LRU list as well. > > So even if swap off iterates the zswap LRUs, it cannot find it to wait > > It just means that we need to defer removing the entry from LRU, only > remove it after most of the write back is complete to some critical > steps. > > > for it. Take a closer look at the race condition I described. The > > I take a closer look at the sequence Chengming describe, it has the > element of delay removing entry from the LRU as well. > > > problem is that after the entry is isolated from the zswap LRU, we > > need to grab the tree lock to make sure it's still there and get a > > ref, and just trying to lock the tree may be a UAF if we race with > > swapoff. > > I feel it is very wrong to have the tree freed while having > outstanding entry allocationed from the tree pending. > I would want to avoid that situation if possible. This should be the case with Chengming's solution. > > > > > > Anyway, I think it is easier to reason about the behavior that way. > > > Swap off will take the extra hit, but not the normal access of the > > > tree. > > > > I think this adds a lot of unnecessary complexity tbh. I think the > > operations reordering that Chengming described may be the way to go > > here. It does not include adding more logic or synchronization > > Does not require adding the tree reference count right? No, just reordering of operations in the writeback path. > > > primitives, just reordering operations to be closer to what we do in > > zswap_load() and leverage existing synchronization. > > The complexity is mostly for avoiding the tree reference count. If we > don't add the tree refcount and we don't need the extra complexity in > the tree waiting for LRU, that sounds great to me. I think that's what Chengming's proposal provides. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 22:33 ` Yosry Ahmed @ 2024-01-26 1:09 ` Chris Li 0 siblings, 0 replies; 47+ messages in thread From: Chris Li @ 2024-01-26 1:09 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel Hi Yosry, On Thu, Jan 25, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > problem is that after the entry is isolated from the zswap LRU, we > > > need to grab the tree lock to make sure it's still there and get a > > > ref, and just trying to lock the tree may be a UAF if we race with > > > swapoff. > > > > I feel it is very wrong to have the tree freed while having > > outstanding entry allocationed from the tree pending. > > I would want to avoid that situation if possible. > > This should be the case with Chengming's solution. Thanks for confirming. Looking forward to Chenming's patch. Chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 21:02 ` Yosry Ahmed 2024-01-24 6:57 ` Yosry Ahmed @ 2024-01-24 7:20 ` Chengming Zhou 2024-01-25 5:44 ` Chris Li 2024-01-25 7:53 ` Yosry Ahmed 1 sibling, 2 replies; 47+ messages in thread From: Chengming Zhou @ 2024-01-24 7:20 UTC (permalink / raw) To: Yosry Ahmed, Johannes Weiner Cc: Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/24 05:02, Yosry Ahmed wrote: > On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: >>> On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>> >>>> On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: >>>>> On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>>>> >>>>>> On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: >>>>>>> During swapoff, try_to_unuse() makes sure that zswap_invalidate() is >>>>>>> called for all swap entries before zswap_swapoff() is called. This means >>>>>>> that all zswap entries should already be removed from the tree. Simplify >>>>>>> zswap_swapoff() by removing the tree cleanup loop, and leaving an >>>>>>> assertion in its place. >>>>>>> >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>>>>> >>>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >>>>>> >>>>>> That's a great simplification. >>>>>> >>>>>> Removing the tree->lock made me double take, but at this point the >>>>>> swapfile and its cache should be fully dead and I don't see how any of >>>>>> the zswap operations that take tree->lock could race at this point. >>>>> >>>>> It took me a while staring at the code to realize this loop is pointless. >>>>> >>>>> However, while I have your attention on the swapoff path, there's a >>>>> slightly irrelevant problem that I think might be there, but I am not >>>>> sure. >>>>> >>>>> It looks to me like swapoff can race with writeback, and there may be >>>>> a chance of UAF for the zswap tree. For example, if zswap_swapoff() >>>>> races with shrink_memcg_cb(), I feel like we may free the tree as it >>>>> is being used. For example if zswap_swapoff()->kfree(tree) happen >>>>> right before shrink_memcg_cb()->list_lru_isolate(l, item). >>>>> >>>>> Please tell me that I am being paranoid and that there is some >>>>> protection against zswap writeback racing with swapoff. It feels like >>>>> we are very careful with zswap entries refcounting, but not with the >>>>> zswap tree itself. >>>> >>>> Hm, I don't see how. >>>> >>>> Writeback operates on entries from the LRU. By the time >>>> zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should >>>> will have emptied out the LRU and tree. >>>> >>>> Writeback could have gotten a refcount to the entry and dropped the >>>> tree->lock. But then it does __read_swap_cache_async(), and while >>>> holding the page lock checks the tree under lock once more; if that >>>> finds the entry valid, it means try_to_unuse() hasn't started on this >>>> page yet, and would be held up by the page lock/writeback state. >>> >>> Consider the following race: >>> >>> CPU 1 CPU 2 >>> # In shrink_memcg_cb() # In swap_off >>> list_lru_isolate() >>> zswap_invalidate() >>> .. >>> zswap_swapoff() -> kfree(tree) >>> spin_lock(&tree->lock); >>> >>> Isn't this a UAF or am I missing something here? >> >> Oof. You're right, it looks like there is a bug. Digging through the >> history, I think this is actually quite old: the original backend >> shrinkers would pluck something off their LRU, drop all locks, then >> try to acquire tree->lock. There is no protection against swapoff. >> >> The lock that is supposed to protect this is the LRU lock. That's >> where reclaim and invalidation should synchronize. But it's not right: >> >> 1. We drop the LRU lock before acquiring the tree lock. We should >> instead trylock the tree while still holding the LRU lock to make >> sure the tree is safe against swapoff. >> >> 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But >> it must always cycle the LRU lock before freeing the tree for that >> synchronization to work. >> >> Once we're holding a refcount to the entry, it's safe to drop all >> locks for the next step because we'll then work against the swapcache >> and entry: __read_swap_cache_async() will try to pin and lock whatever >> swap entry is at that type+offset. This also pins the type's current >> tree. HOWEVER, if swapoff + swapon raced, this could be a different >> tree than what we had in @tree, so >> >> 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to >> look up zswap_trees[] again after __read_swap_cache_async() >> succeeded to validate the entry. >> >> Once it succeeded, we can validate the entry. The entry is valid due >> to our refcount. The zswap_trees[type] is valid due to the cache pin. >> >> However, if validation failed and we have a non-zero writeback_result, >> there is one last bug: >> >> 4. the original entry's tree is no longer valid for the entry put. >> >> The current scheme handles invalidation fine (which is good because >> that's quite common). But it's fundamentally unsynchronized against >> swapoff (which has probably gone undetected because that's rare). >> >> I can't think of an immediate solution to this, but I wanted to put my >> analysis out for comments. > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > The first solution that came to mind for me was refcounting the zswap > tree with RCU with percpu-refcount, similar to how cgroup refs are > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > the percpu-refcount may be an overkill in terms of memory usage > though. I think we can still do our own refcounting with RCU, but it > may be more complicated. Hello, I also thought about this problem for some time, maybe something like below can be changed to fix it? It's likely I missed something, just some thoughts. IMHO, the problem is caused by the different way in which we use zswap entry in the writeback, that should be much like zswap_load(). The zswap_load() comes in with the folio locked in swap cache, so it has stable zswap tree to search and lock... But in writeback case, we don't, shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, then release lru lock to get tree lock, which maybe freed already. So we should change here, we read swpentry from entry with lru list lock held, then release lru lock, to try to lock corresponding folio in swap cache, if we success, the following things is much the same like zswap_load(). We can get tree lock, to recheck the invalidate race, if no race happened, we can make sure the entry is still right and get refcount of it, then release the tree lock. The main differences between this writeback with zswap_load() is the handling of lru entry and the tree lifetime. The whole zswap_load() function has the stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half after __swap_writepage() since we unlock the folio after that. So we can't reference the tree after that. This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early in tree lock, since thereafter writeback can't fail. BTW, I think we should also zswap_invalidate_entry() early in zswap_load() and only support the zswap_exclusive_loads_enabled mode, but that's another topic. The second difference is the handling of lru entry, which is easy that we just zswap_lru_del() in tree lock. So no any path can reference the entry from tree or lru after we release the tree lock, so we can just zswap_free_entry() after writeback. Thanks! // lru list lock held shrink_memcg_cb() swpentry = entry->swpentry // Don't isolate entry from lru list here, just use list_lru_putback() spin_unlock(lru list lock) folio = __read_swap_cache_async(swpentry) if (!folio) return if (!folio_was_allocated) folio_put(folio) return // folio is locked, swapcache is secured against swapoff tree = get tree from swpentry spin_lock(&tree->lock) // check invalidate race? No if (entry == zswap_rb_search()) // so we can make sure this entry is still right // zswap_invalidate_entry() since the below writeback can't fail zswap_entry_get(entry) zswap_invalidate_entry(tree, entry) // remove from lru list zswap_lru_del() spin_unlock(&tree->lock) __zswap_load() __swap_writepage() // folio unlock folio_put(folio) // entry is safe to free, since it's removed from tree and lru above zswap_free_entry(entry) ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-24 7:20 ` Chengming Zhou @ 2024-01-25 5:44 ` Chris Li 2024-01-25 8:01 ` Yosry Ahmed 2024-01-25 7:53 ` Yosry Ahmed 1 sibling, 1 reply; 47+ messages in thread From: Chris Li @ 2024-01-25 5:44 UTC (permalink / raw) To: Chengming Zhou Cc: Yosry Ahmed, Johannes Weiner, Andrew Morton, Nhat Pham, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 11:21 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > The first solution that came to mind for me was refcounting the zswap > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > the percpu-refcount may be an overkill in terms of memory usage > > though. I think we can still do our own refcounting with RCU, but it > > may be more complicated. > Hello, > > I also thought about this problem for some time, maybe something like below > can be changed to fix it? It's likely I missed something, just some thoughts. > > IMHO, the problem is caused by the different way in which we use zswap entry > in the writeback, that should be much like zswap_load(). It is possible to keep the behavior difference and get the tree right. Follow a few rules: 1) Always bump the zswap entry refcount after finding the entry (inside the RCU read lock) 2) Always free entry after refcount drops to zero. 3) Swap off should just wait for the each entry ref count drop to zero (or 1 including the swap off code holding one) then free it. Swap off is the slow path anyway. BTW, the swap off is where swap device locks get a lot of contention. > > The zswap_load() comes in with the folio locked in swap cache, so it has > stable zswap tree to search and lock... But in writeback case, we don't, > shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > then release lru lock to get tree lock, which maybe freed already. > > So we should change here, we read swpentry from entry with lru list lock held, > then release lru lock, to try to lock corresponding folio in swap cache, > if we success, the following things is much the same like zswap_load(). > We can get tree lock, to recheck the invalidate race, if no race happened, > we can make sure the entry is still right and get refcount of it, then > release the tree lock. > > The main differences between this writeback with zswap_load() is the handling > of lru entry and the tree lifetime. The whole zswap_load() function has the > stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half > after __swap_writepage() since we unlock the folio after that. So we can't > reference the tree after that. > > This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early > in tree lock, since thereafter writeback can't fail. BTW, I think we should > also zswap_invalidate_entry() early in zswap_load() and only support the > zswap_exclusive_loads_enabled mode, but that's another topic. > > The second difference is the handling of lru entry, which is easy that we > just zswap_lru_del() in tree lock. > > So no any path can reference the entry from tree or lru after we release > the tree lock, so we can just zswap_free_entry() after writeback. > > Thanks! > > // lru list lock held > shrink_memcg_cb() > swpentry = entry->swpentry > // Don't isolate entry from lru list here, just use list_lru_putback() > spin_unlock(lru list lock) > > folio = __read_swap_cache_async(swpentry) > if (!folio) > return > > if (!folio_was_allocated) > folio_put(folio) > return > > // folio is locked, swapcache is secured against swapoff > tree = get tree from swpentry > spin_lock(&tree->lock) That will not work well with zswap to xarray change. We want to remove the tree lock and only use the xarray lock. The lookup should just hold xarray RCU read lock and return the entry with ref count increased. Chris > > // check invalidate race? No > if (entry == zswap_rb_search()) > > // so we can make sure this entry is still right > // zswap_invalidate_entry() since the below writeback can't fail > zswap_entry_get(entry) > zswap_invalidate_entry(tree, entry) > > // remove from lru list > zswap_lru_del() > > spin_unlock(&tree->lock) > > __zswap_load() > > __swap_writepage() // folio unlock > folio_put(folio) > > // entry is safe to free, since it's removed from tree and lru above > zswap_free_entry(entry) > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 5:44 ` Chris Li @ 2024-01-25 8:01 ` Yosry Ahmed 2024-01-25 19:03 ` Chris Li 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 8:01 UTC (permalink / raw) To: Chris Li Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, Nhat Pham, Huang Ying, linux-mm, linux-kernel > > // lru list lock held > > shrink_memcg_cb() > > swpentry = entry->swpentry > > // Don't isolate entry from lru list here, just use list_lru_putback() > > spin_unlock(lru list lock) > > > > folio = __read_swap_cache_async(swpentry) > > if (!folio) > > return > > > > if (!folio_was_allocated) > > folio_put(folio) > > return > > > > // folio is locked, swapcache is secured against swapoff > > tree = get tree from swpentry > > spin_lock(&tree->lock) > > That will not work well with zswap to xarray change. We want to remove > the tree lock and only use the xarray lock. > The lookup should just hold xarray RCU read lock and return the entry > with ref count increased. In this path, we also invalidate the zswap entry, which would require holding the xarray lock anyway. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 8:01 ` Yosry Ahmed @ 2024-01-25 19:03 ` Chris Li 2024-01-25 21:01 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chris Li @ 2024-01-25 19:03 UTC (permalink / raw) To: Yosry Ahmed Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, Nhat Pham, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > // lru list lock held > > > shrink_memcg_cb() > > > swpentry = entry->swpentry > > > // Don't isolate entry from lru list here, just use list_lru_putback() > > > spin_unlock(lru list lock) > > > > > > folio = __read_swap_cache_async(swpentry) > > > if (!folio) > > > return > > > > > > if (!folio_was_allocated) > > > folio_put(folio) > > > return > > > > > > // folio is locked, swapcache is secured against swapoff > > > tree = get tree from swpentry > > > spin_lock(&tree->lock) > > > > That will not work well with zswap to xarray change. We want to remove > > the tree lock and only use the xarray lock. > > The lookup should just hold xarray RCU read lock and return the entry > > with ref count increased. > > In this path, we also invalidate the zswap entry, which would require > holding the xarray lock anyway. It will drop the RCU read lock after finding the entry and re-acquire the xarray spin lock on invalidation. In between there is a brief moment without locks. Chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 19:03 ` Chris Li @ 2024-01-25 21:01 ` Yosry Ahmed 0 siblings, 0 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 21:01 UTC (permalink / raw) To: Chris Li Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, Nhat Pham, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 11:04 AM Chris Li <chrisl@kernel.org> wrote: > > On Thu, Jan 25, 2024 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > // lru list lock held > > > > shrink_memcg_cb() > > > > swpentry = entry->swpentry > > > > // Don't isolate entry from lru list here, just use list_lru_putback() > > > > spin_unlock(lru list lock) > > > > > > > > folio = __read_swap_cache_async(swpentry) > > > > if (!folio) > > > > return > > > > > > > > if (!folio_was_allocated) > > > > folio_put(folio) > > > > return > > > > > > > > // folio is locked, swapcache is secured against swapoff > > > > tree = get tree from swpentry > > > > spin_lock(&tree->lock) > > > > > > That will not work well with zswap to xarray change. We want to remove > > > the tree lock and only use the xarray lock. > > > The lookup should just hold xarray RCU read lock and return the entry > > > with ref count increased. > > > > In this path, we also invalidate the zswap entry, which would require > > holding the xarray lock anyway. > > It will drop the RCU read lock after finding the entry and re-acquire > the xarray spin lock on invalidation. In between there is a brief > moment without locks. If my understanding is correct, at that point in the code writeback is guaranteed to succeed unless the entry had already been removed from the tree. So we can use xa_cmpxchg() as I described earlier to find and remove the entry in the tree only if it exists. If it does, we continue with the writeback, otherwise we abort. No need for separate load and invalidation. zswap_invalidate_entry() can return a boolean (whether the entry was found and removed from the tree or not). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-24 7:20 ` Chengming Zhou 2024-01-25 5:44 ` Chris Li @ 2024-01-25 7:53 ` Yosry Ahmed 2024-01-25 8:03 ` Yosry Ahmed ` (2 more replies) 1 sibling, 3 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 7:53 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel > Hello, > > I also thought about this problem for some time, maybe something like below > can be changed to fix it? It's likely I missed something, just some thoughts. > > IMHO, the problem is caused by the different way in which we use zswap entry > in the writeback, that should be much like zswap_load(). > > The zswap_load() comes in with the folio locked in swap cache, so it has > stable zswap tree to search and lock... But in writeback case, we don't, > shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > then release lru lock to get tree lock, which maybe freed already. > > So we should change here, we read swpentry from entry with lru list lock held, > then release lru lock, to try to lock corresponding folio in swap cache, > if we success, the following things is much the same like zswap_load(). > We can get tree lock, to recheck the invalidate race, if no race happened, > we can make sure the entry is still right and get refcount of it, then > release the tree lock. Hmm I think you may be onto something here. Moving the swap cache allocation ahead before referencing the tree should give us the same guarantees as zswap_load() indeed. We can also consolidate the invalidate race checks (right now we have one in shrink_memcg_cb() and another one inside zswap_writeback_entry()). We will have to be careful about the error handling path to make sure we delete the folio from the swap cache only after we know the tree won't be referenced anymore. Anyway, I think this can work. On a separate note, I think there is a bug in zswap_writeback_entry() when we delete a folio from the swap cache. I think we are missing a folio_unlock() there. > > The main differences between this writeback with zswap_load() is the handling > of lru entry and the tree lifetime. The whole zswap_load() function has the > stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half > after __swap_writepage() since we unlock the folio after that. So we can't > reference the tree after that. > > This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early > in tree lock, since thereafter writeback can't fail. BTW, I think we should > also zswap_invalidate_entry() early in zswap_load() and only support the > zswap_exclusive_loads_enabled mode, but that's another topic. zswap_invalidate_entry() actually doesn't seem to be using the tree at all. > > The second difference is the handling of lru entry, which is easy that we > just zswap_lru_del() in tree lock. Why do we need zswap_lru_del() at all? We should have already isolated the entry at that point IIUC. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 7:53 ` Yosry Ahmed @ 2024-01-25 8:03 ` Yosry Ahmed 2024-01-25 8:30 ` Chengming Zhou 2024-01-26 0:03 ` Chengming Zhou 2 siblings, 0 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 8:03 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On Wed, Jan 24, 2024 at 11:53 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Hello, > > > > I also thought about this problem for some time, maybe something like below > > can be changed to fix it? It's likely I missed something, just some thoughts. > > > > IMHO, the problem is caused by the different way in which we use zswap entry > > in the writeback, that should be much like zswap_load(). > > > > The zswap_load() comes in with the folio locked in swap cache, so it has > > stable zswap tree to search and lock... But in writeback case, we don't, > > shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > > then release lru lock to get tree lock, which maybe freed already. > > > > So we should change here, we read swpentry from entry with lru list lock held, > > then release lru lock, to try to lock corresponding folio in swap cache, > > if we success, the following things is much the same like zswap_load(). > > We can get tree lock, to recheck the invalidate race, if no race happened, > > we can make sure the entry is still right and get refcount of it, then > > release the tree lock. > > Hmm I think you may be onto something here. Moving the swap cache > allocation ahead before referencing the tree should give us the same > guarantees as zswap_load() indeed. We can also consolidate the > invalidate race checks (right now we have one in shrink_memcg_cb() and > another one inside zswap_writeback_entry()). > > We will have to be careful about the error handling path to make sure > we delete the folio from the swap cache only after we know the tree > won't be referenced anymore. Anyway, I think this can work. > > On a separate note, I think there is a bug in zswap_writeback_entry() > when we delete a folio from the swap cache. I think we are missing a > folio_unlock() there. > > > > > The main differences between this writeback with zswap_load() is the handling > > of lru entry and the tree lifetime. The whole zswap_load() function has the > > stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half > > after __swap_writepage() since we unlock the folio after that. So we can't > > reference the tree after that. > > > > This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early > > in tree lock, since thereafter writeback can't fail. BTW, I think we should > > also zswap_invalidate_entry() early in zswap_load() and only support the > > zswap_exclusive_loads_enabled mode, but that's another topic. > > zswap_invalidate_entry() actually doesn't seem to be using the tree at all. Never mind, I was looking at zswap_entry_put(). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 7:53 ` Yosry Ahmed 2024-01-25 8:03 ` Yosry Ahmed @ 2024-01-25 8:30 ` Chengming Zhou 2024-01-25 8:42 ` Yosry Ahmed 2024-01-26 0:03 ` Chengming Zhou 2 siblings, 1 reply; 47+ messages in thread From: Chengming Zhou @ 2024-01-25 8:30 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/25 15:53, Yosry Ahmed wrote: >> Hello, >> >> I also thought about this problem for some time, maybe something like below >> can be changed to fix it? It's likely I missed something, just some thoughts. >> >> IMHO, the problem is caused by the different way in which we use zswap entry >> in the writeback, that should be much like zswap_load(). >> >> The zswap_load() comes in with the folio locked in swap cache, so it has >> stable zswap tree to search and lock... But in writeback case, we don't, >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >> then release lru lock to get tree lock, which maybe freed already. >> >> So we should change here, we read swpentry from entry with lru list lock held, >> then release lru lock, to try to lock corresponding folio in swap cache, >> if we success, the following things is much the same like zswap_load(). >> We can get tree lock, to recheck the invalidate race, if no race happened, >> we can make sure the entry is still right and get refcount of it, then >> release the tree lock. > > Hmm I think you may be onto something here. Moving the swap cache > allocation ahead before referencing the tree should give us the same > guarantees as zswap_load() indeed. We can also consolidate the > invalidate race checks (right now we have one in shrink_memcg_cb() and > another one inside zswap_writeback_entry()). Right, if we successfully lock folio in the swap cache, we can get the tree lock and check the invalidate race, only once. > > We will have to be careful about the error handling path to make sure > we delete the folio from the swap cache only after we know the tree > won't be referenced anymore. Anyway, I think this can work. Yes, we can't reference tree if we early return or after unlocking folio, since the reference of zswap entry can't protect the tree. > > On a separate note, I think there is a bug in zswap_writeback_entry() > when we delete a folio from the swap cache. I think we are missing a > folio_unlock() there. Ah, yes, and folio_put(). > >> >> The main differences between this writeback with zswap_load() is the handling >> of lru entry and the tree lifetime. The whole zswap_load() function has the >> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half >> after __swap_writepage() since we unlock the folio after that. So we can't >> reference the tree after that. >> >> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early >> in tree lock, since thereafter writeback can't fail. BTW, I think we should >> also zswap_invalidate_entry() early in zswap_load() and only support the >> zswap_exclusive_loads_enabled mode, but that's another topic. > > zswap_invalidate_entry() actually doesn't seem to be using the tree at all. > >> >> The second difference is the handling of lru entry, which is easy that we >> just zswap_lru_del() in tree lock. > > Why do we need zswap_lru_del() at all? We should have already isolated > the entry at that point IIUC. I was thinking how to handle the "zswap_lru_putback()" if not writeback, in which case we can't use the entry actually since we haven't got reference of it. So we can don't isolate at the entry, and only zswap_lru_del() when we are going to writeback actually. Thanks! ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 8:30 ` Chengming Zhou @ 2024-01-25 8:42 ` Yosry Ahmed 2024-01-25 8:52 ` Chengming Zhou 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 8:42 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 12:30 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2024/1/25 15:53, Yosry Ahmed wrote: > >> Hello, > >> > >> I also thought about this problem for some time, maybe something like below > >> can be changed to fix it? It's likely I missed something, just some thoughts. > >> > >> IMHO, the problem is caused by the different way in which we use zswap entry > >> in the writeback, that should be much like zswap_load(). > >> > >> The zswap_load() comes in with the folio locked in swap cache, so it has > >> stable zswap tree to search and lock... But in writeback case, we don't, > >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > >> then release lru lock to get tree lock, which maybe freed already. > >> > >> So we should change here, we read swpentry from entry with lru list lock held, > >> then release lru lock, to try to lock corresponding folio in swap cache, > >> if we success, the following things is much the same like zswap_load(). > >> We can get tree lock, to recheck the invalidate race, if no race happened, > >> we can make sure the entry is still right and get refcount of it, then > >> release the tree lock. > > > > Hmm I think you may be onto something here. Moving the swap cache > > allocation ahead before referencing the tree should give us the same > > guarantees as zswap_load() indeed. We can also consolidate the > > invalidate race checks (right now we have one in shrink_memcg_cb() and > > another one inside zswap_writeback_entry()). > > Right, if we successfully lock folio in the swap cache, we can get the > tree lock and check the invalidate race, only once. > > > > > We will have to be careful about the error handling path to make sure > > we delete the folio from the swap cache only after we know the tree > > won't be referenced anymore. Anyway, I think this can work. > > Yes, we can't reference tree if we early return or after unlocking folio, > since the reference of zswap entry can't protect the tree. > > > > > On a separate note, I think there is a bug in zswap_writeback_entry() > > when we delete a folio from the swap cache. I think we are missing a > > folio_unlock() there. > > Ah, yes, and folio_put(). Yes. I am preparing a fix. > > > > >> > >> The main differences between this writeback with zswap_load() is the handling > >> of lru entry and the tree lifetime. The whole zswap_load() function has the > >> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half > >> after __swap_writepage() since we unlock the folio after that. So we can't > >> reference the tree after that. > >> > >> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early > >> in tree lock, since thereafter writeback can't fail. BTW, I think we should > >> also zswap_invalidate_entry() early in zswap_load() and only support the > >> zswap_exclusive_loads_enabled mode, but that's another topic. > > > > zswap_invalidate_entry() actually doesn't seem to be using the tree at all. > > > >> > >> The second difference is the handling of lru entry, which is easy that we > >> just zswap_lru_del() in tree lock. > > > > Why do we need zswap_lru_del() at all? We should have already isolated > > the entry at that point IIUC. > > I was thinking how to handle the "zswap_lru_putback()" if not writeback, > in which case we can't use the entry actually since we haven't got reference > of it. So we can don't isolate at the entry, and only zswap_lru_del() when > we are going to writeback actually. Why not just call zswap_lru_putback() before we unlock the folio? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 8:42 ` Yosry Ahmed @ 2024-01-25 8:52 ` Chengming Zhou 2024-01-25 9:03 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chengming Zhou @ 2024-01-25 8:52 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/25 16:42, Yosry Ahmed wrote: > On Thu, Jan 25, 2024 at 12:30 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/25 15:53, Yosry Ahmed wrote: >>>> Hello, >>>> >>>> I also thought about this problem for some time, maybe something like below >>>> can be changed to fix it? It's likely I missed something, just some thoughts. >>>> >>>> IMHO, the problem is caused by the different way in which we use zswap entry >>>> in the writeback, that should be much like zswap_load(). >>>> >>>> The zswap_load() comes in with the folio locked in swap cache, so it has >>>> stable zswap tree to search and lock... But in writeback case, we don't, >>>> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >>>> then release lru lock to get tree lock, which maybe freed already. >>>> >>>> So we should change here, we read swpentry from entry with lru list lock held, >>>> then release lru lock, to try to lock corresponding folio in swap cache, >>>> if we success, the following things is much the same like zswap_load(). >>>> We can get tree lock, to recheck the invalidate race, if no race happened, >>>> we can make sure the entry is still right and get refcount of it, then >>>> release the tree lock. >>> >>> Hmm I think you may be onto something here. Moving the swap cache >>> allocation ahead before referencing the tree should give us the same >>> guarantees as zswap_load() indeed. We can also consolidate the >>> invalidate race checks (right now we have one in shrink_memcg_cb() and >>> another one inside zswap_writeback_entry()). >> >> Right, if we successfully lock folio in the swap cache, we can get the >> tree lock and check the invalidate race, only once. >> >>> >>> We will have to be careful about the error handling path to make sure >>> we delete the folio from the swap cache only after we know the tree >>> won't be referenced anymore. Anyway, I think this can work. >> >> Yes, we can't reference tree if we early return or after unlocking folio, >> since the reference of zswap entry can't protect the tree. >> >>> >>> On a separate note, I think there is a bug in zswap_writeback_entry() >>> when we delete a folio from the swap cache. I think we are missing a >>> folio_unlock() there. >> >> Ah, yes, and folio_put(). > > Yes. I am preparing a fix. > >> >>> >>>> >>>> The main differences between this writeback with zswap_load() is the handling >>>> of lru entry and the tree lifetime. The whole zswap_load() function has the >>>> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half >>>> after __swap_writepage() since we unlock the folio after that. So we can't >>>> reference the tree after that. >>>> >>>> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early >>>> in tree lock, since thereafter writeback can't fail. BTW, I think we should >>>> also zswap_invalidate_entry() early in zswap_load() and only support the >>>> zswap_exclusive_loads_enabled mode, but that's another topic. >>> >>> zswap_invalidate_entry() actually doesn't seem to be using the tree at all. >>> >>>> >>>> The second difference is the handling of lru entry, which is easy that we >>>> just zswap_lru_del() in tree lock. >>> >>> Why do we need zswap_lru_del() at all? We should have already isolated >>> the entry at that point IIUC. >> >> I was thinking how to handle the "zswap_lru_putback()" if not writeback, >> in which case we can't use the entry actually since we haven't got reference >> of it. So we can don't isolate at the entry, and only zswap_lru_del() when >> we are going to writeback actually. > > Why not just call zswap_lru_putback() before we unlock the folio? When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, we don't have a locked folio yet. The entry maybe invalidated and freed concurrently. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 8:52 ` Chengming Zhou @ 2024-01-25 9:03 ` Yosry Ahmed 2024-01-25 9:22 ` Chengming Zhou 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 9:03 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel > >>>> The second difference is the handling of lru entry, which is easy that we > >>>> just zswap_lru_del() in tree lock. > >>> > >>> Why do we need zswap_lru_del() at all? We should have already isolated > >>> the entry at that point IIUC. > >> > >> I was thinking how to handle the "zswap_lru_putback()" if not writeback, > >> in which case we can't use the entry actually since we haven't got reference > >> of it. So we can don't isolate at the entry, and only zswap_lru_del() when > >> we are going to writeback actually. > > > > Why not just call zswap_lru_putback() before we unlock the folio? > > When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, > we don't have a locked folio yet. The entry maybe invalidated and freed concurrently. Oh, that path, right. If we don't isolate the entry straightaway, concurrent reclaimers will see the same entry, call __read_swap_cache_async(), find the folio already in the swapcache and stop shrinking. This is because usually this means we are racing with swapin and hitting the warmer part of the zswap LRU. I am not sure if this would matter in practice, maybe Nhat knows better. Perhaps we can rotate the entry in the LRU before calling __read_swap_cache_async() to minimize the chances of such a race? Or we can serialize the calls to __read_swap_cache_async() but this may be an overkill. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 9:03 ` Yosry Ahmed @ 2024-01-25 9:22 ` Chengming Zhou 2024-01-25 9:26 ` Yosry Ahmed 0 siblings, 1 reply; 47+ messages in thread From: Chengming Zhou @ 2024-01-25 9:22 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/25 17:03, Yosry Ahmed wrote: >>>>>> The second difference is the handling of lru entry, which is easy that we >>>>>> just zswap_lru_del() in tree lock. >>>>> >>>>> Why do we need zswap_lru_del() at all? We should have already isolated >>>>> the entry at that point IIUC. >>>> >>>> I was thinking how to handle the "zswap_lru_putback()" if not writeback, >>>> in which case we can't use the entry actually since we haven't got reference >>>> of it. So we can don't isolate at the entry, and only zswap_lru_del() when >>>> we are going to writeback actually. >>> >>> Why not just call zswap_lru_putback() before we unlock the folio? >> >> When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, >> we don't have a locked folio yet. The entry maybe invalidated and freed concurrently. > > Oh, that path, right. > > If we don't isolate the entry straightaway, concurrent reclaimers will > see the same entry, call __read_swap_cache_async(), find the folio > already in the swapcache and stop shrinking. This is because usually > this means we are racing with swapin and hitting the warmer part of > the zswap LRU. > > I am not sure if this would matter in practice, maybe Nhat knows > better. Perhaps we can rotate the entry in the LRU before calling > __read_swap_cache_async() to minimize the chances of such a race? Or > we can serialize the calls to __read_swap_cache_async() but this may > be an overkill. Also, not sure, rotate the entry maybe good IMHO since we will zswap_lru_del() once we checked the invalidate race. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 9:22 ` Chengming Zhou @ 2024-01-25 9:26 ` Yosry Ahmed 2024-01-25 9:38 ` Chengming Zhou 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-25 9:26 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 1:22 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2024/1/25 17:03, Yosry Ahmed wrote: > >>>>>> The second difference is the handling of lru entry, which is easy that we > >>>>>> just zswap_lru_del() in tree lock. > >>>>> > >>>>> Why do we need zswap_lru_del() at all? We should have already isolated > >>>>> the entry at that point IIUC. > >>>> > >>>> I was thinking how to handle the "zswap_lru_putback()" if not writeback, > >>>> in which case we can't use the entry actually since we haven't got reference > >>>> of it. So we can don't isolate at the entry, and only zswap_lru_del() when > >>>> we are going to writeback actually. > >>> > >>> Why not just call zswap_lru_putback() before we unlock the folio? > >> > >> When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, > >> we don't have a locked folio yet. The entry maybe invalidated and freed concurrently. > > > > Oh, that path, right. > > > > If we don't isolate the entry straightaway, concurrent reclaimers will > > see the same entry, call __read_swap_cache_async(), find the folio > > already in the swapcache and stop shrinking. This is because usually > > this means we are racing with swapin and hitting the warmer part of > > the zswap LRU. > > > > I am not sure if this would matter in practice, maybe Nhat knows > > better. Perhaps we can rotate the entry in the LRU before calling > > __read_swap_cache_async() to minimize the chances of such a race? Or > > we can serialize the calls to __read_swap_cache_async() but this may > > be an overkill. > > Also, not sure, rotate the entry maybe good IMHO since we will zswap_lru_del() > once we checked the invalidate race. Not sure what you mean. If we rotate first, we won't have anything to do in the failure case (if the folio is not locked). We will have to do zswap_lru_del() if actually writeback, yes, but in this case no synchronization is needed because the folio is locked, right? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 9:26 ` Yosry Ahmed @ 2024-01-25 9:38 ` Chengming Zhou 0 siblings, 0 replies; 47+ messages in thread From: Chengming Zhou @ 2024-01-25 9:38 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/25 17:26, Yosry Ahmed wrote: > On Thu, Jan 25, 2024 at 1:22 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/25 17:03, Yosry Ahmed wrote: >>>>>>>> The second difference is the handling of lru entry, which is easy that we >>>>>>>> just zswap_lru_del() in tree lock. >>>>>>> >>>>>>> Why do we need zswap_lru_del() at all? We should have already isolated >>>>>>> the entry at that point IIUC. >>>>>> >>>>>> I was thinking how to handle the "zswap_lru_putback()" if not writeback, >>>>>> in which case we can't use the entry actually since we haven't got reference >>>>>> of it. So we can don't isolate at the entry, and only zswap_lru_del() when >>>>>> we are going to writeback actually. >>>>> >>>>> Why not just call zswap_lru_putback() before we unlock the folio? >>>> >>>> When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, >>>> we don't have a locked folio yet. The entry maybe invalidated and freed concurrently. >>> >>> Oh, that path, right. >>> >>> If we don't isolate the entry straightaway, concurrent reclaimers will >>> see the same entry, call __read_swap_cache_async(), find the folio >>> already in the swapcache and stop shrinking. This is because usually >>> this means we are racing with swapin and hitting the warmer part of >>> the zswap LRU. >>> >>> I am not sure if this would matter in practice, maybe Nhat knows >>> better. Perhaps we can rotate the entry in the LRU before calling >>> __read_swap_cache_async() to minimize the chances of such a race? Or >>> we can serialize the calls to __read_swap_cache_async() but this may >>> be an overkill. >> >> Also, not sure, rotate the entry maybe good IMHO since we will zswap_lru_del() >> once we checked the invalidate race. > > Not sure what you mean. If we rotate first, we won't have anything to > do in the failure case (if the folio is not locked). We will have to > do zswap_lru_del() if actually writeback, yes, but in this case no > synchronization is needed because the folio is locked, right? Right, sorry for my confusing expression. We rotate first in lru lock, and only zswap_lru_del() later if actually writeback. What I want to mean is that the possibility of seeing the entry on lru list by another reclaimer is very low, since we rotate and the timing between __read_swap_cache_async() and zswap_lru_del() should be short. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-25 7:53 ` Yosry Ahmed 2024-01-25 8:03 ` Yosry Ahmed 2024-01-25 8:30 ` Chengming Zhou @ 2024-01-26 0:03 ` Chengming Zhou 2024-01-26 0:05 ` Yosry Ahmed 2 siblings, 1 reply; 47+ messages in thread From: Chengming Zhou @ 2024-01-26 0:03 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/25 15:53, Yosry Ahmed wrote: >> Hello, >> >> I also thought about this problem for some time, maybe something like below >> can be changed to fix it? It's likely I missed something, just some thoughts. >> >> IMHO, the problem is caused by the different way in which we use zswap entry >> in the writeback, that should be much like zswap_load(). >> >> The zswap_load() comes in with the folio locked in swap cache, so it has >> stable zswap tree to search and lock... But in writeback case, we don't, >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >> then release lru lock to get tree lock, which maybe freed already. >> >> So we should change here, we read swpentry from entry with lru list lock held, >> then release lru lock, to try to lock corresponding folio in swap cache, >> if we success, the following things is much the same like zswap_load(). >> We can get tree lock, to recheck the invalidate race, if no race happened, >> we can make sure the entry is still right and get refcount of it, then >> release the tree lock. > > Hmm I think you may be onto something here. Moving the swap cache > allocation ahead before referencing the tree should give us the same > guarantees as zswap_load() indeed. We can also consolidate the > invalidate race checks (right now we have one in shrink_memcg_cb() and > another one inside zswap_writeback_entry()). > > We will have to be careful about the error handling path to make sure > we delete the folio from the swap cache only after we know the tree > won't be referenced anymore. Anyway, I think this can work. > > On a separate note, I think there is a bug in zswap_writeback_entry() > when we delete a folio from the swap cache. I think we are missing a > folio_unlock() there. > Hi, want to know if you are preparing the fix patch, I would just wait to review if you are. Or I can work on it if you are busy with other thing. Thanks! ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-26 0:03 ` Chengming Zhou @ 2024-01-26 0:05 ` Yosry Ahmed 2024-01-26 0:10 ` Chengming Zhou 0 siblings, 1 reply; 47+ messages in thread From: Yosry Ahmed @ 2024-01-26 0:05 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On Thu, Jan 25, 2024 at 4:03 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2024/1/25 15:53, Yosry Ahmed wrote: > >> Hello, > >> > >> I also thought about this problem for some time, maybe something like below > >> can be changed to fix it? It's likely I missed something, just some thoughts. > >> > >> IMHO, the problem is caused by the different way in which we use zswap entry > >> in the writeback, that should be much like zswap_load(). > >> > >> The zswap_load() comes in with the folio locked in swap cache, so it has > >> stable zswap tree to search and lock... But in writeback case, we don't, > >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > >> then release lru lock to get tree lock, which maybe freed already. > >> > >> So we should change here, we read swpentry from entry with lru list lock held, > >> then release lru lock, to try to lock corresponding folio in swap cache, > >> if we success, the following things is much the same like zswap_load(). > >> We can get tree lock, to recheck the invalidate race, if no race happened, > >> we can make sure the entry is still right and get refcount of it, then > >> release the tree lock. > > > > Hmm I think you may be onto something here. Moving the swap cache > > allocation ahead before referencing the tree should give us the same > > guarantees as zswap_load() indeed. We can also consolidate the > > invalidate race checks (right now we have one in shrink_memcg_cb() and > > another one inside zswap_writeback_entry()). > > > > We will have to be careful about the error handling path to make sure > > we delete the folio from the swap cache only after we know the tree > > won't be referenced anymore. Anyway, I think this can work. > > > > On a separate note, I think there is a bug in zswap_writeback_entry() > > when we delete a folio from the swap cache. I think we are missing a > > folio_unlock() there. > > > > Hi, want to know if you are preparing the fix patch, I would just wait to > review if you are. Or I can work on it if you are busy with other thing. If you're talking about implementing your solution, I was assuming you were going to send a patch out (and hoping others would chime in in case I missed something). I can take a stab at implementing it if you prefer that, just let me know. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-26 0:05 ` Yosry Ahmed @ 2024-01-26 0:10 ` Chengming Zhou 0 siblings, 0 replies; 47+ messages in thread From: Chengming Zhou @ 2024-01-26 0:10 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Nhat Pham, Chris Li, Huang Ying, linux-mm, linux-kernel On 2024/1/26 08:05, Yosry Ahmed wrote: > On Thu, Jan 25, 2024 at 4:03 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/25 15:53, Yosry Ahmed wrote: >>>> Hello, >>>> >>>> I also thought about this problem for some time, maybe something like below >>>> can be changed to fix it? It's likely I missed something, just some thoughts. >>>> >>>> IMHO, the problem is caused by the different way in which we use zswap entry >>>> in the writeback, that should be much like zswap_load(). >>>> >>>> The zswap_load() comes in with the folio locked in swap cache, so it has >>>> stable zswap tree to search and lock... But in writeback case, we don't, >>>> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >>>> then release lru lock to get tree lock, which maybe freed already. >>>> >>>> So we should change here, we read swpentry from entry with lru list lock held, >>>> then release lru lock, to try to lock corresponding folio in swap cache, >>>> if we success, the following things is much the same like zswap_load(). >>>> We can get tree lock, to recheck the invalidate race, if no race happened, >>>> we can make sure the entry is still right and get refcount of it, then >>>> release the tree lock. >>> >>> Hmm I think you may be onto something here. Moving the swap cache >>> allocation ahead before referencing the tree should give us the same >>> guarantees as zswap_load() indeed. We can also consolidate the >>> invalidate race checks (right now we have one in shrink_memcg_cb() and >>> another one inside zswap_writeback_entry()). >>> >>> We will have to be careful about the error handling path to make sure >>> we delete the folio from the swap cache only after we know the tree >>> won't be referenced anymore. Anyway, I think this can work. >>> >>> On a separate note, I think there is a bug in zswap_writeback_entry() >>> when we delete a folio from the swap cache. I think we are missing a >>> folio_unlock() there. >>> >> >> Hi, want to know if you are preparing the fix patch, I would just wait to >> review if you are. Or I can work on it if you are busy with other thing. > > If you're talking about implementing your solution, I was assuming you > were going to send a patch out (and hoping others would chime in in > case I missed something). Ok, I will prepare a patch to send out for further discussion. > > I can take a stab at implementing it if you prefer that, just let me know. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 15:54 ` Yosry Ahmed 2024-01-23 20:12 ` Johannes Weiner @ 2024-01-23 20:30 ` Nhat Pham 2024-01-23 21:04 ` Yosry Ahmed 1 sibling, 1 reply; 47+ messages in thread From: Nhat Pham @ 2024-01-23 20:30 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, Andrew Morton, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 7:55 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > assertion in its place. > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > That's a great simplification. > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > the zswap operations that take tree->lock could race at this point. > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > However, while I have your attention on the swapoff path, there's a > > > slightly irrelevant problem that I think might be there, but I am not > > > sure. > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > Please tell me that I am being paranoid and that there is some > > > protection against zswap writeback racing with swapoff. It feels like > > > we are very careful with zswap entries refcounting, but not with the > > > zswap tree itself. > > > > Hm, I don't see how. > > > > Writeback operates on entries from the LRU. By the time > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > will have emptied out the LRU and tree. > > > > Writeback could have gotten a refcount to the entry and dropped the > > tree->lock. But then it does __read_swap_cache_async(), and while > > holding the page lock checks the tree under lock once more; if that > > finds the entry valid, it means try_to_unuse() hasn't started on this > > page yet, and would be held up by the page lock/writeback state. > > Consider the following race: > > CPU 1 CPU 2 > # In shrink_memcg_cb() # In swap_off > list_lru_isolate() > zswap_invalidate() > .. > zswap_swapoff() -> kfree(tree) > spin_lock(&tree->lock); > > Isn't this a UAF or am I missing something here? I need to read this code closer. But this smells like a race to me as well. Long term speaking, I think decoupling swap and zswap will fix this, no? We won't need to kfree(tree) inside swapoff. IOW, if we have a single zswap tree that is not tied down to any swapfile, then we can't have this race. There might be other races introduced by the decoupling that I might have not foreseen tho :) Short term, no clue hmm. Let me think a bit more about this. > > > > > > > > Chengming, Chris, I think this should make the tree split and the xarray > > > > > conversion patches simpler (especially the former). If others agree, > > > > > both changes can be rebased on top of this. > > > > > > > > The resulting code is definitely simpler, but this patch is not a > > > > completely trivial cleanup, either. If you put it before Chengming's > > > > patch and it breaks something, it would be difficult to pull out > > > > without affecting the tree split. > > > > > > Are you suggesting I rebase this on top of Chengming's patches? I can > > > definitely do this, but the patch will be slightly less > > > straightforward, and if the tree split patches break something it > > > would be difficult to pull out as well. If you feel like this patch is > > > more likely to break things, I can rebase. > > > > Yeah I think it's more subtle. I'd only ask somebody to rebase an > > already tested patch on a newer one if the latter were an obvious, > > low-risk, prep-style patch. Your patch is good, but it doesn't quite > > fit into this particular category, so I'd say no jumping the queue ;) > > My intention was to reduce the diff in both this patch and the tree > split patches, but I do understand this is more subtle. I can rebase > on top of Chengming's patches instead. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-23 20:30 ` Nhat Pham @ 2024-01-23 21:04 ` Yosry Ahmed 0 siblings, 0 replies; 47+ messages in thread From: Yosry Ahmed @ 2024-01-23 21:04 UTC (permalink / raw) To: Nhat Pham Cc: Johannes Weiner, Andrew Morton, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Tue, Jan 23, 2024 at 12:30 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Jan 23, 2024 at 7:55 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > > assertion in its place. > > > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > > > That's a great simplification. > > > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > > the zswap operations that take tree->lock could race at this point. > > > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > > > However, while I have your attention on the swapoff path, there's a > > > > slightly irrelevant problem that I think might be there, but I am not > > > > sure. > > > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > > > Please tell me that I am being paranoid and that there is some > > > > protection against zswap writeback racing with swapoff. It feels like > > > > we are very careful with zswap entries refcounting, but not with the > > > > zswap tree itself. > > > > > > Hm, I don't see how. > > > > > > Writeback operates on entries from the LRU. By the time > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > > will have emptied out the LRU and tree. > > > > > > Writeback could have gotten a refcount to the entry and dropped the > > > tree->lock. But then it does __read_swap_cache_async(), and while > > > holding the page lock checks the tree under lock once more; if that > > > finds the entry valid, it means try_to_unuse() hasn't started on this > > > page yet, and would be held up by the page lock/writeback state. > > > > Consider the following race: > > > > CPU 1 CPU 2 > > # In shrink_memcg_cb() # In swap_off > > list_lru_isolate() > > zswap_invalidate() > > .. > > zswap_swapoff() -> kfree(tree) > > spin_lock(&tree->lock); > > > > Isn't this a UAF or am I missing something here? > > I need to read this code closer. But this smells like a race to me as well. > > Long term speaking, I think decoupling swap and zswap will fix this, > no? We won't need to kfree(tree) inside swapoff. IOW, if we have a > single zswap tree that is not tied down to any swapfile, then we can't > have this race. There might be other races introduced by the > decoupling that I might have not foreseen tho :) Ideally yes, it should make the zswap <-> swapfile interaction more straightforward. Ideally in that world, writeback is handled in the swap core and is just moving an entry from one backend to another, so synchronization with things like swapoff should be easier. Also, I think in that world we don't need a zswap tree to begin with, there's a global tree(s) that maps a swap index directly to a swapfile swap entry or a zswap entry. > > Short term, no clue hmm. Let me think a bit more about this. See my response to Johannes, I think we may be able to fix this with refcounting + RCU. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed 2024-01-22 13:13 ` Chengming Zhou 2024-01-22 20:19 ` Johannes Weiner @ 2024-01-22 21:21 ` Nhat Pham 2024-01-22 22:31 ` Chris Li 3 siblings, 0 replies; 47+ messages in thread From: Nhat Pham @ 2024-01-22 21:21 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Chris Li, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Fri, Jan 19, 2024 at 6:40 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > } > -- > 2.43.0.429.g432eaa2c6b-goog > Oh man this is sweet! FWIW: Acked-by: Nhat Pham <nphamcs@gmail.com> ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed ` (2 preceding siblings ...) 2024-01-22 21:21 ` Nhat Pham @ 2024-01-22 22:31 ` Chris Li 3 siblings, 0 replies; 47+ messages in thread From: Chris Li @ 2024-01-22 22:31 UTC (permalink / raw) To: Yosry Ahmed Cc: Andrew Morton, Johannes Weiner, Nhat Pham, Chengming Zhou, Huang Ying, linux-mm, linux-kernel On Fri, Jan 19, 2024 at 6:40 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. I was wondering why those need to be there if all the zswap entries should have been swapped in already. In my testing I never see this delete of an entry so think this is kind of just in case. Nice clean up and will help simplify my zswap to xarray patch. Thanks for doing this. Acked-by: Chris Li <chrisl@kernel.org> (Google) Chris > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > } > -- > 2.43.0.429.g432eaa2c6b-goog > ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2024-01-26 1:09 UTC | newest] Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-01-20 2:40 [PATCH 0/2] mm: zswap: simplify zswap_swapoff() Yosry Ahmed 2024-01-20 2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed 2024-01-22 13:17 ` Chengming Zhou 2024-01-23 8:59 ` Huang, Ying 2024-01-23 9:40 ` Yosry Ahmed 2024-01-23 9:54 ` Yosry Ahmed 2024-01-24 3:13 ` Huang, Ying 2024-01-24 3:20 ` Yosry Ahmed 2024-01-24 3:27 ` Huang, Ying 2024-01-24 4:15 ` Yosry Ahmed 2024-01-20 2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed 2024-01-22 13:13 ` Chengming Zhou 2024-01-22 20:19 ` Johannes Weiner 2024-01-22 20:39 ` Yosry Ahmed 2024-01-23 15:38 ` Johannes Weiner 2024-01-23 15:54 ` Yosry Ahmed 2024-01-23 20:12 ` Johannes Weiner 2024-01-23 21:02 ` Yosry Ahmed 2024-01-24 6:57 ` Yosry Ahmed 2024-01-25 5:28 ` Chris Li 2024-01-25 7:59 ` Yosry Ahmed 2024-01-25 18:55 ` Chris Li 2024-01-25 20:57 ` Yosry Ahmed 2024-01-25 22:31 ` Chris Li 2024-01-25 22:33 ` Yosry Ahmed 2024-01-26 1:09 ` Chris Li 2024-01-24 7:20 ` Chengming Zhou 2024-01-25 5:44 ` Chris Li 2024-01-25 8:01 ` Yosry Ahmed 2024-01-25 19:03 ` Chris Li 2024-01-25 21:01 ` Yosry Ahmed 2024-01-25 7:53 ` Yosry Ahmed 2024-01-25 8:03 ` Yosry Ahmed 2024-01-25 8:30 ` Chengming Zhou 2024-01-25 8:42 ` Yosry Ahmed 2024-01-25 8:52 ` Chengming Zhou 2024-01-25 9:03 ` Yosry Ahmed 2024-01-25 9:22 ` Chengming Zhou 2024-01-25 9:26 ` Yosry Ahmed 2024-01-25 9:38 ` Chengming Zhou 2024-01-26 0:03 ` Chengming Zhou 2024-01-26 0:05 ` Yosry Ahmed 2024-01-26 0:10 ` Chengming Zhou 2024-01-23 20:30 ` Nhat Pham 2024-01-23 21:04 ` Yosry Ahmed 2024-01-22 21:21 ` Nhat Pham 2024-01-22 22:31 ` Chris Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox