linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: chengming.zhou@linux.dev, yosryahmed@google.com,
	akpm@linux-foundation.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	 Chengming Zhou <zhouchengming@bytedance.com>
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff
Date: Fri, 26 Jan 2024 11:31:40 -0800	[thread overview]
Message-ID: <CAKEwX=OOf2M0Zm50Vkv6nBUzHR1p7dSGoQ99EofBPxZpxQHKHA@mail.gmail.com> (raw)
In-Reply-To: <20240126153126.GG1567330@cmpxchg.org>

On Fri, Jan 26, 2024 at 7:31 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Jan 26, 2024 at 08:30:15AM +0000, chengming.zhou@linux.dev wrote:
> > From: Chengming Zhou <zhouchengming@bytedance.com>
> >
> > LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
> >
> > CPU1                  CPU2
> > shrink_memcg_cb               swap_off
> >   list_lru_isolate      zswap_invalidate
> >                         zswap_swapoff
> >                           kfree(tree)
> >   // UAF
> >   spin_lock(&tree->lock)
> >
> > The problem is that the entry in lru list can't protect the tree from
> > being swapoff and freed, and the entry also can be invalidated and freed
> > concurrently after we unlock the lru lock.
> >
> > We can fix it by moving the swap cache allocation ahead before
> > referencing the tree, then check invalidate race with tree lock,
> > only after that we can safely deref the entry. Note we couldn't
> > deref entry or tree anymore after we unlock the folio, since we
> > depend on this to hold on swapoff.
>
> This is a great simplification on top of being a bug fix.
>
> > So this patch moves all tree and entry usage to zswap_writeback_entry(),
> > we only use the copied swpentry on the stack to allocate swap cache
> > and return with folio locked, after which we can reference the tree.
> > Then check invalidate race with tree lock, the following things is
> > much the same like zswap_load().
> >
> > Since we can't deref the entry after zswap_writeback_entry(), we
> > can't use zswap_lru_putback() anymore, instead we rotate the entry
> > in the LRU list so concurrent reclaimers have little chance to see it.
> > Or it will be deleted from LRU list if writeback success.
> >
> > Another confusing part to me is the update of memcg nr_zswap_protected
> > in zswap_lru_putback(). I'm not sure why it's needed here since
> > if we raced with swapin, memcg nr_zswap_protected has already been
> > updated in zswap_folio_swapin(). So not include this part for now.
>
> Good observation.
>
> Technically, it could also fail on -ENOMEM, but in practice these size
> allocations don't fail, especially since the shrinker runs in
> PF_MEMALLOC context. The shrink_worker might be affected, since it
> doesn't But the common case is -EEXIST, which indeed double counts.

Yup. At the time, I was thinking more along the lines of what
mechanisms should trigger protection size increase. "swapin" and "LRU
list rotations" were two different mechanisms in my head. I was aware
that there could be double counting, but deemed it OK, as the cost of
over-shrinking (increase in swapin) was fairly serious, and if we have
a fairly aggressive decaying strategy if we protect too much.

But yes, I doubt it mattered that much in hindsight :) And the case
where it is double counted far outnumber the case where it does not,
so I'm fine with removing it here.

>
> To make it "correct", you'd have to grab an objcg reference with the
> LRU lock, and also re-order the objcg put on entry freeing after the
> LRU del. This is probably not worth doing. But it could use a comment.
>
> I was going to ask if you could reorder objcg uncharging after LRU
> deletion to make it more robust for future changes in that direction.
> However, staring at this, I notice this is a second UAF bug:
>
>         if (entry->objcg) {
>                 obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
>                 obj_cgroup_put(entry->objcg);
>         }
>         if (!entry->length)
>                 atomic_dec(&zswap_same_filled_pages);
>         else {
>                 zswap_lru_del(&entry->pool->list_lru, entry);
>
> zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but
> the put may have killed it. I'll send a separate patch on top.
>
> > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >  {
> >       struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >       bool *encountered_page_in_swapcache = (bool *)arg;
> > -     struct zswap_tree *tree;
> > -     pgoff_t swpoffset;
> > +     swp_entry_t swpentry;
> >       enum lru_status ret = LRU_REMOVED_RETRY;
> >       int writeback_result;
> >
> > +     /*
> > +      * First rotate to the tail of lru list before unlocking lru lock,
> > +      * so the concurrent reclaimers have little chance to see it.
> > +      * It will be deleted from the lru list if writeback success.
> > +      */
> > +     list_move_tail(item, &l->list);
>
> We don't hold a reference to the object, so there could also be an
> invalidation waiting on the LRU lock, which will free the entry even
> when writeback fails.
>
> It would also be good to expand on the motivation, because it's not
> clear WHY you'd want to hide it from other reclaimers.
>
> Lastly, maybe mention the story around temporary failures? Most
> shrinkers have a lock inversion pattern (object lock -> LRU lock for
> linking versus LRU lock -> object trylock during reclaim) that can
> fail and require the same object be tried again before advancing.
>
> How about this?
>
>         /*
>          * Rotate the entry to the tail before unlocking the LRU,
>          * so that in case of an invalidation race concurrent
>          * reclaimers don't waste their time on it.
>          *
>          * If writeback succeeds, or failure is due to the entry
>          * being invalidated by the swap subsystem, the invalidation
>          * will unlink and free it.
>          *
>          * Temporary failures, where the same entry should be tried
>          * again immediately, almost never happen for this shrinker.
>          * We don't do any trylocking; -ENOMEM comes closest,
>          * but that's extremely rare and doesn't happen spuriously
>          * either. Don't bother distinguishing this case.
>          *
>          * But since they do exist in theory, the entry cannot just
>          * be unlinked, or we could leak it. Hence, rotate.
>          */
>
> Otherwise, looks great to me.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>


  reply	other threads:[~2024-01-26 19:31 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-26  8:30 [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock chengming.zhou
2024-01-26  8:30 ` [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff chengming.zhou
2024-01-26 15:31   ` Johannes Weiner
2024-01-26 19:31     ` Nhat Pham [this message]
2024-01-27 14:53     ` Chengming Zhou
2024-01-26 19:50   ` Nhat Pham
2024-01-27 15:12     ` Chengming Zhou
2024-01-26 14:28 ` [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Johannes Weiner
2024-01-27 14:33   ` Chengming Zhou
2024-01-26 18:01 ` Nhat Pham
2024-01-27 14:37   ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=OOf2M0Zm50Vkv6nBUzHR1p7dSGoQ99EofBPxZpxQHKHA@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox