From: Chengming Zhou <zhouchengming@bytedance.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Nhat Pham <nphamcs@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chriscli@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff
Date: Tue, 30 Jan 2024 10:30:20 +0800 [thread overview]
Message-ID: <527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com> (raw)
In-Reply-To: <ZbhBNkayw1hNlkpL@google.com>
On 2024/1/30 08:22, Yosry Ahmed wrote:
> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and if returned with folio locked we can reference the tree safely.
>> Then we can check invalidate race with tree lock, the following things
>> is much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>> in the beginning. And it will be unlinked and freed when invalidated
>> if writeback success.
>
> You are also removing one redundant lookup from the zswap writeback path
> to check for the invalidation race, and simplifying the code. Give
> yourself full credit :)
Ah right, forgot to mention it, I will add this part in the commit message.
Thanks for your reminder!
>
>>
>> Another change is we don't update the memcg nr_zswap_protected in
>> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
>> with swapin or concurrent shrinker action, since swapin already
>> have memcg nr_zswap_protected updated, don't need double counts here.
>> For concurrent shrinker, the folio will be writeback and freed anyway.
>> -ENOMEM case is extremely rare and doesn't happen spuriously either,
>> so don't bother distinguishing this case.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Nhat Pham <nphamcs@gmail.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>> 1 file changed, 49 insertions(+), 65 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..f5cb5a46e4d7 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>> zpool_get_type((p)->zpools[0]))
>>
>> static int zswap_writeback_entry(struct zswap_entry *entry,
>> - struct zswap_tree *tree);
>> + swp_entry_t swpentry);
>> static int zswap_pool_get(struct zswap_pool *pool);
>> static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>> rcu_read_unlock();
>> }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> - struct zswap_entry *entry)
>> -{
>> - int nid = entry_to_nid(entry);
>> - spinlock_t *lock = &list_lru->node[nid].lock;
>> - struct mem_cgroup *memcg;
>> - struct lruvec *lruvec;
>> -
>> - rcu_read_lock();
>> - memcg = mem_cgroup_from_entry(entry);
>> - spin_lock(lock);
>> - /* we cannot use list_lru_add here, because it increments node's lru count */
>> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> - spin_unlock(lock);
>> -
>> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> - /* increment the protection area to account for the LRU rotation. */
>> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> - rcu_read_unlock();
>> -}
>> -
>> /*********************************
>> * rbtree functions
>> **********************************/
>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * Rotate the entry to the tail before unlocking the LRU,
>> + * so that in case of an invalidation race concurrent
>> + * reclaimers don't waste their time on it.
>> + *
>> + * If writeback succeeds, or failure is due to the entry
>> + * being invalidated by the swap subsystem, the invalidation
>> + * will unlink and free it.
>> + *
>> + * Temporary failures, where the same entry should be tried
>> + * again immediately, almost never happen for this shrinker.
>> + * We don't do any trylocking; -ENOMEM comes closest,
>> + * but that's extremely rare and doesn't happen spuriously
>> + * either. Don't bother distinguishing this case.
>> + *
>> + * But since they do exist in theory, the entry cannot just
>> + * be unlinked, or we could leak it. Hence, rotate.
>
> The entry cannot be unlinked because we cannot get a ref on it without
> holding the tree lock, and we cannot deref the tree before we acquire a
> swap cache ref in zswap_writeback_entry() -- or if
> zswap_writeback_entry() fails. This should be called out explicitly
> somewhere. Perhaps we can describe this whole deref dance with added
> docs to shrink_memcg_cb().
Maybe we should add some comments before or after zswap_writeback_entry()?
Or do you have some suggestions? I'm not good at this. :)
>
> We could also use a comment in zswap_writeback_entry() (or above it) to
> state that we only deref the tree *after* we get the swapcache ref.
I just notice there are some comments in zswap_writeback_entry(), should
we add more comments here?
/*
* folio is locked, and the swapcache is now secured against
* concurrent swapping to and from the slot. Verify that the
* swap entry hasn't been invalidated and recycled behind our
* backs (our zswap_entry reference doesn't prevent that), to
* avoid overwriting a new swap folio with old compressed data.
*/
next prev parent reply other threads:[~2024-01-30 2:30 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-28 13:28 [PATCH v2 0/3] " Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 1/3] mm/zswap: don't return LRU_SKIP if we have dropped lru lock Chengming Zhou
2024-01-30 0:09 ` Yosry Ahmed
2024-01-30 0:12 ` Nhat Pham
2024-01-30 0:58 ` Yosry Ahmed
2024-01-28 13:28 ` [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Chengming Zhou
2024-01-30 0:22 ` Yosry Ahmed
2024-01-30 2:30 ` Chengming Zhou [this message]
2024-01-30 3:17 ` Johannes Weiner
2024-01-30 3:31 ` Chengming Zhou
2024-01-28 13:28 ` [PATCH v2 3/3] mm/list_lru: remove list_lru_putback() Chengming Zhou
2024-01-28 15:52 ` Johannes Weiner
2024-01-28 19:45 ` Nhat Pham
2024-01-30 0:34 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com \
--to=zhouchengming@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=chriscli@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox