From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CED7BC46CD2 for ; Sat, 27 Jan 2024 15:13:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 38A376B0082; Sat, 27 Jan 2024 10:13:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 313066B0088; Sat, 27 Jan 2024 10:13:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DA836B0089; Sat, 27 Jan 2024 10:13:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0ED7C6B0082 for ; Sat, 27 Jan 2024 10:13:26 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 82F56160968 for ; Sat, 27 Jan 2024 15:13:25 +0000 (UTC) X-FDA: 81725434770.16.70A5D98 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) by imf08.hostedemail.com (Postfix) with ESMTP id 8DCCA160016 for ; Sat, 27 Jan 2024 15:13:23 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G5pJk6QN; spf=pass (imf08.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.170 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706368403; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wKz7fE6jNiu3MF1YuBwnJMx1tp8fPNdH12PFAc7i/zo=; b=gQOhWlaSxPtXVFCcztxTQaKb+o250FPv3Wnto5vm0A772zgapXYmCy4bbHjajvxhFG53Ai Z8JyjdCD3t9Q3hVtnGlr/8sIrYuBZz+iuOxtHOFDmB796S9gp7mv/0S2VydufNi9ZhdmmF 7QWiKPbBTz2pNUqrDxU7dbGpNYe7TRI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706368403; a=rsa-sha256; cv=none; b=kOhl9gc16yP7r6ZMrUpvzSW72zCp07PXG80GAkUz1kwp7W/p4u0S9KEGHGVpu2nDyLVp0e MFy3GONzx0lQsxSRX3tQwYeYAfOk60ykyc6X+57/meItZlakUfmY3pG2W1dm8W1m148TEK ZrP2hU//iOxba4yfnD6H/2pzDHWnhrE= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G5pJk6QN; spf=pass (imf08.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.170 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1706368401; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wKz7fE6jNiu3MF1YuBwnJMx1tp8fPNdH12PFAc7i/zo=; b=G5pJk6QNLQ++IfvQn/17IJTWwTK25IkGEQ8MJOpuswukBjv7WIiCMIWTMoUAm3q7XDEUb2 pkV2q7CEXQ6Leq41Y091gB9r8wAf7Acumj2Fy2YsheLHVjMz3W6scjm+Z2PDhJ6W9qZ7RB dM2Abep11hK1mTUZPKLwNjEULa+7Hj0= Date: Sat, 27 Jan 2024 23:12:52 +0800 MIME-Version: 1.0 Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff Content-Language: en-US To: Nhat Pham Cc: hannes@cmpxchg.org, yosryahmed@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chengming Zhou References: <20240126083015.3557006-1-chengming.zhou@linux.dev> <20240126083015.3557006-2-chengming.zhou@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Chengming Zhou In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Stat-Signature: 43zecfc6pz1ra5mb8ohy4na9rqx6mxpy X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8DCCA160016 X-Rspam-User: X-HE-Tag: 1706368403-805799 X-HE-Meta: U2FsdGVkX1+iw4xPJJUIGGXeUlE9KMwynLkiCui3I/XMAn23BR3bGO5A+ZxRzhyMB1nbSjt5Bsj6/7VDNBI7TyfdxDC75lde4z7NfZj+18cOh7gOUtyU6bK5ui8LKv4J7V6T4RjgQJYbH9QK5xYYIFigm9pjxKsjYivVzhvEmZWlh1UVo3vSnpy3Ycfhzn+/bbGg5BQRJaTvmCPdwO4GjYs4M36Evty0XBo4F40ms1zCeGaJXX41m9qrokbOqbqSJpoYhlhRi8T/rj6YTszd20Ts3zsl2VR/27QnBbASvNJd0RjElhuYmXCUP5doPTxYrZg8hDyDa7YTuT3T4YFv5SSsUKI/3Wt29S+cQtA/l/fSzONvwRIzQiWxMfcMSaNH9171XL0QUpje2H50+gpKM1bbeGg4EAC7kBY8yAgc1esSbhTB6UjWQ78UxF22+qMSLsEzETajm1fCBDD6I1MTFIcP0iEi1mEi0KLA6u+nfSm1F3wfNay1O4/T0o9d5vbzeTXBj8dyHJjJ8VJSeGqSlnTObRqc4SmQ8uysdCC49eLQPUX4DCmq6NLvgw5dAFZWmq7ugjj/gI5ZpavFsJGM25N+yHbAmYHgbCy4/gAy5NUKnpp7yO3HFcC60YT0frdnA0Yk5SelfTruJpFC1AFr0HLvV6GRp/UdkC5H3yNdx8uFh1VbQX1UYU0kUUUVcVPX5wgpV2Ukr6HOHrNOCu6zr05PPIYCqNkCC5U9K65hkwfotIFSj1IbyB4nR0rwd9D64p1cUxFqe/tTMeAs/Wm1XggJ/BcivCeCrFhYq6v540D6YoOpus+hb0tTmi+2aCEf5wz9k/esTCJ9IdGbTaaioNijwe5Ddh/Is99XE8de9M/oXOJgXJeKtitJXJDkIjxXNH64wjCJFW27Xt7BI2Wq2n2taUVb6kxnPc5q3DPWXc3O/0W4FRVas9A/Qu+em0k4XRY3aYsWCl0UFvnZAlF QTd+1pTq JQuMD0GB3lmMbio2z6qBADfwITmhUMEEoG5za1I/n2Tex9llsYWlJgvcMjbb7YgqXcBdDRmZsb70WEaJlc12lOe2eavjrwOp+LM4RTnheNWd4a78CabWrZbWd1S9ATD31nV0+VYv5cgNU9I21wiQy1/Dce8L1AsIsMQ3RwTpxXI9cLRFkOIfOdNPjdhlUaIGWlMnhBsAf5W9XVwVdPhdvLQlCgcY3Fik4T1k/cKIAyOr6yXU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/27 03:50, Nhat Pham wrote: > On Fri, Jan 26, 2024 at 12:32 AM wrote: >> >> From: Chengming Zhou >> >> LRU writeback has race problem with swapoff, as spotted by Yosry[1]: >> >> CPU1 CPU2 >> shrink_memcg_cb swap_off >> list_lru_isolate zswap_invalidate >> zswap_swapoff >> kfree(tree) >> // UAF >> spin_lock(&tree->lock) >> >> The problem is that the entry in lru list can't protect the tree from >> being swapoff and freed, and the entry also can be invalidated and freed >> concurrently after we unlock the lru lock. >> >> We can fix it by moving the swap cache allocation ahead before >> referencing the tree, then check invalidate race with tree lock, >> only after that we can safely deref the entry. Note we couldn't >> deref entry or tree anymore after we unlock the folio, since we >> depend on this to hold on swapoff. >> >> So this patch moves all tree and entry usage to zswap_writeback_entry(), >> we only use the copied swpentry on the stack to allocate swap cache >> and return with folio locked, after which we can reference the tree. >> Then check invalidate race with tree lock, the following things is >> much the same like zswap_load(). >> >> Since we can't deref the entry after zswap_writeback_entry(), we >> can't use zswap_lru_putback() anymore, instead we rotate the entry > > I added list_lru_putback to the list_lru API specifically for this use > case (zswap_lru_putback()). Now that we no longer need it, maybe we > can also remove this as well (assuming no-one else is using this?). > > This can be done in a separate patch though. Right, I can append a patch to remove it since no other users. > >> in the LRU list so concurrent reclaimers have little chance to see it. >> Or it will be deleted from LRU list if writeback success. >> >> Another confusing part to me is the update of memcg nr_zswap_protected >> in zswap_lru_putback(). I'm not sure why it's needed here since >> if we raced with swapin, memcg nr_zswap_protected has already been >> updated in zswap_folio_swapin(). So not include this part for now. >> >> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/ >> >> Signed-off-by: Chengming Zhou > > LGTM! This is quite elegant. > Acked-by: Nhat Pham > >> --- >> mm/zswap.c | 93 ++++++++++++++++++------------------------------------ >> 1 file changed, 31 insertions(+), 62 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 81cb3790e0dd..fa2bdb7ec1d8 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp) >> zpool_get_type((p)->zpools[0])) >> >> static int zswap_writeback_entry(struct zswap_entry *entry, >> - struct zswap_tree *tree); >> + swp_entry_t swpentry); >> static int zswap_pool_get(struct zswap_pool *pool); >> static void zswap_pool_put(struct zswap_pool *pool); >> >> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) >> rcu_read_unlock(); >> } >> >> -static void zswap_lru_putback(struct list_lru *list_lru, >> - struct zswap_entry *entry) >> -{ >> - int nid = entry_to_nid(entry); >> - spinlock_t *lock = &list_lru->node[nid].lock; >> - struct mem_cgroup *memcg; >> - struct lruvec *lruvec; >> - >> - rcu_read_lock(); >> - memcg = mem_cgroup_from_entry(entry); >> - spin_lock(lock); >> - /* we cannot use list_lru_add here, because it increments node's lru count */ >> - list_lru_putback(list_lru, &entry->lru, nid, memcg); >> - spin_unlock(lock); >> - >> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry))); >> - /* increment the protection area to account for the LRU rotation. */ >> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); >> - rcu_read_unlock(); >> -} >> - >> /********************************* >> * rbtree functions >> **********************************/ >> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o >> { >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); >> bool *encountered_page_in_swapcache = (bool *)arg; >> - struct zswap_tree *tree; >> - pgoff_t swpoffset; >> + swp_entry_t swpentry; >> enum lru_status ret = LRU_REMOVED_RETRY; >> int writeback_result; >> >> + /* >> + * First rotate to the tail of lru list before unlocking lru lock, >> + * so the concurrent reclaimers have little chance to see it. >> + * It will be deleted from the lru list if writeback success. >> + */ >> + list_move_tail(item, &l->list); >> + >> /* >> * Once the lru lock is dropped, the entry might get freed. The >> - * swpoffset is copied to the stack, and entry isn't deref'd again >> + * swpentry is copied to the stack, and entry isn't deref'd again >> * until the entry is verified to still be alive in the tree. >> */ >> - swpoffset = swp_offset(entry->swpentry); >> - tree = swap_zswap_tree(entry->swpentry); >> - list_lru_isolate(l, item); > > nit: IIUC, now that we're no longer removing the entry from the > list_lru, we protect against concurrent shrinking action via this > check inside zswap_writeback_entry() too right: > > if (!folio_was_allocated) { > folio_put(folio); > return -EEXIST; > } > > Maybe update the comment above it too? * Found an existing folio, we raced with load/swapin. We generally * writeback cold folios from zswap, and swapin means the folio just * became hot. Skip this folio and let the caller find another one. So now found an existing folio not only means load/swapin, and also concurrent shrinking action. Yes, this comment needs to be changed a little. Thanks.