From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0A6EC46CD2 for ; Wed, 24 Jan 2024 07:21:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 44BE06B0078; Wed, 24 Jan 2024 02:21:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3FB6C6B007D; Wed, 24 Jan 2024 02:21:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29F196B007E; Wed, 24 Jan 2024 02:21:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 17D9A6B0078 for ; Wed, 24 Jan 2024 02:21:39 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id BE4AD1C0F9F for ; Wed, 24 Jan 2024 07:21:38 +0000 (UTC) X-FDA: 81713359476.23.78D16FE Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf03.hostedemail.com (Postfix) with ESMTP id EE6E620005 for ; Wed, 24 Jan 2024 07:21:35 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=DLLkdEvd; spf=pass (imf03.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706080896; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p6qgzXs3nbDCD4wLGTSejG/03+G+KZIJh31uHA04T3Q=; b=XEaWlcB61C+kCJ8dq1TRV4wcyUPR0rGfilgYGjbmrOFVwvlL8nYmTAbKlz4xqrf3CJ3hnX PzEWbuNzRlxIYw/iAHIBKn9V2fkND+lhpKLRTgg/qGK6/f4Zd/mp8eE8PX1OVmCDh7j6J8 anMZtro+LwbxG8q9sRCC+RNNSue5miA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706080896; a=rsa-sha256; cv=none; b=t6HgMEQabr1Z9pu6LnCP8wANy0r/XX0dN3KAO4HqgPs2C29rNahl5ZHo5Cca78VhnkVo1b IjfPoiuc7RP8UyTt5TqMxhAgXdXB50DFEflIytyZNFJUdF4Aifxm4Oaej1CF+wsLStP7Vn 2pxMM0SrdB81rUMXlxgAZjBIGGSzVrY= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=DLLkdEvd; spf=pass (imf03.hostedemail.com: domain of zhouchengming@bytedance.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=zhouchengming@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-6dd72c4eb4aso1345672b3a.2 for ; Tue, 23 Jan 2024 23:21:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1706080894; x=1706685694; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=p6qgzXs3nbDCD4wLGTSejG/03+G+KZIJh31uHA04T3Q=; b=DLLkdEvd2/usz9irQdSanCZsHnVo5+vCjpznxEOhQn1zKR0dF8oBSL67OS4dQomi6D 8J7Jm7ZUq5dz1DGG6++lcuVwS4JdpCNFA32V39e9lublo2btp4fqhultjvvjsZef11Ti 02N0jcV/N2rezCka17+SkiIUjpykNZJEVSa5N4Q7iAZtTeDiOAo7gMq08E3fEwvtKarZ T+8DoRy1LSNbW7Rp2WLfdg9v4w4rN0dWHNQjrWURlpErsTkHbxYc01MOplxo3G+2i+XY kON9h9GROGtunP3KtYmSA/aFElVfwfHpnMMDg7dnAoUsGzQNCh+ahcpzRG4rFkKdf4MQ QxvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706080894; x=1706685694; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=p6qgzXs3nbDCD4wLGTSejG/03+G+KZIJh31uHA04T3Q=; b=bBU5jvbaHX6sc6B85SAMlmo/woLL3/8bwOs5Iwe+djwkigZLuc1OvcIp07fLE+mL3d Ulc/JELk/6r0N+xub9KQolYjd2uRvrT9WO1yJh3+KU+iu7A8RQL7HXwQFpzviZvP0d3E Ts2J7KFIZBB1q40VtVu/nSZxohXNLCrG3Zit7VTlfUiNbeJ5/Fw7dbOnqsFY1D13cRos WH/ZZ3/EVlQ3vhUzXVcjE7mwTRb4nvmHpzSgZxFrKrCSktmav38BzZVYichIuuAIs3J/ LVM4ls8F1t3/84+EtQHf8WpSeCz4iXncRZwiHR5NkvAEQr6eSGumNrFGS+eV+BDbh9W0 YMqg== X-Gm-Message-State: AOJu0Yyqxwv6Al4ytuu3HC9XCU/6fvaqOzAre9m+OVVQ8b6Uz9HtKsg5 0t5nYAlXQnfR4hSSF4wbIq5pmMatuKwr/NgaKZ+hvtccJyPvl5MgE5cZjVDuUK8= X-Google-Smtp-Source: AGHT+IGhTSDybLlJ8KNCkFczYV9Bwwg1uq8s3cZJ+hBG5T57FY1HS7XjRrxe2joN1a2M/D1GbPmpSQ== X-Received: by 2002:a62:ce87:0:b0:6dd:a2ea:6626 with SMTP id y129-20020a62ce87000000b006dda2ea6626mr24143pfg.16.1706080894338; Tue, 23 Jan 2024 23:21:34 -0800 (PST) Received: from [10.255.203.131] ([139.177.225.236]) by smtp.gmail.com with ESMTPSA id t40-20020aa78fa8000000b006daca8ecb85sm12932159pfs.139.2024.01.23.23.21.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 23 Jan 2024 23:21:33 -0800 (PST) Message-ID: Date: Wed, 24 Jan 2024 15:20:13 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Content-Language: en-US To: Yosry Ahmed , Johannes Weiner Cc: Andrew Morton , Nhat Pham , Chris Li , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> From: Chengming Zhou In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: EE6E620005 X-Rspam-User: X-Stat-Signature: gy1rqmozswkhrfud5o6b8e3mpd96pswf X-Rspamd-Server: rspam03 X-HE-Tag: 1706080895-981188 X-HE-Meta: U2FsdGVkX18omIN5ivfl+vGcka6qJGbnPa7B3kAW16oSgRbI3Zc1qqyixLIaE6RY4qAo5a7nLJqbG1AkYMJxJ9zlv98nwG7PuEzoGUFGCiF5xpWHhgavR84deX0vRPbaA7JFyc1hBmC6ly2nJXr1shNQqtyDqCo9M+SLK2v7Ln4rJz3PhB30z8mitbCaTt6L1wm+oYqVO0g5c8OwhllgQTBEIZBVkqo21czjduZ7hWE+rouHvT1H371YYrEI36Z0/GwP63PN1OnK+SsgviyeuJQ3kc28uosG99Ly7puStCNfi6EWr27s2F02U55vFExzkHnEOup0TN3RS/iILrHI06MxEJn36L/OjqcT5y/+HL2i219QoUT2nb2tL467PUlne8JYnfv+T5H+sSwCpbF4JRKYJX3AoVpDvp5SI5tTTPoTjBZREecPhhfkvZBAg7G72Xyis645iZ6ajmoiScpbQ5QGPMUx8+Yg0ea/AW7wpeJBok9uJBjGVarQnunHTljY+P2PBv+fttp/281fm75vAZraJdz0d5HWEYgQEUjDB7SrWVjR5+hT7uUhmhzOQKSiD5rwCJbPKly7Snnk+l4pREkj4r7SonPapXX+7Cgxld/OadIVt8mQFY9aS9rpb+upi3H8WSzTy02QnGj031DxlWB6QfQT9uDJqrBtNUSGsfx/QhcogL97J95jQ6XcQm7qCwv1umWwYNRCXzcRwwq4TliLpGLL/KuNVW6qjJ4XytLWfl/SVHoyuXqXPVNOotlCCI52i4Yux7mhVg03IpD3OJnIAK/ylZLmqUlyLzKg3tmFKtlnLJRBFTew0IHPEHeiy5wyi66TCKutSYkOLMtS5MkMsQktbHJusUAetkSAF2AQADBu05BeKG2M7HxAsvnJ1LcIEmVlSSMPd/H0lSheT2SoDsRybwRUwIes6SbUK+YwcLhjxS3xnQQniC4N4uSoaD9TclvpfNy06HEuWJy dnznXXF1 l9pj2DALvIKtgeI46Znp5ZcdrlLZ/C2IPbYBdTnei3H6Jkbhtv4UTtuhgpX6ULcgSyvYv/frY0ju9b8J9reCWHff4Fl1T/4EQIHSuqPvfIBu+xwmEyJPi2AlfDAaE5oPQLKuFrIUIi5PUnhthwN4/Ht3t0YxMCas4QRtCGbbO3nMg55bVcr8TctaVz5ecy/2W6uB3qU7dQAgTXdkLu16FxQdMJIaJc3S8mYavVQHjIANIbKwzMv7IShiQgGdWLaYB3iqE3mw2+5msaH+dbBX+GNjEHAxHWwT/U+EM9Cn3JTNU1bqsyyW45dcD28b3/ZhFPVIwHJ8A+Ys4XKRshGR+FCrzsadtqa9gNGRliEjCNAHNb/7X1sHVlKgtWkVw4iC7/oSkY7RtUhEf+FGUmTj7/ZJJT5YaftzLsYaU4kV6yN3SPYj0+nkDA5Z3PDhuLBSxdYy7b+RHMlYnTZpQKHYH/cjCaX+YlD5zkQQYiMGCONWvpVVRiljXA7AXx7mPHG7tu+QsYbz5pX26Pg9pmSwWIJxpdA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/24 05:02, Yosry Ahmed wrote: > On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner wrote: >> >> On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: >>> On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner wrote: >>>> >>>> On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: >>>>> On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner wrote: >>>>>> >>>>>> On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: >>>>>>> During swapoff, try_to_unuse() makes sure that zswap_invalidate() is >>>>>>> called for all swap entries before zswap_swapoff() is called. This means >>>>>>> that all zswap entries should already be removed from the tree. Simplify >>>>>>> zswap_swapoff() by removing the tree cleanup loop, and leaving an >>>>>>> assertion in its place. >>>>>>> >>>>>>> Signed-off-by: Yosry Ahmed >>>>>> >>>>>> Acked-by: Johannes Weiner >>>>>> >>>>>> That's a great simplification. >>>>>> >>>>>> Removing the tree->lock made me double take, but at this point the >>>>>> swapfile and its cache should be fully dead and I don't see how any of >>>>>> the zswap operations that take tree->lock could race at this point. >>>>> >>>>> It took me a while staring at the code to realize this loop is pointless. >>>>> >>>>> However, while I have your attention on the swapoff path, there's a >>>>> slightly irrelevant problem that I think might be there, but I am not >>>>> sure. >>>>> >>>>> It looks to me like swapoff can race with writeback, and there may be >>>>> a chance of UAF for the zswap tree. For example, if zswap_swapoff() >>>>> races with shrink_memcg_cb(), I feel like we may free the tree as it >>>>> is being used. For example if zswap_swapoff()->kfree(tree) happen >>>>> right before shrink_memcg_cb()->list_lru_isolate(l, item). >>>>> >>>>> Please tell me that I am being paranoid and that there is some >>>>> protection against zswap writeback racing with swapoff. It feels like >>>>> we are very careful with zswap entries refcounting, but not with the >>>>> zswap tree itself. >>>> >>>> Hm, I don't see how. >>>> >>>> Writeback operates on entries from the LRU. By the time >>>> zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should >>>> will have emptied out the LRU and tree. >>>> >>>> Writeback could have gotten a refcount to the entry and dropped the >>>> tree->lock. But then it does __read_swap_cache_async(), and while >>>> holding the page lock checks the tree under lock once more; if that >>>> finds the entry valid, it means try_to_unuse() hasn't started on this >>>> page yet, and would be held up by the page lock/writeback state. >>> >>> Consider the following race: >>> >>> CPU 1 CPU 2 >>> # In shrink_memcg_cb() # In swap_off >>> list_lru_isolate() >>> zswap_invalidate() >>> .. >>> zswap_swapoff() -> kfree(tree) >>> spin_lock(&tree->lock); >>> >>> Isn't this a UAF or am I missing something here? >> >> Oof. You're right, it looks like there is a bug. Digging through the >> history, I think this is actually quite old: the original backend >> shrinkers would pluck something off their LRU, drop all locks, then >> try to acquire tree->lock. There is no protection against swapoff. >> >> The lock that is supposed to protect this is the LRU lock. That's >> where reclaim and invalidation should synchronize. But it's not right: >> >> 1. We drop the LRU lock before acquiring the tree lock. We should >> instead trylock the tree while still holding the LRU lock to make >> sure the tree is safe against swapoff. >> >> 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But >> it must always cycle the LRU lock before freeing the tree for that >> synchronization to work. >> >> Once we're holding a refcount to the entry, it's safe to drop all >> locks for the next step because we'll then work against the swapcache >> and entry: __read_swap_cache_async() will try to pin and lock whatever >> swap entry is at that type+offset. This also pins the type's current >> tree. HOWEVER, if swapoff + swapon raced, this could be a different >> tree than what we had in @tree, so >> >> 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to >> look up zswap_trees[] again after __read_swap_cache_async() >> succeeded to validate the entry. >> >> Once it succeeded, we can validate the entry. The entry is valid due >> to our refcount. The zswap_trees[type] is valid due to the cache pin. >> >> However, if validation failed and we have a non-zero writeback_result, >> there is one last bug: >> >> 4. the original entry's tree is no longer valid for the entry put. >> >> The current scheme handles invalidation fine (which is good because >> that's quite common). But it's fundamentally unsynchronized against >> swapoff (which has probably gone undetected because that's rare). >> >> I can't think of an immediate solution to this, but I wanted to put my >> analysis out for comments. > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > The first solution that came to mind for me was refcounting the zswap > tree with RCU with percpu-refcount, similar to how cgroup refs are > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > the percpu-refcount may be an overkill in terms of memory usage > though. I think we can still do our own refcounting with RCU, but it > may be more complicated. Hello, I also thought about this problem for some time, maybe something like below can be changed to fix it? It's likely I missed something, just some thoughts. IMHO, the problem is caused by the different way in which we use zswap entry in the writeback, that should be much like zswap_load(). The zswap_load() comes in with the folio locked in swap cache, so it has stable zswap tree to search and lock... But in writeback case, we don't, shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, then release lru lock to get tree lock, which maybe freed already. So we should change here, we read swpentry from entry with lru list lock held, then release lru lock, to try to lock corresponding folio in swap cache, if we success, the following things is much the same like zswap_load(). We can get tree lock, to recheck the invalidate race, if no race happened, we can make sure the entry is still right and get refcount of it, then release the tree lock. The main differences between this writeback with zswap_load() is the handling of lru entry and the tree lifetime. The whole zswap_load() function has the stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half after __swap_writepage() since we unlock the folio after that. So we can't reference the tree after that. This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early in tree lock, since thereafter writeback can't fail. BTW, I think we should also zswap_invalidate_entry() early in zswap_load() and only support the zswap_exclusive_loads_enabled mode, but that's another topic. The second difference is the handling of lru entry, which is easy that we just zswap_lru_del() in tree lock. So no any path can reference the entry from tree or lru after we release the tree lock, so we can just zswap_free_entry() after writeback. Thanks! // lru list lock held shrink_memcg_cb() swpentry = entry->swpentry // Don't isolate entry from lru list here, just use list_lru_putback() spin_unlock(lru list lock) folio = __read_swap_cache_async(swpentry) if (!folio) return if (!folio_was_allocated) folio_put(folio) return // folio is locked, swapcache is secured against swapoff tree = get tree from swpentry spin_lock(&tree->lock) // check invalidate race? No if (entry == zswap_rb_search()) // so we can make sure this entry is still right // zswap_invalidate_entry() since the below writeback can't fail zswap_entry_get(entry) zswap_invalidate_entry(tree, entry) // remove from lru list zswap_lru_del() spin_unlock(&tree->lock) __zswap_load() __swap_writepage() // folio unlock folio_put(folio) // entry is safe to free, since it's removed from tree and lru above zswap_free_entry(entry)