From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E1BCC47258 for ; Tue, 23 Jan 2024 20:30:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9D6216B007B; Tue, 23 Jan 2024 15:30:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9861E8D0001; Tue, 23 Jan 2024 15:30:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84D2E6B0082; Tue, 23 Jan 2024 15:30:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 758876B007B for ; Tue, 23 Jan 2024 15:30:57 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4555280C19 for ; Tue, 23 Jan 2024 20:30:57 +0000 (UTC) X-FDA: 81711719754.13.E1DFA02 Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by imf08.hostedemail.com (Postfix) with ESMTP id 87D6A160017 for ; Tue, 23 Jan 2024 20:30:54 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LfccTfTf; spf=pass (imf08.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706041854; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R7q7qUlqkN3GVHKe++gcj+xeY1A8RNTSViGtBlrvvaw=; b=ww25DijBtDnF7CPxls4hFFVrkD7W9nFOHSr0l9C/d1LgUVHef5c1E/9d0HTM8URxJppttZ 887bGxodp1bzsHPITSPGVgn1CwS2YqoWemBm78cRsZoE7YMuMtZ62MbqJI7TKbGCTWjl/c rnbw2ckT1hVrmadlxtc19lecsJt61Cg= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LfccTfTf; spf=pass (imf08.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.45 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706041854; a=rsa-sha256; cv=none; b=bJMnThmvh6kwC1QNWSVH7RmZ4zaxf0VPaXk2qwq6+/7JxWau9eI4banq2MW/7dNafcYq7D PqASLYvH+/UYWjogpLobb06OqeVWjEr9el11mJx34PcBebdDYz5lTCK7RMCY9Y6mXCThti Gw2AUg0T4Lz0iBDr1Vu8XPVQwIiGzVU= Received: by mail-io1-f45.google.com with SMTP id ca18e2360f4ac-7bedd61c587so162977739f.2 for ; Tue, 23 Jan 2024 12:30:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706041853; x=1706646653; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=R7q7qUlqkN3GVHKe++gcj+xeY1A8RNTSViGtBlrvvaw=; b=LfccTfTfACpNqY4Ew8H9KAuRqKG1lWpBBhf8V/age6DjhrvUD9aR5EDzQA3F3yFavg WZmLhIQxBDFUxcpAbNRpjBf3CyKTZGJalJG7kxBRbvAg/OHNfFwtJfZiSNu5oH7CJh78 fAXlKeS09XEyv+cTwqZ06XHbeeSs5N5vlORxtwiBa2GOLBsQ1jg0Qgg2x78CbgIliIiY xpkg3pPySURKPDUQeGbLufuHLhqStrSQroRGYpzmRj9QcRMvaSkxd8v9l5siBeMtpJJF WzcE/W1Vmf9LZNbcDEhLjULbja3mFbBjrLhCP3PllZFfLPV/LuddTRSH2i4LH+nYWEzw 8XoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706041853; x=1706646653; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=R7q7qUlqkN3GVHKe++gcj+xeY1A8RNTSViGtBlrvvaw=; b=aZsKGILNe57BsZ4AnagPybbYOvGaydaDC7oloMuyQKSZBF9/+BC8n0rfw0cTuqFA5v z/GHgkcxN3T9IuOcF7gvHkNkuJ14ASFJHeWuUiRbDzkuYEUZBBXPwSWz+3NKw+hqRVLP j3ZZ0GNSlRHRNPlKsFDxEIt7viamQN9VUcN2Avy+Pp7A0opVYOplXR3myW/v6qePIAzN lUtuqT2coBXbrpwvbcKy0PmtR/JgiaLyvWu0nyRZFTBC3n/13yAG1DUKE0/gtSieozj4 t6jVFLTyZu8Ck1R+Gm0tP1RSwsizpmnsRcWYDxHd1Z8BKSAsOxO/q0sZUS+tUSrSDVux bC6A== X-Gm-Message-State: AOJu0YxyAeJfRX6M2UBuSO7kN1aKsvtSe8jDNgtat+t1MkJVcG6qX2G1 j8OdMn7CPaA5RrSU4tgjNF8cr/0+TsdfXmfhEOwcD7xzPi/kqYsfUPt7LE0BmYLxayaupz+W4XM mrs3plJP9ciPUuNhjmEAiqFCyU7E= X-Google-Smtp-Source: AGHT+IGMzVtdBkLpku9alykvvgGkiPNQSMkMvSc4Cych2kzUBsI2tv7er4jat+B/Mpj8fYKmTwUmq5Lj0IRrKsw8UHo= X-Received: by 2002:a6b:7b49:0:b0:7be:de5e:b62f with SMTP id m9-20020a6b7b49000000b007bede5eb62fmr518686iop.35.1706041853489; Tue, 23 Jan 2024 12:30:53 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> In-Reply-To: From: Nhat Pham Date: Tue, 23 Jan 2024 12:30:42 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Yosry Ahmed Cc: Johannes Weiner , Andrew Morton , Chris Li , Chengming Zhou , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 87D6A160017 X-Rspam-User: X-Stat-Signature: aoz39z383henqrmk6h6ba5sb8rw51m8g X-Rspamd-Server: rspam01 X-HE-Tag: 1706041854-363730 X-HE-Meta: U2FsdGVkX18DpbveJpvnEvZK3KUjjsQXyXN2k8LqWu9wVpM8sqDMZeQOX7N1JKpjfPQaF8wZl6gqsCrqsBKf2VcmAaeyVPVKSqEd45dWoUZVFyJ+Zt7a23BuhEbx2rwlhkXx8KDQR2PK8f+3TjJRXMHh79q41q6x1ZGpRob1F0OccJxCxi4NqtTJXLisAaVb3UqqTn6sjb/CvVePfb8kMbI4dJFdqVWxbm8RM1qC0NYatiIJKm//wO8YPVwrTch3vXSyDG/zmL9WE29jXWkqwELYxEyOLer6dsx0khnGpR7pve0pKnbCazkcRRsVgbwA17PkzUJpvuSiyi0tvgEZ3LlyN0CSb6T8o7PWgMCAoe4UssIy3s4JZHDNfAXtJH9ZFDFh7mkHECYIrq/MWIT5n9spDOpVjoYFupy3PR265TENcPcd70Y0gLBOXsfnQOktGX+466cSNhf0LTYvpc67mWSsgCw9VGBav25SRRJcSUAdOkch8MoTK3kv954WviEWQe69R16GioeDwYM2aZDEdd8+Xo8Jtor7ALyUhHEY6X/F30hIu41wYCzSUNo6dUFj1CJjLl1fIw/i7NIUvweUyEkLWMPE+MgUqTt8CjK3GaJDqfrtq1l0tCmAmCUgo/N0SE/9SeHx7+k3TX2x/gu1fOLVajhxfrgXgrWK5lar8FYp9lfwOTTthz48ElEUbz31thIxVHh0U8NI+OCPy2peKlpC6LECm/sbDo7LU1XvWtQUm2nLkmw+8c5Wl/RGIWKLokZ0cRE8wuiVuBY8Izb8bhsBbqV9wRWVKN+kTonMz2Zdu9nI0jol/yv0ctsQMZYC5jXWjno08Cr54PhGI6B3MBfFebXhPdlYh4aQyRX9H0ZBKToy63H30uHYxt6TaVcRCisVruZLTy6whXtEHKYs078dT2y3/j3UGqI6ImsGdgetj4PojpfKrEg1NXYXTX6axM6Kvgq3GNJEmHvIkw0 x8IgsdX4 yPRGTqkjg6TDOQYVK+aiOXGXcXUI0WGEHq5OiUzWmKmH5gLNrIN/r3fds5Mpovjkc7+8LXTurj9T6Z3swhMcIdiO79lQdQm7jq7b9AquyzEFQpGCA1V2J0EOeFV7za4q5FXjVf/iuWYCsjUMfvCwrWlawvAw6OJjh83jse6IO+Bl0PU3QAeqj9e8d7CmQpQVzhP/jNdjzZCD0sOUx7zLrpWuxgr5mlc/P0KXhv0c3n2+SDgq5V0lTK+3IQFMP7UStYq10EHxXbG8/FrJS/bGbhLReEXb9PtNJ9WDDaAoBDTRHyOiToolJSvTVljW6fI5e0bAUaPzsc205lT20JlAgvEtgQnUKz3RM8kWi+I7C8JAqFl/D3FyeB/pVyUtLYhL23BA6QJ92nII33MkJT1e9DYxJ+oAnhtQBrYMGzzCCY3eh4rN90mzydfstmnCCIjpZle/xnrYu28yjwuDgZGPz9hq2QD8noho2omk/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 23, 2024 at 7:55=E2=80=AFAM Yosry Ahmed = wrote: > > On Tue, Jan 23, 2024 at 7:38=E2=80=AFAM Johannes Weiner wrote: > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > On Mon, Jan 22, 2024 at 12:19=E2=80=AFPM Johannes Weiner wrote: > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate()= is > > > > > called for all swap entries before zswap_swapoff() is called. Thi= s means > > > > > that all zswap entries should already be removed from the tree. S= implify > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > assertion in its place. > > > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > > > > Acked-by: Johannes Weiner > > > > > > > > That's a great simplification. > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > swapfile and its cache should be fully dead and I don't see how any= of > > > > the zswap operations that take tree->lock could race at this point. > > > > > > It took me a while staring at the code to realize this loop is pointl= ess. > > > > > > However, while I have your attention on the swapoff path, there's a > > > slightly irrelevant problem that I think might be there, but I am not > > > sure. > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > Please tell me that I am being paranoid and that there is some > > > protection against zswap writeback racing with swapoff. It feels like > > > we are very careful with zswap entries refcounting, but not with the > > > zswap tree itself. > > > > Hm, I don't see how. > > > > Writeback operates on entries from the LRU. By the time > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > will have emptied out the LRU and tree. > > > > Writeback could have gotten a refcount to the entry and dropped the > > tree->lock. But then it does __read_swap_cache_async(), and while > > holding the page lock checks the tree under lock once more; if that > > finds the entry valid, it means try_to_unuse() hasn't started on this > > page yet, and would be held up by the page lock/writeback state. > > Consider the following race: > > CPU 1 CPU 2 > # In shrink_memcg_cb() # In swap_off > list_lru_isolate() > zswap_invalidate() > .. > zswap_swapoff() -> kfree(tree= ) > spin_lock(&tree->lock); > > Isn't this a UAF or am I missing something here? I need to read this code closer. But this smells like a race to me as well. Long term speaking, I think decoupling swap and zswap will fix this, no? We won't need to kfree(tree) inside swapoff. IOW, if we have a single zswap tree that is not tied down to any swapfile, then we can't have this race. There might be other races introduced by the decoupling that I might have not foreseen tho :) Short term, no clue hmm. Let me think a bit more about this. > > > > > > > > Chengming, Chris, I think this should make the tree split and the= xarray > > > > > conversion patches simpler (especially the former). If others agr= ee, > > > > > both changes can be rebased on top of this. > > > > > > > > The resulting code is definitely simpler, but this patch is not a > > > > completely trivial cleanup, either. If you put it before Chengming'= s > > > > patch and it breaks something, it would be difficult to pull out > > > > without affecting the tree split. > > > > > > Are you suggesting I rebase this on top of Chengming's patches? I can > > > definitely do this, but the patch will be slightly less > > > straightforward, and if the tree split patches break something it > > > would be difficult to pull out as well. If you feel like this patch i= s > > > more likely to break things, I can rebase. > > > > Yeah I think it's more subtle. I'd only ask somebody to rebase an > > already tested patch on a newer one if the latter were an obvious, > > low-risk, prep-style patch. Your patch is good, but it doesn't quite > > fit into this particular category, so I'd say no jumping the queue ;) > > My intention was to reduce the diff in both this patch and the tree > split patches, but I do understand this is more subtle. I can rebase > on top of Chengming's patches instead.