From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A9A5C47258 for ; Tue, 23 Jan 2024 21:02:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF51B6B007E; Tue, 23 Jan 2024 16:02:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EA5E56B0080; Tue, 23 Jan 2024 16:02:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6D126B0083; Tue, 23 Jan 2024 16:02:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C397A6B007E for ; Tue, 23 Jan 2024 16:02:50 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9080FA02CC for ; Tue, 23 Jan 2024 21:02:50 +0000 (UTC) X-FDA: 81711800100.03.7C42ACC Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf14.hostedemail.com (Postfix) with ESMTP id 9855A100024 for ; Tue, 23 Jan 2024 21:02:48 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SqA8eqsy; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706043768; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hjsQ/F9yGV71FP1Lt6gnrNyQE+7vVsFWBqMv4ElwWB0=; b=rKmwGoFjqHft3BgS5vPDPkbWG5jvJIQYaQ4JkER2ED4moViGj0LeiPYNioo4SX1LpzVRVl F9rgImx3BATNu3qFcGOLdFNoyDcgSFhL0Prxalbzz1whWU7SJ0xlGRGIGXxSRVh8X+Yy0z ydna8wlw6yRkV4qjwziaPv5MWpu1t2k= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706043768; a=rsa-sha256; cv=none; b=DZzjhBKfOFkyIyNE0KS+xVFMZc53zo4/IKLCwIxXVcWTzV86RSVycTgfE0v1xnklvuQpax Prfohb5VHxffvnj3T3NOXZcU7bVYKhMo4sCWM9JpCEAs3/L7LVP3zlBWqYIvW86qNTA6jg ukyGyeeR79Iy2XsJAbM0nh2jSRqWPPk= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SqA8eqsy; spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-55a179f5fa1so5651316a12.0 for ; Tue, 23 Jan 2024 13:02:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706043767; x=1706648567; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hjsQ/F9yGV71FP1Lt6gnrNyQE+7vVsFWBqMv4ElwWB0=; b=SqA8eqsy0zc/nNumxQ+kLuHsc8LLuG696XqBTxd0tpiR5cX0abeXWpAIbhmd3LOAmp 1ZUzpdtLpB081+NHN31yLmKR9UEK/al9UIs4twulREOyRrnAs+7VEPMQmH48mcBWLWBz 43pxAi6h9qsSykXNwZSl2YEheZbEhtKgZQBHwvTBETH+2CVMyJeTlSdLLaayqR+cKHg/ gtBrnDGd7VCjewYrncJ3h9Bzvr40G6OB7/TMPmWDUupQp0wuAaDmHZuSAv1V2AURBqyC 6XEWqSqmVuKfcCGpMb+WA8zo9t2fnWSIticLNakn/TdaC8SHRfP7LMvqwHEHgn+Zx9e2 f/Rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706043767; x=1706648567; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hjsQ/F9yGV71FP1Lt6gnrNyQE+7vVsFWBqMv4ElwWB0=; b=VQEeXF/cZvKBG6p3bHmjJdbF9iDWJshAcj7W+Pt9aQ9LuK9Q2CS7xdMp5clA39TUkZ +E+1mHN+e0QbIRzofMoL3URivdqka5P/y8tfXEzTDoZ2vqmfFVYIWKyoViEChKFfshaw 44WLQJQ1FIsyVZSjfk+/MCtq8yrsyRi2cFgPpi1pDtcG4JFWIMTluT4bk6UhVCOa6Ml2 lYJ4+yeRiB4o78+pRIRlQ1k7/WM9Ob4FhqPMnTbctL6LR4kvlGdcrCLDQ40yvnK9SfML AsT3Po3PeH9gqSTgQLUcXcXmC2IVLs4xLJB7Oi6YUTMkhdm4BB63sJ4b8ee2zvl6qjVD tCWg== X-Gm-Message-State: AOJu0YwHvE4k5CgsXw4cfScdy/plE711TQvMXuLHQQT6TQhGdhSufp2b C6OGyUriwZY4CFXkBeLHwu1R35LJqz5mqwkM4070YfY+tuiQK+mlE/Q7DWlAeClZ5YJGdMuSCxD coa/vxhThMmlfJiLLmMjFhghRbV+FmsI8VX3t X-Google-Smtp-Source: AGHT+IGPeJEGgA4gwneSIqzSjD3ji/NQrp7lLJa1aPwjVdMww/v1xJL9NN4OLnI7saN1wVNCbcM5Om/d8Dj7fndx3mw= X-Received: by 2002:a17:906:198e:b0:a30:db53:5bb5 with SMTP id g14-20020a170906198e00b00a30db535bb5mr259332ejd.58.1706043766802; Tue, 23 Jan 2024 13:02:46 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> In-Reply-To: <20240123201234.GC1745986@cmpxchg.org> From: Yosry Ahmed Date: Tue, 23 Jan 2024 13:02:10 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Johannes Weiner Cc: Andrew Morton , Nhat Pham , Chris Li , Chengming Zhou , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: tcxgj3h5tju9frge9fbbbnak3bi1nx4y X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 9855A100024 X-Rspam-User: X-HE-Tag: 1706043768-422982 X-HE-Meta: U2FsdGVkX19eVaneJif0UQEwdRUMKH+1ukV4WUnIBjspoHvf209PLL6pbnHjjVmy44RzgYHCK/SDbhZFxlv7ucjiLV2xznS8TFRh0OCm8bgdsAZwTvxintdIn+btmqvB81l1j/1/mvSxR5j/BKhLAOzpl0SHyVE/mQUj2sGsYX5mTJ90nAbfSShe6UmZe50+gzixyN1x/HICnkMxqtPJEwggX8P/moYhLGx/dFQExUZlWJygTJE03ILwNP76wns1QegFpTVtoxA5h7pkza2Wi6WLZbYqPV9D5YjeRDpLc9h4NCExCQaO6YbkSfJjIYvxp4PfY+f2X5TpUyxPQhxZKwRtBCyJGLI96UFdgOwmdTR9DXnbu+vEKgerKwau60naW+sT4Khm6GWPK7rCuXgp8OuLz5i5KDFCvsR+b09POu5yoaAQHcU0okVFNj8qupQ8jjTSyQ9H0skoLk9aDesRAdpq0mouaGTGRv6JycUSNJfJm6X15Fjc7tySW6CYTJGMC7QncC1VAwJE1w34HHlcMK69a1ffpIPX9EHv/xAR3kg+erQ3lrsNj+Wi08O5y7Cj5xITQ8MSt3CQJ2IF3ozpbg8FPIwPymO6xyeKaaFgRwRZEwNZI/jIyR/wwHh2VWvbW1KGHeKa1oR46aMC9To6ZSyXycIFnMoFuRqQ2siRiZwMHD8w9UhAEB0g1OiSn9f3llGeza9DI7tuuY7hIb4ncv8JGvRyU+OViJSlB6rnZv1AuG8XpeVUBb+b+DK/Gt3W4v3il+QrQitA67rnUWm/kUlR5Ky/pMEpBOEfD6DdmxKTmVCMpXrZUnUch0PLYUJEVDxmZhvhKyUzJEw38TYGU8HSiznwfGDI75b3cmSIIts5S8puM6CIc5cQ4qaYAl3vg+0ORlHRqXsNBW7vOdmR/jy1sTvjKHocTs9FmamAEFgeOVaqpaTLFSdId0sJdxwewV8Q6K5WXVXbiHclDbP N8IDCc7k Blz9Qs5fbHrn+jIum19kT3tmkw+KAEfzq31a029ho4YWwJCGGZw+TnzfBn5qTlmineSW28RSPam4l1ELT/wxcwwNdBTMv6yC01qyYtJrTOuwfYc4LmLWfN5A4fLAldZhKOTcetDgG9Ox/b0ljh1axoZ5+BfPsjfuYBb1BrvHQx+xlsOM96PJEIo70/kcV/I6pFEt3gubeOD5aSn17DErFjwbBoJOHXx4QSSXHc3SaP04E2aNld3UARYJJL+MfuwQ1YsUcxUhugZPK6yKu3fFjP1gHa+T596K0iCSh3RJ4XAcf9/1YuoiQnwk4iOvy5HmDwhbZVa0GihqaAri4tuyRFrwDkk+NLbu2I6Fp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000022, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 23, 2024 at 12:12=E2=80=AFPM Johannes Weiner wrote: > > On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: > > On Tue, Jan 23, 2024 at 7:38=E2=80=AFAM Johannes Weiner wrote: > > > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > > On Mon, Jan 22, 2024 at 12:19=E2=80=AFPM Johannes Weiner wrote: > > > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate= () is > > > > > > called for all swap entries before zswap_swapoff() is called. T= his means > > > > > > that all zswap entries should already be removed from the tree.= Simplify > > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving = an > > > > > > assertion in its place. > > > > > > > > > > > > Signed-off-by: Yosry Ahmed > > > > > > > > > > Acked-by: Johannes Weiner > > > > > > > > > > That's a great simplification. > > > > > > > > > > Removing the tree->lock made me double take, but at this point th= e > > > > > swapfile and its cache should be fully dead and I don't see how a= ny of > > > > > the zswap operations that take tree->lock could race at this poin= t. > > > > > > > > It took me a while staring at the code to realize this loop is poin= tless. > > > > > > > > However, while I have your attention on the swapoff path, there's a > > > > slightly irrelevant problem that I think might be there, but I am n= ot > > > > sure. > > > > > > > > It looks to me like swapoff can race with writeback, and there may = be > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > > races with shrink_memcg_cb(), I feel like we may free the tree as i= t > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > > > Please tell me that I am being paranoid and that there is some > > > > protection against zswap writeback racing with swapoff. It feels li= ke > > > > we are very careful with zswap entries refcounting, but not with th= e > > > > zswap tree itself. > > > > > > Hm, I don't see how. > > > > > > Writeback operates on entries from the LRU. By the time > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() shoul= d > > > will have emptied out the LRU and tree. > > > > > > Writeback could have gotten a refcount to the entry and dropped the > > > tree->lock. But then it does __read_swap_cache_async(), and while > > > holding the page lock checks the tree under lock once more; if that > > > finds the entry valid, it means try_to_unuse() hasn't started on this > > > page yet, and would be held up by the page lock/writeback state. > > > > Consider the following race: > > > > CPU 1 CPU 2 > > # In shrink_memcg_cb() # In swap_off > > list_lru_isolate() > > zswap_invalidate() > > .. > > zswap_swapoff() -> kfree(tr= ee) > > spin_lock(&tree->lock); > > > > Isn't this a UAF or am I missing something here? > > Oof. You're right, it looks like there is a bug. Digging through the > history, I think this is actually quite old: the original backend > shrinkers would pluck something off their LRU, drop all locks, then > try to acquire tree->lock. There is no protection against swapoff. > > The lock that is supposed to protect this is the LRU lock. That's > where reclaim and invalidation should synchronize. But it's not right: > > 1. We drop the LRU lock before acquiring the tree lock. We should > instead trylock the tree while still holding the LRU lock to make > sure the tree is safe against swapoff. > > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But > it must always cycle the LRU lock before freeing the tree for that > synchronization to work. > > Once we're holding a refcount to the entry, it's safe to drop all > locks for the next step because we'll then work against the swapcache > and entry: __read_swap_cache_async() will try to pin and lock whatever > swap entry is at that type+offset. This also pins the type's current > tree. HOWEVER, if swapoff + swapon raced, this could be a different > tree than what we had in @tree, so > > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to > look up zswap_trees[] again after __read_swap_cache_async() > succeeded to validate the entry. > > Once it succeeded, we can validate the entry. The entry is valid due > to our refcount. The zswap_trees[type] is valid due to the cache pin. > > However, if validation failed and we have a non-zero writeback_result, > there is one last bug: > > 4. the original entry's tree is no longer valid for the entry put. > > The current scheme handles invalidation fine (which is good because > that's quite common). But it's fundamentally unsynchronized against > swapoff (which has probably gone undetected because that's rare). > > I can't think of an immediate solution to this, but I wanted to put my > analysis out for comments. Thanks for the great analysis, I missed the swapoff/swapon race myself :) The first solution that came to mind for me was refcounting the zswap tree with RCU with percpu-refcount, similar to how cgroup refs are handled (init in zswap_swapon() and kill in zswap_swapoff()). I think the percpu-refcount may be an overkill in terms of memory usage though. I think we can still do our own refcounting with RCU, but it may be more complicated.