From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32D2DC47258 for ; Thu, 25 Jan 2024 20:58:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B83B26B0081; Thu, 25 Jan 2024 15:58:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B34696B0087; Thu, 25 Jan 2024 15:58:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D4036B0088; Thu, 25 Jan 2024 15:58:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 8B12A6B0081 for ; Thu, 25 Jan 2024 15:58:08 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 169788051C for ; Thu, 25 Jan 2024 20:58:08 +0000 (UTC) X-FDA: 81719045856.28.0CDB69F Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf25.hostedemail.com (Postfix) with ESMTP id 470BCA0010 for ; Thu, 25 Jan 2024 20:58:06 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rKb+MaRy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706216286; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=31gbTt7/WQPF0Y4UOrqjm/WIqOgckaQgYjl+qIfnXoU=; b=wSiLgyn40DHKNWF230D0dkDOhmIhHQtcO7RO7wHbfj5VrP1uJ3tZ2/6bGs8qL7bQcVFANp Cd1j570z2fSExh3ZBQbkIiSrboy3Y/5KtlhgKFsUSDqOua810Ok/g2URG3WnPurHiutonb 6xA1a722qd1YDQZeo2J8dS5jp39diGo= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rKb+MaRy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706216286; a=rsa-sha256; cv=none; b=XEz5fEmE6p0ZoHuGEcA0PRsVt8TsRwyrU0DOv52Pmpyjf/C0KpNb3ziA46tfjgl/o57pCG qPzkL6WwNfs2Fkk2yW2cuvWheyepsjXbbgf6MU7VeDCBeavm2yD4ofPKMyj49/jIOADAv8 QwGGGQNkw2CAiemujN7pbkYvvB5DhjU= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-a2dc7827a97so811593866b.2 for ; Thu, 25 Jan 2024 12:58:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706216285; x=1706821085; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=31gbTt7/WQPF0Y4UOrqjm/WIqOgckaQgYjl+qIfnXoU=; b=rKb+MaRylh5bD+BA6GgueETAuJoNLcR8vn1vXJoL5ysgvJCgQrGI7nBj0lN7w8+2bE rwUqdpeh3jM+YAIwk64nsSeSmYjaI3F5uzB5vAn+FC45grzJelwuftE/WTLyjVtCwcSI NjptpBfDCfYT4QQgytR6cKOvMLqhgeo8P9pGbhUx/awv9HEebve5sIyQjg7mEaSZIzan wgWmrVcRA5N4hLBbET4VMTj+NZQwjam1W6pV2XHanTShElLVQjg/kuq+0tu38KcPb6pY 5bqdaNPWzry41l3yJ/NSG9SbLVFXP9QafY/2a3WrgUCNqpiE8QB5j4vmiDGIBuDfafP+ 5Ejw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706216285; x=1706821085; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=31gbTt7/WQPF0Y4UOrqjm/WIqOgckaQgYjl+qIfnXoU=; b=GUa3Bf/mkLsbEacB52s1TD7wwug/mNl6hnSnMn3y+rOF01q5ehflytN5fpz+kx8nju oUUaK51pvbmdmRNosodFE9hdC0RCsRjiaDKr8KxPKQd+6OTJT1z/N9PFpJ48B7PLMLdV 6DZcBO60hbGUodqHZpaKanMZ1VicCVlMs+9PDkehLGErcfcIPlSemVHGBeFdLSJNrrgZ a34cVMlpKMAH3eV6pkNfgotd2i5rEI8aL8bVgz6yo688MwdleoZh+ch66zP7R/qaScHC APAeY4S9bjZktjVUCMn0rnVq+uDLcfw/TJQJwOalcTSb+gWBA9B7oy52BNs2kXa2xndD S8nA== X-Gm-Message-State: AOJu0Ywe9y+NRCxIkKC0R5DSji6rOuflHA2GIqhE1IHrEdddMb06QVbR 3qrmvbwCgS/AMaWInX6mbdU8+vdBaenOg4T9Ay5Kh5ElHGeBCPvVsNEoCNw/EXICzxQ+gCWr8+e Armdwz+/s3+hRFEBUZJ0wD4Dd/j6hZCP8g8zS X-Google-Smtp-Source: AGHT+IHpVSOH793RNZe3XgecTMwFbygbft2xOr5Cdbg2uidPsR+U3Oxde1K1MaMLI0s62xz0XNMg6Ht/Bnv2pkdkGYQ= X-Received: by 2002:a17:906:7fd3:b0:a31:984:96a5 with SMTP id r19-20020a1709067fd300b00a31098496a5mr81342ejs.94.1706216284513; Thu, 25 Jan 2024 12:58:04 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 25 Jan 2024 12:57:26 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Chris Li Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Chengming Zhou , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 470BCA0010 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: ffhb5k4xs9eyjxgho3pkdsdp7r7sg6oz X-HE-Tag: 1706216286-252177 X-HE-Meta: U2FsdGVkX1+LglUarp17j0XyqztKj8L1DOCbTon0a9s6nB4OzwboKHPGPMX+RK6tln8V6+/+Si+9d6rLjDrZdS/HVBcqi0K3cPRag0KIJcDdx9K2jOTV5u76Rk9e+Y5DHyw4ohBTmOhiD7ZF5q/5xZrZP2gwpzCQ68xbkri1o5TlGgOJR2x5xpKeIYG0CINODbsiFDl3myNxnCnsDyx++TBUU3HAZ1WG2WLNNzIXKSGlolHe4qdztkfe1GEJs9w8qxOHcpc/dyfGyAJ3yw1ONG+LrTlSuH9jegCincH4cBop2fByE/20TsEgDpFzYCKtaF4sPHYqx6eDRlUmzZhFE/76qomtWuFnycit0tmR1euqO3lM8SScRoyB9CCC+BRREkhRqju1uLswNQ/jlvtloEvmLszdYjBjrWt+WERnUJbjWzTnccnZRsSaP0Hm2qZ2P14UfnNZtgUH6WQhd9CTuThv6DDLOw4i2qm2jSe0C++EqKnKC0Lrw3ge9OkRni1XqEWkbgOyxXlu4H/bWWsJov4IDin377zLDm0P6a3lDk6TPu5YCFqcNqCcHufsWniLX+PUaaiuQrodJWIo/TQ15oXLbJJ+H3/YYrB3656cJOAYj+qiRf7yi384+vCqSGVGesoHTMHVAPnvnX4c8EGoPj2sm9aWVY1Oqth3i43VJsLjQ8qhiLDBwffadZlMz1oG6yznJBCGNqEa6wcSWluZSODs3XWh7anpUX3LhX+/ME4R3vXixZinxGDcZA+RUFt60kZYIiMAabRmDJsuXsWtAZpZ5HmJLcP7LNswF2ke4YrKXBc7vGHeNZZ8ghMKqbvuxcfgowDkiC8nd752aGXn+9GDXlnpKv7po3ZHTE3BlLflZqx41++Kb0Wkaz9g24ln5EBHYEhZViT8g/5N9+zEGCR2DrXNyhGN0I7nNsMtFxeWQKzk2g3OXS5zRvsm/UnMMsGyCZqNhHCcOdlYtfx P/tXCgn9 gH+x3XQu92ErzncuNt426eHlmRGNHXUtTomLlkI8/mhAeG6+XnDam4IS+ShC9T86jjxAlOMAOMwYi9QilxGoFz3HaTLZejF6O0iRAx3ZNeXlSnSmKOrXHYLuwpa7hMJKnqXlAecVcrS3t8DkDFWiorrp2aRGbdkVgLOhMiE2L6VNQq/9eQtQvE8E2ZGERrI5mjwWccVD7QH+6oXCpHsh4vKq5++zxsdtX5BmhWPLT1yRRaaOkHeKixWlZUWL4aGDyHzZyIcQr8u1DD8iiBwetKTtwOiDqyQM+VRmnN62MCByH5Gl9rqJ9ajQBrmSnFszkyecFx1LGE8yYpzhT1XslkNnvRVlDEMn3WrhkSCZ5DbOZfj5TIIsLsqAxkZV4RnRPQ2BnHKMkEWjK8+VPVY3VTVtQBw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 25, 2024 at 10:55=E2=80=AFAM Chris Li wrote= : > > Hi Yosry, > > On Wed, Jan 24, 2024 at 11:59=E2=80=AFPM Yosry Ahmed wrote: > > > > On Wed, Jan 24, 2024 at 9:29=E2=80=AFPM Chris Li = wrote: > > > > > > Hi Yosry, > > > > > > On Tue, Jan 23, 2024 at 10:58=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race m= yself :) > > > > > > > > > > The first solution that came to mind for me was refcounting the z= swap > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs ar= e > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I t= hink > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > though. I think we can still do our own refcounting with RCU, but= it > > > > > may be more complicated. > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the followin= g > > > > kernel diff: > > > > > > Thanks for the great find. > > > > > > I was worry about the usage after free situation in this email: > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e= 9FZ9OFiUG6g@mail.gmail.com/ > > > > > > Glad you are able to find a reproducible case. That is one of the > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > I think the swap_off code should remove the entry from the tree, just > > > wait for each zswap entry to drop to zero. Then free it. > > > > This doesn't really help. The swapoff code is already removing all the > > entries from the trees before zswap_swapoff() is called through > > zswap_invalidate(). The race I described occurs because the writeback > > code is accessing the entries through the LRU, not the tree. The > > Why? > Entry not in the tree is fine. What you describe is that, swap_off > code will not see the entry because it is already not in the tree. > The last one holding the reference count would free it. > > > writeback code could have isolated a zswap entry from the LRU before > > swapoff, then tried to access it after swapoff. Although the zswap > > The writeback should have a reference count to the zswap entry it > holds. The entry will not be free until the LRU is done and drop the > reference count to zero. > > > entry itself is referenced and safe to use, accessing the tree to grab > > the tree lock and check if the entry is still in the tree is the > > problem. > > The swap off should wait until all the LRU list from that tree has > been consumed before destroying the tree. > In swap off, it walks all the process MM anyway, walking all the memcg > and finding all the zswap entries in zswap LRU should solve that > problem. At that point, the entry is isolated from the zswap LRU list as well. So even if swap off iterates the zswap LRUs, it cannot find it to wait for it. Take a closer look at the race condition I described. The problem is that after the entry is isolated from the zswap LRU, we need to grab the tree lock to make sure it's still there and get a ref, and just trying to lock the tree may be a UAF if we race with swapoff. > Anyway, I think it is easier to reason about the behavior that way. > Swap off will take the extra hit, but not the normal access of the > tree. I think this adds a lot of unnecessary complexity tbh. I think the operations reordering that Chengming described may be the way to go here. It does not include adding more logic or synchronization primitives, just reordering operations to be closer to what we do in zswap_load() and leverage existing synchronization.