From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96148C47258 for ; Thu, 25 Jan 2024 05:44:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F10796B006E; Thu, 25 Jan 2024 00:44:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EC14D6B0072; Thu, 25 Jan 2024 00:44:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D89356B0074; Thu, 25 Jan 2024 00:44:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C8BE06B006E for ; Thu, 25 Jan 2024 00:44:16 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7813D120D90 for ; Thu, 25 Jan 2024 05:44:16 +0000 (UTC) X-FDA: 81716742912.29.6AC5B3B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf04.hostedemail.com (Postfix) with ESMTP id 7544440007 for ; Thu, 25 Jan 2024 05:44:14 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MyBiAmZr; spf=pass (imf04.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706161454; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=q9tDM+NEEw2FkqnM2n0GxqdvsyZIqqnK6WLw+vpVYCY=; b=Z+DBDpH1gqOVD4ZX1t0nZ6t4+sKLJxki4IJdAOEfyw987HaAHtFlKWhn97yq6heM1jLVii vX7vSY7/QcyTwcMigipASVkLAKLqELrHcMovVYrEQov9VWp7dm7hlmSlOGP8LvbPWqMDI+ Z/gYlOFR5RqmfP1H6lNPXxAb+AsSPiU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706161454; a=rsa-sha256; cv=none; b=gYVbUiSGfKjw1BMFrX7+rB06Nf0veu3BpFLDU4EwsLRKcx/G1VOUqvf3LCxaqpd6fvvi1W k603tE69PURX21FJCrbqVxWQAaKoy4cFADsofSlpIbNf/BJ6oJOngjWUEn1T6w1spns4se J3zd7VpWr7cisyK2kdzowJwktjWdsBg= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MyBiAmZr; spf=pass (imf04.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 6363562107 for ; Thu, 25 Jan 2024 05:44:13 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D0789C43394 for ; Thu, 25 Jan 2024 05:44:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706161452; bh=sP52rUy6+iN9QS0VLcd6Pwiic1NGZEoMk9V3x/1a23Y=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=MyBiAmZrfbYvOTdCEYi3l+LxL+ewQRdUlqh3y3hhnHIKF7mMcGA30YaCpBhZJvXka niIIHhmP96g+N7J1GTxL14sfFrnM27O0o06xpmZCoFAWaCBYBuNG5BRVzKEqUEa4NR 52qoR3vJxdU7PQIaLgidViaarYei53n9DWSAZwd64M8pxnC3hAdq/nPyARxrpxS1eq QYnlhXpbdh1ODufhqxnIsg9KIJfuRXnEpzLzEIZqduwKWEd7dWV37ViFCbUKo0wRpw zJSoTB5/6kaDwx7AKf6rHUGMDcE7r7Xot1C9M/0+S0RpDbFKirGfD7U1qkgeiXuSzd h059cb6s8VQwQ== Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-28e79182570so4774640a91.2 for ; Wed, 24 Jan 2024 21:44:12 -0800 (PST) X-Gm-Message-State: AOJu0YyY4C+8IPZYJ5ACJFhc963hGzQxMs/1eL7nbAKeZIe28VVR54Tm Pf+nM/4i+In8vVEr8WgJeHbUZNcNF98feeXxq/UaVRDC/kqygzgf49ndnTdydo3LnuHoktRJCEU zEy/2rpSwzpm3qCv2JBnMxS3NJYwweN9rePqs X-Google-Smtp-Source: AGHT+IFBK3SXfs8luX9Jv606z385vOaMQRO31iqXYMrwv6gz2+OcK9wo1metGq7mLIfg3p2fm52PICtPvDwxvxCu5+8= X-Received: by 2002:a17:90b:4b52:b0:28e:79a4:979c with SMTP id mi18-20020a17090b4b5200b0028e79a4979cmr263316pjb.58.1706161452171; Wed, 24 Jan 2024 21:44:12 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> In-Reply-To: From: Chris Li Date: Wed, 24 Jan 2024 21:44:00 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Chengming Zhou Cc: Yosry Ahmed , Johannes Weiner , Andrew Morton , Nhat Pham , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7544440007 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: xh879on79nw8bszua63un585trqiksty X-HE-Tag: 1706161454-790954 X-HE-Meta: U2FsdGVkX19UOYpPszHC7UHOvWNh9NURI8DTkoT0z4kDISx2aCgRG9+NR0sh5cfMUPyvvZJIDDt/YdS1gCchYRgyXhpOa5EX50PaoJGmsEdNNp+aC988DhFS9kpPnfnH3Vp+oF69YjwApoPEU/HHTuvlCHz+Z72hGwNO55Uv4hCGEJ/rjVG1NV63qKEDxVqBvkUfYdXsTSQh8VP+RnE0+UglnS6kulfuDWgvzSz7SYH6tL+/FqhvbX4CK8e91k+PGuikbkmJR8QlH4ijItpb9mkI7V6ExMVJ30yDkB39L7Ep2XKM2INy0XzQ19Ruo3JxwIlQuv/kgup6ky185MO/0qqDCb2E/sHaOBgW+dhc9mPx24qy0y9SXN3XausxN4BDQJU+euvA91M1ju8kmQ4vVNLgSvczJ6KEV12KgjHAn812IGvg7t13VAHqrMSgJ/CxjvvOpq7ywEtXI9s8+3xPcJ2etziEBv5pr+ygxCY67PYWZOzK+4FdV/3y+/iREcxI1tKal3Hy1N5yWXxQEdwt1i6DwaAXEsZjd65hT13Nd3GiwYjm3sYJolPhvFarulSICBe8ueQlftsL/b3cPImekOt4TrYHSVgKFAiMnOnbJ5cUT5AMYII4oigMV7m3rs80htx5PInGS+D9I7Jw97XnbhaEnwHdoaAul6Lvq8euSifQL2hO3ukyKCyWAh/KjyH0hsr4tEWHJrp6OTB3IBz7KrMrblzmSUXOmRM9LayhbEIGbIYRI0ZBxGfnNxDpZuR7pDh7Zysu3dvU8QZUnbEvIUWG3ixdwJPOj7NoUK82dutEXVmBKDV4eX7u8x2wVXpBGZVxwJR+XZRZA/wVtSbId1Z/SAdT63zTJ3LU2NdIuHY4wYmVj218c38cV6przJxU9yf7PEl9XK81ic22rcsixys9N9u0BRyDAZfTBIW6pQnKMnDBKKkESRsLdgIf+BMOgZXadfUSlFgPlHTiUsV n2GoZVUP 64ZYuTconm+tAz8vtkGNlE4rIVqz2/xa9snKJ8Vk8ggnANvE5FRbHZQEj6emyWFaiPKTTI+aHsIS7z44OUw+0fLCDmj77BRO4xjE9lpaT75Ig12sbFW7KGDqVQrv4HgSLnaRytvBDlF7j2Z21DO1CICtUOStWX7hWCGX9pPQz8qix894wSirCX3FqE1USdLub8XqKURf+MZN6oxUkL59l+omFKwiwdBuWBrOfNNPA+MhjgxkhAk2ffbpAePNtpht/MK+RNtVK3Zy4eH59S7DP/LVhjjdojlEArP+qr8AFLPHxSPBUkxdOIOlYM88DAK393B+p X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 23, 2024 at 11:21=E2=80=AFPM Chengming Zhou wrote: > > Thanks for the great analysis, I missed the swapoff/swapon race myself = :) > > > > The first solution that came to mind for me was refcounting the zswap > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > the percpu-refcount may be an overkill in terms of memory usage > > though. I think we can still do our own refcounting with RCU, but it > > may be more complicated. > Hello, > > I also thought about this problem for some time, maybe something like bel= ow > can be changed to fix it? It's likely I missed something, just some thoug= hts. > > IMHO, the problem is caused by the different way in which we use zswap en= try > in the writeback, that should be much like zswap_load(). It is possible to keep the behavior difference and get the tree right. Follow a few rules: 1) Always bump the zswap entry refcount after finding the entry (inside the RCU read lock) 2) Always free entry after refcount drops to zero. 3) Swap off should just wait for the each entry ref count drop to zero (or 1 including the swap off code holding one) then free it. Swap off is the slow path anyway. BTW, the swap off is where swap device locks get a lot of contention. > > The zswap_load() comes in with the folio locked in swap cache, so it has > stable zswap tree to search and lock... But in writeback case, we don't, > shrink_memcg_cb() comes in with only a zswap entry with lru list lock hel= d, > then release lru lock to get tree lock, which maybe freed already. > > So we should change here, we read swpentry from entry with lru list lock = held, > then release lru lock, to try to lock corresponding folio in swap cache, > if we success, the following things is much the same like zswap_load(). > We can get tree lock, to recheck the invalidate race, if no race happened= , > we can make sure the entry is still right and get refcount of it, then > release the tree lock. > > The main differences between this writeback with zswap_load() is the hand= ling > of lru entry and the tree lifetime. The whole zswap_load() function has t= he > stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom= half > after __swap_writepage() since we unlock the folio after that. So we can'= t > reference the tree after that. > > This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) e= arly > in tree lock, since thereafter writeback can't fail. BTW, I think we shou= ld > also zswap_invalidate_entry() early in zswap_load() and only support the > zswap_exclusive_loads_enabled mode, but that's another topic. > > The second difference is the handling of lru entry, which is easy that we > just zswap_lru_del() in tree lock. > > So no any path can reference the entry from tree or lru after we release > the tree lock, so we can just zswap_free_entry() after writeback. > > Thanks! > > // lru list lock held > shrink_memcg_cb() > swpentry =3D entry->swpentry > // Don't isolate entry from lru list here, just use list_lru_putback() > spin_unlock(lru list lock) > > folio =3D __read_swap_cache_async(swpentry) > if (!folio) > return > > if (!folio_was_allocated) > folio_put(folio) > return > > // folio is locked, swapcache is secured against swapoff > tree =3D get tree from swpentry > spin_lock(&tree->lock) That will not work well with zswap to xarray change. We want to remove the tree lock and only use the xarray lock. The lookup should just hold xarray RCU read lock and return the entry with ref count increased. Chris > > // check invalidate race? No > if (entry =3D=3D zswap_rb_search()) > > // so we can make sure this entry is still right > // zswap_invalidate_entry() since the below writeback can't fail > zswap_entry_get(entry) > zswap_invalidate_entry(tree, entry) > > // remove from lru list > zswap_lru_del() > > spin_unlock(&tree->lock) > > __zswap_load() > > __swap_writepage() // folio unlock > folio_put(folio) > > // entry is safe to free, since it's removed from tree and lru above > zswap_free_entry(entry) > >