From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 823E8D149CD for ; Fri, 25 Oct 2024 18:19:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CAB426B0088; Fri, 25 Oct 2024 14:19:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C5B406B0089; Fri, 25 Oct 2024 14:19:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AFCAE6B008C; Fri, 25 Oct 2024 14:19:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 909976B0088 for ; Fri, 25 Oct 2024 14:19:29 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8D8811C6329 for ; Fri, 25 Oct 2024 18:19:06 +0000 (UTC) X-FDA: 82712936670.30.99EB2D0 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf03.hostedemail.com (Postfix) with ESMTP id 15D2D20016 for ; Fri, 25 Oct 2024 18:19:17 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=igyj8vE5; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729880161; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dl0En9CAadDTqXHS8tVcFczlM5FFbAZNtAmw0XzqmtU=; b=TnyGhZ+b3k9CAsgAjUL8XgLQR0rhGsPpm9Et1+UXJIsijTF5gmYRgC1/xgnzsDcYwbEEI0 +Z0YBifaDHCqGuRyejTnsiYF992LUUBqui2rZ8AT44mjvwIwAiSTUqHT5jW6iaFzxHiKXy hleXnHIGCS+iQ+9i+kFkmceiJJo4n6I= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=igyj8vE5; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729880161; a=rsa-sha256; cv=none; b=7ErqVvdL/60mhJW+xx1/2/vqo1TrMQWg7yFAKk+30CUydGPncxuHyYgDqvsWmdN30V+txt q/XPTroC/GIcgLnedLi4XF9DblUJwtQ4WRhgICunjKa5oLVVHRg0NtUquvATLdXhsKDBue 0cFLA6bsuti+Ya/o2zePbi4CS3rrMcg= Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6cc250bbc9eso14920346d6.2 for ; Fri, 25 Oct 2024 11:19:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729880366; x=1730485166; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dl0En9CAadDTqXHS8tVcFczlM5FFbAZNtAmw0XzqmtU=; b=igyj8vE5dviThGSZTO021BW1KQQHgpK4ln7ps5kkIH0Y+60IZIpf93yYsTPCMlas9q HDTnZztQv32rgFQWeDwySndts0U17SCrQzFaoqC+vHx9npQIgnKECvmzWqIkfXxGaR58 jFb3xavBXqbrtg+wMGaHgkR6okl6Zt9FJLr+Mufb3nkOYacMThj7q94q3BQwCqlFMnaX Px/BSqWnId9HN7m5B1+tm+uhka66nmdMPavQKj1s5YrVuKkWDkosD/iAls3s8dFiutS8 9NZqtvKOPSAlPUt+CXPZ2GEBAroIPHPBR1EigaUj1wHSVY1Q51oe4T9AZfaPrNEU5wq3 2mjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729880366; x=1730485166; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dl0En9CAadDTqXHS8tVcFczlM5FFbAZNtAmw0XzqmtU=; b=NJ5kV4UpmfHSHui994/FSOAK/chSecMsPwbPfc5LNguFUwSRVIimx2CKWb0ISlHJ8u nKvZVKel+b5Zm2matxdBkmc61HIhadiQy766FLHaWvm/UNJy40qw3XGZeoH06Mpmy7v5 8egUk1ogJ0fWyJb84euST+oHdO3Jk3yvrwTAsusSjmVQ2oW5sTCEbv93VUR2eAcNB5xA j5uL6OrYDfgtyby8QFfnr9+zUcsUL7Vw8J0NiNou7i+l0YYZ5p23mOZgL7CyHmgdNJw+ HOEc48ktAggz4U+gRoFVn9gJ9glP6cvzkpQsEBRhphgZu9QV61d3VICQaDHgOH/xCZzQ W6wg== X-Forwarded-Encrypted: i=1; AJvYcCVYYP9CLrQ/07HiJbP63iXfr0kRTcYfmA7/N/AvIxSkm04jq/sEg5GnNBDsNyOnRed+JmNtjj1dTQ==@kvack.org X-Gm-Message-State: AOJu0YwvffsVOld8bxglSlxu7fIYHY8pfRlF38/JRBTBdYcAwTQoTnVi 6mH4A/ylMf9D9wkBXxX2ci/TnAslQCoe6b+tN44LCVPeNds+KFGDoS2IU2I+q9TCpB6nkK/vLq/ KA6xhobsjvzoz9Pnz1mrfGwS/cak= X-Google-Smtp-Source: AGHT+IEOF5qmnx1Hp0tE56NzkJcYLq0TSfFoBwTuQWywUUPjhDx6kPtdwQp7JKC7E9Jbrvm3zv41MwJasknH6wc3S7Y= X-Received: by 2002:a05:6214:4613:b0:6cd:ac54:7991 with SMTP id 6a1803df08f44-6d1856c86a9mr4549136d6.17.1729880366479; Fri, 25 Oct 2024 11:19:26 -0700 (PDT) MIME-Version: 1.0 References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-2-usamaarif642@gmail.com> <07e6018b-bc6f-4e1e-9bc3-07a4b5a384fc@gmail.com> In-Reply-To: From: Nhat Pham Date: Fri, 25 Oct 2024 11:19:15 -0700 Message-ID: Subject: Re: [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages To: Yosry Ahmed Cc: Usama Arif , akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Kairui Song , Kairui Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 15D2D20016 X-Stat-Signature: oo1qp9ctkettpbfusa3uzuof794pazfu X-Rspam-User: X-HE-Tag: 1729880357-800604 X-HE-Meta: U2FsdGVkX1/gzUd5TutLp23S4soZjgZ4IlyADqW791de33mowdshDYlmyhieZwriPzho4BnETXGor1eUgm6MZ+fPaBrhiTzNGO4s02wQKMcB8p6lkzhW4gD6ZovUobRE2uoo1umaFQXoHT4U4rBXKjKIARWyTK9pZPxlpS3wBdICBLHwShUA88ibl/hDJ1LjvH/bMnKEwRb7QO3QMtNysVKzJRm0W7stXz+aPP70qCbEGhZe9NDvCKbOmVTujKf9qRDw+zirfLBQpsm7QNM7oj2bSNztHxQTTFBaIhiq+6ewzC60qfHen4xTEzzwazn4SRxOUbwPVRMUdnMZ0SwhuWjIh3Il64eMf7Rcp9qT00hLh4Q2DroELd5N/XM4SbCq0f9Lsu5V4syaOX9DU9BRlajRvteWqTgbxdtBBsUqAhgE7uuKyP+4Nky80Fw95XDa0TsGTFmE3dS+rkvfUoyTFaQT8Z/NR3NjsxUbBDIDj0ejJ1qG/1GoxcMXST2jogaD29AWzg0rR8RJqqgXy1Sknza3F6WDLXYmHkxymkB9r++gE89G6V0FX31nhGolNs429r4FVieUovwXxQw3tfhdM0XiaakTcVUw19EV+OE3FbM4/KP+S9q4tbFgwQ1IpJCcf/oRronztlr6XQG8kzcS5QQ03dIKpu5S6BLexiqzU0ni8wXOmNHc1q8aUhnfJKL95DJX9rJWGLvF19JOUvPddG7tAhafTa87acC23fIg4oxYbm0/4fjElmPqtSlbMU62cQVP3uJr9RrN5zXPtAPcIXnSaYcAHJI6rLRxfpqXBjKW03hCmKmwfGg9dCL52oXE1Y/oD9s+HKhIkfSLhnrcRPWb64JvNLz6VhQY1BX/6cQcpSFGPokkNLzAfx3NLEmSKtk7dRPFzVJscOJ/3lkz5wtN85q79YGGNsLIECKDu4p9kRyY9f5cQpRpKw0ocGsUegDVSJUive4CoSnr5ph tiy5htNA 9bA8dlbk/HNDsnlgX34vOiaQHhs1EpSDyohX5GV0OKw9kyc1CBVpxRLeAUQbPyGx1Cth4k6VHE7l8tAsAqanBSBsM7Ck6Dk4sVGQSN1jY9vwKdFSCNAs2ZnSkr3/vMJiiJ53ag7IelUZGnGvV31SFVPG/xPGD29v21Lg9NLvezxF4MfWWWiUBR4JiSaJlTqPdPue/YHSjSPWGUXgmjXEQjyyrasc6dpxJ5jVLLbjxKTI5BM3bewB9NA2r5/mjoByE4KyJNufqItfYqD4StugEi+Usryjj/kTW8anhgugDC9mFpxWKEWM09h3sv5tg6m+fTA4MsNWhYJadyrOKVpN7y9jFkg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 22, 2024 at 5:46=E2=80=AFPM Yosry Ahmed = wrote: > > [..] > > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > > >> return ret; > > >> } > > >> > > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset) > > >> +{ > > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_tree= s[type]; > > > > > > I am not sure I understand what we are looking for here. When does > > > this return false? Aren't the zswap trees always allocated during > > > swapon? > > > > > > > Hi Yosry, > > > > Thanks for the review! > > > > It becomes useful in patch 3 when trying to determine if a large folio = can be allocated. > > > > For e.g. if the swap entry is the last entry of the last tree, and 1M f= olios are enabled > > (nr_pages =3D 256), then the while loop in zswap_present_test will try = to access a tree > > that doesn't exist from the 2nd 4K page onwards if we dont have this ch= eck in > > zswap_present_test. > > Doesn't swap_pte_batch() make sure that the range of swap entries > passed here all corresponds to existing swap entries, and those > entries should always have a corresponding zswap tree? How can the > passed in range contain an entry that is not in any zswap tree? > > I feel like I am missing something. > > > > > >> +} > > >> + > > >> +/* Returns true if the entire folio is in zswap */ > > > > > > There isn't really a folio at this point, maybe "Returns true if the > > > entire range is in zswap"? > > > > Will change, Thanks! > > > > > > > > Also, this is racy because an exclusive load, invalidation, or > > > writeback can cause an entry to be removed from zswap. Under what > > > conditions is this safe? The caller can probably guarantee we don't > > > race against invalidation, but can we guarantee that concurrent > > > exclusive loads or writebacks don't happen? > > > > > > If the answer is yes, this needs to be properly documented. > > > > swapcache_prepare should stop things from becoming racy. > > > > lets say trying to swapin a mTHP of size 32 pages: > > - T1 is doing do_swap_page, T2 is doing zswap_writeback. > > - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entr= y, nr_pages) in do_swap_page is not yet called. > > - T2 - zswap_writeback_entry starts and lets say writes page 2 to swap.= it calls __read_swap_cache_async -> swapcache_prepare increments swap_map = count, writes page to swap. > > Can the folio be dropped from the swapcache at this point (e.g. by > reclaim)? If yes, it seems like swapcache_prepare() can succeed and > try to read it from zswap. > I think you're onto something. Can this happen: say T2 writebacks a couple of tail pages, but not all of them, then drops everything from swap cache. Then T1 can definitely proceed. It would then check again in zswap_load(), which returns false (thanks to zswap_present()) test. All fine and good so far, but then in swap_read_folio(), it would try to fall back to reading the entire large folio from swapfile, which will contain bogus data in pages that have not been written back by T2. I think the problem is swap_read_folio() assumes it always succeeds, and a precondition for successful reading is the large folio must have no mixed backing state for its subpages, which we fail to guarantee before entering swap_read_folio(). This can lead to memory corruption. Buuut, I think all we need to do is just check the backing state again after T1's swapcache_prepare() call. At this point, we are guaranteed to have a stable backing state. If it fails here, then we can just exit and fall back to individual page swapping in.