From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 425C5D149E8 for ; Fri, 25 Oct 2024 19:11:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CD2366B0085; Fri, 25 Oct 2024 15:11:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C5B836B008C; Fri, 25 Oct 2024 15:11:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AFCC86B0092; Fri, 25 Oct 2024 15:11:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8AD5B6B008C for ; Fri, 25 Oct 2024 15:11:17 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1613680E28 for ; Fri, 25 Oct 2024 19:11:00 +0000 (UTC) X-FDA: 82713066954.27.527A6D0 Received: from mail-ej1-f48.google.com (mail-ej1-f48.google.com [209.85.218.48]) by imf23.hostedemail.com (Postfix) with ESMTP id 6306A14001B for ; Fri, 25 Oct 2024 19:11:02 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=km408tjO; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729883397; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v+V2gicBaBl7PEEs6aTCwYwoLlKwlg1ivy8jDgkDE+I=; b=wtIE5tf/rY6GlmLiAsHSzdKpJPHt8uxukjh8aDCfcPScrBF2Pgo9tkK9YWI1XCTmJciO7E Hh+EhP5Xk9e4AOhlmAi2kFfUN+IR5vhc99515NPqZM18x9nonBqJ7dUzXa5L1xq1c5VZm2 cSg7wXpKtTJO7axp0lM4W4J8ERW+PD4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=km408tjO; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729883397; a=rsa-sha256; cv=none; b=x3gBp/QyaQdGD7mdbQZqVAnmn8YdNZ5bKgzmxOk3tFhF8Xa/QfxOZPVkpaA5LUolI8eIZ3 ou2BJ2OS8mi+Hb6mMCwq4pg0jnPMaVpPmcpOrvqTAdBQDxbFBV2fxzihFjIc/T076mJEHD 5ne+7dezID7cyJJLTeEA6REr7bdOpMw= Received: by mail-ej1-f48.google.com with SMTP id a640c23a62f3a-a9a3dc089d8so330443466b.3 for ; Fri, 25 Oct 2024 12:11:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729883474; x=1730488274; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=v+V2gicBaBl7PEEs6aTCwYwoLlKwlg1ivy8jDgkDE+I=; b=km408tjOJYx+VKyLD1mArYC73XGncAUNiOGbbI4kVAtm4NFEGjXF5IVZCz4hb7bd4+ mCdgeMWjMtSJ5jfIEMYVq51D9VPDxHAqfccqehiEbS34s+qQw8qLTvzU/Xj+BdqEYKZE jLaWtFgxDiApZRZkhI0n41SDjq/zSVqHbSTpgAVdcIQSTKeOaCOgbYSL6YgY4YHLd3RB QAJQOogKR2lyyoO8YC+Ou+69+kJ7EkaWA3ocmii1ezQyf6BGmPuxOQtqGte5Ni4f7sLW fj9Mc7rfcRDOHF5R8ZQtOYEXW+1jwpNwnCpmcbPc3ZAqjbHuGJJ470bO4NVNYOp0vO9P ENJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729883474; x=1730488274; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=v+V2gicBaBl7PEEs6aTCwYwoLlKwlg1ivy8jDgkDE+I=; b=Kk8+YJwazMYM8VwI/lw1UvMiIzvgBYkbC+eFpadRJ+fNUVuRle6bd49dUUCDUK7e4I HHvdWiBr2JoqOWMY6QcIi7wKVnDUOqfVe8a6qa/TPF07o5TbzjdIXoD3igIsy+ePzfGK t9F6ecgsTA/TqjzfEBD9saSyCTnRKU/KhAYMVPO8JPppnxDLN0yYsaybvZ5viKW844qE zWRXxHHK/hhlW9/xrWx4/KRq748Ufs4/UmUPMtYeNy5DvVwencq6kpL28HDA/8O+PHR7 9erIorogIWE2e6EIAEseGb/IQTeEDfrSW46Cucemc40B0ytwTYRFhvi7LMVqpm/w5YQE TvLA== X-Forwarded-Encrypted: i=1; AJvYcCWp9Xr8KJhw2eyGe2tVr0KcZWIPB4sTbzuu39oC8b1AiabXBFOKP+Li4ZwgTn5TPO23ko3+oyiDeQ==@kvack.org X-Gm-Message-State: AOJu0YxfOqCvurh3YbSz7v7qJgJ5hVLbQl+MjkPMc++nz4jW1QVPk1Zs vwY6b2Ibeww8vL57GHzLlVN1ZWv0N9GUoN+KfHxajJTGqakbwuvdfLu6qSb037k1A1LinbYgiBV Iss1o//BcBX4IAV8HXokeaZtBUuOoBzGWbovq X-Google-Smtp-Source: AGHT+IFhPUBKlnDRmyDDdYmN45BuvAFdxY8aPcUUJgzYOw8ZupqUu2AZb0uq8uR0DYvoOi89Mv4qDmTjd0Sm7BdWRTU= X-Received: by 2002:a17:907:3f04:b0:a9a:11f5:8cd9 with SMTP id a640c23a62f3a-a9de61ea9f8mr14067466b.56.1729883473609; Fri, 25 Oct 2024 12:11:13 -0700 (PDT) MIME-Version: 1.0 References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-2-usamaarif642@gmail.com> <07e6018b-bc6f-4e1e-9bc3-07a4b5a384fc@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Fri, 25 Oct 2024 12:10:37 -0700 Message-ID: Subject: Re: [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages To: Nhat Pham Cc: Usama Arif , akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Kairui Song , Kairui Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 6306A14001B X-Stat-Signature: fwjc6tr1esmrrq3tm65h6khzsk8uikb8 X-HE-Tag: 1729883462-783873 X-HE-Meta: U2FsdGVkX18soKFLyooGmG4u8bt2fWt/u7oPGbVmigiGScCnzwNlmLOi8o5+vBgtA9JzxB0eHKIn3ZOLiptNHBhe5+Yy6w/VkhALnK88Lufg6PxQlbbTbl//mbcujnaxi/IS37RLYWUtj5cMsOk3323IUWnFUcS9O/k3PmLbLxwJQEPBv6UDlTydHrA83TUCm2bQGiA64S4t+6Y9YMrxTK9POys7QQHVdbzu5Z1znGZOi+40qRJvAuPujXEakEtK4Lga9r7RxyHFVwJqmS7x1ytmqDrLBJxnSMnwA9X+b/D73C8FwcKVJxa+lbrOEHqSdEZ8tlOIG27n5WoWZRslh122Qkw9RYPjyNmrA3+A2G9o5IwQC0XqxDFw0iPINmikRSeTcg7p2PWj+F7sQ/E21mSxhwJEKctRYXuD+kgsZ4jGj/XBZVuTDZrIFG7nY6WtBYyrh2P0NQ3ZmtGWdeuJ1baPzRFrR/bItz2h7FCvi2Vli0F3P1+FF+Ofypgqb96RXzT5iYhnarQPbXOxk+4uIVIZpgvSdCixTYBdqS6BIs9MKp+a1yqFt2ZT9A6ZymA3Wt62S5yyyBdlC0ev3HuEEuP4VN3x3wjRW9uXHjMMXyJnnL1YBeDNeuLbJMDQCdEmabNQ9XmdBaAqLkN5QmqX5kXpf3N7eGA5OiFKnEzncJnhwwZgZbuXKPT5yPJrc1jQq+cRl6jGLcTI7tPxFe+jMdH2XmYaQnuM7gOaU2qFF6/tjvcRgn8kOLs+rpyQpEJh0V4dxydQEZIzavQx/JPgQxTi+CZBJFfo6OZTdsgq3TCSutbuPOxW4cbO4c9LIin1rU+Ot1E4J+fmOOslWRwoy8SC7wPw7epn3lhtODrBal2tW08FXUUoWRqbRtRpgk6jQYyCeZ8oAqLIkMqlpY82E+QA4V+PghRfnBG3qhZH/weqItpWOb5AHIspSqTfegeyXkrFhcuA1nFWM6QIFvR 9q+ZTiYC ntH0rGj49rASSx0Zy0W0hKUlk8BoP7/kgISiqBEk4KCAYpfPRvH2ZE6+WF/aCDDY2vCG+aEopUnFehg4vBErAz+HNfsjMGtoIkZeNg72OakYPQiGpCEIUN7VSFBvuW6xjHEnm9GdA1/g0/vj8M0jwed7X+pWBcJVEI74rxGLc1whNduA5RqcfQuK/TAqhmRz1mjh9xB7i6PsYdFrVyxmvHE9P4xuXWUMxT8ZoGDQgNbONap8pFh5cL+T2bA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 25, 2024 at 11:19=E2=80=AFAM Nhat Pham wrot= e: > > On Tue, Oct 22, 2024 at 5:46=E2=80=AFPM Yosry Ahmed wrote: > > > > [..] > > > >> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio) > > > >> return ret; > > > >> } > > > >> > > > >> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset= ) > > > >> +{ > > > >> + return (offset >> SWAP_ADDRESS_SPACE_SHIFT) < nr_zswap_tr= ees[type]; > > > > > > > > I am not sure I understand what we are looking for here. When does > > > > this return false? Aren't the zswap trees always allocated during > > > > swapon? > > > > > > > > > > Hi Yosry, > > > > > > Thanks for the review! > > > > > > It becomes useful in patch 3 when trying to determine if a large foli= o can be allocated. > > > > > > For e.g. if the swap entry is the last entry of the last tree, and 1M= folios are enabled > > > (nr_pages =3D 256), then the while loop in zswap_present_test will tr= y to access a tree > > > that doesn't exist from the 2nd 4K page onwards if we dont have this = check in > > > zswap_present_test. > > > > Doesn't swap_pte_batch() make sure that the range of swap entries > > passed here all corresponds to existing swap entries, and those > > entries should always have a corresponding zswap tree? How can the > > passed in range contain an entry that is not in any zswap tree? > > > > I feel like I am missing something. > > > > > > > > >> +} > > > >> + > > > >> +/* Returns true if the entire folio is in zswap */ > > > > > > > > There isn't really a folio at this point, maybe "Returns true if th= e > > > > entire range is in zswap"? > > > > > > Will change, Thanks! > > > > > > > > > > > Also, this is racy because an exclusive load, invalidation, or > > > > writeback can cause an entry to be removed from zswap. Under what > > > > conditions is this safe? The caller can probably guarantee we don't > > > > race against invalidation, but can we guarantee that concurrent > > > > exclusive loads or writebacks don't happen? > > > > > > > > If the answer is yes, this needs to be properly documented. > > > > > > swapcache_prepare should stop things from becoming racy. > > > > > > lets say trying to swapin a mTHP of size 32 pages: > > > - T1 is doing do_swap_page, T2 is doing zswap_writeback. > > > - T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(en= try, nr_pages) in do_swap_page is not yet called. > > > - T2 - zswap_writeback_entry starts and lets say writes page 2 to swa= p. it calls __read_swap_cache_async -> swapcache_prepare increments swap_ma= p count, writes page to swap. > > > > Can the folio be dropped from the swapcache at this point (e.g. by > > reclaim)? If yes, it seems like swapcache_prepare() can succeed and > > try to read it from zswap. > > > > I think you're onto something. > > Can this happen: say T2 writebacks a couple of tail pages, but not all > of them, then drops everything from swap cache. Then T1 can definitely > proceed. It would then check again in zswap_load(), which returns > false (thanks to zswap_present()) test. > > All fine and good so far, but then in swap_read_folio(), it would try > to fall back to reading the entire large folio from swapfile, which > will contain bogus data in pages that have not been written back by > T2. > > I think the problem is swap_read_folio() assumes it always succeeds, > and a precondition for successful reading is the large folio must have > no mixed backing state for its subpages, which we fail to guarantee > before entering swap_read_folio(). This can lead to memory corruption. > > Buuut, I think all we need to do is just check the backing state again > after T1's swapcache_prepare() call. At this point, we are guaranteed > to have a stable backing state. If it fails here, then we can just > exit and fall back to individual page swapping in. I think this should work, but we need to take a closer look for other things that can go wrong along this path.