From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F425C54E58 for ; Mon, 25 Mar 2024 18:35:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA26F6B0088; Mon, 25 Mar 2024 14:35:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2A9A6B008A; Mon, 25 Mar 2024 14:35:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BA52D6B0092; Mon, 25 Mar 2024 14:35:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A4BD76B0088 for ; Mon, 25 Mar 2024 14:35:49 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6DBC3C01F1 for ; Mon, 25 Mar 2024 18:35:49 +0000 (UTC) X-FDA: 81936415218.25.CC459D7 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf22.hostedemail.com (Postfix) with ESMTP id 79079C000A for ; Mon, 25 Mar 2024 18:35:47 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dmkx5QFa; spf=pass (imf22.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711391747; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; b=JrOlpojRQdm4m2n/sKjc/tyga7IObDHLWxfykaW6Xc1VEGdK2pyRQtTv63ZjR2zi1KS47b P/SB8wWxLCQcEd2o7mxwZS5Xf719Z1bgPmSOxLVMvIV9EhSnHlXm3oBE7rTPVIHmerSA8U +p91XvbtZ4mybPALwLLKIqedl02uIno= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711391747; a=rsa-sha256; cv=none; b=mn3Oz5MIleMOMwSNFz+2G/YqMv2sNTDc+BWSlOBqHtzhcIVB5sqoRF5ArJeAclpg4FcIZ4 /VCKnd3sILbnEhlwesJjoQ/9mev6U7FILl8zfpp0H9fuYRfdR0/+rKx+bcJdVp2KidyL4P wxwc88IoF3jCSReGGfhIkoF4hy1Spqw= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dmkx5QFa; spf=pass (imf22.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a46ea03c2a5so805941566b.1 for ; Mon, 25 Mar 2024 11:35:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711391746; x=1711996546; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; b=dmkx5QFaNxVE8dClL5smQx1h1dIS5bjhDl5z4zcXT8Sy6Sz592l94xgy9v4oTI8Jsc iczeytCKqn9BD5eUdjoiQXbOnj5DjLc+2Ay9qxzQFeANUuURjtNWqtkNkjGKwIHD3GZ0 W6L0LHLJzLhKOFJ1rcFcbjL0FWOtjJ72nZNjdIPxa+zCB/UlfaleO9M//SZsPtqJGMqp CpfvMtNVv3WTMIqWtUkeMMB+BBxUb4Br3/tK+1S54L5YIr8sltaiHqd5oFLFUyAUEaVW 05CEZbmcA58e7SNTTq44hCRDDEuyMW7Jt1u4hQ5dJ4BbvU1cfSwg2b/PojTE8q6YAeL6 VHHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711391746; x=1711996546; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; b=DiT/xIBxYW0+G+xK+Vgc2tj4cXeexcutjwakldPvp0RRffwM+x3CLBiEvLOHuTLppF vhZtGm2nteSF89sNRmJ6mhe8ApUrScw/8PJhjQ2MX55ozHPMMi7+3qMUJm2gZWratJYo 0biTlhIXLC/NLjhpD7BpurbhCBBxSDt0x4ZL7uy6pyjueNMbxKUcfjZJs3eoGNpUuANm kJxCCwN5+DRmzCmL5/OgxbzaxXpPJ68y/qTxoNg21LbkWjN+obz0maIyc+m9uCh9/n3R h6a0Wg0y4r0f4f2KfHzv6w25PJEvznzGLpDQkN+Xdi1XJvyScYqQY9PDMwH33+iR3wpK SIrw== X-Forwarded-Encrypted: i=1; AJvYcCX6LKwXhS+7foS74ud2JUHBdBElozKEHiJEW0Tq5NAR+IaZX+lEXyNPb0EfGr29b9IyG9F8HQ/nMCFmjxjjC5VATBk= X-Gm-Message-State: AOJu0Ywj7WqVtbNQ2L6B+O/1KeCSWu7+djBPrJuYx9o3nb15sCinc/pD G2snqzJQe8fmtuKAIrE3qgx6AkQF9hipGQgi+m57wAueweYGd88Rz5v4cwF9eimDhT8ECGV/VWV QQSIwk7uzEVq7nw7FhwG738KsFjcK1o+FlGxy X-Google-Smtp-Source: AGHT+IEG4khKur4No7fwruiwfnOnyHPfaZKYoeBWu8Ce3T4gznyAzpX/p3EocHUe/K7sGfTaoxu9DnIVdLzkNo5JMRA= X-Received: by 2002:a17:907:9722:b0:a47:48d7:d393 with SMTP id jg34-20020a170907972200b00a4748d7d393mr6682243ejc.33.1711391745531; Mon, 25 Mar 2024 11:35:45 -0700 (PDT) MIME-Version: 1.0 References: <20240324210447.956973-1-hannes@cmpxchg.org> <1e7ce417-b9dd-4d62-9f54-0adf1ccdae35@linux.dev> <339639f3-4334-44a8-a811-d20b3c578f74@linux.dev> In-Reply-To: <339639f3-4334-44a8-a811-d20b3c578f74@linux.dev> From: Yosry Ahmed Date: Mon, 25 Mar 2024 11:35:06 -0700 Message-ID: Subject: Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices To: Chengming Zhou Cc: Barry Song <21cnbao@gmail.com>, Johannes Weiner , Andrew Morton , Zhongkun He , Chengming Zhou , Chris Li , Nhat Pham , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kairui Song , Minchan Kim Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 79079C000A X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 7qtqkiznuxappwhtzr53tmndqw1tnzju X-HE-Tag: 1711391747-577774 X-HE-Meta: U2FsdGVkX19evkXb9YZ0qrvZYPcs7jr5ELe0nxw9fFBGOqJoU3RlqTmA22wHntFxi0H4ItCLUAfI+aOzy8FT1rDpsWwMWx4HVl1jxUG1XXA0aYRc2zNdBNwFiWZHaCU8LhAinpkjIxs4RXwolwcpMGJsO6K7awfIOZa7qFJ3lEMAm65BxXksjpvkWSrERFaAfRtAtRCgkL6u0uG8YRmcrhoBcDRHEmuNSoAl6tsWeUTnImsaWvE6sznn3++mDEfw/AokGNnq2+pu5qe9tZJLJ4O1R9HVORTwS32A3Ejje4Y0uhqsm1YcM5jO9OLH7lGfVxVHWPGXkwJ7M/fi3TZ0AmKwBhBD3iJmYatwtvF+ibPznjQPSlIMLT4eMRbS7Azn7XOXHo5kh26qwfsKdhNoeeiwxMpfqqFt7ltpSE97Zmn7A+eJZW1wpFI4htQ6f0WkDRDoIyPaFE5iwXmYPErhnCD2h8bjKirmB/lZFd/mxopsf2OMVObInckNjQ0jZ6CqCXKLCmcjl1O3f44SKAzcj6P3e2NGz0ObEJGV+IgODiaHccGsK4adrOXYTQsTzwtko4Ccd8ISSjJBrWzeIQwGmIsgosYp2ss+W8TyuUAOjTTvP5q6gn8BieNah5acdWH35fmILHyppFO4NYmmCkVoXRccclK/470jVmFIOxRVXilAG/6ywOOQexrV+ut1XGvY3krgiPO5FuNJ+4IYr/rMZ7UF30JjzGeUPpXju9EkiT+b1EniGqBi7dUhdADnSvuSO0KYlPnr2gwrvFYzsAPVN7PxfRtzC6aGbatol1Uufoqa7dFsn2nLpmY/E2OznraDsYC7jqhoUB/D+yhFmiS/LOA7B6Xtiyttcln67rh4muYTdg+HY1GhDDIeJ9mYqywFmpRMkcjxk7CKIaiF71goQmlyvppfDmL4dToj89oYwQpISAJGoj7Ti0lIcpY/p4v7bR1lIatb0/MNpe1ZLG/ naUAwfyM BECwTkI7hL8oP/1MzUez4hcvE96ELEq+xyXvW+m8I/hrlEGxGr+xt/oknZ26MNS3FcF8xy+5RBtbUhW5yiQhA1sxpPp4bdnL7IyPnxoMY+Hv4WrI90fpAXB2Z73uvZBVzhzDG+ttc1wOpq+xKlweIyfE0PtJpRqU6J74rkbz9TF1Tu2NTx+L9FBAVHBBoojqHYmON0zmHCuDYtXsihkJgZsK+JB3LCwCswXAO6UnQPPeBZeJOZZqBbjh/arLXdoEzW7Xq6qX9QI79FLjpLaWOMu9PoL911PJHBNFRRkAOUgd75wXorpOQnblLDV4cetRTaRwBkrlbLS50vNL/eWutAvs25xqOpnEAKjlO39AXJWjApfF9pJDiSF4w2hOtObHp75L9AnR2GR9awHTDMRlqL/IGJUi1UJI+HT8EdCvWBLLk71jmGgeoXDyh72BLTt0Wzr+KVlzjj1BuO8QrtIgHTMZ42OyxEgOnq/uvzDjT6jggimH90ehqHMpdlA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 25, 2024 at 2:46=E2=80=AFAM Chengming Zhou wrote: > > On 2024/3/25 17:40, Yosry Ahmed wrote: > > On Mon, Mar 25, 2024 at 2:22=E2=80=AFAM Chengming Zhou wrote: > >> > >> On 2024/3/25 16:38, Yosry Ahmed wrote: > >>> On Mon, Mar 25, 2024 at 12:33=E2=80=AFAM Chengming Zhou > >>> wrote: > >>>> > >>>> On 2024/3/25 15:06, Yosry Ahmed wrote: > >>>>> On Sun, Mar 24, 2024 at 9:54=E2=80=AFPM Barry Song <21cnbao@gmail.c= om> wrote: > >>>>>> > >>>>>> On Mon, Mar 25, 2024 at 10:23=E2=80=AFAM Yosry Ahmed wrote: > >>>>>>> > >>>>>>> On Sun, Mar 24, 2024 at 2:04=E2=80=AFPM Johannes Weiner wrote: > >>>>>>>> > >>>>>>>> Zhongkun He reports data corruption when combining zswap with zr= am. > >>>>>>>> > >>>>>>>> The issue is the exclusive loads we're doing in zswap. They assu= me > >>>>>>>> that all reads are going into the swapcache, which can assume > >>>>>>>> authoritative ownership of the data and so the zswap copy can go= . > >>>>>>>> > >>>>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults wi= ll try > >>>>>>>> to bypass the swapcache. This results in an optimistic read of t= he > >>>>>>>> swap data into a page that will be dismissed if the fault fails = due to > >>>>>>>> races. In this case, zswap mustn't drop its authoritative copy. > >>>>>>>> > >>>>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=3DzV9P691B9bVq= 33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/ > >>>>>>>> Reported-by: Zhongkun He > >>>>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > >>>>>>>> Cc: stable@vger.kernel.org [6.5+] > >>>>>>>> Signed-off-by: Johannes Weiner > >>>>>>>> Tested-by: Zhongkun He > >>>>>> > >>>>>> Acked-by: Barry Song > >>>>>> > >>>>>>> > >>>>>>> Do we also want to mention somewhere (commit log or comment) that > >>>>>>> keeping the entry in the tree is fine because we are still protec= ted > >>>>>>> from concurrent loads/invalidations/writeback by swapcache_prepar= e() > >>>>>>> setting SWAP_HAS_CACHE or so? > >>>>>> > >>>>>> It seems that Kairui's patch comprehensively addresses the issue a= t hand. > >>>>>> Johannes's solution, on the other hand, appears to align zswap beh= avior > >>>>>> more closely with that of a traditional swap device, only releasin= g an entry > >>>>>> when the corresponding swap slot is freed, particularly in the syn= c-io case. > >>>>> > >>>>> It actually worked out quite well that Kairui's fix landed shortly > >>>>> before this bug was reported, as this fix wouldn't have been possib= le > >>>>> without it as far as I can tell. > >>>>> > >>>>>> > >>>>>> Johannes' patch has inspired me to consider whether zRAM could ach= ieve > >>>>>> a comparable outcome by immediately releasing objects in swap cach= e > >>>>>> scenarios. When I have the opportunity, I plan to experiment with= zRAM. > >>>>> > >>>>> That would be interesting. I am curious if it would be as > >>>>> straightforward in zram to just mark the folio as dirty in this cas= e > >>>>> like zswap does, given its implementation as a block device. > >>>>> > >>>> > >>>> This makes me wonder who is responsible for marking folio dirty in t= his swapcache > >>>> bypass case? Should we call folio_mark_dirty() after the swap_read_f= olio()? > >>> > >>> In shrink_folio_list(), we try to add anonymous folios to the > >>> swapcache if they are not there before checking if they are dirty. > >>> add_to_swap() calls folio_mark_dirty(), so this should take care of > >> > >> Right, thanks for your clarification, so should be no problem here. > >> Although it was a fix just for MADV_FREE case. > >> > >>> it. There is an interesting comment there though. It says that PTE > >>> should be dirty, so unmapping the folio should have already marked it > >>> as dirty by the time we are adding it to the swapcache, except for th= e > >>> MADV_FREE case. > >> > >> It seems to say the folio will be dirtied when unmap later, supposing = the > >> PTE is dirty. > > > > Oh yeah it could mean that the folio will be dirted later. > > > >> > >>> > >>> However, I think we actually unmap the folio after we add it to the > >>> swapcache in shrink_folio_list(). Also, I don't immediately see why > >>> the PTE would be dirty. In do_swap_page(), making the PTE dirty seems > >> > >> If all anon pages on LRU list are faulted by write, it should be true. > >> We could just use the zero page if faulted by read, right? > > > > This applies for the initial fault that creates the folio, but this is > > a swap fault. It could be a read fault and in that case we still need > > to make the folio dirty because it's not in the swapcache and we need > > to write it out if it's reclaimed, right? > > Yes, IMHO I think it should be marked as dirty here. > > But it should be no problem with that unconditional folio_mark_dirty() > in add_to_swap(). Not sure if there are other issues. I don't believe there are any issues now. Dirtying the folio in add_to_swap() was introduced before SWP_SYNCHRONOUS_IO, so I guess things have always worked. I think we should update the comment there though to mention that dirtying the folio is also needed for this case (not just MADV_FREE), or dirty the PTE during the fault. Otherwise, if someone is making MADV_FREE changes they could end up breaking SWP_SYNCHRONOUS_IO faults. Adding Minchan here in case he can confirm that we in fact rely on add_to_swap()->folio_mark_dirty() for SWP_SYNCHRONOUS_IO to work as intended.