From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2D19EB64D8 for ; Thu, 22 Jun 2023 06:42:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4954D8D0002; Thu, 22 Jun 2023 02:42:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 445638D0001; Thu, 22 Jun 2023 02:42:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30CBE8D0002; Thu, 22 Jun 2023 02:42:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 216E88D0001 for ; Thu, 22 Jun 2023 02:42:06 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 635EF1A073B for ; Thu, 22 Jun 2023 06:33:01 +0000 (UTC) X-FDA: 80929416162.12.3B437CF Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) by imf04.hostedemail.com (Postfix) with ESMTP id 854224000C for ; Thu, 22 Jun 2023 06:32:59 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=WvEiEoTH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of cerasuolodomenico@gmail.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=cerasuolodomenico@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687415579; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RnXCS9HlHSWVmZGQQLXKESb1FrYUKFPp5xcZutPyndo=; b=f2+6Wd9zkMjCcpfp1C0xzJrXkL3IHIqSrqHr4CrbFvJy+vTkvLzNuWTk394TTRyei9iYtI 1wbVCEC6fFdGm5lR1VDJ0928NEiYI1ozslIx7aZXFe5O2q/9dKtgA1bvFX9UT6jbdLMxZY oC2aA7VWj0bTGUdlppkmrgdcFOeBuRI= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=WvEiEoTH; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of cerasuolodomenico@gmail.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=cerasuolodomenico@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687415579; a=rsa-sha256; cv=none; b=dpc5NVFBWlQUFz2yi2SzIwNTEGgHPhbUwnNs1zXHGcsLKLAhw9OeYLQdcJMwJrh6CUVXmu jYHhV9/dl1HSFb6PQRMN7UyXtQ2JPzpKlZsAfB6gtmps9sP/ckgab6UI3IBqiVLkOMK99y n7HwvffSFsRb5L8zMAronc4QwpPtMvs= Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-54fb3c168fcso5376238a12.0 for ; Wed, 21 Jun 2023 23:32:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1687415578; x=1690007578; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RnXCS9HlHSWVmZGQQLXKESb1FrYUKFPp5xcZutPyndo=; b=WvEiEoTHBp8pgpKlLGnYz+e2z/Xmu06WRhAxqfsSacSkoiLiGK1I9e/KwlhQLzIcSL J5/teIkVdGLZ0eyrldV76c2ahkqSkDxGLX7zdD83bRrQFoNZ//WlqgvHJ8oi07/Ehkyt oHvseqFA/Qk3NxRkDGmb5y45pilex4etW1tyJSmaRT1mPbcFjWkVnL3nYvcBwtFRZtYm f6OiIlMiera6JGMRmXtee0oSrtJ5CqR1Za5zynoB6l9nmoDC/vqg/W4aIMiadxYbW4XD jrEz/vcy+nUaSBobWSgPbOlMCFJGGOoHhVpVfK+K7RKlhbKMJ0C/PdQhqME8yc/+Gz+7 QmZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687415578; x=1690007578; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RnXCS9HlHSWVmZGQQLXKESb1FrYUKFPp5xcZutPyndo=; b=EuxiTGJm0wAUxCdpRGEwuQ1jXe1Fa0jN2ryk8CS8mX5Lil9hb03KYcAEQqajdOJf7j 9auCGu83vx9vrQ0efXs8feRNLVxa2ucjdGUJrg312g1FSHB4ebKdCXni+BQFLTA85F50 MXksaNUIOpTiMT17kdDFvdJ8aa1kTmkSVP5JD008QZwKCpra2MIcHrLnw75K4dZNqXi+ s0JcD+9I4/ihAiFOptr0eoe83uW7cQmD/yyjO9NzxJ2gsa5AQWCeo/2aSVrX68sRFlXE spR5NpAEyIkNGTNR8YyguvFKuOqrUWz/AF4Ik/MN0By4OlCItFdS6zVYnhbqIRZTp7bK /CVg== X-Gm-Message-State: AC+VfDyUXm4fYlCayCqrsZ3anwwzpDQsGS/a7saRyvcxI8w2ZQ0BOwiC ZbQLfm5sXVNUmridPNwm5Kd/shSPYajzJdzfCO8= X-Google-Smtp-Source: ACHHUZ4y6NoH6E6X9P7fPZPPpYvbQzD1QNv/EeyZqHGma+Q2Q3263VA7mheJntUALZaZ6K2a4TM5vmEFEIq5hhjER1c= X-Received: by 2002:a17:90a:8984:b0:25b:e766:6c40 with SMTP id v4-20020a17090a898400b0025be7666c40mr18070652pjn.20.1687415578054; Wed, 21 Jun 2023 23:32:58 -0700 (PDT) MIME-Version: 1.0 References: <20230621093009.637544-1-yosryahmed@google.com> In-Reply-To: From: Domenico Cerasuolo Date: Thu, 22 Jun 2023 08:32:46 +0200 Message-ID: Subject: Re: [PATCH] mm: zswap: fix double invalidate with exclusive loads To: Yosry Ahmed Cc: Andrew Morton , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Konrad Rzeszutek Wilk , Seth Jennings , Dan Streetman , Vitaly Wool , Johannes Weiner , Nhat Pham , Yu Zhao , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 854224000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: za8m1sp6fkb14shahccsuteahgaf6fmk X-HE-Tag: 1687415579-572063 X-HE-Meta: U2FsdGVkX187DeR6M8jNnUfpSaUuH4iIOnACJFd2qVwFCvzUhdiijdT0vRFf6dtNeiaul5MHoFt5+YTNwWzYMDWDoPuSTJXC2DX7SVcwTgiF19QxrBhRhyo0jaIKIgpYPhUn3PayUOySdtQObFvt3ybb6PdOF0k8FuPZChGLsPpl0GPbt0nF9RVgplMzzw0Uq1qXCA9P9u9y4j6ZJgXIfNFf8HWrTIjWwvtDMKGWiysUqDp7kZH45X4IreBieuQLHBMMkLYA7EWfsldq4IT+TJUl30m5g5Sh0e8VKTTtbk07xXtfSZ0MomJX2iGVjX2hYqu5/H0qxRQeyqPFPQ9rJN11dV7TKtF8o5UZNcLNK3U9g/VR3aNULu6ni6tJ+u9xkgmK2n5OEhb/y2UHBoLVLAmAj7wZ5XewzeNz2KUz4V9AZHQn9QuDnWsZRz7lW0ZjYBvXpvh4tJz2INnQXau6EJ9cGSiFOJcs9WvqXff4ZG0LR9T8pljMxT81/9ZoN2T6Xnc/qxn14KtmZ/Eyn3OzHfKcAjA9FiNjf07eHhtZCfx1jr82zP2CG8XAW0SaDunSastb19GMortKB1tM0olv44ifJtfBDHhrzHAoqOx20FHJw3ivl0+IoWwze4BZFn/VhMdv69NIzcHuchJ4eJaOJG5Ou8Gz9wF3EbnTM/c8qWZXIkW1qvIyDzOCaE5NN0AHlg4iWuzk6iVY0I8SclReVQdh0+Mp07+lexv32URoPYjhjvYrvefdEfLR8kbJab61TrhFSiTFra7VhEexnE6NWwcdAanVRnp62tcSbfnLCoIDPIB7bVEoWIovo1zrwhq0HNe7IwvuN25piKcbd1Fkdcu47f7udhZVyqRJoTL410ttbmSCkwEaovHF3Y5J8W40aCE6egFUyf8F/ku2mzQlcLOFAD7X21MNAfbKsZjW2A/TaIEQVPAjg3BkmZqdG2KQHDjdjqlEdqGiXJvcOSE 0m1im+XA v8SqffGTq/J+YmFTJA7Zz/3ql+zHfeIv+P0wjg3raptAHif7LfQBn/ePteNMyIlpwh7e/xAAjLGCvRF2vTFXi1WGjeOKmjQ0JK/QZehRyyck01LgxRwIoYK/4RTWrqvzWS6xg+O1k+Kp5E4OunsRrH3NTN7l1uQVDAX7EXyFe49dXvb3DYV9iJmeyxOv8yeNnKavTgKIr/5S2V14L0oi0aQX5F+JSPZf8vzcSEV6VmPj2x/WHlPYuBmXwPIJ+npVn6tsta5Rsq4XEgKEITwPGlv6WFniZQ9syypfzOiVxq3aqZzLMUUNWqTBTpEsrCjHDQ/eLs1UvPHexktcRI8kF7OWBmsxZL0lOEziaC/gtCPTJSwMx6t4ErMauKnojXL5WZRa4Yh9+nQDwsx6ck/U3CSD55rKC5b0ti/EZc9Qr33p9/1vQ0c9KOqjCc1Pobv3sLzjv4snD2zItSmj6xMVYGhnH7Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 21, 2023 at 11:23=E2=80=AFPM Yosry Ahmed wrote: > > On Wed, Jun 21, 2023 at 12:36=E2=80=AFPM Domenico Cerasuolo > wrote: > > > > On Wed, Jun 21, 2023 at 7:26=E2=80=AFPM Yosry Ahmed wrote: > > > > > > On Wed, Jun 21, 2023 at 3:20=E2=80=AFAM Domenico Cerasuolo > > > wrote: > > > > > > > > On Wed, Jun 21, 2023 at 11:30=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry= before > > > > > returning from zswap_frontswap_load(), after dropping the local > > > > > reference. However, the tree lock is dropped during decompression= after > > > > > the local reference is acquired, so the entry could be invalidate= d > > > > > before we drop the local ref. If this happens, the entry is freed= once > > > > > we drop the local ref, and zswap_invalidate_entry() tries to inva= lidate > > > > > an already freed entry. > > > > > > > > > > Fix this by: > > > > > (a) Making sure zswap_invalidate_entry() is always called with a = local > > > > > ref held, to avoid being called on a freed entry. > > > > > (b) Making sure zswap_invalidate_entry() only drops the ref if th= e entry > > > > > was actually on the rbtree. Otherwise, another invalidation c= ould > > > > > have already happened, and the initial ref is already dropped= . > > > > > > > > > > With these changes, there is no need to check that there is no ne= ed to > > > > > make sure the entry still exists in the tree in zswap_reclaim_ent= ry() > > > > > before invalidating it, as zswap_reclaim_entry() will make this c= heck > > > > > internally. > > > > > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > > Signed-off-by: Yosry Ahmed > > > > > --- > > > > > mm/zswap.c | 21 ++++++++++++--------- > > > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index 87b204233115..62195f72bf56 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *= root, struct zswap_entry *entry, > > > > > return 0; > > > > > } > > > > > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_en= try *entry) > > > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_en= try *entry) > > > > > { > > > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > > > rb_erase(&entry->rbnode, root); > > > > > RB_CLEAR_NODE(&entry->rbnode); > > > > > + return true; > > > > > } > > > > > + return false; > > > > > } > > > > > > > > > > /* > > > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_g= et(char *type, char *compressor) > > > > > return NULL; > > > > > } > > > > > > > > > > +/* > > > > > + * If the entry is still valid in the tree, drop the initial ref= and remove it > > > > > + * from the tree. This function must be called with an additiona= l ref held, > > > > > + * otherwise it may race with another invalidation freeing the e= ntry. > > > > > + */ > > > > > > > > On re-reading this comment there's one thing I'm not sure I get, do= we > > > > really need to hold an additional local ref to call this? As far as= I > > > > understood, once we check that the entry was in the tree before put= ting > > > > the initial ref, there's no need for an additional local one. > > > > > > I believe it is, but please correct me if I am wrong. Consider the > > > following scenario: > > > > > > // Initially refcount is at 1 > > > > > > CPU#1: CPU#2: > > > spin_lock(tree_lock) > > > zswap_entry_get() // 2 refs > > > spin_unlock(tree_lock) > > > spin_lock(tree_lock) > > > zswap_invalidate_entry() = // 1 ref > > > spin_unlock(tree_lock) > > > zswap_entry_put() // 0 refs > > > zswap_invalidate_entry() // problem > > > > > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > > > entry would have already been freed. If we check that the entry is on > > > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > > > reading already freed and potentially re-used memory. > > > > > > We would need to search the tree to make sure the same entry still > > > exists in the tree (aka what zswap_reclaim_entry() currently does). > > > This is not ideal in the fault path to have to do the lookups twice. > > > > Thanks for the clarification, it is indeed needed in that case. I was j= ust > > wondering if the wording of the comment is exact, in that before callin= g > > zswap_invalidate_entry one has to ensure that the entry has not been fr= eed, not > > specifically by holding an additional reference, if a lookup can serve = the same > > purpose. > > > I am wondering if the scenario below is possible, in which case a > lookup would not be enough. Let me try to clarify. Let's assume in > zswap_reclaim_entry() we drop the local ref early (before we > invalidate the entry), and rely on the lookup to ensure that the entry > was not freed: > > - On CPU#1, in zswap_reclaim_entry() we release the lock during IO. > Let's assume we drop the local ref here and rely on the lookup to make > sure the zswap entry wasn't freed. > - On CPU#2, invalidates the swap entry. The zswap entry is freed > (returned to the slab allocator). > - On CPU#2, we try to reclaim another page, and allocates the same > swap slot (same type and offset). > - On CPU#2, a zswap entry is allocated, and the slab allocator happens > to hand us the same zswap_entry we just freed. > - On CPU#1, after IO is done, we lookup the tree to make sure that the > zswap entry was not freed. We find the same zswap entry (same address) > at the same offset, so we assume it was not freed. > - On CPU#1, we invalidate the zswap entry that was actually used by CPU#2= . > > I am not entirely sure if this is possible, perhaps locking in the > swap layer will prevent the swap entry reuse, but it seems like > relying on the lookup can be fragile, and we should rely on the local > ref instead to reliably prevent freeing/reuse of the zswap entry. > > Please correct me if I missed something. I think it is, we definitely need an additional reference to pin down the e= ntry. Sorry if I was being pedantic, my original doubt was only about the wording= of the comment, where it says that an additional reference must be held. I was wondering if it was strictly needed, and now I see that it is :) > > > > > > > > > > > Also, in zswap_reclaim_entry(), would it be possible if we call > > > zswap_invalidate_entry() after we drop the local ref that the swap > > > entry has been reused for a different page? I didn't look closely, bu= t > > > if yes, then the slab allocator may have repurposed the zswap_entry > > > and we may find the entry in the tree for the same offset, even thoug= h > > > it is referring to a different page now. This sounds practically > > > unlikely but perhaps theoretically possible. > > > > I'm not sure I understood the scenario, in zswap_reclaim_entry we keep = a local > > reference until the end in order to avoid a free. > > > Right, I was just trying to reason about what might happen if we call > zswap_invalidate_entry() after dropping the local ref, as I mentioned > above. > > > > > > > > > > > > I think it's more reliable to call zswap_invalidate_entry() on an > > > entry that we know is valid before dropping the local ref. Especially > > > that it's easy to do today by just moving a few lines around. > > > > > > > > > > > > > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > > > struct zswap_entry *entry) > > > > > { > > > > > - /* remove from rbtree */ > > > > > - zswap_rb_erase(&tree->rbroot, entry); > > > > > - > > > > > - /* drop the initial reference from entry creation */ > > > > > - zswap_entry_put(tree, entry); > > > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > > > + zswap_entry_put(tree, entry); > > > > > } > > > > > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_p= ool *pool) > > > > > * swapcache. Drop the entry from zswap - unless invalida= te already > > > > > * took it out while we had the tree->lock released for I= O. > > > > > */ > > > > > - if (entry =3D=3D zswap_rb_search(&tree->rbroot, swpoffset= )) > > > > > - zswap_invalidate_entry(tree, entry); > > > > > + zswap_invalidate_entry(tree, entry); > > > > > > > > > > put_unlock: > > > > > /* Drop local reference */ > > > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned ty= pe, pgoff_t offset, > > > > > count_objcg_event(entry->objcg, ZSWPIN); > > > > > freeentry: > > > > > spin_lock(&tree->lock); > > > > > - zswap_entry_put(tree, entry); > > > > > if (!ret && zswap_exclusive_loads_enabled) { > > > > > zswap_invalidate_entry(tree, entry); > > > > > *exclusive =3D true; > > > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned ty= pe, pgoff_t offset, > > > > > list_move(&entry->lru, &entry->pool->lru); > > > > > spin_unlock(&entry->pool->lru_lock); > > > > > } > > > > > + zswap_entry_put(tree, entry); > > > > > spin_unlock(&tree->lock); > > > > > > > > > > return ret; > > > > > -- > > > > > 2.41.0.162.gfafddb0af9-goog > > > > > Reviewed-by: Domenico Cerasuolo