From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95062EB64D8 for ; Wed, 21 Jun 2023 21:23:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 897B98D0003; Wed, 21 Jun 2023 17:23:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 81FF78D0002; Wed, 21 Jun 2023 17:23:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C05C8D0003; Wed, 21 Jun 2023 17:23:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 56EDC8D0002 for ; Wed, 21 Jun 2023 17:23:30 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 19BA240A4A for ; Wed, 21 Jun 2023 21:23:30 +0000 (UTC) X-FDA: 80928031380.20.7314057 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf07.hostedemail.com (Postfix) with ESMTP id 02CE040015 for ; Wed, 21 Jun 2023 21:23:27 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="IUD/48Ni"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687382608; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=heNeHfiARaceONbssR12WXe+frvSZHsqLWOX1/hCL9s=; b=hVflPfjOL8/MqQGuhHhqC6rZ8MpExRK1+63DmVg8tcbqjjr8pA2joZri0tLK3K0OYvi3c7 oEXx6Dpple93ReeaHm8X6HQWNnY/OL6BCioIx6SmfR0pRN6CZUehiqlj4jh8OPIa01w7ED Jq58gsLp6kWsuELFHYeJdg94XQPV0bU= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="IUD/48Ni"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687382608; a=rsa-sha256; cv=none; b=Ja4MJ10ynjqAPGrESVuMikvuuw4Jki27vIudXs4CA2OD3Qb+7ohniu9+24ub1z+C9r/Iwm ndmUaE1X7ET5/a/+V0tdUXMv5MGMePu8PzkUsSkt97iR7/OVNWuDUq2038eMBD6mgPPhM4 ITSOaRat5X2G6uTn675UWa9UdGkk910= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-988b204ce5fso490345666b.3 for ; Wed, 21 Jun 2023 14:23:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1687382606; x=1689974606; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=heNeHfiARaceONbssR12WXe+frvSZHsqLWOX1/hCL9s=; b=IUD/48NiAvJjZa8wBYCxLtZRP/Ev2EYwSg5cdshQ8/8v5NE892w86pXJnDiF3hfkBb M3HAdnQTIR+2RuxqX2pwDQTkDDBD2vw+UcN43q2x9UwQohvRtB9SFoPpZOlwdXl9yLwE RpA/2khuahF4yYCIH22GeGonKnkdQ87vyIaBUzXz3pD8DE8S30pFMox86otLIndd+j+o ztfaVNnpndzbnn8sXrTPEQ7iIcbYm3PA7HU1ZGUQKmACDrCH7GPXvAPsNa95g5Tvr4MG gBSSXve2fLGfireQd2ORLx+hdZGDpWbK6VQIVTuqapUFG49IXLrt14auhajIXc99bet2 oLow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687382606; x=1689974606; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=heNeHfiARaceONbssR12WXe+frvSZHsqLWOX1/hCL9s=; b=RR3xFVD4Qz+2lRpl0xNNBeFM9+U65PXwowTyxssnaPx/fb/GyTpLRhF1qcmsd2dQlo reRHaykZTitV9+8t+Ea9I7eMcgfYWYNl12OmQdE/uy1Puxm4S1H3QeQU1kLueIvXHnvJ q8wpz3b+Elva3te++BaSkP7eCf4rIMHS/ZFIbr9k6tNdZMPnZqNT6xo9hL+VlkTNzqwv SplzM8VA3fa3n7UGfKgT2rUPh5bpm9wAVOdELVq+nzbGDRnBJaLkZ8QK4lGIPYNOv7o7 BNJU5O18TCcmd0jpiDGswFOiNp5n3t7eEdraiR1CAtUqKvMiuH7dZBDCbUV9JoTY5PUw Mh0A== X-Gm-Message-State: AC+VfDwl/wvyHvHra6Ia9qK/r5nWxluHLgsGDrOqrfhb0Ny+r6maGRFb qddMJd3/O9dq5n+whdN8VoX/dggrCJj/136y8WMncw== X-Google-Smtp-Source: ACHHUZ6TiYXNvQ9OCOZRjYDx74ECcIeMOucO1V2qR9QzqR0kcCMCgN/l7RQV6A1I2b1NGZ3+coUfB2GPQwdOOFlMluM= X-Received: by 2002:a17:907:3fa3:b0:94e:116:8581 with SMTP id hr35-20020a1709073fa300b0094e01168581mr17129604ejc.5.1687382606289; Wed, 21 Jun 2023 14:23:26 -0700 (PDT) MIME-Version: 1.0 References: <20230621093009.637544-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 21 Jun 2023 14:22:49 -0700 Message-ID: Subject: Re: [PATCH] mm: zswap: fix double invalidate with exclusive loads To: Domenico Cerasuolo Cc: Andrew Morton , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Konrad Rzeszutek Wilk , Seth Jennings , Dan Streetman , Vitaly Wool , Johannes Weiner , Nhat Pham , Yu Zhao , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 02CE040015 X-Stat-Signature: hfuxhgwn178tmjy7ocdta9uq639tdehp X-Rspam-User: X-HE-Tag: 1687382607-918302 X-HE-Meta: U2FsdGVkX1+luKhkf+r8iEwnQbTxI83sdI10ae0XFT0d1M39nmxvbfDXtEVkiAQqKGT/EkDjgEoTiEtddgCeHln9dIUW4Lo6et2OBk3iR/No0WOefDZqDDisFwVrX810IQAU5CUqgc144rxsHoMHIq/cqfMFtOw7HASQjgKEPnwi8O2Te5JlX+hYeHD8uZ656Ezo3KxjdSDpEBreVXV7glN4mM/jDa3cE2om1Fi7WK2cUaAMXqXiWy6Gihp1ryCxneUCtprl/Wt09xNL/tNiKH3dXfy3hnxUWoms3J2BQ8O/Ijm+25H0Ru+KzPgC2V95J74/fePtldvmdTlH3L+EivsfB28YtHG1QUucnlns0MTh2OWXPTKYIY/OltCanNBGc8gIwmDQRuBfDbUcLtcxnizh4E/kxvm5qDHHGfVzkeYVqk14w/Q02w5ilEyw79HtNQhpe59DqdkieEpP01vlZrj3I0Xj5lVoKqOafAo1/jIMmdno6n02nzlbh0/MWVRvrk0IqCeAMdXdXWE1IpNOVNy1IPYBha83QdtVG8ijVXqNtNwMj5l74YJ4JAABMynXQ0NIlzgCmcMkF3q1fYGUAAw9WWlIONnE2YAXWHpvIcaYVv473L5lQ0FLLrPk+eGlHNMDOgKd7a8OOjcPwmobj9653gjPlBSQej9BzEXTzreCHXZp4eUTF3ivFoiw49ro5pZc5RsYNpjmOLZsvV6ps+vdb91KnmJZCmg2QfA3cIQGQrZ3S0Le3NIfxYF5zPI8aXIP2s5lPPo+M+RuKWRuW9Wcbz3lC4OCQxfOkaxyWLO7G1HhLHGGHzlbR9Sev03HmGP7q4/HlZDSNzoo/xvBthdCYOnljRy5WN9eSmrxgQTm2QVYaYA5ItJu79gJwRgGMy3Sl2sa3t8b8SuzLgHvda/PRWtDhqOhwDeIY68CGfgJ9GPf1218gtyHPx8cOv+amnASLgPK36LgYmHziEX wJJ8S1Bl UyckhaNWpSln/9lpPakl0xXbgPBhBd3MY7pKA4OqtJkSX+HOFV5XHOwC8vlEOJBKnUHtMeSj1T+O8mKHPl68PtU2nMCM7n4pjb3zwGP2ktKUFOOgEkrEtthlmkY/A5DVTnesPKqL52yXCXRK42h1orAwfSRfyXwtZDyAwSbRqDveEPMvMBs4XHIYAzTMF8NoUXj6nYhs2JDmr1WbM4Kp4Nf5mI4OnbdjvwgzW5ztmSNp2WGRB4ZX51a7ZcNWna6l0Y2tz6EyzurQBrYDW6s63y3OC6PHUYL08BSQXUqGCwLtOD9AnpqIqGXl+NFPXxFXW72P0O1nAc1Sv1eZjuBPf/kS6Rzolea38oAwU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 21, 2023 at 12:36=E2=80=AFPM Domenico Cerasuolo wrote: > > On Wed, Jun 21, 2023 at 7:26=E2=80=AFPM Yosry Ahmed wrote: > > > > On Wed, Jun 21, 2023 at 3:20=E2=80=AFAM Domenico Cerasuolo > > wrote: > > > > > > On Wed, Jun 21, 2023 at 11:30=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry b= efore > > > > returning from zswap_frontswap_load(), after dropping the local > > > > reference. However, the tree lock is dropped during decompression a= fter > > > > the local reference is acquired, so the entry could be invalidated > > > > before we drop the local ref. If this happens, the entry is freed o= nce > > > > we drop the local ref, and zswap_invalidate_entry() tries to invali= date > > > > an already freed entry. > > > > > > > > Fix this by: > > > > (a) Making sure zswap_invalidate_entry() is always called with a lo= cal > > > > ref held, to avoid being called on a freed entry. > > > > (b) Making sure zswap_invalidate_entry() only drops the ref if the = entry > > > > was actually on the rbtree. Otherwise, another invalidation cou= ld > > > > have already happened, and the initial ref is already dropped. > > > > > > > > With these changes, there is no need to check that there is no need= to > > > > make sure the entry still exists in the tree in zswap_reclaim_entry= () > > > > before invalidating it, as zswap_reclaim_entry() will make this che= ck > > > > internally. > > > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > Signed-off-by: Yosry Ahmed > > > > --- > > > > mm/zswap.c | 21 ++++++++++++--------- > > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 87b204233115..62195f72bf56 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *ro= ot, struct zswap_entry *entry, > > > > return 0; > > > > } > > > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entr= y *entry) > > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entr= y *entry) > > > > { > > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > > rb_erase(&entry->rbnode, root); > > > > RB_CLEAR_NODE(&entry->rbnode); > > > > + return true; > > > > } > > > > + return false; > > > > } > > > > > > > > /* > > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get= (char *type, char *compressor) > > > > return NULL; > > > > } > > > > > > > > +/* > > > > + * If the entry is still valid in the tree, drop the initial ref a= nd remove it > > > > + * from the tree. This function must be called with an additional = ref held, > > > > + * otherwise it may race with another invalidation freeing the ent= ry. > > > > + */ > > > > > > On re-reading this comment there's one thing I'm not sure I get, do w= e > > > really need to hold an additional local ref to call this? As far as I > > > understood, once we check that the entry was in the tree before putti= ng > > > the initial ref, there's no need for an additional local one. > > > > I believe it is, but please correct me if I am wrong. Consider the > > following scenario: > > > > // Initially refcount is at 1 > > > > CPU#1: CPU#2: > > spin_lock(tree_lock) > > zswap_entry_get() // 2 refs > > spin_unlock(tree_lock) > > spin_lock(tree_lock) > > zswap_invalidate_entry() //= 1 ref > > spin_unlock(tree_lock) > > zswap_entry_put() // 0 refs > > zswap_invalidate_entry() // problem > > > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > > entry would have already been freed. If we check that the entry is on > > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > > reading already freed and potentially re-used memory. > > > > We would need to search the tree to make sure the same entry still > > exists in the tree (aka what zswap_reclaim_entry() currently does). > > This is not ideal in the fault path to have to do the lookups twice. > > Thanks for the clarification, it is indeed needed in that case. I was jus= t > wondering if the wording of the comment is exact, in that before calling > zswap_invalidate_entry one has to ensure that the entry has not been free= d, not > specifically by holding an additional reference, if a lookup can serve th= e same > purpose. I am wondering if the scenario below is possible, in which case a lookup would not be enough. Let me try to clarify. Let's assume in zswap_reclaim_entry() we drop the local ref early (before we invalidate the entry), and rely on the lookup to ensure that the entry was not freed: - On CPU#1, in zswap_reclaim_entry() we release the lock during IO. Let's assume we drop the local ref here and rely on the lookup to make sure the zswap entry wasn't freed. - On CPU#2, invalidates the swap entry. The zswap entry is freed (returned to the slab allocator). - On CPU#2, we try to reclaim another page, and allocates the same swap slot (same type and offset). - On CPU#2, a zswap entry is allocated, and the slab allocator happens to hand us the same zswap_entry we just freed. - On CPU#1, after IO is done, we lookup the tree to make sure that the zswap entry was not freed. We find the same zswap entry (same address) at the same offset, so we assume it was not freed. - On CPU#1, we invalidate the zswap entry that was actually used by CPU#2. I am not entirely sure if this is possible, perhaps locking in the swap layer will prevent the swap entry reuse, but it seems like relying on the lookup can be fragile, and we should rely on the local ref instead to reliably prevent freeing/reuse of the zswap entry. Please correct me if I missed something. > > > > > > Also, in zswap_reclaim_entry(), would it be possible if we call > > zswap_invalidate_entry() after we drop the local ref that the swap > > entry has been reused for a different page? I didn't look closely, but > > if yes, then the slab allocator may have repurposed the zswap_entry > > and we may find the entry in the tree for the same offset, even though > > it is referring to a different page now. This sounds practically > > unlikely but perhaps theoretically possible. > > I'm not sure I understood the scenario, in zswap_reclaim_entry we keep a = local > reference until the end in order to avoid a free. Right, I was just trying to reason about what might happen if we call zswap_invalidate_entry() after dropping the local ref, as I mentioned above. > > > > > > I think it's more reliable to call zswap_invalidate_entry() on an > > entry that we know is valid before dropping the local ref. Especially > > that it's easy to do today by just moving a few lines around. > > > > > > > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > > struct zswap_entry *entry) > > > > { > > > > - /* remove from rbtree */ > > > > - zswap_rb_erase(&tree->rbroot, entry); > > > > - > > > > - /* drop the initial reference from entry creation */ > > > > - zswap_entry_put(tree, entry); > > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > > + zswap_entry_put(tree, entry); > > > > } > > > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_poo= l *pool) > > > > * swapcache. Drop the entry from zswap - unless invalidate= already > > > > * took it out while we had the tree->lock released for IO. > > > > */ > > > > - if (entry =3D=3D zswap_rb_search(&tree->rbroot, swpoffset)) > > > > - zswap_invalidate_entry(tree, entry); > > > > + zswap_invalidate_entry(tree, entry); > > > > > > > > put_unlock: > > > > /* Drop local reference */ > > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type= , pgoff_t offset, > > > > count_objcg_event(entry->objcg, ZSWPIN); > > > > freeentry: > > > > spin_lock(&tree->lock); > > > > - zswap_entry_put(tree, entry); > > > > if (!ret && zswap_exclusive_loads_enabled) { > > > > zswap_invalidate_entry(tree, entry); > > > > *exclusive =3D true; > > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type= , pgoff_t offset, > > > > list_move(&entry->lru, &entry->pool->lru); > > > > spin_unlock(&entry->pool->lru_lock); > > > > } > > > > + zswap_entry_put(tree, entry); > > > > spin_unlock(&tree->lock); > > > > > > > > return ret; > > > > -- > > > > 2.41.0.162.gfafddb0af9-goog > > > >