From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DC0CC47258 for ; Thu, 25 Jan 2024 22:34:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C85886B0095; Thu, 25 Jan 2024 17:34:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C35AE6B0096; Thu, 25 Jan 2024 17:34:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AFF576B0098; Thu, 25 Jan 2024 17:34:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A13606B0095 for ; Thu, 25 Jan 2024 17:34:39 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4F354120E9A for ; Thu, 25 Jan 2024 22:34:39 +0000 (UTC) X-FDA: 81719289078.11.E555D5A Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) by imf16.hostedemail.com (Postfix) with ESMTP id 69C8618000B for ; Thu, 25 Jan 2024 22:34:37 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qm6PW0Rm; spf=pass (imf16.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706222077; a=rsa-sha256; cv=none; b=fsBrx8TV3fXxK/dRlYLwGzL2tq5e+7m9bXbn1g3Rh1MPWKDDTXnm7dOz+vAh580cEWljP2 g+V1yvk7uKKspSYKiqOPGy5kvchaSF6cdPADuYc03pdIwGGC4uRgns573hw5l/hx9aKp2L 9Ko/iUYx3lItBw2h6x9kr3v5Ygq/h8s= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Qm6PW0Rm; spf=pass (imf16.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706222077; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v7quTyiowyLH4rAwhkxhJ6Ex8ZBRUXT3RRycWcoogYY=; b=pzUOKf6SOxFuSN579+PoVH+QsILkw4RVS/XdcAlbKYmyjLApSkvOrb7KmQ/ldMTGGOrNAs g6/s+ITd2aWt4bULwh4MQfCLWggBmkmsHfwbhS5ceu5hE/tXLxFbIR5youo4qYR85aCoHv C2hSwmph0KlO8/pmvc5VRYEep5nG2Ek= Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-a2dc7827a97so819326166b.2 for ; Thu, 25 Jan 2024 14:34:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706222076; x=1706826876; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=v7quTyiowyLH4rAwhkxhJ6Ex8ZBRUXT3RRycWcoogYY=; b=Qm6PW0RmaJjbreQu1Y9xH0uM0KhEyWpELjojK8R3K8RGDNWRqrI8CsWV8xuu1WTslk 3hbEjmz/KrEEH2dwwHFn7slgG/jywSI9Sx6Ie0vr6qBil/sGZ8Mc0b4ygsniRW3zb4wl 1M8VWbi/It/f4dr+3QmzTTUAD0jrMNH2GxWjjsyisk2UKUpKIw/9KCOna6mLnJgB9hhi x7wK55q6Aq0YoB+/0I/k/9Mf4I0ZN7H3zatTQU8ldr3AVfiCc6nk40xCOKN2A1hmh6WE NcbAa4KWy+wpXwIdq6bYMPqCEZC/tNdlTAOHVf/TzpazaE0u+1TyvL2N1d2AxoCExGcm ODBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706222076; x=1706826876; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=v7quTyiowyLH4rAwhkxhJ6Ex8ZBRUXT3RRycWcoogYY=; b=HVL+hVeX6mr3Yu0zFhs+qdGKiab9H7xSSBtWc2e1C5Dq1Z+ECQCWhXn76a+DwPlo1w ChY4HnLey2NRtB4lmDoLuEsulHccC/ggpBNZylVWcLVD2c8Z0GCAi96ipVeAZaar+lpb qUWUAVh8mvhIDvDnNESZaM9RWZv8W2OjBrBUgoMYJwAnFjLrGknOGJuKXtACVcSIKlXR 2mI15unp0Yx6m8F+hodwwdvpEafC0kEUzY218/6YSpVsNf0ZKILtgJOPVo9kirPurI56 33zK5MIJeEDK8c6ZkgL+ZVHDPzwwa26qfZSqZVdVQpdjME32xuJAFK1/5gdS6HpX8/z5 SKkQ== X-Gm-Message-State: AOJu0YxHKZxJ9WoIH6x3rQddHTRcGCYTGp66huWlZiKM4ozdB7ecJfvm ZgqgbKZt+JvSbRGxvHK8U1z7ZCJGnE93w9/5AuJFJoJ0xZGKuyqKvOtgtoDOX7AdsaXSiZKAJ+V YOZ7utN/OnEqEFOcngkOTOdnzj9IPUz1lpFmw X-Google-Smtp-Source: AGHT+IHDx4eVmCszSnPI5vueY6lO8sQyfGQ5S10G/lvIfYfp6AaE15SXKvWQe2dBNI2bUaBxZkumtNp40MBBMOqI1go= X-Received: by 2002:a17:906:6d4a:b0:a31:2028:8c6 with SMTP id a10-20020a1709066d4a00b00a31202808c6mr124453ejt.104.1706222075615; Thu, 25 Jan 2024 14:34:35 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 25 Jan 2024 14:33:59 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Chris Li Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Chengming Zhou , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 69C8618000B X-Stat-Signature: iqifktz13sumikwxadax6sgn5k6fdekn X-Rspam-User: X-HE-Tag: 1706222077-44678 X-HE-Meta: U2FsdGVkX18rBwCsSIjwuktVAZX2AdZQ+A1DgjcmeM16lmZqVsIMS9ECcPRr0Pi1wFTgw2qkwkdHDr22vxyOynwvKeIRK/btGL32HaHpMO/uGB/SnopxU3aRQwj3zfFbRB14BPRjmpgo1Iw0dWQ5BkreMCFvFP5Do+BHy73YzE1BPPJLRXzpM99iy/nkYyMH9sTGuEDHzQvzqjTIkzAXILMHSa6ptO7p0NGWGCn6f3qXahmKS2AFifOw4swt50k8mjAdQ5IcxxBdz5rJinqrkgluS6Nj+OhgcZpHumtJtpp7byqxBlN4n3cEWbmc0PQclGLqJJIq6EdhyPl4Z2Yd5jBWEDxTEPlpKCBZsDTXdh2ppJZd6vI9AZb9ndlZffrlU1HlJwtEQ4z+3n0Q598OuUs1lEhhefaRx2TzioPEbmiJqu9F/FFfuShXc7owIajhnPHI43OugjUYaJ4ZCABB98izG3WZFhE6aV8kKqAPF3404AfYpEIRfD9u3mhgqmF2UtiB5XcjSJi+29Yx/OamwIWtVLHo7G0Br9P3toAR/gisb1pOs58Acwaq+pGArSRQelNbNYsxAivxB/VUPGYbBw2mLFC+b4xBLARAWqe/8TyhDLa5YIijs2hySt97Wq1uEKQSlL+CfhDIxCIfs0Vrk17SBGnomQ4wZAXgCNMKq3cTFe+Ut+f/8pH/5mrFrhemV60Q8iWthIEqXUG8585jISLB/wbtNx7XKyOdhrH8dCE0tL0NLxkA5j0Ri/30UGnrDCysQGaCiOrS26XHEcyAQGSzTu4lSeSyE1ELYLRPgSqwfS9bSIqyzDbEHE1JSxyOMeussF9rmsx4GBp/zulJhDQsIqwfa7WwMON3ZbCGhHcVMR7UWlYph+pmtFkM1M3E8U7lFp5hsk9m9id2N1r7sPvhOKW6LA3tLyJKOvY8WvuMUSsB2kFAgLrSLzEH62YLfPAa1OcxbigehYhM4O2 oS/TgXYM opqiKrW7e/0RcjXF3n8NRuLXwdvJLyseIx/ia84XG5JJ9gDZktBePovBALKpW/y8ZICFauFB48Ay617IWIJybDkyFpv8AyyHMQIDXQ/PM4sOL7zb8EWn6pdRIPIkY/Qdre3UCNoHz9TD7nsXGuv6kad4VPfyjpsq2Al32x7ZHBy5s8Tf2/a4/a6G60dmdVhXPkPx04aw1162+arSUdDxTVsDQppSASqJcSJNs/GhLg3kHjAPhdea8u+9R7c1aXwt9woXbN55QnwV8vV/OEarsq4n67aL+DfwHW1NutaauoYouM7Zyd83MAyKTgWjnfJpsksW5DBpZ/yIfJo+glB+qDnC7axLl34bOSLfYNKpfO42sASuVsFSu0wiGUReEyqCWekhcjKBQiGsmobSedoDXyUYywA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 25, 2024 at 2:31=E2=80=AFPM Chris Li wrote: > > Hi Yosry, > > On Thu, Jan 25, 2024 at 12:58=E2=80=AFPM Yosry Ahmed wrote: > > > > On Thu, Jan 25, 2024 at 10:55=E2=80=AFAM Chris Li w= rote: > > > > > > Hi Yosry, > > > > > > On Wed, Jan 24, 2024 at 11:59=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > On Wed, Jan 24, 2024 at 9:29=E2=80=AFPM Chris Li wrote: > > > > > > > > > > Hi Yosry, > > > > > > > > > > On Tue, Jan 23, 2024 at 10:58=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon ra= ce myself :) > > > > > > > > > > > > > > The first solution that came to mind for me was refcounting t= he zswap > > > > > > > tree with RCU with percpu-refcount, similar to how cgroup ref= s are > > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()).= I think > > > > > > > the percpu-refcount may be an overkill in terms of memory usa= ge > > > > > > > though. I think we can still do our own refcounting with RCU,= but it > > > > > > > may be more complicated. > > > > > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the foll= owing > > > > > > kernel diff: > > > > > > > > > > Thanks for the great find. > > > > > > > > > > I was worry about the usage after free situation in this email: > > > > > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc= +Q0e9FZ9OFiUG6g@mail.gmail.com/ > > > > > > > > > > Glad you are able to find a reproducible case. That is one of the > > > > > reasons I change the free to invalidate entries in my xarray patc= h. > > > > > > > > > > I think the swap_off code should remove the entry from the tree, = just > > > > > wait for each zswap entry to drop to zero. Then free it. > > > > > > > > This doesn't really help. The swapoff code is already removing all = the > > > > entries from the trees before zswap_swapoff() is called through > > > > zswap_invalidate(). The race I described occurs because the writeba= ck > > > > code is accessing the entries through the LRU, not the tree. The > > > > > > Why? > > > Entry not in the tree is fine. What you describe is that, swap_off > > > code will not see the entry because it is already not in the tree. > > > The last one holding the reference count would free it. > > > > > > > writeback code could have isolated a zswap entry from the LRU befor= e > > > > swapoff, then tried to access it after swapoff. Although the zswap > > > > > > The writeback should have a reference count to the zswap entry it > > > holds. The entry will not be free until the LRU is done and drop the > > > reference count to zero. > > > > > > > entry itself is referenced and safe to use, accessing the tree to g= rab > > > > the tree lock and check if the entry is still in the tree is the > > > > problem. > > > > > > The swap off should wait until all the LRU list from that tree has > > > been consumed before destroying the tree. > > > In swap off, it walks all the process MM anyway, walking all the memc= g > > > and finding all the zswap entries in zswap LRU should solve that > > > problem. > > > > At that point, the entry is isolated from the zswap LRU list as well. > > So even if swap off iterates the zswap LRUs, it cannot find it to wait > > It just means that we need to defer removing the entry from LRU, only > remove it after most of the write back is complete to some critical > steps. > > > for it. Take a closer look at the race condition I described. The > > I take a closer look at the sequence Chengming describe, it has the > element of delay removing entry from the LRU as well. > > > problem is that after the entry is isolated from the zswap LRU, we > > need to grab the tree lock to make sure it's still there and get a > > ref, and just trying to lock the tree may be a UAF if we race with > > swapoff. > > I feel it is very wrong to have the tree freed while having > outstanding entry allocationed from the tree pending. > I would want to avoid that situation if possible. This should be the case with Chengming's solution. > > > > > > Anyway, I think it is easier to reason about the behavior that way. > > > Swap off will take the extra hit, but not the normal access of the > > > tree. > > > > I think this adds a lot of unnecessary complexity tbh. I think the > > operations reordering that Chengming described may be the way to go > > here. It does not include adding more logic or synchronization > > Does not require adding the tree reference count right? No, just reordering of operations in the writeback path. > > > primitives, just reordering operations to be closer to what we do in > > zswap_load() and leverage existing synchronization. > > The complexity is mostly for avoiding the tree reference count. If we > don't add the tree refcount and we don't need the extra complexity in > the tree waiting for LRU, that sounds great to me. I think that's what Chengming's proposal provides.