From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2ED5FC47258 for ; Thu, 25 Jan 2024 07:59:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AE43D8D0017; Thu, 25 Jan 2024 02:59:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A77E58D000C; Thu, 25 Jan 2024 02:59:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E8C68D0017; Thu, 25 Jan 2024 02:59:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 757638D000C for ; Thu, 25 Jan 2024 02:59:54 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1239016014A for ; Thu, 25 Jan 2024 07:59:54 +0000 (UTC) X-FDA: 81717084708.12.FDA4681 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf18.hostedemail.com (Postfix) with ESMTP id 3C1BD1C0014 for ; Thu, 25 Jan 2024 07:59:52 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Nl7iNXwR; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706169592; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y1H6lVGRWuFfz+Z1+0jRYL727Z9vnxtvrMU1rUX5Q8w=; b=3dLdlnxysSv24CKtg7IDxuav4M8c46OiqoGOLxRZPVi45gBZUrkzvRy9F642b1DRglmtgr 3N23ttsU140fhYZYZS3InObBHzg48qBS8GQT3KQ26qQl+F8S6TzN6D6SFHIzB7yT2N6ZX1 zMK8WPLmj5mVSPvT074SGY/cJ434Biw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706169592; a=rsa-sha256; cv=none; b=z+32LbCPIGo/oyMKSVebhTPbI/xLcwvYkB0uY3YN4ZH7ecvwqSfw/6eEosDmQBwW7TpC2d GQa/C2AvfJerr0WjMHg0uSV92CxNq6cJ+Hj1Z/D2UV/92U3y7eBGYoj/UmMfY5eNmJMgfq +iPVTtxAJUKVvv0yXOd+YAHFV/C77E4= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Nl7iNXwR; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a26fa294e56so648153066b.0 for ; Wed, 24 Jan 2024 23:59:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706169591; x=1706774391; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y1H6lVGRWuFfz+Z1+0jRYL727Z9vnxtvrMU1rUX5Q8w=; b=Nl7iNXwRd4NawVSUZsyz1ONaWfF13FimPmVx4ih4+whVPM8dO6Fw0wV69HYatLmtFe ZioVv+jMmwLS+ZpXqYyHNEcdS9O5kwauHZSD4XSbd/zbQCHVIESGG55xNQp2i/hgh/XO kzTa89uHQA1UFiXtM57QNzOjJVjZwolFMyLjyayYuIPQau99o2gLTKCc4802YlhwiFVB xEUHPXOKYhlydIznhn/nTipD9vTpzmewqAmY0UDgfbq+6Ksbm1cqXwXR0R6KoEdks+dP WeMt+3XG96OR9Ol3o/4wb9Juw/oQ/8W+kghkvOw/Z3G36/smAJsNjjulpnP4nqAsFLnF w+CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706169591; x=1706774391; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y1H6lVGRWuFfz+Z1+0jRYL727Z9vnxtvrMU1rUX5Q8w=; b=Vrw9IdgiQFM3dlQKituSz279MUF3ia+RPskpQBBUE+UjgoyYm0st9xTP0Bl90xS8RI traG50GYqdvh7AbarLEakOJwUWoHWCKD7hoEm51jFq0WwO5x7cnwgc5tU8ZfLISUVom/ yI5YJTgSObg44Joce5mw6IrFyhCFyv67DCM/110lbKFsQWxrV6K7Pwmbse399bAGOOJM 7MCMiy1dUFuoATC7fXqf0z4wj8nTmOIQ4yf2FDxBc1UNpe9KB1KV2DX6FI2htHpmlNy9 /LuXLtTa+SFqKWalu97yEjwOp1TEjs5/NUH4Oww9M1zy+/MyDm2HB+2TLUCjnAdqorhX nG4w== X-Gm-Message-State: AOJu0YwpdDxb334mwmAdSiFOX92GxLvf8I4sdUXr5mu9dN4QPbtr/vNF yx5vIAXkewTlMDaRU7mI3ZHZNitKPV5ZlXssVuk5thigkSbWMvW8jWU/8tFLCcR/6wYdOYmdwLv 5GDvYaVJ5XU1GD+wZ4kxd1srI0DqFaTYQYx3C X-Google-Smtp-Source: AGHT+IEvyok96y4xvtk9mejVkFC+Yj+gray9xkaI9A2frC6J9UtpcfEu2BzocGyZkRxBJTYr6tHNwqdIbQMto9kX8ow= X-Received: by 2002:a17:906:b78c:b0:a2c:88d3:754c with SMTP id dt12-20020a170906b78c00b00a2c88d3754cmr349013ejb.40.1706169590452; Wed, 24 Jan 2024 23:59:50 -0800 (PST) MIME-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> <20240120024007.2850671-3-yosryahmed@google.com> <20240122201906.GA1567330@cmpxchg.org> <20240123153851.GA1745986@cmpxchg.org> <20240123201234.GC1745986@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Wed, 24 Jan 2024 23:59:14 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() To: Chris Li Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Chengming Zhou , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: zc9ss77g7wn6dy51km6ixx3msw7ms5eu X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 3C1BD1C0014 X-Rspam-User: X-HE-Tag: 1706169591-105688 X-HE-Meta: U2FsdGVkX18ov9rab/5Z0VOxRP1YOvlHsjCV4Dy7AedVTjNI/GT3YLJ9RLxZBHnHUxTpmCo5GXm69ZlZqrHRDmek7y+FA+M/gOrjQAz4mYa8j6tl0eZIVx7JUhmNVSGtkqgZ6BcG4FHyR0QTGcNrLBDgWeMtlPpVY0kL1Rf0sjfaTIUPB3AR5OcgOmNJz5dKLgahyU3SelpWdkPwPISc+l1emmvtyTZU+71xwCnfwmoXr2e3UZCn+n99r+UhuNoC0kb7/rHD+94V0/nmuXiG8uPkiTMot8zaZVTvbB1ILoeO3s8E+v+Lz6BnYfqVRu+4UGn/RspeyLjtpI5MbCHIvJIJdWWzhlpeZgyRBQvGroQi5jaYovZ+r4B0ZTbgdBnQZxMpzjGXUR7o9Myr9HNQ2NBy71hrCXZ7A+9Rxv6CYpFnUOfI/zWGOeZ3D8V2YzcRGVgyLDbE36cAPxFcAWr1x9TsfidUyWNgKZHLq4GKKBut2VWblzkL+OwUvueesR2b8kS8oA9lBms6NkX+SQtb6+GXRneT+ZVrdeoUpmnbflyLuXR7iV+coxk7d7PWrPGkR+sF3sJMOPsCuaVoZl4g2E8UY84kHDrv+Bymq6KmH/dSS/k8+16TXEo0/WsVmeFiRSRbiOYUP2LLp3DXzNwWxLYZaE8judQv9TCxPJh+oEkMrKcCYjM0O2hFoooN2VC8B6eqGuTjzOmn1JbmElzguGOBuQ87HJ/RS0ip8J3byiZHti02zbaZUIwLrZObHyDFKcoi9qdh7xObBj6447d23b15VI99BKnowG19fLSWENzSs5zfkhamvgIB0wtjtHbj0gXzXUkRM2DU6UZmI4rP+zPk3kyDAVlYemzUk9ogqvmzLFSC/VHAwIajiaeKLBAXupUihkNr2O9EgSxQntovMK/FwucKnMnE2rHKhHHhXTcuutAgpJFZOA+ggBPjadqMnMhM4NDzLlSBK17xusj /lEbAHJ1 D5Y7bmECaEHOoIh6N5OYUofVbBhXO9EjDHezQqUnE+Yecil9t7xLMMZ6Leg2EwMXsUk7OE33EStMHmDH0gN0xGO/2mDusZ4W6kNLFZuQUkdrYltpp6FX9jyQ5Xf6a3yaE9NWAc8AWJc7OT5E7MOl9aZhzg1wz6NtADIzZ3GS4kbO5a4WQKrCKgISlEC3nReKdOWH3hz9BfdxI+FuKU+XbczBZdVm3rGo1D0Hr8I9UaLykZ8/+XWQJDZj4w3/r2Eddm62ADfovu5b6eZ8Bk8Zru1wwmkPckFsiGvx7Xzgl0MHOtH0ltEQdaJ5GLgZhSxX1EZhuPOcPCBmwEIpP1vrkr/W8yBI4QMopV8m5prHCJeTLQfPwhtWfvEv+cw7A4dWwXcE0hVl2mvSO6CQJqRoJTHYuBQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 24, 2024 at 9:29=E2=80=AFPM Chris Li wrot= e: > > Hi Yosry, > > On Tue, Jan 23, 2024 at 10:58=E2=80=AFPM Yosry Ahmed wrote: > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race mysel= f :) > > > > > > The first solution that came to mind for me was refcounting the zswap > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > the percpu-refcount may be an overkill in terms of memory usage > > > though. I think we can still do our own refcounting with RCU, but it > > > may be more complicated. > > > > FWIW, I was able to reproduce the problem in a vm with the following > > kernel diff: > > Thanks for the great find. > > I was worry about the usage after free situation in this email: > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9= OFiUG6g@mail.gmail.com/ > > Glad you are able to find a reproducible case. That is one of the > reasons I change the free to invalidate entries in my xarray patch. > > I think the swap_off code should remove the entry from the tree, just > wait for each zswap entry to drop to zero. Then free it. This doesn't really help. The swapoff code is already removing all the entries from the trees before zswap_swapoff() is called through zswap_invalidate(). The race I described occurs because the writeback code is accessing the entries through the LRU, not the tree. The writeback code could have isolated a zswap entry from the LRU before swapoff, then tried to access it after swapoff. Although the zswap entry itself is referenced and safe to use, accessing the tree to grab the tree lock and check if the entry is still in the tree is the problem. > > That way you shouldn't need to refcount the tree. The tree refcount is > effectively the combined refcount of all the zswap entries. The problem is that given a zswap entry, you have no way to stabilize the zswap tree before trying to deference it with the current code. Chengming's suggestion of moving the swap cache pin before accessing the tree seems like the right way to go. > Having refcount on the tree would be very high contention. A percpu refcount cannot be contended by definition :)