From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32122C47DDB for ; Fri, 26 Jan 2024 19:31:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7D446B0095; Fri, 26 Jan 2024 14:31:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B2BF86B0096; Fri, 26 Jan 2024 14:31:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F33A6B0098; Fri, 26 Jan 2024 14:31:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 90BA86B0095 for ; Fri, 26 Jan 2024 14:31:55 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3393D80489 for ; Fri, 26 Jan 2024 19:31:55 +0000 (UTC) X-FDA: 81722457390.07.68134E7 Received: from mail-il1-f180.google.com (mail-il1-f180.google.com [209.85.166.180]) by imf22.hostedemail.com (Postfix) with ESMTP id 6F18EC001E for ; Fri, 26 Jan 2024 19:31:52 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GYhfrrdV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.180 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706297512; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CwIyM2FU0QQ1VJvtvKXZch6o5WvkSz5Affa1c3xEM6M=; b=JjhK7bzO6UfOMrIulftRgUPFjIpGnoeWE/50YJsH9zSNvJeMZe4DJfmF+8LhmI5pTWHK/z /GRBFGTdOfcPtIpNmhNgD1PNpLrljfVVY4Ezg2Dx1Dwul9qSdJZkzoNaBOcW/xmBTdMhRw rH/kKkHxzNVBKDzOsVqQl/YFBnzhKZs= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GYhfrrdV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.180 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706297512; a=rsa-sha256; cv=none; b=SMxBeWuFhT6tyITrRn2966A62p6IsXd5rxuFzXi+9xpyKAbOS9H77o6JVRSfnKES7DWzOW CttjFKtTSJe0e4F6VAdKwP1nEZgDwg7cm1T7nJ0NnQnnjofzKCX2p6b4WvmkMjPlbMbreA GWOh9jg70UZN+lIWVQzSsz9jyJJK4zs= Received: by mail-il1-f180.google.com with SMTP id e9e14a558f8ab-3627f61fc15so932655ab.2 for ; Fri, 26 Jan 2024 11:31:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706297511; x=1706902311; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CwIyM2FU0QQ1VJvtvKXZch6o5WvkSz5Affa1c3xEM6M=; b=GYhfrrdVy/AJgx8addiei51LzHR+RyLl4I0rv7AGjzOebKGxNvvh7jU7pwjgSvWaeQ yp6cG+QWzVjhgr0mkotvKOZNpEi4W799w4Oadv/t7RUhWw5YbtJ98P4vcK7ekg7jzZXy +f6kTHec2qTbPpQxfSXluLLcG+uNapJG+0yyysEq7Tm22520WaTXH/8KrJRKOEAux768 UTJzusAmLBN3LU2EgKqL0SxxJa4SEgpkoTojlLG45uaOWtlalIJ7u0mw/J2eJmqIxF1+ 319VyPlU18B2+wHKJl1ktogac/CqgEliPEi888LZoLfnuk425971IIB3kJ1k/LNXgyuj nB6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706297511; x=1706902311; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CwIyM2FU0QQ1VJvtvKXZch6o5WvkSz5Affa1c3xEM6M=; b=mQeQioPK6eRGXfF4/wDv5jvmvwWsPAgp0lwFsnU1vpcTUdh4yZLMcLRgcVHltSzLZ/ yd5r2oEA4jo2z7NHEgwI1O09GoUBlAvHbHMeFn1jUAfpl4wGdTcn6J7NCkdfo3A2EcSu bo/BZd0BqkyLfg7kIUE+gV3oZrIGWYVJkxrNyAo9Bt7kpHuDfzzYfEDmc+m55V3hDpE9 CV8hFgc1WOMcwIsrh46XlE9Ku9GhV0NgJ1rK1iZsS+nB2sbdyR/n4Rk2hRoqa4bvlSSk Etb2AyRy6P56RUrypPdCH7WHcZ0E4eYf5qg1LGAqNlWgDDe35NAKi5PKLDuRWVpc2ZO+ eDKA== X-Gm-Message-State: AOJu0YwfvaVAVodZaPTPjMmpoe4+Fq2xl6istYOnUNTc+xgn4dESNFZm uAkWLvpgUv8lKO11VF67CTXBnR1ZBx3JsCtY2Czvk2ZB355uJlBR2eDvJ4u8xVr1xYgRMzVdjCY KLovLe6tPh8i/pMt0UCdN3G+4wH4= X-Google-Smtp-Source: AGHT+IHnokpwiyO2PEGJ0drytSWXqc8hb8A1r4TGMfX17/4JXY76I/rZuYobCRxWx17ZoIzcrqBtGL7Epl/2JLEQlhA= X-Received: by 2002:a92:cd09:0:b0:362:ad7c:bbdf with SMTP id z9-20020a92cd09000000b00362ad7cbbdfmr272383iln.16.1706297511416; Fri, 26 Jan 2024 11:31:51 -0800 (PST) MIME-Version: 1.0 References: <20240126083015.3557006-1-chengming.zhou@linux.dev> <20240126083015.3557006-2-chengming.zhou@linux.dev> <20240126153126.GG1567330@cmpxchg.org> In-Reply-To: <20240126153126.GG1567330@cmpxchg.org> From: Nhat Pham Date: Fri, 26 Jan 2024 11:31:40 -0800 Message-ID: Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff To: Johannes Weiner Cc: chengming.zhou@linux.dev, yosryahmed@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chengming Zhou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 7ixwhs6ap8jfpkpgs4ba4rfy1uppe9oy X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6F18EC001E X-HE-Tag: 1706297512-886298 X-HE-Meta: U2FsdGVkX1/tuX4Pg3Y/u61msfKMuO7IEZCW017yH9c4YKLs+R0L2ABQkjL5V7ycqXo1/egEiR9I3IdAJKg5ihgMzPN8D4/Dz6SZ/KGQSLk6XJ7PKSqV2McO7s4sFJPU4JDMlNzgqRBi91PIbJRDUlXrD1zmYLwr+v6Civ+zOLeQiN2JwhisZbwp/OA2gVuBXRwo/RygCsy6peaabPDVjmy6RUW9nHC7jcTT0yLPiRc6zoRZa6x8AQrTTy3iLqiQb29o2JdXwfow5qf/5VEVinSVi9LQlDkwx/RtBiJmv5WxPK957YbBXDTpzc3sJ9zliK/coGZFFGwBMuyHSHH78IgY92inUO1Xy91uyEv/nYDynSgbJZ2iP4Vuw+ccwps5f6H8RY40ZuY5S/WBS8GGVfve7IGAjtkihj+DmZucg3WwXSdYF2/qiCgYFOCDo1aiH7pOtwxvt6NO4avxOrg4N55KItVPtq708gkYpce5vn5Nj/LYB9pXJ3lMnlLp5YlZ9r4mYaZa1ef9sNTAKfPu4qP+ZNUD5ZniT0/CTV6qV6Pc+FBbfMpNOLt4inyup9VmduEH3TXI3DsVbMhk7IDYsldRmnCfi6G8OWV3qYXWXjHRUzIOFF6hxSTi4BTs2o/3ZdQQqpzE21YE19Fmn2gaTh7LIlI9VVI1TtD3FHmI1sYEkAS40ZMW5lU8ZzjKjIAt40PVl1U+ODxn9K72vBtv8V8pKFfjU8bhNkqWvV63vUr/PvLpULMzZ39F5MqCU8RWG1dlekCQV7+/QGgz442oje7BWi0SS3v0Mon0ppc8jb5b7O3jTyXCYfLg6uYWC7TxlH5uyxnMWlyYz0//Ha4aEgJ3JuU0tCTy3WGDUiOZ9LXsom+NBV42iOZyudZOVIgl0LaPx27ENJFvpBoQFwpb/NnHOn7oRoC3BRv6yGKDWLGVdXWFyISwL2qIk/1Ge+a7r9S0Gs+Jfit3oOEWNJv 9DYXzAmY vS7YMAAuwCYnldZ9x0qtsBbyITP/yN5oM+5syRHZfSkEqTx0cetpcxyyNu+lIupUhZs5NvvoEKSlUo5uZFkDMmejmGJ3h2OwN3OwGMBmYFlVYl5vo4BClan6hGOENM6HStLarQUtmpQziMlVWOH4fQdEDKcaOg8d2NYejO6ikxJnPPxF2v6rMTfJ7WAWa6GFB+g0FG6NYlajzDUMbpUM/1EYqkPj2JJT9Dl4LpqlX7rveFJg0GfOyJ7w4GOH8TSsF1QkMPYLTYvCgQMsp4D09Q/mQR6MlXPa7LEVDjJoCFG2Kdca/G8KB42cuXrD/5eSJjCqMOU6sjnscREf3J/dvfI04VXwbyfAebWPTeTfg+o9D4oi/8yWgp1UGKBcK/db85IhaIshTJ3WhHNGU9UqKSA4e6D0VtuP4t9Wdm/6HqSsQkPMb6kuOTAGJjX3HCEQJDm6yfmu/a5kg7C1ktiEygN5hdw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 26, 2024 at 7:31=E2=80=AFAM Johannes Weiner wrote: > > On Fri, Jan 26, 2024 at 08:30:15AM +0000, chengming.zhou@linux.dev wrote: > > From: Chengming Zhou > > > > LRU writeback has race problem with swapoff, as spotted by Yosry[1]: > > > > CPU1 CPU2 > > shrink_memcg_cb swap_off > > list_lru_isolate zswap_invalidate > > zswap_swapoff > > kfree(tree) > > // UAF > > spin_lock(&tree->lock) > > > > The problem is that the entry in lru list can't protect the tree from > > being swapoff and freed, and the entry also can be invalidated and free= d > > concurrently after we unlock the lru lock. > > > > We can fix it by moving the swap cache allocation ahead before > > referencing the tree, then check invalidate race with tree lock, > > only after that we can safely deref the entry. Note we couldn't > > deref entry or tree anymore after we unlock the folio, since we > > depend on this to hold on swapoff. > > This is a great simplification on top of being a bug fix. > > > So this patch moves all tree and entry usage to zswap_writeback_entry()= , > > we only use the copied swpentry on the stack to allocate swap cache > > and return with folio locked, after which we can reference the tree. > > Then check invalidate race with tree lock, the following things is > > much the same like zswap_load(). > > > > Since we can't deref the entry after zswap_writeback_entry(), we > > can't use zswap_lru_putback() anymore, instead we rotate the entry > > in the LRU list so concurrent reclaimers have little chance to see it. > > Or it will be deleted from LRU list if writeback success. > > > > Another confusing part to me is the update of memcg nr_zswap_protected > > in zswap_lru_putback(). I'm not sure why it's needed here since > > if we raced with swapin, memcg nr_zswap_protected has already been > > updated in zswap_folio_swapin(). So not include this part for now. > > Good observation. > > Technically, it could also fail on -ENOMEM, but in practice these size > allocations don't fail, especially since the shrinker runs in > PF_MEMALLOC context. The shrink_worker might be affected, since it > doesn't But the common case is -EEXIST, which indeed double counts. Yup. At the time, I was thinking more along the lines of what mechanisms should trigger protection size increase. "swapin" and "LRU list rotations" were two different mechanisms in my head. I was aware that there could be double counting, but deemed it OK, as the cost of over-shrinking (increase in swapin) was fairly serious, and if we have a fairly aggressive decaying strategy if we protect too much. But yes, I doubt it mattered that much in hindsight :) And the case where it is double counted far outnumber the case where it does not, so I'm fine with removing it here. > > To make it "correct", you'd have to grab an objcg reference with the > LRU lock, and also re-order the objcg put on entry freeing after the > LRU del. This is probably not worth doing. But it could use a comment. > > I was going to ask if you could reorder objcg uncharging after LRU > deletion to make it more robust for future changes in that direction. > However, staring at this, I notice this is a second UAF bug: > > if (entry->objcg) { > obj_cgroup_uncharge_zswap(entry->objcg, entry->length); > obj_cgroup_put(entry->objcg); > } > if (!entry->length) > atomic_dec(&zswap_same_filled_pages); > else { > zswap_lru_del(&entry->pool->list_lru, entry); > > zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but > the put may have killed it. I'll send a separate patch on top. > > > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct lis= t_head *item, struct list_lru_o > > { > > struct zswap_entry *entry =3D container_of(item, struct zswap_ent= ry, lru); > > bool *encountered_page_in_swapcache =3D (bool *)arg; > > - struct zswap_tree *tree; > > - pgoff_t swpoffset; > > + swp_entry_t swpentry; > > enum lru_status ret =3D LRU_REMOVED_RETRY; > > int writeback_result; > > > > + /* > > + * First rotate to the tail of lru list before unlocking lru lock= , > > + * so the concurrent reclaimers have little chance to see it. > > + * It will be deleted from the lru list if writeback success. > > + */ > > + list_move_tail(item, &l->list); > > We don't hold a reference to the object, so there could also be an > invalidation waiting on the LRU lock, which will free the entry even > when writeback fails. > > It would also be good to expand on the motivation, because it's not > clear WHY you'd want to hide it from other reclaimers. > > Lastly, maybe mention the story around temporary failures? Most > shrinkers have a lock inversion pattern (object lock -> LRU lock for > linking versus LRU lock -> object trylock during reclaim) that can > fail and require the same object be tried again before advancing. > > How about this? > > /* > * Rotate the entry to the tail before unlocking the LRU, > * so that in case of an invalidation race concurrent > * reclaimers don't waste their time on it. > * > * If writeback succeeds, or failure is due to the entry > * being invalidated by the swap subsystem, the invalidation > * will unlink and free it. > * > * Temporary failures, where the same entry should be tried > * again immediately, almost never happen for this shrinker. > * We don't do any trylocking; -ENOMEM comes closest, > * but that's extremely rare and doesn't happen spuriously > * either. Don't bother distinguishing this case. > * > * But since they do exist in theory, the entry cannot just > * be unlinked, or we could leak it. Hence, rotate. > */ > > Otherwise, looks great to me. > > Acked-by: Johannes Weiner