From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43C3AC47DA9 for ; Tue, 30 Jan 2024 03:17:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B04B66B00BF; Mon, 29 Jan 2024 22:17:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AB4D46B00C6; Mon, 29 Jan 2024 22:17:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9097D6B00C7; Mon, 29 Jan 2024 22:17:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 78BE06B00BF for ; Mon, 29 Jan 2024 22:17:36 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4D26DA1F69 for ; Tue, 30 Jan 2024 03:17:36 +0000 (UTC) X-FDA: 81734517312.08.82CF337 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf22.hostedemail.com (Postfix) with ESMTP id 3F92AC0015 for ; Tue, 30 Jan 2024 03:17:34 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=xVSioNyP; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706584654; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2+uMvYuPMcW2/oEhpWE+vBu0Mu5fnqXnYf4zI++OU9M=; b=3T2FxZGI9ZI/Mk5m0dTtgt1bzLWtfkiTMlKHsry4aItZR1rya/0/IcS7haWUn/S+IFYne1 zHt7dbS9GL2BQcog3O4vNj7LcH83i72qMReXRlHGzGmmbwhggvxNJTo9/nujPNeS1ccLNf gnzZAnrtszGaUZVr10iEtWJA5qYZAns= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=xVSioNyP; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf22.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706584654; a=rsa-sha256; cv=none; b=E/9FTk2unNdRja+O1/ZNVftp2yQ5vmTSPUVpIWN0n5IIiD1c7AWh7jcOY9788/UZlc4ua3 IpHZQTazmUDDR8u3B5mg9VQxyP6+x3FDAjNQUp6PQRv3gUAEYg+oBrXVT2gijtM+/n87Ij Fn8Nc7En7iyfVSuDx5WFbXaRA/wOIsw= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-42ab4f89bdbso2741001cf.0 for ; Mon, 29 Jan 2024 19:17:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1706584653; x=1707189453; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=2+uMvYuPMcW2/oEhpWE+vBu0Mu5fnqXnYf4zI++OU9M=; b=xVSioNyP4Tw8N+eYBSnPJIsI8dULbhTlQhPjrxBCxdMu8M9O0vnizkh69gOPcaBlE+ pZbWufWbqwtocKdYjZGNvFj26EyN3XupurKU0hJP3EGPMOFTUApLWdikESPa9fAYtCqY 4ci0ooht5v7RUpj/r4Cth4v1MbmuPXD2rkZIB8n81ttcpCJNLA9kJz7GHR1Nr8T+W++r 1RTMGW7MmBBSlBoMtHTi2xx2ElroNtmrpznS33w+OcK60E4jJwPz6F8cAMxlHdxZ9gxs Axpv1n4x77jN1l1d8035KXJuvg14hs3rEdSMQQI1sFRjU7Zi15FSfE5bMXT4CDsWPV2k BaAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706584653; x=1707189453; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2+uMvYuPMcW2/oEhpWE+vBu0Mu5fnqXnYf4zI++OU9M=; b=Cu95dndRlnW66ZNOH+gww1/EckVpH7+zRS7pv5av12/HlFngrcBvoG1v8nKS3quwYh aWtcaH9ucCL3tZj4D7/qWrgAlwd5nlrgJxwgldg7jnCT8mgpyW+A8bN1MOqBLjs3win5 C3DWgY3WrjOuThMGMLpN6fy24S7+XHiJIpFFLUEgP1AJR9fPN2J8sUDqbOV+GjuCWgKC lCcpI23jMCLHtg+0BsIe79d1F/6ZJSO+fiTeYvJkJV5TAVnMlcI6Maz+4xkCVOccB/oT OPG+VHx26Iv6+7U7P9SRfDIdhGUNEiWoD1eQ90MYQvHwcNGg/EplYdkhFqA2ls+iR/ix UR4A== X-Gm-Message-State: AOJu0YygbfRE1znQQBEx2bMpDlPTOUQR84YADA1UXnMhWIZfv37zZMyo M14l6pvRJ070xzYBJO+wIXlnV4U955ag04j05CodJOEFjYvWebX04WoT9PAPPhY= X-Google-Smtp-Source: AGHT+IH2krtYpXrHWqQn8jUxf7oGQpxAVml/F9fI6Iv5KcFb1lopfNaMpnRwWqIrrTd1Htg3adZr+Q== X-Received: by 2002:ac8:5988:0:b0:42a:ad64:bc09 with SMTP id e8-20020ac85988000000b0042aad64bc09mr1790712qte.119.1706584653281; Mon, 29 Jan 2024 19:17:33 -0800 (PST) Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id cf5-20020a05622a400500b0042aab8ca417sm1028836qtb.3.2024.01.29.19.17.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Jan 2024 19:17:33 -0800 (PST) Date: Mon, 29 Jan 2024 22:17:27 -0500 From: Johannes Weiner To: Chengming Zhou Cc: Yosry Ahmed , Nhat Pham , Andrew Morton , Chris Li , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff Message-ID: <20240130031727.GA780575@cmpxchg.org> References: <20240126-zswap-writeback-race-v2-0-b10479847099@bytedance.com> <20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com> <527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <527bd543-97a5-4262-be73-6a5d21c2f896@bytedance.com> X-Rspam-User: X-Stat-Signature: uqszyrr3dsy399ajqnfz15943u5i9my1 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 3F92AC0015 X-HE-Tag: 1706584654-330017 X-HE-Meta: U2FsdGVkX19lk7vYCnvkZWMU9xzqHxu2UIX78lQvxUXvn314mO+VD4/U810I02l+2XZrCaCjRKA19ksYx3rnMJIXcGHIWGVvftALkCC7IJFTl7bHjpB9f1PWNtz7F41DSQoomEgvhPWaEMhIhAvSULY3LqDCkLyIWpfF1nPOzOEuLOYDUVxyxSOE8EO7DbMR/kuHplxS/4zVJTvTGOfitZ+mRLC1RB5NYvU1SppGvxcciP4tQwlN9kZEypeEeIPkw1S0B3XHPdbbLEJmJjIks3oDBDkmdT2NONuw2+Yen/sM8vcjHL+Mrmdgvzt4Sv7YYd2ione5zhrR0VzhJITT0Z9vhkLxlf8FtPwfrfpRJA4V/8QXqtejAguxybAXepXyLL9fZsK/ws8JLViaApppuFAZEuAuerGFjyHNgHSvAksFs3NflzB9L8MD47/NHBIlUub6J6Zd0+gdemDLGKHHRMvOA3FotCXtFckP74fG4+gco6sX1ufw4pycgF83P+FA6qM9dETXFB4vy8kXS7l+8qnZDCkLi+azheVkNODn3HIg0zqoq1gv7z4e0sJ4zscFtP28w+WYyBwFdUvdrx0xShJLKSBRIIyhCAYQpfDUwBa69tm6HUrqw7hT+GKRSnsmOARJwiG5AJu6q2834+4a8a+LaYBqZnSVfxKEEl5J9IL6AGiq/JGc11zjYjNZ2hpisAQG2wLVyWYdYkZhF+6kHTJSIAT35p5elfh2zyYPY/jykFui0KRQiSh4U4iMqtyJG8WeFs4XFua4krcDC4eCVRqLzhhDcv+/YJgh5UA3H2w59ZAypiOFyysziM25XZnV5vN1nfT1Qts40LkZHcOkAyjhLGj1Sqrmcmtz7yvN27EpppRi4wTtVFaswoMlXyvVTt0G+pAgkuMwLd6/sxEvR0YFApVzvJkEiPvSFHBES/8rrbxGVxW3PORJVtTDouDcftPiGIufDqy49dfnvZ4 eC60o4Wk SCIKdmbJvVBHa9gNd2T361FZJR0O5mj35rAEhUMP59nvw903HuoF2tubDRqAZEqmMfVsaK5TtokQJSeZTyUkPnpsgC9e12UVrNwgp2S8nZaugbzSxPwJGCM9JZdxPDMEI3n9ftcAN5JBSdMvNNJ8x6cMkap442rI5+FS7WuidecjqyuUoJguCe8EqHpPLbEvQe+UU2xpZJSOAJbrSQ68ihl7I9zlMLVxwWOZBFIkdp4+8UdvcaAIn2UZnWdWB3inJo0+sazHxTzWBsZNmOKdvciLXVROuVhgZz9XMXOV/nDe8EvMJ+/3IMPVVoYUeHlAp6202Fb7e/nzuaBbsT8bE423O7CCgjn6yLUKGP1nsfQ3oYebU1BOqJ/8HuN10f1XXFgwj/pqPGZEtfuY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote: > On 2024/1/30 08:22, Yosry Ahmed wrote: > > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: > >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > >> { > >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > >> bool *encountered_page_in_swapcache = (bool *)arg; > >> - struct zswap_tree *tree; > >> - pgoff_t swpoffset; > >> + swp_entry_t swpentry; > >> enum lru_status ret = LRU_REMOVED_RETRY; > >> int writeback_result; > >> > >> + /* > >> + * Rotate the entry to the tail before unlocking the LRU, > >> + * so that in case of an invalidation race concurrent > >> + * reclaimers don't waste their time on it. > >> + * > >> + * If writeback succeeds, or failure is due to the entry > >> + * being invalidated by the swap subsystem, the invalidation > >> + * will unlink and free it. > >> + * > >> + * Temporary failures, where the same entry should be tried > >> + * again immediately, almost never happen for this shrinker. > >> + * We don't do any trylocking; -ENOMEM comes closest, > >> + * but that's extremely rare and doesn't happen spuriously > >> + * either. Don't bother distinguishing this case. > >> + * > >> + * But since they do exist in theory, the entry cannot just > >> + * be unlinked, or we could leak it. Hence, rotate. > > > > The entry cannot be unlinked because we cannot get a ref on it without > > holding the tree lock, and we cannot deref the tree before we acquire a > > swap cache ref in zswap_writeback_entry() -- or if > > zswap_writeback_entry() fails. This should be called out explicitly > > somewhere. Perhaps we can describe this whole deref dance with added > > docs to shrink_memcg_cb(). > > Maybe we should add some comments before or after zswap_writeback_entry()? > Or do you have some suggestions? I'm not good at this. :) I agree with the suggestion of a central point to document this. How about something like this: /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: * * 1. We extract the swp_entry_t to the stack, allowing * zswap_writeback_entry() to pin the swap entry and * then validate the zwap entry against that swap entry's * tree using pointer value comparison. Only when that * is successful can the entry be dereferenced. * * 2. Usually, objects are taken off the LRU for reclaim. In * this case this isn't possible, because if reclaim fails * for whatever reason, we have no means of knowing if the * entry is alive to put it back on the LRU. * * So rotate it before dropping the lock. If the entry is * written back or invalidated, the free path will unlink * it. For failures, rotation is the right thing as well. * * Temporary failures, where the same entry should be tried * again immediately, almost never happen for this shrinker. * We don't do any trylocking; -ENOMEM comes closest, * but that's extremely rare and doesn't happen spuriously * either. Don't bother distinguishing this case. */ > > We could also use a comment in zswap_writeback_entry() (or above it) to > > state that we only deref the tree *after* we get the swapcache ref. > > I just notice there are some comments in zswap_writeback_entry(), should > we add more comments here? > > /* > * folio is locked, and the swapcache is now secured against > * concurrent swapping to and from the slot. Verify that the > * swap entry hasn't been invalidated and recycled behind our > * backs (our zswap_entry reference doesn't prevent that), to > * avoid overwriting a new swap folio with old compressed data. > */ The bit in () is now stale, since we're not even holding a ref ;) Otherwise, a brief note that entry can not be dereferenced until validation would be helpful in zswap_writeback_entry(). The core of the scheme I'd probably describe in shrink_memcg_cb(), though. Can I ask a favor, though? For non-critical updates to this patch, can you please make them follow-up changes? I just sent out 20 cleanup patches on top of this patch which would be super painful and error prone to rebase. I'd like to avoid that if at all possible.