From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B349C3DA70 for ; Tue, 30 Jul 2024 18:47:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 904F06B0083; Tue, 30 Jul 2024 14:47:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 88E206B0085; Tue, 30 Jul 2024 14:47:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 707356B0089; Tue, 30 Jul 2024 14:47:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4E7866B0083 for ; Tue, 30 Jul 2024 14:47:00 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B6A45C0471 for ; Tue, 30 Jul 2024 18:46:59 +0000 (UTC) X-FDA: 82397300958.25.BAC426D Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) by imf26.hostedemail.com (Postfix) with ESMTP id C5458140026 for ; Tue, 30 Jul 2024 18:46:57 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=z3OfFOG4; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722365190; a=rsa-sha256; cv=none; b=V/bRYwWqoPClwjsGJpi1hdwFJYEY3fuxyQ14Q4ooyBBT9uDG3JUPn+5Fn/cGW8FNMHA8yD FwnnjncpLwyt9Z5z1D6TcUjWtrnSsTwzYyL75KSyZqg1+BFuz7idNpOxCmRnq+a6Q3xqeI JbJxDpjudhcXOIL5JPbf6a01ZRS5GTI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=z3OfFOG4; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722365190; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uVdGunqTifJ0ucf5dI3wogex/JXMYi66FPjHcH+FSlM=; b=gcRR4O/RUUZ7c+fcg6rhA5V0bAFsShMLLXu5Wll53bknIoDX4hcTGMzCOxpd5/gJmNle9O PDDS8d7OLoZqyaNGQZUS6JLP5ravKrCZfHWIJHRjMtjd4ZsAgLe6+Vj5hItGUEePQWEMs8 DiAzzmGzqcNgkMYIVb2rrBjchETApVw= Received: by mail-lf1-f52.google.com with SMTP id 2adb3069b0e04-52efd8807aaso7662474e87.3 for ; Tue, 30 Jul 2024 11:46:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722365216; x=1722970016; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=uVdGunqTifJ0ucf5dI3wogex/JXMYi66FPjHcH+FSlM=; b=z3OfFOG4qHi9bDJtvxWEKcRdrVqcSFmO8zKJb3tRlEAaf1E8D1gjWMuSqEqmEkUCp2 CdMSWte4Q7lrpC96A0qbyUDfCfhAgu0wOGxp7ahKGSRgoMNfsG5In8xBuhDinv9JU2Zs mfelOjUsSWFi/YXmRe3pC7wXcbX5dpgJBH1EK3kIqpEVbwFyJ71mgw8sMGc91nN+mka3 E5m94Em4ZfarQqQAzH4/Yg9vNK2IolREsOMwjR9jFD/5vJY+0dbSDgpqVN+cOFj53qcK /w0WCc9I+NwsokkPQ5/5eBkO7aCVhaJMM+yW/VJ5LYbb3lQyu1jQ9hKDNoJp2FcFawBH IhRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722365216; x=1722970016; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uVdGunqTifJ0ucf5dI3wogex/JXMYi66FPjHcH+FSlM=; b=F03uXK6zrGF01hUbjBpk5uu0WF4gJ1dL7lTlcMFbzkIEuHi+HYuTN5JXhTdj4zRX3p s5Yix4FVYVZ7bQ7bS8U+i/1BQGqtA/Lc+0vzE0htZU4fslxs994FB/sy5XTl7aBmLVRo MhsQDlkfUwaLjS8Nl12kP5/nBMdperLkqGLFWWnekNFMLizxnfkeYcoHScDjrfjvIyfc FwgBC2xyAi2fcRrJsAPBC345vqQM9qaI+oM7JfBaNJ+MNgqY6nOl0eWkbhf7BTNEzWR0 B5vVJnH3rjIZabNW3K9P7fQrAbIQaqKZCo7eKLZm6nL5GZl+H4S/jEWFwP6xqQwJ8bXd sltw== X-Forwarded-Encrypted: i=1; AJvYcCUjqB2LeYvWNHa+Q6WW10bHrO343oIYAx911Y1r8XzL2ermtJ4pE/mU+6vu18JJkXdTks6zBnCaRZfuC3Sf8ndfFOQ= X-Gm-Message-State: AOJu0YzEYaKUhH2L0y5aHkalpVIO1w+S+YPzIOaFOLiaWSTByG0oMa6A xCbV+E+DyyflrkcQJYuUYSSv9dApqHxodlN2lXS+KgSMgUC1uLdsNN742s+DQbRifsyYqsT00Xl mjCwvWqrgWsYJGQhXOaC/1TcoJz+gWSd8b6YH X-Google-Smtp-Source: AGHT+IEr6GbEIq6h3DhcPrrgRNvxSdc2v8ceb+Op4xbNm+ZPkrMPBmM696siHnXJSv4oWEeDgjPO33WOrGQ3P8breGk= X-Received: by 2002:a05:6512:2c8b:b0:52c:812b:6e72 with SMTP id 2adb3069b0e04-5309b259b68mr8215718e87.1.1722365215298; Tue, 30 Jul 2024 11:46:55 -0700 (PDT) MIME-Version: 1.0 References: <20240725232813.2260665-1-nphamcs@gmail.com> <20240725232813.2260665-2-nphamcs@gmail.com> <20240730033929.GB2866591@cmpxchg.org> In-Reply-To: <20240730033929.GB2866591@cmpxchg.org> From: Yosry Ahmed Date: Tue, 30 Jul 2024 11:46:17 -0700 Message-ID: Subject: Re: [PATCH 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker To: Johannes Weiner Cc: Nhat Pham , akpm@linux-foundation.org, shakeelb@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, flintglass@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: a9kb6yhiarj765usyfqk8iyajaaqp6px X-Rspamd-Queue-Id: C5458140026 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1722365217-907349 X-HE-Meta: U2FsdGVkX1+YUVX6tIhh2VnjSnOfdwvKskR3+zPMCh25KIDM9lu8clhcSWGCS2f0eBcCEWyLtjCkaO01Y6j1HAC8/5mwAcBlKLXzwXLA78e/JdL6ArO16blqkMGF7YzOG796HbDJQPV/CN09TN0pfyQkhkj1sm80K3QreKMoUsl0V1JYf8l2tRSFaNomFruMwenkPRXa7ZyCDJ+QxYBSLe6CxjeIq4zcmNcB9OmkaQpQniP18N7Ct966KDlFhJjeGzYPb84099BFJl9ULDCeJ7dPa4YEHb9pYDcFAsHXaulZMPC4fHK40OLpunQFHLd9pA8VWdc7DWitgXC8Tq7JeYtVkxTTFBb1nuYhqKp7qif8W0OpVb//1JIEMYF1svT79oBaaX+e1bf66yMCUkGgf7rGqFR6OXsUK4v18FbidjpV0E7EaZWO66JtE2QF2a8hTfsnu9Qgo8Zy4X3FXRTAIDurx5NrmuyC6OYj15ZjvfJsy0V6JVeze1oweD2akERTZn1dW9ftt1IbKooyr/u99P5dFZerVk/5Dni4hKx8IbFiAPagnzqefpm83+0rOW8Z31k8Bc9BjNGovRD7XBzDiBmhiaKk/SJRaf7/NYKGA9saCtH0cb7pQPV/G4PClaoL8LHkHeQRZQKAIYNeKi9trKzw5QPt4nk5XDwQEj34J0RbeSfnuN8VaN87gscDXHU7VUrU+hV9jnjVxrWGTqJIYfcCJ6Qel4mP2RcFrToVlFYe7fapyQXhlTVw6tJurpooYkQOYhhP1lvKHDIo58Fkv40fmVBEjV15FUmI5bDAvvvscJeJIQezyMDcN5ErTs4G6qM8Bqna60BmqPR6Z1mFo6KWoAksB5sVVW2JI0aURfN3xaB1dcjLoxqaB0V+crdSanhMBWD3lwBqjr4S7HhxIeifLPi9aVZ7QbxC81SSvq//VNd2+Z/Rl2jVsKr4a3U8985TInOgcv0zWAZDhaE 6+qmO7vl O9QHHP3dC8IkbO7BMv/toy8zn/lcY23V2QmqZXwrYK/TcqKiRUi397DTf0vyTRL9dwb/ndiE5zDNmFkfovt5GffTfOx6nowG0PQSGKO39KQEEDw84gcCWXxajIZ/Voo9FSDlZZ3AiQdLgRLyWE7xx9f5vkxox01LVdoAk1EZn/cV+fzJG0Db9KTQT5JP+HHcy9Q5DWRCejjrNhgM0/cZAzi89ZkP+CYpaA5MnZdQE78GHSwyS9U7u7cNzUDLXvW+1GfCs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000759, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 29, 2024 at 8:39=E2=80=AFPM Johannes Weiner wrote: > > On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote: > > On Thu, Jul 25, 2024 at 4:28=E2=80=AFPM Nhat Pham w= rote: > > > > > > Current zswap shrinker's heursitics to prevent overshrinking is britt= le > > > and inaccurate, specifically in the way we decay the protection size > > > (i.e making pages in the zswap LRU eligible for reclaim). > > > > Thanks for working on this and experimenting with different > > heuristics. I was not a huge fan of these, so I am glad we are trying > > to replace them with something more intuitive. > > > > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > > This leads to the following unfortunate effect: when a new batch of > > > pages enter zswap, the protection size rapidly decays to below 25% of > > > the zswap LRU size, which is way too low. > > > > > > We have observed this effect in production, when experimenting with t= he > > > zswap shrinker: the rate of shrinking shoots up massively right after= a > > > new batch of zswap stores. This is somewhat the opposite of what we w= ant > > > originally - when new pages enter zswap, we want to protect both thes= e > > > new pages AND the pages that are already protected in the zswap LRU. > > > > > > Replace existing heuristics with a second chance algorithm > > > > > > 1. When a new zswap entry is stored in the zswap pool, its reference = bit > > > is set. > > > 2. When the zswap shrinker encounters a zswap entry with the referenc= e > > > bit set, give it a second chance - only flips the reference bit an= d > > > rotate it in the LRU. > > > 3. If the shrinker encounters the entry again, this time with its > > > reference bit unset, then it can reclaim the entry. > > > > At the first look, this is similar to the reclaim algorithm. A > > fundamental difference here is that the reference bit is only set > > once, when the entry is created. It is different from the conventional > > second chance page reclaim/replacement algorithm. > > > > What this really does, is that it slows down writeback by enforcing > > that we need to iterate entries exactly twice before we write them > > back. This sounds a little arbitrary and not very intuitive to me. > > This isn't different than other second chance algorithms. Those > usually set the reference bit again to buy the entry another round. In > our case, another reference causes a zswapin, which removes the entry > from the list - buying it another round. Entries will get reclaimed > once the scan rate catches up with the longest reuse distance. > > The main goal, which was also the goal of the protection math, is to > slow down writebacks in proportion to new entries showing up. This > gives zswap a chance to solve memory pressure through compression. If > memory pressure persists, writeback should pick up. > > If no new entries were to show up, then sure, this would be busy > work. In practice, new entries do show up at a varying rate. This all > happens in parallel to anon reclaim, after all. The key here is that > new entries will be interleaved with rotated entries, and they consume > scan work! This is what results in the proportional slowdown. The fact that in practice there will be new entries showing up because we are doing anon reclaim is exactly what I had in mind. My point is that in practice, for most cases, the reference bit could just be causing a steady slowdown, which can be achieved just by tuning the seek. But I have a better idea now about what you mean with the proportional slowdown. (more below) > > > Taking a step back, what we really want is to writeback zswap entries > > in order, and avoid writing back more entries than needed. I think the > > key here is "when needed", which is defined by how much memory > > pressure we have. The shrinker framework should already be taking this > > into account. > > > > Looking at do_shrink_slab(), in the case of zswap (seek =3D 2), > > total_scan should boil down to: > > > > total_scan =3D (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > > > , and this is bounded by zswap_shrinker_count() * 2. > > > > Ignoring nr_deferred, we start by scanning 1/2048th of > > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > > zswap_shrinker_count() at zero priority (before OOMs). At > > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > > > Keep in mind that zswap_shrinker_count() does not return the number of > > all zswap entries, it subtracts the protected part (or recent swapins) > > and scales by the compression ratio. So this looks reasonable at first > > sight, perhaps we want to tune the seek to slow down writeback if we > > think it's too much, but that doesn't explain the scenario you are > > describing. > > > > Now let's factor in nr_deferred, which looks to me like it could be > > the culprit here. I am assuming the intention is that if we counted > > freeable slab objects before but didn't get to free them, we should do > > it the next time around. This feels like it assumes that the objects > > will remain there unless reclaimed by the shrinker. This does not > > apply for zswap, because the objects can be swapped in. > > Hm. > > _count() returns (objects - protected) * compression_rate, then the > shrinker does the >> priority dance. So to_scan is expected to be a > small portion of unprotected objects. > > _scan() bails if to_scan > (objects - protected). > > How often does this actually abort in practice? Good question :) > > > Also, in the beginning, before we encounter too many swapins, the > > protection will be very low, so zswap_shrinker_count() will return a > > relatively high value. Even if we don't scan and writeback this > > amount, we will keep carrying this value forward in next reclaim > > operations, even if the number of existing zswap entries have > > decreased due to swapins. > > > > Could this be the problem? The number of deferred objects to be > > scanned just keeps going forward as a high value, essentially > > rendering the heuristics in zswap_shrinker_count() useless? > > > > If we just need to slow down writeback by making sure we scan entries > > twice, could something similar be achieved just by tuning the seek > > without needing any heuristics to begin with? > > Seek is a fixed coefficient for the scan rate. > > We want to slow writeback when recent zswapouts dominate the zswap > pool (expanding or thrashing), and speed it up when recent entries > make up a small share of the pool (stagnating). > > This is what the second chance accomplishes. Yeah I am more convinced now, I wanted to check if just tuning the seek achieves a similar result, but the complexity from the reference bit is not much anyway -- so maybe it isn't worth it. I still think the nr_deferred handling in the shrinker framework does not seem compatible with zswap. Entries could be double counted, and previously counted entries could go away during swapins. Unless I am missing something, it seems like the scanning can be arbitrary, and not truly proportional to the reclaim priority and zswap LRUs size. I also think it's important to clarify that there are two mechanisms that control the writeback rate with this patch, the reference bit proportional slowdown (protecting against writeback of recently swapped out pages), and nr_swapins protection (feedback from swapins of recently written back pages). Maybe we can move things around to make it more obvious how these mechanisms work hand in hand, or have a comment somewhere describing the writeback mechanism.