From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FD94C27C79 for ; Thu, 20 Jun 2024 22:45:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E9B228D00EE; Thu, 20 Jun 2024 18:45:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E4A358D00DB; Thu, 20 Jun 2024 18:45:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CEB238D00EE; Thu, 20 Jun 2024 18:45:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id AB7CA8D00DB for ; Thu, 20 Jun 2024 18:45:33 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3341B1C05FA for ; Thu, 20 Jun 2024 22:45:33 +0000 (UTC) X-FDA: 82252750146.22.83124E5 Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf30.hostedemail.com (Postfix) with ESMTP id 631628000E for ; Thu, 20 Jun 2024 22:45:31 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="efk3/hSR"; spf=pass (imf30.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718923524; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=; b=szz2Uf722WJKl/Mn1DJxxw++5bGZGYGvZvsKtvCjVx3kisXOl6oyHRnZ/Ty9Q3XkdOlDPM SzwdltzBvTwntaUyGuaQOC/ChC9Yzjz4uhLotQtZ1QeroSZp5mU5KSDVanCB+0RsffMADb rfO4WFavXKB2D3/e9u4/BLkWTtG7QPY= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="efk3/hSR"; spf=pass (imf30.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718923524; a=rsa-sha256; cv=none; b=BaLZmZ4b5FrUOOVaUWALoeC6LYl6JayeW4Ju4zuPqyjceXbgBv5GQD/ljaLx/G2Xi3qgEL SktzO+9yeZ0Vt2pNa+qMqX+r8vBbNLe7mN7Yp7EEjugRKsl5kAGSqm8OVfx/om0GddhmJL f/0Ebx4opC0DXCae3QGTItjwSZ8AIac= Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6b50a68b0b3so16824856d6.0 for ; Thu, 20 Jun 2024 15:45:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718923530; x=1719528330; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=; b=efk3/hSR5ubJy7CC/psi4x3xIU2M0fR7aZzqFfqUhlhhPdinrt/BIa9/xXMkF2Yp6c gSyAtYFI1aMnlh9zXXN8++7Bh0tQgjpEGoyvNCpLxtU8aFlAqYn8exEz9pXy/mTdsG5J JgeMO/iYDymaD9vwp2d1FwW4B4TOWN9f8UNyryrS2Ai7YfuNn8uUc67TbEJyev13yime mGsw/pxq4PSUv8fUcLCnEPYj/Yl7uXvWoB/QuHQmk7qQf2BC0xc4ZUJwG1JTZzxeJ2fM u2nvqma8UMQv5KaztVpPBLk+Kr8EwA+GPY/LOFE2yxdBuHBUkY8UABkv3wr/32tbcpq0 QV6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718923530; x=1719528330; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lnmrlYlrxRh9vivFD+6NtUoULUx1ZAai45xa/zONLMY=; b=M5SKj7Ju1wEITHMB3O69DjCMT2AtJkYmYUvW6ZWoX0t4LimwAKvSdF/n2GS2H2vtci uhRnlRiZihPHsV0UAEa638ENBrzAk5IOw0ZQzU50Cobji25QULPN6h9FrzU+xLqS4WZS I2MQcA6LhRLrSNO5t5hlQfAeX3NNr9rGYUBdrQ3BLxHbjfjZNlYaiHaDD0PSRydKm0Zf DuJPMFY+kZvUThrRDLnR1moiO4oMbUXIpdlo/3aKbtw3NC0M4i2E0bzlZ9cSxOUy/X0P bb1WLJ5dij5wgbZTsAScdSE+eLWHs2FVI+gC8sTTMD0JHbmrjjqCehjiqvj5zsHd4pz6 9vfA== X-Forwarded-Encrypted: i=1; AJvYcCW+0SkMjKVvdVyrlCI04wfwXfdk3yMLs1Yn9MXMachwy1JOrCY44aZd3iHWQ16re0b94HmsvbZekAP4/YlhG5YPWmg= X-Gm-Message-State: AOJu0Yw8pwalldbWQsEaJ8FZvLK4USO/ukhSuPXeb04xGCGotqcgKQCR I0cOLHauEe9EOQr/vD27yrsVZaxZgDNvjL8aLdmn1mevR++cHNw/c2xpsX9OBTztjh9Ud0Fkmfg y3JTjSUHd6r1dtKVrx+T4mjAPslI= X-Google-Smtp-Source: AGHT+IHUHiKhuO3lflJiK1mEtBHHQQANwPa3f18lchzyH5vGRdVxN8b4X5W0wRN32prRqb0Z1SVcif1uQeeWXATBtg8= X-Received: by 2002:ad4:420c:0:b0:6b2:b997:6513 with SMTP id 6a1803df08f44-6b5019b852emr115022296d6.7.1718923530320; Thu, 20 Jun 2024 15:45:30 -0700 (PDT) MIME-Version: 1.0 References: <20240608155316.451600-1-flintglass@gmail.com> In-Reply-To: From: Nhat Pham Date: Thu, 20 Jun 2024 15:45:17 -0700 Message-ID: Subject: Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink To: Takero Funaki Cc: Yosry Ahmed , Johannes Weiner , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ii8rbcje1x5ysdtf5asezdds3bxeqgdm X-Rspam-User: X-Rspamd-Queue-Id: 631628000E X-Rspamd-Server: rspam02 X-HE-Tag: 1718923531-102473 X-HE-Meta: U2FsdGVkX18h9viBctS1Ar7XBbiXHuokjmonpmoZnTAanPAZmONOSLQvuz2lOIlEv+4IS2ts1pj5dyD/FriUhwD42Q77696I8zRDsO1elPLRIlO7qusEFJ098MhKxSX1n/pXCpLCSyYqI/5u2SsKrBtGb5PZxAczgBn9y5jwe2z7QphY/vastlJ/RBXj28PlQMvAneb2sovz/GpKCD6xGYumNjU9mriWKwQyeRQQcR6SkPdxBlns5ZnrcA68jQ6Ili/aNb2cX/MwBOlRfwvyVVi7SDoLMCcpbZQUTZj1FUKivD6GEpL8IkYug0L1FXiyPHnIIawvLTx5910Skco3HrtY0VWdAlf5DYskP09yYG5BW/sDZ/pHdq+T8CiE/kIEliG89MGnzyZ55aXFx8AYp0zbG6/PhoT3iiChohmW0TWiQZ5ylHqYyXAoJc0pgf63YwkRCs5amrYXL0pn62HBiaKXem3wsQ6iNgukBfNQSQn6uoyFWzki9AxlZJo9ON3gcVcm64VjUjQX3aGYDAwH/+EWLefR2ILwKNe/VNdQY1tPglROiABeG4ZkPPyHzGDCbWm9Gp2WFHNn7TTzZeGPXT/pQfeCBqO+qndey8KuscLdar79zcDPEi4EipGiOqpYSlwelC+2RVURY78ItAnJAzIcYgMTsrd3hvI9fCvgtm1ARVqI2SGTGnepL2mJu4BDwFP9uRl7kKTtpCaWey0Yd7tSzUBQioZU1qoUGyaUt9ry1j78Djiq0H+1n9eq3uOCvDnY9xv2l5Nx6fUbB7Tjus3lUf+bWs2dlNYh+KLoiL4KJhjedMomGWMSYLFIG3uaEVrEfOY9Bf4MBNOTAI76sqcYoJ4yXMkR8Jzqc2Zew62+m8UOIZztsVSBUQf4zaSQAw1nW2F5dhV2n+KBsdxafIhILQfWDTqFVg9ml1hlRNp8RhPaC0U8fUYoKct8mEMXDhQeWyANtXVFQgx7opm kJFaC3sV mI/wMLSyN2FpjhLhkKp38ZByN8KSf6zd/4FBnq2myLnRFvWAlLCejcp/ucfH8po1I8Sv9MbqASnMs8WqhdPAkk3GDM5yfH1piyR13bbcfcwr1P8l3XEn5spNVAiyjPD5ouXfQgO/uF0bJeOAw37d4a8B/FDa3fcD27ckAbLF3fTl7oeQR3PrVtUVnLuSaCTBzXtJWxf0D2mTIVN2b+NJkCMiWcrWBqzvQ84ySHVLEOYWB4DHm6MDLOpGLFuyR9bY+PcjaN1robnb8mMm2rBYxjOtClB4ewEPy2D1MHrGG7gSImjL0v5BfpaVCg57lx0f9wy//RlEBKraQ15R5pta4Zb/VPQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000088, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 19, 2024 at 6:03=E2=80=AFPM Takero Funaki wrote: > > Hello, > > Sorry for the late reply. I am currently investigating a > responsiveness issue I found while benchmarking with this series, > possibly related to concurrent zswap writeback and pageouts. > > This series cannot be applied until the root cause is identified, > unfortunately. Thank you all for taking the time to review. > > The responsiveness issue was confirmed with 6.10-rc2 with all 3 > patches applied. Without patch 3, it still happens but is less likely. > > When allocating much larger memory than zswap can buffer, and > writeback and rejection by pool_limit_hit happen simultaneously, the > system stops responding. I do not see this freeze when zswap is > disabled or when there is no pool_limit_hit. The proactive shrinking > itself seems to work as expected as long as the writeback and pageout > do not occur simultaneously. > > I suspect this issue exists in current code but was not visible > without this series since the global shrinker did not writeback > considerable amount of pages. > > > 2024=E5=B9=B46=E6=9C=8815=E6=97=A5(=E5=9C=9F) 7:48 Nhat Pham : > > > > BTW, I'm curious. Have you experimented with increasing the pool size? > > That 20% number is plenty for our use cases, but maybe yours need a > > different cap? > > > > Probably we can allocate a bit more zswap pool size. But that will > keep more old pages once the pool limit is hit. If we can ensure no > pool limit hits and zero writeback by allocating more memory, I will > try the same amount of zramswap. > > > Also, have you experimented with the dynamic zswap shrinker? :) I'm > > actually curious how it works out in the small machine regime, with > > whatever workload you are running. > > > > It seems the dynamic shrinker is trying to evict all pages. That does > not fit to my use case that prefer balanced swapin and swapout > performance Hmm not quite. As you have noted earlier, it (tries to) shrink the unprotected pages only, > > > 2024=E5=B9=B46=E6=9C=8815=E6=97=A5(=E5=9C=9F) 9:20 Yosry Ahmed : > > > > > > 1. > > > The visible issue is that pageout/in operations from active processes > > > are slow when zswap is near its max pool size. This is particularly > > > significant on small memory systems, where total swap usage exceeds > > > what zswap can store. This means that old pages occupy most of the > > > zswap pool space, and recent pages use swap disk directly. > > > > This should be a transient state though, right? Once the shrinker > > kicks in it should writeback the old pages and make space for the hot > > ones. Which takes us to our next point. > > > > > > > > 2. > > > This issue is caused by zswap keeping the pool size near 100%. Since > > > the shrinker fails to shrink the pool to accept_thr_percent and zswap > > > rejects incoming pages, rejection occurs more frequently than it > > > should. The rejected pages are directly written to disk while zswap > > > protects old pages from eviction, leading to slow pageout/in > > > performance for recent pages on the swap disk. > > > > Why is the shrinker failing? IIUC the first two patches fixes two > > cases where the shrinker stumbles upon offline memcgs, or memcgs with > > no zswapped pages. Are these cases common enough in your use case that > > every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before > > putting the zswap usage below accept_thr_percent? > > > > This would be surprising given that we should be restarting the > > shrinker with every swapout attempt until we can accept pages again. > > > > I guess one could construct a malicious case where there are some > > sticky offline memcgs, and all the memcgs that actually have zswap > > pages come after it in the iteration order. > > > > Could you shed more light about this? What does the setup look like? > > How many memcgs there are, how many of them use zswap, and how many > > offline memcgs are you observing? > > > > Example from ubuntu 22.04 using zswap: > root@ctl:~# find /sys/fs/cgroup/ -wholename > \*service/memory.zswap.current | xargs grep . | wc > 31 31 2557 > root@ctl:~# find /sys/fs/cgroup/ -wholename > \*service/memory.zswap.current | xargs grep ^0 | wc > 11 11 911 > > This indicates 11 out of 31 services have no pages in zswap. Without > patch 2, shrink_worker() aborts shrinking in the second tree walk, > before evicting about 40 pages from the services. The number varies, > but I think it is common to see a few memcg that has no zswap pages > > > I am not saying we shouldn't fix these problems anyway, I am just > > trying to understand how we got into this situation to begin with. > > > > > > > > 3. > > > If the pool size were shrunk proactively, rejection by pool limit hit= s > > > would be less likely. New incoming pages could be accepted as the poo= l > > > gains some space in advance, while older pages are written back in th= e > > > background. zswap would then be filled with recent pages, as expected > > > in the LRU logic. > > > > I suspect if patches 1 and 2 fix your problem, the shrinker invoked > > from reclaim should be doing this sort of "proactive shrinking". > > > > I agree that the current hysteresis around accept_thr_percent is not > > good enough, but I am surprised you are hitting the pool limit if the > > shrinker is being run during reclaim. > > > > > > > > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent= . > > > Patch 3 makes zswap_store trigger the shrinker before reaching the ma= x > > > pool size. With this series, zswap will prepare some space to reduce > > > the probability of problematic pool_limit_hit situation, thus reducin= g > > > slow reclaim and the page priority inversion against LRU. > > > > > > 4. > > > Once proactive shrinking reduces the pool size, pageouts complete > > > instantly as long as the space prepared by shrinking can store the > > > direct reclaim. If an admin sees a large pool_limit_hit, lowering > > > accept_threshold_percent will improve active process performance. > > > > I agree that proactive shrinking is preferable to waiting until we hit > > pool limit, then stop taking in pages until the acceptance threshold. > > I am just trying to understand whether such a proactive shrinking > > mechanism will be needed if the reclaim shrinker for zswap is being > > used, how the two would work together. > > For my workload, the dynamic shrinker (reclaim shrinker) is disabled. > The proposed global shrinker and the existing dynamic shrinker are > both proactive, but their goals are different. > > The global shrinker starts shrinking when the zswap pool exceeds > accept_thr_percent + 1%, then stops when it reaches > accept_thr_percent. Pages below accept_thr_percent are protected from > shrinking. > > The dynamic shrinker starts shrinking based on memory pressure > regardless of the zswap pool size, and stops when the LRU size is > reduced to 1/4. Its goal is to wipe out all pages from zswap. It > prefers swapout performance only. > > I think the current LRU logic decreases nr_zswap_protected too quickly > for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to > between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin() > increments nr_zswap_protected when page-ins of evicted pages occur > later, this technically has no effect while reclaim is in progress. > > While zswap_store() and zswap_lru_add() are called, the dynamic > shrinker is likely running due to the pressure. The dynamic shrinker > reduces the LRU size to 1/4, and then a few subsequent zswap_store() > calls reduce the protected count to 1/4 of the LRU size. The stored > pages will be reduced to zero through a few shrinker_scan calls. Ah this is a fair point. We've been observing this in production/experiments as well - there's seems to be a positive correlation between zswpout rate and zswap_written_back rate. Whenever there's a spike in zswpout, you also see a spike in writtenback pages too - looks like the flood of zswpout weaken zswap's lru protection, which is not quite the intended effect. We're working to improve this situation. We have a couple of ideas floating around, none of which are too complicated to implement, but need experiments to validate before sending upstream :)