From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB145C3271E for ; Tue, 9 Jul 2024 00:53:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6E276B0092; Mon, 8 Jul 2024 20:53:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AF6BF6B0095; Mon, 8 Jul 2024 20:53:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 970406B0096; Mon, 8 Jul 2024 20:53:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6F46E6B0092 for ; Mon, 8 Jul 2024 20:53:50 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1610BA433C for ; Tue, 9 Jul 2024 00:53:50 +0000 (UTC) X-FDA: 82318391820.04.7929B60 Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com [209.85.219.179]) by imf15.hostedemail.com (Postfix) with ESMTP id 4BE47A0011 for ; Tue, 9 Jul 2024 00:53:47 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lon7BSqG; spf=pass (imf15.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720486412; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MxUewbS9CHNfJzvJo7a2g8V9O8ZPqq3ROM3y+MnW49k=; b=hFuOVIxYiT/YMb3PMxgyqwjvLuApkP2xXOZcCUFks7fCaNI4R0tzfROrDEvBD4S0Ek8Tz6 Rf+Q082dR5S+JKtqQM4fS8zb1R+YHDmDmB5HZwF7xePh04kunHfvgRAPHUvjPv4D6EbDpg gSjAbDs86dVpWqxbrip3S1frfzG6OmU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lon7BSqG; spf=pass (imf15.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720486412; a=rsa-sha256; cv=none; b=r+aTemmZN7QZ7iucKHjM+mMqDbIeM3O/9fAw0/3NTbLefvQyLh4KhU0gEWwVBR7GLQgWvU s58QHTChrpro7Ea8y2cJpSZJdnsWlYheWshwcH3/m3hgoWDEMrTx9dDfYa0aHV6ES/YCwc htP6mZ6NisCsvcopO//re5b5dMxlhuI= Received: by mail-yb1-f179.google.com with SMTP id 3f1490d57ef6-dff17fd97b3so5177195276.2 for ; Mon, 08 Jul 2024 17:53:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720486426; x=1721091226; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MxUewbS9CHNfJzvJo7a2g8V9O8ZPqq3ROM3y+MnW49k=; b=lon7BSqGrMQcXTAG2Bsb9ZHZCYKBPFnjx0fFcP61GGUiXdUPvGy1BKTgBrNadyGILL NbupqGDOH84f93KFOrmwBrnGDt3p8F7LgXgNGvBTnQlNUq4PRohPQU54cjuha3aPHVYh FYNZbzfcU5LahKcx3GycI3zpEBKE+/A9yuEJBKqoA8z60J38hwGBR84c9OWeNo3Fj0Ud pMaa4EIkbegUl9rhhUBhqOIBQVoKQZHFyrO097jQ/8JZ7qpj84FYHUGWLiBMp7SzsiBF aApKe1yQyA//2nNx30b4dGcgjPQaC67q39hcgYRsKNj/82urEEZpcNX7Rf79IXK9nj5M wq8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720486426; x=1721091226; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MxUewbS9CHNfJzvJo7a2g8V9O8ZPqq3ROM3y+MnW49k=; b=hqUpDuwMGj57MBEsN8sy/NcZw4VM1C07DWC96dPwHL4os9fb6hf1l/a5p33Gio2TqY HYxhNL4TuaBQZwNhaatk7M8ZRfYn+zeoSI2bujenqM26NsiKjBdnCf99XVMBUaAbqGFm oNAP6rbCLQxCoGicka20/PN+IfH/mELQUBJHGvjWV6lJ+AIuGoITXgg74GO+bAzh3xq/ W+pCUSct89Vhp3I+48K67ZvjEcAKC5REyXK38/2P2NUD4IJS+bvP2S4j6oGMfdvCgqeN IoZ5YK3HIW9cwkIisxEzzNDloBT2VRI9gvxQJFTJ/oSLf8qtxcVeQZBoDrsWZmlgUOY4 gcmQ== X-Forwarded-Encrypted: i=1; AJvYcCX1aKHlXvZCg0+9UYB7N9uPKxcji6n5LkbYBVElAUSrh63FhlN2j8GVQbXbQqim4J64g49mQUkJQllx6j8Ec9yRB5g= X-Gm-Message-State: AOJu0YzC0/pGR2JAruNEZ6GhvHRJkihZmhC9az2CE+AFRPrOO8zmXjA7 P9POuzeAb1VYE1HkdF4idf43phak824FC0D/zW15eF6r59qf+Igizv8Vhdi060CsLv/NFykkR7H l/ddCtfezJ8dS5h9ixYFJw+wAwMM= X-Google-Smtp-Source: AGHT+IHXiWsau9JPn2ls9GoEdNvbqKjBaLXu+xKi9eBrGLKMZtfZmdFEZFw1ZBab9RxCU+60Rc5TxW9YcW0zTeo8fsQ= X-Received: by 2002:a25:9386:0:b0:de5:4cd0:8da4 with SMTP id 3f1490d57ef6-e041b078252mr1702340276.33.1720486426111; Mon, 08 Jul 2024 17:53:46 -0700 (PDT) MIME-Version: 1.0 References: <20240706022523.1104080-1-flintglass@gmail.com> In-Reply-To: <20240706022523.1104080-1-flintglass@gmail.com> From: Nhat Pham Date: Mon, 8 Jul 2024 17:53:33 -0700 Message-ID: Subject: Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink To: Takero Funaki Cc: Johannes Weiner , Yosry Ahmed , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 4BE47A0011 X-Stat-Signature: m4nnrgc8ey3c9obe316fnuabzubqi56j X-HE-Tag: 1720486427-275306 X-HE-Meta: U2FsdGVkX1+dCbaDgg6qwtbMQHemyToIfxwWKUCiombMeqUNP0uMbbi2lC2eeCLU/ug4Vo063ZMmySDOPy8J7vGdjHKk4MSPnGIAlRh46P728qPtaE6ZSbE1h0SHXp7l6b6j3fOqzUdFNpTcYklXPTyOu2MYTsDtr+l6mgUua1m6foof50aoAochMKYdvcrU4PNbhEWgOhdzL+SrYl39Lv32hdzVSKBP65zUdBDS/5qPq1G5tC9e0z55XLqN1o0lieTVKaAf1lARsa6uO6M7uS5Bl3JT0Jpna2nFGILDa1U+0RG/jpHZnf94E6BX34tw1FWvQjH8gs379lZ8x1AhyABYvXPzZ5D4wo9jJKyrFRruuW3tG+y84pug9+/aVEd4Ax9map7gM5F5vrQStLB0moMeITWfOj2Np05ZQuuR6W398Aj9dele35KuzQEC7JfkeRR1GSnVolzCRyZRK17dPqLCPQOE/1bC+kL6H5H9ztTY+WrL5nAfgv6CBgkYigN5BJ9axJimgJzLTOZZgXfwUtanewMxcbyOgCVMlyUP/6YHt7U9UFlAaH7mfrcH3PBi+akucwJAsK+Q9uXLAaSW01eKQ2rjsg4FQZ8hKkJPCYxRv/NIZUe2ZaunGSPFvYk2OhXYDI/0bsM3Yr4Oy/qAfU6voowZEFC6CnOZqk7abmV2agz77TCkjXcBFR9RGPdGPehFMJL9wXkwtC6BC6nILoVPp5a7u74bHYw6P88PPj23NytcjqqVFVSqu9A/Yn9szW3UL4Ljjz9Uu4YQBbS699Yblo7yPeLqWzhVU0nG9cYlOphEih5AINVTlXaRhuvkh7Z78vQPFSdoHXbwL/akFPRAbNSzQQsxqd6kMoha5NNELDmERKcskkCRifrB+qogdkjZ+8aEpcstplKBliCfS+NEqvnrVi8bpI0L049Bokn1+U2RHkC34zjNY41MduSbEBwjKTYr7h3XmCfBB4E ssHhae4J d42oNsExaGvv6IrU5kdRsiq0kquLlu2esUAfvWdPLkTgSTs2potdgnf/aV9y7tPciBkOyCfrriNZjiq2xJpPjH3phR48eImn/Hm7MCBkzSS/UoKa1toxBmbbca4DwYET038cpuKgfNeqcfw0Ww5wxy4g6Gr9qw05RXptJQykT6J/R/b5GlG25c1SaYSyi378IevkPwChU7xR8G9O+eESmLxBE40giH+m1ONTCfNMv02AzgEvs2bMMnHCsVrTnR04pJV+Q3doWC3aLYwcoEYJUBSY+zyOZAKYZ9riAfHrEb5E5ZNqxAwnK15qzNVxRgRNtz/K7mIhLK+GbM4RTJENbgGKuJj7HINrtyypjybOdtqaGPPNQZ38JkFEoBt99puEAtO0W X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 5, 2024 at 7:25=E2=80=AFPM Takero Funaki = wrote: > > Woah, thanks for the copious of details :) > This series addresses some issues and introduces a minor improvement in > the zswap global shrinker. > > This version addresses issues discovered during the review of v1: > https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail= .com/ > and includes three additional patches to fix another issue uncovered by > applying v1. > > Changes in v2: > - Added 3 patches to reduce resource contention. > mm: zswap: fix global shrinker memcg iteration: > - Change the loop style (Yosry, Nhat, Shakeel) > - To not skip online memcg, save zswap_last_shrink to detect cursor chang= e by cleaner. (Yosry, Nhat, Shakeel) > mm: zswap: fix global shrinker error handling logic: > - Change error code for no-writeback memcg. (Yosry) > - Use nr_scanned to check if lru is empty. (Yosry) > > Changes in v1: > mm: zswap: fix global shrinker memcg iteration: > - Drop and reacquire spinlock before skipping a memcg. > - Add some comment to clarify the locking mechanism. > mm: zswap: proactive shrinking before pool size limit is hit: > - Remove unneeded check before scheduling work. > - Change shrink start threshold to accept_thr_percent + 1%. > > > Patches > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > > 1. Fix the memcg iteration logic that abort iteration on offline memcgs. > 2. Fix the error path that aborts on expected error codes. > 3. Add proactive shrinking at accept threshold + 1%. > 4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block > pageout. > 5. Store incompressible pages as-is to accept all pages. > 6. Interrupt the shrinker to avoid blocking page-in/out. > > Patches 1 to 3 should be applied in this order to avoid potential loops > caused by the first issue. Patch 3 and later can be applied > independently, but the first two issues must be resolved to ensure the > shrinker can evict pages. Patches 4 to 6 address resource contention > issues in the current shrinker uncovered by applying patch 3. > > With this series applied, the shrinker will continue to evict pages > until the accept_threshold_percent proactively, as documented in patch > 3. As a side effect of changes in the hysteresis logic, zswap will no > longer reject pages under the max pool limit. > > The series is rebased on mainline 6.10-rc6. > > > The counters below are from vmstat and zswap debugfs stats. The times > are average seconds using /usr/bin/time -v. > > pre-patch, 6.10-rc4 > | | Start | End | Delta | > |--------------------|-------------|-------------|------------| > | pool_limit_hit | 845 | 845 | 0 | > | pool_total_size | 201539584 | 201539584 | 0 | > | stored_pages | 63138 | 63138 | 0 | > | written_back_pages | 12 | 12 | 0 | > | pswpin | 387 | 32412 | 32025 | > | pswpout | 153078 | 197138 | 44060 | > | zswpin | 0 | 0 | 0 | > | zswpout | 63150 | 63150 | 0 | > | zswpwb | 12 | 12 | 0 | > > | Time | | > |-------------------|--------------| > | elapsed | 8.473 | > | user | 1.881 | > | system | 0.814 | > > post-patch, 6.10-rc4 with patch 1 to 5 You mean 1 to 6? There are 6 patches, no? Just out of pure curiosity, could you include the stats from patch 1-3 only= ? > | | Start | End | Delta | > |--------------------|-------------|-------------|------------| > | pool_limit_hit | 81861 | 81861 | 0 | > | pool_total_size | 75001856 | 87117824 | 12115968 | > | reject_reclaim_fail| 0 | 32 | 32 | > | same_filled_pages | 135 | 135 | 0 | > | stored_pages | 23919 | 27549 | 3630 | > | written_back_pages | 97665 | 106994 | 10329 | > | pswpin | 4981 | 8223 | 3242 | > | pswpout | 179554 | 188883 | 9329 | The pswpin delta actually decreases. Nice :) > | zswpin | 293863 | 590014 | 296151 | > | zswpout | 455338 | 764882 | 309544 | > | zswpwb | 97665 | 106994 | 10329 | > > | Time | | > |-------------------|--------------| > | elapsed | 4.525 | > | user | 1.839 | > | system | 1.467 | > > Although the pool_limit_hit is not increased in both cases, > zswap_store() rejected pages before this patch. Note that, before this > series, zswap_store() did not increment pool_limit_hit on rejection by > limit hit hysteresis (only the first few hits were counted). Yeah hysteresis + the broken global shrinker puts the system in a very bad state. Thanks for showing this. > > From the pre-patch result, the existing zswap global shrinker cannot > write back effectively and locks the old pages in the pool. The > pswpin/out indicates the active process uses the swap device directly. > > From the post-patch result, zswpin/out/wb are increased as expected, > indicating the active process uses zswap and the old pages of the > background services are evicted from the pool. pswpin/out are > significantly reduced from pre-patch results. Lovely :) > > > System responsiveness issue (patch 4 to 6) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > After applying patches 1 to 3, I encountered severe responsiveness > degradation while zswap global shrinker is running under heavy memory > pressure. > > Visible issue to resolve > ------------------------------- > The visible issue happens with patches 1 to 3 applied when a large > amount of memory allocation happens and zswap cannot store the incoming > pages. > While global shrinker is writing back pages, system stops responding as > if under heavy memory thrashing. > > This issue is less likely to happen without patches 1 to 3 or zswap is > disabled. I believe this is because the global shrinker could not write > back a meaningful amount of pages, as described in patch 2. > > Root cause and changes to resolve the issue > ------------------------------- > It seems that zswap shrinker blocking IO for memory reclaim and faults > is the root cause of this responsiveness issue. I introduced three > patches to reduce possible blocking in the following problematic > situations: > > 1. Contention on workqueue thread pools by shrink_worker() using > WQ_MEM_RECLAIM unnecessarily. > > Although the shrinker runs simultaneously with memory reclaim, shrinking > is not required to reclaim memory since zswap_store() can reject pages > without interfering with memory reclaim progress. shrink_worker() should > not use WQ_MEM_RECLAIM and should be delayed when another work in > WQ_MEM_RECLAIM is reclaiming memory. The existing code requires > allocating memory inside shrink_worker(), potentially blocking other > latency-sensitive reclaim work. > > 2. Contention on swap IO. > > Since zswap_writeback_entry() performs write IO in 4KB pages, it > consumes a lot of IOPS, increasing the IO latency of swapout/in. We > should not perform IO for background shrinking while zswap_store() is > rejecting pages or zswap_load() is failing to find stored pages. This > series implements two mitigation logics to reduce the IO contention: > > 2-a. Do not reject pages in zswap_store(). > This is mostly achieved by patch 3. With patch 3, zswap can prepare > space proactively and accept pages while the global shrinker is running. > > To avoid rejection further, patch 5 (store incompressible pages) is > added. This reduces rejection by storing incompressible pages. When > zsmalloc is used, we can accept incompressible pages with small memory > overhead. It is a minor optimization, but I think it is worth > implementing. This does not improve performance on current zbud but does > not incur a performance penalty. > > 2-b. Interrupt writeback while pagein/out. > Once zswap runs out of prepared space, we cannot accept incoming pages, > incurring direct writes to the swap disk. At this moment, the shrinker > is proactively evicting pages, leading to IO contention with memory > reclaim. > > Performing low-priority IO is straightforward but requires > reimplementing a low-priority version of __swap_writepage(). Instead, in > patch 6, I implemented a heuristic, delaying the next zswap writeback > based on the elapsed time since zswap_store() rejected a page. > > When zswap_store() hits the max pool size and rejects pages, > swap_writepage() immediately performs the writeback to disk. The time > jiffies is saved to tell shrink_worker() to sleep up to > ZSWAP_GLOBAL_SHRINK_DELAY msec. > > The same logic applied to zswap_load(). When zswap cannot find a page in > the stored pool, pagein requires read IO from the swap device. The > global shrinker should be interrupted here. > > This patch proposes a constant delay of 500 milliseconds, aligning with > the mq-deadline target latency. > > Visible change > ------------------------------- > With patches 4 to 6, the global shrinker pauses the writeback while > pagein/out operations are using the swap device. This change reduces > resource contention and makes memory reclaim/faults complete faster, > thereby reducing system responsiveness degradation. Ah this is interesting. Did you actually see improvement in your real deployment (i.e not the benchmark) with patch 4-6 in? > > Intended scenario for memory reclaim: > 1. zswap pool < accept_threshold as the initial state. This is achieved > by patch 3, proactive shrinking. > 2. Active processes start allocating pages. Pageout is buffered by zswap > without IO. > 3. zswap reaches shrink_start_threshold. zswap continues to buffer > incoming pages and starts writeback immediately in the background. > 4. zswap reaches max pool size. zswap interrupts the global shrinker and > starts rejecting pages. Write IO for the rejected page will consume > all IO resources. This sounds like the proactive shrinker is still not aggressive enough, and/or there are some sort of misspecifications of the zswap setting... Correct me if I'm wrong, but the new proactive global shrinker begins 1% after the acceptance threshold, and shrinks down to acceptance threshold, right? How are we still hitting the pool limit... My concern is that we are knowingly (and perhaps unnecessarily) creating an LRU inversion here - preferring swapping out the rejected pages over the colder pages in the zswap pool. Shouldn't it be the other way around? For instance, can we spiral into the following scenario: 1. zswap pool becomes full. 2. Memory is still tight, so anonymous memory will be reclaimed. zswap keeps rejecting incoming pages, and putting a hold on the global shrinker. 3. The pages that are swapped out are warmer than the ones stored in the zswap pool, so they will be more likely to be swapped in (which, IIUC, will also further delay the global shrinker). and the cycle keeps going on and on? Have you experimented with synchronous reclaim in the case the pool is full? All the way to the acceptance threshold is too aggressive of course - you might need to find something in between :) > 5. Active processes stop allocating pages. After the delay, the shrinker > resumes writeback until the accept threshold. > > Benchmark > ------------------------------- > To demonstrate that the shrinker writeback is not interfering with > pagein/out operations, I measured the elapsed time of allocating 2GB of > 3/4 compressible data by a Python script, averaged over 10 runs: > > | | elapsed| user | sys | > |----------------------|--------|-------|-------| > | With patches 1 to 3 | 13.10 | 0.183 | 2.049 | > | With all patches | 11.17 | 0.116 | 1.490 | > | zswap off (baseline) | 11.81 | 0.149 | 1.381 | > > Although this test cannot distinguish responsiveness issues caused by > zswap writeback from normal memory thrashing between plain pagein/out, > the difference from the baseline indicates that the patches reduced > performance degradation on pageout caused by zswap writeback. > I wonder if this contention would show up in PSI metrics (/proc/pressure/io, or the cgroup variants if you use them ). Maybe correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.) with IO pressure to show the pattern, i.e the contention problem was there before, and is now resolved? :)