From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2042EC52D7C for ; Wed, 21 Aug 2024 14:43:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E9986B011B; Wed, 21 Aug 2024 10:43:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 898BB6B012B; Wed, 21 Aug 2024 10:43:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 713216B012C; Wed, 21 Aug 2024 10:43:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 500286B011B for ; Wed, 21 Aug 2024 10:43:12 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id ECC5F140D58 for ; Wed, 21 Aug 2024 14:43:11 +0000 (UTC) X-FDA: 82476520182.12.CE4BECF Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf27.hostedemail.com (Postfix) with ESMTP id 2B7EB40012 for ; Wed, 21 Aug 2024 14:43:09 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TauoQKPE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724251331; a=rsa-sha256; cv=none; b=U6X7TmKEi1hj3i5vO+2zLIAU8Gc9sqhkVAV3DKkucgM9kygefYZgFMFAYyZY6KNw2+Oao1 IJj1qgPwQGP+X+jGZqSjMcJNcDoXPg6vQZ2z6MjFoO9PqScvcU4XVawzeIwnyQjSW4/msj cDucH2CtZsFrhPZQ6hgAvLmHSYC91Co= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TauoQKPE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724251331; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3hTkgzcR3AnSxa314EMmlKAfjbthoWMYj15N6KUnLXU=; b=BK8VA2Uf7otgd6cUvtIxfWt311v0bN7o1Woy3sagmy2jrMjXumf2U4GIQ9Dl5qGyaIteX9 XYZkIdWaBhhcgVe5Jg8NJUeVB6Vbn9kg4sXyfL7+M5fsBuqKs7eBCa2cxYh1V+GoEwXyiE knPAYmA2XjydI4gZiNonVO+H37dWKVw= Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6bf6beda0c0so39075276d6.1 for ; Wed, 21 Aug 2024 07:43:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724251389; x=1724856189; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3hTkgzcR3AnSxa314EMmlKAfjbthoWMYj15N6KUnLXU=; b=TauoQKPExwnkprbvZ70KLKgA5Iroq27Eb7AB9bb3qVatk7L70h5eN8mYbCs/FycevV cV3KMy0AyejdFhhwRskh45hmKENxXsH6JkjnxyJsClVpBj/WCPSBcXEXHSPX011xsM/5 cGknx6C391GgrbO7qKZajssWYoGG7RbFcasKjWX1roOyesY9CAPfl1I+DyaIn0MfvXhl 5tEfPjTvpG26ye1lq53Ca3Gh8TPInNKkmnonDQ9feohf1BxROuPs5OCosPfTSXTmthVk 4S8+UjFE9KcylwyBkElUyS8PbtCW2/jijzGRhPk/0TTMYZLQ7ds8K1TBwDseP+HkBPXZ F6Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724251389; x=1724856189; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3hTkgzcR3AnSxa314EMmlKAfjbthoWMYj15N6KUnLXU=; b=EymvwFeZDkoboEOrOS4uETpfHX1i/DMCWlTA1m96FXkKAzL2DqTN8fuaXQMCP6Yhc3 IMhkD/BB8hijpzPvedxgG1POR2I5cMKPmQhS0WKtlWv/1+C3RZ8Q/82HqdzHXEcCwdf3 Ip7O6teRX/rkPHWsApIWb7tTPSJViQFGWshcT4x6UUhkAE/JoKilB3aGddu1cY3pm7w1 Zt7gHpGFuAADQfXrbOenexuWi4t06alCt0E8uD5JA4qV5s7hK3KrH9J4T7IP+BtOoniE kB0WDefs8DUtrxq6DoFncnDsNmpvGYMHTg/SNiAdhVaQEOnqhM/w3eQzDD55tWeQ1fUh Nmnw== X-Forwarded-Encrypted: i=1; AJvYcCXPf/TtnsCrfFEaA+2AeTPuayGnTTqLa8p6G8ZWhph0VouasOYMA6ZVtmm/V8pjwqaOgpUxOpelWA==@kvack.org X-Gm-Message-State: AOJu0YxANM9VyH/wsLqKyIu+Ef22+bAI1aKjbqo1YB378qQYIQNrrEw5 bzV1H3PGqGVzPWVZdN9sa8qKNHuCOyZwoc7TIg6SVrsJkqCa7s3AqvieXbrV61ZZhGPJ2VOS1/C 9M6HAUeqYZVzvaP6xfxJ6nbPOiDc= X-Google-Smtp-Source: AGHT+IFb4eaLx4KxBuCDTnWqFPKFmaBCDDtJS88BDxxBHN7+mARkycK57RU2F+b2XSz8kDdnMm9pCDcFZiASEe5Q3XA= X-Received: by 2002:a05:6214:578a:b0:6bf:7fc1:64c7 with SMTP id 6a1803df08f44-6c155d71e5bmr33636526d6.15.1724251388880; Wed, 21 Aug 2024 07:43:08 -0700 (PDT) MIME-Version: 1.0 References: <20240819021621.29125-1-kanchana.p.sridhar@intel.com> In-Reply-To: <20240819021621.29125-1-kanchana.p.sridhar@intel.com> From: Nhat Pham Date: Wed, 21 Aug 2024 10:42:57 -0400 Message-ID: Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios To: Kanchana P Sridhar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosryahmed@google.com, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2B7EB40012 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: hgdzgo9qn9c5pcc7y8x5nsfby6ogz48i X-HE-Tag: 1724251389-246253 X-HE-Meta: U2FsdGVkX1+7YOoJxU2FuKEoEzmPEQCvy0IxJKSC9nWxmB2WqvlgBRxU+w7wETleNHNkylGikc4pnyyZayjH20MczVJ0TOX5j1vha0i3lhHTpBskNPNE1L5USLqqqerNFh5LnD/IWJtH3q5pFWqzDRNEaXYX2GERFTIUlX4gfry3hCJmR41JtjRIOS0jl+eexzoRxOcc423KUuurO8UAZqC+XHi/UyUsMQJYtD8m38f09TtiuHRj52NrcRvVGnioRuBChenIh6q0PU7qv6pszMSGYvaTdsHgULFLIzUmAZjQzEnUmBar8EYAN1wU8oOTNWqr1fDL+3A70gbK5PEKxz3o5e5q9/1e+TmvKeQnxbFVF8eLRBdEYn1sJ7eCaw02HaD8YH7ADWqok08/0GwSGsQ8qjRL9T4PDqkB6nUQbqg+DWLjYzB+IgW6tf87cn/xqV5YVO6lAB4Ny4sQnXdpd7xws8ANq9LAuBfsW+YkAzf47F+5c0HANFh2S4lmhAVpadRQ9XeZOjZOWtTxHAFj8mZJm2+5Doi5cwOeoi6vovR6Y+qwqQY3DbZyM6Fpk7t4GJEcq+jQpaXcpO1QEn2MBMtIjb8Biu2ErwQ6MLVCezeHmErt99AKdgT6Zqqv8T7ZILnOoAJ2zczHb+gbnZg1OLaq+f7/5kwsRU6a+An0zrui3Qr3qo03L7idqaXwo7x9qeFpRq1vcNpBphyshXzy0lsFQhKkPlH9el81ptOvFE93u+H1ia48E56PK4svUtFmkTIJYJhII/yQKgkaOQpJErKRa9P6l0Uyzf00voMh0d+Q+q2/IcGedcE1HVoSicoLKRwJ0BQoUrsQjloNEzbnl3lHeKe7+Qecay+IIxHSXmWnNkvgrC2JfyP4d2BSWy7FX1+SlQYWIptXDUr+EEZQlSK61wu9PD08CdiAs57SqDh09On5rDDk9wiBfcnTQeRz4Y9Fy/gR4LSXL6vluGe qF/h0bL+ nbvORpabbDksjeMZL5fDY3pGkWLNGIgABM8Rz2c4maaWsOr9IvGRK5l4CHgt6J6uqELXGY4OD2RxCOWnAJ6LiNTHGeJ1xntaig4zpXwg9CCnNvG5UmjWbVzjOjBzCt3dzv/F0cBdm1ETHFx57n6FhqoN5IeyhstMvjHh7YE7NTHXMCoyuRk2slK8HfWa5P0g8miHBNMqPei9atZIiQXraS9aDkuY0OKiH071Rrfj7SspcggqJoFu7YgLFiTfExX3LO11NefEczIa3H8HauG/Kqz8fXCICJrMrMxG4VrOOciOfuKT3J8ZjFRIlXkoeJexv8RWrco0+lD7/w173rJHTLQZQIBIUX+BYfPjRJEco6BfQ8foVBKZOXccPsaSyMQGe5cpE8Lzc9JdY4B5YnJ4vxGKs2zVrGwjzyrVuLMgPcEv2JJLGfctEZUuamkQuaSYdGpH9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.018310, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Aug 18, 2024 at 10:16=E2=80=AFPM Kanchana P Sridhar wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to v6.11-rc3 in patch 2/4 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.rober= ts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. > > For instance, the determination of whether a folio is same-filled is > based on mapping an index into the folio to derive the page. Likewise, > there is a function "zswap_store_entry" added to store a zswap_entry in > the xarray. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > This patch-series is a precursor to ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent RFC patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Changes since v3: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444= . > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when TH= P > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once pe= r > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > Performance Testing: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Testing of this patch-series was done with the v6.11-rc3 mainline, withou= t > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > ZSWAP. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed. Following a similar methodology as in Ryan Roberts' > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > run, each allocating and writing 1G of memory: > > usemem --init-time -w -O -n 70 1g > > Since I was constrained to get the 70 usemem processes to generate > swapout activity with the 4G SSD, I ended up using different cgroup > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > 64K mTHP experiments: cgroup memory fixed at 60G > 2M THP experiments : cgroup memory fixed at 55G > > The vm/sysfs stats included after the performance data provide details > on the swapout activity to SSD/ZSWAP. > > Other kernel configuration parameters: > > ZSWAP Compressor : LZ4, DEFLATE-IAA > ZSWAP Allocator : ZSMALLOC > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput reported by usemem and perf sys time for running the test > are as follows, averaged across 3 runs: > > 64KB mTHP (cgroup memory.high set to 60G): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > |zswap-mTHP=3DStore | ZSWAP lz4 | 265.43 | -191% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > ------------------------------------------------------------------ Yeah no, this is not good. That throughput regression is concerning... Is this tied to lz4 only, or do you observe similar trends in other compressors that are not deflate-iaa? > > ----------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP = | > | | mainline | Store | Store = | > | | | lz4 | deflate-iaa = | > |-----------------------------------------------------------------------= | > | pswpin | 0 | 0 | 0 = | > | pswpout | 174,432 | 0 | 0 = | > | zswpin | 703 | 534 | 721 = | > | zswpout | 1,501 | 1,491,654 | 1,398,805 = | > |-----------------------------------------------------------------------= | > | thp_swpout | 0 | 0 | 0 = | > | thp_swpout_fallback | 0 | 0 | 0 = | > | pgmajfault | 3,364 | 3,650 | 3,431 = | > |-----------------------------------------------------------------------= | > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 = | > |-----------------------------------------------------------------------= | > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 = | > ----------------------------------------------------------------------- > Yeah this is not good. Something fishy is going on, if we see this ginormous jump from 175000 (z)swpout pages to almost 1.5 million pages. That's a massive jump. Either it's: 1.Your theory - zswap store keeps banging on the limit (which suggests incompatibility between the way zswap currently behaves and our reclaim logic) 2. The data here is ridiculously incompressible. We're needing to zswpout roughly 8.5 times the number of pages, so the saving is 8.5 less =3D> we only save 11.76% of memory for each page??? That's not right... 3. There's an outright bug somewhere. Very suspicious. > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 190,827 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 27.23 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | > ------------------------------------------------------------------ I'm confused. This is a *regression* right? A massive one that is - sys time is *more* than 5 times the old value? > > -----------------------------------------------------------------------= -- > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap-mTH= P | > | | mainline | Store | Stor= e | > | | | lz4 | deflate-ia= a | > |-----------------------------------------------------------------------= --| > | pswpin | 0 | 0 | = 0 | > | pswpout | 797,184 | 0 | = 0 | > | zswpin | 690 | 649 | 66= 9 | > | zswpout | 1,465 | 1,596,382 | 1,540,76= 6 | > |-----------------------------------------------------------------------= --| > | thp_swpout | 1,557 | 0 | = 0 | > | thp_swpout_fallback | 0 | 3,248 | 3,75= 2 | This is also increased, but I supposed we're just doing more (z)swapping out in general... > | pgmajfault | 3,726 | 6,470 | 5,69= 1 | > |-----------------------------------------------------------------------= --| > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,26= 1 | > |-----------------------------------------------------------------------= --| > | hugepages-2048kB/stats/swpout | 1,557 | 0 | = 0 | > -----------------------------------------------------------------------= -- > I'm not trying to delay this patch - I fully believe in supporting zswap for larger pages (both mTHP and THP - whatever the memory reclaim subsystem throws at us). But we need to get to the bottom of this :) These are very suspicious and concerning data. If this is something urgent, I can live with a gate to enable/disable this, but I'd much prefer we understand what's going on here.