From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD0FDC636DB for ; Wed, 28 Aug 2024 22:38:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C68A6B0085; Wed, 28 Aug 2024 18:38:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2760E6B0088; Wed, 28 Aug 2024 18:38:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 13E2B6B0089; Wed, 28 Aug 2024 18:38:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EAF136B0085 for ; Wed, 28 Aug 2024 18:38:08 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9AC591C4E6C for ; Wed, 28 Aug 2024 22:38:08 +0000 (UTC) X-FDA: 82503118656.27.A395EA8 Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49]) by imf23.hostedemail.com (Postfix) with ESMTP id CFDC714000B for ; Wed, 28 Aug 2024 22:38:06 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nLi0ltbL; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724884569; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=W4mWL0QAWqfYbCPiW7EzqqnvovOEoUQMcl35w6zYM9Q=; b=66tiwR58lXGA02KVJJY0TpxOcg1mm9dZgS4IawJPvcFpZa1IKm48HhrTEUWRr+yrqmtaqu 9xiJ10NYavJRN01QtO0svxDyisJovW9SsMgOcVaLDkYBM/yBv8iT20qCD38bMHALkE79sC f4KoNnykIDdLUjXeqtL2jkbM7aaus94= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nLi0ltbL; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724884569; a=rsa-sha256; cv=none; b=15tQaut9mlSK1/rbEWrq7HhsRXcMMRlPn94P0jKQ6KEUL2PDKJ6E/zgMlD+07Lb9o0PY5N OFl9++Vd+1/zDo+nxFCnS4kLi7yHsBqZcV6byb0FcuKNNXIKA+d0T1xKAi6b24hzsh8plU ROtLNq8AbheYpts5nMfOLHuivSPgA+g= Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-842f5a3839bso27764241.0 for ; Wed, 28 Aug 2024 15:38:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1724884686; x=1725489486; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=W4mWL0QAWqfYbCPiW7EzqqnvovOEoUQMcl35w6zYM9Q=; b=nLi0ltbLLE7piDg1NmRMeVHmZeLubLAWByHzor8kt9hJLXKtlxRP4hEjyS7efNMg1m kTa3kdk/5QiaAN9kTOgY/wSDnT/kGQRO5/1gLRXn1uxH74wTAosDRhzPoqM00cJUaye8 /AQIvZoLxSMKhqydSgg5Z9whd+SHa29H9LAK9g8gTYxOCXpqPSUMBUQWnl2oIowU5qVW HE/0SIVizpB8I2JNnCTcEvp5hqzIGSwZk7WD4Gwu3pu9t51v7gzvfiBH51pM6hME11O4 kOMDh91axw65Gn1+CEtlI+p61h/UKPRBodAgXmSCwE918fcGpYRjiouObvr0EejHF0XO Hqqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724884686; x=1725489486; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=W4mWL0QAWqfYbCPiW7EzqqnvovOEoUQMcl35w6zYM9Q=; b=gzYopGlNi0XVorjTJYrTPFjh8PKgKNRP7CW7i/lqXr5mGwFZnVEPALyU4B5mbhOcUB tzIOxg08xozSXaTc6A7vX/Gn9bXGDnhTDAu1PdX7+N4D08mvRgc8eIWrrloWNBZXI979 cFmqANxLnplERjUf9Q5tq9arsbGHKTWXkM8fZGIRBy1kyKtGr7XCoxRlQPQaXCMax4sy 6jkltIZqkggOnJeIxmfyPv4X+emAutmWGHZ7iYUUnIMnn+QEAzjWxF09Ive/CJb/tLOh TP0tQ9KWRqoSmmFE0z74/aLTJJ+cf0S09W9K/6e82jEdhxq/eeO7MDPescLOnMdHsjYD Tgrw== X-Forwarded-Encrypted: i=1; AJvYcCWTkEaesv0P1l30js5HyY7XA+zo8D9k7HlEnIcT5ORozFyMVaF/JLJyKPiKmHdhRFpx7kRKYvCzYw==@kvack.org X-Gm-Message-State: AOJu0YwmZx1f7aT48iI/fILqS38DV+dOh3oiDbKgo4IhaGJYrfT11vE2 UrWf8l026d9xVZCD9uiqv2iic8uxIvwEKMUzz97lL7zOOYb/jfzQ30wUdPD19oq8qCWJl1il0Oc rOTE7r+nv8h2+KrwoJcCD9LVF7D6oLLgrQw1+ X-Google-Smtp-Source: AGHT+IFvRVqUsxWVm/DbtGTsyIWU+vHhP8/cv0Ja/eh49yBuE2K6R/dx+CKf1Dohz4llp5XeFPMp1tCAZQHZHHk/QVA= X-Received: by 2002:a05:6102:2ac1:b0:48f:4bd5:23d9 with SMTP id ada2fe7eead31-49a5b01be88mr1349364137.5.1724884685625; Wed, 28 Aug 2024 15:38:05 -0700 (PDT) MIME-Version: 1.0 References: <20240828093516.30228-1-kanchana.p.sridhar@intel.com> In-Reply-To: <20240828093516.30228-1-kanchana.p.sridhar@intel.com> From: Yosry Ahmed Date: Wed, 28 Aug 2024 15:37:28 -0700 Message-ID: Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios To: Kanchana P Sridhar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, nphamcs@gmail.com, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: CFDC714000B X-Stat-Signature: a4e8u8bhzak63qromq66qt5amdthk4kb X-Rspam-User: X-HE-Tag: 1724884686-446024 X-HE-Meta: U2FsdGVkX18W7KzaQsVXjbSL3kLQ/vG1Hwi4r69YD9Gkb7/EeOKcem0y03bns1zJIxFXBRLyPx3sVYV/0mzePXauMqWGa7v+Qsn6PZ47io31/MNKRVdMjiF2BU1CfQgkpexSgCJvgRfT8DzEN1h/6eZT7DcqCYDkTTxa2qJL5ILgLvshS2JqRwYULmn5j54Gh/VksbMKd3RCK6SGeF1VpTr1L58S2MqVgLER0qEGk3KPf2PNJYoC8bw6YOpLkrs41TT9UAMqvWcihpi6BxcwFxbkJROc78jYD1fA51zReeyGxybEnPcCrbhDwi6TqcARIHvRfeSLzIU2ucWK7h52CtU5k/x0Z8WizFmCl5IOJNZQBaj3lnk7O1stZVI8mJy65jNLBmhkBMHNRJro294NbefJd88Zshu9fxG9Yu3Gdxva0IXGwEjMbW5I2iHDnw8GDYwmqaMb3zMf8ApW1FL/ulFCIkMey3OTUdm46aYyCJMlAnsOhSvrsNY6EPCwgM8ZiOHlsKsRvKCbaOokTwo6cccTEojm7Uf9TjT/3PTRCj6OgpP+eJ4YTQbJTfErmbZb2qm6drMHtwja0slu3ZUwA9VioSHp/BXVQg72B2P3oIddj7qug1JwqQIuqxE0ph/+n4bpIU3NuRPHJKi5lhacwc7FwuP9vIy4gkRPY33KX3Nse6u2Rm+mjiQcaTS2hUrNyEw8FTfHlsv/lQvdnMs8yDLUkYZLtoukxe42WrHf4IYsfmUAgCzzGmRvZ7702L3mTqYjSy5RmeKBIGaflVyKAoCPAvxAomne/f/Gm6O4jxX9q0yAI/pwUvZml9bPKpHO558NyWuVDFOAr91ur/7ST4ZYCA87RzMCMckI8Xh8pJ5X2m1QG759IYjp9DN5mNrBvnOIOriErjYB4xAbuSCFEhlmGDzdJQVWMMxzHv8wgHqHLsTyLQB4IdEqOWERvnFKK3NnztDfZF+Ch9yuKjm aNIQesxc CCM6hTTwJln4ddDQTph9R7JMV+vbutmhaXA8H0PZPRp2oeQp545Tcq+sXdTUtUUkhTpHGgDQj3nQnopK5LGXjUPjFCAZ3C5c76QYcrHUtV7TIv6hVKI5ZqAtNneqY6AWoTYHdgI+Hm6g+F/fMlad+yG2VCUa51pSMP4I8XvA4IYCxZJMLWTapCVKZ4U3DG27ZlqqyTeboJ80E5HKeSTdi2Y0RWPg1pUBAia33yfsgizgGDVBpioSurxGNVM/6g9YW3PFl94RmJpt4oFZfdbUQuw/m5daz6WTttFj43TWUNkEjSP4dKciMteVWKA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Aug 28, 2024 at 2:35=E2=80=AFAM Kanchana P Sridhar wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to v6.11-rc3 in patch 2/4 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.rober= ts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. For instance, the function zswap_store_entry() stores a zswap_entr= y > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > delete all offsets corresponding to a higher order folio stored in zswap. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > This patch-series is a precursor to ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent RFC patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Changes since v4: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > Nhat for the data reviews!). > 2) Rebased to mm-unstable from 8/27/2024, > commit b659edec079c90012cf8d05624e312d1062b8b87. > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() = if > CONFIG_MEMCG is not defined, to resolve build errors reported by kerne= l > robot; as per Nhat's and Michal's suggestion to not require a separate > patch to fix the build errors (thanks both!). > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > suggested by Yosry (Thanks Yosry!). > 5) Squashed the commits that define new mthp zswpout stat counters, and > invoke count_mthp_stat() after successful zswap_store()s; into a singl= e > commit. Thanks Yosry for this suggestion! > > Changes since v3: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444= . > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when TH= P > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once pe= r > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > Performance Testing: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Testing of this patch-series was done with the v6.11-rc3 mainline, withou= t > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as th= e > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor. > Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 40G. The is no swap limit set for the cgroup. Following a > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > series [2], 70 usemem processes were run, each allocating and writing 1G = of > memory: > > usemem --init-time -w -O -n 70 1g > > The vm/sysfs mTHP stats included with the performance data provide detail= s > on the swapout activity to ZSWAP/swap. > > Other kernel configuration parameters: > > ZSWAP Compressors : zstd, deflate-iaa > ZSWAP Allocator : zsmalloc > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput is derived by averaging the individual 70 processes' throughpu= ts > reported by usemem. sys time is measured with perf. All data points are > averaged across 3 runs. > > 64KB mTHP (cgroup memory.high set to 40G): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > ------------------------------------------------------------------------= ------ > v6.11-rc3 mainline zswap-mTHP Chan= ge wrt > Baseline Ba= seline > ------------------------------------------------------------------------= ------ > ZSWAP compressor zstd deflate- zstd deflate- zstd de= flate- > iaa iaa = iaa > ------------------------------------------------------------------------= ------ > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% = -3% > sys time (sec) 771.68 802.08 954.85 735.47 -24% = 8% > memcg_high 111,223 110,889 138,651 133,884 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 16 16 0 0 > pswpout 7,471,472 7,527,963 0 0 > zswpin 635 605 624 639 > zswpout 1,509 1,478 9,453,761 9,385,910 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,616 3,430 4,633 3,611 > ZSWPOUT-64kB n/a n/a 590,768 586,521 > SWPOUT-64kB 466,967 470,498 0 0 > ------------------------------------------------------------------------= ------ > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > > ------------------------------------------------------------------------= ------ > v6.11-rc3 mainline zswap-mTHP Chan= ge wrt > Baseline Ba= seline > ------------------------------------------------------------------------= ------ > ZSWAP compressor zstd deflate- zstd deflate- zstd de= flate- > iaa iaa = iaa > ------------------------------------------------------------------------= ------ > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% = -10% > sys time (sec) 823.55 830.42 801.72 676.65 3% = 19% > memcg_high 16,054 15,936 14,951 16,096 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 0 0 0 0 > pswpout 8,629,248 8,628,907 0 0 > zswpin 560 645 5,333 781 > zswpout 1,416 1,503 8,546,895 9,355,760 > thp_swpout 16,854 16,853 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,341 3,574 8,139 3,582 > ZSWPOUT-2048kB n/a n/a 16,684 18,270 > SWPOUT-2048kB 16,854 16,853 0 0 > ------------------------------------------------------------------------= ------ > > In the "Before" scenario, when zswap does not store mTHP, only allocation= s > count towards the cgroup memory limit. However, in the "After" scenario, > with the introduction of zswap_store() mTHP, both, allocations as well as > the zswap compressed pool usage from all 70 processes are counted towards > the memory limit. As a result, we see higher swapout activity in the > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > charge leads to more frequent memory.high breaches. > > This causes degradation in throughput and sys time with zswap mTHP, more = so > in case of zstd than deflate-iaa. Compress latency could play a part in > this - when there is more swapout activity happening, a slower compressor > would cause allocations to stall for any/all of the 70 processes. > > In my opinion, even though the test set up does not provide an accurate > way for a direct before/after comparison (because of zswap usage being > counted in cgroup, hence towards the memory.high), it still seems > reasonable for zswap_store to support (m)THP, so that further performance > improvements can be implemented. Are you saying that in the "Before" data we end up skipping zswap completely because of using mTHPs? Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data to force the mTHPs to be split and for the data to be stored in zswap? This would be a more fair Before/After comparison where the memory goes to zswap in both cases, but "Before" has to be split because of zswap's lack of support for mTHP. I assume most setups relying on zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not. Nhat, is this something you can share?