From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6C13CDDE64 for ; Wed, 23 Oct 2024 10:48:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF4CE6B007B; Wed, 23 Oct 2024 06:48:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CA46C6B0082; Wed, 23 Oct 2024 06:48:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1E4E6B0083; Wed, 23 Oct 2024 06:48:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8E5A66B007B for ; Wed, 23 Oct 2024 06:48:15 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 424C01A0882 for ; Wed, 23 Oct 2024 10:47:44 +0000 (UTC) X-FDA: 82704542004.21.3E604FD Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf27.hostedemail.com (Postfix) with ESMTP id E37A540015 for ; Wed, 23 Oct 2024 10:47:55 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="meso/XUb"; spf=pass (imf27.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729680341; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wQsAxOoHpCBFjjCtuSXqJKrel9ACGY64HU1SixNMRmk=; b=FYJG+Ea8ypp1xbeThDZIypL2eqxdyJmf+7TJz6CsDIkosugB4b0e9cVpUbgo0U9PrqkeWd jZw+bfasB88fxJuNnhpt8QPMUyM8f0p63c9NfJ7ShcOz5Ajpx6nNJGPY0pnZoM2Bhee1B5 0TfPIjdPTfbdS4LYs9SWlfy9OKEWtnY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729680341; a=rsa-sha256; cv=none; b=Oj5BYHB500UGOybwKu+5XRKcmfX0PbhTBs+dncwM6DQPMJmJ7aYy+iwqHYaqkV52yD9Yng LoPLnigXwqC7V4Fkw5Somu5yJbeTMf9ZJdPtsTKsBquab6WTWU48ghkt586UylhD1VYIL8 Jqzo1Uxz00LzS+HFxvUvUtan2OoTNgw= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="meso/XUb"; spf=pass (imf27.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-a9a4031f69fso960372866b.0 for ; Wed, 23 Oct 2024 03:48:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729680491; x=1730285291; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=wQsAxOoHpCBFjjCtuSXqJKrel9ACGY64HU1SixNMRmk=; b=meso/XUburjUczanNdftpvV/f0w6f1Mp1nTFLRiN429w6HgAe2r+O7XNjW+qsGicQo 0x8VlNI/QSw3Dazo9YJlDOrnKjJ9ApORHV7IfnaUdq8+Mz7qIUFUYO0ebsPdRQxYCb8o /YrMN0i0e4tAFb6V1FdKCCwCvy6tk3SWNCuU0cqaClqn2sPWj2Xxj8nxqWq3x8XWhXAX qpJTyd3uNB471PGsr04+hCRhxK9Sh4/91PiW5hQel8A3gwAvGrSlfbl1nfqiS1pQvQSm egS8mCsro5Ur0LfSbcXIxt3sT6qxRgR+12ioZm1KLBWedi9qczOCqfJiNh4S9Ex7b/U7 hLew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729680491; x=1730285291; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wQsAxOoHpCBFjjCtuSXqJKrel9ACGY64HU1SixNMRmk=; b=f7GJz75Z2yjQwOHI5j9IYDnWB8SrvWtxqxNjp2NjijW4R2FNTrZKcpAt4AVrzgDaFV oWSwzwGG1bxbT8LPFSu/duwTsrwSDsby601vwEhDAnas6u49hH9emY43MOxrr467oCa3 seFP1FwPN/a8KK+tKaJ86T/iVs34hBkZuvpkVyEBeSoSdtK+e9JFtkA/etlcTUmRzqmh hOgV21ts/+/1YNVjAi+Kuc/l94h5uyZLF14pphKYz7daWO2MLGWR7KKeypF0fNlqkGRU eGvEHMZbgi4Iy5o87AIOJawlX3II4NmpeRMDkXW4so2neGa01Z1noY8FQPLy/twpD9aa 1fNQ== X-Forwarded-Encrypted: i=1; AJvYcCVkkvNNDoR3bFdXUHA03r/v0JE10I+jGFFoHOl8L1zUibVhSjAGMWgywYCEYPgihVeZI7rW54OMCw==@kvack.org X-Gm-Message-State: AOJu0YyyS+c5sd+22d/s307swkXcVDbwCCEP6Bih1Cy/GRAXgLMw3eJ5 Mcp2M74ydAwAaHRIWRjyAdPvka/RKOzI83p9V4lZCVlxPiOnuUwx X-Google-Smtp-Source: AGHT+IHueF9moqugv7EDlqlver/9RLB+7JxXlkI7eft5E/5/UDJV3/V4TP6hdw3bEUImJad1pI/ttA== X-Received: by 2002:a17:907:7ea0:b0:a9a:2afc:e4cc with SMTP id a640c23a62f3a-a9abf963fb3mr194672366b.58.1729680491120; Wed, 23 Oct 2024 03:48:11 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::7:ca73]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9a9137074fsm460402066b.135.2024.10.23.03.48.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Oct 2024 03:48:10 -0700 (PDT) Message-ID: <4c30cc30-0f7c-4ca7-a933-c8edfadaee5c@gmail.com> Date: Wed, 23 Oct 2024 11:48:10 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios To: Barry Song <21cnbao@gmail.com> Cc: senozhatsky@chromium.org, minchan@kernel.org, hanchuanhua@oppo.com, v-songbaohua@oppo.com, akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org References: <20241018105026.2521366-1-usamaarif642@gmail.com> <5313c721-9cf1-4ecd-ac23-1eeddabd691f@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E37A540015 X-Stat-Signature: 9asmjy3fgk6opeth8aythgoczxodhtmo X-HE-Tag: 1729680475-905200 X-HE-Meta: U2FsdGVkX1+Cd9hDV1i5qh9iWupzkWJUq3/R313j4vlaBHAv4GiuXHD35635bB8CZcto1jGRgLN4lChX7U5GgBbyt7DRU/2dNOKQ9imweNPxQ8IqkRdNd+YVehoWUYPIsyaDpepugJWvJmIaXS39a+qiqUTrT7fkgQ4Gt6m54nuMZXH+rXEvP4oPBkPWCFcR+lxt2QKo/tZ2vgJcBjeUFPRdW/NnCb21ON0CpkkdTaODTOAEv9I08GCFOD7Ijrpkc5KN+fgWwydMBCGSB0+ZbEFV4OG8jDCTE8sayPYWRv+kjGwhgjzAwhMd3ZP1fijhhuNQFclU2yRFeIEw+9iSLAjJVrh8F5vCSRh9eqQ42wiGKCh2M7/QlHCYEJPHmXS5ya9kcIL0etIrPuiOcKNWlCJUmR02ELfdY1q9jQFEY+DJUqG26N2R8jBf4tuzFknst22UwUqAzgPOYL/M1ORs74zwgD9KJWIIAneJmPZRVqSxbrw2b//9OIV7T0VXcEO4OERN021pmM3B4h2IFa781U3EB4a+UBRs7RY/Hb72RdgYGa+81k+tK5K8nvw+i08EGt2rrgnv3QW2cDgnMsO2la/mLSBwv1p7hLNZwmpCcLrOP5tjBQQLupssvzQZsorJnLMW2l3RlQBWUTg7h2FvJTr2DHDP9E2nKcGpGALZ37MC7/4YzyXekeCJ6RFQX+ZVR9AF94NfXyo0SghZVqPS9YM8a+c6e6wli0l6qdqN8FnjKftFeoGVPboRqnuJrHVeFI5HqW34htL8EtX/phhn15oZ3wLU55w1pj7D2U9NlFsmNwUzd8jf3PYUVvjPvyy8mt23eeIJuWm85sN/+Wsrpbu333esxbY1wa43y34yy7XO89USOqqnysMto8lLp/2GI1Q1bb873WU0TqTaDU0qABTcts7h3uu0Zl0DYjfhL1hrP/bvcU3xRgyXI5tsPA84XlKb/sLl+Iw76qh9guo X2Ps+eOI FpTdyB3acSNhYZ837FlKY6VMzKwLc0zmEixzeBAQZ6SL3WxNnS6x+rAzYxTYiCGOVdIauzuc3lsfE3w6WiCL3HPKFcZWBFXZYw9yVergHfC5zzt/9cpyxTuJtLEu7Ad3IsVlmEyRPOTpYgamR5D7Nb0FFk7ZoFGZZJIGp5w3TKQEMhGmsJH2tSgsCw3ME49aWdUvV/UZWnNWdRQz55nBoqlwaQQ4T5QF7S+Xr0bi/VEtOe9fzg3sGIgKDcWCfX8suyGv2tIeCqJW0Um+lNEeLvZlVgSj36tBC4AwthYpkwl4bklsWNlf5dzR7wHt4EPP5b8MkpcokmEnh24y5JBSBzUKsHBDeglIPbcNFWtfp8hd4lIbwrCoJvGjGi4bCxBXaMCNRYmq5YqUQBenUf83+5Xoavi2rHMs+GXdL/e2gqPQbC/HSXRFKCQzI0V68q5lWvkm++4SmBJ6ZmfYIUXbq28lKWIjardG3gbtNIdUgB7Cd5O04obikiR7sqBmHvMyRzaQKF+TOi8fWEKpUmFZ6RgprrjUvIoxEPY6fy73ExgZIiQJhNnmdvJkVR09Y2/1PCwW50k4QsT31sMaJb/fp7rK5oLE981MQMmAa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 23/10/2024 11:26, Barry Song wrote: > On Wed, Oct 23, 2024 at 11:07 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Wed, Oct 23, 2024 at 10:17 AM Usama Arif wrote: >>> >>> >>> >>> On 22/10/2024 21:46, Barry Song wrote: >>>> On Wed, Oct 23, 2024 at 4:26 AM Usama Arif wrote: >>>>> >>>>> >>>>> >>>>> On 21/10/2024 11:40, Usama Arif wrote: >>>>>> >>>>>> >>>>>> On 21/10/2024 06:09, Barry Song wrote: >>>>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif wrote: >>>>>>>> >>>>>>>> After large folio zswapout support added in [1], this patch adds >>>>>>>> support for zswapin of large folios to bring it on par with zram. >>>>>>>> This series makes sure that the benefits of large folios (fewer >>>>>>>> page faults, batched PTE and rmap manipulation, reduced lru list, >>>>>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when >>>>>>>> using zswap. >>>>>>>> >>>>>>>> It builds on top of [2] which added large folio swapin support for >>>>>>>> zram and provides the same level of large folio swapin support as >>>>>>>> zram, i.e. only supporting swap count == 1. >>>>>>>> >>>>>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve >>>>>>>> no readahead swapin performance [3], and also allows us to build on large >>>>>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. >>>>>>>> >>>>>>>> Patch 3 adds support for large folio zswapin. This patch does not add >>>>>>>> support for hybrid backends (i.e. folios partly present swap and zswap). >>>>>>>> >>>>>>>> The main performance benefit comes from maintaining large folios *after* >>>>>>>> swapin, large folio performance improvements have been mentioned in previous >>>>>>>> series posted on it [2],[4], so have not added those. Below is a simple >>>>>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >>>>>>>> with memory integrity check). >>>>>>>> >>>>>>>> | no mTHP (ms) | 1M mTHP enabled (ms) >>>>>>>> Base kernel | 1165 | 1163 >>>>>>>> Kernel with mTHP zswpin series | 1203 | 738 >>>>>>> >>>>>>> Hi Usama, >>>>>>> Do you know where this minor regression for non-mTHP comes from? >>>>>>> As you even have skipped swapcache for small folios in zswap in patch1, >>>>>>> that part should have some gain? is it because of zswap_present_test()? >>>>>>> >>>>>> >>>>>> Hi Barry, >>>>>> >>>>>> The microbenchmark does a sequential read of 1G of memory, so it probably >>>>>> isnt very representative of real world usecases. This also means that >>>>>> swap_vma_readahead is able to readahead accurately all pages in its window. >>>>>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast >>>>>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow >>>>>> do_swap_page calls. I had added some prints and I was seeing 8 pages being >>>>>> readahead in 1 do_swap_page. The larger number of calls causes the slight >>>>>> regression (eventhough they are quite fast). I think in a realistic scenario, >>>>>> where readahead window wont be as large, there wont be a regression. >>>>>> The cost of zswap_present_test in the whole call stack of swapping page is >>>>>> very low and I think can be ignored. >>>>>> >>>>>> I think the more interesting thing is what Kanchana pointed out in >>>>>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>>> I am curious, did you see this when testing large folio swapin and compression >>>>>> at 4K granuality? Its looks like swap thrashing so I think it would be common >>>>>> between zswap and zram. I dont have larger granuality zswap compression done, >>>>>> which is why I think there is a regression in time taken. (It could be because >>>>>> its tested on intel as well). >>>>>> >>>>>> Thanks, >>>>>> Usama >>>>>> >>>>> >>>>> Hi, >>>>> >>>>> So I have been doing some benchmarking after Kanchana pointed out a performance >>>>> regression in [1] of swapping in large folio. I would love to get thoughts from >>>>> zram folks on this, as thats where large folio swapin was first added [2]. >>>>> As far as I can see, the current support in zram is doing large folio swapin >>>>> at 4K granuality. The large granuality compression in [3] which was posted >>>>> in March is not merged, so I am currently comparing upstream zram with this series. >>>>> >>>>> With the microbenchmark below of timing 1G swapin, there was a very large improvement >>>>> in performance by using this series. I think similar numbers would be seen in zram. >>>> >>>> Imagine running several apps on a phone and switching >>>> between them: A → B → C → D → E … → A → B … The app >>>> currently on the screen retains its memory, while the ones >>>> sent to the background are swapped out. When we bring >>>> those apps back to the foreground, their memory is restored. >>>> This behavior is quite similar to what you're seeing with >>>> your microbenchmark. >>>> >>> >>> Hi Barry, >>> >>> Thanks for explaining this! Do you know if there is some open source benchmark >>> we could use to show an improvement in app switching with large folios? >>> >> >> I’m fairly certain the Android team has this benchmark, but it’s not >> open source. >> >> A straightforward way to simulate this is to use a script that >> cyclically launches multiple applications, such as Chrome, Firefox, >> Office, PDF, and others. >> >> for example: >> >> launch chrome; >> launch firefox; >> launch youtube; >> .... >> launch chrome; >> launch firefox; >> .... >> >> On Android, we have "Android activity manager 'am' command" to do that. >> https://gist.github.com/tsohr/5711945 >> >> Not quite sure if other windows managers have similar tools. >> >>> Also I guess swap thrashing can happen when apps are brought back to foreground? >>> >> >> Typically, the foreground app doesn't experience much swapping, >> as it is the most recently or frequently used. However, this may >> not hold for very low-end phones, where memory is significantly >> less than the app's working set. For instance, we can't expect a >> good user experience when playing a large game that requires 8GB >> of memory on a 4GB phone! :-) >> And for low-end phones, we never even enable mTHP. >> >>>>> >>>>> But when doing kernel build test, Kanchana saw a regression in [1]. I believe >>>>> its because of swap thrashing (causing large zswap activity), due to larger page swapin. >>>>> The part of the code that decides large folio swapin is the same between zswap and zram, >>>>> so I believe this would be observed in zram as well. >>>> >>>> Is this an extreme case where the workload's working set far >>>> exceeds the available memory by memcg limitation? I doubt mTHP >>>> would provide any real benefit from the start if the workload is bound to >>>> experience swap thrashing. What if we disable mTHP entirely? >>>> >>> >>> I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited >>> memory.max to 4G. >>> >>> mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: >>> >>> ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G >>> metric no mTHP 16K mTHP=always >>> real 1m0.613s 0m52.008s >>> user 25m23.028s 25m19.488s >>> sys 25m45.466s 18m11.640s >>> zswpin 1911194 3108438 >>> zswpout 6880815 9374628 >>> pgfault 120430166 48976658 >>> pgmajfault 1580674 2327086 >>> >>> >> >> Interesting! We never use a phone to build the Linux kernel, but >> let me see if I can find some other machines to reproduce your data. > > Hi Usama, > > I suspect the regression occurs because you're running an edge case > where the memory cgroup stays nearly full most of the time (this isn't > an inherent issue with large folio swap-in). As a result, swapping in > mTHP quickly triggers a memcg overflow, causing a swap-out. The > next swap-in then recreates the overflow, leading to a repeating > cycle. > Yes, agreed! Looking at the swap counters, I think this is what is going on as well. > We need a way to stop the cup from repeatedly filling to the brim and > overflowing. While not a definitive fix, the following change might help > improve the situation: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 17af08367c68..f2fa0eeb2d9a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > > @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > *folio, struct mm_struct *mm, > memcg = get_mem_cgroup_from_mm(mm); > rcu_read_unlock(); > > - ret = charge_memcg(folio, memcg, gfp); > + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > MEMCG_CHARGE_BATCH) > + ret = -ENOMEM; > + else > + ret = charge_memcg(folio, memcg, gfp); > > css_put(&memcg->css); > return ret; > } > The diff makes sense to me. Let me test later today and get back to you. Thanks! > Please confirm if it makes the kernel build with memcg limitation > faster. If so, let's > work together to figure out an official patch :-) The above code hasn't consider > the parent memcg's overflow, so not an ideal fix. > >> >>> >>> >>>>> >>>>> My initial thought was this might be because its intel, where you dont have the advantage >>>>> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD >>>>> and ARM as well, though a bit less (have added the numbers below). >>>>> >>>>> The numbers show that the zswap activity increases and page faults decrease. >>>>> Overall this does result in sys time increasing and real time slightly increases, >>>>> likely because the cost of increased zswap activity is more than the benefit of >>>>> lower page faults. >>>>> I can see in [3] that pagefaults reduced in zram as well. >>>>> >>>>> Large folio swapin shows good numbers in microbenchmarks that just target reduce page >>>>> faults and sequential swapin only, but not in kernel build test. Is a similar regression >>>>> observed with zram when enabling large folio swapin on kernel build test? Maybe large >>>>> folio swapin makes more sense on workloads where mappings are kept for a longer time? >>>>> >>>> >>>> I suspect this is because mTHP doesn't always benefit workloads >>>> when available memory is quite limited compared to the working set. >>>> In that case, mTHP swap-in might introduce more features that >>>> exacerbate the problem. We used to have an extra control "swapin_enabled" >>>> for swap-in, but it never gained much traction: >>>> https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ >>>> We can reconsider whether to include the knob, but if it's better >>>> to disable mTHP entirely for these cases, we can still adhere to >>>> the policy of "enabled". >>>> >>> Yes I think this makes sense to have. The only thing is, its too many knobs! >>> I personally think its already difficult to decide upto which mTHP size we >>> should enable (and I think this changes per workload). But if we add swapin_enabled >>> on top of that it can make things more difficult. >>> >>>> Using large block compression and decompression in zRAM will >>>> significantly reduce CPU usage, likely making the issue unnoticeable. >>>> However, the default minimum size for large block support is currently >>>> set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). >>>> >>> >>> I saw that the patch was sent in March, and there werent any updates after? >>> Maybe I can try and cherry-pick that and see if we can develop large >>> granularity compression for zswap. >> >> will provide an updated version next week. >> >>> >>>>> >>>>> Kernel build numbers in cgroup with memory.max=4G to trigger zswap >>>>> Command for AMD: make defconfig; time make -j$(nproc) bzImage >>>>> Command for ARM: make defconfig; time make -j$(nproc) Image >>>>> >>>>> >>>>> AMD 16K+32K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 1m23.038s 1m23.050s >>>>> user 53m57.210s 53m53.437s >>>>> sys 7m24.592s 7m48.843s >>>>> zswpin 612070 999244 >>>>> zswpout 2226403 2347979 >>>>> pgfault 20667366 20481728 >>>>> pgmajfault 385887 269117 >>>>> >>>>> AMD 16K+32K+64K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 1m22.975s 1m23.266s >>>>> user 53m51.302s 53m51.069s >>>>> sys 7m40.168s 7m57.104s >>>>> zswpin 676492 1258573 >>>>> zswpout 2449839 2714767 >>>>> pgfault 17540746 17296555 >>>>> pgmajfault 429629 307495 >>>>> -------------------------- >>>>> ARM 16K+32K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 0m51.168s 0m52.086s >>>>> user 25m14.715s 25m15.765s >>>>> sys 17m18.856s 18m8.031s >>>>> zswpin 3904129 7339245 >>>>> zswpout 11171295 13473461 >>>>> pgfault 37313345 36011338 >>>>> pgmajfault 2726253 1932642 >>>>> >>>>> >>>>> ARM 16K+32K+64K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 0m52.017s 0m53.828s >>>>> user 25m2.742s 25m0.046s >>>>> sys 18m24.525s 20m26.207s >>>>> zswpin 4853571 8908664 >>>>> zswpout 12297199 15768764 >>>>> pgfault 32158152 30425519 >>>>> pgmajfault 3320717 2237015 >>>>> >>>>> >>>>> Thanks! >>>>> Usama >>>>> >>>>> >>>>> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ >>>>> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ >>>>> >>>>>> >>>>>>>> >>>>>>>> The time measured was pretty consistent between runs (~1-2% variation). >>>>>>>> There is 36% improvement in zswapin time with 1M folios. The percentage >>>>>>>> improvement is likely to be more if the memcmp is removed. >>>>>>>> >>>>>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> index 40de679248b8..77068c577c86 100644 >>>>>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> @@ -9,6 +9,8 @@ >>>>>>>> #include >>>>>>>> #include >>>>>>>> #include >>>>>>>> +#include >>>>>>>> +#include >>>>>>>> >>>>>>>> #include "../kselftest.h" >>>>>>>> #include "cgroup_util.h" >>>>>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >>>>>>>> return test_zswap_writeback(root, false); >>>>>>>> } >>>>>>>> >>>>>>>> +static int zswapin_perf(const char *cgroup, void *arg) >>>>>>>> +{ >>>>>>>> + long pagesize = sysconf(_SC_PAGESIZE); >>>>>>>> + size_t memsize = MB(1*1024); >>>>>>>> + char buf[pagesize]; >>>>>>>> + int ret = -1; >>>>>>>> + char *mem; >>>>>>>> + struct timeval start, end; >>>>>>>> + >>>>>>>> + mem = (char *)memalign(2*1024*1024, memsize); >>>>>>>> + if (!mem) >>>>>>>> + return ret; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * Fill half of each page with increasing data, and keep other >>>>>>>> + * half empty, this will result in data that is still compressible >>>>>>>> + * and ends up in zswap, with material zswap usage. >>>>>>>> + */ >>>>>>>> + for (int i = 0; i < pagesize; i++) >>>>>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; >>>>>>>> + >>>>>>>> + for (int i = 0; i < memsize; i += pagesize) >>>>>>>> + memcpy(&mem[i], buf, pagesize); >>>>>>>> + >>>>>>>> + /* Try and reclaim allocated memory */ >>>>>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >>>>>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + gettimeofday(&start, NULL); >>>>>>>> + /* zswpin */ >>>>>>>> + for (int i = 0; i < memsize; i += pagesize) { >>>>>>>> + if (memcmp(&mem[i], buf, pagesize)) { >>>>>>>> + ksft_print_msg("invalid memory\n"); >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + gettimeofday(&end, NULL); >>>>>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >>>>>>>> + ret = 0; >>>>>>>> +out: >>>>>>>> + free(mem); >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int test_zswapin_perf(const char *root) >>>>>>>> +{ >>>>>>>> + int ret = KSFT_FAIL; >>>>>>>> + char *test_group; >>>>>>>> + >>>>>>>> + test_group = cg_name(root, "zswapin_perf_test"); >>>>>>>> + if (!test_group) >>>>>>>> + goto out; >>>>>>>> + if (cg_create(test_group)) >>>>>>>> + goto out; >>>>>>>> + >>>>>>>> + if (cg_run(test_group, zswapin_perf, NULL)) >>>>>>>> + goto out; >>>>>>>> + >>>>>>>> + ret = KSFT_PASS; >>>>>>>> +out: >>>>>>>> + cg_destroy(test_group); >>>>>>>> + free(test_group); >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> /* >>>>>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory >>>>>>>> * limit in zswap, writeback should affect only the zswapped pages of that >>>>>>>> @@ -584,6 +654,7 @@ struct zswap_test { >>>>>>>> T(test_zswapin), >>>>>>>> T(test_zswap_writeback_enabled), >>>>>>>> T(test_zswap_writeback_disabled), >>>>>>>> + T(test_zswapin_perf), >>>>>>>> T(test_no_kmem_bypass), >>>>>>>> T(test_no_invasive_cgroup_shrink), >>>>>>>> }; >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >>>>>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >>>>>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >>>>>>>> [4] https://lwn.net/Articles/955575/ >>>>>>>> >>>>>>>> Usama Arif (4): >>>>>>>> mm/zswap: skip swapcache for swapping in zswap pages >>>>>>>> mm/zswap: modify zswap_decompress to accept page instead of folio >>>>>>>> mm/zswap: add support for large folio zswapin >>>>>>>> mm/zswap: count successful large folio zswap loads >>>>>>>> >>>>>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + >>>>>>>> include/linux/huge_mm.h | 1 + >>>>>>>> include/linux/zswap.h | 6 ++ >>>>>>>> mm/huge_memory.c | 3 + >>>>>>>> mm/memory.c | 16 +-- >>>>>>>> mm/page_io.c | 2 +- >>>>>>>> mm/zswap.c | 120 ++++++++++++++------- >>>>>>>> 7 files changed, 99 insertions(+), 52 deletions(-) >>>>>>>> >>>>>>>> -- >>>>>>>> 2.43.5 >>>>>>>> >>>>>>> >>>> >> > > Thanks > Barry