From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34563E77188 for ; Fri, 10 Jan 2025 10:40:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C06E98D0002; Fri, 10 Jan 2025 05:40:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B8F7D8D0001; Fri, 10 Jan 2025 05:40:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B9AA8D0002; Fri, 10 Jan 2025 05:40:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 704A38D0001 for ; Fri, 10 Jan 2025 05:40:30 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1ED55120907 for ; Fri, 10 Jan 2025 10:40:30 +0000 (UTC) X-FDA: 82991198220.08.127FD42 Received: from mail-ej1-f50.google.com (mail-ej1-f50.google.com [209.85.218.50]) by imf01.hostedemail.com (Postfix) with ESMTP id 0CD8340005 for ; Fri, 10 Jan 2025 10:40:27 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=D9HxjOg+; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736505628; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7gTFxrDKkbNuSND5qK6+pdEESzfbkdUDR55kyqW4W0w=; b=7h8JnPxvsmcNJKvP2/qjC6sT3yr4Sbc5688KSBwZUySYh0DMiNtvTZdSULTFPyyDcAKHON AIINpf2wZgh3K/WVcgBABXm+5AJsF7b26Mcldn5sqbMEAiAcWgXCS/lDp4IkYhz59CIyW7 dMVEJkl7xlcFLFRA8fXioL+Z94VzXjs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736505628; a=rsa-sha256; cv=none; b=IaLo8CAhgWGbJgNjUgrBossQMImg0D/QKEDDMrZgR2ccTnbWBE7Ahmnawie4DMtTbFWJWF MiLTF9j/HSjDrfzDvb3lNtkB/H+SbakD955hUmwHlAfhNslX1ORWMRW1qRtMVZcEs5B5ux /uABF7zePgLQ7aOhVvUSCeqeIY/baIY= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=D9HxjOg+; spf=pass (imf01.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.218.50 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ej1-f50.google.com with SMTP id a640c23a62f3a-aaf6b1a5f2bso561221466b.1 for ; Fri, 10 Jan 2025 02:40:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736505626; x=1737110426; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=7gTFxrDKkbNuSND5qK6+pdEESzfbkdUDR55kyqW4W0w=; b=D9HxjOg+t/7xDw7nQo1IQTAZ65jM00SjK5LyzlSlWlFGnkCnru5A1KnvIJvPNY0KnV 4pVlO/3swT99uiZNL4bak7eryfhyKjSA3fGaPt9jLjd+1nz3c5c8IeKpnIbeiQEat0ag iVyHm12gtUbiRuuLMn7dC2o/WcUa8NGma8w28Aj+6V5uUlQbKIhUCVjppnwx3ZOjcBuu WtPKTqHTYt/dVISfee+S0xPGLqzeXBwvGy4Sc66WVgAICCjAc0q2zY94b7YCPt+6jwUI AoL8rg+w8nycpgJEgSccl23xl08xbrFXy3WjBsgeaFo+bIt0MB8zvNxi/RbeXIeBWlwp vc6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736505626; x=1737110426; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7gTFxrDKkbNuSND5qK6+pdEESzfbkdUDR55kyqW4W0w=; b=nwMHYTVQudglmqbKzbD+YBdxoH20ar/FCUNVxhkc4h6N/wKt+hati8ex6IwE0vrOxx RS9SEljO9mWNMnsYP72ALWWz2vhp4ER2q2k2OCUr+23Ud9+EnE+VOOetTKJm9G9XHuWc yyxo5+jpautb79n6nzsfaT4tZWjd4WsQgX/L6wcr3FPwBuspeXaK94t/R6MhvOi4MZ1I F2Sq7CWkaxYNYryL8XzpcdcWV+cXmyp3QatNWa2jW8qcggGtznBzM5gTdDpPc2/XK6me 6zBb1If9owEWl8WmGFdL6hYOPXuZwdt3/D/DwaxGPkYtq25YC84smPUhMxdG6U4J1DLA Hrbg== X-Forwarded-Encrypted: i=1; AJvYcCXgDCCfB5vWy8/aajVxxNUJ2l7/1FeiY/4tvOvsK5HW8xoBSCY4JwXJuQdED/MWGY7tmsJ2UH9quQ==@kvack.org X-Gm-Message-State: AOJu0YxFn6smwg2rKjhU4VDY27rLbGvFdN58J5Q6+9y8RmjCsVJ2nYwA PoIRF1EMqIwczte1SUifqKSF9nBmvIbtpE1p3hd3ebLmSJXcNX5F X-Gm-Gg: ASbGncs2PV6JQaExI+9T5PXyWjyeW28qCN2/bIECpgCHsNSSh8HfMsORIvW7Hn+ocw3 EZyztAyw2D3N4B4ToSxnGhwPu8qu3Z0KZJJMCLtHXQNHdo+bK/bLCeyCGeekbZaU2Y59d44SnIg CqGgESbwc+GZjtxJI+LUMjuFnQ5u4TUvzFzo9B+3fEfZrtEdcuwGMX9UuZt7ebEDPRqMHtXUlwd 5PRX+rupNDwQhNs9L2ck1LuyttRGmKVhpIcydWbjTb0kCFd0IEGXYZm206eHn8gozq1Wrnt0S6n 4vR6YqHwoFywRUs6SQcAFeKI8b/v X-Google-Smtp-Source: AGHT+IFCt5OYuuSLs5TmXBFFchCWdQRcwgfvmxdkjee/hC1feI/I8ncesgfypK7peRixMwW2UohFwA== X-Received: by 2002:a17:907:12cd:b0:aa6:7c36:3428 with SMTP id a640c23a62f3a-ab2c3acf112mr481532166b.0.1736505626075; Fri, 10 Jan 2025 02:40:26 -0800 (PST) Received: from ?IPV6:2a03:83e0:1126:4:829:739b:3caa:6500? ([2620:10d:c092:500::5:e213]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ab2c95647efsm153242266b.117.2025.01.10.02.40.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 10 Jan 2025 02:40:25 -0800 (PST) Message-ID: Date: Fri, 10 Jan 2025 10:40:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Barry Song <21cnbao@gmail.com> Cc: lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Yosry Ahmed , Shakeel Butt , Yu Zhao References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 0CD8340005 X-Stat-Signature: 69mze1o43c773gw1i4j58wyzyk6wekxj X-Rspam-User: X-HE-Tag: 1736505627-982033 X-HE-Meta: U2FsdGVkX1/fLfhwTjuwTaHGo6B/ELu1a+n6KgN1sw419hRq3FnS2CSa6/pse6dU9SzWU8ppLcKbiWI7MMENNFFpyCpr5vvxMyVPw1CHuLhki9Mf30oHtiIcgUrfbftUqkeTSJsHPlanQAmMgkLBksm8im4b1YadlS1kR5MvSGW5nNtiEXWmj/5kmLqRX0URWoonmSs+hRzF57moqRzWNWxFnGIRKI+bRci6m/JuoRpIeaemcGXBNETHGAXXW8y7jy2ZVu542w9zMiv7iI7wMbU5szQdgsSdCZwrT86LiDV5ElxyfroZF4JCq8VeSzYsog62b/VkcWrJQXyHIms00Oa2jrPUA70iuKCSAWvhGEe1lvVxwMCc/rCZmFckc3ClLwFty4+EqelevmGw8D0kdLd/ghQJQJAYhRGmJeFXycwQFAeV6IDHaRegwGWIqwVF2n4gMnkCq3Q5U9mCkNTOBSccatw4Rr28Cbu1kZZRsu3GaGm5WbM2E2jQp4LnlbGLQJWQyxxCbhBmO+d6xvgUAr3tTpA4AatPV4cRuJr7HWYh21OdOEWJvTTRF1Qxj47WpK7q5k/XN1RiM4QwhqZkdl4tQiIhcdoed0aBmWrbmTmniwq+0S6FDIAQBSm3VwhFpMi+o9Wd9gdykwR/NTkAerkce16sx6XoMOppk4/LjSfc6pJXTP7g92bW03b7ONcwnIGau+najGCp+P4BM0sn3vyf5L/1REWKsxfkBGnquLatgZOeDapa+yxtxYmzrqedQ/3OsOPaM84fDdhGcsIBqry2Fgx9t3uEFdqDzH1WkagVgvkpzNCduBpx0/ME+Wj/4YY9z5OgCzn/6j2FurG2PP2+LHXAMKiTs+1A9+FQYzg0gwSwVNPWm9xrojEOMYC0BfuphMRqbMCPnur8T7bAnRSk6e3EbFThwpeqHzrbNH6I2oW3+GZ6yzolT93Fchg/0ufjPufsE+f9ClJD4uN gqSaYPwn TV7Zz07wvNS+TP9fB7sbN/B4nAMVVfz6RWTMieF6YsqvPnGBg2OHd+j6CzwGAhFYH2fOdsMloDEjv0KiWb2qWHRr/SJRCTIlfulxHLRoB90YKlo6vPhrX7n1JMv0+VapGwEptzhthD6yo7ja8KLZJ4Xc4dH4xyB/LpIXYsIKKQNwKEvU9cl5ZmO/xHN1FXMgCt0rICA/7lcP6gff35MUVHRA5hIyjr1ktp2X1VpZz63D0lWaBFxdWVofLzfjzEmxS4PQa/Gert1rY1HKDydwnHqzDqi/zZit0TOTTm6dCiO81HENg4tfuZ8dQon0176o2ppO0lD+xz4wMsgeFYepbmb5GG+pB21NVL+dGaubTiDRSA3bEUMpuHEBBvM66pEcWLqxvf9/JvkaOAFy8VsVPSw6GVx8yjbGGhC8DJWQ4jjE7JT/arqsrGnJcGgyoGFxRiVxxvjLIECabJYuqdMONtTZyVxqac43zw5iskHIShGEjUOEj+0PAQR7cTClLXCmvAsJz/e9o0chNkxQz46wu5sU6HcVLevF3vw8bPfX8PbICLWI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/01/2025 10:30, Barry Song wrote: > On Fri, Jan 10, 2025 at 11:26 PM Usama Arif wrote: >> >> >> >> On 10/01/2025 10:09, Barry Song wrote: >>> Hi Usama, >>> >>> Please include me in the discussion. I'll try to attend, at least remotely. >>> >>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif wrote: >>>> >>>> I would like to propose a session to discuss the work going on >>>> around large folio swapin, whether its traditional swap or >>>> zswap or zram. >>>> >>>> Large folios have obvious advantages that have been discussed before >>>> like fewer page faults, batched PTE and rmap manipulation, reduced >>>> lru list, TLB coalescing (for arm64 and amd). >>>> However, swapping in large folios has its own drawbacks like higher >>>> swap thrashing. >>>> I had initially sent a RFC of zswapin of large folios in [1] >>>> but it causes a regression due to swap thrashing in kernel >>>> build time, which I am confident is happening with zram large >>>> folio swapin as well (which is merged in kernel). >>>> >>>> Some of the points we could discuss in the session: >>>> >>>> - What is the right (preferably open source) benchmark to test for >>>> swapin of large folios? kernel build time in limited >>>> memory cgroup shows a regression, microbenchmarks show a massive >>>> improvement, maybe there are benchmarks where TLB misses is >>>> a big factor and show an improvement. >>> >>> My understanding is that it largely depends on the workload. In interactive >>> scenarios, such as on a phone, swap thrashing is not an issue because >>> there is minimal to no thrashing for the app occupying the screen >>> (foreground). In such cases, swap bandwidth becomes the most critical factor >>> in improving app switching speed, especially when multiple applications >>> are switching between background and foreground states. >>> >>>> >>>> - We could have something like >>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >>>> to enable/disable swapin but its going to be difficult to tune, might >>>> have different optimum values based on workloads and are likely to be >>>> left at their default values. Is there some dynamic way to decide when >>>> to swapin large folios and when to fallback to smaller folios? >>>> swapin_readahead swapcache path which only supports 4K folios atm has a >>>> read ahead window based on hits, however readahead is a folio flag and >>>> not a page flag, so this method can't be used as once a large folio >>>> is swapped in, we won't get a fault and subsequent hits on other >>>> pages of the large folio won't be recorded. >>>> >>>> - For zswap and zram, it might be that doing larger block compression/ >>>> decompression might offset the regression from swap thrashing, but it >>>> brings about its own issues. For e.g. once a large folio is swapped >>>> out, it could fail to swapin as a large folio and fallback >>>> to 4K, resulting in redundant decompressions. >>> >>> That's correct. My current workaround involves swapping four small folios, >>> and zsmalloc will compress and decompress in chunks of four pages, >>> regardless of the actual size of the mTHP - The improvement in compression >>> ratio and speed becomes less significant after exceeding four pages, even >>> though there is still some increase. >>> >>> Our recent experiments on phone also show that enabling direct reclamation >>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation >>> failure rate - this probably removes the need for fallbacking to 4 small >>> folios. (Note that our experiments include Yu's TAO—Android GKI has >>> already merged it. However, since 2 is less than >>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even >>> without Yu's TAO, although I have not confirmed this.) >>> >> >> Hi Barry, >> >> Thanks for the comments! >> >> I haven't seen any activity on TAO on the mailing list recently. Do you know >> if there are any plans for it to be sent for upstream review? >> Have cc-ed Yu Zhao as well. >> >> >>>> This will also mean swapin of large folios from traditional swap >>>> isn't something we should proceed with? >>>> >>>> - Should we even support large folio swapin? You often have high swap >>>> activity when the system/cgroup is close to running out of memory, at this >>>> point, maybe the best way forward is to just swapin 4K pages and let >>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >>>> as well. >>> >>> This approach might be suitable for non-interactive scenarios, such as building >>> a kernel within a memory control group (memcg) or running other server >>> applications. However, performing collapse in interactive and power-sensitive >>> scenarios would be unnecessary and could lead to wasted power due to >>> memory migration and unmap/map operations. >>> >>> However, it is quite challenging to automatically determine the type >>> of workloads >>> the system is running. I feel we still need a global control to decide whether >>> to enable mTHP swap-in—not necessarily per size, but at least at a global level. >>> That said, there is evident resistance to introducing additional >>> controls to enable >>> or disable mTHP features. >>> >>> By the way, Usama, have you ever tried switching between mglru and the >>> traditional >>> active/inactive LRU? My experience shows a significant difference in >>> swap thrashing >>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build >>> tests. >>> >> >> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of >> swap-thrashing. > > Are you sure, Usama, since mglru is enabled by default? I have to echo > 0 to manually > disable it. > Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set by default as well? Atleast on x86. $ make defconfig $ grep LRU_GEN .config # CONFIG_LRU_GEN is not set Thanks, Usama >> >> Thanks, >> Usama >> >>> the latest mm-unstable >>> >>> *********** default mglru: *********** >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m44.561s >>> user 46m53.274s >>> sys 3m48.585s >>> pswpin: 1286081 >>> pswpout: 3147936 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 714580 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 286881 >>> pgpgin: 17199072 >>> pgpgout: 21493892 >>> swpout_zero: 229163 >>> swpin_zero: 84353 >>> >>> ******** disable mglru ******** >>> >>> root@barry-desktop:/home/barry/develop/linux# echo 0 > >>> /sys/kernel/mm/lru_gen/enabled >>> >>> root@barry-desktop:/home/barry/develop/linux# ./build.sh >>> *** Executing round 1 *** >>> real 6m27.944s >>> user 46m41.832s >>> sys 3m30.635s >>> pswpin: 474036 >>> pswpout: 1434853 >>> 64kB-swpout: 0 >>> 32kB-swpout: 0 >>> 16kB-swpout: 331755 >>> 64kB-swpin: 0 >>> 32kB-swpin: 0 >>> 16kB-swpin: 106333 >>> pgpgin: 11763720 >>> pgpgout: 14551524 >>> swpout_zero: 145050 >>> swpin_zero: 87981 >>> >>> my build script: >>> >>> #!/bin/bash >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> >>> vmstat_path="/proc/vmstat" >>> thp_base_path="/sys/kernel/mm/transparent_hugepage" >>> >>> read_values() { >>> pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') >>> pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') >>> pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') >>> pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') >>> swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') >>> swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') >>> swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout >>> 2>/dev/null || echo 0) >>> swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin >>> 2>/dev/null || echo 0) >>> swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin >>> 2>/dev/null || echo 0) >>> echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k >>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero >>> $swpin_zero" >>> } >>> >>> for ((i=1; i<=1; i++)) >>> do >>> echo >>> echo "*** Executing round $i ***" >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>> echo 3 > /proc/sys/vm/drop_caches >>> >>> #kernel build >>> initial_values=($(read_values)) >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null >>> final_values=($(read_values)) >>> >>> echo "pswpin: $((final_values[0] - initial_values[0]))" >>> echo "pswpout: $((final_values[1] - initial_values[1]))" >>> echo "64kB-swpout: $((final_values[2] - initial_values[2]))" >>> echo "32kB-swpout: $((final_values[3] - initial_values[3]))" >>> echo "16kB-swpout: $((final_values[4] - initial_values[4]))" >>> echo "64kB-swpin: $((final_values[5] - initial_values[5]))" >>> echo "32kB-swpin: $((final_values[6] - initial_values[6]))" >>> echo "16kB-swpin: $((final_values[7] - initial_values[7]))" >>> echo "pgpgin: $((final_values[8] - initial_values[8]))" >>> echo "pgpgout: $((final_values[9] - initial_values[9]))" >>> echo "swpout_zero: $((final_values[10] - initial_values[10]))" >>> echo "swpin_zero: $((final_values[11] - initial_values[11]))" >>> sync >>> sleep 10 >>> done >>> >>>> >>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ >>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ >>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ >>>> >>>> Thanks, >>>> Usama >>> > > Thanks > Barry