From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1B29E7719C for ; Fri, 10 Jan 2025 10:26:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1537C6B00C0; Fri, 10 Jan 2025 05:26:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0D35B6B00C1; Fri, 10 Jan 2025 05:26:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E17BA6B00C2; Fri, 10 Jan 2025 05:26:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BC3526B00C0 for ; Fri, 10 Jan 2025 05:26:32 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0A158A096F for ; Fri, 10 Jan 2025 10:26:31 +0000 (UTC) X-FDA: 82991162982.06.B7E095F Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf13.hostedemail.com (Postfix) with ESMTP id F346720007 for ; Fri, 10 Jan 2025 10:26:28 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kN6gF4EV; spf=pass (imf13.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736504789; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Szy2Sll22omDRYIC04OeD9rDJY5enWNfASQybYTPxbE=; b=ih1A0v2di/E7y4lgJc90hFUw4H8zmk7n6OlU1fz1N8U3HZKH8v7+dFLbl1bAEgTf/duQgI Ta78jF/lm7Ls2/LK21ifDhUQVVLYb1uxE2ST/EU8Ug5EcOsxBryPnfj0+8u1naQnKQRdtf sHyIQLrhNK44y9xGXb5jAG9Ggu6y3sE= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kN6gF4EV; spf=pass (imf13.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736504789; a=rsa-sha256; cv=none; b=HM5h4ecgPiT5senJ0jha7GrWBPEIiqq5Sl8GAmJWeLZHcC1F5b5FLl/8LFLFEs+V50mi9J Xdmmnz1Z6W+upSbJsbRk1sNr87WJDB6drS/+Ib6qAqGzeldtzfBY9gbzVNIE4TjT7OobTi Ru8bTEL36TAEIBoCucc9Pj/pjGWqlRg= Received: by mail-ed1-f54.google.com with SMTP id 4fb4d7f45d1cf-5d3e8f64d5dso3495770a12.3 for ; Fri, 10 Jan 2025 02:26:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736504787; x=1737109587; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Szy2Sll22omDRYIC04OeD9rDJY5enWNfASQybYTPxbE=; b=kN6gF4EV4cGbuduW3kCSHHj9zUhTUOgojttIONYuY+RqFQzAYEjhk+9RI0VypsXw8h 0NMEfI4P4HDj4znFCICwtEMB9z4t0huLRYn+bLKhIfbhQa35pWba0zJ9czjkKzUKia+1 t25iQtz+uYCAp+TZk9eJoBmG6urzM/H5fcvFRXlRcTkDKVgfvnij0JnDzlYOBdGt85HR 7rQDEJ3VGY5nGqAHeewMFPjXDD1jvQIfsKx0MwVJFMIk+5KIgfDqTDKkJej3H+ttnp0j 2kJtb7P9aMi0dqEwqcol0OJzvKv/g7+QCHDnDBvcQ5bTsvdvwXo2stf+YgImAGMSu/89 uWQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736504787; x=1737109587; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Szy2Sll22omDRYIC04OeD9rDJY5enWNfASQybYTPxbE=; b=ikJcoM4ANc/NIGvbmt03xxJk7HDyKtrHDYhf+aw3eoqvIIhH9I1OEWAL7l4a0u8G+8 1cTuL7uVUyluuaeKKsQsWIePZfDrrqP0gJDLFVDS/KaoSG+dwPmmbwvaOiKMFp/pYBE3 9s5OEZJwz2RfsHiMxnqUda5hkBqjcLkpqCWkr/efur2QCezn+iJ2SC+3FaBITZaZ5vwe 89rSuDLeveSTsar3AjIzK8t1K9KSD6S+VLsGa2mbNehbmA2sUO4p+iEJevPqnJ+yjw67 wCeMRGpAndqUjy3CLMVeaoeftOt6iwpC5bCL6EVPNtYYp6t0SvCha1uES7xc6jL3EFx3 YnzA== X-Forwarded-Encrypted: i=1; AJvYcCXLf7QTPmcMN7110LZdkqIyO8qMuWtJWfNDsHsEN7/WVqUjodM87vRi+PJvHdPAqRk9VNuczXHWtA==@kvack.org X-Gm-Message-State: AOJu0YwWFnwgM/LoILOsONgXdExSGRUEnQNe9QFBHIESY84xdEDlIudd csGASQ7+ZjwDdOWVSAArzczuDcGhqSKCJtgHXt16d558uYHqRVwd X-Gm-Gg: ASbGncuEBQ1C1lhW36pkLfM5k4Rrmuk3z6UdMCrsVimuQxsel4kG6C6eraF+CbN1S1M J6nNE/HARu4CQ6rVRgyG/KT4M7W4l0FLsjRZxbLwyUbGbRMwaC5CTYnVv0NQWpviacU4eXh987h EWRERxw/CDTz6HLY+qP6tNiZtHsMCdNv/DWPEgVrWcj+n/HVyvllZshy5dt4luKPItalb/fjtj0 s15NbCntGMX4yoI8KjKSVUN2ju5SEFNrhAG/l8KGqyl0M07fLOBzpwJXQfNkB6ET6yGdf4utiKU y1d7rzBj8ENKzt9NqY1dKwKaOt+U X-Google-Smtp-Source: AGHT+IE/yDbcvxHh5kV5QTSdJ+ZWXoW7aS8ft8/4Up27Ydd+z/8fuEdqokPkp//9y0rUvvIfSikIsw== X-Received: by 2002:a05:6402:320b:b0:5d0:cfdd:2ac1 with SMTP id 4fb4d7f45d1cf-5d972dfcbcfmr10002274a12.6.1736504786933; Fri, 10 Jan 2025 02:26:26 -0800 (PST) Received: from ?IPV6:2a03:83e0:1126:4:829:739b:3caa:6500? ([2620:10d:c092:500::5:e213]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5d99008c294sm1516556a12.15.2025.01.10.02.26.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 10 Jan 2025 02:26:26 -0800 (PST) Message-ID: Date: Fri, 10 Jan 2025 10:26:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Barry Song <21cnbao@gmail.com> Cc: lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Yosry Ahmed , Shakeel Butt , Yu Zhao References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam05 X-Stat-Signature: aibqn3q4yhcd33ssdott11b5ypk1z49p X-Rspamd-Queue-Id: F346720007 X-Rspam-User: X-HE-Tag: 1736504788-922376 X-HE-Meta: U2FsdGVkX19iAaDXqnuOE7cZ/jG1Cv/u7lwKljbhzgabw6nJOhPYdBWr/JAhBIqFGQxFOQw0AbmvFs6MzNjWQeV85GLwVleX3Oyyez0mWwz3NNY+4ScZy5ZMdVv4pkWFlakJQfXxCfWutssCtR+d6doAJ0caogmOPG9rHmvJQPvqQcqmV8xSbeFdaYRfe82HOyFPS3Rt7Rzbr1Kb2Ol73j9/fteQvD5fxUYgXOg+PXuPwsSSugzDOCEyVigW4rp9v86SIxvthyKFx/HjozYrGZ1iMbcaFIo75AcvvHqC/biFfBOGrCo6eMjdkKL+mo0140YqngS9E75BBb0wqdTI+1DSTCnPtvHF9jl994tEFJZpXmPY+wypBDp0mGxwvJDjaKgISM11A5Q57jlU+ArFoc+xR4zgAjVX7tdVx4HkqBXKE3pX/wn8Xvd1laewBWqOJgrqrdKSxyHVwfOP7TZ0Sz24opPuZ7OpX/knWlPJwU7/O5J3JkkAVVkbysEe+ANU04YmCJhqKtdRn50DoCxFTcBoI9s6tZH+yIpdsVbDxCbrPMFi/q1tp5hgpuKEBLzelxm7ja9cw4zkwRDUwhpiGNRPNz06Q1dbd4A7U5KRKqTVXUCZu2uInqQ2j5fmoUZOHUHL3bJycWLHAK366FyF+Ew6FjByBlcMUT7mwhd9TWsetZmGQN07PrbHpqwfHPmaeyDzeM14FF7fDXJ9eXIJXV+PHFMHkFTUYm9fVZYIlX0tMs3FowNy5TltgRqsgHFKiKktXtpaPqjmGmdL+ivkagcN7PF2OGaqmuhXxgi3z32QjuFVk9rGhaSKF/Cm+vrRfDORO8sgB4UeJ5ZIpwvjlCXsyyOzqGRFd956cGL/dK9DD27scc/67yN6Yx5/FfHRwve1YVHv0hCUhIEtsbVSnwVnZg+4DQVxSYHoXFzKC2+ivModduCzEQ4DZaUCHRvNQQYY8KQvmBxAzjzZjuC iGaTeNrk XxMJjLFScAfr0H9FTaSXu0VvZopnpGvXM2FgvxyKebPgnoti0OZBv2RIBpfhPDi4fPAjl9oLZ6rfOll8VMWA5IXDK6OTOmLgqj5LS0xUJYaR5+pEu1OTIO81QOg4ZzLZqYVxei2eKkBjB0yHLSRaUul1y4+MLXbicxnwRBCu4aLs56P3BUewXJ9v48x3RAom0RShat4X7k3+rET4TlSlxW6J6wIDNE2T35YdJy3nEzeMWvmJSPo8lbJAqPonyvY8mVXw8wgPWAF14JDymzX2bXwMgTcv4/NBbjoeRW+Pxc8HMbkBCsuPUwrpr2rDxUu6PBzAHwSj3t6xL/73LTnh/LNKcXYlQjNCJUu87m/WU/XpC4ZsIw6bgKSk2O7kHo9cKS03QPsDAv4846J5zCXpdhRYqpw51Ki+t6aXuAh1pThICWMuKKH0PoDWkX3ssmnMW8IQ/2zXHlOKTKJ2j07jYpB7u1D0s25S97yanapBbRmix+5pcINvy3Jj5vYDMGty4oULxkGudf/Ctytia++PtocWC3D5kOI1zyauByDuvwgK6F8k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/01/2025 10:09, Barry Song wrote: > Hi Usama, > > Please include me in the discussion. I'll try to attend, at least remotely. > > On Fri, Jan 10, 2025 at 9:06 AM Usama Arif wrote: >> >> I would like to propose a session to discuss the work going on >> around large folio swapin, whether its traditional swap or >> zswap or zram. >> >> Large folios have obvious advantages that have been discussed before >> like fewer page faults, batched PTE and rmap manipulation, reduced >> lru list, TLB coalescing (for arm64 and amd). >> However, swapping in large folios has its own drawbacks like higher >> swap thrashing. >> I had initially sent a RFC of zswapin of large folios in [1] >> but it causes a regression due to swap thrashing in kernel >> build time, which I am confident is happening with zram large >> folio swapin as well (which is merged in kernel). >> >> Some of the points we could discuss in the session: >> >> - What is the right (preferably open source) benchmark to test for >> swapin of large folios? kernel build time in limited >> memory cgroup shows a regression, microbenchmarks show a massive >> improvement, maybe there are benchmarks where TLB misses is >> a big factor and show an improvement. > > My understanding is that it largely depends on the workload. In interactive > scenarios, such as on a phone, swap thrashing is not an issue because > there is minimal to no thrashing for the app occupying the screen > (foreground). In such cases, swap bandwidth becomes the most critical factor > in improving app switching speed, especially when multiple applications > are switching between background and foreground states. > >> >> - We could have something like >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >> to enable/disable swapin but its going to be difficult to tune, might >> have different optimum values based on workloads and are likely to be >> left at their default values. Is there some dynamic way to decide when >> to swapin large folios and when to fallback to smaller folios? >> swapin_readahead swapcache path which only supports 4K folios atm has a >> read ahead window based on hits, however readahead is a folio flag and >> not a page flag, so this method can't be used as once a large folio >> is swapped in, we won't get a fault and subsequent hits on other >> pages of the large folio won't be recorded. >> >> - For zswap and zram, it might be that doing larger block compression/ >> decompression might offset the regression from swap thrashing, but it >> brings about its own issues. For e.g. once a large folio is swapped >> out, it could fail to swapin as a large folio and fallback >> to 4K, resulting in redundant decompressions. > > That's correct. My current workaround involves swapping four small folios, > and zsmalloc will compress and decompress in chunks of four pages, > regardless of the actual size of the mTHP - The improvement in compression > ratio and speed becomes less significant after exceeding four pages, even > though there is still some increase. > > Our recent experiments on phone also show that enabling direct reclamation > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > failure rate - this probably removes the need for fallbacking to 4 small > folios. (Note that our experiments include Yu's TAO—Android GKI has > already merged it. However, since 2 is less than > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > without Yu's TAO, although I have not confirmed this.) > Hi Barry, Thanks for the comments! I haven't seen any activity on TAO on the mailing list recently. Do you know if there are any plans for it to be sent for upstream review? Have cc-ed Yu Zhao as well. >> This will also mean swapin of large folios from traditional swap >> isn't something we should proceed with? >> >> - Should we even support large folio swapin? You often have high swap >> activity when the system/cgroup is close to running out of memory, at this >> point, maybe the best way forward is to just swapin 4K pages and let >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >> as well. > > This approach might be suitable for non-interactive scenarios, such as building > a kernel within a memory control group (memcg) or running other server > applications. However, performing collapse in interactive and power-sensitive > scenarios would be unnecessary and could lead to wasted power due to > memory migration and unmap/map operations. > > However, it is quite challenging to automatically determine the type > of workloads > the system is running. I feel we still need a global control to decide whether > to enable mTHP swap-in—not necessarily per size, but at least at a global level. > That said, there is evident resistance to introducing additional > controls to enable > or disable mTHP features. > > By the way, Usama, have you ever tried switching between mglru and the > traditional > active/inactive LRU? My experience shows a significant difference in > swap thrashing > —active/inactive LRU exhibits much less swap thrashing in my local kernel build > tests. > I never tried with MGLRU enabled, so I am probably seeing the lowest amount of swap-thrashing. Thanks, Usama > the latest mm-unstable > > *********** default mglru: *********** > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m44.561s > user 46m53.274s > sys 3m48.585s > pswpin: 1286081 > pswpout: 3147936 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 714580 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 286881 > pgpgin: 17199072 > pgpgout: 21493892 > swpout_zero: 229163 > swpin_zero: 84353 > > ******** disable mglru ******** > > root@barry-desktop:/home/barry/develop/linux# echo 0 > > /sys/kernel/mm/lru_gen/enabled > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > *** Executing round 1 *** > real 6m27.944s > user 46m41.832s > sys 3m30.635s > pswpin: 474036 > pswpout: 1434853 > 64kB-swpout: 0 > 32kB-swpout: 0 > 16kB-swpout: 331755 > 64kB-swpin: 0 > 32kB-swpin: 0 > 16kB-swpin: 106333 > pgpgin: 11763720 > pgpgout: 14551524 > swpout_zero: 145050 > swpin_zero: 87981 > > my build script: > > #!/bin/bash > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path="/proc/vmstat" > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout > 2>/dev/null || echo 0) > swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout > 2>/dev/null || echo 0) > swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout > 2>/dev/null || echo 0) > swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin > 2>/dev/null || echo 0) > swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin > 2>/dev/null || echo 0) > swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin > 2>/dev/null || echo 0) > echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > $swpin_zero" > } > > for ((i=1; i<=1; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > echo 3 > /proc/sys/vm/drop_caches > > #kernel build > initial_values=($(read_values)) > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > final_values=($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > echo "pgpgin: $((final_values[8] - initial_values[8]))" > echo "pgpgout: $((final_values[9] - initial_values[9]))" > echo "swpout_zero: $((final_values[10] - initial_values[10]))" > echo "swpin_zero: $((final_values[11] - initial_values[11]))" > sync > sleep 10 > done > >> >> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ >> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/ >> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/ >> >> Thanks, >> Usama > > Thanks > Barry