From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADFDFE77188 for ; Fri, 10 Jan 2025 10:30:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E2AA8D0001; Fri, 10 Jan 2025 05:30:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3934C6B00C4; Fri, 10 Jan 2025 05:30:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2340B8D0001; Fri, 10 Jan 2025 05:30:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 03DCE6B00C3 for ; Fri, 10 Jan 2025 05:30:46 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9EF7F44351 for ; Fri, 10 Jan 2025 10:30:43 +0000 (UTC) X-FDA: 82991173566.23.2987249 Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171]) by imf10.hostedemail.com (Postfix) with ESMTP id BF940C0011 for ; Fri, 10 Jan 2025 10:30:41 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FhbXTpat; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736505041; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HqNOVxePdyVyGKPF+TJcRmHRVUuheuDeSIuR7uM8Vr0=; b=YDMQYvwhal0EzU9RbfBycG0lAtA7FdEr8Bg9aktxKahjKgGlJg8tFhtnkjt693iNACCimo eVXuihmRBjmna7XFhYNh5SiTR605wp/4QTbLIGQWpvLFZ8dAqQlOlh945VJ6oANgrMDLS7 mB/MDxVfYP+YcjRSepz+FcU0hQSh43Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736505041; a=rsa-sha256; cv=none; b=aE7qMeKWBRi6Ujf057UcaTBUQiZdxAkuEIAgnx8lasnv9r2vvZDdNDHlr/W3ujVRNIUxUw 9sHp9Ec96M1Mxc8G9LrTqaOuD8qqlEofekjXbLv1akvE4OHIu8SrLSYDGRngx3xvDWPWDw IQne5BVOHgf63v5jnybEB2AmksIob+I= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FhbXTpat; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-vk1-f171.google.com with SMTP id 71dfb90a1353d-517aea3ee2aso1068969e0c.2 for ; Fri, 10 Jan 2025 02:30:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736505041; x=1737109841; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HqNOVxePdyVyGKPF+TJcRmHRVUuheuDeSIuR7uM8Vr0=; b=FhbXTpatmqJ73BuqWEMjO1VfvVqfTMOFr5SQ5EQ0ZAPLDYEtmds+v9vavxu22ctbgo rBOZs+TSS/6CrMOS3jIAEEKSAn2ma1wJenfuXwv5MKCZDUOEeACd5guUgtlV3CTBexDw 722g7GH6DganPfnoqCX8pponeAAbDq6H+ELbakAU17YAmju/j3/5Yh8XzNrmX52RiVqU yBUvuQdFTfWlwQXNhTdxA1zj+b76XVY/w6qaTYVIokn3gP1WYozLCXW4LOW8TnxZs5A8 63/zAUNuQ62V/7AQ4W1uSziOxStwCaY6s7zCbPcB5//U2QVzAbFinbMzrn4thBpEyeuD +NfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736505041; x=1737109841; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HqNOVxePdyVyGKPF+TJcRmHRVUuheuDeSIuR7uM8Vr0=; b=jFXM6Szz8KvGLOL4CDlD+ZL8dZMy+rRRgkqj0ch7xnzXwEaGppc3Z3+qklbSerQ/RS wf+bT6TKGJEKqzZ0yny3jezVorbcVTI75P0BAoUeMfKhox7IvSYaq9zZBcvLmA7llVpr v7+skezzvhVmyKKQudz0HUWoCbLPWBmfNfqEzNKlAsr7N9koVa82tDL8kBSlnuZNdlYH 7yuwtw5ZljR2un9g/2qEtRmH/EquOFzyFREezJLjdrwYNMuB0DcyJ9ia3wjOY3MtXD3i IGKOOvj6LH6MJnkt208wmqYdkLum2BeVkgNc2qqfJ2VXHlYfePBFDN1BP27i4ZEwEoh5 MM9Q== X-Forwarded-Encrypted: i=1; AJvYcCWEs2CR6cv54S9+qX/AelFIrevPTKkQCs0qJJqe6KUS1pCjikS9d3fnXUQQtUn+Maaih99dDLEXWQ==@kvack.org X-Gm-Message-State: AOJu0Yyr7nv8xem1swaQxLxX1pe/F2VEEEaOLU8ZHxLzwE7sGe3TsSuw JoDiLn8FYvmyAq4iDsrX8GGP710HcwFqV9uLIHAl7/BjCxsqfow1gvQ72iDx2xaIkB5R/7PUKvI oZfoSBhKRQUg/OEu4jNwxopS2K1g= X-Gm-Gg: ASbGnctx2TyOQkX0imharDXVZbEen8KVR2kiCrr99g0n5LMANZcz9lf7kbNQDRCnpYv gGIJMAKoqqWu4GJNTmv/y/nYx6sPE0HxTzK7K3Xm6PmG3T/9hJMyIeCiouwuqjL0Q5PvgAf4C X-Google-Smtp-Source: AGHT+IFyJwzbIBUK6Auku2J0PmhT4ujJRmSlR9Y0FGd/bMUkAiyY5AwTpD5r/qqXfADDCh7bOPkyEc3u/zxZ4uUq3/w= X-Received: by 2002:a05:6122:8b8d:b0:515:d032:796b with SMTP id 71dfb90a1353d-51c6c353b28mr9677201e0c.11.1736505040742; Fri, 10 Jan 2025 02:30:40 -0800 (PST) MIME-Version: 1.0 References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 10 Jan 2025 23:30:29 +1300 X-Gm-Features: AbW1kvakPy9ikFI2qYrEbVa2GFOl75OsAReERTveHzvQpCR4i1pkQltsc3-8Nwk Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Usama Arif Cc: lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Yosry Ahmed , Shakeel Butt , Yu Zhao Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: bezpm3hwhjkmxqr9jrygr7746x1g47h9 X-Rspamd-Queue-Id: BF940C0011 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1736505041-433530 X-HE-Meta: U2FsdGVkX19/UilVuWHDWCjGe7cM9BG5F1dUogjQK59l8GWYyDfE567xYpwvixaI1QU+XnbLaOerAJeZHZDzesKzXkD/rAXdeXVQpZWMfZ+yACRYVMGBklgQkjM4Bc5pqlYv4iPBdVIi8PgFmzRbzNXVWUnael2+0n7A40zlc3oAQEYh3GWFiLN4YTcSG3tX/G2kmd7SMFLOS2HH8bDh0jHPfHJ4rSNVPOJ1wnZIjm9Ir23DZFv9bCH61CLaOSc51ufgLuIN9rrBhgppuxpSe6C3AdOQ7RHL1MYGxogN5Ad8iDRQxrh9juzqlJ1Lh9brDQLgWFduCc361TsTqQ/nRy/Z5csQQZ5jFlo7QHaQx0Kh7FpzstMIhdjhixM7ihMahjJco2e3utP5xM2niDr4OiSdz/4dhfQNOI1Cea+yQPEdrTJp4c/jCd2kzTnyNxzVndVs/CVbI8d5JhHqMSZee1MJiW/gFDn/xo/KzDBz8xrvFRuIe9Dii2WihkaI12oLKKZowPBh//UGoUx969xr7PG0Gs7AwgQrnKhBXkZ+hImeUHO17zXRWTZ3Gp1+hExDymFUfZmxRMbnOMqxI40DxCb2AzKpMoDRNnfs3S35IbbwesW7bmHnrscoeX7KmQKP5gl+1cIGInasJnD4X/YGZxscwJj11v0poDStpF/ZmgKsYtTaatlJ2n+kPB/G/AWXCot8SO/8YrpY7tRMQ+AK6JDRzgMfi7zfDJ+4XmF3Ldb5eb7i8ptplh0yEGK7PIE6Ye6r8HO1yb+wg9OO8WiGBbCNbTZ2jMlpkqk926q9FgtpJZCbPHxxe3pq8tCNJwxmS/X15f7JxV1STZ60Hv/9C3L6ggLPWmA5fXFt03IA6ghRSRqkOhxtLIkTm0p3jTd5/4uuEBVeU98nDUycypMt+Ke7sX5Xc49Z2hhN/zuqUE3C7Kzdt7gg2LRnIdkK3aQ2FL2FdhpTPKNxjDvQCed GlBEPiSj ETDtaKFfM5NOVNzNQasE/J6FTEMwxmmUOWjT1Dzl23VXQpaXRT4K639TeMnJmA1+pmS7EeKQ8cFxvcwOwc9VG2OomgEsYLGrMQzOHNpZLrx22nMRbNKDBZHqh2XQb2CfCf+U949qPFTPRs46w8dSWdaZCw49S/Dq3cpiJHNO+pMwnLtjLTBc9NRZkBNElUUV0lt+qjRKbH9c7nbfc1cktogHRvSxGvxvUXEf/eOQF+MLqff0/bHdl2hm9mG1Z9GPXMYkUtO2jrb/yQGay7PuAznl3nZzyKlY/1jzURSotMfYvNowcVO1zM5ClJ3dYsOufJjApjqfKLA8p5M1T4tOabRG0vLU7MfL7Yw8F7nMT2+03Ide5qaB2M2SxNhhIZahw2Jzu/k1xCmF+eEAbj6IBVQHdn0Eg/6RRTAtzTlw9S9wBQXy2WDa+QhRAmEhQw3eA6nOb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 11:26=E2=80=AFPM Usama Arif wrote: > > > > On 10/01/2025 10:09, Barry Song wrote: > > Hi Usama, > > > > Please include me in the discussion. I'll try to attend, at least remot= ely. > > > > On Fri, Jan 10, 2025 at 9:06=E2=80=AFAM Usama Arif wrote: > >> > >> I would like to propose a session to discuss the work going on > >> around large folio swapin, whether its traditional swap or > >> zswap or zram. > >> > >> Large folios have obvious advantages that have been discussed before > >> like fewer page faults, batched PTE and rmap manipulation, reduced > >> lru list, TLB coalescing (for arm64 and amd). > >> However, swapping in large folios has its own drawbacks like higher > >> swap thrashing. > >> I had initially sent a RFC of zswapin of large folios in [1] > >> but it causes a regression due to swap thrashing in kernel > >> build time, which I am confident is happening with zram large > >> folio swapin as well (which is merged in kernel). > >> > >> Some of the points we could discuss in the session: > >> > >> - What is the right (preferably open source) benchmark to test for > >> swapin of large folios? kernel build time in limited > >> memory cgroup shows a regression, microbenchmarks show a massive > >> improvement, maybe there are benchmarks where TLB misses is > >> a big factor and show an improvement. > > > > My understanding is that it largely depends on the workload. In interac= tive > > scenarios, such as on a phone, swap thrashing is not an issue because > > there is minimal to no thrashing for the app occupying the screen > > (foreground). In such cases, swap bandwidth becomes the most critical f= actor > > in improving app switching speed, especially when multiple applications > > are switching between background and foreground states. > > > >> > >> - We could have something like > >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > >> to enable/disable swapin but its going to be difficult to tune, might > >> have different optimum values based on workloads and are likely to be > >> left at their default values. Is there some dynamic way to decide when > >> to swapin large folios and when to fallback to smaller folios? > >> swapin_readahead swapcache path which only supports 4K folios atm has = a > >> read ahead window based on hits, however readahead is a folio flag and > >> not a page flag, so this method can't be used as once a large folio > >> is swapped in, we won't get a fault and subsequent hits on other > >> pages of the large folio won't be recorded. > >> > >> - For zswap and zram, it might be that doing larger block compression/ > >> decompression might offset the regression from swap thrashing, but it > >> brings about its own issues. For e.g. once a large folio is swapped > >> out, it could fail to swapin as a large folio and fallback > >> to 4K, resulting in redundant decompressions. > > > > That's correct. My current workaround involves swapping four small foli= os, > > and zsmalloc will compress and decompress in chunks of four pages, > > regardless of the actual size of the mTHP - The improvement in compress= ion > > ratio and speed becomes less significant after exceeding four pages, ev= en > > though there is still some increase. > > > > Our recent experiments on phone also show that enabling direct reclamat= ion > > for do_swap_page() to allocate 2-order mTHP results in a 0% allocation > > failure rate - this probably removes the need for fallbacking to 4 sma= ll > > folios. (Note that our experiments include Yu's TAO=E2=80=94Android GKI= has > > already merged it. However, since 2 is less than > > PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even > > without Yu's TAO, although I have not confirmed this.) > > > > Hi Barry, > > Thanks for the comments! > > I haven't seen any activity on TAO on the mailing list recently. Do you k= now > if there are any plans for it to be sent for upstream review? > Have cc-ed Yu Zhao as well. > > > >> This will also mean swapin of large folios from traditional swap > >> isn't something we should proceed with? > >> > >> - Should we even support large folio swapin? You often have high swap > >> activity when the system/cgroup is close to running out of memory, at = this > >> point, maybe the best way forward is to just swapin 4K pages and let > >> khugepaged [2], [3] collapse them if the surrounding pages are swapped= in > >> as well. > > > > This approach might be suitable for non-interactive scenarios, such as = building > > a kernel within a memory control group (memcg) or running other server > > applications. However, performing collapse in interactive and power-sen= sitive > > scenarios would be unnecessary and could lead to wasted power due to > > memory migration and unmap/map operations. > > > > However, it is quite challenging to automatically determine the type > > of workloads > > the system is running. I feel we still need a global control to decide = whether > > to enable mTHP swap-in=E2=80=94not necessarily per size, but at least a= t a global level. > > That said, there is evident resistance to introducing additional > > controls to enable > > or disable mTHP features. > > > > By the way, Usama, have you ever tried switching between mglru and the > > traditional > > active/inactive LRU? My experience shows a significant difference in > > swap thrashing > > =E2=80=94active/inactive LRU exhibits much less swap thrashing in my lo= cal kernel build > > tests. > > > > I never tried with MGLRU enabled, so I am probably seeing the lowest amou= nt of > swap-thrashing. Are you sure, Usama, since mglru is enabled by default? I have to echo 0 to manually disable it. > > Thanks, > Usama > > > the latest mm-unstable > > > > *********** default mglru: *********** > > > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > > *** Executing round 1 *** > > real 6m44.561s > > user 46m53.274s > > sys 3m48.585s > > pswpin: 1286081 > > pswpout: 3147936 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 714580 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 286881 > > pgpgin: 17199072 > > pgpgout: 21493892 > > swpout_zero: 229163 > > swpin_zero: 84353 > > > > ******** disable mglru ******** > > > > root@barry-desktop:/home/barry/develop/linux# echo 0 > > > /sys/kernel/mm/lru_gen/enabled > > > > root@barry-desktop:/home/barry/develop/linux# ./build.sh > > *** Executing round 1 *** > > real 6m27.944s > > user 46m41.832s > > sys 3m30.635s > > pswpin: 474036 > > pswpout: 1434853 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 331755 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 106333 > > pgpgin: 11763720 > > pgpgout: 14551524 > > swpout_zero: 145050 > > swpin_zero: 87981 > > > > my build script: > > > > #!/bin/bash > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enable= d > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabl= ed > > > > vmstat_path=3D"/proc/vmstat" > > thp_base_path=3D"/sys/kernel/mm/transparent_hugepage" > > > > read_values() { > > pswpin=3D$(grep "pswpin" $vmstat_path | awk '{print $2}') > > pswpout=3D$(grep "pswpout" $vmstat_path | awk '{print $2}') > > pgpgin=3D$(grep "pgpgin" $vmstat_path | awk '{print $2}') > > pgpgout=3D$(grep "pgpgout" $vmstat_path | awk '{print $2}') > > swpout_zero=3D$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > > swpin_zero=3D$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > > swpout_64k=3D$(cat $thp_base_path/hugepages-64kB/stats/swpout > > 2>/dev/null || echo 0) > > swpout_32k=3D$(cat $thp_base_path/hugepages-32kB/stats/swpout > > 2>/dev/null || echo 0) > > swpout_16k=3D$(cat $thp_base_path/hugepages-16kB/stats/swpout > > 2>/dev/null || echo 0) > > swpin_64k=3D$(cat $thp_base_path/hugepages-64kB/stats/swpin > > 2>/dev/null || echo 0) > > swpin_32k=3D$(cat $thp_base_path/hugepages-32kB/stats/swpin > > 2>/dev/null || echo 0) > > swpin_16k=3D$(cat $thp_base_path/hugepages-16kB/stats/swpin > > 2>/dev/null || echo 0) > > echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k > > $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero > > $swpin_zero" > > } > > > > for ((i=3D1; i<=3D1; i++)) > > do > > echo > > echo "*** Executing round $i ***" > > make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev/nul= l 2>/dev/null > > echo 3 > /proc/sys/vm/drop_caches > > > > #kernel build > > initial_values=3D($(read_values)) > > time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \ > > CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/= dev/null > > final_values=3D($(read_values)) > > > > echo "pswpin: $((final_values[0] - initial_values[0]))" > > echo "pswpout: $((final_values[1] - initial_values[1]))" > > echo "64kB-swpout: $((final_values[2] - initial_values[2]))" > > echo "32kB-swpout: $((final_values[3] - initial_values[3]))" > > echo "16kB-swpout: $((final_values[4] - initial_values[4]))" > > echo "64kB-swpin: $((final_values[5] - initial_values[5]))" > > echo "32kB-swpin: $((final_values[6] - initial_values[6]))" > > echo "16kB-swpin: $((final_values[7] - initial_values[7]))" > > echo "pgpgin: $((final_values[8] - initial_values[8]))" > > echo "pgpgout: $((final_values[9] - initial_values[9]))" > > echo "swpout_zero: $((final_values[10] - initial_values[10]))" > > echo "swpin_zero: $((final_values[11] - initial_values[11]))" > > sync > > sleep 10 > > done > > > >> > >> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@= gmail.com/ > >> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.c= om/ > >> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.c= om/ > >> > >> Thanks, > >> Usama > > Thanks Barry