From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32E81E7719A for ; Sat, 11 Jan 2025 10:52:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94C866B007B; Sat, 11 Jan 2025 05:52:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8FB936B0082; Sat, 11 Jan 2025 05:52:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7EA466B0083; Sat, 11 Jan 2025 05:52:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 60FD06B007B for ; Sat, 11 Jan 2025 05:52:22 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D31F7A11D6 for ; Sat, 11 Jan 2025 10:52:21 +0000 (UTC) X-FDA: 82994856882.23.8BB2873 Received: from out-187.mta1.migadu.com (out-187.mta1.migadu.com [95.215.58.187]) by imf06.hostedemail.com (Postfix) with ESMTP id E92E9180006 for ; Sat, 11 Jan 2025 10:52:19 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=xbivJl8r; spf=pass (imf06.hostedemail.com: domain of yanjun.zhu@linux.dev designates 95.215.58.187 as permitted sender) smtp.mailfrom=yanjun.zhu@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736592740; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qAZFt+PB04ucfy2OvumTaP885jVT+neWbhNecfPkD74=; b=fr0wWhZKgF7bMBgowdPmInznEWGC/zFX+qNvryot4qwUHjNGtAS36gmJKwuXD3uh5KWgep Uofp//KxzHP21DH1OD2A1IA7ifoVbJ9BqXX2zMkxsv0d6C0tYZmR7sC/bZqEOeKp9NEvpx KKE7TREpVfkd8QaaFb1OrBH7y3d9Eiw= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=xbivJl8r; spf=pass (imf06.hostedemail.com: domain of yanjun.zhu@linux.dev designates 95.215.58.187 as permitted sender) smtp.mailfrom=yanjun.zhu@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736592740; a=rsa-sha256; cv=none; b=C/5KI5LyQAkne8/wASXKr+FVBoz7iii3kkH3B1ocXnEJuRIyiii9q1O0VGfSM2ZxShjY2h NEDofkfXYWXdmt/WDeBAZFyVIOYV7siVidk402d6kSyqTvXU9Oe45xcvFmjA/ZU1OrfmRk h4GdZPsjyzNSSJA/hWWatdguqEGqTUM= Message-ID: <081a1173-2d71-427b-ad26-16c8d1d99628@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1736592738; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qAZFt+PB04ucfy2OvumTaP885jVT+neWbhNecfPkD74=; b=xbivJl8rl+WKqo8bibOOncwRYkrD/b+oM82jw1M45N+7JAlDAY3iBZSyHXSjmoxzlKhofn 3C30uzzSE8Z8kS6h+GBj+S7jacVAMTg2/3blOgvdsl19AMU4lG6bKg2rRuGqRgVQdpgG3W PapIq+yIR+xZdgmnol/MkMz3/uOEXiw= Date: Sat, 11 Jan 2025 11:52:12 +0100 MIME-Version: 1.0 Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Nhat Pham , Usama Arif Cc: lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Barry Song <21cnbao@gmail.com>, Yosry Ahmed , Shakeel Butt References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Zhu Yanjun In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: E92E9180006 X-Stat-Signature: na5ctneufwo7sumqbxahuzongkio7fki X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1736592739-711215 X-HE-Meta: U2FsdGVkX1+Dc9iHt9n3UL8B0UHprOwrDhHDMJVuLF/rhqLJPzKYWMSRp73gHU+2eWfZb2qqaASrRzxJxA2jaEMnZSa7r7qvafTgcEnJE7E93OTPSqt8jht6S0sHs/onk6O32sRTwxe4qqXA1QjLsPhO9el00EpekcItz3u69zUdvHYMItK0OLFpkB3QjPQ8dyqlTlW27+Xwe/6fgOA2prFyAvIjK1ktkyhpHbq8Gy4P5HKsUBIgamut22a5aR3viscwhy2fNt84HbnlW0Gp6ONUqoZPqYCQFVyD9li+roVmDwXkgNd8+jl05/g2Uaz9hFU7TiIYJ47ofRGBaUBvLfQqYPDK7rtmQH/8GhSWWazwNjnznAzFT0Cwz+VedchfZpzvwEZreHbNcrlEFGddM+1L81XxIDzN1jAjyIYs/czZZUHi/8yvvBEzHRIjL/wPiUM3OuFMLNEUpqnwzmPtvZngnJby6siNBy4ICKe8aQRzRHIwe3Vy2f6qe7FO+Wcl0dM5ObdvZDnxWssWNwxWYzu1IIfJicOtCiXhOeqO8P5/trs/UxTYZ2n262YkslQzNSoTU9806XvWQI2cbNRS4NAFf3LKwyTpRsK8bAefruKy2NfyVKqqifpYUvN11RjCFmHMb/lMB4RhyZA9COR3xFFfzP1iasbVWfxAweRvr7u20QwsbFxdv/O9qMLIdk5SFDsXM1ZuZRnyrbjZ6ybaqINWEvJJoTkJLx19vGRUDoqRG0cr/b9cpPHlJekgewDjbO+etsWD12fZ5tKpzAjWUfWJKfgVywlC/KSv7gcDikC1tbrTb/uFRMGz0aDWJiyWH3TqPHq1BguzHq2/BYhMFZWmW+r9IAXS9nwGqZ06G5ioralZ8rr6LLmiHWEH+gSrBKkHyxKBbrFArlLIJ2zNCwHDV7wvEj4fu5iQNH0dVmw0pKWOfPDjOk2TekVc2uFd2zMy+v5ucy9zGC/wU09 TyUdzbrl ii6T3Qc7E5DJMJT0DHTmZVKg2LuwjCBIlOl9TKY4hwn0SFQaGGFm3l4eoz0ZCw25Xp/213pqxMEbWNroM7cu1d1AaJ42UdD68mEl68IC3pb8AK981TQfMp0tlFPWy0ZlVv8ry7vpGANFo7yEGQAEVW/2Qf+j22XeZyOjls358t+KDVyYLVAID+ZRwpwdq7CKBf20r X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/1/10 5:29, Nhat Pham 写道: > On Fri, Jan 10, 2025 at 3:08 AM Usama Arif wrote: >> >> I would like to propose a session to discuss the work going on >> around large folio swapin, whether its traditional swap or >> zswap or zram. > > I'm interested! Count me in the discussion :) I am also interested in this topic. Hope to join the meeting and discuss. Zhu Yanjun > >> >> Large folios have obvious advantages that have been discussed before >> like fewer page faults, batched PTE and rmap manipulation, reduced >> lru list, TLB coalescing (for arm64 and amd). >> However, swapping in large folios has its own drawbacks like higher >> swap thrashing. >> I had initially sent a RFC of zswapin of large folios in [1] >> but it causes a regression due to swap thrashing in kernel >> build time, which I am confident is happening with zram large >> folio swapin as well (which is merged in kernel). >> >> Some of the points we could discuss in the session: >> >> - What is the right (preferably open source) benchmark to test for >> swapin of large folios? kernel build time in limited >> memory cgroup shows a regression, microbenchmarks show a massive >> improvement, maybe there are benchmarks where TLB misses is >> a big factor and show an improvement. >> >> - We could have something like >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled >> to enable/disable swapin but its going to be difficult to tune, might >> have different optimum values based on workloads and are likely to be > > Might even be different across memory regions. > >> left at their default values. Is there some dynamic way to decide when >> to swapin large folios and when to fallback to smaller folios? >> swapin_readahead swapcache path which only supports 4K folios atm has a >> read ahead window based on hits, however readahead is a folio flag and >> not a page flag, so this method can't be used as once a large folio >> is swapped in, we won't get a fault and subsequent hits on other >> pages of the large folio won't be recorded. > > Is this beneficial/useful enough to make it into a page flag? > > Can we push this to the swap layer, i.e record the hit information on > a per-swap-entry basis instead? The space is a bit tight, but we're > already in the talk for the new swap abstraction layer. If we go the > dynamic route, we can squeeze this kind of information in the > dynamically allocated per-swap-entry metadata structure (swap > descriptor?). > > However, the swap entry can go away after a swapin (see > should_try_to_free_swap()), so that might be busted :) > >> >> - For zswap and zram, it might be that doing larger block compression/ >> decompression might offset the regression from swap thrashing, but it >> brings about its own issues. For e.g. once a large folio is swapped >> out, it could fail to swapin as a large folio and fallback >> to 4K, resulting in redundant decompressions. >> This will also mean swapin of large folios from traditional swap >> isn't something we should proceed with? > > Yeah the cost/benefit analysis differs between backend. I wonder if a > one-size-fit-all, backend-agnostic policy could ever work - maybe we > need some backend-driven algorithm, or some sort of hinting mechanism? > > This would make the logic uglier though. We've been here before with > HDD and SSD swap, except we don't really care about the former, so we > can prioritize optimizing for SSD swap (in fact looks like we're > removing the HDD portion of the swap allocator). In this case however, > zswap, zram, and SSD swap are all valid options, with different > characteristics that can make the optimal decision differ :) > > If we're going the block (de)compression route, there is also this > pesky block size question. For instance, do we want to store the > entire 2MB in a single block? That would mean we need to decompress > the entire 2MB block at load time. It might be more straightforward in > the mTHP world, but we do need to consider 2MB THP users too. > > Finally, the calculus might change once large folio allocation becomes > more reliable. Perhaps we can wait until Johannes and Yu make this > work? > >> >> - Should we even support large folio swapin? You often have high swap >> activity when the system/cgroup is close to running out of memory, at this >> point, maybe the best way forward is to just swapin 4K pages and let >> khugepaged [2], [3] collapse them if the surrounding pages are swapped in >> as well. > > Perhaps this is the easiest thing to do :) >