From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B6DBE77188 for ; Fri, 10 Jan 2025 10:28:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 198BD6B00C5; Fri, 10 Jan 2025 05:28:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 12BA86B00C6; Fri, 10 Jan 2025 05:28:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0CDB6B00C6; Fri, 10 Jan 2025 05:28:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0A3DF6B00BF for ; Fri, 10 Jan 2025 05:28:58 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AD11912093D for ; Fri, 10 Jan 2025 10:28:57 +0000 (UTC) X-FDA: 82991169114.08.73B7EBC Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49]) by imf10.hostedemail.com (Postfix) with ESMTP id C8313C0006 for ; Fri, 10 Jan 2025 10:28:55 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=as0VE325; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736504935; a=rsa-sha256; cv=none; b=xm/KxR44cLoRhu/Y5VyoUdzBNL/aUZApqWJeYqQxN14dIaJuStnr5tU2bek+clZ7wo+le4 qALLdisSLIH/R2lQyHmSFBGZA9XbJD1OEv+mkIdANbeMmfEeK+j2Vfasj/3GKccEOvfmIr 14Z7n9qfPdhU93Z0bth8pxazTGlq1vs= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=as0VE325; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736504935; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=; b=tCJ1w/d1gagWJot1h/nGpc06xHOrZ+lAAD8grUlBMkzIi1JdAGOlcV+IxP0OnJFXSnQ6iJ nndG5isrLhr9vjpIxrvXe3TVbQjQE3mhWfTSk5XDWXKxr8d2H15Vl+0qrH/mnDFV/vpkKu CbE1lWrUrMb4lsVZ85ChVk+zgov+w20= Received: by mail-vs1-f49.google.com with SMTP id ada2fe7eead31-4affd0fb6adso623956137.1 for ; Fri, 10 Jan 2025 02:28:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736504935; x=1737109735; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=; b=as0VE325kPoD62/6LoPjTTRH2CSMpbU6EUa7dgz/Ut4KhJIkkZmt82irS9uDY5BnlE NptBi2hOqQ0tGCqR1UMUZ8Q5DjHDsqGtiojLoxfEp4NaVjQZFJLKFdb+mkdnSC8up6oj UglOt7eMKDIy33P10gRJ6Q9+b+91UqB+mmYGGy+l6vsJQduqNmm5YbQUb95Ek+iawuuL uJqfyP54Zr5muavE0WSjHxH09csFbd5g9sd9oJP7IK/CfrC/5a3+zCHWJsPTj2P9LTSH 0xAYkltNNIBborJl+FVrUWH1PE2ogX/Aexf/2eWF0qq9KMwcAsP7yISmNE0epCVWe3qx 6eJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736504935; x=1737109735; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sxjwba1xIBWyBDhpMRTJjxQqQbIrsYYEjiuYhJO6tdI=; b=hE5ORdAcAqxbDAnlhMUZPNzm6xHpLHldL3zP5m944xm7w7cXaJHsygsm+Wi9klIlm3 ioutkUDfMUWL2xO9xgOeLbL00rph4O1QyiTa6zXBumLYoyoWrz2gyOEM9QzQ6j1brKqo rN9pIIxH+vYeXCA9B5Mk4v4qgrttKWCGC7Rql+DlbHz1y0NJI+I9CfA/tMjbQYPs6avm 6LkW3p58b6HcQnwhJF7G5DwC5O3zuZrepxvbDxu3ND3MmivZj0PK1xxKMRyZM3sbHPOt QQfZUpR+moWow+aoS60vNxawAfP+TYjRNxrWX3/YnCMTWS+CeH6s79210uOUwLDbsz2y SfwQ== X-Forwarded-Encrypted: i=1; AJvYcCVP2P/RKD1QsIApt0yG9mLHrmX/YfrCHKmTibRl3QfZF/CtPe+84RzsdAiWF7dSgv576wfRLX0RNQ==@kvack.org X-Gm-Message-State: AOJu0YzfjDSHqd3CJ+oatfdAEaUe2kvnMMpQwWHCxHgBjaayELKj+lAP TYeYHrzFyj6lznoEJUe58swQJcodBNi6/TxS0mNdLyVo0LTJS+Hk41U99i+kMFuEVOt5V5UNav0 iV1oeHkwvirjKBjbF2rC8Y4CW+Fc= X-Gm-Gg: ASbGncuEWZWXpzHbWPjzmD5eNDOxBVN4dWB++AIKDtcnU6ngcGJFJjWrhyUUWddmxkt Et2tg2UkczBhBdLExO7x3IexBbHCIy7JfJxfbBG0Szgia/ru63GrXd1NBk6H2vk4cajNJLTCv X-Google-Smtp-Source: AGHT+IGmgtLKsop8RKa2tEuVLkgFIevxqDsfRacxzQM81zNXLNWh6jeVcihTAB17drbSFmc7KJkjERlOwrfKhJNLldg= X-Received: by 2002:a05:6102:2b9b:b0:4af:dcbe:4767 with SMTP id ada2fe7eead31-4b3d0d9fc34mr9934237137.10.1736504934800; Fri, 10 Jan 2025 02:28:54 -0800 (PST) MIME-Version: 1.0 References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 10 Jan 2025 23:28:43 +1300 X-Gm-Features: AbW1kva4_lAo0t3bKGIgum24FIkbxS8wODQEHdbJczfHKd4WeK4aKNQYOEPKwKE Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Nhat Pham Cc: Usama Arif , lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Yosry Ahmed , Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C8313C0006 X-Stat-Signature: 3s3zmc57388d5zx7zfxb8gzpecoab6gr X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736504935-53771 X-HE-Meta: U2FsdGVkX1/D7X/L8N2PhIzTrJhcojJIzVLmdjPb6iWBfHTdvKiVlIoXuSy1UoexeRy5FM62aIJBbnzFmimWZ0pO0R0q75q+4XJvKuE9RNIAaRzFzhJDnf1fK6LmqgpyiZTtjl6nrWZME1AgJqKbZ+7u8RAVci7GFtTyggLrIKJnTQlxFTtEVqKBPR1QbY+UBL/YiOAaoPRaEKnaK+KrRqZ6bO+yQn/+QCxA7IazUHPXviyh3L1SXRPxbnZyeA5zel87I6IDqVH20vG+ld9Q9CDSCU5zeOyV1ONu7htDVc+28rCk5vLED68oKO/STX3woGDP3g6g8qj6D1jK7qcXkwUR7OhCDrVO1N+Kbydm7XiO6FY5xqzbtxR0xxPjd73RldpacjcPBiNysuabff8pLghF8oSLLmuATUMu13Q3MdJPw0G6YWg4pNrmXBoHTYkR+2urSNUe1Y7KJ0MMr/UBWpsBGQeBdq39DGxWn5l+kyxOFSASW/c749b3bO/XhTITcgkSeooAg/SW2I+jtixx/MRl2xozb0nY4NTfRaRu2Z1Thu1jRiNTL69RC7UBzzaEdTvAteEU1GkGom+C9LDgtjQdIKR2q0CdAnWxJzAr6XdB+Uk+PaqD4H7cbwhhT8uy8je5Dgjncge0BTl1fzKC5jT8TgSbEBzEnNuUSVU8X6NEWxc15ccMx7bwLIW8ge1xdWB3KcVBjtx5eAYMWLqe0qUoYpn6VjAegf00jFO3EXJotAmdcTbGd0F2lGh6mYC7dieK0jHjV0y6w41Grx2SwrF4yx213ciTmF59h6nsryitA7Xtqf5NWAttqHeSRq42ovL50n8ecVkVPgLb05CwSchqx+8NLt2OnxRhXtJheIbaKLG25DGgPEQ1QPRzXzdWnBK7D22zO3+70tWSPD/k4ZIrj7X8Ir+B6pkqbRhJrK1iMCknT2j64hvNvm8FD2RpriHaFYeGBlDG+imTqij 6Pu/17Ml Isi14WC9+bkJm7Ag+GCNeIdGQn9l3x1VaGMQbrU0cE02H69tpfQ6MYUvUOTU9umCPULzhs/2nLGWrSAq1L0YNQc3Hk7159RTAWJjoy4XVIiI6lh5MH5YSzvffxqYdDDpf9vqCNkyByi1wqI4Y8tR71rZYo6+3N62hkvskTpvAqCcLyEGfXZ1INQqqjt++aKigU9HwkmKV7qWCPytofyG8eY8YhqN3Lp6Mgn+uea/CJ1ZpH4/BSFVjAP52PDOsWQHjBroiqbTNOssif0c0i0ofwa4nH1aSBqXkX0zVjsdiEaXOrsYBZXBR0bPUHHrBcIRYIQE25m+CF6KAc7n5QeKRtCRNBHKsWUj+LY85/nYShI77BP9jcs92QEJLQg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.006271, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 5:29=E2=80=AFPM Nhat Pham wrote= : > > On Fri, Jan 10, 2025 at 3:08=E2=80=AFAM Usama Arif wrote: > > > > I would like to propose a session to discuss the work going on > > around large folio swapin, whether its traditional swap or > > zswap or zram. > > I'm interested! Count me in the discussion :) > > > > > Large folios have obvious advantages that have been discussed before > > like fewer page faults, batched PTE and rmap manipulation, reduced > > lru list, TLB coalescing (for arm64 and amd). > > However, swapping in large folios has its own drawbacks like higher > > swap thrashing. > > I had initially sent a RFC of zswapin of large folios in [1] > > but it causes a regression due to swap thrashing in kernel > > build time, which I am confident is happening with zram large > > folio swapin as well (which is merged in kernel). > > > > Some of the points we could discuss in the session: > > > > - What is the right (preferably open source) benchmark to test for > > swapin of large folios? kernel build time in limited > > memory cgroup shows a regression, microbenchmarks show a massive > > improvement, maybe there are benchmarks where TLB misses is > > a big factor and show an improvement. > > > > - We could have something like > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > > to enable/disable swapin but its going to be difficult to tune, might > > have different optimum values based on workloads and are likely to be > > Might even be different across memory regions. > > > left at their default values. Is there some dynamic way to decide when > > to swapin large folios and when to fallback to smaller folios? > > swapin_readahead swapcache path which only supports 4K folios atm has a > > read ahead window based on hits, however readahead is a folio flag and > > not a page flag, so this method can't be used as once a large folio > > is swapped in, we won't get a fault and subsequent hits on other > > pages of the large folio won't be recorded. > > Is this beneficial/useful enough to make it into a page flag? > > Can we push this to the swap layer, i.e record the hit information on > a per-swap-entry basis instead? The space is a bit tight, but we're > already in the talk for the new swap abstraction layer. If we go the > dynamic route, we can squeeze this kind of information in the > dynamically allocated per-swap-entry metadata structure (swap > descriptor?). > > However, the swap entry can go away after a swapin (see > should_try_to_free_swap()), so that might be busted :) > > > > > - For zswap and zram, it might be that doing larger block compression/ > > decompression might offset the regression from swap thrashing, but it > > brings about its own issues. For e.g. once a large folio is swapped > > out, it could fail to swapin as a large folio and fallback > > to 4K, resulting in redundant decompressions. > > This will also mean swapin of large folios from traditional swap > > isn't something we should proceed with? > > Yeah the cost/benefit analysis differs between backend. I wonder if a > one-size-fit-all, backend-agnostic policy could ever work - maybe we > need some backend-driven algorithm, or some sort of hinting mechanism? > > This would make the logic uglier though. We've been here before with > HDD and SSD swap, except we don't really care about the former, so we > can prioritize optimizing for SSD swap (in fact looks like we're > removing the HDD portion of the swap allocator). In this case however, > zswap, zram, and SSD swap are all valid options, with different > characteristics that can make the optimal decision differ :) > > If we're going the block (de)compression route, there is also this > pesky block size question. For instance, do we want to store the > entire 2MB in a single block? That would mean we need to decompress > the entire 2MB block at load time. It might be more straightforward in > the mTHP world, but we do need to consider 2MB THP users too. I don't think we need to save the entire 2MB in a single block. After 64KB, we don't see much improvement in compression ratio or speed. The most significant increase was observed between 4KB and 16KB. For example, for zstd: File size: 182502912 bytes 4KB Block: Compression time =3D 0.967303 seconds, Decompression time =3D 0.200064 seconds Original size: 182502912 bytes Compressed size: 66089193 bytes Compression ratio: 36.21% 16KB Block: Compression time =3D 0.567167 seconds, Decompression time =3D 0.152807 seconds Original size: 182502912 bytes Compressed size: 59159073 bytes Compression ratio: 32.42% 32KB Block: Compression time =3D 0.543887 seconds, Decompression time =3D 0.136602 seconds Original size: 182502912 bytes Compressed size: 57958701 bytes Compression ratio: 31.76% 64KB Block: Compression time =3D 0.536979 seconds, Decompression time =3D 0.127069 seconds Original size: 182502912 bytes Compressed size: 56700795 bytes Compression ratio: 31.07% 128KB Block: Compression time =3D 0.540505 seconds, Decompression time =3D 0.120685 seconds Original size: 182502912 bytes Compressed size: 55765775 bytes Compression ratio: 30.56% 256KB Block: Compression time =3D 0.575515 seconds, Decompression time =3D 0.125049 seconds Original size: 182502912 bytes Compressed size: 54203461 bytes Compression ratio: 29.70% 512KB Block: Compression time =3D 0.571370 seconds, Decompression time =3D 0.119609 seconds Original size: 182502912 bytes Compressed size: 53914422 bytes Compression ratio: 29.54% 1024KB Block: Compression time =3D 0.556631 seconds, Decompression time =3D 0.119475 seconds Original size: 182502912 bytes Compressed size: 53239893 bytes Compression ratio: 29.17% 2048KB Block: Compression time =3D 0.539796 seconds, Decompression time =3D 0.119751 seconds Original size: 182502912 bytes Compressed size: 52923234 bytes Compression ratio: 29.00% To simplify things(Reduce the potential decompression of large blocks for s= mall swap-ins), for a 2MB THP, we are actually saving it as 2MB/16KB blocks in zsmalloc, as shown in the RFC. https://lore.kernel.org/linux-mm/20241121222521.83458-1-21cnbao@gmail.com/ > > Finally, the calculus might change once large folio allocation becomes > more reliable. Perhaps we can wait until Johannes and Yu make this > work? > > > > > - Should we even support large folio swapin? You often have high swap > > activity when the system/cgroup is close to running out of memory, at t= his > > point, maybe the best way forward is to just swapin 4K pages and let > > khugepaged [2], [3] collapse them if the surrounding pages are swapped = in > > as well. > > Perhaps this is the easiest thing to do :) Thanks barry