From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97396E77188 for ; Fri, 10 Jan 2025 04:29:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFAD46B007B; Thu, 9 Jan 2025 23:29:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CAAD06B0082; Thu, 9 Jan 2025 23:29:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B73176B0083; Thu, 9 Jan 2025 23:29:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D22176B007B for ; Thu, 9 Jan 2025 23:29:38 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0521144203 for ; Fri, 10 Jan 2025 04:29:37 +0000 (UTC) X-FDA: 82990263636.11.5176DEC Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf06.hostedemail.com (Postfix) with ESMTP id 1F5AD180005 for ; Fri, 10 Jan 2025 04:29:35 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XLDDI5VZ; spf=pass (imf06.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736483376; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r9h8q8mb/iEsRBVQtgkGEOREeUq29DJjPaMT6VBdJlY=; b=0IH/RWby/7ilOWQ5oWllKCcFXdCVMczChJ+gH6i7Ga4/YRqcHBruDf6EdrW2aa0mDBokQs tretZdNOMX9LJOatwmesgo+X+FEKCZQ2J4l9rh/izGFdoO+ISbM29RIlOnoD59YC/8c7G6 BnEVolOv60c9nKOXhcL9/BWtPJ6vSls= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XLDDI5VZ; spf=pass (imf06.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736483376; a=rsa-sha256; cv=none; b=bkVyezCHbe1G72b5DKVAMN3wgnjL6y5ApytSetIl1Z4tfmqXi5FkGG1vP2qYzyOlXY+w8T Fox69Lvo7iYrFGhq+b7HDqr1fglwDwe9A5KV30l44Umaf9v5pcY05Hy4rHz3zrLW/9mM2i V1ar7xUTdAb9p5+MC1/8fg8/z6Vxv98= Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-6d8fa32d3d6so20833386d6.2 for ; Thu, 09 Jan 2025 20:29:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736483375; x=1737088175; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=r9h8q8mb/iEsRBVQtgkGEOREeUq29DJjPaMT6VBdJlY=; b=XLDDI5VZVnmuxMPxrj15EnOU9id71F7jO1xzZXspchNdnH8Ul5T9Z/cyObWz+9nq11 Sz5/Sje8p0h3ealyHJaejzdPCa+jYT6rK9sIFIc7/Es5JfqimfQJTbMlElvPZKup9Qxm fHsMkNYJ8z9KtjsnSh8nyky6Em0yjHggVMcVj6eHBvQSMo5JKS8z633KhXQCumpK+HEo DGGTvFurDdyBZ+d/8s4mHlVI19hoJSzjWXU61HiLXQ4zO6Ad0j5AhHWDLbcVZmAY/QiW 1iUAURVc6CzGH8AuoSgWJwbVhvjYNPbCJqvrAOn4zYZWVKVgozVIAq/9FWEYJuUeMPMz yf3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736483375; x=1737088175; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r9h8q8mb/iEsRBVQtgkGEOREeUq29DJjPaMT6VBdJlY=; b=fKRMWtNnXBEPSiThXu0jS3j87ClyyKDWytBSO5WwOnLEgTcpx2+QG22MVJ7b3b4wsx xksGEQ1UKp/bnY+xi1fjN/B3gfw/q7dKDG56Gft35uTEQQfJjSSfYWUk7oY/sxdxEaNm TMIcyFtlTErZknJAUS/3Rv3yyIU1d7o4lQ+7rZ44QeXuioVbBEtlQvTlCeU+k1vLlDr5 Zq8Zg7SntJ41Io1Th5NLjBXvhXlcVWI/Y/RYMi6cZPC7LXK8iHus3jr3C40QQAYsdgH+ FKJ5NnJLY4F+2NTyNclAQzMcKcqOHChrN9b3PZD2BIcG/qfqhGzjUAw7HIcWen5V0LiE DMwA== X-Forwarded-Encrypted: i=1; AJvYcCXQB2u7ISAjWpKJH19aLf7PzZdXtaMQbJFQpYqldoTbq6vr0osI/X1xaDcOEMoL9urChoDSLuao5g==@kvack.org X-Gm-Message-State: AOJu0YxNXm16sU1cpbxXVNtI3Gf56i3h3afVN3+3nELMkJsjSCe2HOb3 Udu9GP6CzBig0Bf15MUpDoxqEXP44b7I04EGzWeszJsg5LO9vUuSR6hV9HUjdAFb6XYzE1xc91G W/rmh/g53n4vtJC1WNPiBbxAbgr8= X-Gm-Gg: ASbGncshvoo3nILCLmL9SNbiJ+0+LvHOszMmzm7mqYLK4gYOfWzlet3MQUn/a/vXAXC OTjEtDUYPaY0f4FJj2gjNeLtu9xIiGlIeNB2IYDOexoSJ5STopAQ+WNbPzcvqCFeIgrLoCA== X-Google-Smtp-Source: AGHT+IFNq+Rht4YHDP7OUWNGDh0pHvtcR/PttAU3VH+1NNRZ9jNWV3tIhMVCpsvf0Keq7HVwFWcWHnlCAFgVAqXOqsY= X-Received: by 2002:a05:6214:53c2:b0:6d4:85f:ccb7 with SMTP id 6a1803df08f44-6df9aef1beamr156211786d6.0.1736483375113; Thu, 09 Jan 2025 20:29:35 -0800 (PST) MIME-Version: 1.0 References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> In-Reply-To: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com> From: Nhat Pham Date: Fri, 10 Jan 2025 11:29:23 +0700 X-Gm-Features: AbW1kvaiaQaxSQEZY0mjPkjooCxRtAgE3ash8D64lF7U75j6HFOVRPRpSI9ljJQ Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin To: Usama Arif Cc: lsf-pc@lists.linux-foundation.org, Linux Memory Management List , Johannes Weiner , Barry Song <21cnbao@gmail.com>, Yosry Ahmed , Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1F5AD180005 X-Rspamd-Server: rspam12 X-Stat-Signature: 6rqmhnk7qxdtky9aofwpkj3g1qhgjyuq X-Rspam-User: X-HE-Tag: 1736483375-67522 X-HE-Meta: U2FsdGVkX18TvcwLd2l125Rx7gpv1h9EMZNWne3eLYCHeo58E4aS3G575CB11PgNS+gXRHVH/MoD0yMpiX1dKbJaXh4xZ3aZegRVdN+0jkz400HCOcTX/oPanN773wEOoeSIvsSTob6Knd0mQPwjhibk6Wpfp6rRBHX+sJiHtbGS4c7BRgV/5B/o1EAtaWTVCbLjabv1O7y45ihCwXddIg0/aDYrJm3Or85cZ3a0VZixk6fSkgNF612FhH+86aELOZ8WqQ0dVlC6X56P3UuLYUTF3h7r4K8eEFmrWAOkg7p9JUMCgawpq02bvknDP8fZcl10UA7kB2Da8qy8RB2cwI97T8UXg11e4J1HIGjzPqKIu0DlKIsZtXr7Z/NOpaSNh4XvLuhuR7gdjNENtXfCQvqaSNgwWTkCVed7AeRBrVmc6W+Ihx9t5TgN6Ba+as2t/jjNipGplySUnbv+2R5WUeGP2e9RhIIo6wdFQ1L4ca5fHFcfTvOp7MzAXmgUTOH/woukBJAQszAcwc/u2lirDCNp8AzCvhEecNJRy6Coy7+Er9wAguqaMh2sZDLEuLvbGYYGW9qh7k0G/juLnOQ0Jk6IXItnSAFKHEnwsNa1DUVJvMupPYmyAy9s/x7ABisJ9Jwv60b4oiKo7E/kBFKYnl4rPYXzT/AElTWe7gEYCxEvWms2mFS4Tlv4IXGstDodREyDP2RKsnogGF1Ho9l4rFNoirdAd3GOaZFj6eiiZs//qBO125L6jeIpfnTy2rR5YcZ8yu1m5hgqDPjeQgTIpcNJukkoVMyP1Z2P9NxesZMizInomzW6euCdQlBC1AB7wY/Sp/07DiLHSacgsXlYf9YjY9cRAXghgXOcEpndNtHx7ONwyRyYfznHZtc6nSX26cY3hVwtkddz3luShAvq8FxeXsD7KZoFYPq31mUbkpWOwLZhNJGSHeyIUZ5hzDq5EavYDZZCdZIoiQW/yN5 oQbel3mZ fM/8dx6lggYZFrDzj2RLnYhdpZlRZo+wT1Kkdur/YWsX0Q6sKB586jy5JZFDfsf31twIONhne5V0tlD88hzSCWXtcxObutEpFx3F7fa08Yw19DVkhf6hdSBbHraUHBfetxxpdz9QhMpoOcMjYxwCNQkJlVymqTV44hKrvmJlg7nYag8ZL15qIGNsug21twnc19JUCMn3Wab9cN+rLKDdITTlSVcwcm6orUfvqV/wVCOizZR1tGUJWj84oIx4jktSg4028SNTgRgkvSOp0T0+XgZGb0Ozs0jbEVNUmaaKuYXKOzdYvvG7BSjwzJF8QGJ3wQoGXWKIxoihlo/YieGd53TX8GQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 3:08=E2=80=AFAM Usama Arif = wrote: > > I would like to propose a session to discuss the work going on > around large folio swapin, whether its traditional swap or > zswap or zram. I'm interested! Count me in the discussion :) > > Large folios have obvious advantages that have been discussed before > like fewer page faults, batched PTE and rmap manipulation, reduced > lru list, TLB coalescing (for arm64 and amd). > However, swapping in large folios has its own drawbacks like higher > swap thrashing. > I had initially sent a RFC of zswapin of large folios in [1] > but it causes a regression due to swap thrashing in kernel > build time, which I am confident is happening with zram large > folio swapin as well (which is merged in kernel). > > Some of the points we could discuss in the session: > > - What is the right (preferably open source) benchmark to test for > swapin of large folios? kernel build time in limited > memory cgroup shows a regression, microbenchmarks show a massive > improvement, maybe there are benchmarks where TLB misses is > a big factor and show an improvement. > > - We could have something like > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled > to enable/disable swapin but its going to be difficult to tune, might > have different optimum values based on workloads and are likely to be Might even be different across memory regions. > left at their default values. Is there some dynamic way to decide when > to swapin large folios and when to fallback to smaller folios? > swapin_readahead swapcache path which only supports 4K folios atm has a > read ahead window based on hits, however readahead is a folio flag and > not a page flag, so this method can't be used as once a large folio > is swapped in, we won't get a fault and subsequent hits on other > pages of the large folio won't be recorded. Is this beneficial/useful enough to make it into a page flag? Can we push this to the swap layer, i.e record the hit information on a per-swap-entry basis instead? The space is a bit tight, but we're already in the talk for the new swap abstraction layer. If we go the dynamic route, we can squeeze this kind of information in the dynamically allocated per-swap-entry metadata structure (swap descriptor?). However, the swap entry can go away after a swapin (see should_try_to_free_swap()), so that might be busted :) > > - For zswap and zram, it might be that doing larger block compression/ > decompression might offset the regression from swap thrashing, but it > brings about its own issues. For e.g. once a large folio is swapped > out, it could fail to swapin as a large folio and fallback > to 4K, resulting in redundant decompressions. > This will also mean swapin of large folios from traditional swap > isn't something we should proceed with? Yeah the cost/benefit analysis differs between backend. I wonder if a one-size-fit-all, backend-agnostic policy could ever work - maybe we need some backend-driven algorithm, or some sort of hinting mechanism? This would make the logic uglier though. We've been here before with HDD and SSD swap, except we don't really care about the former, so we can prioritize optimizing for SSD swap (in fact looks like we're removing the HDD portion of the swap allocator). In this case however, zswap, zram, and SSD swap are all valid options, with different characteristics that can make the optimal decision differ :) If we're going the block (de)compression route, there is also this pesky block size question. For instance, do we want to store the entire 2MB in a single block? That would mean we need to decompress the entire 2MB block at load time. It might be more straightforward in the mTHP world, but we do need to consider 2MB THP users too. Finally, the calculus might change once large folio allocation becomes more reliable. Perhaps we can wait until Johannes and Yu make this work? > > - Should we even support large folio swapin? You often have high swap > activity when the system/cgroup is close to running out of memory, at thi= s > point, maybe the best way forward is to just swapin 4K pages and let > khugepaged [2], [3] collapse them if the surrounding pages are swapped in > as well. Perhaps this is the easiest thing to do :)