From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5966C25B75 for ; Wed, 29 May 2024 06:51:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ECD1F6B0098; Wed, 29 May 2024 02:51:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E7E746B0099; Wed, 29 May 2024 02:51:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D448F6B009A; Wed, 29 May 2024 02:51:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B73C26B0098 for ; Wed, 29 May 2024 02:51:04 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1F44DA0888 for ; Wed, 29 May 2024 06:51:04 +0000 (UTC) X-FDA: 82170511248.10.FAE7885 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf06.hostedemail.com (Postfix) with ESMTP id 1671A180009 for ; Wed, 29 May 2024 06:51:01 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jk+533gn; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716965462; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c1yFYXx2NGLCnrIT7W/qKrqBxvt+RWVeRhWGU1q22zc=; b=8QdwXj2Osb1JpQY6XtAFYrdfkBvJWpJ3x65hqRt+C6NltMJnQptCmpaesMUAIF4AW+hd16 v3uJmoSzcyogkjggY3kgPs2USx7p4QjKRf3IdyyKLZyBFozzoPNXwDC86xH+ACT9QHhEY9 MTdkyU1KSDDrCWyh3hVljQyfT76r7qM= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jk+533gn; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716965462; a=rsa-sha256; cv=none; b=F7d7lbevVwB1r5ZlXylSWMSBH8G765iEBwQcTohwvLscIlg7gmzojy7wtwcog7WgxW6TZk V+lZ2n3rs2DXqzB4yuJGCCTCl9VIhwn5ZxWzrfPbqwEdWufyPJx/pgw+UhVdoBaCjkiR7Z KDjrmzTRz/iDqmzYGBTAmfOg4k1PlH8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 00CEB626BC for ; Wed, 29 May 2024 06:51:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id AC63FC32781 for ; Wed, 29 May 2024 06:51:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716965460; bh=enL52tMN2raXGgkJriQMJs4PAW2R3vQ13pph7pAjeNY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=jk+533gnh1vSbm1Hjg9BLqARg1a4Ro2TMNUTrNhH0S+ihpeMndSHgYTELJBbb0PVH r3L1dyX53RyF16wU+NN5VCKTdsXmLeZdDgBbf0GyryCiYMEWzjWxxidw1qIaxZGPj9 nEouuAWSD6i/8zvcdJOn690QfSqmeknGSt7nnoRQk4rc3HYTsz8CB3myRE+HxCCUcd EOV9J59UAf0Ju/rXOS2K8uiCVhASFp/aSCtoooXIxVW8dxcOOtDYj8L37oqaYL+z3x nnh4z78X1eKjOgSMJhwuuEVH6rV6vSjGkzlttpPhVLx7DdIHeku9TKFE2Qp8Swi05E 6t7dAXI2e0Cvw== Received: by mail-il1-f179.google.com with SMTP id e9e14a558f8ab-37342d1d9edso6921475ab.3 for ; Tue, 28 May 2024 23:51:00 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCVa2n5fWWbjZ2EYrGQbjxm7Ldm0RduUV4ZyuGf4p8wRl/nhZ95HqjIPAqdnm/pb2RmsqV0CdIGd+TvJpObwGBEUUTg= X-Gm-Message-State: AOJu0YwNqFNzBaVpHQDTVbWTYngIArzRi5AaEDIsGfvVU8HplqnIueKX rE3V94tbaVVdViOk20xfWWYgROdORRMPPjQzEhXRlAgP3a37gvKm+mLQp8FzQ3cPZ1hOx9NanH0 Buxmwp/gaXqTKnvDsKeaKIzMdgg4pBH42JWvS X-Google-Smtp-Source: AGHT+IGC71xXJvrjoT44qsLE+rtBjia96MX8sJ1VSZXR7OKR4u9yaAFuqemE36Sy8Y2FVfzMokVV6qb0SjAWc7NEs18= X-Received: by 2002:a05:6e02:15c7:b0:371:ea98:cc69 with SMTP id e9e14a558f8ab-3737b39c9eamr166060105ab.26.1716965459962; Tue, 28 May 2024 23:50:59 -0700 (PDT) MIME-Version: 1.0 References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> In-Reply-To: From: Chris Li Date: Tue, 28 May 2024 23:50:47 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Matthew Wilcox Cc: Karim Manaouil , Jan Kara , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1671A180009 X-Stat-Signature: bdnzq6nk1zdwyabzbbsjxxxd1jpntq6a X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1716965461-617027 X-HE-Meta: U2FsdGVkX19CylNtVI7ZInx5hLjFQa+Tb5m9o0i+4+aoGENG/DFfzYjgcMASeLmB/J0yBDgfkSNpRzZlHn24vf7/6bF2fwJ7EfQH8lXDFnChfzL4EQ+af/mUopIMOXBj7/BqOiBlZZ+9M1Vd/Zs4PxAxq0aPc/5Yl2Cqpo/sDczQmEfbdTdHd9k97IqYbTtttanyU/vMSK6VKKESE6aGA61PReUXDKlH1SH+vydMgicZBDm5IMzAwCm2coUec+2aM+hEBynt9N0BOtqgNUMt6aYA8xng2CmMqeRiHXHM9Y7AeRYA6bzT6v2BdLne2ZT1i4tPc/H5Bqrf/wXJvtHZR/T2dOk89wnVJgzaZIVE4bl9GDcvxDEI/1pBqYP5bmNlRkdRE+ZSInioT3e2e+ld/U/1kXiAiSmGAgMLuykrPsqqjU7KcCuYKzA6hx6lJMEJEu/suYYRlSrUCU/KySKxEe96LL/hzz3QA8mwA9OCbywUyj4Led/qikoxt0rNfBWUZzGCVSTnZCRYe2NjYYxdLJpfjJruDqoDl6AJnVQqJJ4OLif7cRS5czY5IEJp/dJrBMdgsn1jtQ1djUM58tkZpX2I6tcfhznBCpmybrYBTmdKS0+Jdx/+VSXSjVbLGTlJkaLlxJaCeDtucwPTb9zGcODjIBKxCHXlR0UmDe4dd94PvaXJmGm34cn+BEhf06ndDn91tT1S6DdCX0VU+/BCyaJMTvxuL6Ee+7Xjl/t3tnlytEw7rt650/elTyVAr1SeaywNZtE1n7geMp1BMO0b8c0uliOvtm8sYvdWrQm4VrpatkpiSQUnTuk0cQaEWasHg5gdRZ3TySap+gYpe/fh7cjBZTT6rm9ol8wcjtpZCk4Zm3AuGap1+Suj7icKapjvPP1TGO/NreGl4Bb0mExkdI/LEmoCOkAizqs+CMBWt/JeMbMeUyQo7DDILNolIx+FWhainNfqYTeWyG8B3s2 MtM4oAc1 KForK0uWz42nrc2vMLIRQP6OeyWsk+9HQ6Z9is1BmtQJumzLoT3gB32cBsv+Lfj7ZPFvNtmJuZdmGsKEA+olszrUbBx/hDFUFsPWPdHqAJ1MYyvB8zaWh2/sS5HYRU99VpcCcTAYKhl7Ttf/FLGzKD/ZeJB6i81VhLKvib554qQv8UAP8B3rjUmGwDyiKmkllzCJ/1TP+JrIeg/iFgzmy7DDnkNkrfcozdMO/4vkgf+MoesU8n19OFSuQd0Qt5yPYRBbfLdbYbHxuZWrcKwgm/CBOiX19cNHHplGpWMIXMJ0uch8IAqll9SjqNEe3kE6FJ6YlTNhp5Di0vBS4Kb1bknZ6NtqJILkipKQNPfTwp+c3GGlZl++13buKXAW51pUFLQBE/cbeFe6VT9vwCY15bqk+iQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 28, 2024 at 8:57=E2=80=AFPM Matthew Wilcox wrote: > > On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote: > > > Filesystems already implemented a lot of solutions for fragmentation > > > avoidance that are more apropriate for slow storage media. > > > > Swap and file systems have very different requirements and usage > > patterns and IO patterns. > > Should they, though? Filesystems noticed that handling pages in LRU > order was inefficient and so they stopped doing that (see the removal > of aops->writepage in favour of ->writepages, along with where each are > called from). Maybe it's time for swap to start doing writes in the orde= r > of virtual addresses within a VMA, instead of LRU order. Well, swap has one fundamental difference than file system: the dirty file system cache will need to eventually write to file backing at least once, otherwise machine reboots you lose the data. Where the anonymous memory case, the dirty page does not have to write to swap. It is optional, so which page you choose to swap out is critical, you want to swap out the coldest page, the page that is least likely to get swapin. Therefore, the LRU makes sense. In VMA swap out, the question is, which VMA you choose from first? To make things more complicated, the same page can map into different processes in more than one VMA as well. > Indeed, if we're open to radical ideas, the LRU sucks. A physical scan > is 40x faster: > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/ That simulation assumes the page struct has access to information already. On the physical CPU level, the access bit is from the PTE. If you scan from physical page order, you need to use rmap to find the PTE to check the access bit. It is not a simple pfn page order walk. You need to scan the PTE first then move the access information into page struct. > > > One challenging aspect is that the current swap back end has a very > > low per swap entry memory overhead. It is about 1 byte (swap_map), 2 > > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is > > more than 64 bytes per file. That is a big jump if you map a swap > > entry to a file. If you map more than one swap entry to a file, then > > you need to track the mapping of file offset to swap entry, and the > > reverse lookup of swap entry to a file with offset. Whichever way you > > cut it, it will significantly increase the per swap entry memory > > overhead. > > Not necessarily, no. If your workload uses a lot of order-2, order-4 > and order-9 folios, then the current scheme is using 11 bytes per page, > so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per > order-9 folio. That's a lot of bytes we can use for an extent-based > scheme. Yes, if we allow dynamic allocation of swap entry, the 24B option. Then sub entries inside the compound swap entry structure can share the same compound swap struct pointer. > > Also, why would you compare the size of an inode to the size of an > inode? inode is ~equivalent to an anon_vma, not to a swap entry. I am not assigning inode to one swap entry. That is covered in my description of "if you map more than one swap entry to a file". If you want to map each inode to anon_vma, you need to have a way to map inode and file offset into swap entry encoding. In your anon_vma as inode world, how do you deal with two different vma containing the same page? Once we have more detail of the swap entry mapping scheme, we can analyse the pros and cons. Chris