From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB307C25B74 for ; Thu, 30 May 2024 22:54:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0FC886B0089; Thu, 30 May 2024 18:54:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0ACCA6B0092; Thu, 30 May 2024 18:54:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB5C86B0095; Thu, 30 May 2024 18:54:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CC2F66B0089 for ; Thu, 30 May 2024 18:54:06 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2D0B7120785 for ; Thu, 30 May 2024 22:54:06 +0000 (UTC) X-FDA: 82176566892.28.36FB7E4 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf11.hostedemail.com (Postfix) with ESMTP id 2E88340010 for ; Thu, 30 May 2024 22:54:04 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Vgy+glyz; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717109644; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Js/Ctcil+uIqEbkgWzcEb9AuxRNnvMY1RodsoLgMmos=; b=Ry/LVxhuJpoVcUH926IHdEjrnZ3eIqH5P26WyvxxiuRQioLvvn83j4DLT4WbSxhlVvc5g/ IJqjgQJxnkNmgA4OmYqxwTLosds3GNVAvWCE4W0F/VmuyXFsuN6GCeeKOxIpjjGbe5CztZ 3f4r0pvUNK8wTvntexsH4jFjBUbJmjU= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Vgy+glyz; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717109644; a=rsa-sha256; cv=none; b=2eP1qibeSVWB2toqctC13ULHPvpQvmBZcQyZCuYQD/mwHXT9BJ7YwtBLGJFtCvQTeArfoe Gw3JOCK0Z8Ko66T5E3JxEISY/CJqh3WIMFJlbsSAXXzkZk7jK1ycR/UXCP3hZP8Mubsq3g sR0mJo8cClFpvjBULXdtef9HvaA9aVw= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 4660A61F9B for ; Thu, 30 May 2024 22:54:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EF5A2C2BBFC for ; Thu, 30 May 2024 22:54:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717109643; bh=8vYykfvd4mJBbm4tJ19d+cl/tmROKAWI9kZhZ7uYAJs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Vgy+glyzfHQBGXH4JDm6uCnq99vNh8F58iaNcie1pP8Fvxg24ngzRpN/+Y2/Dltxi 6NJ7ZD13ynmHpgFCTykJiqXE6xBx2c43k/SJF3pGZp3smT88h9iPeQbIcTAoSm4eVN ElTmj78dz+QS7UaeuYrSQGop0RIL3JYcm8XZYnQtDwdBHLtnJ1iPpJrZOvH4KDZEdi +D3iyXhf/UcVdlFlWgO3+w5JTCcQ3/X+LGgRVCCK0CNGZpx0Aj/Iutj3DDh2oY+3Th bjE4itCkxPr03ZFsx5rKZUr5ls0aWK6Ung+ipzw5LmCxZK+CgI426kor4Ll87kDRea 6BeljobKQFV5w== Received: by mail-io1-f46.google.com with SMTP id ca18e2360f4ac-7eaccf09dd1so63470339f.2 for ; Thu, 30 May 2024 15:54:02 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCUrzO+uDoDmAA8FqRkp5lpvF4Mmu93X20z6CJ6a2TTyDOzPdSfMcMMHIZ6j6po+q8wwkZ8LDMZsA96ETykgLqfYeuA= X-Gm-Message-State: AOJu0YwWhoOmEYDashsikInInCx52CWEa6w+aYBBTKenFx8q2q/33lBj SRg5hZjgeIT6dVr17HmWHrMIU9OzZGGSLa7Q/ycCvmvxbE/wExygnj0uvNk7jc8ca9YQr/Zn3k5 wKxPaj+O5ePUcD3efs1aA2UbH1ScHTgPrpLoo X-Google-Smtp-Source: AGHT+IFEJn/agI3emstA8QL+ySipDnfkXXerpZPw3b5c33kJ8gj9UoB2M3Rz2tB+YqEAaPMbpqkHqsWcGGIV0OEUKpM= X-Received: by 2002:a05:6602:6d82:b0:7de:e48e:8a1c with SMTP id ca18e2360f4ac-7eafff2b994mr41734139f.19.1717109642265; Thu, 30 May 2024 15:54:02 -0700 (PDT) MIME-Version: 1.0 References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> In-Reply-To: From: Chris Li Date: Thu, 30 May 2024 15:53:49 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Matthew Wilcox Cc: Karim Manaouil , Jan Kara , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2E88340010 X-Stat-Signature: ujqtpcukn33uqo5odk7t6yk876wngh45 X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1717109644-764951 X-HE-Meta: U2FsdGVkX18F5jpAdAgLuQMV1Zi6AfaXfsRsJ5VxD+yiQGb2ZrWsPnFYJB70AQM8PINJPUpRzw2v30BBFdBWY+WT49rVbZS+0IJoGyx/5izZSeNk9/pklLqoK9vQwSRAL/ciCnaJ9IH34aDhdFJxY6w7wjXU9J7iP+lUIkXF91dBnlOoeEsZZWti6bVDzMU4jgnTsCe9Wc8Ba65pVhKJMJ7fhYkoEAvYbnzW8pAbjYFvMNlA7L2Tr1hzT/q/Xq/Qx/YU4tzZUjLG67SXa1QZlGSuti2Q7xKRuN16bXRDYk7xz+XUp+xIKDTYBEFvtBgv1nJBmLPafOalp3mWuUmNBB+mHptUnFsn8G+Eu4dK56xhRKeKCckeiu9wLkxE2IE9WpRWHwRiwJVBRFlI8wC4x4TOWf2JKWCYLyI4p2sIpbr+uNxl0m9FYWc0tmP5cN/AXwyIoM96QLoKNn7/f4P6UiBKcZX+YOQhw51xUBDsL/ZrcgDF8Z9nraTX4RyfhM9RCxOYPs2QPrU3AHfmH6Zt8FN2xIWQmDxabpBXKTRVH+bmf4wTV2dsFEvy55fShFG4zG/HHN9DvNjXnv41mkt5WiTmXQnp140WMx0Hu6Ks7WNbcnBIoKOLEg6cxdcU99RKLscnDMEcM0ViuuCexpSZGeV8RqHW/7Sv9JFACp0Jch3nYiMRAPqkRlOyJE/XLvebYsJxce4UGP2RipbEk9FaDRlcNnrzhDWj2WjOzIJ++de1zpUtibLL8O6QAf+Fw0nQ7awyE0Q3Un16rYZIDokdbLYJgsiWMQlfvBLRhBsjQQnKCK1P0FD6tPeCU9Vts+PNghgI+izUgGfno1Nafi83VRndpu48HsPE8Za7vfl74/oGzb64T2CdCw5e9WuegK0OaSE6VeiOhCNKS2geZKci+C8KcJMLLwyI3orod+mCS6MvCFHbWfQvOpbPSq3xEU6M/SM0HnAgVh3tEkVwOXI IuoaMFMq Q5WdauTtzsv/0v31hcHmVJEhu3cehutL3QpI9WjZ/2Td/VSVwydFWt8nrGuF/GO8BI3A/VkOQJHyC2n8Q5oZOIvtmfo0VY7gSWKAoocpZ0GzeAW5sAZ9toce6zjArCML4XwCwP8Tez5cxhSHiCWLFH0FJMfm9AbsyugJgkIXqZgfxh4yWRorZZTt276o4o1n7alrNDwiShT3c7MWTpmWrYw79qcIFOOh7ARnI/wnwTEuIBbR3ADmfK/3rC2g+TvAvFq9mGyykyzV7Ca9oOjDtknG7HOiu6iGDPNrt4QdA3uc1P2jB3sWI3BxknXY1Oh0GxOnC97qJMjNrkTB26BeyJI7sEklAnvCXX6bKYXE5XITdoTZ6abhgW7Jw1X7/owPPuK1zNBJl4WQ6XTP9LfjytdOb0w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 29, 2024 at 5:33=E2=80=AFAM Matthew Wilcox wrote: > > On Tue, May 28, 2024 at 11:50:47PM -0700, Chris Li wrote: > > On Tue, May 28, 2024 at 8:57=E2=80=AFPM Matthew Wilcox wrote: > > > > > > On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote: > > > > > Filesystems already implemented a lot of solutions for fragmentat= ion > > > > > avoidance that are more apropriate for slow storage media. > > > > > > > > Swap and file systems have very different requirements and usage > > > > patterns and IO patterns. > > > > > > Should they, though? Filesystems noticed that handling pages in LRU > > > order was inefficient and so they stopped doing that (see the removal > > > of aops->writepage in favour of ->writepages, along with where each a= re > > > called from). Maybe it's time for swap to start doing writes in the = order > > > of virtual addresses within a VMA, instead of LRU order. > > > > Well, swap has one fundamental difference than file system: > > the dirty file system cache will need to eventually write to file > > backing at least once, otherwise machine reboots you lose the data. > > Yes, that's why we write back data from the page cache every 30 seconds > or so. It's still important to not write back too early, otherwise > you need to write the same block multiple times. The differences aren't > as stark as you're implying. > > > Where the anonymous memory case, the dirty page does not have to write > > to swap. It is optional, so which page you choose to swap out is > > critical, you want to swap out the coldest page, the page that is > > least likely to get swapin. Therefore, the LRU makes sense. > > Disagree. There are two things you want and the LRU serves neither > particularly well. One is that when you want to reclaim memory, you > want to find some memory that is likely to not be accessed in the next > few seconds/minutes/hours. It doesn't need to be the coldest, just in > (say) the coldest 10% or so of memory. And it needs to already be clean, > otherwise you have to wait for it to writeback, and you can't afford that= . Do you disagree that LRU is necessary or the way we use the LRU? In order to get the coldest 10% or so pages, assume you still need to maintain an LRU, no? > > The second thing you need to be able to do is find pages which are > already dirty, and not likely to be written to soon, and write those > back so they join the pool of clean pages which are eligible for reclaim. > Again, the LRU isn't really the best tool for the job. It seems you need to LRU to find which pages qualify for write back. It should be both dirty and cold. The question is, can you do the reclaim write back without LRU for anonymous pages? If LRU is unavoidable, then it is necessarily evil. > > > In VMA swap out, the question is, which VMA you choose from first? To > > make things more complicated, the same page can map into different > > processes in more than one VMA as well. > > This is why we have the anon_vma, to handle the same pages mapped from > multiple VMAs. Can you clarify when you use anon_vma to organize the swap out and swap in, do you want to write a range of pages rather than just one page at a time? Will write back a sub list of the LRU work for you? Ideally we shouldn't write back pages that are hot. anon_vma alone does not give us that information. > > > > Indeed, if we're open to radical ideas, the LRU sucks. A physical sc= an > > > is 40x faster: > > > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.or= g/ > > > > That simulation assumes the page struct has access to information alrea= dy. > > On the physical CPU level, the access bit is from the PTE. If you scan > > from physical page order, you need to use rmap to find the PTE to > > check the access bit. It is not a simple pfn page order walk. You need > > to scan the PTE first then move the access information into page > > struct. > > We already maintain the dirty bit on the folio when we take a write-fault > for file memory. If we do that for anon memory as well, we don't need > to do an rmap walk at scan time. You need to find out which 10% pages are cold to swap them out in the first place. That is before the write-fault can happen. The write-fault does not help selecting which subset of pages to swap out. > > > > > One challenging aspect is that the current swap back end has a very > > > > low per swap entry memory overhead. It is about 1 byte (swap_map), = 2 > > > > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is > > > > more than 64 bytes per file. That is a big jump if you map a swap > > > > entry to a file. If you map more than one swap entry to a file, the= n > > > > you need to track the mapping of file offset to swap entry, and the > > > > reverse lookup of swap entry to a file with offset. Whichever way y= ou > > > > cut it, it will significantly increase the per swap entry memory > > > > overhead. > > > > > > Not necessarily, no. If your workload uses a lot of order-2, order-4 > > > and order-9 folios, then the current scheme is using 11 bytes per pag= e, > > > so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per > > > order-9 folio. That's a lot of bytes we can use for an extent-based > > > scheme. > > > > Yes, if we allow dynamic allocation of swap entry, the 24B option. > > Then sub entries inside the compound swap entry structure can share > > the same compound swap struct pointer. > > > > > > > > Also, why would you compare the size of an inode to the size of an > > > inode? inode is ~equivalent to an anon_vma, not to a swap entry. > > > > I am not assigning inode to one swap entry. That is covered in my > > description of "if you map more than one swap entry to a file". If you > > want to map each inode to anon_vma, you need to have a way to map > > inode and file offset into swap entry encoding. In your anon_vma as > > inode world, how do you deal with two different vma containing the > > same page? Once we have more detail of the swap entry mapping scheme, > > we can analyse the pros and cons. > > Are you confused between an anon_vma and an anon vma? The naming in > this area is terrible. Maybe we should call it an mnode instead of an > anon_vma. The parallel with inode would be more obvious ... Yes, I was thinking about anon vma. I am just taking a look at the anon_vma and what it can do. Thanks for the pointer. Chris