From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EFF6C04FFE for ; Fri, 17 May 2024 03:49:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 854676B0089; Thu, 16 May 2024 23:48:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 804CD6B008A; Thu, 16 May 2024 23:48:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F3776B008C; Thu, 16 May 2024 23:48:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4DF286B0089 for ; Thu, 16 May 2024 23:48:59 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B4986A537A for ; Fri, 17 May 2024 03:48:58 +0000 (UTC) X-FDA: 82126506756.04.31AE96F Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf18.hostedemail.com (Postfix) with ESMTP id CE2C81C000A for ; Fri, 17 May 2024 03:48:55 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kOyUH3hA; spf=pass (imf18.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715917736; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FZGOkj+06bpT2+Z1UXOf+6icBKWgotWgKuiJIQhgu40=; b=HUK5z8SBDD4u3PbcyOxNVmGqJgRcf8XWBZLq+PTc+ToeSzzahRP7Bc8jAzc3vgsIfsQ3qU lxlw2ZqEW6mGRt+TOtP1/mPGVJ2vYixp71LXe9Hwu2gpl73tupbH+MqplV6Hk747WsbJ+G dFLqP3KaiOH/q1dOIxuv82pSt+/Bqig= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kOyUH3hA; spf=pass (imf18.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715917736; a=rsa-sha256; cv=none; b=CRLo71uuENYNvkaE2r0JrwNSePvnExtwt3qQ8skYBkfLeiJ1aeRb7SBB9xO2q4ugtmFr0x JjxS7A3cybQOhQBVuVCBzOkRWiWNNc8dC7yBXaF3VHuFfVWMuD/vBS65jo7aE406hcQCjH DKtaLSVZUzrLPNK4xYHsCJk0Sgs3KNo= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id B03C86181E for ; Fri, 17 May 2024 03:48:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 64E83C2BD10 for ; Fri, 17 May 2024 03:48:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715917734; bh=d/m8yG9nMrmr/FByfyBGWLKBwZaEQ8IeWh5jbzbbrGY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=kOyUH3hAtvxyv5t3km2n6DnxgQ1+zazsB19odu0g2th+tdeakXEoZu4Af1mwisLca r7s1QWKWRcFooCpOMn1OqkkCiO2C8rO5Sn077ycUKVz8fIpyGOF9fPCr7t9T9U2a8y z6m9k8a1kylOnyLmj0vm0riS47edXxRb5SnFHwdc7BHucvXGsEgl9l1w6WjQSeiZ+r vrG0kBqDUIcVtI2/8cyohhml3U0cAZn5DJQy4seU3kfvzZ0kBgX00FvpdxELha+yoH LuZI6mTwmXBQQ0wYzw58dDBNp+CqtfDEAN5D28CQG0V98Iazs6B5BGWa2+kVJSJQnA Gsajz6yM1Trng== Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-51f0b6b682fso1637956e87.1 for ; Thu, 16 May 2024 20:48:54 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCWJEzBXHCOBLllyrFVtX6c7MqMmtwW4ybN/4ee3cAozy5nh6+eK2zGJvh7105hEPewTkPkgt1fPFeOwVRRDAjZJyEs= X-Gm-Message-State: AOJu0YylqI4i4SDrz89F8Wo4D2i6D5gUYO9aHabBrg+JBjaIE1lDbyQH SiUKaoHVH01Ug3i0m3bfZcyXbFil6exJHk+K69EdS+b7jIXwR+xI+Fmj/lpZSZjbjAbCy9y7CIs jZAPalNYQxCbZ/f+h5UEGraK7Kw== X-Google-Smtp-Source: AGHT+IEg2PquTaj62X6StnyHDVvaY4DQ8yBSNmNQOocGHW01k8mEofghhZqioiCAU4cSyhpC0q+Lz0eSIJfq17aW3c8= X-Received: by 2002:a05:6512:114c:b0:523:add0:baa9 with SMTP id 2adb3069b0e04-523add0bbe8mr6094777e87.43.1715917733095; Thu, 16 May 2024 20:48:53 -0700 (PDT) MIME-Version: 1.0 References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> <20240314090339.kieqv4v4m6yyewn5@quack3> In-Reply-To: From: Chris Li Date: Thu, 16 May 2024 20:48:41 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Zi Yan Cc: Jan Kara , Barry Song <21cnbao@gmail.com>, Jared Hulbert , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, david@redhat.com, Kairui Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: CE2C81C000A X-Stat-Signature: kiummb55y8uiktswsb31md5mjcn7j1ug X-HE-Tag: 1715917735-809916 X-HE-Meta: U2FsdGVkX18rWNfM656mr6Vrk00cD9TgktaG0TvLWzmuQ+T8a4cmwz2UJ48taPh73LPF3eScDoQljVtERaTre+2p5KiTYQt5amI6f7aHSnXZ+yEqG+mVXbtpv54IXSCMl/iqLSUGfDtcXI07/IVj0dCuJQepbl+93lJqg62lR79qo7+ubECcmO3zjMcZYKbAK37AdFHigup6z6WcQ2oGcNsLGSf9Cp/7+35KWtPWHxWdIsmFocvfOeJ9KvVs5pCZeHHpMCcJaatJ2Cw/EPdyY2z/7CCKFxEb1VxqpShkbPDly7j+GZ3Hh/kK0lr/YGPT1K3BBV/7jOXKeZiPYwmpFmKIunn378bV/IbVogsO64pK8bc4AIXqE8pbXVHgvp9h07qeR1lFDq93PNUIwGFiUvFwQXQqTkb0OduZKjBqOM/FYlUOmQEqU+SibRU7ulVMk+og3ebSVJBLU/5OEqq9iYGxJ7nNxh52JRICQHHrbrhy3n798VZe70/8uL2RoEygoTajwXUcSnQDnQawLZHpLAcTyF5AzgYHCq5wZff70DSxotHkQegm8bqbd727DCkosU4oKtocdx8bzuvVzCQcTwy0DmugdozuJEc0dD/WwNVY4X//tp9cCY2Tf16fbBgHqavPMkr/ntwTpa5Ku24a469Vy6a05uNZCseQH+jixasFLkOmNAJVvFmjjq7ad2pnFd3M+L5Gig2oLQzKk2c3qS7758CE5VxDo9Wk2SaIKxsYlZLL+0tlAIQDP24/v2irEjK6beOUDi3BKBBkmwAvyRACSP5rNlYd1KXmi+YCv1xjie4zVA1WGaXtq2FAsL0gOxZO2JZQHp2+8ImT+qIwJhxvD+Se+pIU8DGEBtJdrojtC6DWa3aScqJn3CbAVnKgxvyyH+tufCfw/tsA1gnMKiDB6yUvN0+DBBWGea+GlP63CLTKN8Rh0GV/CmzHgeRLT/K7C3XIMV6y/D2LL/P KjnRIxwb RxVE/GaQ0+1Xn4wxkaQIbziM0O+wcgbYvdGtlUg31Grq3CwW+Fazi3GfraMvkrDBm8+fjFHpjjDaGG4LlsAdPwobEcVK8zXTLKK8XFtWhGwv0sgntP31QEfDzplUPFvVY8wByU2VwkETafnSu3v+/3WeE3/XbBrmpBmM8TVjZk7I7fp2cnKkqkAfmbvA2x+LrOGG8oS0pzQdD0JAkCTPQTP/DinXilrBGslykARbPonSlZaLaaqom04MpkXeNR7kqwq/lZyhmgqJ/Xs8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Zi, On Thu, May 16, 2024 at 8:04=E2=80=AFAM Zi Yan wrote: > > On 14 Mar 2024, at 5:03, Jan Kara wrote: > > > On Fri 08-03-24 05:17:46, Barry Song wrote: > >> On Fri, Mar 8, 2024 at 5:06=E2=80=AFAM Jared Hulbert wrote: > >>> > >>> On Thu, Mar 7, 2024 at 9:35=E2=80=AFAM Jan Kara wrote: > >>>> > >>>> Well, but then if you fill in space of a particular order and need t= o swap > >>>> out a page of that order what do you do? Return ENOSPC prematurely? > >>>> > >>>> Frankly as I'm reading the discussions here, it seems to me you are = trying > >>>> to reinvent a lot of things from the filesystem space :) Like block > >>>> allocation with reasonably efficient fragmentation prevention, trans= parent > >>>> data compression (zswap), hierarchical storage management (i.e., mov= ing > >>>> data between different backing stores), efficient way to get from > >>>> VMA+offset to the place on disk where the content is stored. Sure yo= u still > >>>> don't need a lot of things modern filesystems do like permissions,> = directory structure (or even more complex namespacing stuff), all the stuff > >>>> achieving fs consistency after a crash, etc. But still what you need= is a > >>>> notable portion of what filesystems do. > >>>> > >>>> So maybe it would be time to implement swap as a proper filesystem? = Or even > >>>> better we could think about factoring out these bits out of some exi= sting > >>>> filesystem to share code? > >>> > >>> Yes. Thank you. I've been struggling to communicate this. > >>> > >>> I'm thinking you can just use existing filesystems as a first step > >>> with a modest glue layer. See the branch of this thread where I'm > >>> babbling on to Chris about this. > >>> > >>> "efficient way to get from VMA+offset to place on the disk where > >>> content is stored" > >>> You mean treat swapped pages like they were mmap'ed files and use the > >>> same code paths? How big of a project is that? That seems either > >>> deceptively easy or really hard... I've been away too long and was > >>> never really good enough to have a clear vision of the scale. > >> > >> I don't understand why we need this level of complexity. All we need t= o > >> know are the offsets during pageout. After that, the large folio is > >> destroyed, and all offsets are stored in page table entries (PTEs) or = xa. > >> Swap-in doesn't depend on a complex file system; it can make its own > >> decision on how to swap-in based on the values it reads from PTEs. > > > > Well, but once compression chimes in (like with zswap) or if you need t= o > > perform compaction on swap space and move swapped out data, things aren= 't > > that simple anymore, are they? So as I was reading this thread I had th= e > > impression that swap complexity is coming close to a complexity of a > > (relatively simple) filesystem so I was brainstorming about possibility= of > > sharing some code between filesystems and swap... There is a session for the filesystem as swap back end in LSF/MM. > > I think all the complexity comes from that we want to preserve folios as > a whole, thus need to handle fragmentation issues. But Barry=E2=80=99s ap= proach Yes, we want to preserve the folio as a whole. The fragmentation is one the swap entry on the swap file. These two are at two different layers. It should be possible to folio as a whole and write out fragmented swap entries. > is trying to get us away from it. The downside is what you mentioned > about compression, since 64KB should give better compression ratio than > 4KB. For swap without compression, we probably can use Barry=E2=80=99s > approach to keep everything simple, just split all folios when they go > into swap, but I am not sure about if there is disk throughput loss. I have some ideas about writing out a large folio to non-contiguous swap entry without breaking up the folio. It will have the same effect in terms of swap entry and disk write side effects as Barry's folio break out approach. We can still track back those fragmented swap entries belonging to the compound swap entry. That is in the last page of my talk slide (not the reference slide). BTW, we can have the option to swap in as large folio doesn't mean we have to swap in as large folio all the time. It should be a policy decision above the swap back end. The swap back end can support large or small folio as requested. For zram, I suppose it is possible to modify zram to compress non-contiguous io vectors written as one internal compressed buffer in zsmalloc. If it is read back using the same io vectors, it will get the same data bac= k. Chris > For zswap, there will be design tradeoff between better compression ratio > and complexity. > > Best Regards, > Yan, Zi