From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2EB2C5475B for ; Wed, 6 Mar 2024 04:17:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 550196B00A0; Tue, 5 Mar 2024 23:17:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 500836B00A1; Tue, 5 Mar 2024 23:17:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3EF686B00A2; Tue, 5 Mar 2024 23:17:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 307E26B00A0 for ; Tue, 5 Mar 2024 23:17:12 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0D3D11C0A92 for ; Wed, 6 Mar 2024 04:17:12 +0000 (UTC) X-FDA: 81865304304.09.8B3F79B Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172]) by imf09.hostedemail.com (Postfix) with ESMTP id 6A36E140004 for ; Wed, 6 Mar 2024 04:17:10 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=G69Yf1ZE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of jaredeh@gmail.com designates 209.85.167.172 as permitted sender) smtp.mailfrom=jaredeh@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709698630; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+DNJVB7BUf75D2fMYiPvkLQok/EMOWoEWDDyuth/6VM=; b=FwyFZNtD1roQd+MH57NEhZnln2LbitfhJzVUAknSzlOa/CF3scYeFCf4oPWM2nofzAoHUt J+0K6V1c3YJqJMRadtMjUJSx3ypMV+TrXWacSi6F98X3gmvDw6WFHhe/jXFM2jPHZcz2Nf /8INvZcZf1AaQkLvw/CG+hhu+LwODJk= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=G69Yf1ZE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of jaredeh@gmail.com designates 209.85.167.172 as permitted sender) smtp.mailfrom=jaredeh@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709698630; a=rsa-sha256; cv=none; b=h0AJjrAOwRzFVVC8DYwwxnmIeRr3tLGxipRrTekcVZ7MmlAFUht3MVCcQ+zXNqFs49BkuT w1loHA4UqLEasgbj7Tyhv4R2B2j7ZTH7tX5+fj3tpK42lVD8V2jwOBzdJMJ5Tg1b7ecYu2 LqRBT3M+aSzFmUdyyU+LQ+qGzR5rJ38= Received: by mail-oi1-f172.google.com with SMTP id 5614622812f47-3bbbc6e51d0so4519398b6e.3 for ; Tue, 05 Mar 2024 20:17:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709698629; x=1710303429; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+DNJVB7BUf75D2fMYiPvkLQok/EMOWoEWDDyuth/6VM=; b=G69Yf1ZEO8wg9+olzlnY4Og6w1XUrnhR1oM6DRposD9pQHqXffuMQ2+euFGhro4+Au p5RNRwMpzK9T2oswS48J55bMRSaZmpNLJuJt8oTXObvGbnu29o0lqf/ZpU23qw4wfxv0 QGhW3Zc/GF98xxy40aTEEazJmnctLqDKVo6/FLrQAAg10u0xAmPdMniogcrFHPJmkLOo 3MbQ5Qh6cMyLHBtQTC+feo2avfW4Hv3dlDEhK0r4sJud86NbEFBWhQqj9Ds5MtmTPIUE CcZgal5KzCkBP2fZ3/EDjUsgBRJV+ssYRv7LoOrHkJv2TQzqiBzZToynCybQGs1zJuQX Cqkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709698629; x=1710303429; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+DNJVB7BUf75D2fMYiPvkLQok/EMOWoEWDDyuth/6VM=; b=YGe7Xn8vzZ+0UcjbyECt667XT4vI//tR3RgGphVwvJgdZy15FaF2Ru0M4QDdAdujJ0 CA15LxavKmagoxe7+jvwlCgA1Twc9riBpUPAeMOzpZr5IKkIU3F5hQ07bUpKoR64+kT8 2I9O5kvgo+u7YuXRVWFZ8zYHxOYPJmnhtxl31anKTunGFazFrsU5tGtx7HeDr3GURL0w qbZM5XXKNUOKu07hJfpARjE5Wh1dVUvMzW59tPkCZck0dXD561ZqSK1WQfpPo/pPEJH+ vPGJEc1sdEiDmxSrtq0kumw/Fob5Nk58E16t4Ljicd5F8AuJL8JkOhIXrXqZLWeJ0Ava wl9g== X-Forwarded-Encrypted: i=1; AJvYcCXp3XnU3L9vsf4g5quCAksVtMhKBjDTZgfiR3ygHX+SsWZxyL7gkE4ZnO0n6jHI2SJDLkVEnXpn16aY+ekxF8MeXww= X-Gm-Message-State: AOJu0YxeAPvfWb5M9XD/E5mMC+Fi1Uab1dXPeDXhZJEGCBuyw4+PkbdU Suc02xR1ZVeLkGxmhts5L75bLyS/VdvazpQ/uGHcrcPxmKIyl4HUok7kRnjE/NWtdm9L5mIcrMq 3TEg7r8jS+NALokWpNIkyWxs28Us= X-Google-Smtp-Source: AGHT+IFwTXDhrMDaXo45+2cVX24tYzlI2XcfGOzSKWvpS9qLqWo1l8Rw8TFN2ajz0l8iDxgfUQkIdC1Z7uYeDR3B4eM= X-Received: by 2002:a05:6870:92c5:b0:220:8cc8:903d with SMTP id g5-20020a05687092c500b002208cc8903dmr4213930oak.57.1709698629304; Tue, 05 Mar 2024 20:17:09 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jared Hulbert Date: Tue, 5 Mar 2024 20:16:57 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Chris Li Cc: Chengming Zhou , Matthew Wilcox , Nhat Pham , lsf-pc@lists.linux-foundation.org, linux-mm , ryan.roberts@arm.com, David Hildenbrand , Barry Song <21cnbao@gmail.com>, Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 6A36E140004 X-Stat-Signature: p3iiwcmgmrph5we59hzb4k67wqs91c1p X-HE-Tag: 1709698630-160225 X-HE-Meta: U2FsdGVkX1+HNqeDeaGrX4+SCjCuB975q1A6cEDDBmDgFvWx10ilx/5zLaGMt7Nm1wl0QansR8MjNm6LwSGQIy54xvyNJ6PlxQ34nKewlYRZB6w7SSIqqg0ZQtfJ8w5bRgXFGUfvs7pAObO0cZnmMyrCNHE1u81mEwiNsnA8Na6XXExJ9DhFuXddLtWq4YOmbAM0gLbvjma7k0exI069RWGoXLhsG3wuVQpJEUnOvC5/8153IPRz7/OBOid1YbxHyDS4Eb73QOPMXH3t0s/QE2dEuwX8Yf8Z2M4nJOBUevW610qxHUgEUSQHMiN2O68ORe4gTdEGxQRJETcWWmHfSDeP6OtNzGjeVAVfHVXtECpeLVua0rWGEqiuzmXp4gTunHhxaLVVR5BwMaTQzhEa3yrS9HfsMu0Z06R1H7HBR778bAbSi8ClJLU5NVSan7Ehv87NKYgyhkUAimBZwyp7oioQEPKU47Qv1r//QTaj3YrVkKkcJ2QbWyGpiv4dgRs8RlCqyQ+YQhuKYlAP8BHUi2tZFWJijDjVfp+0SBjPYmOcyjQ23nQWxwpn2kBRRGHjYCOtNIkGzQw6x0MKiwxpYDlKJisCzAnitml/ZLx4JBsP9nW3wqrGJcb3BNAX/dfCCOGKhWwez+PmRubavZd1SCD4Ra3CCM1S12uD54usgcu1zr1nPTbvErWJPJGNJHzWdxTNlnVUsinT2kRdTKBL+9qMCNN5+aw9W5QjaMSM3mmAZhfvyvGE2vncGpQBub/POzyIMEgg2GJpbYiZ55QfqUQS2+gG7d40JuDtX1Kjsg41d52vZD8o2IX3bAXDViKWhlGTe163DbuZTb0mN1diChwxSqyCAUVmo/oCe8Uh+SGiU7t2AcwGq1jU9DYM6XePsjLnFsZ2+HRG5YkCHK6sj+/meVpO8s9GPJ9Nq9gWEt1sgvN6vQ56Ci4t7cSWsKrtXetBTiwV3xhb9XliShI Izh6+2Nw HBhtNgDZ0PxeX5D2zGrMqDbljoXHytM7V9oZb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 1:58=E2=80=AFPM Chris Li wrote: > > On Tue, Mar 5, 2024 at 1:38=E2=80=AFPM Jared Hulbert = wrote: > > > > On Mon, Mar 4, 2024 at 11:49=E2=80=AFPM Chris Li wr= ote: > > > > > > I have considered that as well, that is further than writing from one > > > swap device to another. The current swap device currently can't accep= t > > > write on non page aligned offset. If we allow byte aligned write out > > > size, the whole swap entry offset stuff needs some heavy changes. > > > > > > If we write out 4K pages, and the compression ratio is lower than 50%= , > > > it means a combination of two compressed pages can't fit into one > > > page. Which means some of the page read back will need to overflow > > > into another page. We kind of need a small file system to keep track > > > of how the compressed data is stored, because it is not page aligned > > > size any more. > > > > > > We can write out zsmalloc blocks of data as it is, however there is n= o > > > guarantee the data in zsmalloc blocks have the same LRU order. > > > > > > It makes more sense when writing higher order > 0 swap pages. e.g > > > writing 64K pages in one buffer, then we can write out compressed dat= a > > > as page boundary aligned and page sizes, accepting the waste on the > > > last compressed page, might not fill up the whole page. > > > > A swap device not a device, until recently, it was a really bad > > filesystem with no abstractions between the block device and the > > filesystem. Zswap and zram are, in some respects, attempts to make > > specialized filesystems without any of the advantages of using the vfs > > tooling. > > > > What stops us from using an existing compressing filesystem? > > The issue is that the swap has a lot of different usage than a typical > file system. Please take a look at the current different usage cases > of swap and their related data structures, in the beginning of this > email thread. If you want to use an existing file system, you still > need to to bridge the gap between swap system and file systems. For > example, the cgroup information is associated with each swap entry. > > You can think of swap as a special file system that can read and > write 4K objects by keys. You can always use file system extend > attributes to track the additional information associated with each > swap entry. Yes. This is what I was trying to say. While swap dev pretends to be just a simple index, your opener for this thread mentions a VFS-like swap interface. What exactly is the interface you have in mind? If it's VFS-like... how does it differ? >The end of the day, using the existing file system, the > per swap entry metadata overhead would likely be much higher than the > current swap back end. I understand the current swap back end > organizes the data around swap offset, that makes swap data spreading > to many different places. That is one reason people might not like it. > However, it does have pretty minimal per swap entry memory overheads. > > The file system can store their meta data on disk, reducing the in > memory overhead. That has a price that when you swap in a page, you > might need to go through a few file system metadata reads before you > can read in the real swapping data. When I look at all the things being asked of modern swap backends, compression, tiering, metadata tracking, usage metrics, caching, backing storage. There is a lot of potential for reuse from the filesystem world. If we truly have a VFS-like swap interface why not make it easy to facilitate that reuse? So of course I don't think we should just take stock btrfs and call it a swap backing store. When I asked "Why stops us..." I meant to discuss it to see how far off the vision is. So let's consider the points you mentioned. Metadata overhead: ZRAM uses 1% of the disksize as metadata storage, you can get to 1% or less with modern filesystems unmodified (depends on a lot of factors) >From a fundamental architecture standpoint it's not a stretch to think that a modified filesystem would be meet or beat existing swap engines on metadata overhead. Too many disk ops: This is a solid argument for not using most filesystems today. But it's also one that is addressable, modern filesystems have lots of caching and separation of metadata and data. No reason a variant can't be made that will not store metadata to disk. In the traditional VFS space fragmentation and allocation is the responsibility of the filesystem, not the pagecache or VFS layer (okay it gets complicated in the corner cases). If we call swap backends the swap filesystems then I don't think it's hard to imagine a modified (or a new) filesystem could be rather easily adapted to handle many of the things you're looking for if we made a swapping VFS-like interface that was a truely a clean subset of the VFS interface With a whole family of specialized swap filesystems optimized for different systems and media types you could do buddy allocating, larger writes, LRU level group allocations, sub-page allocation, direct writes, compression, tiering, readahead hints, deduplication, caching, etc with nearly off the shelf code. And all this with a free set of stable APIs, tools, conventions, design patterns, and abstractions to allow for quick and easy innovation in this space. And if that seems daunting we can start by making existing swap backends glue into the new VFS-like interface and punt this for later. But making clear and clean VFS like interfaces, if done right allows for a ton of innovation here. > > > > Crazy talk here. What if we handled swap pages like they were mmap'd > > to a special swap "file(s)"? > > That is already the case in the kernel, the swap cache handling is the > same way of handling file cache with a file offset. Some of them even > share the same underlying function, for example filemap_get_folio(). Right there is some similarity in the middle. And yet the way a swapped page is handled is very different at the "ends" the PTEs / fault paths and the way data gets to swap media are totally different. Those are the parts I was thinking about. In otherwords, why do a VFS-like interface, why not use the VFS interface? I suppose the first level why gets you to something like a circular reference issue when allocating memory for a VFS ops that could trigger swapping... but maybe that's addressable. It gets crazy but I have a feeling the core issues are not too serious. > Chris