From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 719DEC5475B for ; Thu, 7 Mar 2024 00:46:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89B956B00C3; Wed, 6 Mar 2024 19:46:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 84B236B00C4; Wed, 6 Mar 2024 19:46:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7120E6B00C5; Wed, 6 Mar 2024 19:46:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 611416B00C3 for ; Wed, 6 Mar 2024 19:46:43 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DB0F4A1050 for ; Thu, 7 Mar 2024 00:46:42 +0000 (UTC) X-FDA: 81868402644.21.2B69B2C Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf01.hostedemail.com (Postfix) with ESMTP id 8831340007 for ; Thu, 7 Mar 2024 00:46:40 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=eLUEdQWj; spf=pass (imf01.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709772401; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R6rhsQAkATb+YYoJAjoq3gkAoitp41ZUJwPksBpcsFM=; b=AWBRJ+NKNBGSL3iQhNXCVuGMNIHFjxlSbd9vt6Ur/BF42Glzy8iC9jfBP2IRyHp06+OCK4 oPJzlCuxacHRbyKXfc7ogUr8ECgIhz7h7aHvlBu6eb/LED8ua5/d/HhiwfwcyRJg7yvM+c CvF7bCl/XWrO/zYeDUuHxIUCyO8pd4I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709772401; a=rsa-sha256; cv=none; b=8hBxABKMFZX+rCzd2RdGj6cTlHxqqlJ8bYH5cCdxqpIcRi4GTFOSZeOrd9Oba6REAZ8PQm BgLEMtl5rnGLpEsTbqCGBRm23pvAKDTb56KGPy5dedcTZ4NgeZ0g357HjLbV84FV1SVO5W hHMSsiju2bC9qcLfdDfkWSDrd0Jj+nw= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=eLUEdQWj; spf=pass (imf01.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 2CEF4CE1EFF for ; Thu, 7 Mar 2024 00:46:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6F93CC43390 for ; Thu, 7 Mar 2024 00:46:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709772396; bh=bXUD5JRzyT8dESv2o89Qnc1Ohm+z2Hfshz5rcFvEn4k=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=eLUEdQWjJHEfrCDL5+6uYZwyP1oTazu5smdsqhFiGFOV36J+oe0ytjASJ9y1iTXv+ Hgk+CoL0hqUzeAY4megCJsw67X7s+Eq4YLAEFmMRQTNDqEKxTk5ZbrJCn6Cw038vvt SkCW9schUlxdqYu8wFCCD+vwWRpD56Q09qbr7mRSpHNIP2WFya3eFr8h9K6GS9npVj 4vrolaWxytnB0Nwvkd+TYopy6r57UMGs/tNFjB+L1wc/GL9n7DmzP4iPuczaTq0KA3 y82/TZV6IKRleM6q0wQR/G6TG3r6aS0InPnjTMd03YD8S5t1G4vIxuejLDWFxAUlOc wED6YNeVkNgmg== Received: by mail-il1-f171.google.com with SMTP id e9e14a558f8ab-365c773ae6cso786875ab.2 for ; Wed, 06 Mar 2024 16:46:36 -0800 (PST) X-Gm-Message-State: AOJu0Yx7VFBCEDLHwASOq7RoRTB5vPnkjJrFVp/2koyZ8QFc70a0C6ty 3mzhZn2RwVnj2QNXfc0TC4Ex6P1lE1zmhOGXcEnCPP9y9QPg40bzKsHvuk3jEHCdQ3Ttwjpy2Si xWN6HPbF7aRn57DB5OPdDxts1kGCZR/+z59qT X-Google-Smtp-Source: AGHT+IG+rAFS39XYpVtnMztwfKI9j5G1LzMtMkMhoqhFZ37wjRrvrOlDvQsrxYkVpyAJKitCGtAVk6jcp2tjLFjsluU= X-Received: by 2002:a05:6e02:1807:b0:365:1e8f:1519 with SMTP id a7-20020a056e02180700b003651e8f1519mr21497571ilv.27.1709772395700; Wed, 06 Mar 2024 16:46:35 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Chris Li Date: Wed, 6 Mar 2024 16:46:23 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Jared Hulbert Cc: linux-mm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: m8qc48i9p55ew7xe8gzbsh44qipopn9k X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8831340007 X-Rspam-User: X-HE-Tag: 1709772400-713114 X-HE-Meta: U2FsdGVkX19jML6/6zoeR4LYjaIt04cjf+9ji83z8zLhrAd7t8WxgnJ+rftnbAei7G17m3H4hGGCqLZEUaAM+YBIDZA9VmgtCaeiOcnMI6D43JTbYHoXBLnHGWiakYUDX5XX7PE1iDOMIU3/tmi7R2xXTV7eNdLcA+aH7BSUNO0x+CQ3TLfFcfX7xq87bu5xSs6bIkmNr83M45SswLRf/3JfRmpCrPuscuP1KeeQyJV4/WOo2SV7s7qjpVqxUWkup9vXCz3wkP+eQaQBcTp7nKDsonYDeTu6TQPiAOSD9fXFrz7j94+J/xmO9BxRYwY5Fmt/W2DF4iDI6LbN6aV5WBPDLc1uAEi75fW1kBXnJpS9bw58g9EhH5V1DULKhebcpOS5rM7ViNgqqhYAj487rns+7GH7qd6Ku7M2qZIdPRWQmOeXt2tYNuNL2L2ONSuB/P59pXUVvLRFj5gNJrdWffN64KSps0J0klds99AZeA3daNQAdfU19sKmYwAz1Dv1jrY3I4WECeJZR8k/xiftfgTyV67v2rFWdZiN9STqUzoUBiyRhNhbq/8QhMa1TM8wpyNfftR3IR9ThAahtk/K6NED+0oLkYbR4oYDjCWCYypCejPYHV24EbR9EV6qx1M+eSWC0PSgA3xw6a0eAGph0IUfAp6wvgZJHkNaGZonKiWPAyE5LUv2lD+mEhebnRyoQ0YUJNkciamkzTes5OxkSj9lAAf2fYgE0bUONNLL3lj+Mi0cTGV7Qrp/WEhBrL3xXXY9c3F/4nrY6tUsT3eNJCMUnBn422rTr0x8Y1cbGaa0F0L+7+16tooXcjoPuignvgfK/BTmYmRgc7ABvFbOrQ1DjNPnvGlDiu6U1+WOuoiE8BWFGmZA4gvI2C3O8CQboou+P2ZMGyk6cP/vkHme6U6OPrEmHVuf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 6, 2024 at 2:44=E2=80=AFPM Jared Hulbert wr= ote: > > On Wed, Mar 6, 2024 at 10:16=E2=80=AFAM Chris Li wrot= e: > > > > On Wed, Mar 6, 2024 at 2:39=E2=80=AFAM Jared Hulbert wrote: > > > > > > On Tue, Mar 5, 2024 at 9:51=E2=80=AFPM Chris Li w= rote: > > > > > > If your file size is 4K each and you need to store millions of 4K > > > > small files, reference it by an integer like filename. Typical file > > > > systems like ext4, btrfs will definitely not be able to get 1% meta > > > > data storage for that kind of usage. 1% of 4K is 40 bytes. Your > > > > typical inode struct is much bigger than that. Last I checked, the > > > > sizeof(struct inode) is 632. > > > > > > Okay that is an interesting difference in assumptions. I see no need > > > to have file =3D=3D page, I think that would be insane to have an ino= de > > > per swap page. You'd have one big "file" and do offsets. Or a file > > > per cgroup, etc. > > > > Then you are back to design your own data structure to manage how to > > map the swap entry into large file offsets. The swap file is a one > > large file, it can group clusters as smaller large files internally. > > No, that's not how I see it. I must be missing something. From my > perspective I am suggesting we should NOT be designing our own data > structures to manage how to map the swap entries into large file > offsets. OK, you are suggesting not using file inodes for 4K swap pages. Also not design our own data structure to manage swap entry allocation. > > This is nearly identical to the database use case which has been a > huge driver of filesystem and block subsystem optimizations over the > years. In practice it's not uncommon to have a dedicated filesystem > dominated with one huge database file, a smaller transaction log, and > some metadata files about the database. The workload for the database > is random reads and writes at 8K, while the log file is operated like > a write only ringbuffer most of the time. And filesystems have been > designed and optimized for decades (and continue to to be optimized) > to properly place data on the media. All the data structures and > grouping logic is present. Filesystems aren't just about directories > and files. Those are the easy parts. Then how do you allocate swap entries using this file system or database? More detail on how swap entries map into the large files offsets can help me understand what you are trying to do. > > > Why not use the swap file directly? The VFS does not really help, > > I don't understand your question? How do you have a "swap file" > without a clearly defined API? What am I lissing. Swap file support exists in the kernel. You can block IO on the swap device with a given offset. The block device API exists. That is how the swap back end works right now. I am not sure I understand your question. Chris > > > it > > is more of a burden to maintain all those super blocks, directory, > > inode etc. > > I mean... how is the minimum required superblock different than the > header on a swap partition? Sure we can strip out features that > aren't needed. What directories and inodes are you maintaining? But > if your swap store happened to support extra features... why does it > matter? > > > > Remember I'm advocating a subset of the VFS interface, learning from > > > it not using it as is. > > > > You can't really use a subset without having the other parts drag > > alone. Most of the VFS operations, those op call back functions do not > > apply to swap directly any way. > > If you say VFS is just an inspiration, then that is more or less what > > I had in mind earlier :-) > > Of course you can use a subset without having the other parts drag > along. That's the definition of subset, at least how I intent it. > > Matthew Wilcox talked about integrating zswap and swap more tightly. > I feel like it's not clear how zswap and swap _should_ interact given > the state of the swap related APIs such as they are > > On the other hand there are several canonical and easy to implement > ways to do something similar in traditional fs/vfs land. > > 1. A filesystem that compressed data in RAM and did writeback to > blockdev, it would have to have a blockdev aware allocator. > 2. A filesystem that compressed data in RAM that overlaid another > filesystem. Would require uncompressing to do writeback (unless VFS > was extended with cwrite() cread() ) > 3. A block dev that compressed data in RAM under a filesystem, it > would have to have a block dev aware allocator. > > I'd like to talk about making this sort of thing simple and clean to > do with swap. > > > > > > > > > From a fundamental architecture standpoint it's not a stretch to = think > > > > > that a modified filesystem would be meet or beat existing swap en= gines > > > > > on metadata overhead. > > > > > > > > Please show me one file system that can beat the existing swap syst= em > > > > in the swap specific usage case (load/store of individual 4K pages)= , I > > > > am interested in learning. > > > > > > Well mind you I'm suggesting a modified filesystem and this is hard t= o > > > compare apples to apples, but sure... here we go :) > > > > > > Consider an unmodified EXT4 vs ZRAM with a backing device of the same > > > sizes, same hardware. > > > > > > Using the page cache as a bad proxy for RAM caching in the case of > > > EXT4 and comparing to the ZRAM without sending anything to the backin= g > > > store. The ZRAM is faster at reads while the EXT4 is a little faster > > > at writes > > > > > > | ZRAM | EXT4 | > > > ----------------------------- > > > read | 4.4 GB/s | 2.5 GB/s | > > > write | 643 MB/s | 658 MB/s | > > > > > > If you look at what happens when you talk about getting thing to and > > > from the disk then while the ZRAM is a tiny bit faster at the reads > > > but ZRAM is way slow at writes. > > > > > > | ZRAM | EXT4 | > > > ------------------------------- > > > read | 1.14 GB/s | 1.10 GB/s | > > > write | 82.3 MB/s | 548 MB/s | > > > > I am more interested in terms of per swap entry memory overhead. > > > > Without knowing how you map the swap entry into file read/writes, I > > have no idea now how to interpertet those numbers in the swap back end > > usage context. ZRAM is just a block device, ZRAM does not participate > > in how the swap entry was allocated or free. ZRAM does compression, > > which is CPU intensive. While EXT4 doesn't, it is understandable ZRAM > > might have lower write bandwidth. I am not sure how those numbers > > translate into prediction of how a file system based swap back end > > system performs. > > I randomly read/write to zram block dev and one large EXT4 file with > max concurrency for my system. If you mounted the file and the zram > as swap devs the performance from the benchmark should transfer to > swap operations. How that maps to system performance....? That's a > more complicated benchmarking question. > > > Regards, > > > > Chris >