From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B64FC54E41 for ; Wed, 6 Mar 2024 22:44:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C082D6B00AD; Wed, 6 Mar 2024 17:44:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BB8456B00AE; Wed, 6 Mar 2024 17:44:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A593C6B00AF; Wed, 6 Mar 2024 17:44:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 904E76B00AD for ; Wed, 6 Mar 2024 17:44:56 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 60EA7C0645 for ; Wed, 6 Mar 2024 22:44:56 +0000 (UTC) X-FDA: 81868095792.26.DB0E165 Received: from mail-oa1-f53.google.com (mail-oa1-f53.google.com [209.85.160.53]) by imf22.hostedemail.com (Postfix) with ESMTP id CC20DC001D for ; Wed, 6 Mar 2024 22:44:54 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="RrS/7tRe"; spf=pass (imf22.hostedemail.com: domain of jaredeh@gmail.com designates 209.85.160.53 as permitted sender) smtp.mailfrom=jaredeh@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709765094; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YwofNbeqTyndacDCE+1E0bBZ78D4SSglb/QIwgMPhmA=; b=orJE5zbEqEuczr7GGQ2RsD5epLQSaaiEtMWVgcYPFH89M03W9wFg9ZhAd2HqIm5rXqMe8Q LszUAz/CMDeWPeAdQDrbTuo2Y2Bh8Op6nOTLGJksGu07I9IHAeunV7BGLVeh8aI7ho3gXA cIGyiCwrHC9Kd1/lUIG5BgShKaeqkyg= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="RrS/7tRe"; spf=pass (imf22.hostedemail.com: domain of jaredeh@gmail.com designates 209.85.160.53 as permitted sender) smtp.mailfrom=jaredeh@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709765094; a=rsa-sha256; cv=none; b=4cG2vBBYSuBkN8NnhqVEJR6b69HJZICGbPkEXK6ASNm9V7CDChdZ6ksuiTbkVGGPO5JdoD Ejp+WG1NZUpQuIfWH1juQGBCcRwZykli8N13R3Yb/LvFeW0J5GGDB/maIDyfvuY0j16LUS VOEhjuNuFPeaVaFomb+Ytl5tNlmqM2Q= Received: by mail-oa1-f53.google.com with SMTP id 586e51a60fabf-2211ae43f31so79451fac.0 for ; Wed, 06 Mar 2024 14:44:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709765094; x=1710369894; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YwofNbeqTyndacDCE+1E0bBZ78D4SSglb/QIwgMPhmA=; b=RrS/7tReAqii023Db1JSbRdupkoehhH3i9KwWESpKX2iIRT139eawwmLZhac2Ron6j 28jvDVx8CR4IeFk6gInJ9V4SvDPQOCKA/YXvtEsPijz1I3qYnOBnuzEtRxkvIyo3rZM1 yo3IRkpVvGGLIY67Oo9MbgvY01WezKjo7TjwF1Bs5CPdUqpBwcGNEH4dKW3QDhIM4zdW ssESZSVPUFFWLVOMoSmXaGauxUFaUZF1XGSFLvif7DiueQnOjUTHSNs661qCXXibC07c +9SlTcn0BPYcuMAk20dqNJAz6hzv6eep+66kbWLedMw20wZ8Z07lqYxTOAcDM3LYaGDw X1QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709765094; x=1710369894; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YwofNbeqTyndacDCE+1E0bBZ78D4SSglb/QIwgMPhmA=; b=pno6dWMnAK8zTpiFni4+BMXrtYfhdJPZYxtRey0s6o+yQQDOxd/pghatoTrXvTIEDk 7rZOemsZbJGA6ytGoNpC4K5Qn4SdbEuFW+MLrNHPyCeos876V0i1ZcfboO/h9HefyK/v ojEi7jLXIBTYNVZtAP9bCEApLNLftAPIvVdBehJA8XZQWh6NJZ5RvM4tUUszzUj+jmuA gx1RC3jlLuofdVp1NGRVpA+utWV2+yiYwSl04HBpUwhtVxc9LKkw3bFydssJzwBDvAQG tNlzF8ZXQ24KNAJpLzACdayG1yjeoCs0TdNKm86uhRHb0NnpV+AQ23eV2DfEIpiuITAQ 7DUw== X-Gm-Message-State: AOJu0YyiqcQxjO0CreC8lcmTFzuVwh9riAiCRAQQDWmkgsffFNkERLj2 9O4i//fUH3+/LJZfJo+gyXvHoJIc035wc/51FgSUr++Q+PHTrEQOFJY2SGSwP2Gefr85Es1oGwu UeMCeMkSVTbME7YQ5BHzOJyEf5mzS1TxaPiQP1A== X-Google-Smtp-Source: AGHT+IGeQgMqdrnFY8IJ43+N+AJYe14ZMO/r4GODkgjMauLZv/kvXI4SzEWMXGK0/YV8u9vXdpLkOzhJ/tXVY4btHtI= X-Received: by 2002:a05:6870:5baa:b0:21f:9c3a:ae22 with SMTP id em42-20020a0568705baa00b0021f9c3aae22mr6561154oab.59.1709765093819; Wed, 06 Mar 2024 14:44:53 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jared Hulbert Date: Wed, 6 Mar 2024 14:44:42 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Chris Li Cc: linux-mm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: CC20DC001D X-Rspam-User: X-Stat-Signature: 34eawyh65xbj79dmj37kfhwrwd7jwjxb X-Rspamd-Server: rspam01 X-HE-Tag: 1709765094-818003 X-HE-Meta: U2FsdGVkX1/nWKghWAJslfNGVfl1mn8v0JF1Z18WQ2ASN6oXLcXyXBjjIkxRlt4X5JocNtVys9+xK401c4Un17fESe+iQAe9yfKBYpYaeA9nHogIgu/F/t3rR4dHhNDUhlBO8mlQrL+w25wTNLCQICBiVMXZMMzDhLStOnTkvn+edFZILE3lT+7Zhq7e/wgHjKCrp9JX07KesXzRqFFi6QpEaVYI+N8rPTHeN4R12qHeoH5jKTOkeH3bIdQCSy+uXjPHvzwWBDLa+kqwkB/zAgeo+XIFh3NxbCY9bK6YJD6hNHQtnBstiMMNI/GtZJ7uW9C9CfwSAoxk3bYUNOtILfYQXl88+fOg/Oi3lSVSOm3ApolBO5UxaME4k8SZ9sGYUW5JxraFYqMBMrG338Hwq4PGz5t4nDctD/XoJk6ul3wX3gK8nVUz2rDeny6o3wKwKnIDISlEZspjRGAkwikRtedjlA2u0wA2MUV4XaG7nlSoeM0/rr7ZwXCJVuGLoUdKULKCNhlte2Kdi//RR/xxkCoks6PG/Eq1KkuIQC+sO0Z8TTlkOautO71RAP7cI4wqMNtxszaMG2flm5NydzaDn0xEo7VBoUlqQEbbmGDbOLAbjn8ghJVfnYA9USbMc/cmA0SwM2CzHT36hwnBE+YTr0RK6WqPOZs6WXg0O8nFIeGUT80XQ1fGFIv4JmhYZjjF+CZvtbyHIEx4C9zaDfRzOUQjJC3wOkQvrTg57olTU1D77k5v3zYIadx2/uI497bGCi8x2nQKWSUP4vOsp9tHIEEFogVEOaHCmE5gXoquLfdcLhoBM53XAdATyWekpVHCy2ujvuPEiTo1HmAu8QbzidVNPw6uM8EMjxzAZ6MrzDXYgKiiq/kO8415/WMXHKYdjFbHPS23uah9o26qBzl5xAkkALMnbK18bEGCe7a4KZl4n6L4LYH6/ewa531k2glhcVfXMCWj2y0LBi2tv/V hMhaAjC9 uK3tDDlkRWe3qjS+trIbhwmKkso2V6FyjJdtm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 6, 2024 at 10:16=E2=80=AFAM Chris Li wrote: > > On Wed, Mar 6, 2024 at 2:39=E2=80=AFAM Jared Hulbert = wrote: > > > > On Tue, Mar 5, 2024 at 9:51=E2=80=AFPM Chris Li wro= te: > > > > If your file size is 4K each and you need to store millions of 4K > > > small files, reference it by an integer like filename. Typical file > > > systems like ext4, btrfs will definitely not be able to get 1% meta > > > data storage for that kind of usage. 1% of 4K is 40 bytes. Your > > > typical inode struct is much bigger than that. Last I checked, the > > > sizeof(struct inode) is 632. > > > > Okay that is an interesting difference in assumptions. I see no need > > to have file =3D=3D page, I think that would be insane to have an inode > > per swap page. You'd have one big "file" and do offsets. Or a file > > per cgroup, etc. > > Then you are back to design your own data structure to manage how to > map the swap entry into large file offsets. The swap file is a one > large file, it can group clusters as smaller large files internally. No, that's not how I see it. I must be missing something. From my perspective I am suggesting we should NOT be designing our own data structures to manage how to map the swap entries into large file offsets. This is nearly identical to the database use case which has been a huge driver of filesystem and block subsystem optimizations over the years. In practice it's not uncommon to have a dedicated filesystem dominated with one huge database file, a smaller transaction log, and some metadata files about the database. The workload for the database is random reads and writes at 8K, while the log file is operated like a write only ringbuffer most of the time. And filesystems have been designed and optimized for decades (and continue to to be optimized) to properly place data on the media. All the data structures and grouping logic is present. Filesystems aren't just about directories and files. Those are the easy parts. > Why not use the swap file directly? The VFS does not really help, I don't understand your question? How do you have a "swap file" without a clearly defined API? What am I lissing. > it > is more of a burden to maintain all those super blocks, directory, > inode etc. I mean... how is the minimum required superblock different than the header on a swap partition? Sure we can strip out features that aren't needed. What directories and inodes are you maintaining? But if your swap store happened to support extra features... why does it matter? > > Remember I'm advocating a subset of the VFS interface, learning from > > it not using it as is. > > You can't really use a subset without having the other parts drag > alone. Most of the VFS operations, those op call back functions do not > apply to swap directly any way. > If you say VFS is just an inspiration, then that is more or less what > I had in mind earlier :-) Of course you can use a subset without having the other parts drag along. That's the definition of subset, at least how I intent it. Matthew Wilcox talked about integrating zswap and swap more tightly. I feel like it's not clear how zswap and swap _should_ interact given the state of the swap related APIs such as they are On the other hand there are several canonical and easy to implement ways to do something similar in traditional fs/vfs land. 1. A filesystem that compressed data in RAM and did writeback to blockdev, it would have to have a blockdev aware allocator. 2. A filesystem that compressed data in RAM that overlaid another filesystem. Would require uncompressing to do writeback (unless VFS was extended with cwrite() cread() ) 3. A block dev that compressed data in RAM under a filesystem, it would have to have a block dev aware allocator. I'd like to talk about making this sort of thing simple and clean to do with swap. > > > > > > From a fundamental architecture standpoint it's not a stretch to th= ink > > > > that a modified filesystem would be meet or beat existing swap engi= nes > > > > on metadata overhead. > > > > > > Please show me one file system that can beat the existing swap system > > > in the swap specific usage case (load/store of individual 4K pages), = I > > > am interested in learning. > > > > Well mind you I'm suggesting a modified filesystem and this is hard to > > compare apples to apples, but sure... here we go :) > > > > Consider an unmodified EXT4 vs ZRAM with a backing device of the same > > sizes, same hardware. > > > > Using the page cache as a bad proxy for RAM caching in the case of > > EXT4 and comparing to the ZRAM without sending anything to the backing > > store. The ZRAM is faster at reads while the EXT4 is a little faster > > at writes > > > > | ZRAM | EXT4 | > > ----------------------------- > > read | 4.4 GB/s | 2.5 GB/s | > > write | 643 MB/s | 658 MB/s | > > > > If you look at what happens when you talk about getting thing to and > > from the disk then while the ZRAM is a tiny bit faster at the reads > > but ZRAM is way slow at writes. > > > > | ZRAM | EXT4 | > > ------------------------------- > > read | 1.14 GB/s | 1.10 GB/s | > > write | 82.3 MB/s | 548 MB/s | > > I am more interested in terms of per swap entry memory overhead. > > Without knowing how you map the swap entry into file read/writes, I > have no idea now how to interpertet those numbers in the swap back end > usage context. ZRAM is just a block device, ZRAM does not participate > in how the swap entry was allocated or free. ZRAM does compression, > which is CPU intensive. While EXT4 doesn't, it is understandable ZRAM > might have lower write bandwidth. I am not sure how those numbers > translate into prediction of how a file system based swap back end > system performs. I randomly read/write to zram block dev and one large EXT4 file with max concurrency for my system. If you mounted the file and the zram as swap devs the performance from the benchmark should transfer to swap operations. How that maps to system performance....? That's a more complicated benchmarking question. > Regards, > > Chris