From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CBB4C25B78 for ; Wed, 29 May 2024 03:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 278696B0085; Tue, 28 May 2024 23:36:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 228866B00A9; Tue, 28 May 2024 23:36:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F1076B00B1; Tue, 28 May 2024 23:36:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E09E86B00A9 for ; Tue, 28 May 2024 23:36:26 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6326AA06BC for ; Wed, 29 May 2024 03:36:26 +0000 (UTC) X-FDA: 82170020772.14.F244A85 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf09.hostedemail.com (Postfix) with ESMTP id AEB7914000C for ; Wed, 29 May 2024 03:36:23 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=endl3Rv9; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716953784; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xOpU0OCsyOlDLQI4Y1kJ9yIxEJ7uXhh4JCGlCBYM0uY=; b=POEgcrhxfTofF6K0qyn585+hlh6YY/TvYgYcUOQi3E71Np4S4jm1G7U3Et7I8nRVJq/8Wk f/jz3QYAaCc+NGdXoafOX3U4IMU3MlCtVhp7HECLFtjWXy3lmWbwK6vAQAuEX+VsJ/2thF 52U8URMm8ovxypaA5XlZ7t9jX0BbviI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=endl3Rv9; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716953784; a=rsa-sha256; cv=none; b=dfXJK2FN8trMTVe53yhl0VU1JorwL03xLbW7MYg1ydalBgYkItsrYJF4E1cOU/Mpa1a+42 98VLr1GJEcBMzxNrWSm9DE3l7gJj4ZXnbRq6obct3XPoOVyutUN7FrsQQwIDrvwbbrXH8i Tj5xZCnQKxJsYzkukEVGt+mcQJzozgo= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 4942FCE168C for ; Wed, 29 May 2024 03:36:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 56C09C4AF0B for ; Wed, 29 May 2024 03:36:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716953778; bh=EJnGoXAfrVsdo0spRUF2+ivYQPlLETOBEe+dsMm6kik=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=endl3Rv9u9/NLQtJbv9pzL+MPV1gJ+qZ8yYVLhZyni67BqxrVnFSkntNknaHI8tn1 GiMCkRZGAKrfT4yzqqBJfQt8PEB3udthPzmDikjzsTuUYAFloVoJ2U89ThNiiNQ+G3 Zs7UXZALG8kSRel9zbzomCVNfpwa/PB6tIDilQMV38Aaq6hl730D/Ga7TQ8IJ3GmoR 6nlBJeZtr+ugW2gRIyTCuXkdM3ougVumPmrvM32Up/OnjI3xatf1eJh2AUVb4MUL28 07N4iviLnbSdy2OHi0CZVlAledLC8Cu1bVgkso+dTzQZy1y/pHbJwyUrw2fdRE1jYC 79YvdfpnsWOlw== Received: by mail-io1-f41.google.com with SMTP id ca18e2360f4ac-7e201ab539eso66052339f.1 for ; Tue, 28 May 2024 20:36:18 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCWeyAKUk39uEczd7fPgl9SJizK7D2V5wYgKEu83baw0YZ8rWmOl+6r2q7iMEPMdq02TfEA4XW1YoPInfUKf8Qi2lK4= X-Gm-Message-State: AOJu0Yy/gQXlDkix2yJfmUInauRwHE/9Bb6CMqEWAI7uJm+1ZHp4823Y yR9aIMqieopMI7Sud9tPTCt5iZQKpQsJ0K0kawcYpttbTXvmFplNlvpakgn2Y3AAvmBkix6uYY9 tuS0MdmlN6TqCuolBzhShs8mrUtbrpIffO1RN X-Google-Smtp-Source: AGHT+IGBYi3xbEaHsiWkfhRIcq3mqedUualpuhlJZ81fMUZcWwVWYKif+NkNsdQV3vKkh9lva66Joz3TkpP6Y2JFaek= X-Received: by 2002:a05:6602:641e:b0:7ea:cd68:d235 with SMTP id ca18e2360f4ac-7eacd68d657mr1002854639f.18.1716953777516; Tue, 28 May 2024 20:36:17 -0700 (PDT) MIME-Version: 1.0 References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> In-Reply-To: From: Chris Li Date: Tue, 28 May 2024 20:36:04 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Jared Hulbert Cc: Karim Manaouil , Jan Kara , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 6gqquebjsd4zw9munbskeno39dp1xi9x X-Rspamd-Queue-Id: AEB7914000C X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1716953783-938081 X-HE-Meta: U2FsdGVkX1+8TBbX82fbhm0MNCNDsfBFzCfSoeu5cBOSCZIZqhkIdPZ5rOA/xQqg6uX8qdR9MFv7QLnJwmTlKH+hhXKIGLueOaNJOykXQ3MjijOymtus6e3RlOzyeFbPCeQ3oyqo+l+5H+86cBUFGlWV7NYRk8dkFo4NJM1cjSBh6uaNOWVAoGHF44tx+08BoLztBKLlrJIK9Zu8QOkJAxumJp22DWmy/TeQISN/Kzk5aVEWXjQanQj/zeteJDag3446/CeeOmOH8AUJTqztzu0ktRMB687PzSlQnIkzh8lApz1ePj2lphv49Ky1ODdNfMbTCFfy/qu/0QC5ZDwBAfwmaPVgHUpZsAUEbpoUswujhiQfiVjlOtW23grvxKQiK9ZQhUbyxnVTwf4u6yNpTbY3ogg/OG2BoG8p4N4m/Kmgioce2nULNUdC+oRm+vlvQQk2y3dLSTIYAo36ANZrbZw2iSay91+u421RTf/wBY0t4FRphZfiwlEL8OWBK/d06i2rAFxxTmHUZ6WAexhAikak0xgecbZq1Zo83ElIDfmJoBV5xe6o1d2sk3BCcEuqylMoIpIdHFgG8/Gh8MtUlCc/6/LKP0XeNV/cJBmJ/+Soh/uVHUUPN4aR7t+7ZY6qdH0bj6vRFV+7kDyn19Yu3c0NUiKE/rLtqPZVtVN+dNLoxf5FeNS+LMft1nqqLBwvqxLF6pzQYDDZumDxs2FKt980+l6VO6+mUFV6EL8oZmcuhjXVrKVRjW67jTPsMLfk6RTwcmQGcX8O0GGL5kx2jy+FvTt8XeTMl1MCKHCAHY4f07VgoXnC+bQNj56lLtcm71zQ+9l1EsdRAPf1f09AiADJTXsne+tXv8UfkClHTjzCuFm/L3703xP0d2vCzQU5tG4fvogGr8wS+yPmlpsKVoNbUxOQ8TA7SIvWe5tST997VQwenMkVLejRGK9x6gcCGiEf9S4rORbBNKNqHPl 57bfJl65 LJ3eE/vZGilZyVgGTMMOgmXaXXWIwAHnIIHj0piTPZQypCmKi11zTwICuf6DBQO34J4fvO42qcB9EomeqazPjGgUazBhhl9dX3T+LNup0AZm9wG5oyCQeNN2frxhvqPU2x/S+F4+UozAqw4ZEgMadjFzwQaW/53AN/iWMQ/2p7+Mo/p8BvmT83wU4bXuag320MVBEtqnoVdx1MiZiTBrjZ/bBvGtHtphH+p1DVo2DSrbUm6ZNQ6yl12BTNHQW1+Y5rpbkaIrk08cOazYBDu5ZGK4WVBEQ0F/x9R+iY1pjKzCyQw9ylRx0bpY/pUgDMBeqQFOykV4xx4DkIY45Cp2JE4gvcux1e/SS8O/GYXR6j3b2RsM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 28, 2024 at 12:08=E2=80=AFAM Jared Hulbert = wrote: > > On Tue, May 21, 2024 at 1:43=E2=80=AFPM Chris Li wrot= e: > > > > Swap and file systems have very different requirements and usage > > patterns and IO patterns. > > I would counter that the design requirements for a simple filesystem > and what you are proposing doing to support heterogeneously sized > block allocation on a block device are very similar, not very > different. > > Data is owned by clients, but I've done the profiling on servers and > Android. As I've stated before, databases have reasonably close usage > and IO. Swap usage of block devices is not a particularly odd usage > profile. > > > One challenging aspect is that the current swap back end has a very > > low per swap entry memory overhead. It is about 1 byte (swap_map), 2 > > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is > > more than 64 bytes per file. That is a big jump if you map a swap > > entry to a file. If you map more than one swap entry to a file, then > > you need to track the mapping of file offset to swap entry, and the > > reverse lookup of swap entry to a file with offset. Whichever way you > > cut it, it will significantly increase the per swap entry memory > > overhead. > > No it won't. Because the suggestion is NOT to add some array of inode > structs in place of the structures you've been talking about altering. > > IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to > more than double the per entry overhead from 11 B to 24 B. Is that > correct? Of course if modernizing the structures to be properly folio > aware requires a few bytes, that seems prudent. The most expanded form of swap entry is 24B, the last option in the slide. However, you get the saving of duplicating compound swap entries. e.g. for PMD size of compound swap entries. You can have 512 identical swap entries within one compound swap entry. They only need to have 8 bytes of pointer each point to the compound entry struct. So the average of the per entry is 8B + (24B + compound struct overhead)/512, much smaller than 24B. If all swap entries are order 0. Then yes, the average is 24B per entry. > > Also IIUC 8 bytes of the 24 are a per swap entry pointer to a > dynamically allocated structure that will be used to manage > heterogeneous block size allocation management on block devices. I > object to this. That's what the filesystem abstraction is for. EXT4 > too heavy for you? Then make a simpler filesystem. You can call my compound swap entry a simpler filesystem. Just a different = name. If you are writing a new file system for swap, you don't need the inode and most of the VFS ops etc. Those are unnecessary complexity to deal with. > > So how do you map swap entries to a filesystem without a new mapping > layer? Here is a simple proposal. (It assumes there are only 16 > valid folio orders. There are ways to get around that limit but it > would take longer to explain, so let's just go with it.) > > * swap_types (fs inodes) map to different page sizes (page, compound > order, folio order, mTHP size etc). Swap type has preexisting meaning in Linux swap back end code, its reference to the swap device. Let me just call it "swap_order". > ex. swap_type =3D=3D 1 -> 4K pages, swap_type =3D=3D 15 -> 1G hugep= ages etc > * swap_type =3D fs inode > * swap_offset =3D fs file offset > * swap_offset is selected using the same simple allocation scheme as toda= y. > - because the swap entries are all the same size/order per > swap_type/inode you can just pick the first free slot. > * on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE) > - removes the blocks from the file without changing its "size". > - no changes are required to the swap_offsets to garbage collect blocks= . Can I assume your swap entry encoding is something like [swap_order (your swap_type)] + [swap_offset]? Let's forget the fact that you might not be able to get swap order bits from the swap entry in a 32 bit system. Assume the swapfile is small enough that is not a problem. Now your swap cache address space is 16x compared to the original swap cache address space. You may say, oh, that is "virtual" swap cache address space, you are not using the 16x address space at the same time. That is true. However, you can create worse fragmentation in your 16x virtual swap cache address space. The xarray used to track the swap cache does not handle sparse index storage well. The worst case fragmentation in xarray is about 32-64x. So the worst fragmentation in your 16x swap address space can be something close to 16x end. Let's say it is not 16x, pick a low end 4x. 4x 8B per swap cache pointer that is already 32B per swap entry just on the swap cache alone. FYI, the original swap cache and the compound swap entry in the pdf do not have this swap cache address space blow up issue. Chris > > This allows you the following: > * dynamic allocation of block space between sizes/orders > * avoids any new tracking structures in memory for all swap entries > * places burden of tracking on filesystem >