From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42071C6FD1E for ; Wed, 8 Mar 2023 08:00:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8467E6B0072; Wed, 8 Mar 2023 02:59:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7F7196B0074; Wed, 8 Mar 2023 02:59:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6BE746B0075; Wed, 8 Mar 2023 02:59:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5E6B76B0072 for ; Wed, 8 Mar 2023 02:59:59 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 25D10120CC0 for ; Wed, 8 Mar 2023 07:59:59 +0000 (UTC) X-FDA: 80544982518.11.7ECBE4C Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) by imf22.hostedemail.com (Postfix) with ESMTP id 2AA6BC0006 for ; Wed, 8 Mar 2023 07:59:56 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=fiikSj48; spf=pass (imf22.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678262397; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oErMbovgNlM1fHmldHtqwkgHhSfzOswt1j6nL2mFsxo=; b=1BkJ9p+OOTw2du/EtSv5dLZ49n0zQ3jTtJMAG/onFoBmDy3jxBU3GHHZnlRRCc0rd51Vay AR4q1JtQXfZpNnhbupRiYwmQwCNVzgxthx9Lx1BVlwU/AXV8RTMM2rlgWDdugfCu1z2bWp 2Ond6vFH4UztO9b79Tw7I/Adl7ELNao= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=fiikSj48; spf=pass (imf22.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678262397; a=rsa-sha256; cv=none; b=tdrh7dUvubjTfx5YPTDpcY4Xt5P0h32UKm9Ca7odu2mii/6jv0Mbpqjzvgm01zIRBQjWeS rVF4PByMJdMazA2qXfTAksB1TpY1y5JfTQrV4cWVveA0oj23vuWsLLOZABmu8w47AT4Ih4 /5mSjfdvF68inyw2F6vta+etWouSiBc= Received: by mail-pg1-f177.google.com with SMTP id z10so9110669pgr.8 for ; Tue, 07 Mar 2023 23:59:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; t=1678262396; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=oErMbovgNlM1fHmldHtqwkgHhSfzOswt1j6nL2mFsxo=; b=fiikSj48Ilmf1vGJNiI3Bq/tguYc41vQm7Wv71k+P26+1xYdYHczpadicL85OTqWpl U6NgRMu82A0NHUhz+qaoqyMW/MBvAos1JcIz5Ci30MHMEgpN3Efy5wucE0LVoqE02Rn4 GqceS5pTWt5mQ6VyIQboQr6gW3TVX/RlrbPi6hPLu1VhuQyw5IQIpJhcLRKbnd+S1RsE RuHTYzkLCyADd8bKD0vBez9bXsu1UTO6mx2e09vG0uGkUYtnWlz7XW/yqw7MKzE3kvL5 JPf+3rN6lFzPWYllqvI4I8A5EZfdXQcNqF0hbc4jXIbMi7pITxFgi4Q61TmDsGQ+kaJy 5MBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678262396; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=oErMbovgNlM1fHmldHtqwkgHhSfzOswt1j6nL2mFsxo=; b=oXj/N+HBSv7ZHs3FtAEshzCt4gZww8e0MNtkOnFX26DGUJArGB4bB7fv0jBaVZvjPH n0IQPM0AA3BD8QzK0DV2lOgr9H5bzYs9XlM6FGqLVjFOF4Y5CU3uG1svnCFL8ONWFiWX ggvUaMVZ0TBLn+O0mLV5CyK+1dYS6YEy7yg1xg504Q/ePNyLIhUY1kwj5MgjB8WjCAX4 yIYqIMXeWyZrBRqECOruBgfSypNxxHWHRff7+Wjd6XiY4Avl1nZzmgg6jm13xt7pRHU6 EjUWLiF06IUPyvUcTley1ZUSqY9qsmItmgxC42EWW9ZxXm3RoEAUhzblZKbnGPmKc3Sr tVEw== X-Gm-Message-State: AO0yUKUPgv3oJULKoizOO2VCpEMAVMh8YBgn5TVvXompgzwwW2YMRfHP En1HAFOgQVNtukmRZlFRBLvAsw== X-Google-Smtp-Source: AK7set8bsOinXE0KQbcPQOZUDb0oKVCU2EyTeRsMoZqI+23n5Roy2RhtB13Lmukyu/y+AOZsaMHOxg== X-Received: by 2002:a62:1991:0:b0:5e2:434d:116b with SMTP id 139-20020a621991000000b005e2434d116bmr13162108pfz.23.1678262395850; Tue, 07 Mar 2023 23:59:55 -0800 (PST) Received: from dread.disaster.area (pa49-186-4-237.pa.vic.optusnet.com.au. [49.186.4.237]) by smtp.gmail.com with ESMTPSA id s1-20020aa78281000000b0059435689e36sm9159908pfm.170.2023.03.07.23.59.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Mar 2023 23:59:55 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pZoiG-006DN9-PH; Wed, 08 Mar 2023 18:59:52 +1100 Date: Wed, 8 Mar 2023 18:59:52 +1100 From: Dave Chinner To: Luis Chamberlain Cc: Matthew Wilcox , "Darrick J. Wong" , James Bottomley , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Message-ID: <20230308075952.GU2825702@dread.disaster.area> References: <2600732b9ed0ddabfda5831aff22fd7e4270e3be.camel@HansenPartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2AA6BC0006 X-Rspam-User: X-Stat-Signature: joa6mqzh8kza9roikmdcryqnx45o6y99 X-HE-Tag: 1678262396-651456 X-HE-Meta: U2FsdGVkX1/XAFkDExB4GcrqrHr+BqNKwexNzbqcPvM/+driNP7rpt4KMwZNSr3bpRrdDirBC3RpQBaiy8GOWDw7ANVqCL6IqLjnuvW9LrQbb+fI90+s23ay1hWr/gcRU429BsOVCaieL8kdTWkzsfKCRu15FkR+g44+2R/i1H+nklE7BmlczM/vgYV4qogNCleHTF7M4SX3/pk9MCzaflL5U/SJcbZ8GLoE+teqE6JAONxXA29AbSQHbewtQ94b5Dub0KxocCiSK8mjeAZim4K3mVDQ9Welp6mFU0dVrND7kQNUDnYBKIEcBNBOpJN9sfw2lyDcsq910mxhjo2ulELwfFZJtN3aPzSzJhqFS3vU2O8HBQmWFpSrmqiTH/4ObdTrvk2PpiBUNo/8JH9ejrGjtSgYx4ia3IEjrpwEZaxfv3mAtU/o5bWFAP/eS4J8K5Sr8c3oHfgoexTXTD0icGLdJ4YyChm+Fl7/VvP+yvbxivdMn5VqI1vHPcV9cyAoSivY0DXBntFp5HwogCpY0eV88Yk/Qb1/xTM7YxzJhlT4tKQXao5mvhiJIf8olNYJzjjPud2r8pyu4sN2P3Q977vekiwEeJRKP0cp6EnebbL/Anfu9SwL+w0iv8ItmminXTHJi+laxhYvUut9NmXXUzNdZ0apwDLRP7f7nMagYCsscdCgvVvhcM3P0KoIuqAVy7BcBm8DPBzwOOcAcHgU6z4nHoNZuHYDRyZSieTX3hu+uCSCP6aU81IgcU7Pha9LSGukk5A0VxT0RFISeKiz6IWEAtGJ3eNqShpopXdg8jOY7SKTe2eba8R+dBDCXsOLHgq7vEXo3jvr93LuEv91XgonzXybZYqEweTeg/TN4oLfJBHP8ltKPjHDPbjsHtVmO6jG2AJ2pSMCyqrJdJ/Z0vOptNGUDvR/6n5h3Kk8x2ZcdFr/alkBdgdw7cdYA6eIjKVnp0UqmtM/MHAKAF8 yq88zhLy LbwC+NbCvU+yo3ZlwNg5EFRAeK+nSzh/Drh0dynz+ucajV0EDEIyTQ5FuVQ2c6ko2+CsWFliicrXRAe6kglAPmunBELsF2npGhEan6YD4IdHLU345RUt4X97CZ2KNqkolfVSnUXrcqzhzXLcR9VioVkhpegVJ7fKNMciDi5HPUtZFwabktDe/aETN00oBZzfXNaXhKZCTxGc7tRZLVbpfeOYdG8tMr2HhqwAmSfSqpPdILildHqo0iulO0BomJpkjrfu+Yvn4ycRnx6YePJQaeiK7SDiqFXgxWS0067yDw33CdUdkTcYL65MgsZhSVZkTheKe3dKEStN2KqcH9TSieFee0eo3kjc8eJUDMmfhw8NqPrJj1IY4oE5agzMwMIkuc/wFrq5c1Sh8J6ETpQzhwDoJf8q6Ba69rDFsGbTWxaB8FXQ1B0/jBiX38gaE8KDwuYVjKpSb8TDqQ8QE+6OYr60kiDYhwXMqJbeVlQlu45arVSAOSkeO5MDLdK1QRARIlotw9RM+ap19Oc9l3CZ2TjI5r/ClQVUZYydzp7hvi/UuoABWdodfW4zYwoN29XEWswBafK1oawKP13tqSn8KK46YWI+OI3D3TAhydSqNIyLm4DxR0UBPOoZmyOVEMa6/yudeWU3C7I1CgJ4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 07, 2023 at 10:11:43PM -0800, Luis Chamberlain wrote: > On Sun, Mar 05, 2023 at 05:02:43AM +0000, Matthew Wilcox wrote: > > On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote: > > > On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote: > > > > XFS already works with arbitrary-order folios. > > > > > > But block sizes > PAGE_SIZE is work which is still not merged. It > > > *can* be with time. That would allow one to muck with larger block > > > sizes than 4k on x86-64 for instance. Without this, you can't play > > > ball. > > > > Do you mean that XFS is checking that fs block size <= PAGE_SIZE and > > that check needs to be dropped? If so, I don't see where that happens. > > None of that. Back in 2018 Chinner had prototyped XFS support with > larger block size > PAGE_SIZE: > > https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/ Having a working BS > PS implementation on XFS based on variable page order support in the page cache goes back over a decade before that. Christoph Lameter did the page cache work, and I added support for XFS back in 2007. THe total change to XFS required can be seen in this simple patch: https://lore.kernel.org/linux-mm/20070423093152.GI32602149@melbourne.sgi.com/ That was when the howls of anguish about high order allocations Willy mentioned started.... > I just did a quick attempt to rebased it and most of the left over work > is actually on IOMAP for writeback and zero / writes requiring a new > zero-around functionality. All bugs on the rebase are my own, only compile > tested so far, and not happy with some of the changes I had to make so > likely could use tons more love: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230307-larger-bs-then-ps-xfs On a current kernel, that patchset is fundamentally broken as we have multi-page folio support in XFS and iomap - the patchset is inherently PAGE_SIZE based and it will do the the wrong thing with PAGE_SIZE based zero-around. IOWs, IOMAP_F_ZERO_AROUND does not need to exist any more, nor should any of the custom hooks it triggered in different operations for zero-around. That's because we should now be using the same approach to BS > PS as we first used back in 2007. We already support multi-page folios in the page cache, so all the zero-around and partial folio uptodate tracking we need is already in place. Hence, like Willy said, all we need to do is have filemap_get_folio(FGP_CREAT) always allocate at least filesystem block sized and aligned folio and insert them into the mapping tree. Multi-page folios will always need to be sized as an integer multiple of the filesystem block size, but once we ensure size and alignment of folios in the page cache, we get everything else for free. /me cues the howls of anguish over memory fragmentation.... > But it should give you an idea of what type of things filesystems need to do. Not really. it gives you an idea of what filesystems needed to do 5 years ago to support BS > PS. We're living in the age of folios now, not pages. Willy starting work on folios was why I dropped that patch set, firstly because it was going to make the iomap conversion to folios harder, and secondly, we realised that none of it was necessary if folios supported multi-page constructs in the page cache natively. IOWs, multi-page folios in the page cache should make BS > PS mostly trivial to support for any filesystem or block device that doesn't have some other dependency on PAGE_SIZE objects in the page cache (e.g. bufferheads). Cheers, Dave. -- Dave Chinner david@fromorbit.com