From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B6E7C61DA4 for ; Thu, 9 Mar 2023 13:11:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E46DB280001; Thu, 9 Mar 2023 08:11:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DF69E6B0078; Thu, 9 Mar 2023 08:11:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBE96280001; Thu, 9 Mar 2023 08:11:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BD4A26B0075 for ; Thu, 9 Mar 2023 08:11:43 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 826681C6A00 for ; Thu, 9 Mar 2023 13:11:43 +0000 (UTC) X-FDA: 80549396886.01.9DF01A7 Received: from bedivere.hansenpartnership.com (bedivere.hansenpartnership.com [96.44.175.130]) by imf02.hostedemail.com (Postfix) with ESMTP id 360FC80014 for ; Thu, 9 Mar 2023 13:11:40 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=IaAgulVr; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=jiXbls0P; spf=pass (imf02.hostedemail.com: domain of James.Bottomley@HansenPartnership.com designates 96.44.175.130 as permitted sender) smtp.mailfrom=James.Bottomley@HansenPartnership.com; dmarc=pass (policy=none) header.from=hansenpartnership.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678367501; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Wi2noMtmMFt8FbaizkArOaPC88b7+1t6c21mjJBY1ks=; b=SaWwAIhwg625Y8SgpbwRkuwwFSfX9rqpw+cgFaCbRaMbNI3bh23w4u99jtryBJlw+iiCFn lRLOdcZxP5eHKotef0w2G3CNfc0TB9k8SAQ1d+ZEd/rL1k/W7+HcQFJHERLdtFKak3AG6b r4QoYxuTgfcAwAxa1fis8on8tyRTRGk= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=IaAgulVr; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=jiXbls0P; spf=pass (imf02.hostedemail.com: domain of James.Bottomley@HansenPartnership.com designates 96.44.175.130 as permitted sender) smtp.mailfrom=James.Bottomley@HansenPartnership.com; dmarc=pass (policy=none) header.from=hansenpartnership.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678367501; a=rsa-sha256; cv=none; b=6KMHr8oDIItjZ7MGulf3ieHtj4oSrEORkvmGR8EkHvPEeF25WZTrSX988KgC+Q3Dozwjt/ TOrbbx9OrpD4sF6LLX1BBQTnWCUw2zBID2n+tvOP6WqS+kaqqkXqVB+NpyDFr+kaVPHiP3 uoZn2ypefp5VXzhBcncdUEq4PTOThIE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hansenpartnership.com; s=20151216; t=1678367499; bh=ximSBdqm11LjZkXc+6xN4CESZDhw1BYjj+9qJ8MzVUs=; h=Message-ID:Subject:From:To:Date:In-Reply-To:References:From; b=IaAgulVrKnAlPv2g9YUwPeVG3kjLr4zTCuiYcy4FOtAtni2DimLGPdtR5RKQ9dKJK jYPtl3sQ54MRWe5LuLgedYJOWoh8PnMxRyEoRuEdiDNZs8xiVmfq2xM9oQWduYvOuk 2HNPfKw28Nm1Hj/0zE+k90F3BQrzSdg3BZIx7kB0= Received: from localhost (localhost [127.0.0.1]) by bedivere.hansenpartnership.com (Postfix) with ESMTP id 116DE1281446; Thu, 9 Mar 2023 08:11:39 -0500 (EST) Received: from bedivere.hansenpartnership.com ([127.0.0.1]) by localhost (bedivere.hansenpartnership.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pc7NiKohuf7s; Thu, 9 Mar 2023 08:11:38 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hansenpartnership.com; s=20151216; t=1678367498; bh=ximSBdqm11LjZkXc+6xN4CESZDhw1BYjj+9qJ8MzVUs=; h=Message-ID:Subject:From:To:Date:In-Reply-To:References:From; b=jiXbls0PZ6UqDt36GCwKW7X7dFmjNQ5w5Y5aFASfef4wv740Q7UiikJquC3oRPPlm LrOWzL7n3TLXBe75VMf8TYwT6F6RrzxfKFsPSNN0tg3gksXItN2iGbgw+vKtBjOVIi TA/XaSmNmd0qIi6NBPgovbKdkKMpB8ASv+nZPbZc= Received: from lingrow.int.hansenpartnership.com (unknown [IPv6:2601:5c4:4302:c21::c14]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (prime256v1) server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by bedivere.hansenpartnership.com (Postfix) with ESMTPSA id B73771280885; Thu, 9 Mar 2023 08:11:37 -0500 (EST) Message-ID: <260064c68b61f4a7bc49f09499e1c107e2a28f31.camel@HansenPartnership.com> Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations From: James Bottomley To: Javier =?ISO-8859-1?Q?Gonz=E1lez?= Cc: Matthew Wilcox , Theodore Ts'o , Hannes Reinecke , Luis Chamberlain , Keith Busch , Pankaj Raghav , Daniel Gomez , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Date: Thu, 09 Mar 2023 08:11:35 -0500 In-Reply-To: <20230309080434.tnr33rhzh3a5yc5q@ArmHalley.local> References: <0b70deae-9fc7-ca33-5737-85d7532b3d33@suse.de> <20230306161214.GB959362@mit.edu> <1367983d4fa09dcb63e29db2e8be3030ae6f6e8c.camel@HansenPartnership.com> <20230309080434.tnr33rhzh3a5yc5q@ArmHalley.local> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.4 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: 4yfffri97c9xc5smegsf3udmg4xmn7re X-Rspam-User: X-Rspamd-Queue-Id: 360FC80014 X-Rspamd-Server: rspam06 X-HE-Tag: 1678367500-721306 X-HE-Meta: U2FsdGVkX19p1iVCA6CHBYuriN0ZQosjmCYr0RL5CeF/Z7PGzl344pHMCgstJlhE43LGDAMxsUaKJD+Nsonh63Ox8Hskd0MXRvC6Df3pwH4g3K34LRGTs1LQFezy/VhFZUghvYCuKe6/x6PvcNGwE84/vr37Zt6PMm+UrqoFqoqd5VYKjr/6Qxg8ao9WJq6tDFwxk1UMtaWBbBUqrKtGCoN4FecBfLEiRhgeyWBCjqbOcW6c68dsG1dw4jZ7IA+1yLsWvIBNj7BER8qNxjGAwpHKQY7Ss1aL9PE6sW0uL84FwnAG8QfIteZnE5TNX0Sr7m3GFq65JuZGBRwmR7ybQsTfXWBIwJfJxZAZQbZCmOAEB6tFneFRgoWp2L3GPQ5dDfvTZi5PLaNvur9vCGf1/gfPT/GIhiA7e3ZvynNQaWTyttz6/FiIzKBk43pfoNamY/tJlTiyr+K5VxKxT753L7+kIwIrpPULm83nWCedYfvHaRn0y+NzOqEqFmcKOzfRfza/GFhFW9kzbCpc+26RGkb1MVzRItztapZlx1NUzc7VMY1ljbXMVswB0wHrm46r6PF5vFDWG7MnjIC847PdCauPMjX2lpXGs4hzb1rHmjz2ob6Srx5pxcfNKj/oehb3Xsmg4z1T7SzYwWaeTiZeM0JYyQKXzW/L9zif1dQmjI31nlIs7uFHQoPcuD88BupMf0blZlosmLYNvfMtcQLoPsfzsKIxQTCeC1HqCgVJS3BFE5vdmwvgZ0vfy49BKZ260Kxiec4TqE3RrAznIXakUx8b49nr6R/381J5ru11phJgd7MBugv5H0Qsa1RcD26e3fcic5icVz6S+4BzJg2zsuDS3aI1CrPvQ6jwIfz6vRFhMrRuYWBkIx67xV8D8LaR7j606r9vDdUZlrSb+RkaefntSpPl3lVE/9pTaXLGVaJdu7fBV4jCss0NH9/bDK4dK46T4IqCBl/RQHMpOlB Q3ZD+AZ7 J3l11kB9htd+g3gJKxstBqmiHkG+M9TW0bjpxUj1Wud4n4t2dwe8qyNu10fHTbXGNMConRyhlBN+DrK68UdrEJ3FYatyhyH1DANkn8RbYBDycEOY/rEGRO5sSkFglirvwTvuldTzzUlkN3TTke+khSv61IgGN6sj5uOOPqzaXlFuScCOFpVijXkYvxhEB8rTBMoXYhB9i/8Gh4WKJJwBkwTwfHkmgYlUR7yJu0vURaIHP/choKslD/tIH8AFK9WZUr+wvraPjutDfsSLdL/GiiK44b1Tz55FVip9GGlei/sk7eHTFU+SJHcLTyTxFH+uZG6pSVGQTVrMJTNuSsO3GC1Kvgw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote: > On 08.03.2023 13:13, James Bottomley wrote: > > On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote: > > > On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote: > > > > What HDD vendors want is to be able to have 32k or even 64k > > > > *physical* sector sizes.  This allows for much more efficient > > > > erasure codes, so it will increase their byte capacity now that > > > > it's no longer easier to get capacity boosts by squeezing the > > > > tracks closer and closer, and their have been various > > > > engineering tradeoffs with SMR, HAMR, and MAMR.  HDD vendors > > > > have been asking for this at LSF/MM, and in othervenues for > > > > ***years***. > > > > > > I've been reminded by a friend who works on the drive side that a > > > motivation for the SSD vendors is (essentially) the size of > > > sector_t. Once the drive needs to support more than 2/4 billion > > > sectors, they need to move to a 64-bit sector size, so the amount > > > of memory consumed by the FTL doubles, the CPU data cache becomes > > > half as effective, etc. That significantly increases the BOM for > > > the drive, and so they have to charge more.  With a 512-byte LBA, > > > that's 2TB; with a 4096-byte LBA, it's at 16TB and with a 64k > > > LBA, they can keep using 32-bit LBA numbers all the way up to > > > 256TB. > > > > I thought the FTL operated on physical sectors and the logical to > > physical was done as a RMW through the FTL?  In which case sector_t > > shouldn't matter to the SSD vendors for FTL management because they > > can keep the logical sector size while increasing the physical one. > > Obviously if physical size goes above the FS block size, the drives > > will behave suboptimally with RMWs, which is why 4k physical is the > > max currently. > > > > FTL designs are complex. We have ways to maintain sector sizes under > 64 bits, but this is a common industry problem. > > The media itself does not normally oeprate at 4K. Page siges can be > 16K, 32K, etc. Right, and we've always said if we knew what this size was we could make better block write decisions. However, today if you look what most NVMe devices are reporting, it's a bit sub-optimal: jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size 512 jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size 512 jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size 0 If we do get Linux to support large block sizes, are we actually going to get better information out of the devices? > Increasing the block size would allow for better host/device > cooperation. As Ted mentions, this has been a requirement for HDD and > SSD vendor for years. It seems to us that the time is right now and > that we have mechanisms in Linux to do the plumbing. Folios is > ovbiously a big part of this. Well a decade ago we did a lot of work to support 4k sector devices. Ultimately the industry went with 512 logical/4k physical devices because of problems with non-Linux proprietary OSs but you could still use 4k today if you wanted (I've actually still got a working 4k SCSI drive), so why is no NVMe device doing that? This is not to say I think larger block sizes is in any way a bad idea ... I just think that given the history, it will be driven by application needs rather than what the manufacturers tell us. James