From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81A16C6FA99 for ; Sun, 5 Mar 2023 03:06:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0B83E6B0075; Sat, 4 Mar 2023 22:06:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0687A6B0078; Sat, 4 Mar 2023 22:06:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E23FA6B007B; Sat, 4 Mar 2023 22:06:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CFBEF6B0075 for ; Sat, 4 Mar 2023 22:06:40 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id AB617140831 for ; Sun, 5 Mar 2023 03:06:40 +0000 (UTC) X-FDA: 80533356960.21.E1074B3 Received: from esa5.hgst.iphmx.com (esa5.hgst.iphmx.com [216.71.153.144]) by imf05.hostedemail.com (Postfix) with ESMTP id EB362100009 for ; Sun, 5 Mar 2023 03:06:37 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=wdc.com header.s=dkim.wdc.com header.b=SLmlNW4o; dkim=pass header.d=opensource.wdc.com header.s=dkim header.b=Vmwp9iSZ; spf=pass (imf05.hostedemail.com: domain of "prvs=42103bfe9=damien.lemoal@opensource.wdc.com" designates 216.71.153.144 as permitted sender) smtp.mailfrom="prvs=42103bfe9=damien.lemoal@opensource.wdc.com"; dmarc=pass (policy=quarantine) header.from=opensource.wdc.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677985598; a=rsa-sha256; cv=none; b=LNGmSQ0IF4yQ0PX9bSZ4xzCbPzLEfGLQGf6Uc1p82NX2aAVQysIJ9lBjv0uCFlPm5oj3Xl OmBa2UfCKlftBt4yD6ZGQM9sey+EyzKKpKBqIWESOAurwj6pkX//A7tIrgP2MQHiv73guX OiqtH+CO/TRsiizYAnW1OcHbYaePxAA= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=wdc.com header.s=dkim.wdc.com header.b=SLmlNW4o; dkim=pass header.d=opensource.wdc.com header.s=dkim header.b=Vmwp9iSZ; spf=pass (imf05.hostedemail.com: domain of "prvs=42103bfe9=damien.lemoal@opensource.wdc.com" designates 216.71.153.144 as permitted sender) smtp.mailfrom="prvs=42103bfe9=damien.lemoal@opensource.wdc.com"; dmarc=pass (policy=quarantine) header.from=opensource.wdc.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677985598; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZhRYbCFoDopfcHSaIXKwIZGyD4ceHBARThGhC57jwCg=; b=JUmWzlEUwWe6i97f+I5Q4/10miMkwKCQkQSXZjCTlOyBeczpJ2OSgZ8Totw+8YgCElcmFe iZdon+YIOC/9m3Jzex5irC2//1mCNnLDf8Tyi/QAr1/hFl/9/Zi8Do0oB4HqOLjfNF7x4I 6bRFLO3TTqFiJXBAlJLI7nu+whcv4aM= DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1677985597; x=1709521597; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=4sHgSFSdO5N7LHmIQ4SkzwwAHxAQ/XXht/nbleX0auY=; b=SLmlNW4ohuREogtIO/51PZx00UI99wCTES+oB36OfQHAJcCx3CnyjWc9 UogYexOTJdk4bzW5Zm/LUtHIgfJODoOOurOJCvahRJklet8d7vuey2rLh cu0HYpy8yiT5pyOTPpvKK18drD11svEHjrUcaUzY7gvcvOP1TN6Hl2Kxe 62rUyrlTfwoPJLjoCdBlxdnXX3vluJDe7UWefJocZi9Dcvm4U8h5dTpEE 5yAxur3EV8dj6EbijJ7/XJGbw73r5jIBOOdjcqcIyfhSdr0+yu2A8y76z kWA6acOEGGR8wuF1KZTHHLHOEtiVnEidpFYcwUFV6nX1J74YycOslf12W w==; X-IronPort-AV: E=Sophos;i="5.98,234,1673884800"; d="scan'208";a="224601317" Received: from h199-255-45-14.hgst.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 05 Mar 2023 11:06:36 +0800 IronPort-SDR: 1MS0oH+ZtRDf6+a2HkeWO3mMfTKHqkjyyu369mprRTmWdBVEMCxCmwxtYS1S0h6hP9V+tydjQx E0w13pwA7QN2b73f8UkHSdQ6HP/QA98Z/nTHfesBM+UTx+PewGqR1GmsYoGrm1d2TvysayynwC qT9FuPxHrw3A5DxP7W6WCA3XTD6JlRhk8J7W5b3iGMJtX0Wj4TbkcOpTWMaFn8zuTSWSG+VtLP zH5frdDbvtW6GGPcrRMIBLV28gqAMt7LpFtMs35o+s/ndOyb6pY9bmqtQKa4nWP0LNTJ/57LmY qzc= Received: from uls-op-cesaip01.wdc.com ([10.248.3.36]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 04 Mar 2023 18:23:19 -0800 IronPort-SDR: 99RddSRsjvIdJRiCV3lhv8Rc6S4xnzOrFyVw1RFQOUqAzd2KKR5CEU8v2ND/PcI7///xhuA1UO TKtzNVuR1zzaqLHe6BDXvgU7kXQOpeBa56Zci9a4ib7cjq7S8gDMt98uzd5C6TltaPDOhH9gCi vJk/r5gdLbO589jOb29fc1+eaeSaiMq7I2sQpIvKw5eiHq8bFDdBXfVsP86e0R2BLkaCBJWbs/ Y6ink4WXa/L9Tnd6qmOQW/ER5+V7rNcsNU63gqVCXcxFnw6pR/a2p10wzWabkqjJlQo7cVttqa PIw= WDCIronportException: Internal Received: from usg-ed-osssrv.wdc.com ([10.3.10.180]) by uls-op-cesaip01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 04 Mar 2023 19:06:36 -0800 Received: from usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTP id 4PTmnm1Bpmz1RvTp for ; Sat, 4 Mar 2023 19:06:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d= opensource.wdc.com; h=content-transfer-encoding:content-type :in-reply-to:organization:from:references:to:content-language :subject:user-agent:mime-version:date:message-id; s=dkim; t= 1677985594; x=1680577595; bh=4sHgSFSdO5N7LHmIQ4SkzwwAHxAQ/XXht/n bleX0auY=; b=Vmwp9iSZzXgnPq3u7PukPbcVkENzd7IZ7drRQSF40ngNl4e6L15 nH5qHszkfAPbbOMMkJtgs3naAsv6O7iOmnE/QS0gZGAPIoTeoggDG5NEqIOzUkjj xJsUYBaGhQ70ZBERU5FB5TZIY5tgtBJE3N1x41F6/eikHLjQhcwQrknD4ZNQ/BxH KwjbaJ1Es6WWPqwwfwFwCrPEq8JHc0WSwyjcr1MDAPZ13Ys2w4ldNc0Ad1WThQ+I mY1p9cOqg1K9CMvnfMvXZ+xufbCDAo1BrOaPiQ/10xsiaIIQwvY9o5uZw0K1fN5V jPJ0+Z2S9sa/ZUzD8s7LbRBnSw22b9FN9/Q== X-Virus-Scanned: amavisd-new at usg-ed-osssrv.wdc.com Received: from usg-ed-osssrv.wdc.com ([127.0.0.1]) by usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 163h0Jx9jfKi for ; Sat, 4 Mar 2023 19:06:34 -0800 (PST) Received: from [10.225.163.54] (unknown [10.225.163.54]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTPSA id 4PTmnh6htNz1RvLy; Sat, 4 Mar 2023 19:06:32 -0800 (PST) Message-ID: <9edd952e-d9b5-ed5f-29a2-981dfd9b10cd@opensource.wdc.com> Date: Sun, 5 Mar 2023 12:06:31 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Content-Language: en-US To: Matthew Wilcox , Hannes Reinecke Cc: Luis Chamberlain , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , =?UTF-8?Q?Javier_Gonz=c3=a1lez?= , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org References: From: Damien Le Moal Organization: Western Digital Research In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Queue-Id: EB362100009 X-Rspamd-Server: rspam01 X-Stat-Signature: hywcbk56fzj1usmmdmbmc1yq45ze5z3d X-HE-Tag: 1677985597-721410 X-HE-Meta: U2FsdGVkX199NpRP1/PfNrQTWvUyK5NFGEP4HEyOvsJhIfnYJ8QvEhRbn8bEqZBt1Qncf1QB5clvhQ3Qm2qGJ3xTbQ/LdAG2XgxEj7THzax887IBgB5Edh91xhlE9LVSWLXHqf4jbh3sGCRmVGjmtDmTcEdSrgjdJmo3vmUkameotjKNo0TGMwGbcyYpkEkwO0zfoIREYydlFB3YNCJ+XzYIili+vVt98KBg7HObOxVOhaM+M9C8iUv3LtUkqcl1QmwDanUU6DVZnutmBQomRhKtL55acHsELZL0pT40k2c27N6x5Gjq3sA4Gew2CUXX/jU9fsd60QqbpB/aRp3eY1mezkRyscaMiMM2zMdAcxW7lqgHrNXv6JivCebaHeZyvnjNUppb03mG5Rcv7faKveCAPqY/zWWVgUWRRVeGkDi7nOMaAGoEiYO79096nzqRHqsaK2pBPpperwbPA71ZPIWXklaSR3qvuWPK2GxWaF7MCjpXYj9yW9YyPHMkdu2F/u4ey/xS15uzJvgkohExBvAkXp3SrNlf3MQAjQzYNqtSRji73cPLvf3FIeV802Gen6bSYZf3o+PaND0hDhi7Q98TAVWaiKYZ1vgYx5N2sD8PheZ4FNhGATvv5Ljw476xmDJ5Kaf56gXXGmKgCrP5NNPQsdIWVpqGAbJ6RqqXg3+r+2TPa24aAIuDQpNm5uxMaYwW8l3cF5ViDqZgoYc8ud/FHpwWc4XM2mN3CjBmy8eDltFbvOvvFelDX4a1pNBNCNdrOEQBaqYRvfCCyGdrOcjDfph+jXs0xyKiDZrLkO0MN8+negblpyLjWeDqMh4uc971++vMg7ddU9GE2YyhCUnIGB6yQ3i2j84ZaRfNdfmHN28Ou7dP5Mqt/owZMLCFQN8VJ0Q8rgWYrVP3TR/6rBVKNYykGu3ov0DaiKyu4GS3gP3Kp4QhLaF0bKpESLHETLWdPlzzPgvZQNIE3oL PH6L9T0M eLtWhomZp3BXEJK0vcSexe+GuszonVsxxpOug9gHtC5ehKp2jk5AM+CCw4P93oXrT6Srx5yvNNljTxx87a9/Uc0K9ToibaJTlGzccBvUbLD/JjglVOttejvhuw5NGTSN4wEmc+76EqFNj4ol8VUwAmDGf5/+u+6B//SK1d03H5tjldlqtTSaYLWpW0L+aXQRKohl2lIpYx6KjBaY+suSfSS0qrpVd2dXFnvnx1BzOcnM9MzPZkPKlI+jz7TG4Zx2Iq3kU4JBU33pYrxmL6YSNnzrE0FqP30MdYoNe9j0XcZahW9lrNohXsYBXyp2r5gpxZLm7A3GbXuezD9/U8UezNsqUAT5hWa5A5jEgOqnNWR/9Bj/b8s7zdsHmxEH0RYUI5OULTIxQlX0BotKkeXW1uCgzyccFipkdUBGwwlaNxavphPgdOZSGRC8/bRg+MlDZM74wWVHzu1GMBhVXVTF/s/RJ02AweNHD9btAlvgbCNo/eboIgc3JKLDOgFJrXfyfYPEF0k+d4imaRVFgVpq6juSl/BtZxZjP4ryrRipC09fFNDucbQwJUByaBhPxIj385a/X3XFtjPLH1hggvLmh1vviuXdfpB/MAv1T9lCcDRl1itpVYxKOytqkNFix0u7oEzIroSZ03pzZzcE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/5/23 02:54, Matthew Wilcox wrote: > On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote: >> On 3/4/23 17:47, Matthew Wilcox wrote: >>> On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote: >>>> We could implement a (virtual) zoned device, and expose each zone as a >>>> block. That gives us the required large block characteristics, and with >>>> a bit of luck we might be able to dial up to really large block sizes >>>> like the 256M sizes on current SMR drives. >>>> ublk might be a good starting point. >>> >>> Ummmm. Is supporting 256MB block sizes really a desired goal? I suggest >>> that is far past the knee of the curve; if we can only write 256MB chunks >>> as a single entity, we're looking more at a filesystem redesign than we >>> are at making filesystems and the MM support 256MB size blocks. >>> >> Naa, not really. It _would_ be cool as we could get rid of all the cludges >> which have nowadays re sequential writes. >> And, remember, 256M is just a number someone thought to be a good >> compromise. If we end up with a lower number (16M?) we might be able >> to convince the powers that be to change their zone size. >> Heck, with 16M block size there wouldn't be a _need_ for zones in >> the first place. >> >> But yeah, 256M is excessive. Initially I would shoot for something >> like 2M. > > I think we're talking about different things (probably different storage > vendors want different things, or even different people at the same > storage vendor want different things). > > Luis and I are talking about larger LBA sizes. That is, the minimum > read/write size from the block device is 16kB or 64kB or whatever. > In this scenario, the minimum amount of space occupied by a file goes > up from 512 bytes or 4kB to 64kB. That's doable, even if somewhat > suboptimal. FYI, that is already out there, even though hidden from the host for backward compatibility reasons. Example: WD SMR drives use 64K distributed sectors, which is essentially 16 4KB sectors stripped together to achieve stronger ECC). C.f. Distributed sector format (DSEC): https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/collateral/tech-brief/tech-brief-ultrasmr-technology.pdf This is hidden to the host though, and the LBA remains 512B or 4KB. This however does result in measurable impact on IOPS with small reads as a sub-64K read needs to be internally processed as a 64KB read to get the entire DSEC. The drop in performance is not dramatic: about 5% lower IOPS compared to an equivalent drive without DSEC. Still, that matters considering HDD IO density issues (IOPS/TB) but in the case of SMR, that is part of the increased capacity trade-off. So exposing the DSEC directly as the LBA size is not a stretch for the HDD FW, as long as the host supports that. There are no plans to do so though, but we could try experimenting. For host side experimentation, something like qemu/nvme device emulation or tcmu-runner for scsi devices, should be able to allow emulating large block size fairly easily. > > Your concern seems to be more around shingled devices (or their equivalent > in SSD terms) where there are large zones which are append-only, but > you can still random-read 512 byte LBAs. I think there are different > solutions to these problems, and people are working on both of these > problems. The above example does show that the device can generally implement emulation of smaller LBA even with an internally larger read/write size unit. Having that larger size unit advertised as the optimal IO size alignment (as it should) and being more diligent in having FSes & mm use that may be a good approach too. > > But if storage vendors are really pushing for 256MB LBAs, then that's > going to need a third kind of solution, and I'm not aware of anyone > working on that. No we are not pushing for such crazy numbers :) And for SMR case, smaller zone sizes are not desired as small zone size leads to more real estate waste on the HDD platters, so lower total capacity (not desired given that SMR is all about getting higher capacity "for free"). -- Damien Le Moal Western Digital Research