From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BA2DC678D5 for ; Sat, 4 Mar 2023 18:54:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A95D36B0072; Sat, 4 Mar 2023 13:54:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A45606B0073; Sat, 4 Mar 2023 13:54:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 90D4C6B0074; Sat, 4 Mar 2023 13:54:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 808626B0072 for ; Sat, 4 Mar 2023 13:54:09 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 52539A04ED for ; Sat, 4 Mar 2023 18:54:09 +0000 (UTC) X-FDA: 80532115818.08.BCEDD70 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf05.hostedemail.com (Postfix) with ESMTP id D0EAB100002 for ; Sat, 4 Mar 2023 18:54:05 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=PYJir0+o; spf=none (imf05.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677956046; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=984jR9dtStLaR+Q4kubSGdxGmWkaN30VLd4vb6GjFxo=; b=JQvm/YnEm0b+Ria3IYk0ZoOZ47wZ0S41pHyLAtuKi1yK556tg8TmT9cywfx4xBLhrsaD7U hwd46DxAD1bOai7e/z3C37HmCzk/PMbWkRRphARWZ7CPr8m0Rxef82EItNS+ngtJM1s4HQ IpPVv7tYynEfnJ5Gv9oZrS6K1UWiX1M= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=PYJir0+o; spf=none (imf05.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677956046; a=rsa-sha256; cv=none; b=8bwPfT6CLl/gajgjIlK9vE+c+HinXr6dkzEHgeqUbiXnBdzfnQ6dY27os/w2eZJjw4Bdyw +oM0uActqLWNBtA4J8pkQ9zBOFoVDvPy/Mf2CP9s9g2A8spWKFHBffGv/RKgS6XPUL5V1y 7L7GiYOoVbEutdBSas05Lv95X5mCgqs= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=984jR9dtStLaR+Q4kubSGdxGmWkaN30VLd4vb6GjFxo=; b=PYJir0+ov/KzFoDCR1jzLGMTnm W/fl5KsmclZ4qatDVpdCuBXzF+lw9GiezEloMyKZdirGAIEmy0FvPT8clvIC7PVmg4ofjYDiUWa6t HwYqzzmZqGebXE7Pb/JnNI9MWVgC69zL4CxdxCvAK2pJcz3JHN+e34ZHv1o5owilwJ9WPaBEF86fg YczLJI/TnT0oAt3PzwS9C8Cjl9NdFBZM1VjGPHemR7o7jBSlCXh9UY4iRqhCHYt8ihGq2n9PEPgCD XROTsKd3/2Q2IdYkJZEzD4C+hqvcl6bf3UjoBovKQYdvC61u0U4LooNSbTQPJhuqXfM+g6mK8ogoK 9cIFkF/A==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pYX0t-009QS7-Sy; Sat, 04 Mar 2023 18:53:47 +0000 Date: Sat, 4 Mar 2023 10:53:47 -0800 From: Luis Chamberlain To: Matthew Wilcox Cc: Hannes Reinecke , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , Javier =?iso-8859-1?Q?Gonz=E1lez?= , Klaus Jensen , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: prz8ixhxtzukpcmbjefgrm51k46x4dh3 X-Rspam-User: X-Rspamd-Queue-Id: D0EAB100002 X-Rspamd-Server: rspam06 X-HE-Tag: 1677956045-391956 X-HE-Meta: U2FsdGVkX188Uca08n6GwL/oHeEayzXbHKAEhaX5SAyZfDtjTEOEe2ZyyW0ku1lGPm71kHlCrn4AM6NE3YE8jQGboB8g9FVs8BXKHOLO9ac8W97fDIpXv9VBY6tgETEHWy6ZX4inPGMr2Ab1l5vjQVfq0LLFjAKfUCR54jmOvki+EG3slOiTUIpX00CmYwzk2X/0GxFXgpq/A+29mymQ5/XrcfSWm8kE1KfNj/K7QwfmAMyCToc+xywXnBzSsE+Ldj9LbYe4G0H9bDMXHEqLN7AorkpojWM9Wx5oYQt/j6sF89PThHUJe5ZYo+OHdRJc1OyNXk787ZkT/XqPOCRlljeMtcwQ54jorrUEwsYKuk9+OMfxjd4YXp6h0chLyP28Y7UKUcuCgLdqbqrx5ZQXIa7DQWw+xp0r2p2VBkhGUNy1d3F8ThDhPU/+Tj+Oo+J+IIqI9MkNFnKN9CYiYi0HQeTqwc5EhjRdBfBgzQF6ekpwm/GsEzUkLcXGS7gApjKkC1wAq0SgU04zbdXUs5Kv2SmrbThZM551ucynq22z+xSgNtV7eIKRQRUo8r+BWVYkh0pryWqVx6z7ncV8YtYUVZ3Mos0F11mioqYnTukkKhEmNIVq7G+qJTqvC287CS6lT9RhnizXNJo5zTP/N/Nrr8Yzd2xGCCqHnqBZZjVdihWGN/0M9fFdl+kv+/Y++xfSun1j24jU5pwoFOD23AMnk2hoJtT3uhOFmZMR5IJ50M4Ur9o0Dd7yEfYMTrNGbhPn/eWIy6LMDT++1YRQi2n45srD4t4tQicjPWLCHDjQ6ADBoBEP2LkjuO3oPycCwBWUV07YmnEYyX/xwynWcLso9xO1Q+gyCIcW9TCsblCd8BnuHo3Xi3S9BB7HxUuHKm3+wk49CIMHr3k7tJ8hpGmcMSX5DtFDegYGZdFczGo5d650s8Fujk4vPUq+vd6cjZLH8hWiHyUcnulW3BHt01u do4c/maD Av3t4CZr0P0GtGELdJ2c88PEm0jPk6DUw56+iYlajLg28LxcnHLvsgyqnf2BkeBAf7PTvbfE7JArwsjEG80yAoptwRPHFk2gdbarBQZ+I+7z4b/6xzXtvhZU4n1lStdshtylu9kozfgMDNZDtkhU2MraGiUtYL2gLQw826P56A4UJdcGVhfSYlNL3onqLzHXDCg9747U4eMFUbl3JSTETqEGQ4MAXcB0LvgHgnrUrCUG+1n2dgSpdczpWPyMrTqqsJNIoEIn/nsVxRuX0VKeSz+Xpx37jH0iTd6hIheF/m/jpfqg3dQdQ04cvZ/9t5dmIEnyv8RGfxB6l2mc7lKqSVbwuUBg9kIvEZr8L X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Mar 04, 2023 at 05:54:38PM +0000, Matthew Wilcox wrote: > On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote: > > On 3/4/23 17:47, Matthew Wilcox wrote: > > > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote: > > > > We could implement a (virtual) zoned device, and expose each zone as a > > > > block. That gives us the required large block characteristics, and with > > > > a bit of luck we might be able to dial up to really large block sizes > > > > like the 256M sizes on current SMR drives. > > > > ublk might be a good starting point. > > > > > > Ummmm. Is supporting 256MB block sizes really a desired goal? I suggest > > > that is far past the knee of the curve; if we can only write 256MB chunks > > > as a single entity, we're looking more at a filesystem redesign than we > > > are at making filesystems and the MM support 256MB size blocks. > > > > > Naa, not really. It _would_ be cool as we could get rid of all the cludges > > which have nowadays re sequential writes. > > And, remember, 256M is just a number someone thought to be a good > > compromise. If we end up with a lower number (16M?) we might be able > > to convince the powers that be to change their zone size. > > Heck, with 16M block size there wouldn't be a _need_ for zones in > > the first place. > > > > But yeah, 256M is excessive. Initially I would shoot for something > > like 2M. > > I think we're talking about different things (probably different storage > vendors want different things, or even different people at the same > storage vendor want different things). > > Luis and I are talking about larger LBA sizes. That is, the minimum > read/write size from the block device is 16kB or 64kB or whatever. > In this scenario, the minimum amount of space occupied by a file goes > up from 512 bytes or 4kB to 64kB. That's doable, even if somewhat > suboptimal. Yes. > Your concern seems to be more around shingled devices (or their equivalent > in SSD terms) where there are large zones which are append-only, but > you can still random-read 512 byte LBAs. I think there are different > solutions to these problems, and people are working on both of these > problems. > > But if storage vendors are really pushing for 256MB LBAs, then that's > going to need a third kind of solution, and I'm not aware of anyone > working on that. Hannes had replied to my suggestion about a way to *virtualize* *optimally* a real storage controller with larger LBA, in that thread I was hinting to avoid using on the hypervisor cache=passthrough and instead use something like cache=writeback or even cache=unsafe for experimentation for virtio-blk-pci. For a more elaborate description of these see [0] but the skinny is cache=writeback uses the host stroage controller while the other rely on the host page cache. The overhead of latencies incurred by anything to replicate larger LBAs should be mitigated, so I don't think using a zone storage zone for it would be good. I was asking whether or not experimenting with a different host page cache PAGE_SIZE might help replicate things more a bit realistically, even if if was suboptimal for the host for the reasons previously noted as stupid. If sticking to PAGE_SIZE on the host another idea may be to use tmpfs + huge pages so to at least mitigate TLB lookups. [0] https://github.com/linux-kdevops/kdevops/commit/94844c4684a51997cb327d2fb0ce491fe4429dfc Luis