From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01583C678D4 for ; Fri, 3 Mar 2023 04:20:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4FBFC6B0074; Thu, 2 Mar 2023 23:20:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4ABC86B0075; Thu, 2 Mar 2023 23:20:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39A4F6B0078; Thu, 2 Mar 2023 23:20:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2A8D86B0074 for ; Thu, 2 Mar 2023 23:20:31 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E542D16026F for ; Fri, 3 Mar 2023 04:20:30 +0000 (UTC) X-FDA: 80526285420.09.0939498 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by imf10.hostedemail.com (Postfix) with ESMTP id 2F29BC000B for ; Fri, 3 Mar 2023 04:20:28 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=fail ("headers rsa verify failed") header.d=mit.edu header.s=outgoing header.b=WgYhcgiq; spf=pass (imf10.hostedemail.com: domain of tytso@mit.edu designates 18.9.28.11 as permitted sender) smtp.mailfrom=tytso@mit.edu; dmarc=pass (policy=none) header.from=mit.edu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677817229; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UnVn9SQXj+xQGHz65tda2oIv7yPINQJtjR6aYPfdJQU=; b=dPtqP6HFGoQQA8V0o0qXCVs/o+bHrR0RAQwXuHON7jtX46elvVkDAE0YcMh6MZLi0e6660 cc0YdYkB3wQPz2BC9zDhaJBTBz+LedcSTSLZES2XWAKYBIi+knj0WAH353vXx1Tcjt0mWK gawiEPQskEvkIQGM+innuCu07ANTmWM= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=fail ("headers rsa verify failed") header.d=mit.edu header.s=outgoing header.b=WgYhcgiq; spf=pass (imf10.hostedemail.com: domain of tytso@mit.edu designates 18.9.28.11 as permitted sender) smtp.mailfrom=tytso@mit.edu; dmarc=pass (policy=none) header.from=mit.edu ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677817229; a=rsa-sha256; cv=none; b=OAKWMDkmQOHKI2zgy3o4X0wSe72dlBigShplw8KdLr0NpLhnrXGecNn8m5vRR8vUYPRfsx LWE7vvp2RfX12pmCWSuViXNDfWfMiWMsoIZ0KVNEUdKQI91souW+yW6MhBlooPaFQqcOws Y0FfNXjtw/9tjY+QWhJmv/esKi/qwNg= Received: from cwcc.thunk.org (pool-173-48-120-46.bstnma.fios.verizon.net [173.48.120.46]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 3234KQlt012375 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 2 Mar 2023 23:20:27 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1677817228; bh=UnVn9SQXj+xQGHz65tda2oIv7yPINQJtjR6aYPfdJQU=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=WgYhcgiqsg+8sSl2tOb2NDrxj/5T0ISlbi0RcKr2x0kBTqiWGBlYNE8QE4Q27lPsp eyw+YObQY+obW56PQZApKtdg8b4zNsReYomY6W+n4rHtS+WOooAfcd5n5sB9looxwN 28b983j4GY13CIWhaYkHbORFZVjEiwJXUoxA+07f1cb50w9r8iv2pUyHt0o4DKudq3 YSC36+1NK2qlCdl/icP9DtI7UF+/Vplf729UHt3KGaEGJbAAPwtsK0ZifbrD5vpFYm hfaMjP4NyPIOOT2nZqqGFH9/LuDLgd6t+nwvxv+14MR+qa2LnlkhXHeTLd/53WfhYp a2kfSYlsg5piw== Received: by cwcc.thunk.org (Postfix, from userid 15806) id 3C34215C3593; Thu, 2 Mar 2023 23:20:26 -0500 (EST) Date: Thu, 2 Mar 2023 23:20:26 -0500 From: "Theodore Ts'o" To: "Martin K. Petersen" Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2F29BC000B X-Rspam-User: X-Stat-Signature: dztua5e57bk66m37ctfgputj1cwz1umc X-HE-Tag: 1677817228-94711 X-HE-Meta: U2FsdGVkX1+t/yylUTVh2Hvu6ZcaMMelsQRdUMknSPgssjpTFff4uZhUniwlkrzrSBpCJ6G0SrFfjin77hOF5dDPOruFxQRTh0nY0WJ+Yc5tF2RqAXFh+Tr9oH88gKkvHD/Vv0L5jKs8JO2DD1rBj2ZDwbDcqliGsbKbrYtLru4nhbRig1JPqVr/KmJF6GGxxT9JTg0osBIF1LHQPKvrQFSJZxganKQG33qyFRR+rqJdVg2uCfDDuvgnnwUi++lQtQDAqDlXr6CGa6t8xxc98GkKqS/B1Ewtb6m73TGO9yGp9oLQITOMnTLT8nTvlr+ZqOTjduZear9zVepftmTCF0Uv4qv3WupPbLKazyG0zqXbSvqX0LXbrVEBXd7A9hD5g9CZ1aH+DbzPGpGBq6zD984AajqC0Po5Ut3QYr3y5kp7/380OdhUc6uPXVYbg7ZKkQLMuEVBM+Kv9XQSTvwVhUhKmWw06kAplAt73sfUYbo3gj6oC100ItJhYxuXBHcsogtQ5dk1uAqxGMYaNo50pSlx1m1ljoJN3ZY9i/KsqmF+Fgl2x/zhGwPU4Xlb40fgAWRyLfZIT5IxCuQeehD2WsxRCQMKUftTB2uV/hM7nZGzrwv6xoXuXc45tzin9Y2R3UzaBC4qyyGmzGUcnlR5IdeNOdfvYByhiZrydYqxypEgjQOwO+9Eu5DTRnl8N9lTu240Oau9AIyCnaPFMEiWe3SqPki4r+dEgBBVAgo5qaW3c85XTCVE1annTir1Li/6VKrt/frb7HOINigkZiKt2mQP+MEj/JNT0Af6RY9Q8wAZKzMhdgO41kwS+TAKklJCg4sqa1R72T4PkRk6evmxxFzL68rtQBHdKXULk7D61jGmq0KVlTA4AVLL0y6HC1zrIdtrRezonB/vSBnQYWa/STA39hCFqXuiv2MNX9I0o2d9u6wAetuDVr2C5wmfDWpirhG3ENiVu2iRQ5bvrfz DPn2nZvN wyaP0rXUuTVE966XAqW5OmESJn4jts8KyKSM/30LUjE7zrjq4H8fNtJBDVBYdm8bbhnsCOFV1wMrRJxXvykUz1FkVMTsqy+1UUEolSophOq79jvTXPlwlg9zDbgKd9lLtEBFnxQIf8dHhfJ5kJjzkTrzceco94miNT4KojC/BzQ4cppffQWOL+B2jPaLfLbAtEPGCPJRgJ3VbPAR1B8xyUM8BHgekRgDfIme+gbnUFk5dp287aPnZulB98v/1bUB3lNm3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.065685, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 02, 2023 at 09:54:59PM -0500, Martin K. Petersen wrote: > > Hi Ted! > > > With NVMe, it is possible for a storage device to promise this without > > requiring read-modify-write updates for sub-16k writes. All that is > > necessary are some changes in the block layer so that the kernel does > > not inadvertently tear a write request when splitting a bio because it > > is too large (perhaps because it got merged with some other request, > > and then it gets split at an inconvenient boundary). > > We have been working on support for atomic writes and it is not a simple > as it sounds. Atomic operations in SCSI and NVMe have semantic > differences which are challenging to reconcile. On top of that, both the > SCSI and NVMe specs are buggy in the atomics department. We are working > to get things fixed in both standards and aim to discuss our > implementation at LSF/MM. I'd be very interested to learn more about what you've found. I know more than one cloud provider is thinking about how to use the NVMe spec to send information about how their emulated block device work. This has come up at our weekly ext4 video conference, and given that I gave a talk about it in 2018[1], there's quite a lot of similarity of what folks are thinking about. Basically, MySQL and Postgres use 16k database pages, and if we can avoid their special doublewrite techniques to avoid torn writes, because they can depend on their Cloud Block Devices Working A Certain Way, it can make for very noticeable performance improvements. [1] https://www.youtube.com/watch?v=gIeuiGg-_iw So while the standards might allow standards-compliant physical devices to do some really wierd sh*t, it might be that if all cloud vendors do things in the same way, I could see various cloud workloads starting to depending on extra-standard behaviour, much like a lot of sysadmins assume that low-numbered LBA's are on the outer diamenter of the HDD and are much more performant than sectors on the i.d. of the HDD. This is completely not guaranteed by the standard specs, but it's become a defacto standard. That's not a great place to be, and it would be great if can find ways that are much more reliable in terms of querying a standards-compliant storage device and knowing whether we can depend on a certain behavior --- but I also know how slowly storage standards bodies move. :-( > Hinting didn't see widespread adoption because we in Linux, as well as > the various interested databases, preferred hints to be per-I/O > properties. Whereas $OTHER_OS insisted that hints should be statically > assigned to LBA ranges on media. This left vendors having to choose > between two very different approaches and consequently they chose not to > support any of them. I wasn't aware of that history. Thanks for filling that bit in. Fortunately, in 2023, it appears that for many cloud vendors, the teams involved care a lot more about Linux than $OTHER_OS. So hopefully we'll have a lot more success in getting write hints generally available to hyperscale cloud customers. >From an industry-wide perspective, it would be useful if the write hints used by Hyperscale Cloud Vendor #1 are very similar to what write hints are supported by Hyperscale Cloud Vendor #2. Standards committees aren't the only way that companies can collaborate in an anti-trust compliant way. Open source is another way; and especially if we can show that a set of hints work well for the Linux kernel and Linux applications ---- then what we ship in the Linux kernel can help shape the set of "write hints" that cloud storage devices will support. - Ted P.S. From a LSF/MM program perspective, I suspect we may want to have more than one session; one that is focused on standards and atomic writes, and another that is focused on write hints. The first might be mostly block and fs focused, and the second would probably be of interest to mm folks as well.