From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0F9CC64EC4 for ; Sat, 4 Mar 2023 13:25:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9870F6B0072; Sat, 4 Mar 2023 08:24:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 90FDB6B0073; Sat, 4 Mar 2023 08:24:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78AE36B0074; Sat, 4 Mar 2023 08:24:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 624EB6B0072 for ; Sat, 4 Mar 2023 08:24:59 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 302ECC09A0 for ; Sat, 4 Mar 2023 13:24:59 +0000 (UTC) X-FDA: 80531286318.30.FBCD525 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf06.hostedemail.com (Postfix) with ESMTP id 32882180011 for ; Sat, 4 Mar 2023 13:24:56 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=javigon-com.20210112.gappssmtp.com header.s=20210112 header.b="L0dlnMu/"; spf=none (imf06.hostedemail.com: domain of javier@javigon.com has no SPF policy when checking 209.85.208.44) smtp.mailfrom=javier@javigon.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677936297; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p1Alr84dTUeXL8BqAnraROnsxhewVG8ff6KdOjLRqpg=; b=Sf5tClCbBmPVoWTTdPHKC1BYNmEsRhYTtamrz2zwF86FQgPBCF3L0k6sbahWjszKlSCreu 3nBdO6kqWH06gASNINWgRmubOyCAvxoFQ3I9Wqrx4sirD2fjn4Fr7JQfbYTI5A/KvmNpXP B4uj9ErHq6zbUDRBL6qq0Jtusl0L6vk= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=javigon-com.20210112.gappssmtp.com header.s=20210112 header.b="L0dlnMu/"; spf=none (imf06.hostedemail.com: domain of javier@javigon.com has no SPF policy when checking 209.85.208.44) smtp.mailfrom=javier@javigon.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677936297; a=rsa-sha256; cv=none; b=bvpzu1ufOA9G0LvN5b5+987M8jhSII2tuG+B8mx307H4oyO9Hvl30AeSHuy6ksa+ftaJhd uaGpcqfZa9otXX+HSwoImJjmruBYkCsZAKV96Zku4s6re8kKcJwW9o0SjQSzoetkPjUzB6 dqv+0f/2/9fRIbsJllS2Mz/XgsWgHgc= Received: by mail-ed1-f44.google.com with SMTP id g3so20930286eda.1 for ; Sat, 04 Mar 2023 05:24:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=javigon-com.20210112.gappssmtp.com; s=20210112; t=1677936295; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=p1Alr84dTUeXL8BqAnraROnsxhewVG8ff6KdOjLRqpg=; b=L0dlnMu/KWMEnv5YdUvbCmjH3VLx+p/cZ9i0/fvQL3OmW7o9yCE3nHNX9ywhWzahXS nFV6KeskLXnyOJVPlkpXWLTtict2gCReI0HiH+zjr9cOv2QOe+nN8A//46y5CrNrMW5k ZRW12NzlnMJgcPb6przCZdRJTL/Jll5Gfe7D1A3TtySgditjBAhmLd0myF35ed1L3zUI Efg5X3eJcaJbw8kuvEJiGf4wBZjTjq0OsUR8lNc0iKl+6+8Fp/EwSZVTH/DiBhNckgrI nISy9rAouZcmvLz//BY/7LcBSCOHfq/GoKnnOSXAeg0Y04pstu2uOXx8LbKFDh4agWWt s3pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677936295; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=p1Alr84dTUeXL8BqAnraROnsxhewVG8ff6KdOjLRqpg=; b=Bdp573tLIcPEJtX8nO9x8+v4BR1UTpG69FRBBV9rWBCZZc1oTCHxjrfsaWNc+KwD/v Gh6wF/01ES0BUp1agsM5JK5gpr9arrKMr0jPsSG+NgHXBul3F6858RZuLIJq8UqcW+k5 jTzPpEylV8xzKcZcBOiTUtlhLQQL3dK0QlrA3U3HjJaTnXSI27bCanyoF8q/wssUzwUL zC1LKZ5w2GEr/IjaQOSFJzLBquaSxU6o9mGYItMERydyjLpAj9Yohy5r/vDOU/db5F+r 6gMlGuPjo1LXGKdoHbg/myQ0GrVXtccTWsq/OaxuriyqECRb1ZuHay0vtVz6YfK0ZUvb Re9g== X-Gm-Message-State: AO0yUKW9Q4pN5/TrKQOnEva1G5fi9eCsE1OWr6rEZ8dcKyD0PCp5DInb VKg0dF1+J++C9MwyVKWey5gVgw== X-Google-Smtp-Source: AK7set+wSqK6yZKtTvNhsBuh5Q0PfonJUVZDO9hu6OEg1WcYI82cR21EBudH8wBUk9ldFiwfOQ309w== X-Received: by 2002:a17:906:99d4:b0:8ed:5aac:6973 with SMTP id s20-20020a17090699d400b008ed5aac6973mr5324415ejn.35.1677936295336; Sat, 04 Mar 2023 05:24:55 -0800 (PST) Received: from localhost (89.239.251.72.dhcp.fibianet.dk. [89.239.251.72]) by smtp.gmail.com with ESMTPSA id v24-20020a50d598000000b004bfc59042e5sm2417834edi.61.2023.03.04.05.24.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 04 Mar 2023 05:24:54 -0800 (PST) Date: Sat, 4 Mar 2023 14:24:53 +0100 From: Javier =?utf-8?B?R29uesOhbGV6?= To: Hannes Reinecke Cc: Luis Chamberlain , Matthew Wilcox , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Message-ID: <20230304132453.fy6gu4q64wrs2mxs@mpHalley-2.localdomain> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 32882180011 X-Stat-Signature: 8ysghf91eocz8cgbwbhaifh9wd8bd9cs X-HE-Tag: 1677936296-117349 X-HE-Meta: U2FsdGVkX19fSnFRlmPdQqo/TRfIbi8GkXMF3GOADDtD/bMjGJwFrbpDc90gibUqMmloeUR+qXUIJXdDju/IuicXU0NbXD9lmOCEhech8vTrtUJ84G5Ntolt+tvmEijs/EvGl2CP7I7MQ4bMb8hSF/Wsd+ezeFiARxa9GmZnSFdmO1Y7QmoErLj/dbK0ZygSA0mx2ApARp8eIQJxYBYIihzV9nJ4XSPsxuP+b0crVNKDKIlP79fNiQuOODvci1BAxKROyLGEXvOZb1UyyWs2uF7Uz+L4PO5DdZUeQOVx+7jlb3yNYzoi/5qEDF3/MP0wkTzItmiQqpTlYAeoxW3fh7dV2wH4oCsGgH7vnIBDexx54vTbJQrMd14IP2dZZpKnmxsi3YVM+QVlKuYJ2syQKTLgbZ71eezzirlNOzCBEI6V77TxNX2E+XyqWSPO5P6O1avpEEHPH87Q7YRXr25yUJwuutO5CGZQzJFLzjDRgrHEMnxUzgvpJQa1iILxN1H2TLouoFYh0AYON3b+3Rp5IIS/x2ZR6JGEXIGIFPYvzwDG5kSUCSovki0BrfMU479+uAZB6MpJgt6o+IYAnx04tb47/OBTBY4yeTqnt9Kuz53G8qlgmrcGEFMhlvYN65sGid3oCeso/ungujApo+/THpfOUES68blGWA8sBj0CHXgrUTFhm7FNujnQ52Ij3JkAQXhtLoHgTE3DsSJ8PaOtnvhni+mAzaJfEB3M3uj4xCOoOwmpvwz4cOgG91mg3IlsvI/ezPn79+NsLtJpnwMZZzCDu1+OxNSHHg5SWS6XMml6UWbMpHMQyYmwoUpPEzzWryvnlWlst5MV9jOekjnYPBIyqUcpIJb53rsEKcNLbc9ihVif4IZQ5V/l/Mkms1TFOUoPmi188LUbLc1p30ewhwFHhG5aw3ITC2HUoI/IvHHXAqR/fGNvXQ9Uvst3GQzuR44Uh7eHBizv5tc5T+3 Sy1RYNrO 6oG/26U2q3LCyYGdalrXxNuBevZGBweP4+Vr3PkJSUedLg7z8QflFdS2QEBEWbL688DKbszgd4utrDJf4ZRgcCo9Iq82xDKDXbxu1MXWT7mLFvdcRg3blouLkRNmbVN1prH2dDiCsEQBOVtOGBJ8nwnnTpE9PSTq5GkRN0HQG0LW1ignqz1hupa7XQKDSGcgHcUFPgDesqnwC1aObBgt4p3Ru2RDUXPSnty9TurtNd7MuKjQvKmAibqM92g6+1CB3dS96FPU2p+QYSKsTFCISdQEbbFrQr5jZGtHl3QiDvB73UYByLp8WCDWTrmuDNhAfoH76AhBWsLi9NdU/RlmX/bZAQMM9g+t3icWjGisIKSt11pwmuM0t2W4DZwAZe+deD4QjlEn6qrxeh1Enftw0Z9NnsJnHCvtTs670Efh522fRB6nrsex3t9EEvLPPb8B7xcF8H0gvaPHf34A1r8PDPqWIyg5xf9fP/37/w/6q0VpkbcVToUh4CROAgIc2okgQTEp67mSAZ+WeOh3BR5NEeG/zaK/GZFBu8C56xQsb13YNUWPJeAZHOvqFHRgaQtxKYI/D X-Bogosity: Ham, tests=bogofilter, spamicity=0.041042, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 04.03.2023 12:08, Hannes Reinecke wrote: >On 3/3/23 22:45, Luis Chamberlain wrote: >>On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote: >>>On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote: >>>>That said, I was hoping you were going to suggest supporting 16k logical block >>>>sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is >>>>4k. :) >>> >>>I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE. >>>Funnily, while the pressure is coming from the storage vendors, I don't >>>think there's any work to be done in the storage layers. It's purely >>>a FS+MM problem. >> >>You'd hope most of it is left to FS + MM, but I'm not yet sure that's >>quite it yet. Initial experimentation shows just enabling > PAGE_SIZE >>physical & logical block NVMe devices gets brought down to 512 bytes. >>That seems odd to say the least. Would changing this be an issue now? >> >>I'm gathering there is generic interest in this topic though. So one >>thing we *could* do is perhaps review lay-of-the-land of interest and >>break down what we all think are things likely could be done / needed. >>At the very least we can come out together knowing the unknowns together. >> >>I started to think about some of these things a while ago and with the >>help of Willy I tried to break down some of the items I gathered from him >>into community OKRs (super informal itemization of goals and sub tasks which >>would complete such goals) and started trying to take a stab at them >>with our team, but obviously I think it would be great if we all just >>divide & and conquer here. So maybe reviewing these and extending them >>as a community would be good: >> >>https://protect2.fireeye.com/v1/url?k=bd8b143b-dcf6fc7c-bd8a9f74-74fe485fff30-e62d6b1f7e2b2236&q=1&e=64cdf12c-742d-4d0b-9870-bfc5c26dba21&u=https%3A%2F%2Fkernelnewbies.org%2FKernelProjects%2Flarge-block-size >> >>I'm recently interested in tmpfs so will be taking a stab at higher >>order page size support there to see what blows up. >> >Cool. > >>The other stuff like general IOMAP conversion is pretty well known, and >>we already I think have a proposed session on that. But there is also >>even smaller fish to fry, like *just* doing a baseline with some >>filesystems with 4 KiB block size seems in order. >> >>Hearing filesystem developer's thoughts on support for larger block >>size in light of lower order PAGE_SIZE would be good, given one of the >>odd situations some distributions / teams find themselves in is trying >>to support larger block sizes but with difficult access to higher >>PAGE_SIZE systems. Are there ways to simplify this / help us in general? >>Without it's a bit hard to muck around with some of this in terms of >>support long term. This also got me thinking about ways to try to replicate >>larger IO virtual devices a bit better too. While paying a cloud >>provider to test this is one nice option, it'd be great if I can just do >>this in house with some hacks too. For virtio-blk-pci at least, for instance, >>I wondered whether using just the host page cache suffices, or would a 4K >>page cache on the host modify say a 16 k emualated io controller results >>significantly? How do we most effectively virtualize 16k controllers >>in-house? >> >>To help with experimenting with large io and NVMe / virtio-blk-pci I >>recented added support to intantiate tons of large IO devices to kdevops >>[0], with it it should be easy to reproduce odd issues we may come up >>with. For instnace it should be possible to subsequently extend the >>kdevops fstests or blktests automation support with just a few Kconfig files >>to use some of these largio devices to see what blows up. >> >We could implement a (virtual) zoned device, and expose each zone as a >block. That gives us the required large block characteristics, and >with >a bit of luck we might be able to dial up to really large block sizes >like the 256M sizes on current SMR drives. Why would we want to deal with the overhead of the zoned block device for a generic large block implementation? I can see how this is useful for block devices, but it seems to me that they would be users of this instead. The idea would be for NVMe devices to report a LBA format with a LBA size > 4KB. Am I missing something? >ublk might be a good starting point. Similarly, I would see ublk as a user of this support, where the underlying device is > 4KB.