From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C594C64EC7 for ; Wed, 1 Mar 2023 04:18:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15C726B0071; Tue, 28 Feb 2023 23:18:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 10AAF6B0072; Tue, 28 Feb 2023 23:18:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F146C6B0073; Tue, 28 Feb 2023 23:18:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E03036B0071 for ; Tue, 28 Feb 2023 23:18:37 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B6C0D1C696A for ; Wed, 1 Mar 2023 04:18:37 +0000 (UTC) X-FDA: 80519023074.06.CB3D837 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by imf15.hostedemail.com (Postfix) with ESMTP id 9D2E7A000C for ; Wed, 1 Mar 2023 04:18:34 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677644316; a=rsa-sha256; cv=none; b=yYtxsqFUlbP9aJJiZlywvRQoFzMFD85pWlA5YjKmJrnWNBDCrEynBY2iPBqHs+RJBh3CYj T7hPYeB5Dk7AbsZHTMre0WNT5HLojSqvI5jmY1MjoROPtX3pueDz7Qrw3em5oASxvErnww X1FHgeGS5C/M//b+VwwOdmADpJR6pw8= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of hsiangkao@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=hsiangkao@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677644316; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a94IE1t84K6XhEokni0hKZQnety4wT+GiZHR7iBPHXE=; b=QyC6x9CJTv35WtgJNRVkp80j27HCIAqd3OGiviCffSI2dqb1e++rp88M3vCwfgrwntv0y8 dsonYYBO/vnS9VUHJgdpxOjbKfEG0/BXtMhVHBVoyyzsRgDj1xPOevmH3QKild6lhqjijH 4Cw9Ajpdp41350ZGCxbYHg8E0SNlwCo= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0Vcmfsg9_1677644310; Received: from 30.97.48.239(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0Vcmfsg9_1677644310) by smtp.aliyun-inc.com; Wed, 01 Mar 2023 12:18:31 +0800 Message-ID: <7c111304-b56b-167f-bced-9e06e44241cd@linux.alibaba.com> Date: Wed, 1 Mar 2023 12:18:30 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations To: Theodore Ts'o , lsf-pc@lists.linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org References: From: Gao Xiang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 9D2E7A000C X-Rspamd-Server: rspam01 X-Stat-Signature: wdg6nc4o5hboaxkp74mnyzi4jshm48b4 X-HE-Tag: 1677644314-382113 X-HE-Meta: U2FsdGVkX1+Bi2zlv3vvfqwFzyaE5DAc3CYrYAyNo/an7zvULaXiE8uekBuejgujXRTjEn1fhPrt9N0Y/ztAkXKEM4aKeQIpImuAIknVUVfFGtI6uCmmun6quXcEUn519sdcF0u6WV+nIItpgcZb8vBdZsAbbthewYE+z7k5NZ8RVcoL0PgPaoj3nDbpDMvUpgjuoyKEgQ2A4oiJRWO4wTXk6vc1OtAUyvBfyoGelVG/1gM1vLRfV4BatRQ5kwdvM62zcCB2Ggg6dhPWdFpfxpy58328jvd0ReDQwHewjuJe3cMFZOgasIi+x1CGWX8DMEOb9Hf0gxOxJ++VjZKdg616KRt/VUwNHZlipiJCoFLNns+Hk3vr+NSBN04v7czlchehjF1AU40ei1QtPWptUrjMiC72ow8HjaS8yXfgit4hrMqoUyjRKDm/iHSJfeGwaUCHEMdSP3Wv+V8Udczuf0oCklRHvQ8knm+HCzkS4z0TdDfWcXCc1di/la1M6lmfIEZtF8+By+OT0jB9rRdgZynHLqymPZY9Wm7BZdZx1G7Z2ol3OFiCZi39Y/1YN6Ld0IDjGjeEUeuh5fDk/L+Dh1Ta7mSWxX7A16eXPehq2zJIjLqULGVThC2ZcproGD2ns6+t6A2A4kKO6aiFbh32ykJ5GYdvCUiwLsB4j0HwDwSoTmnfs5IuTdZoY8XM8KZH9LjklabcgZ51ypCbyS3/nKf4L+dE6yWTVoHvkcThNWmzNBhdlRO+a0zP10b1OXraCyFStN4Y/VbbYvvRsrLHZrdSqYTkQKBilooa2kSJ7xLmHzOfgVbAi6NaIWow/VwbbFSUnNsxIeViQDwQizsZDJdHN7CxzDaGF/RfvK6oKj7atTo4eNKvpZFrc2ygUNS11myrovS33opVgl2890YLLgRgdVLgJ4DJ1LA4s1gkuXAIqLX3wfuvFyD0fE7ToNLSOnvTKbKSlx+9ud7gNNw fS696kHq RSXBGnj56F9uoH1StXPTDWYga0KsKda4oXwPtxjl386fjwFbocMYF1mgE4SNeL4rh7JByN4Gpav+ZGRMUj9e09Psf40DGjmYkJ5OYyRUPQEmyUhnrn4MXn2wJiRGxMCoUTZZiAaCkhnayEaxd3cn3mdnTYlhPxRzNfsB3bsgEXJGgYqC02j7epbuf6iMXM+1iMyMf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/3/1 11:52, Theodore Ts'o wrote: > Emulated block devices offered by cloud VM’s can provide functionality > to guest kernels and applications that traditionally have not been > available to users of consumer-grade HDD and SSD’s. For example, > today it’s possible to create a block device in Google’s Persistent > Disk with a 16k physical sector size, which promises that aligned 16k > writes will be atomically. With NVMe, it is possible for a storage > device to promise this without requiring read-modify-write updates for > sub-16k writes. All that is necessary are some changes in the block > layer so that the kernel does not inadvertently tear a write request > when splitting a bio because it is too large (perhaps because it got > merged with some other request, and then it gets split at an > inconvenient boundary). Yeah, most cloud vendors (including Alibaba Cloud) now use ext4 bigalloc to avoid mysql double write buffers. In addition to improve performance, this method can also minimize unnecessary I/O traffic between computing and storage nodes. Once I hacked a COW-based in-house approach in XFS by using the optimized always_cow with some tricks to avoid storage dependency. But nowadays AWS and Google Cloud are all using ext4 bigalloc, so.. ;-) > > There are also more interesting, advanced optimizations that might be > possible. For example, Jens had observed the passing hints that > journaling writes (either from file systems or databases) could be > potentially useful. Unfortunately most common storage devices have > not supported write hints, and support for write hints were ripped out > last year. That can be easily reversed, but there are some other > interesting related subjects that are very much suited for LSF/MM. > > For example, most cloud storage devices are doing read-ahead to try to > anticipate read requests from the VM. This can interfere with the > read-ahead being done by the guest kernel. So being able to tell > cloud storage device whether a particular read request is stemming > from a read-ahead or not. At the moment, as Matthew Wilcox has > pointed out, we currently use the read-ahead code path for synchronous > buffered reads. So plumbing this information so it can passed through > multiple levels of the mm, fs, and block layers will probably be > needed. It seems that is also useful as well, yet if my understanding is correct, it's somewhat unclear for me if we could do more and have a better form compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and impacts are quite limited.) Thanks, Gao Xiang >