From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F6BEC6FA8E for ; Sat, 4 Mar 2023 13:41:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5C046B0072; Sat, 4 Mar 2023 08:41:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E0CAF6B0073; Sat, 4 Mar 2023 08:41:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD3FA6B0074; Sat, 4 Mar 2023 08:41:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id BDA406B0072 for ; Sat, 4 Mar 2023 08:41:11 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 936F1A9350 for ; Sat, 4 Mar 2023 13:41:11 +0000 (UTC) X-FDA: 80531327142.01.8C22B95 Received: from bedivere.hansenpartnership.com (bedivere.hansenpartnership.com [96.44.175.130]) by imf16.hostedemail.com (Postfix) with ESMTP id 44DFC180010 for ; Sat, 4 Mar 2023 13:41:09 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=PLx1HmSQ; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=PLx1HmSQ; spf=pass (imf16.hostedemail.com: domain of James.Bottomley@HansenPartnership.com designates 96.44.175.130 as permitted sender) smtp.mailfrom=James.Bottomley@HansenPartnership.com; dmarc=pass (policy=none) header.from=hansenpartnership.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677937269; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DbrQdxCRCtTuNj0XaCxxrqerHeN5Nowz3ZHSTDjxXQc=; b=R7xOmLkJ9PyDYXOZH7N4k6vSM6rlP+TSoOVSzQxwsA1xwsbtInK2u/n53bcPDkerKyYceN fi8nWv+AMNsHe5MBl/ey/dnVj4U8EJLBrrJlUBBBoPtT4Cuyfi8YAZ9YF0Bl1z7sv7FEeo vMnGVTnzRRgH9Fo2WySMvKfG5UpqaZo= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=PLx1HmSQ; dkim=pass header.d=hansenpartnership.com header.s=20151216 header.b=PLx1HmSQ; spf=pass (imf16.hostedemail.com: domain of James.Bottomley@HansenPartnership.com designates 96.44.175.130 as permitted sender) smtp.mailfrom=James.Bottomley@HansenPartnership.com; dmarc=pass (policy=none) header.from=hansenpartnership.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677937269; a=rsa-sha256; cv=none; b=PavDdnbNouVtYb3JtZw+3IPmvQTX1YwbiMDUn+g1rXFh9E8LvhmtjgWoH3RawAF/pfZJjK mUK5lpIc3blNo9t/35ChnvvNe4gbItT04qJ1MdFvzXan3LniONDU3gkeUsW4dbToAB68aW pgblHcUNPM+fc4OrWmjOvbStByjyGSk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hansenpartnership.com; s=20151216; t=1677937266; bh=0BnLfnbKbKNKsYCI8VyTnqzvOXG/vTqppzaUia/sHiw=; h=Message-ID:Subject:From:To:Date:In-Reply-To:References:From; b=PLx1HmSQAX/krZrfbLo+rNVNaPCGzbJqukUcMgZ5oHzBYTAA1/d7jq1dWSuXBvpy+ Z3t8t3KFz+l4iE0Zw2k0rOw4AydCH1ExlmiYraxZvLh77iGTwRuR/BUhqcwIqBtO0v BDxKhuDHNqX6E2tJfVCXY4EphuGkYmOfP5H9Rewo= Received: from localhost (localhost [127.0.0.1]) by bedivere.hansenpartnership.com (Postfix) with ESMTP id BFC381280954; Sat, 4 Mar 2023 08:41:06 -0500 (EST) Received: from bedivere.hansenpartnership.com ([127.0.0.1]) by localhost (bedivere.hansenpartnership.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id u3zTdhfQFg-Z; Sat, 4 Mar 2023 08:41:06 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hansenpartnership.com; s=20151216; t=1677937266; bh=0BnLfnbKbKNKsYCI8VyTnqzvOXG/vTqppzaUia/sHiw=; h=Message-ID:Subject:From:To:Date:In-Reply-To:References:From; b=PLx1HmSQAX/krZrfbLo+rNVNaPCGzbJqukUcMgZ5oHzBYTAA1/d7jq1dWSuXBvpy+ Z3t8t3KFz+l4iE0Zw2k0rOw4AydCH1ExlmiYraxZvLh77iGTwRuR/BUhqcwIqBtO0v BDxKhuDHNqX6E2tJfVCXY4EphuGkYmOfP5H9Rewo= Received: from [IPv6:2601:5c4:4302:c21::a774] (unknown [IPv6:2601:5c4:4302:c21::a774]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (prime256v1) server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by bedivere.hansenpartnership.com (Postfix) with ESMTPSA id A205D1280050; Sat, 4 Mar 2023 08:41:05 -0500 (EST) Message-ID: <2600732b9ed0ddabfda5831aff22fd7e4270e3be.camel@HansenPartnership.com> Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations From: James Bottomley To: Matthew Wilcox Cc: Keith Busch , Luis Chamberlain , Theodore Ts'o , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Date: Sat, 04 Mar 2023 08:41:04 -0500 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.4 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: yptys51w1mnifpkfd9xhjkcd1g9pesnf X-Rspam-User: X-Rspamd-Queue-Id: 44DFC180010 X-Rspamd-Server: rspam06 X-HE-Tag: 1677937269-652242 X-HE-Meta: U2FsdGVkX18Qm+IGuXtqvu6FW4Zh2ZnEbWmADtd9pRyBf+F/vKJeqYaDgFm2uf8sFXtgk6Pc2Ymc/szwsv0mBY+R2CJ2cuy7ANhG3jm+yhPVTiAkY7fcx66KSjq+nKEBaDo1HWkC4vDJNw1p89lwwjQh4cMUOkjQjgOCsV7Z2Eh/dAyM9gDSW7QOVFs6BJqmty8pGXgLIL7s8ewJMhbbRbyR6yEn9uptQGv8WrkSTki/7E9sqUlh0RgR63V37iNNhyuDPxyjZHf28DEJiLBzm+/4S+nKnpvLjuaknE2gjjFCyHXJ1GZh/54uvIELLKXVCiBqZi7TdArmxLWbKwnaoZh6mLnfyk7iyv7Ob2pY0ERY6Kmk83T/0IhlGzAFXAbUQrAH0xPLINHMOh5uSRbQ3IXBh7P06ctahCEoSEnFRv3xhHu/IcbIsCLO9xk8jj72QfpkiADALCj42qvWhmKKFb7FHKW5IpqpNdAMctQyuskZOjzRNfPx+FjV05rYGRlWJSnZEEtfsgD+lUGuOAwmRe/RPuTBh9ix3KDj3v1QH8C8bqC700tVFUW28GCFCaevK02duSPmBN6sApozTL0K7cqemA7agdPal6TtAcknyXly1T5dH9zzWG0l4pH+BzlhOtfY8s1cal66sb5MTvCWjNlkjmjxTCpdb3l41j2ZPJaK3/2MLfVdb6YX21XTuYWWtGkQIv3esjygAhvL1pIHl+zE+6zDRmVuy8hfLKIsL2idAj5tr359QcXRR6iiuiM0dpNEYzZ+/q+M6TH+KxLmJSd93ZOu1I6m7/K+S1Je9di4Qq81LXD+DOx50E2vklhKaT9KNW2wAY1QHDs4GzOfHv6rLaqvgBH08+wXuzvN31KN+zcS0rR5QfrJc2SAaT7f+ntBHP7OiVLNIpz5Bc7VwWrkDsO/T+UzomKtj2SPf4dseagSpBmkiJvOqT8ZyBACZSrvNhpqwm0nBgMUDbm pLwzsrJC e0t0wKTsPDxBIHTaxI0aSKqqFIrzN54Hmir+9F6AzZg/uB1HySnv8InczIbpiLKB/L36uvfx8nTiJFoUsDfY7y17OH/uEg0FIrZ1UgNmuRAfantSOPQVcVzkKfqXNNVAKiksCDgednptYn2n94Ng6lS4JfDdhNJjwbCH9XmESc+XFJrOt5LP6bSYgMcYtsQgGAwFc4hVfOYFZQb2pYJlJCXxjtrssq+z0lmsMHa6LipJiBGpMKPHkSgY96wIMjOj3WTIdpkYVeRU+AHZh/pm41nwZkgjCGaDcri49xD1tbgffnCussRq5+0u2s5joMbn1PbP3nKg6UMc7deE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, 2023-03-04 at 07:34 +0000, Matthew Wilcox wrote: > On Fri, Mar 03, 2023 at 08:11:47AM -0500, James Bottomley wrote: > > On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote: > > > On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote: > > > > That said, I was hoping you were going to suggest supporting > > > > 16k logical block sizes. Not a problem on some arch's, but > > > > still problematic when PAGE_SIZE is 4k. :) > > > > > > I was hoping Luis was going to propose a session on LBA size > > > > PAGE_SIZE. Funnily, while the pressure is coming from the storage > > > vendors, I don't think there's any work to be done in the storage > > > layers.  It's purely a FS+MM problem. > > > > Heh, I can do the fools rush in bit, especially if what we're > > interested in the minimum it would take to support this ... > > > > The FS problem could be solved simply by saying FS block size must > > equal device block size, then it becomes purely a MM issue. > > Spoken like somebody who's never converted a filesystem to > supporting large folios.  There are a number of issues: > > 1. The obvious; use of PAGE_SIZE and/or PAGE_SHIFT Well, yes, a filesystem has to be aware it's using a block size larger than page size. > 2. Use of kmap-family to access, eg directories.  You can't kmap >    an entire folio, only one page at a time.  And if a dentry is > split across a page boundary ... Is kmap relevant? It's only used for reading user pages in the kernel and I can't see why a filesystem would use it unless it wants to pack inodes into pages that also contain user data, which is an optimization not a fundamental issue (although I grant that as the blocksize grows it becomes more useful) so it doesn't have to be part of the minimum viable prototype. > 3. buffer_heads do not currently support large folios.  Working on > it. Yes, I always forget filesystems still use the buffer cache. But fundamentally the buffer_head structure can cope with buffers that span pages so most of the logic changes would be around grow_dev_page(). It seems somewhat messy but not too hard. > Probably a few other things I forget.  But look through the recent > patches to AFS, CIFS, NFS, XFS, iomap that do folio conversions. > A lot of it is pretty mechanical, but some of it takes hard thought. > And if you have ideas about how to handle ext2 directories, I'm all > ears. OK, so I can see you were waiting for someone to touch a nerve, but if I can go back to the stated goal, I never really thought *every* filesystem would be suitable for block size > page size, so simply getting a few of the modern ones working would be good enough for the minimum viable prototype. > > > The MM issue could be solved by adding a page order attribute to > > struct address_space and insisting that pagecache/filemap functions > > in mm/filemap.c all have to operate on objects that are an integer > > multiple of the address space order.  The base allocator is > > filemap_alloc_folio, which already has an apparently always zero > > order parameter (hmmm...) and it always seems to be called from > > sites that > > have the address_space, so it could simply be modified to always > > operate at the address_space order. > > Oh, I have a patch for that.  That's the easy part.  The hard part is > plugging your ears to the screams of the MM people who are convinced > that fragmentation will make it impossible to mount your filesystem. Right, so if the MM issue is solved it's picking a first FS for conversion and solving the buffer problem. I fully understand that eventually we'll need to get a single large buffer to span discontiguous pages ... I noted that in the bit you cut, but I don't see why the prototype shouldn't start with contiguous pages. James