From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84534C46CA1 for ; Mon, 18 Sep 2023 05:07:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD4776B0204; Mon, 18 Sep 2023 01:07:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D843A6B0205; Mon, 18 Sep 2023 01:07:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C4CED6B0206; Mon, 18 Sep 2023 01:07:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B23086B0204 for ; Mon, 18 Sep 2023 01:07:50 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7E54B1CA733 for ; Mon, 18 Sep 2023 05:07:50 +0000 (UTC) X-FDA: 81248535900.07.93E9BD4 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf13.hostedemail.com (Postfix) with ESMTP id 865522000D for ; Mon, 18 Sep 2023 05:07:47 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=2PZCAf7w; spf=pass (imf13.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695013667; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; b=MFI4KMjhUzBeJcwrhWQn4kQyyWeDOhLqsWUylHuMxuUmRZu6V4oC5SIuobkXs2WHj5HFO3 lKk/lINxzZGUw1A47RwSiTZtk4aX6tmm97sYwB/Y4hgN89LOLeYbNpz++Tifvwu3pLO1tG 9f94q9p0gPjAIZrb9EG4sqpjjqFgM/A= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=2PZCAf7w; spf=pass (imf13.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695013667; a=rsa-sha256; cv=none; b=tzwRc5JjoW0GVLBT+pg8UgMPXobzba/E4FbNLLUcvZUA3dk2oYLh8LG4+a1hMLHfKSqApI 2jfOsFq198Si3Hfb6Eb8NGsMy5v1B+MJnWUUqssPmduwsi2fwd5hnq9RCugWNKEMznaaiC sWqbRfqkQbDVLG6tgHAzMxzAncFkOnM= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-1c46b30a1ceso9468815ad.3 for ; Sun, 17 Sep 2023 22:07:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1695013666; x=1695618466; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; b=2PZCAf7wQVRMaYYFgOEgJcK9sOLgwkWGwDuKgR+jjm8HJVI0kBOoHLQtoLZOqgqvhn e+DJynnxKPWg5AaftSREtzbmNOV88acsIKn2xIS5+4byTSi4a0O1oMRM7bI2N2Uy8RgT 0YNl5S3LbC17A7rfPwo9rhF6c5+Ln92cW0lz37k3jwBE7NWywQ6Xn55NvVqzt7Rm4Q+3 Fr6rukWAOTGyect00c0LAifpE3lEFVKI2wMkRC0UZeuLO9esqhdyqy1vYfHiLkFekpG2 FIKGNDef6m3O5F3DioACbLKegIXRI9NpSSmrOO6eHwwvRRDrVqrP62lCcsYfteFRuU/b xoHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695013666; x=1695618466; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=M6z0XBRSxgiJriY+TPnWcbvaSWUhO2fbf3ilso8ZY+s=; b=glrdPaw0V4UcHQJFU0p9UyxdVcYM01+aFihGWbJcts/Umvaj17MLflxzuSIiLkA7Uy TnFRjwnOHZu4HBB6o6YMJ9CvmF4O686ja70K/VhBmwP8SR4rlL7jbNR+A8nDX6VoNSI4 ELU/Az5dlrZyt63/bVZOZwVxRjvETgNjXoBmoRCly0eVtjhQArlXqMNy9fsK8dSDOo9e DLFlRTwOm7f52DJ/fakh4lDhAnzgnhlUdl5rzN8WCg6NE080KHM4R9C6+aFUwVmEQMCA uvf2U3IcBPnz2aENZcpChC2nROMlXp8FKCqiBkwDx1O0gYD5/uA9hPnKCrWVFeuVQqsR CAwQ== X-Gm-Message-State: AOJu0YyehmRGIdGkk/EqNakDnZ3YYUiCJApmwnDRgDlI/rCbETUvbkNH sGeO1g7zT61bazSiyM3C632y38/36mSAegB/HE0= X-Google-Smtp-Source: AGHT+IE8P1OEwOYx0scttNCXcS7VvsJAwL4czH20HKd4GAJinQSVG8BkRdzXGitYN9r1svmoTect/A== X-Received: by 2002:a17:902:efc6:b0:1b8:76ce:9d91 with SMTP id ja6-20020a170902efc600b001b876ce9d91mr7789080plb.1.1695013666187; Sun, 17 Sep 2023 22:07:46 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id iz17-20020a170902ef9100b001b9de4fb749sm7494951plb.20.2023.09.17.22.07.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 22:07:45 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qi6U2-002DLu-32; Mon, 18 Sep 2023 15:07:42 +1000 Date: Mon, 18 Sep 2023 15:07:42 +1000 From: Dave Chinner To: Luis Chamberlain Cc: Pankaj Raghav , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, p.raghav@samsung.com, da.gomez@samsung.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, willy@infradead.org, djwong@kernel.org, linux-mm@kvack.org, chandan.babu@oracle.com, gost.dev@samsung.com Subject: Re: [RFC 00/23] Enable block size > page size in XFS Message-ID: References: <20230915183848.1018717-1-kernel@pankajraghav.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 865522000D X-Rspam-User: X-Stat-Signature: pismq8zx8dzb9ammuuzomt68uu33ao7c X-Rspamd-Server: rspam01 X-HE-Tag: 1695013667-890464 X-HE-Meta: U2FsdGVkX182dyh4ZgNOv2eFblIBYnCMKXu78bJls3FuPd7ZN2VTaSRS2QyPb3iqoXU9iRUfBFiv84P/Qz2G+0V+57psPcBsFkeGLHjq1sUoI3Owc8TGH9Mv7PIqIwPlhIIVzqwQBknC5HgU29sXIaJAtHJaamo7PC5rWhDnGvRxR71w2f8KrO/ZgxKlfuVy4+0R1Of5c7DRrL1MMPxbPENIVP1VjpTztUn3CP/bsauGa9KiDB0Our8IRP01G3jmdZ7M4H/fKSLKiPzSTSOxMu8SzXS7ugnQz0tqx43c9JCiNDGGc4Q+SM7yg86sPM2ZoxQ8Td4OfnDHKiftBC6iB3ax1G+1+2vgf7XNI5ZaVQGpMC0JLgUFEFre7n8D0ZWnKPWj9s70JPbvEjwxVGRJ87kg72HigQrg7wXPvmKs/D/z+AkDcm89CF/KsWTRXprEn2Oo3D9hyZ0eYur6q+O/PO8XlYciXTWUYk0dwvUk3Wu3Vh9n7N+Y6OCu4TnajmyWf6+tsUbo+a3Il2ZmPNrESu3zL4SINuD6bQtt4vvKl2VQs0fjUbP1WcZAqv5MYPfBwPX99guZWfKDws/o5b3f7G88wx7oOaQrZmar+Qn/a6/N4R81E0wdUCP0lISJVO7dw8yA9S+ALZZDdHrj0xKu61vTalycpDbV/z+6aHPsxUUTRccuvus+s2AtSCv6Sve+5OjdHwmLIc0RmxaOF5kADKrINg5xJiY1s1lppzOdLZEYw4WtgLwk0s+YV5px9VDIKKwWMv3+vizysuH+p+2rO43dhTTNusK1nbC7dkc2GbuVkOO1CuBsx/WU8LiuOS5doEn9ewbw0/q8PaMDaOJqOuAseI/GxjnFFE1pjGM4qj7+TN++n53mg7zLt4HGdouAE3Ytxbb1TTFWy16JGLPrMzFGiNKoEP2zMlZYFskHPEPxCJ+QLsOkjVZumye1UhYf++V7rmRla7ak/XQO7Hu NPpC7kcV tcCcct4Eu7P2JVr3iRpVBrDNL4q4Wg1sxhChtnRgOSHDABg+/itrGqJlvtJyqIcLoEJNM1IJ52augmy0AlG+zWdhK8Kdq0Gjsta1K7JPK6CeIwu3x7XMSCwEkRfVGuobPgITA2sD1RQIpjCEJbzaw/1ZBF+lT9JpnZTdBXlNYvO1QJZP75K9ZGCi+CCmg1O19lT2hgvF40+vxAu/3tkTMImsTGOwsWCODhsR4M0LexQNb3GrMNvioPUN9+61n9kx4QBekVXsIlyzd5Dt9C888SzyUNU6Kw+gfNrBPPzGW8YFU6qMh1ad+sjeu0q5hAbCIyYjrZ2J/d7rOtDMKI+sMjMJoIHUq9yj/jEjNkg1ETIVRKh/oswrbsYCP9U3DN4ynraYhYI367drf3ay7Qu3XNQuefhuzi+iU3pYn8y7Rl1XvD9Vwh4L+SA/+K2NaprfHT4jFSRiEnZ1owsP2xZVdFj6yr4/Wt55hzdpHPTWmVGSow/B5vJMfYxIqGK+itoxUvBSMgQpdPwDGzorKgIwA3Z+4Gg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote: > On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote: > > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > > > From: Pankaj Raghav > > > > > > There has been efforts over the last 16 years to enable enable Large > > > Block Sizes (LBS), that is block sizes in filesystems where bs > page > > > size [1] [2]. Through these efforts we have learned that one of the > > > main blockers to supporting bs > ps in fiesystems has been a way to > > > allocate pages that are at least the filesystem block size on the page > > > cache where bs > ps [3]. Another blocker was changed in filesystems due to > > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > > > Willcox in the page cache for adopting xarray's multi-index support, and > > > iomap support, it makes supporting bs > ps in XFS possible with only a few > > > line change to XFS. Most of changes are to the page cache to support minimum > > > order folio support for the target block size on the filesystem. > > > > > > A new motivation for LBS today is to support high-capacity (large amount > > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > > > typically greater than 4k [4] to help reduce DRAM and so in turn cost > > > and space. In practice this then allows different architectures to use a > > > base page size of 4k while still enabling support for block sizes > > > aligned to the larger IUs by relying on high order folios on the page > > > cache when needed. It also enables to take advantage of these same > > > drive's support for larger atomics than 4k with buffered IO support in > > > Linux. As described this year at LSFMM, supporting large atomics greater > > > than 4k enables databases to remove the need to rely on their own > > > journaling, so they can disable double buffered writes [5], which is a > > > feature different cloud providers are already innovating and enabling > > > customers for through custom storage solutions. > > > > > > This series still needs some polishing and fixing some crashes, but it is > > > mainly targeted to get initial feedback from the community, enable initial > > > experimentation, hence the RFC. It's being posted now given the results from > > > our testing are proving much better results than expected and we hope to > > > polish this up together with the community. After all, this has been a 16 > > > year old effort and none of this could have been possible without that effort. > > > > > > Implementation: > > > > > > This series only adds the notion of a minimum order of a folio in the > > > page cache that was initially proposed by Willy. The minimum folio order > > > requirement is set during inode creation. The minimum order will > > > typically correspond to the filesystem block size. The page cache will > > > in turn respect the minimum folio order requirement while allocating a > > > folio. This series mainly changes the page cache's filemap, readahead, and > > > truncation code to allocate and align the folios to the minimum order set for the > > > filesystem's inode's respective address space mapping. > > > > > > Only XFS was enabled and tested as a part of this series as it has > > > supported block sizes up to 64k and sector sizes up to 32k for years. > > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > > > that doesn't depend on buffer-heads and support larger block sizes > > > already should be able to leverage this effort to also support LBS, > > > bs > ps. > > > > > > This also paves the way for supporting block devices where their logical > > > block size > page size in the future by leveraging iomap's address space > > > operation added to the block device cache by Christoph Hellwig [6]. We > > > have work to enable support for this, enabling LBAs > 4k on NVME, and > > > at the same time allow coexistence with buffer-heads on the same block > > > device so to enable support allow for a drive to use filesystem's to > > > switch between filesystem's which may depend on buffer-heads or need the > > > iomap address space operations for the block device cache. Patches for > > > this will be posted shortly after this patch series. > > > > Do you have a git tree branch that I can pull this from > > somewhere? > > > > As it is, I'd really prefer stuff that adds significant XFS > > functionality that we need to test to be based on a current Linus > > TOT kernel so that we can test it without being impacted by all > > the random unrelated breakages that regularly happen in linux-next > > kernels.... > > That's understandable! I just rebased onto Linus' tree, this only > has the bs > ps support on 4k sector size: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev > I just did a cursory build / boot / fsx with 16k block size / 4k sector size > test with this tree only. I havne't ran fstests on it. W/ 64k block size, generic/042 fails (maybe just a test block size thing), generic/091 fails (data corruption on read after ~70 ops) and then generic/095 hung with a crash in iomap_readpage_iter() during readahead. Looks like a null folio was passed to ifs_alloc(), which implies the iomap_readpage_ctx didn't have a folio attached to it. Something isn't working properly in the readahead code, which would also explain the quick fsx failure... > Just a heads up, using 512 byte sector size will fail for now, it's a > regression we have to fix. Likewise using block sizes 1k, 2k will also > regress on fsx right now. These are regressions we are aware of but > haven't had time yet to bisect / fix. I'm betting that the recently added sub-folio dirty tracking code got broken by this patchset.... Cheers, Dave. -- Dave Chinner david@fromorbit.com