From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1536C28B30 for ; Thu, 20 Mar 2025 11:41:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF7F8280002; Thu, 20 Mar 2025 07:41:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B8004280001; Thu, 20 Mar 2025 07:41:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A227E280002; Thu, 20 Mar 2025 07:41:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 852B4280001 for ; Thu, 20 Mar 2025 07:41:15 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9C5A61602A5 for ; Thu, 20 Mar 2025 11:41:16 +0000 (UTC) X-FDA: 83241738552.13.47B3D3F Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf28.hostedemail.com (Postfix) with ESMTP id 8C957C0003 for ; Thu, 20 Mar 2025 11:41:14 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AGqnsfhX; spf=pass (imf28.hostedemail.com: domain of mcgrof@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=mcgrof@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742470874; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=1oD7CdNNMNdl3U9wZELAZEM3Gr0rdVG22X/qlDIThMQ=; b=urPr1ujHjP+6FjI2DMeWfyHjZYgStNY5VAHkf2Xt3NYiks5pAAnMAAg0tQHSvOoJFMNvQw Uel7J2Y0zyyyEVeM6GRTpasi2ZwBDcyFGALBueVMrJBUCAXXlr1XEXHSAkyFG3888okF/P euPbCitaHncHWV2036yZfvIYn7YOU5I= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AGqnsfhX; spf=pass (imf28.hostedemail.com: domain of mcgrof@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=mcgrof@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742470874; a=rsa-sha256; cv=none; b=fqcqY+UAL776jjXYcDbLpyAnmNVfewJocKu08Jsx5U0fElMitkQN/leEL/fTIojAtMCA+4 KmvvBtePBLlzpTOPGSGntsavxC9hPlInSmxQ0qSL1US/+8xpQSpdGnNt4DjH8c6cmtm1fS FD2NWNzBLGwfDJf68lJlKqJ2WoSV7f8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 7390A5C6748; Thu, 20 Mar 2025 11:38:56 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 87B8BC4CEDD; Thu, 20 Mar 2025 11:41:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1742470873; bh=76RdB/g6Y50zkaCO75Y2n9UEvJzSLLD2iwa/QrAYSwI=; h=Date:From:To:Cc:Subject:From; b=AGqnsfhXk9NYBoKxuflpuxAxjxhUGPR5bhUuk1NMOeWrOrO8dZLv7nYDxqAQ0ysox pXioidUOdytJnK1Wbp5Hef+ollk2qOnba9MgrIq1M8OCj1W9/8Pm/CcezPcKagrx0L Yf/yNROzL6LuDGeUDlcUJXk5GZ7WmXrBkXtjv3YNKgfmH9eUlSe7+cK6boQzJe1Gfj iLcfD7GZPBLimrDc4vF9TgZscNe58A925X+PhBf1FZ0CVFYPW7jKGFIXuN0Msa/8Ew iB0iGVs5pWPMMe5VRzIZyIM0wRj3xFwzgRhhTVu9bGTi+kbaW6x0UuyawLI5gzczOB JKq4qkrcE99Gg== Date: Thu, 20 Mar 2025 04:41:11 -0700 From: Luis Chamberlain To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Cc: lsf-pc@lists.linux-foundation.org, david@fromorbit.com, leon@kernel.org, hch@lst.de, kbusch@kernel.org, sagi@grimberg.me, axboe@kernel.dk, joro@8bytes.org, brauner@kernel.org, hare@suse.de, willy@infradead.org, djwong@kernel.org, john.g.garry@oracle.com, ritesh.list@gmail.com, p.raghav@samsung.com, gost.dev@samsung.com, da.gomez@samsung.com, Luis Chamberlain Subject: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Stat-Signature: 49s1kt4jj5fe5d53caeox7tb74qe9amq X-Rspamd-Queue-Id: 8C957C0003 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1742470874-222376 X-HE-Meta: U2FsdGVkX1/enFB7jM0z2e8naFBMdP4OI+8ie0CQCcYcqhYP4BVqNC7lchSFesskbGqzfK2QFCnE7s4Sl5dl0wx08frfR/AuWOr+uEEXFgqLIPkOulrHAM7Fn4YgWDtlL8T7Jfe2yZ34rOFJyxzAsN2YQMtxXdV9GYrO4QjM/OMSkIQ74OGInJBY0ShdZrZ4DJKui72KDjzXrSQkGalU3lUt52RpFSB/ZUdbzzkBENZUu304y/CaKXavgJ5sw/kC1gvJgwPjkzkzprxK71LOElIO4ebqrfY7zeFS705/LbCQ50t9h/hKJuxKAMIP6oqnDp0Gq6hsotJIJ/RMmjUrMA86YY/kpJZl77Y3+PNSaW6qob4qe/9sm4dtzhZ3yS5IxcrUIp8GNEgqyb6jmmCQL1sgfTsNI5EeJv6HzS7nNAs/94/JA7m3MR9wN93lVmC3GrJrsMBGx6FvtG+BTxtvSgNQhMeFueIJ+Z+RP9LWu4zkBQXWYPTZ0HXAng/9QQPlQyKsZVD8J4yP1B12tABWh0b3bb31GxcCP0nWHrX5TRdZS9fibFsID9yjgD3WV6UMmg4f2Bijmia++pkliiBY9K425+hyjr6oe9KpUpuBQSOrPc+IRFcYeJ0XMyyTCsw7HMwWXt7d4JLIx7n88XPIRaHwssCfcTqmr7Jm3oYpe2xSySfTv8HOBeXfAm8nTtZ3TtZPbcHYYA7l9Y7aGOaKznBos4pmVR9mC7YAPasq8Ddzp54yVq/0mqyxO0WmrFomDmCYOZ5UIfzXwVXhDxPysLe21Alq5yY7tnk5kSiHPjGbXVGy/Tq4Yyqo0AxmpubkXThSlyoOAHhwyNLE2BwBAKzei3vNp2M0dK9XdRKDKkhZxNiHgMNAtVts8NRkDWAMMC6h1ILgDkJgOnFBXt/TPZAZLW2mAKGTDCGQ8WY3uUnme843nVVKpqyLPb52cxogkCdP074wowOUNiQ1Abt BNJ45yoi lSdaojnAaWv+CUxArZlEDW0UWSjjD7hvsYjmvkMcAc6aOCB2kCf/2QczmErgWIkYmqgL1f9JTBH14os6FPAKtDNvrb8Zrw6I5yFPMLMUMuJs7/Na8pOH2AGS4ssYmpzYCN5FYGsiCta3fLa/aV7wfI+MGUtvEfCBW5eNhKfmDIXovKe5kkWl399ymTHZPOwJctoHKkELhic8lvZoJkqSu611J8zVEPgfU2YUp4irbIuWGzAJ9CgWtCQonKBCc1JEAzmJV6CqOLljtos0LWcop2OSbrkEN1cj5nygkAyHZ0YQOUmMNOR+JQHMXFQ4oN8nNtL/AbgeGqlxtDxboaxGS0UucOfTvYSyo6M7c X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: We've been constrained to a max single 512 KiB IO for a while now on x86_64. This is due to the number of DMA segments and the segment size. With LBS the segments can be much bigger without using huge pages, and so on a 64 KiB block size filesystem you can now see 2 MiB IOs when using buffered IO. But direct IO is still crippled, because allocations are from anonymous memory, and unless you are using mTHP you won't get large folios. mTHP is also non-deterministic, and so you end up in a worse situation for direct IO if you want to rely on large folios, as you may *sometimes* end up with large folios and sometimes you might not. IO patterns can therefore be erratic. As I just posted in a simple RFC [0], I believe the two step DMA API helps resolve this. Provided we move the block integrity stuff to the new DMA API as well, the only patches really needed to support larger IOs for direct IO for NVMe are: iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit The other two nvme-pci patches in that series are to just help with experimentation now and they can be ignored. It does beg a few questions: - How are we computing the new max single IO anyway? Are we really bounded only by what devices support? - Do we believe this is the step in the right direction? - Is 2 MiB a sensible max block sector size limit for the next few years? - What other considerations should we have? - Do we want something more deterministic for large folios for direct IO? [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org Luis