From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5292C36001 for ; Thu, 20 Mar 2025 19:26:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DDBC2280004; Thu, 20 Mar 2025 15:26:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D8A7C280001; Thu, 20 Mar 2025 15:26:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C2F13280004; Thu, 20 Mar 2025 15:26:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A4088280001 for ; Thu, 20 Mar 2025 15:26:17 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 97BC8AC828 for ; Thu, 20 Mar 2025 19:26:17 +0000 (UTC) X-FDA: 83242910394.13.B4BEE6A Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf24.hostedemail.com (Postfix) with ESMTP id CEC8718001A for ; Thu, 20 Mar 2025 19:26:15 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mneN6Ag6; spf=pass (imf24.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742498775; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=IT5Y1CECgMcgiJ0AS3xtcoX9bxP9q8Li12z0Jmxks94=; b=Tbal5xo7CMACV6pA+2+qf8Eab0R3qIohcSl4BmOa3O+jwVEUuZiq8aRKjOAior7qyCBwyR focEPr8VMcaUmzCAeEDFlR30NL6aaMNJ8CqAse54ge7S6DC4jO/XxSJyIvK/Ris98MJKds e7X/Tb9y4Z25EsQ8kHbFI4cttMWWY04= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mneN6Ag6; spf=pass (imf24.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742498775; a=rsa-sha256; cv=none; b=Wy//O4Luk2pAI0IF/TKAW9EFHnARZgeEvicACW+ubICG9NPpLQpgNwXmUCFwmMPPVWwHmJ edTHXRcvckCw66JSHgbKLMRTkU2q+c3xURlQS8sI+wn4/TM07qe63ZkhrSdxKsBIableOs gBbd2FR3zZniw+Clj9cbfE+VbPH6H8s= Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-225477548e1so22935995ad.0 for ; Thu, 20 Mar 2025 12:26:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742498775; x=1743103575; darn=kvack.org; h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=IT5Y1CECgMcgiJ0AS3xtcoX9bxP9q8Li12z0Jmxks94=; b=mneN6Ag6vqTEqH2Ut8/mAcmV/CgRg6/lZIETSz+zX1b2IgTzz2vldaNGjbgQVbkw/F n4l6A5m5bZe1qBeonwu8LOsuufUKQJPqcIHthnkEXAjRNu7/BwFzkb3jbtQEdTQGDg5b DI+5RPJOsizRDBA1IgI8Q3LJ/iGSjiTPDihk3AoTyiaCYjNfsVzPUUQPCUn+OG+kN4hw MJd7Dpr3Vpl4ssgM5QFyHpTeEB2iN3f7fjBSura/dBMLcyS+3OlrB3n6crWnoCWgOJBd pKTBqrJAb+uam28+C61WZKxwqVm0BXY4vgpStXNxC4woztuK0lmZk6ub5tZKLxmUd6K7 h1kQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742498775; x=1743103575; h=references:message-id:date:in-reply-to:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IT5Y1CECgMcgiJ0AS3xtcoX9bxP9q8Li12z0Jmxks94=; b=ATa21XofztcbHbecWsr/+egQJVZrngOBDg26jt/+boeObUD17JkGWomJWi/rDTOwEu rUf3qYfqV8+1ksx9mM+AVUVmwwX05MnaOmMJmMM66BMwWZ03snr8Balq59C4alJ55+HC ThyMSdV7TcDN5cmsXaUQ1hHcpR2HxUq9TbBlRysDcaqWoA6V+CPtZlqLzlFXuYWuRxm9 WvAwSrCxNxf/9mjAysX1A2Ewoa7hZTxeYrVDE8G8rRRhHe6xQ76+u6S60I2a+ibfSce9 2xgWtDaydI2gSqkLRgwV0tP+V+DSSnhMxxUrcPCwKa9ozxJVQberI/o3ocXJgMkV8Y5A ck2g== X-Forwarded-Encrypted: i=1; AJvYcCU42CSXdUenAsFaUv0B000wM0IkhxVfR965YKQLFcMbwPuji6y3LmPK2e6qbi6vtyOH0xm8z2csmA==@kvack.org X-Gm-Message-State: AOJu0Yx36v+2V3HpfBrDsmgFzSQvZWklwyDTUh0595fkT2AmDV3VfGek s5yvo4U7STpGv1LxhiQBed8vfdZevGuv7meGolpgSrJA9LUgPG73 X-Gm-Gg: ASbGnct7O7NCtLynraCJGi2IRCqNXB4LWbT082CBmrT/MmquBJXGhnSRw7+sGoq+NEh 8wYuLFV06kLs4sA7KEFqGVxGQVWXKIgi7K6M9tHevNlD9+OE/LIZf0+telfvs713bpdflE0Erfw 3I0FwowSzE9UYhDTLs+aoGV3w/Kx+U0ZzaK77FaVVlw1vQDQX35qZSQrauY2kBYTWtiH6OOt3Y/ UEmBwf2rt1U+J3VbsAMH9Q0UGFeJ1udBaGXX2DWX2yxehl6PQdYeTibXSGXR6t3S+yp0kOleMG4 DJlDIpRB4a8No93/XZZ9SFTuHElrHJJauj/1qu4RgfKT3YD4 X-Google-Smtp-Source: AGHT+IFiUbe6+CRoKA0l1QROXCFvcvYY/0q35paihFSYabz3ziCBZXR+Acwdt5E/e3qDJoE5kc+e6g== X-Received: by 2002:a05:6a21:688:b0:1f5:7353:c303 with SMTP id adf61e73a8af0-1fe42f2cbf3mr1025749637.11.1742498774445; Thu, 20 Mar 2025 12:26:14 -0700 (PDT) Received: from dw-tp ([171.76.82.198]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7390611bd23sm186775b3a.96.2025.03.20.12.26.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Mar 2025 12:26:13 -0700 (PDT) From: Ritesh Harjani (IBM) To: Luis Chamberlain , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Cc: lsf-pc@lists.linux-foundation.org, david@fromorbit.com, leon@kernel.org, hch@lst.de, kbusch@kernel.org, sagi@grimberg.me, axboe@kernel.dk, joro@8bytes.org, brauner@kernel.org, hare@suse.de, willy@infradead.org, djwong@kernel.org, john.g.garry@oracle.com, p.raghav@samsung.com, gost.dev@samsung.com, da.gomez@samsung.com, Luis Chamberlain Subject: Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 In-Reply-To: Date: Fri, 21 Mar 2025 00:16:28 +0530 Message-ID: <87o6xvsfp7.fsf@gmail.com> References: X-Rspamd-Queue-Id: CEC8718001A X-Rspamd-Server: rspam05 X-Rspam-User: X-Stat-Signature: fg9ymm5geqhi9mb4f5ru65dfr7enza63 X-HE-Tag: 1742498775-382316 X-HE-Meta: U2FsdGVkX1+nFfY+n15uAMdQZvThF60dDzJHCtXVRZu3TCxDLt5z4yi2pYfdv/7rZ3BiyDbll+3upkFVX5mQq7uitEgWrupp432lkorx1wpqjn9z+8dBw5OPmbFJOTt8oJIYawcediHnAsZ4hA45t6agP3sY9LqyzQNtB5AZsJQP/xTi2yc/P56MQgJTfLXxq/yWijQATfKJASIdXlvjCnpE/zaOYS1U9ctNgYKDXpt57RsxAqxKaSCDBkTF7NKF7ePGZOF+jAnKngFWRmjtvNQJHDUXn+y6MKDw5606r2No5AkxaOyhydOqOKqUEW3c2nnt+4KgJwcAzvv0ycR2VjfK/iN+wkDKq5l5OOyWkHx6xZezqvz+8D9wpL10GE7lcV+sI7vKwrwF9mZd04CrYk68JG7NwDsaBhKH+4OK7FNcPIfSJKXMCHVcl19uG/LLJxOZz4GrO1mmC9uNDgOMOwOAOzSQAhyItt38ed0WvTDppO4Xlzh4BxPRGD5iWKdjGJn7hrQ1Xp6js/p19wUkn3Ulqnmy3xwEeCRVZMiY49EKjsyWUs1YEpY5oW6kPbvG/0VckKvAFzAAwsrDI7DRFP7cxym0bStNsGCT2JsMs2zRJAUDAfePSWRSvZ6AuM7bgPCjkWur1AiOMpzj4pM1Szd94/a5AoX/KYXrwynRheK+qXAcWHmwSfkIM1ZSS4LVclJBjNS+EEAxVSpjGXf5AHPc+bG+bDC2WMmJCRlqP8VZ+rhVtFRuJ+9CizF9k6/C7UxOzdWbUSKSW4usZ/GumcmqnCiccpppUR6XSP5nrtP1jX2EujBepqRRTsYad4ohQU+WQDxh+yveAPABGlhcH11BldTd7Rcs0GFNEm49NKAh9P816H1c5H9TondaYqG+k8qkd0uR1eJCEZHRcWAZnnVm2CNsDOsUMEvS/zjUs0b8Q8JcObDfnhM0ed9kSJTw2iORhcu3NkoC8RPvgYG sl4mCuKU PrDGE8Lvc1re9ubgfWoGXuoEF8RS6gmaPgsID86PfwN2I4WEhjklEs4XUoBaQkcXYsadqowMSdiuJIj05ocuuv+v8Aycifgwlvx53tyW/Nb/zWXch2dAVKzkxyBvZilMREZy/mSIN9NNObP31ky6KDTCF/K3ejRPUde0cjssFsJgBh2Z2BMJ8dCENc3JOJdGByBJAz//aw1cJKy7L3o2L/J/Qwi7wwOht6PJ/ePciLMiO4VgqXtiVOcxi+Au4DLoVbJ/KWP2trOPyG40Kdsp7DBjLgswy0CCCiB+3IUgpjzp5Aa79X22V+R+i+RNV0GcuNvyuyGsd5plgZw1i4QfpEqAit1nHY5Q/qFq7sN3aiy/K/8L06PTGkiCt9ooojkweiThLf3M+vpnfnfS3jhIuSNOMjE0WXuiUwhNvwADYzOr5ACVHW8qHA78mNbehRgPB4HVKlo7T+LxfKQoYSoyh5rwDn2A1iWd70u0q X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Luis Chamberlain writes: > We've been constrained to a max single 512 KiB IO for a while now on x86_64. > This is due to the number of DMA segments and the segment size. With LBS the > segments can be much bigger without using huge pages, and so on a 64 KiB > block size filesystem you can now see 2 MiB IOs when using buffered IO. > But direct IO is still crippled, because allocations are from anonymous > memory, and unless you are using mTHP you won't get large folios. mTHP > is also non-deterministic, and so you end up in a worse situation for > direct IO if you want to rely on large folios, as you may *sometimes* > end up with large folios and sometimes you might not. IO patterns can > therefore be erratic. > > As I just posted in a simple RFC [0], I believe the two step DMA API > helps resolve this. Provided we move the block integrity stuff to the > new DMA API as well, the only patches really needed to support larger > IOs for direct IO for NVMe are: > > iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page > blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Maybe some naive questions, however I would like some help from people who could confirm if my understanding here is correct or not. Given that we now support large folios in buffered I/O directly on raw block devices, applications must carefully serialize direct I/O and buffered I/O operations on these devices, right? IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on /dev/xxx) on separate boundaries (blocksize == pagesize) worked fine, since direct I/O would only invalidate its corresponding page in the page cache. This assumes that both direct I/O and buffered I/O use the same blocksize and pagesize (e.g. both using 4K or both using 64K). However with large folios now introduced in the buffered I/O path for block devices, direct I/O may end up invalidating an entire large folio, which could span across a region where an ongoing direct I/O operation is taking place. That means, with large folio support in block devices, application developers must now ensure that direct I/O and buffered I/O operations on block devices are properly serialized, correct? I was looking at posix page [1] and I don't think posix standard defines the semantics for operations on block devices. So it is really upto the individual OS implementation, correct? And IIUC, what Linux recommends is to never mix any kind of direct-io and buffered-io when doing I/O on raw block devices, but I cannot find this recommendation in any Documentation? So can someone please point me one where we recommend this? [1]: https://pubs.opengroup.org/onlinepubs/9799919799/ -ritesh > > The other two nvme-pci patches in that series are to just help with > experimentation now and they can be ignored. > > It does beg a few questions: > > - How are we computing the new max single IO anyway? Are we really > bounded only by what devices support? > - Do we believe this is the step in the right direction? > - Is 2 MiB a sensible max block sector size limit for the next few years? > - What other considerations should we have? > - Do we want something more deterministic for large folios for direct IO? > > [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org > > Luis