From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54D7AC54E49 for ; Mon, 26 Feb 2024 12:22:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB3AE44014A; Mon, 26 Feb 2024 07:22:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C657B440147; Mon, 26 Feb 2024 07:22:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B055B44014A; Mon, 26 Feb 2024 07:22:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 97BC8440147 for ; Mon, 26 Feb 2024 07:22:53 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 680C68084C for ; Mon, 26 Feb 2024 12:22:53 +0000 (UTC) X-FDA: 81833869026.20.863422D Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf14.hostedemail.com (Postfix) with ESMTP id 7F7A0100005 for ; Mon, 26 Feb 2024 12:22:51 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=liQkvA9P; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf14.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708950171; a=rsa-sha256; cv=none; b=Wd1/wZOZTeJwwom33ZQXNWgdRja4ahYGFj56fusLY6UYlu4amMwKv3VYNYeR2mDaFU4BTv /r0X6g1mt7o5gGKLBAsLY7qFwPXxDfhTIp6p0TsmNzNhjwF6OJbl7guR0kSRLd6drwgNu9 AaNSyCKXA7Yt1DSjzXwF2yLemLk7ZWQ= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=liQkvA9P; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf14.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708950171; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UeKmln7ls50YshP4IH8J5sCT9ly5i9tLfWS9qKuD79Q=; b=rx7M56RkhNxSmXiWWGBUdF3tUG80ywRCxct3VZxdyhFSkKeD69yPTR2ysPnN8Cy66TzlED rJbsGvQ3kJK8+n+wp9lmHW6N/sL7SglnM6LTalSKyWRkhKe3dLeCc/lgfpC+2N4zJW9wQJ f5sPTEnTg6njYOWekhU3c3fQV12MgRo= Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-1dc1ff58fe4so25644975ad.1 for ; Mon, 26 Feb 2024 04:22:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1708950170; x=1709554970; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UeKmln7ls50YshP4IH8J5sCT9ly5i9tLfWS9qKuD79Q=; b=liQkvA9PydwZEyLcfZ+Qybyg7pmKjFxVf89Ne+QUABKZmUAT1zc64IvsqbeN1TbhXG Vrd8ri0DPBq0fGGq5ZV43Ms2PwSCtNG/Acvas9jjV7qrXSw1vTDJq0wa2lWCFpqivQsD y7kmXpws2nKHCsO2uVC8eaeHXDdY4jokIeXSpo0xqDa4tCV4gw3MxT4+H2x0Vxep7Uea pffdm2eV/uNVsc61cr1+buahlO+L5Nw9j1wgXy5Pba75bC3M4VBvhZ4cKWU4jp60WXQw sst05MTQbIyNwgOduTA35ZDqn1RJKEYOOyYuDp5a3eU4yDZdNl0pj36Jh5BR1f9P9rV7 nMQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708950170; x=1709554970; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UeKmln7ls50YshP4IH8J5sCT9ly5i9tLfWS9qKuD79Q=; b=NmvbnibziVz28NG/FOcfcfDIt7SAJDrO3b1wmK1ZRr9j0XNVZRj/bmO6j/VtPtpUTs QC86E6wL5ovynevIHl9bz+kFADwTTQnQH4kP7s4tgUUpv1q/DTRq5o37sjj8VK3XOf/B 7TMunsBFpqu4IgS/4vmaSRnBWLHQ9ovShCEUQBPUWxOM7F7C2BZsxo+b3xc/Tn7b8TJY bds0rNQIINRkcSUxusVae5HrYDYbv3EdIJRK35y7KohP9wa6dneWEJF7ya8xYlD7XbKq DHFCk8e+3Ffe1mHTU9o+tvkCsW+yV9owVM+5lQ0H7Zeg7P2gWYOPrrSBUdduejdmxuCP u4HA== X-Forwarded-Encrypted: i=1; AJvYcCWTYgB1CSuEhd/tYWWBNjfD3rUk9BGqhpAtt8/7vUhFBDUgwOAy8ufgrzTWa/x0w4DdnKnzQdWbZw+XDUeFEcbbHkk= X-Gm-Message-State: AOJu0YzQPBaM+68JKq/id6T/seCf4+lFc41YaZNF35b+F2y0RYPXqjDc H7NVw/AAZN8IT7S/2Y+sSdN7YYEphD0bX4T/WyKzpvnEzUO7jtQVoiYohXO44V0twJLrIrMJKxC 8 X-Google-Smtp-Source: AGHT+IHFhd+cs46iLz3SBHDRw3Wf98YLTh6F6lbcU0dVNPeuKoNVztc/l+iGffTanWCumdTfnXVZkA== X-Received: by 2002:a17:902:e812:b0:1dc:b261:6eb5 with SMTP id u18-20020a170902e81200b001dcb2616eb5mr234541plg.2.1708950170243; Mon, 26 Feb 2024 04:22:50 -0800 (PST) Received: from dread.disaster.area (pa49-181-247-196.pa.nsw.optusnet.com.au. [49.181.247.196]) by smtp.gmail.com with ESMTPSA id kl8-20020a170903074800b001db86c48221sm3807708plb.22.2024.02.26.04.22.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Feb 2024 04:22:49 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rea0M-00BkQd-39; Mon, 26 Feb 2024 23:22:47 +1100 Date: Mon, 26 Feb 2024 23:22:46 +1100 From: Dave Chinner To: Luis Chamberlain Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Christoph Hellwig , Chris Mason , Johannes Weiner , Matthew Wilcox , Linus Torvalds Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7F7A0100005 X-Stat-Signature: oceteexptrjariibu9cit64nsxzmpsxq X-HE-Tag: 1708950171-530631 X-HE-Meta: U2FsdGVkX18YU1ngAAdS7ThMWfDI9VF+8I+Z513XHUsFEE80UJd/WQ7y0Gzt09Pi2fXg3IsDqjxn1+hfiOgiKBnzpCNxphkpyFg+CUIMZn9E+2LGmyWDxxyKbXnOK0e6EsTwxmBF3ew9Gff49Sxi99QdfwPxjPbYgi6Ri1TUnveJr5WbqYQvGricydY1FqgfG1vusC5Dry6srN++9yat0NNceEANfKs70j1IBfUrgQqCtug+iR0WqET+RIyJVFaupeTBOomYIoudSruI9A2pqXbMShOpGCGJ/+KJNqcCNdlAMBtmefFlKK6o/JLvfLMF2X5mq4V1SyP4/VC3fU5n75NHa+LTks+F+UpOhzoqMCGFGZaKCSG2fKjZ++BSPjwA+0/t3OI6yydhfePkQQHPTcmbQ0hFNP/P5Z6X6FyUxvmesBJ5Z1q/zO8IKgY0jSa2KbqZyiBKUJJD/S7XI57TUv92A57dLvHeoCim+6TO0XgvO5Nrny7uQJhFeJMqZDOZ1Xg4efqpNhB8NPpOoLhUxfxqMUPXWhC/aZunw+TOZqAR0ZhgTmYfzZGPfBPFik3576w4bNdI5gRNhVAlRCRT251gwUtDebm2HW2J7fFWIKEazav+FqUkeCkRqJIFoGnOmDG2JmzyZ0G3GHvToCauinlm+DiGkNxravp6wXtxBWZtxKWcIAh1Do+CfRptfG3Fvvniv0FsH+GyXdHrU8SUW+ahK5sXFFMOItD9KrMoWXAy418d90JgBgpiuKVcmS2HvAz+/FZn55zITb6A9q57m1jMEe1CGFuE29/rG1IHUlNUVJCoCS+lfgqNiEFL18CwosUnY0qqh9Jwzy2b/i4e53qKmfljsanp0LYthKfWGq43WzTAERvabby6jepWc2XYCI2DD1f812TzWxf4VOQK8gE35dt4r5c78a5JiYHGpJQ5ZSTxje2cWcAU5vRAhWnFVNqiwxvit86ssD1ZKcY HJbc517U VPJBvP9adkwEko51ai6k3Xxb5HTkCChJRqxhQkabIMG+roqabJiGM1KgyafMbiPmL5VTYcwv0XGaBdaRnJOEUN2xXkhH/bE0DwiGGoHJtxfxQr+2e8h3cGi12+Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > I recently ran a different type of simple test, focused on sequantial writes > to fill capacity, with write workload essentially matching your RAM, so > having parity with your RAM. Technically in the case of max size that I > tested the writes were just *slightly* over the RAM, that's a minor > technicality given I did other tests with similar sizes which showed similar > results... This test should be possible to reproduce then if you have more > than enough RAM to spare. In this case the system uses 1 TiB RAM, using > pmem to avoid drive variance / GC / and other drive shenanigans. > > So pmem grub setup: > > memmap=500G!4G memmap=3G!504G > > As noted earlier, surely, DIO / DAX is best for pmem (and I actually get > a difference between using just DIO and DAX, but that digresses), but > when one is wishing to test buffered IO on purpose it makes sense to do > this. Yes, we can test tmpfs too... but I believe that topic will be > brought up at LSFMM separately. The delta with DIO and buffered IO on > XFS is astronomical: > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > Vs > ~ 7,000 MiB/s with buffered IO You're not testing apples to apples. Buffered writes to the same superblock serialise on IO submission, not write() calls, so it doesn't matter how much concurrency you have in write() syscalls. That is, streaming buffered write throughput is entirely limited by the number of IOs that the bdi flusher thread can submit. For ext4, XFS and btrfs, delayed allocation means that this writeback thread is also doing extent allocation for all IO, and hence the single writeback thread for buffered writes is the performance limiting factor for them. It doesn't matter how fast you can copy in to the kernel, it can only drain as fast as it can submit IO. As soon as this writeback thread is CPU bound, incoming buffered write()s will be throttle back to the rate at which memory can be cleaned by the writeback thread. Direct IO doesn't have this limitation - it's an orange in comparison because IO is always submitted by the task that does the write() syscall. Hence it inherently scales out to the limit of the underlying hardware and it is not limited by the throughput of a single CPU like page cache writeback is. If you wonder why people are saying "issue sync_file_range() periodically" to improved buffered write throughput, it's because it moves the async writeback submission for that inode out of the single background writeback thread and into task context where IO submission can be trivially parallelised. Just like direct IO.... IOWs, the issue you are demonstrating is the inherent limitations in single threaded write-behind cache flushing, and the solution to that specific bottleneck is to enable concurrent writeback submission from the same file and/or superblock via various available manual mechanisms. An automatic way of doing this for large streaming writes is switch from write-behind to near-write-through, such that the majority of write IO is submitted asynchronously from the write() syscall. Think of how readahead from read() context pulls in data that is likely to be needed soon - sequential writes should trigger similar behaviour where we do async write-behind of the previous write()s in the context of the current write. Track a sequential write window like we do readahead, and trigger async writeback for such streaming writes from the write() context... That doesn't solve the huge tarball problem where we create millions of small files in a couple of seconds, then have to wait for single threaded writeback to drain them to the storage at 50,000 files/s. We can create files and get the data into the cache far faster and with way more concurrency than the page cache can push the data back to the storage itself. IOWs, the problems with page cache write throughput really have nothing to do with write() scalability, folios or filesystem block sizes. The fundamental problem is single-threaded writeback IO submission and that throttling incoming writes to whatever speed it runs at when CPU bound.... -Dave. -- Dave Chinner david@fromorbit.com