From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45A9FC25B10 for ; Fri, 10 May 2024 23:57:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 856A46B0146; Fri, 10 May 2024 19:57:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 806BD6B0147; Fri, 10 May 2024 19:57:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6CE5D6B0148; Fri, 10 May 2024 19:57:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 53DBF6B0146 for ; Fri, 10 May 2024 19:57:18 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id BD09DC0253 for ; Fri, 10 May 2024 23:57:17 +0000 (UTC) X-FDA: 82104150114.01.B67DCD3 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf21.hostedemail.com (Postfix) with ESMTP id 9A3811C0010 for ; Fri, 10 May 2024 23:57:14 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=NJNFS9QU; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf21.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715385435; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vZhAtcjZqvSIYAYBjEmEHWr9KFoQsf5mGGmphyy7A3o=; b=xHbwPNr31oko5RiB3hcZxNfb+eLlCeM1vTxhgmUcsFkXzC600p7hxEGYKKB386u6H7/3O7 zKig9om25mpQqf2awChfsmLAxVxR5bGtXZTf1cPCv9V7Sdrh5XFErR4H0CbbHX/LiDcovl 3KXYMjVXwczbAPdll/tn0al2O0rXkB0= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=NJNFS9QU; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf21.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715385435; a=rsa-sha256; cv=none; b=GMEYwWLwwVZK2VDW3hjzx0yhtto3bzmeeo0BASpIZSpeQI9ALa8iNep3mirV/83D5Fj8W0 brHlQfWlu7hUZcrhJy1+D6OUzQ6g5osmCxJLgppzXg6K3EH0jK1fov0Y/qTcUf/R318lW1 WPfjePOp2xWdO6KSsotlOSEnsJeXb18= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description; bh=vZhAtcjZqvSIYAYBjEmEHWr9KFoQsf5mGGmphyy7A3o=; b=NJNFS9QUEE2IczvvyrDzZcqVeA WbWXNbFevtcG9mkWtEN70usKsD+ek36bNQqPZXzFruDObIJKuY55YIEqYWI3PeeckvKryN7ryPdVf zpEnVJ7h0tq3MK7+j1YrKT5uBuIUTN5cevnr1QpJVZe/Tg3N4RPRzEIMdBrP0znguYybdsxE5J74a dIFJ5liL8GvxP4vQCTCRn66aYxMtL0VV/u4zC3L09apJ2HJEPyyBk8oVWysSaXTwdkNT+C1euIHUN Ui3qrV1AgfNfLCmtjEtj2EaedAAmxgSKFYTKmZr5aLXbB+ZD183vfncQPNIjanEPQm9t0+nLCsny1 OtHMkoWQ==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1s5a6t-00000006qIt-4C6g; Fri, 10 May 2024 23:57:08 +0000 Date: Fri, 10 May 2024 16:57:07 -0700 From: Luis Chamberlain To: Chris Mason , Dave Chinner , David Bueso , Kent Overstreet , "Paul E. McKenney" Cc: Linus Torvalds , Matthew Wilcox , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Christoph Hellwig , Chris Mason , Johannes Weiner Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 9A3811C0010 X-Stat-Signature: h6tcyjixhjfbhzp4aw5ohaxfbrzm8a5u X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1715385434-824554 X-HE-Meta: U2FsdGVkX1+fRVWDN3ho/DTLqCXwdsHzztWYdpvQPPyI1WvV3NpSYMsnCKn8YdIt++h4Dr3KVIGm5kYloKclE+iYHaqquSQLqEZkRSmDOqvdkfvTzIz0LYkCXUp3iFjWBPe8xkDJTzYnZt93yNUHbRG/M6BwPReHvLYXQuwLNRfIVnO0IWpMGbRu7r1f/DQJYQ5vE1ztBv2ZFCfHWLZOUni4Ei6vDgLCTmBoMRWzaHotLhyGF3lrxfrEBXldTSyQRCDC/P5ivcZPfA/eX/7eBnF/9myuRmwXshZkbSWOnBldg9ShXJ0BIgDCZtUJ4zJUnPmAlZSWIcXwP3/FV7JAMmjcRy4HA2mQ5Fu1LcHuXbWNy9UYYUvr0uU9lJ4++dwCcIDroxH6nntSzGZlyZTrBP+zEQYL/zaLlNQH+SLO0ZnXs9d0530qSuMREaBPyyUGFjMHGLL3Cv0R6Q65smYt7I6dGVWtHigP5TrWfJBGc46WSV/ttyCwpQqZFfQHfBekp8yXj8LB6XY99/WAj954rxfjv0yTFPN9uhKJ19LepZvbC/gJD3T92wZRGb4sL4MDZBppXYm6v941t4E7ILjixOiSRLqw2K6e3JRK/M1EpFMo83pf26OlgniwGT8pTYtmKtb+M3a59cd8Ac3Kk3PriLU6Gv3kzMluPZUXj7Ogo7xpU4vvoRFz5o9JOMa3ZQISFFKsVdZCD/0yyC087Y3fLUE/doQchCE4vKJrsmpjCV2Ukr6KmUf598EkKfEyShokrNuX359hLv9ehOldPmwZ1GtgE1fUZ3lVu1MTXEMu2g5wmu/xzESzeb3+jxQO+4YnPYPonZ9pzfzF30yPJ2XXq3wCfb1wgND9Rx4nKI5Nnr/M6hMmgX1rzGM3WdtlcrJ0R+Mv4BPtX6VP47f2m8HhbnbUxqMB/ltf7oX4Tb2WwX9z2uM9Y1Sqx2+ve15pTBomubx32yms/KzIDnNu0sf fQzDJ2G1 ZWbABpoucd2klGtAl8oszgAmZzJacOAu9AjQgES8ZX29sazVXWhSuw1hy1ZA/HCsEd2J5GzplOQg7Vw0sYspkQcWEEyt8IMHKC4mNwRvSrakXkAz9nUmiVBBhxY84moSQT5WPK1lpQtIZwGKUQuX2MC8UYOfS6SLDz3zE7YqL4dEzuwPM6XeJTESJ3kF0mAO7YD8Oj7UQHlJ1c3uGnJ5MAdPQP2PGabcAd3cR9KrQU5ZT0RIMuEOsitC0RjR4ZMVuQGV4cjILFW/szZL94IqWpA9JBG6tMH+xRnfMZvbMWhsMXJ9Q40vlK60M01gxwNKhxSJ1m/RrWDjuFtw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000141, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Feb 24, 2024 at 05:57:43PM -0500, Chris Mason wrote: > Going back to Luis's original email, I'd echo Willy's suggestion for > profiles. Unless we're saturating memory bandwidth, buffered should be > able to get much closer to O_DIRECT, just at a much higher overall cost. I finally had some time to look beyond just "what locks" could be the main culprit, David Bueso helped me review this, thanks! Based on all the discussions on this insanely long thread, I do believe the issue was the single threaded write-behind cache flushing back Chinner noted. Lifting the /proc/sys/vm/dirty_ratio from 20 to 90 keeps the profile perky and nice with most top penalties seen just in userspace as seen in the first paste exhibit a) below but as soon as we start throttling we hit the profile on past exhibit b) below. a) without the throttling: Samples: 1M of event 'cycles:P', Event count (approx.): 1061541571785 Children Self Command Shared Object Symbol + 17.05% 16.85% fio fio [.] get_io_u ◆ + 3.04% 0.01% fio [kernel.vmlinux] [k] entry_SYSCALL_64 ▒ + 3.03% 0.02% fio [kernel.vmlinux] [k] do_syscall_64 ▒ + 1.39% 0.04% fio [kernel.vmlinux] [k] __do_sys_io_uring_enter ▒ + 1.33% 0.00% fio libc.so.6 [.] __GI___libc_open ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] __x64_sys_openat ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] do_sys_openat2 ▒ + 1.33% 0.00% fio [unknown] [k] 0x312d6e65742f2f6d ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] do_filp_open ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] path_openat ▒ + 1.29% 0.00% fio [kernel.vmlinux] [k] down_write ▒ + 1.29% 0.00% fio [kernel.vmlinux] [k] rwsem_down_write_slowpath ▒ + 1.26% 1.25% fio [kernel.vmlinux] [k] osq_lock ▒ + 1.14% 0.00% fio fio [.] 0x000055bbb94449fa ▒ + 1.14% 1.14% fio fio [.] 0x000000000002a9f5 ▒ + 0.98% 0.00% fio [unknown] [k] 0x000055bbd6310520 ▒ + 0.93% 0.00% fio fio [.] 0x000055bbb94b197b ▒ + 0.89% 0.00% perf libc.so.6 [.] __GI___libc_write ▒ + 0.89% 0.00% perf [kernel.vmlinux] [k] entry_SYSCALL_64 ▒ + 0.88% 0.00% perf [kernel.vmlinux] [k] do_syscall_64 ▒ + 0.86% 0.00% perf [kernel.vmlinux] [k] ksys_write ▒ + 0.85% 0.01% perf [kernel.vmlinux] [k] vfs_write ▒ + 0.83% 0.00% perf [ext4] [k] ext4_buffered_write_iter ▒ + 0.81% 0.01% perf [kernel.vmlinux] [k] generic_perform_write ▒ + 0.77% 0.02% fio [kernel.vmlinux] [k] io_submit_sqes ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] ret_from_fork_asm ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] ret_from_fork ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] kthread ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] worker_thread ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] process_one_work ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] wb_workfn ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] wb_writeback ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] __writeback_inodes_wb ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] writeback_sb_inodes ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] __writeback_single_inode ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] do_writepages ▒ + 0.76% 0.00% kworker/u513:26 [xfs] [k] xfs_vm_writepages ▒ + 0.75% 0.00% kworker/u513:26 [kernel.vmlinux] [k] submit_bio_noacct_nocheck ▒ + 0.75% 0.00% kworker/u513:26 [kernel.vmlinux] [k] iomap_submit_ioend So we see *more* penalty because of perf's own buffered IO writes of the perf data than any writeback from from XFS. a) when we hit throttling: Samples: 1M of event 'cycles:P', Event count (approx.): 816903693659 Children Self Command Shared Object Symbol + 14.24% 14.06% fio fio [.] get_io_u ◆ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] ret_from_fork_asm ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] ret_from_fork ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] kthread ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] worker_thread ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] process_one_work ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] wb_workfn ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] wb_writeback ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __writeback_inodes_wb ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] writeback_sb_inodes ▒ + 4.87% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __writeback_single_inode ▒ + 4.87% 0.00% kworker/u513:3- [kernel.vmlinux] [k] do_writepages ▒ + 4.87% 0.00% kworker/u513:3- [xfs] [k] xfs_vm_writepages ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] iomap_submit_ioend ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] submit_bio_noacct_nocheck ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __submit_bio ▒ + 4.82% 0.04% kworker/u513:3- [nd_pmem] [k] pmem_submit_bio ▒ + 4.78% 0.05% kworker/u513:3- [nd_pmem] [k] pmem_do_write Although my focus was on measuring the limits of the page cache, this thread also had a *slew* of ideas on how to improve that status quo, pathological or not. We have to accept some workloads are clearly pathological, but that's the point in coming up with limits and testing the page cache. But since there were a slew of unexpected ideas spread out this entire thread about general improvements, even for general use cases, I've collected all of them and put them as notes for for review for this topic at LSFMM. Thanks all for the feedback! Luis