From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85998C02181 for ; Wed, 22 Jan 2025 09:22:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C56596B0082; Wed, 22 Jan 2025 04:22:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C05266B0083; Wed, 22 Jan 2025 04:22:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA6B66B0085; Wed, 22 Jan 2025 04:22:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 87B756B0082 for ; Wed, 22 Jan 2025 04:22:51 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E53251414FC for ; Wed, 22 Jan 2025 09:22:50 +0000 (UTC) X-FDA: 83034548100.22.E3FB979 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf30.hostedemail.com (Postfix) with ESMTP id 84FDD8000A for ; Wed, 22 Jan 2025 09:22:48 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=0QH5If6x; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=vtGBVGb7; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=0QH5If6x; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=vtGBVGb7; spf=pass (imf30.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737537768; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yU8+uGuEP8rUlgMpA7oGK2ff4J2sNYY+AUnJ0O5nvek=; b=cGrSZbGE8wAW/hzC3nwZv16GrgY8eyN0Sf43Bm7YN8vaDV85D5Vbsw/4LDwC0/gJ6+4SVt C6VN63d+w6SiDTw1p6xkKMAAyall4m4PK72xTzOqi0f+cJYDI16L9cgSbVgxSSSxtBykGE D0wW2h90lllfLrDhiXZwwdslh9O7cAs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=0QH5If6x; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=vtGBVGb7; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=0QH5If6x; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=vtGBVGb7; spf=pass (imf30.hostedemail.com: domain of jack@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737537768; a=rsa-sha256; cv=none; b=fZotCJUCVb6y0VmnkBt++xKrH2mNn0ZjdAhkmWc1hYmcvDDXHY29o2Q8zzlhdrlGS3ihsG CmC9+ZvDqz5rEwJaNMiU0P/iB2kruJhUh+/QCexfwCCRfn1EZl+pDFRm/q/z+KigrV3/U0 /6qNK7bTebaxFztbBf8Km/vig/W+CUs= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id C3CE32124D; Wed, 22 Jan 2025 09:22:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737537766; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yU8+uGuEP8rUlgMpA7oGK2ff4J2sNYY+AUnJ0O5nvek=; b=0QH5If6xdzP6/4yb027KIgEWRMQ2yhbUVlJCJ1XKK/MW+jsMy1rmaka3Wpr/lgFCpT28ND d7Q+hj4XOh8SSm0E6w6X3bS6QJhknnQxGEfcFTBZ+Mn6JGypEU5dzWBxcm5ZSPyLK/16r5 oPNc0zSGaiy+xz72FZSgAEj8tjYZZgQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737537766; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yU8+uGuEP8rUlgMpA7oGK2ff4J2sNYY+AUnJ0O5nvek=; b=vtGBVGb7ikG8MhhptZ7LPGJLaLUhbC5tzp4lj/iWAA8UvkdwToejU4slXJWKe8JFj6knwt UmXD9MQlflfz4AAA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737537766; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yU8+uGuEP8rUlgMpA7oGK2ff4J2sNYY+AUnJ0O5nvek=; b=0QH5If6xdzP6/4yb027KIgEWRMQ2yhbUVlJCJ1XKK/MW+jsMy1rmaka3Wpr/lgFCpT28ND d7Q+hj4XOh8SSm0E6w6X3bS6QJhknnQxGEfcFTBZ+Mn6JGypEU5dzWBxcm5ZSPyLK/16r5 oPNc0zSGaiy+xz72FZSgAEj8tjYZZgQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737537766; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yU8+uGuEP8rUlgMpA7oGK2ff4J2sNYY+AUnJ0O5nvek=; b=vtGBVGb7ikG8MhhptZ7LPGJLaLUhbC5tzp4lj/iWAA8UvkdwToejU4slXJWKe8JFj6knwt UmXD9MQlflfz4AAA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id B537E1397D; Wed, 22 Jan 2025 09:22:46 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id HGv5K+a4kGcLXAAAD6G6ig (envelope-from ); Wed, 22 Jan 2025 09:22:46 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 582BBA081E; Wed, 22 Jan 2025 10:22:42 +0100 (CET) Date: Wed, 22 Jan 2025 10:22:42 +0100 From: Jan Kara To: Joanne Koong Cc: Jan Kara , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 84FDD8000A X-Stat-Signature: ryhcaq4a7ucqb56gcyqx4gp9z9n8bd36 X-Rspam-User: X-HE-Tag: 1737537768-999794 X-HE-Meta: U2FsdGVkX1/egCIlhPrsqzl9p9JtggYAprBNjOYOctBOzTlXcvkiYoW/Krlw2T1KIA6E2jhcwArtjdX9gf7jPQv9A5XxiCn+bji0Eny2fa4l42KSCGKtYoBdLjaRc29guhMeRFde3pYXVZQPQP5UJ0Mz2y59KFgN2EX1wl/vk8n0ykCbclVjkvOZoUQt35EkFL4gZq6SJ+g/E1muoX7nbYEreXTPM35lwlZmSSkKUHGC6WxTabdKYP81l55H2p+7T5u4nJyIwojyNS+V+anE9ZDnJGwb1pMBXCK1/lEQBQTIyhLOEVAgIlQDSW0foyVNaSGjM0c2lZ56O20/iCjGcknp8HwMZaUQefzJ4wLYTsqWnuJJ+N23OhUHBEMiZoDkZ4u0RZv59mcLC5g7hJeeLvn13IwM5DK0daUcmQ+ESVuTBL6IsA5E03qVH+MBZ5hHBgNk/vmow+4azydGXHXvIAGCAlCoR62QCOzRxFAUKyaz3jvh5yp6EK90FHp7hvHbqvF8Gxf9STho42mgKkOdO1DRFuvBCoWJnfDKRMLZ/HO9jcD1lfu1bgluKyXiuIRtWK5d8KmdBrI1kqQlgDMm8d4LK+YlnoGcysTaSCHf4kT2SDHJxyxKPnpX6+8KxEj3dSb9+8/VnZYFcMa9TPVFTwFO5EZTaiGVRy/fTTZfmb7L62exOYM2NRi3YApmFeJ9fYAYLENO7Z05+2HJkMWsAC7d07F7BcaeS94SzHPQTdRZbc3YrW/JEWaPp/VNpsJIfyjehE0sZUYyrRZcXCX6hzdhkntvi0AZ7HLAi11OdwhdFBD71qv/Gb7u4IEWngJ35VUSyIQMKHsxFAkLm6wQh2/H6ZvOPzwPLC5F+jGJDqZKUcLl7S1D9XidAIUc86wFJzpm7jMoftirEFNcZo5dT/g+leo9JdncjfCppVrAqXRITBJH2w8UCYy18b557sKPPpMd5HAhWdXWr9zqGs1 oUU3/G0E 5p31qsXhFMS7uQ2dEmoxEo70VfXPi0unbUqUMcOXeA7RvP14c94+SIy3/XgNsmPDRrsNphVMRVxIN5me2dQWVTSruh/LahPSE/5NfFzllgK6oyevKwFMLY1MDf5uWfdvkj6iZfhgn4zpp65GAzDB2vK0BL88aEt8U4qFtoDrBuma4RZkoajPe7y6ZxaK3RQLkERiO4xEz0n7yCHYwUji0pBMiA404ck5YF7JIxQadB3jOC+h5qZ5n15S4PCoWavTlrq9CUuyRdNyZXG1c8ANTAgDCNWkNbxESqrihVq5SBIM+0KJx4uBDZJDM3XdL9dPllVdIbR4WH7fGCBFzkdtQso7mwdttjfDVaVRPOwDYgYzOmGNhiOdqYjoGf63tm5if7e9M X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 21-01-25 16:29:57, Joanne Koong wrote: > On Mon, Jan 20, 2025 at 2:42 PM Jan Kara wrote: > > On Fri 17-01-25 14:45:01, Joanne Koong wrote: > > > On Fri, Jan 17, 2025 at 3:53 AM Jan Kara wrote: > > > > On Thu 16-01-25 15:38:54, Joanne Koong wrote: > > > > I think tweaking min_pause is a wrong way to do this. I think that is just a > > > > symptom. Can you run something like: > > > > > > > > while true; do > > > > cat /sys/kernel/debug/bdi//stats > > > > echo "---------" > > > > sleep 1 > > > > done >bdi-debug.txt > > > > > > > > while you are writing to the FUSE filesystem and share the output file? > > > > That should tell us a bit more about what's happening inside the writeback > > > > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi? > > > > You can check in /sys/block//bdi/{min,max}_ratio . I suspect the > > > > problem is that the BDI dirty limit does not ramp up properly when we > > > > increase dirtied pages in large chunks. > > > > > > This is the debug info I see for FUSE large folio writes where bs=1M > > > and size=1G: > > > > > > > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 896 kB > > > DirtyThresh: 359824 kB > > > BackgroundThresh: 179692 kB > > > BdiDirtied: 1071104 kB > > > BdiWritten: 4096 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3596 kB > > > DirtyThresh: 359824 kB > > > BackgroundThresh: 179692 kB > > > BdiDirtied: 1290240 kB > > > BdiWritten: 4992 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3596 kB > > > DirtyThresh: 359824 kB > > > BackgroundThresh: 179692 kB > > > BdiDirtied: 1517568 kB > > > BdiWritten: 5824 kB > > > BdiWriteBandwidth: 25692 kBps > > > b_dirty: 0 > > > b_io: 1 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 7 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3596 kB > > > DirtyThresh: 359824 kB > > > BackgroundThresh: 179692 kB > > > BdiDirtied: 1747968 kB > > > BdiWritten: 6720 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 896 kB > > > DirtyThresh: 359824 kB > > > BackgroundThresh: 179692 kB > > > BdiDirtied: 1949696 kB > > > BdiWritten: 7552 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3612 kB > > > DirtyThresh: 361300 kB > > > BackgroundThresh: 180428 kB > > > BdiDirtied: 2097152 kB > > > BdiWritten: 8128 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > > > > > > > I didn't do anything to configure/change the FUSE bdi min/max_ratio. > > > This is what I see on my system: > > > > > > cat /sys/class/bdi/0:52/min_ratio > > > 0 > > > cat /sys/class/bdi/0:52/max_ratio > > > 1 > > > > OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB. > > Checking the code, this shows we are hitting __wb_calc_thresh() logic: > > > > if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { > > unsigned long limit = hard_dirty_limit(dom, dtc->thresh); > > u64 wb_scale_thresh = 0; > > > > if (limit > dtc->dirty) > > wb_scale_thresh = (limit - dtc->dirty) / 100; > > wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / > > } > > > > so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never > > generates enough throughput to ramp up it's share from this initial value. > > > > > > Actually, there's a patch queued in mm tree that improves the ramping up of > > > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could > > > > test whether it changes something in the behavior you observe. Thanks! > > > > > > > > Honza > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche > > > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa > > > > tch > > > > > > I still see the same results (~230 MiB/s throughput using fio) with > > > this patch applied, unfortunately. Here's the debug info I see with > > > this patch (same test scenario as above on FUSE large folio writes > > > where bs=1M and size=1G): > > > > > > BdiWriteback: 0 kB > > > BdiReclaimable: 2048 kB > > > BdiDirtyThresh: 3588 kB > > > DirtyThresh: 359132 kB > > > BackgroundThresh: 179348 kB > > > BdiDirtied: 51200 kB > > > BdiWritten: 128 kB > > > BdiWriteBandwidth: 102400 kBps > > > b_dirty: 1 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 5 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3588 kB > > > DirtyThresh: 359144 kB > > > BackgroundThresh: 179352 kB > > > BdiDirtied: 331776 kB > > > BdiWritten: 1216 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3588 kB > > > DirtyThresh: 359144 kB > > > BackgroundThresh: 179352 kB > > > BdiDirtied: 562176 kB > > > BdiWritten: 2176 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 0 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3588 kB > > > DirtyThresh: 359144 kB > > > BackgroundThresh: 179352 kB > > > BdiDirtied: 792576 kB > > > BdiWritten: 3072 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > BdiWriteback: 64 kB > > > BdiReclaimable: 0 kB > > > BdiDirtyThresh: 3588 kB > > > DirtyThresh: 359144 kB > > > BackgroundThresh: 179352 kB > > > BdiDirtied: 1026048 kB > > > BdiWritten: 3904 kB > > > BdiWriteBandwidth: 0 kBps > > > b_dirty: 0 > > > b_io: 0 > > > b_more_io: 0 > > > b_dirty_time: 0 > > > bdi_list: 1 > > > state: 1 > > > --------- > > > > Yeah, here the situation is really the same. As an experiment can you > > experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I > > don't expect you should need to go past 10) and figure out when there's > > enough slack space for the writeback bandwidth to ramp up to a full speed? > > Thanks! > > > > Honza > > When locally testing this, I'm seeing that the max_ratio affects the > bandwidth more so than min_ratio (eg the different min_ratios have > roughly the same bandwidth per max_ratio). I'm also seeing somewhat > high variance across runs which makes it hard to gauge what's > accurate, but on average this is what I'm seeing: > > max_ratio=1 --- bandwidth= ~230 MiB/s > max_ratio=2 --- bandwidth= ~420 MiB/s > max_ratio=3 --- bandwidth= ~550 MiB/s > max_ratio=4 --- bandwidth= ~653 MiB/s > max_ratio=5 --- bandwidth= ~700 MiB/s > max_ratio=6 --- bandwidth= ~810 MiB/s > max_ratio=7 --- bandwidth= ~1040 MiB/s (and then a lot of times, 561 > MiB/s on subsequent runs) Ah, sorry. I actually misinterpretted your reply from previous email that: > > > cat /sys/class/bdi/0:52/max_ratio > > > 1 This means the amount of dirty pages for the fuse filesystem is indeed hard-capped at 1% of dirty limit which happens to be ~3MB on your machine. Checking where this is coming from I can see that fuse_bdi_init() does this by: bdi_set_max_ratio(sb->s_bdi, 1); So FUSE restricts itself and with only 3MB dirty limit and 2MB dirtying granularity it is not surprising that dirty throttling doesn't work well. I'd say there needs to be some better heuristic within FUSE that balances maximum folio size and maximum dirty limit setting for the filesystem to a sensible compromise (so that there's space for at least say 10 dirty max-sized folios within the dirty limit). But I guess this is just a shorter-term workaround. Long-term, finer grained dirtiness tracking within FUSE (and writeback counters tracking in MM) is going to be a more effective solution. Honza -- Jan Kara SUSE Labs, CR