From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEDC3C02180 for ; Thu, 16 Jan 2025 11:01:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0585C6B0089; Thu, 16 Jan 2025 06:01:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F22E56B008A; Thu, 16 Jan 2025 06:01:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9CCD280001; Thu, 16 Jan 2025 06:01:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B1B096B0089 for ; Thu, 16 Jan 2025 06:01:25 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 545EE1A0B08 for ; Thu, 16 Jan 2025 11:01:25 +0000 (UTC) X-FDA: 83013023730.23.316F4C1 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf16.hostedemail.com (Postfix) with ESMTP id E1EEB18000C for ; Thu, 16 Jan 2025 11:01:22 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nDOnGevl; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=BIE73Bvr; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=VklNlpvB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=O3p73550; spf=pass (imf16.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737025283; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KnGXZdJlEms5ULoeuyyEbZ4E4cbzhyWII1nI4VNWv50=; b=7BBLlFZ7GYPnzMaiwUNWmmzqXMJEqiWRuw9fQ8LhZO1Hivo5AezZPFbNAg4YXDVg/MQCzO rr8n82T2YklTx/zrT1LQc1Di8oBVJYPnXnShK1TWESx6wE/s4XucMMdsS4fKjWK5GwgbjN QeVF+tQNy2IYgQ1OyBOiTGNJssv7A0g= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737025283; a=rsa-sha256; cv=none; b=RcfpogEOoVNNx0yuZuxm2Z0di3dcb8Zr5KXEwGv3xvPnCk0wO3NC1NitDLkITQ8KyZankB KD8QBaJHdU9RjWiC/7N4O+oZ6EAkObxY/7n50OS90rajhhUa87hOQ55xWn/Gp0upktolK2 6Hfg9XnKjI5n32LoEk7rVkZArOqVm/g= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nDOnGevl; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=BIE73Bvr; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=VklNlpvB; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=O3p73550; spf=pass (imf16.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 2A5F31F796; Thu, 16 Jan 2025 11:01:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737025281; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KnGXZdJlEms5ULoeuyyEbZ4E4cbzhyWII1nI4VNWv50=; b=nDOnGevl+Jl26iDGz+x6aky12KznuuOnQIcL3UAmVz/a6L7B/yD99lD5LUMdE7HSo2Pbi7 Lq2GXQpinnvdqYsf71b6Sd4JJNTEOKi7wHmN4D97u9xNMTOQiM9JAfwD82FpVfQ/1LYEA4 g7A92aB+154tYJR0PXBKVOTrFF8KLq0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737025281; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KnGXZdJlEms5ULoeuyyEbZ4E4cbzhyWII1nI4VNWv50=; b=BIE73BvrWIM3cJnrtW7zOVCKJDweaPpETD3xbrDwpYhPjIkq3oTJCOS+eRvd4eNarBHMx/ U+zSPnP03eVjBkDQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737025280; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KnGXZdJlEms5ULoeuyyEbZ4E4cbzhyWII1nI4VNWv50=; b=VklNlpvB6j0DrgBZDkCK6xT62d8aYq5SQDqZ8pjjPgIYKVT/diQtDLhrsg4G2boYKpDh5O zbuBVAJvTKSA9ZNsSJQeja7SbEAIAgOhZHF0NlDuKnt9GkjuHtFyhwy1Sun0puK0njZY9N 3l8Y+yniGun7CrxeAwl5LbUni+txvGQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737025280; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KnGXZdJlEms5ULoeuyyEbZ4E4cbzhyWII1nI4VNWv50=; b=O3p735506mg+aqjoNal9r8qqicEulPnI+xWFB8OH0YcnfYY7csQvgijK3ppI5ET8+Xtq0p K8/nQpxfUca719Dg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 20BBD13A57; Thu, 16 Jan 2025 11:01:20 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id KJL5BwDniGfAKgAAD6G6ig (envelope-from ); Thu, 16 Jan 2025 11:01:20 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id CAB6EA08E0; Thu, 16 Jan 2025 12:01:15 +0100 (CET) Date: Thu, 16 Jan 2025 12:01:15 +0100 From: Jan Kara To: Joanne Koong Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: 97si5b7sdsktenkgw1qpuwpgc9geq77f X-Rspamd-Queue-Id: E1EEB18000C X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1737025282-248132 X-HE-Meta: U2FsdGVkX19bEQmVPGAYHCALrUnSUE/37nJDeDDkaR93fBS8Rz8OZ848ea4fHtKR+ttC+cBvXo3WuiU1oI/mvPgPWzKAszWWyf7R9qj01SdoIcvTMH4kYW/CKLNZ1YrkqUIU9PIe4v4aIGZ4mZCOkkGhBF6+WVJaUJ1Tws/u2h5BmJiRLD4vbDIt9bFuORPVohccFZK02mv0AtaIHMWtGU6NOEJhhdJC5z8VPU4t54IavzHHsu5JleGwgVwMH+J1nvUwR5wCwz8wrZZvUOOI/dbMVwesYAnfIY6CjpSAQMD6Vn9D3O1DaJiiUZ5nB6N3yuVNODF/36ccNwpQuFeUP8//6rUwwLJA62+YuBlCPR5c2HkdRzeSWhvYD6nmzJRmsdbTpgzcteocCuh7xHfxRdB+5esS2VELgcwxl8rDG3LqwsNfBAWtkSDs32SBCxw0eGMGtWqYaRu1vT8GazFkQR3xIZbzRGP/ofo0SZZIBpouWiHxMbc8WztIZWFzSlh4gmhw37wS+7ZTz7LTfgJyHdPqTHEtLl8ttuiAUfj7eAgza4KJesSO6XX6FdXjFg/BVuWQ8iiVxLlwiA6xQKbF0in5cmWwuxos5Ll9CJ6EJ5AvM60E4jsCf2516OfafQIZf+ZD9BwY7p3cpBESgE3fm2Yr/DgrMqG5LMWwyqvZkqtYuWGHKhd2XJBp9PHsFXBc/RYTy8aSKrvFSE5KNyFTUCYSP+HHSxaRH5rTJ+Vxc+M+gwf2xSc+ydTbQV9Xs+Wh1cuHP2GH9jm0ZGZ8JVtEY/Xo3wccX9UE+x0K7mhACQ8u5KnMlifFum5mK2Q6Vg4kzwjz107IwG0VoEzSTG0yxiMM+xZAysG9zJeXC8I6t/IDJe1CQaueLgs2m8nXeP07B8nHIc8veZJpvvxzk9hj90Za28rrSfL7GfewjlABPKHfbkCA8744rMrnruNWfHhY2UxMUgAlyGOSUG/jNgF eGFm/4Ee EvKBUOMfr7K1M41RUSwabsSfz8+ME6xtwS+KghKPp8VtO9Ak0tnKY5tug8fSVgbpPxQYYyoTMxJ8IbpnndMyYKbpGc2l76sd9wbNzABBfvk0zkV3QPb1bxwZTJDIIU7vwn3h+mUptmSa9+F7QYnABxBd0K//2HNVJAdebRBz9VZbPX5JS0n65Sq3Ck2fsre4AVbRPbCUQKWC8UpZnslVBBAK9hta8hM9D1e6X5buqU2T62RpJ3/PlRTjJjiIXg2Wt2BUN8FtaZQ4LMTf/groYCKyBzEcI/MzB66lKg1iKhn0yTxwCWN8zdRnLD/1QLlztjsxNuEyYrtib75wjW60UZEX0Aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello! On Tue 14-01-25 16:50:53, Joanne Koong wrote: > I would like to propose a discussion topic about improving large folio > writeback performance. As more filesystems adopt large folios, it > becomes increasingly important that writeback is made to be as > performant as possible. There are two areas I'd like to discuss: > > == Granularity of dirty pages writeback == > Currently, the granularity of writeback is at the folio level. If one > byte in a folio is dirty, the entire folio will be written back. This > becomes unscalable for larger folios and significantly degrades > performance, especially for workloads that employ random writes. > > One idea is to track dirty pages at a smaller granularity using a > 64-bit bitmap stored inside the folio struct where each bit tracks a > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k > pages), and only write back dirty chunks rather than the entire folio. Yes, this is known problem and as Dave pointed out, currently it is upto the lower layer to handle finer grained dirtiness handling. You can take inspiration in the iomap layer that already does this, or you can convert your filesystem to use iomap (preferred way). > == Balancing dirty pages == > It was observed that the dirty page balancing logic used in > balance_dirty_pages() fails to scale for large folios [1]. For > example, fuse saw around a 125% drop in throughput for writes when > using large folios vs small folios on 1MB block sizes, which was > attributed to scheduled io waits in the dirty page balancing logic. In > generic_perform_write(), dirty pages are balanced after every write to > the page cache by the filesystem. With large folios, each write > dirties a larger number of pages which can grossly exceed the > ratelimit, whereas with small folios each write is one page and so > pages are balanced more incrementally and adheres more closely to the > ratelimit. In order to accomodate large folios, likely the logic in > balancing dirty pages needs to be reworked. I think there are several separate issues here. One is that folio_account_dirtied() will consider the whole folio as needing writeback which is not necessarily the case (as e.g. iomap will writeback only dirty blocks in it). This was OKish when pages were 4k and you were using 1k blocks (which was uncommon configuration anyway, usually you had 4k block size), it starts to hurt a lot with 2M folios so we might need to find a way how to propagate the information about really dirty bits into writeback accounting. Another problem *may* be that fast increments to dirtied pages (as we dirty 512 pages at once instead of 16 we did in the past) cause over-reaction in the dirtiness balancing logic and we throttle the task too much. The heuristics there try to find the right amount of time to block a task so that dirtying speed matches the writeback speed and it's plausible that the large increments make this logic oscilate between two extremes leading to suboptimal throughput. Also, since this was observed with FUSE, I belive a significant factor is that FUSE enables "strictlimit" feature of the BDI which makes dirty throttling more aggressive (generally the amount of allowed dirty pages is lower). Anyway, these are mostly speculations from my end. This needs more data to decide what exactly (if anything) needs tweaking in the dirty throttling logic. Honza -- Jan Kara SUSE Labs, CR