From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03F80C02183 for ; Fri, 17 Jan 2025 11:53:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68BE36B008A; Fri, 17 Jan 2025 06:53:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 63C056B008C; Fri, 17 Jan 2025 06:53:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DBB66B0092; Fri, 17 Jan 2025 06:53:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 307CF6B008A for ; Fri, 17 Jan 2025 06:53:17 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id DE4BF459E0 for ; Fri, 17 Jan 2025 11:53:16 +0000 (UTC) X-FDA: 83016783192.09.2883471 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf30.hostedemail.com (Postfix) with ESMTP id 7412D8000F for ; Fri, 17 Jan 2025 11:53:14 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=NJcCHxZP; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=gDPG0Z9T; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=NJcCHxZP; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=gDPG0Z9T; dmarc=none; spf=pass (imf30.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737114794; a=rsa-sha256; cv=none; b=1ONkyWwVTKPIjLGjnV1/G8eoMHhKqgF5EIGhMX2+eA6Sf3oIqJv1tDJw3TkwVW48wkt14C j98E9U2GsdFLl5+W2h70u7AJ29q1fXcYS/3afS8nHLEGiJSYCiPgat3xh02C2BfJnq7ka+ 64RMwIomgf5H6pBBc2/7WA3KZHg0YDk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=NJcCHxZP; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=gDPG0Z9T; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=NJcCHxZP; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=gDPG0Z9T; dmarc=none; spf=pass (imf30.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737114794; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xJIXcHFlhkpdaXnNFrtSkSwJL1S21AwiTjXtrXyOnvI=; b=ufaWLjfgpElT2kzSHNi/I4WuMGni+DwkkIZqxhlv2WpI9Ah32m34K3razda0yZc57XZXd+ 8XegsNYVBwh6kjOxn1U4AbUzRqX/Ks2bD8jLUEJgnpUTknzl354AbqDSRwuG2Na1yeUSud DmhI3rJSEOasiVkuRL93qbdVtu0P5KY= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id ACF211F387; Fri, 17 Jan 2025 11:53:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737114792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xJIXcHFlhkpdaXnNFrtSkSwJL1S21AwiTjXtrXyOnvI=; b=NJcCHxZPZ6+7RbFR4Zeawf29Ru6YuKBKC7IpbiRKRIX7SpY0db3HR+3Qhyt8k8dzE9Qjzb oM/aFtmsARQ1iI5l9+uSjQ0MgCwdHhwpB5dw6NfntY+A9ApRUC1pWLWcgkomkKgbP60pTo /6RArHt8EiFNUEUFT4xpwczB8g6lUwA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737114792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xJIXcHFlhkpdaXnNFrtSkSwJL1S21AwiTjXtrXyOnvI=; b=gDPG0Z9THR5+zXITIQwNntze3+8VPiCbdz4NgQC38FqLorkFqgvWQ6PbeYzRmHN8opmCfe PKr9DOD2D7jtszBw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737114792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xJIXcHFlhkpdaXnNFrtSkSwJL1S21AwiTjXtrXyOnvI=; b=NJcCHxZPZ6+7RbFR4Zeawf29Ru6YuKBKC7IpbiRKRIX7SpY0db3HR+3Qhyt8k8dzE9Qjzb oM/aFtmsARQ1iI5l9+uSjQ0MgCwdHhwpB5dw6NfntY+A9ApRUC1pWLWcgkomkKgbP60pTo /6RArHt8EiFNUEUFT4xpwczB8g6lUwA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737114792; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xJIXcHFlhkpdaXnNFrtSkSwJL1S21AwiTjXtrXyOnvI=; b=gDPG0Z9THR5+zXITIQwNntze3+8VPiCbdz4NgQC38FqLorkFqgvWQ6PbeYzRmHN8opmCfe PKr9DOD2D7jtszBw== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id A011A139CB; Fri, 17 Jan 2025 11:53:12 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id Xy0KJ6hEimd+cAAAD6G6ig (envelope-from ); Fri, 17 Jan 2025 11:53:12 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 6181EA08E0; Fri, 17 Jan 2025 12:53:12 +0100 (CET) Date: Fri, 17 Jan 2025 12:53:12 +0100 From: Jan Kara To: Joanne Koong Cc: Jan Kara , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Action: no action X-Rspam-User: X-Rspamd-Queue-Id: 7412D8000F X-Rspamd-Server: rspam10 X-Stat-Signature: yohtdhos6zx5se64ueizj4tu5hiann1u X-HE-Tag: 1737114794-499949 X-HE-Meta: U2FsdGVkX18XSEmrxZ6axlqsTKpSwaANyPSkA1WFkPwwJ6rDFr4QE+cJqzbirxiAjmbSi8ESIjSEwbycq5iLaFCkICr5TVdLUnM8msPFD2F6hZtz5CBpFFJCxBisUjetnfp5AFsHQCBXUfyk/fLW3mgqe3CpayjQGo95HDjpIuBuD0xjE+Gf3zyCP8D8pUS/eiqVVfqVYjeA3IovkROhOJ7S0quVZO8acIrDk7W+BvNUuEG9IypAAwR6wIhrqIjP/l8VKO9x9RjSbNhi/t7env58Ar9/YGt9A0dD7rP2YHAFdN/UiiAe/5xbYS4WHK3+tf38Ia0eGipgyjaQrDW8X5Rm4RcZQWE/aS9TyFiIWtEZMOM2Mx0b+JrNB0kv2f0TxGGGZ+xOC0LqKXutgxNUWpTyqlijMuqDFhDEdfvRjqEl3er8ZRkH9r88ZymGpZkldpPpW1bYQmhwwvBMSQjlybWa6ROGAdvdnAQzEIH9y7Bu34R8rEiCvBOiddBEmLvp51sZx/5U3djbIyFuELYr3YnIr23Mr5s4dOxz2K/EEKSCwmhPumgOh84B6kb2qMqk5hg7jlqq7F+6dt+Ia6SIX3qCJ6V2dgbYS9OCCLZbe9m5091ekfxzAJ/BoS/n/ksXbpSjdVNPx8h/z3xQK3HAfKOkekm7TTgTRNcwoxkNrRtbneUgWYrhwX6pBbPtgAVqoxnAvKMibORHaDdMKpuVnskx4gMI1aQUNG3HBciyNUuELaKsnRy7avKYy5qCf91qLL9ldrdwe3WA1F8H4EVgAPnPrrmC/zs04rlmTDM0sF7jK0HiU3C/iR8L38zIh2sOuNEFqzgtqJwwDtu9qQUpQs9zWZ30zdZDm6lzhxsinrJnvwIb28N7YPvqgkcUzMX7g/gqMTRL6iqzq0teUYcMm/i62URbEaYRO3YbXhIinWnRjhIQMuJWSuAOqEMzLndDpVxUXN1ibDr0GUmu8c+ M/54v953 E/bmopaAVES0/dJ3kGdk3BcFIVXhewRB7SSpggwSPB+1W/p+2eEr1dUWKcFojyJ9ZzdrIcvkFP0dROf9KKoTD6DQv4890l3s6taiHgxSoQxuXmqX96MaHS/4jQFwfcO5OiuBhzdzdYprTVzEk3r4dEqXHMFZPQF4Kj6s6Iz1MkM1BnGwhnOCe95PWbGfnU7k2xPClYFNpPUhc9cn2tH2Ntn/D9ustToBsg8CuUcscjy2sal5VlY9OlwnhOMSNZ3+IVvDAvgih3kw+IqT1cEm6qqvZRLtg7Tzw16kq2CQi7K3lbJjbmkB5AGYlV1dgA7Rnhhk8iVt/FlkTPNJlSae+HRcH9xJYy2uafIqWogaf3Gm0KTJacIg6KJEPlWgGLpJWnJyR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu 16-01-25 15:38:54, Joanne Koong wrote: > On Thu, Jan 16, 2025 at 3:01 AM Jan Kara wrote: > > On Tue 14-01-25 16:50:53, Joanne Koong wrote: > > > I would like to propose a discussion topic about improving large folio > > > writeback performance. As more filesystems adopt large folios, it > > > becomes increasingly important that writeback is made to be as > > > performant as possible. There are two areas I'd like to discuss: > > > > > > == Granularity of dirty pages writeback == > > > Currently, the granularity of writeback is at the folio level. If one > > > byte in a folio is dirty, the entire folio will be written back. This > > > becomes unscalable for larger folios and significantly degrades > > > performance, especially for workloads that employ random writes. > > > > > > One idea is to track dirty pages at a smaller granularity using a > > > 64-bit bitmap stored inside the folio struct where each bit tracks a > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k > > > pages), and only write back dirty chunks rather than the entire folio. > > > > Yes, this is known problem and as Dave pointed out, currently it is upto > > the lower layer to handle finer grained dirtiness handling. You can take > > inspiration in the iomap layer that already does this, or you can convert > > your filesystem to use iomap (preferred way). > > > > > == Balancing dirty pages == > > > It was observed that the dirty page balancing logic used in > > > balance_dirty_pages() fails to scale for large folios [1]. For > > > example, fuse saw around a 125% drop in throughput for writes when > > > using large folios vs small folios on 1MB block sizes, which was > > > attributed to scheduled io waits in the dirty page balancing logic. In > > > generic_perform_write(), dirty pages are balanced after every write to > > > the page cache by the filesystem. With large folios, each write > > > dirties a larger number of pages which can grossly exceed the > > > ratelimit, whereas with small folios each write is one page and so > > > pages are balanced more incrementally and adheres more closely to the > > > ratelimit. In order to accomodate large folios, likely the logic in > > > balancing dirty pages needs to be reworked. > > > > I think there are several separate issues here. One is that > > folio_account_dirtied() will consider the whole folio as needing writeback > > which is not necessarily the case (as e.g. iomap will writeback only dirty > > blocks in it). This was OKish when pages were 4k and you were using 1k > > blocks (which was uncommon configuration anyway, usually you had 4k block > > size), it starts to hurt a lot with 2M folios so we might need to find a > > way how to propagate the information about really dirty bits into writeback > > accounting. > > Agreed. The only workable solution I see is to have some sort of api > similar to filemap_dirty_folio() that takes in the number of pages > dirtied as an arg, but maybe there's a better solution. Yes, something like that I suppose. > > Another problem *may* be that fast increments to dirtied pages (as we dirty > > 512 pages at once instead of 16 we did in the past) cause over-reaction in > > the dirtiness balancing logic and we throttle the task too much. The > > heuristics there try to find the right amount of time to block a task so > > that dirtying speed matches the writeback speed and it's plausible that > > the large increments make this logic oscilate between two extremes leading > > to suboptimal throughput. Also, since this was observed with FUSE, I belive > > a significant factor is that FUSE enables "strictlimit" feature of the BDI > > which makes dirty throttling more aggressive (generally the amount of > > allowed dirty pages is lower). Anyway, these are mostly speculations from > > my end. This needs more data to decide what exactly (if anything) needs > > tweaking in the dirty throttling logic. > > I tested this experimentally and you're right, on FUSE this is > impacted a lot by the "strictlimit". I didn't see any bottlenecks when > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects > the dirty throttle control freerun flag (which gets used to determine > whether throttling can be skipped) in the balance_dirty_pages() logic. > For FUSE, we can't turn off strictlimit for unprivileged servers, but > maybe we can make the throttling check more permissive by upping the > value of the min_pause calculation in wb_min_pause() for writes that > support large folios? As of right now, the current logic makes writing > large folios unfeasible in FUSE (estimates show around a 75% drop in > throughput). I think tweaking min_pause is a wrong way to do this. I think that is just a symptom. Can you run something like: while true; do cat /sys/kernel/debug/bdi//stats echo "---------" sleep 1 done >bdi-debug.txt while you are writing to the FUSE filesystem and share the output file? That should tell us a bit more about what's happening inside the writeback throttling. Also do you somehow configure min/max_ratio for the FUSE bdi? You can check in /sys/block//bdi/{min,max}_ratio . I suspect the problem is that the BDI dirty limit does not ramp up properly when we increase dirtied pages in large chunks. Actually, there's a patch queued in mm tree that improves the ramping up of bdi dirty limit for strictlimit bdis [1]. It would be nice if you could test whether it changes something in the behavior you observe. Thanks! Honza [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa tch -- Jan Kara SUSE Labs, CR