From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A316C02183 for ; Thu, 16 Jan 2025 23:39:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D5986B007B; Thu, 16 Jan 2025 18:39:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 685826B0082; Thu, 16 Jan 2025 18:39:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54D126B0085; Thu, 16 Jan 2025 18:39:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 366866B007B for ; Thu, 16 Jan 2025 18:39:09 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 889C781209 for ; Thu, 16 Jan 2025 23:39:08 +0000 (UTC) X-FDA: 83014933176.26.437CD85 Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf06.hostedemail.com (Postfix) with ESMTP id AA9BC18000C for ; Thu, 16 Jan 2025 23:39:06 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vg6RIt+I; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737070746; a=rsa-sha256; cv=none; b=3lGpnb8rqokOufsGi7z2GQY8krPom/LedSR3RcIGctovgDnCzinXRhSw5kz3EZlFqzwLn3 EaLZd3y50ds7uvjMyEBglkRRoXqhqNSBVV/Xz9Ya6YGfaDIFlhm4lQGRr28RvQjZjJpoDZ SDJYrs/CxSxCXIlaqYUa/SrImb930gk= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vg6RIt+I; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737070746; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ec6gpDyMajGjFuPKgHZQpk6/0XcA5rkPf0cH8wY4yms=; b=6eZGo2uNSTYrXjqIjVFkEsBTWn2ev7mezg6iWBvIdkkIU5D3UULSRBQAEMrWmvLN41qomQ XbC7gdqPiSf0knTmygQ9Q7Hs+Awch7tpa5rjhghy7TjeWCwTim2jgw8NrrwTwANyYeTdA1 NrS7MGEGKOtn8iVTeP8lzl0G/yxi6Ps= Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-7b6fc3e9e4aso128303085a.2 for ; Thu, 16 Jan 2025 15:39:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737070746; x=1737675546; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ec6gpDyMajGjFuPKgHZQpk6/0XcA5rkPf0cH8wY4yms=; b=Vg6RIt+ITi8bwema4kCJYReqU46HLOUjDQ1yzn4BgrAXic3mXw9Q3f5jTSK/nXnJQ8 koTjMFmn3QgiY1qFF/GBYEjtCkO4D/zIvbQG1aat91naRpEhxuC0ZvAYaT9hYMD+vLuK CBiEpWY7wwiVCdU11MkaoZ3DPBNuAxEp5kIn2ct9jPPvq7f+CD4K4ns6npOagkTe5E9F /vAt9mprVq1fY5KhknZYpBT3mbNGPF6aEe4/RDfDSOnnZuxKd4pFTe5DwklrqBBnFUVb byIvxa9hkI9xtyRkIzS2ztsPAoTZWMTVoTyUDCQid5eLofQXLScoSGO+RgqMM9OF2nHA uVEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737070746; x=1737675546; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ec6gpDyMajGjFuPKgHZQpk6/0XcA5rkPf0cH8wY4yms=; b=TqVj9PIEmmbjY/9idPyNQ3d5LiPdzAj+SjE2sfgGP5hwupZI3sADpMjX8T/hpRVL/k a/tHNFiaN9YfrmxPRTbTcWxCiOHxHp6Xna+esXBU8twSGVUxK+8nKJLPAWjsVyGTjuTA DPk0GJtB5I2LOfwXrY1NMSZtVVK3xqK3COvYY5DFX0RtsYHDHw84SJp6QubfqiHJzdkh ssjXOjjCYlyjdCSoOUbmYKxp32S25z+k2g8BymLilm8xKAuDGfSFaMKxbvo2sBqAIp4V LO9wEORSv/6fMeftXT801v7CyYeDyfJEmGxp+uzcQ27IPDHB5/HBWmONpYAOD6niTCaP G/Lg== X-Forwarded-Encrypted: i=1; AJvYcCVjZbhtJWvxQJKKqLHlD5vFutYkINa/tGyWn1wG+YfJBSmLnIJHXt/LQCMWkiun5R7Q0m3yRP4B9A==@kvack.org X-Gm-Message-State: AOJu0YyUEwC1kAwq+anblgYxshmh+bOXgEjFb6q/4PMzGDo9Ez9d0vbo t7emrM2r3uPkwbWdQPut9BSfRLvJP+ITJOMI1CaVrRtVaX8uZoSyxWZNTSiuK1mBvYMvPs+aF2m TDb3ZalLlzGpOVnO9VTI664h5FKw= X-Gm-Gg: ASbGncsvmapPN9+zJ5Cgy8CySYI2mngZPRgNCFtrz0UdPlW49pPCIXQ09WDDNHjME06 LmUr63hAbu+EUhIMKZmvaV/iTUg/r4Lrh80X+lFA= X-Google-Smtp-Source: AGHT+IEWFtMz2nZaQqlnyGV8W5jcAdcLz5ror57ZakuJPBarcMznbOgwAycJE73Q4N7/o0wFqbjpis4Zt5U2ZPJU8To= X-Received: by 2002:a05:620a:2485:b0:7b6:d435:ccf7 with SMTP id af79cd13be357-7be6328903amr156470285a.50.1737070745620; Thu, 16 Jan 2025 15:39:05 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Joanne Koong Date: Thu, 16 Jan 2025 15:38:54 -0800 X-Gm-Features: AbW1kvZvnpMlLJqTdGWsmX8nxNgsEexSx9HfrAfmc--5BYIHMDIaqcbuSZnDF00 Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance To: Jan Kara Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: AA9BC18000C X-Rspamd-Server: rspam10 X-Stat-Signature: d9a5gks47dqceqocgjntxxqgauro47s8 X-HE-Tag: 1737070746-676498 X-HE-Meta: U2FsdGVkX19JufxRvn5Mt/vcGUY7kcXFRg99hkS4U3DGx1rJ4J+fRwWJpAsICcVczhiu4dpffb6iSmLO0Pjsq1z02zKYrmENgQw71WkmwZbd+GiLPtGgaQsYAhCO/54uIo8A6HUfgqR3UFf4kUcTNp8mRAatL9ZKqTZqHy9CYR+usU0/GvqNQDe/ba9X/bsETHu8RuK/di+wseCQK8FLVRh5ifWUlgYDJgmsC675Pf9EojOBZZapP1GqbA36dqp5EHRwX5P1QEPu67VoXeA2bCN+5FUFi2Fvx8B4ZYyuRhAx0kNAXbmWkUZqfgVFxCeOEdp7JE+WhCbXxuzFifENZE2HTN5nz9Mkht/iTxA8LBPwoNOXN1d6fHoSZw4EcG6DhTdEFEDhscOfvdKYmMI8gxwItjWGTZxVFBj1854VQxJVqGKxIcfnRnCimY476bM3zC5fTgExqKWkWKyAmAK1cZQmoVB1iM7t4yDqJEpSf+4e2jn/L4fFaA7pGL6lnG0cEyK5bxAYqjHaboGL7hXNfMWAUcreNcl7nS3EyRARFA9hnQKwTeV5cGGz6O11EcI7ME+J7Xag2CAVeLn9jJgyfQJcmxyLUNLrI5+Nhu2pn8ZMSj7x7N4Nu3zf45tU3IYHJI6KS9WIuEqKg+Hz/vptlEs5cp5QqSB8sh1EnVL3qRL5Xd3aH50RZnWjZjnMgRT/EL2e29DgX2XfTKEeWqIjj8/yLGSs1imPcxsZ/Mrhzo5dIx+TTYH7PJIBEd1IMk8irzW8xjQmvCXa3L9Qj8gVurNZG5KUVW7uE0sNgn7PNScTQTjqKoKs9ftRGie3b3RV2uWyIh/fwTRiCEK+LjCI3cv0KbhzjU7VL6DMz6iOzHDlsc19mxOHfVvP0HP8yuQgC4off/EVluXPw8b6+tWlO8Fexbf0LH7FBuOHzxmxy6+CFDa2pVzwva626MbhRwdLomNPEs1qER2iwlRJtMq WVh3i+ia 0kg0FE2UQ7llTYZQyQgd1IiGoUjnqn+rBwIwF9rr3S/depR85v21+Ejz2+fGmw51yTI7x2WIyFMPGlS1idgJfZ3Cb1emjm4jXzdCTF27hcwWrWrUUbjkhhLV2YzEWB6tU2cXUjLG875i+HUlEJPrl8WPz2Hw9oWDGqvmbUjlwCERCO+9FlsPrHe5nZNGvcY8ssfx4Tck3y32WniSLxHIbX6JMRXdPeMcyv32TqKTDGJ8YcOgByA2bwI64J1thACMRaVDq7NI1kWMwHaB7gccI2hWnWOmseQJmhHRApc0fr44YPjYD0NpOuAJOrm4nwnrE+cJdWlG6ljcuGMUdoEjxkU5uGZ9L1JViGZl8PWvvQvg6SZDpeO2anqg3IA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000197, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 16, 2025 at 3:01=E2=80=AFAM Jan Kara wrote: > > > Hello! > > On Tue 14-01-25 16:50:53, Joanne Koong wrote: > > I would like to propose a discussion topic about improving large folio > > writeback performance. As more filesystems adopt large folios, it > > becomes increasingly important that writeback is made to be as > > performant as possible. There are two areas I'd like to discuss: > > > > =3D=3D Granularity of dirty pages writeback =3D=3D > > Currently, the granularity of writeback is at the folio level. If one > > byte in a folio is dirty, the entire folio will be written back. This > > becomes unscalable for larger folios and significantly degrades > > performance, especially for workloads that employ random writes. > > > > One idea is to track dirty pages at a smaller granularity using a > > 64-bit bitmap stored inside the folio struct where each bit tracks a > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k > > pages), and only write back dirty chunks rather than the entire folio. > > Yes, this is known problem and as Dave pointed out, currently it is upto > the lower layer to handle finer grained dirtiness handling. You can take > inspiration in the iomap layer that already does this, or you can convert > your filesystem to use iomap (preferred way). > > > =3D=3D Balancing dirty pages =3D=3D > > It was observed that the dirty page balancing logic used in > > balance_dirty_pages() fails to scale for large folios [1]. For > > example, fuse saw around a 125% drop in throughput for writes when > > using large folios vs small folios on 1MB block sizes, which was > > attributed to scheduled io waits in the dirty page balancing logic. In > > generic_perform_write(), dirty pages are balanced after every write to > > the page cache by the filesystem. With large folios, each write > > dirties a larger number of pages which can grossly exceed the > > ratelimit, whereas with small folios each write is one page and so > > pages are balanced more incrementally and adheres more closely to the > > ratelimit. In order to accomodate large folios, likely the logic in > > balancing dirty pages needs to be reworked. > > I think there are several separate issues here. One is that > folio_account_dirtied() will consider the whole folio as needing writebac= k > which is not necessarily the case (as e.g. iomap will writeback only dirt= y > blocks in it). This was OKish when pages were 4k and you were using 1k > blocks (which was uncommon configuration anyway, usually you had 4k block > size), it starts to hurt a lot with 2M folios so we might need to find a > way how to propagate the information about really dirty bits into writeba= ck > accounting. Agreed. The only workable solution I see is to have some sort of api similar to filemap_dirty_folio() that takes in the number of pages dirtied as an arg, but maybe there's a better solution. > > Another problem *may* be that fast increments to dirtied pages (as we dir= ty > 512 pages at once instead of 16 we did in the past) cause over-reaction i= n > the dirtiness balancing logic and we throttle the task too much. The > heuristics there try to find the right amount of time to block a task so > that dirtying speed matches the writeback speed and it's plausible that > the large increments make this logic oscilate between two extremes leadin= g > to suboptimal throughput. Also, since this was observed with FUSE, I beli= ve > a significant factor is that FUSE enables "strictlimit" feature of the BD= I > which makes dirty throttling more aggressive (generally the amount of > allowed dirty pages is lower). Anyway, these are mostly speculations from > my end. This needs more data to decide what exactly (if anything) needs > tweaking in the dirty throttling logic. > I tested this experimentally and you're right, on FUSE this is impacted a lot by the "strictlimit". I didn't see any bottlenecks when strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects the dirty throttle control freerun flag (which gets used to determine whether throttling can be skipped) in the balance_dirty_pages() logic. For FUSE, we can't turn off strictlimit for unprivileged servers, but maybe we can make the throttling check more permissive by upping the value of the min_pause calculation in wb_min_pause() for writes that support large folios? As of right now, the current logic makes writing large folios unfeasible in FUSE (estimates show around a 75% drop in throughput). Thanks, Joanne > Honza > -- > Jan Kara > SUSE Labs, CR