From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D3AFC0218C for ; Wed, 22 Jan 2025 22:17:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA5F8280001; Wed, 22 Jan 2025 17:17:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D553D6B0085; Wed, 22 Jan 2025 17:17:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1D07280001; Wed, 22 Jan 2025 17:17:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A24F46B0083 for ; Wed, 22 Jan 2025 17:17:16 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2DEE91407AD for ; Wed, 22 Jan 2025 22:17:16 +0000 (UTC) X-FDA: 83036499672.21.3FCF8DF Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf08.hostedemail.com (Postfix) with ESMTP id 42F01160002 for ; Wed, 22 Jan 2025 22:17:14 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WDtHrDqR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737584234; a=rsa-sha256; cv=none; b=MJPxnVhvxsxdQ9Bx0sBvfQAsH4zDDZXO3ss1Vro8XeL8QMrP268l0954i/CEMC3Ia3NApx IW/ktC/wNNu6X9RPAM0z6MvPKmFk3rVslCu+KdR29bSa2wKFn7pojjR8YIGTgLKS/adRek Hz+HRVnmU01SVM07dDlNbJzAzyzsOfw= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WDtHrDqR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737584234; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EopwUpr8CDIdHSlkl8ERszZCKxp0DkS1X7x5GPG61x8=; b=BPxa51EakBVjxJJJadTxacTyvPR23McRJedLjRMuTwryDuqxiQubaTLeGY1nEn+vjlsgt0 CEW+ESnOFKUELjbzYd+wo01QROW7AwNNf/bbRxyB6kgDIEsRav85b1PzKXGgtIlFFUoHTA YKC3U1UtM1mRwObFgx26YHi0beUexHE= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-467725245a2so2628901cf.3 for ; Wed, 22 Jan 2025 14:17:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737584233; x=1738189033; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EopwUpr8CDIdHSlkl8ERszZCKxp0DkS1X7x5GPG61x8=; b=WDtHrDqRgFgW5tH4e6CfguAP3VIlnB2nwwMcC94Jz3pTlDlSHxj9ITwa+GVZ3ki4LZ 0+VouAEF6Y8gmei3ohvCV9qItGoN5/PquX+wROqsyRPfqsTIDVD9UDr0i5rrSuBibNYj JLs7gNs54SUH82yHuTcki51xqMVNBJbh1V0TtKKLdezxB4/oiUaIAyrhPsZCw37yg+Dx BPh4bDxQ863YoQUHUgweSZnQY6X+ZbWHi51v2ww9kzJOKXkDG5EFcGm/efk3jzbKfj4G sLzfAiXECAd4eYF/KMUdGAK9+7yxiEit9MCcAq1mCemccmbOBaY/199/voCKJ4VpwVgz zmuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737584233; x=1738189033; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EopwUpr8CDIdHSlkl8ERszZCKxp0DkS1X7x5GPG61x8=; b=c83gnEd11RB/4V6DpNre9poHA1okcrnb4fpKvgolNhRI4Bps87n9HXKNrF/vAk08HP Hzp7Y85A9l0K80BmWqA9vFZ8pKSdDHTgPq/D70CaAeRLEpV0zqj4BgMrjFvDcUO4IwoR BD6B4d3pwS1d4npFA5BSwY+y3Vm96OqOh6RI4uHgpew3PTtGx25tSumL5FeHGj0UOV/S vhVmagCf02px0+ueD7CW1QlTTy2S1EjkponIPyP+yrnOFcZSbYTEBvqZ/QTienaKswJ9 6U82OU7RdPqmFAHiJkDnC/8HnIUet6B32I6H7nlG7futxb36EhaDe55xU4QCEHQFuTes QlKw== X-Forwarded-Encrypted: i=1; AJvYcCUL+MybnI7Klrf1PxY0iAp+X3KzrZCcEnRbzclThNxn1S5HUe2cEAqOmrwoOuMeSQ/zweZf2HsVLA==@kvack.org X-Gm-Message-State: AOJu0Yxc93AzyyjgtmPdeuYB3vIXlmwfW2ogU8gPptGvSfEvaGYCQi4a xyz7pOv7MTRqKMwd98x9XE8AFZWOIRBFt3vKJmL/DDeMKsQZpjdTNveMEnxgkdC/iHOnD2mmyiw 2luxZb790mCSILh1CxdJuQhW8O4o= X-Gm-Gg: ASbGncuiDvfYT1k4BuqR+W3hplHe917oZRloIUWfqIxiVblNQNyJQR+cAivsdAGl7UB XFHo8BjhX7UPBShqdpwximFk+zJ7k+9sPC6e/165Y7Vab5xojHIpw X-Google-Smtp-Source: AGHT+IFVpHrQO8rLRJ4cZJ73Sg56MIzb8x9r+g0pBlODP+qwNn+E2LawfxrujXY95+uhKbTFM2EDtXQ6ksG0KM1VXOU= X-Received: by 2002:a05:622a:1817:b0:467:b1b4:2b6 with SMTP id d75a77b69052e-46e12b7bf21mr365983451cf.38.1737584233233; Wed, 22 Jan 2025 14:17:13 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Joanne Koong Date: Wed, 22 Jan 2025 14:17:02 -0800 X-Gm-Features: AbW1kvZO2izQxpQVm3b1L-accBighurtrGC3qBVA0hBEbsKjAYmQzn6TZIWDe4Y Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance To: Jan Kara Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 42F01160002 X-Stat-Signature: c3mq5kbedrmhwebyf9rthxjwxhtx6qy5 X-Rspam-User: X-HE-Tag: 1737584234-861572 X-HE-Meta: U2FsdGVkX1/3VmzDr7A4c6fPv97M4X/kGYiV4lg5ue1y60vOEvN3XofgzTR6YufBfbYHiNHYNVr26PdDd1MTh6gk36TsI5EyW14zBjqEOAIOEbyYatm6gwoEUVjQq+F6yR/ciNQDV3dPbH1ndIjKxaqT2vriOYXbU+Pxcsh0dBcFHMa98hKpikf9wplrGPD/w5IINC2Ar6YW6Avq/knAvJ5oOuGLIGkIT3pt2OGM/KknyARwV1unmNeoBwxk6K1MTF7uVEXwL7oMYNEzhjyXpp5mb7lCbQfLgVBXfAJgvdHssMrG59l9ZG4K+JGDVW5rBRrz67W8D2nL0jHnjzg8/6/BcYPbfdaDSXmyAqaMoK2F5bau309VaT0k+MU8QCgX4f1uFnFYrDQk2e3hPffm1t04FvqyrDeC3lwsAmXPzkC2zK3jmEtVVty7UPR1cfe3dPVEJfvbrdLzLYGpF7/+/1Y61c4qtpVqjlTcVYbjv4v+Ya4kjjYb1Oo8wez3tuF+P8PE7/woZvUMl0enWa49yuDmzDBNM9yiRL7YfOKGUF8lJH+WL8rdj7QF3OPp8m+mj1PiTGkznI3zbAxIVxA3ImEcOHkDJNFT6ju1Xj4Al/2FKzVMwZHXtdythVh43ZSOEiXws3fy6MOSUFd4tVLQfNqTj4mZ/OhW2MK4lJBngWghIOtVNWGDTwbGNJPtuyQQc+uhvDuB+rNpu2lK7hl3r6jajsmXfYCL668efO3K1Zcceu3qBcJVfn/B2jz2uLYIAMZLR7r1ykqCySVPaIHoBxiHzkfCNfyXnbCt1jP2phaQTLZkzq4sWzzjGciEaNoKJSL1A38E5AUjr0K/bV0HzXJ+a9NFFNdVOXvx9mKEx4A5oe29gi2zfIVVAAAtmmbpSvQa7wtDIz0qMUm2fii0ne3xVxvU+ECMWO7OTpVYUthahRQpIN78CMwo9tr/94IQBwduZW1PEyZD3VgFwsp ZmZ1S2Lb YSG5AUMLmc0/djTK8pAIhttXDC5k2Tu7Btb5r1ZGSwcPsvvHrMdkVbCaKgsXL7e+MiZlOYCjQdi9Yk7eh9uucHvFfhST7WFws282gHnQfe4Rb4zTXFB0iu5lPsrHHJZWyi0rBjtYy/w2VszADD5XqGyJvdIMKZoFrRhzqQRI+swDTFkH9aKLOJJBlEK08YACms+jL+0yIaH6ke7vE6gHRaLDrUhEuTdvud1IbT7rTai4gYISiswhaVz1K/kOzNWdxjy90Hjjc32PSsHfJT0D954gEW8jZdfKu3XVB8v7Myj2pUTiJWB1eibIUrILtMwbXUvaTPAjGV2qY5s50d5xfftGbP6ODox+j0wpi+CSanCSpTpfnbXhvK1svhaL33+3kPbUyAsQWFxQsjEwIu/BogoVnog== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000182, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 22, 2025 at 1:22=E2=80=AFAM Jan Kara wrote: > > On Tue 21-01-25 16:29:57, Joanne Koong wrote: > > On Mon, Jan 20, 2025 at 2:42=E2=80=AFPM Jan Kara wrote: > > > On Fri 17-01-25 14:45:01, Joanne Koong wrote: > > > > On Fri, Jan 17, 2025 at 3:53=E2=80=AFAM Jan Kara wro= te: > > > > > On Thu 16-01-25 15:38:54, Joanne Koong wrote: > > > > > I think tweaking min_pause is a wrong way to do this. I think tha= t is just a > > > > > symptom. Can you run something like: > > > > > > > > > > while true; do > > > > > cat /sys/kernel/debug/bdi//stats > > > > > echo "---------" > > > > > sleep 1 > > > > > done >bdi-debug.txt > > > > > > > > > > while you are writing to the FUSE filesystem and share the output= file? > > > > > That should tell us a bit more about what's happening inside the = writeback > > > > > throttling. Also do you somehow configure min/max_ratio for the F= USE bdi? > > > > > You can check in /sys/block//bdi/{min,max}_ratio . I su= spect the > > > > > problem is that the BDI dirty limit does not ramp up properly whe= n we > > > > > increase dirtied pages in large chunks. > > > > > > > > This is the debug info I see for FUSE large folio writes where bs= =3D1M > > > > and size=3D1G: > > > > > > > > > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 896 kB > > > > DirtyThresh: 359824 kB > > > > BackgroundThresh: 179692 kB > > > > BdiDirtied: 1071104 kB > > > > BdiWritten: 4096 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3596 kB > > > > DirtyThresh: 359824 kB > > > > BackgroundThresh: 179692 kB > > > > BdiDirtied: 1290240 kB > > > > BdiWritten: 4992 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3596 kB > > > > DirtyThresh: 359824 kB > > > > BackgroundThresh: 179692 kB > > > > BdiDirtied: 1517568 kB > > > > BdiWritten: 5824 kB > > > > BdiWriteBandwidth: 25692 kBps > > > > b_dirty: 0 > > > > b_io: 1 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 7 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3596 kB > > > > DirtyThresh: 359824 kB > > > > BackgroundThresh: 179692 kB > > > > BdiDirtied: 1747968 kB > > > > BdiWritten: 6720 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 896 kB > > > > DirtyThresh: 359824 kB > > > > BackgroundThresh: 179692 kB > > > > BdiDirtied: 1949696 kB > > > > BdiWritten: 7552 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3612 kB > > > > DirtyThresh: 361300 kB > > > > BackgroundThresh: 180428 kB > > > > BdiDirtied: 2097152 kB > > > > BdiWritten: 8128 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > > > > > > > > > I didn't do anything to configure/change the FUSE bdi min/max_ratio= . > > > > This is what I see on my system: > > > > > > > > cat /sys/class/bdi/0:52/min_ratio > > > > 0 > > > > cat /sys/class/bdi/0:52/max_ratio > > > > 1 > > > > > > OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB. > > > Checking the code, this shows we are hitting __wb_calc_thresh() logic= : > > > > > > if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { > > > unsigned long limit =3D hard_dirty_limit(dom, dtc->th= resh); > > > u64 wb_scale_thresh =3D 0; > > > > > > if (limit > dtc->dirty) > > > wb_scale_thresh =3D (limit - dtc->dirty) / 10= 0; > > > wb_thresh =3D max(wb_thresh, min(wb_scale_thresh, wb_= max_thresh / > > > } > > > > > > so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi neve= r > > > generates enough throughput to ramp up it's share from this initial v= alue. > > > > > > > > Actually, there's a patch queued in mm tree that improves the ram= ping up of > > > > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you= could > > > > > test whether it changes something in the behavior you observe. Th= anks! > > > > > > > > > > H= onza > > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.g= it/tree/patche > > > > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb= _calc_thresh.pa > > > > > tch > > > > > > > > I still see the same results (~230 MiB/s throughput using fio) with > > > > this patch applied, unfortunately. Here's the debug info I see with > > > > this patch (same test scenario as above on FUSE large folio writes > > > > where bs=3D1M and size=3D1G): > > > > > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 2048 kB > > > > BdiDirtyThresh: 3588 kB > > > > DirtyThresh: 359132 kB > > > > BackgroundThresh: 179348 kB > > > > BdiDirtied: 51200 kB > > > > BdiWritten: 128 kB > > > > BdiWriteBandwidth: 102400 kBps > > > > b_dirty: 1 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 5 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3588 kB > > > > DirtyThresh: 359144 kB > > > > BackgroundThresh: 179352 kB > > > > BdiDirtied: 331776 kB > > > > BdiWritten: 1216 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3588 kB > > > > DirtyThresh: 359144 kB > > > > BackgroundThresh: 179352 kB > > > > BdiDirtied: 562176 kB > > > > BdiWritten: 2176 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 0 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3588 kB > > > > DirtyThresh: 359144 kB > > > > BackgroundThresh: 179352 kB > > > > BdiDirtied: 792576 kB > > > > BdiWritten: 3072 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > BdiWriteback: 64 kB > > > > BdiReclaimable: 0 kB > > > > BdiDirtyThresh: 3588 kB > > > > DirtyThresh: 359144 kB > > > > BackgroundThresh: 179352 kB > > > > BdiDirtied: 1026048 kB > > > > BdiWritten: 3904 kB > > > > BdiWriteBandwidth: 0 kBps > > > > b_dirty: 0 > > > > b_io: 0 > > > > b_more_io: 0 > > > > b_dirty_time: 0 > > > > bdi_list: 1 > > > > state: 1 > > > > --------- > > > > > > Yeah, here the situation is really the same. As an experiment can you > > > experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 1= 0 (I > > > don't expect you should need to go past 10) and figure out when there= 's > > > enough slack space for the writeback bandwidth to ramp up to a full s= peed? > > > Thanks! > > > > > > Honza > > > > When locally testing this, I'm seeing that the max_ratio affects the > > bandwidth more so than min_ratio (eg the different min_ratios have > > roughly the same bandwidth per max_ratio). I'm also seeing somewhat > > high variance across runs which makes it hard to gauge what's > > accurate, but on average this is what I'm seeing: > > > > max_ratio=3D1 --- bandwidth=3D ~230 MiB/s > > max_ratio=3D2 --- bandwidth=3D ~420 MiB/s > > max_ratio=3D3 --- bandwidth=3D ~550 MiB/s > > max_ratio=3D4 --- bandwidth=3D ~653 MiB/s > > max_ratio=3D5 --- bandwidth=3D ~700 MiB/s > > max_ratio=3D6 --- bandwidth=3D ~810 MiB/s > > max_ratio=3D7 --- bandwidth=3D ~1040 MiB/s (and then a lot of times, 56= 1 > > MiB/s on subsequent runs) > > Ah, sorry. I actually misinterpretted your reply from previous email that= : > > > > > cat /sys/class/bdi/0:52/max_ratio > > > > 1 > > This means the amount of dirty pages for the fuse filesystem is indeed > hard-capped at 1% of dirty limit which happens to be ~3MB on your machine= . > Checking where this is coming from I can see that fuse_bdi_init() does > this by: > > bdi_set_max_ratio(sb->s_bdi, 1); > > So FUSE restricts itself and with only 3MB dirty limit and 2MB dirtying > granularity it is not surprising that dirty throttling doesn't work well. > > I'd say there needs to be some better heuristic within FUSE that balances > maximum folio size and maximum dirty limit setting for the filesystem to = a > sensible compromise (so that there's space for at least say 10 dirty > max-sized folios within the dirty limit). > > But I guess this is just a shorter-term workaround. Long-term, finer > grained dirtiness tracking within FUSE (and writeback counters tracking i= n > MM) is going to be a more effective solution. > Thanks for taking a look, Jan. I'll play around with the bdi limits, though I don't think we'll be able to up this for unprivileged FUSE servers. I'm planning to add finer grained diritiness tracking to FUSE and the associated mm writeback counter changes but even then, having full writes be that much slower is probably a no-go, so I'll experiment with limiting the fgf order. Thanks, Joanne > Honza > -- > Jan Kara > SUSE Labs, CR