From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD452C02183 for ; Fri, 17 Jan 2025 22:45:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2560B6B007B; Fri, 17 Jan 2025 17:45:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 205886B0082; Fri, 17 Jan 2025 17:45:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A7506B0083; Fri, 17 Jan 2025 17:45:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DB16E6B007B for ; Fri, 17 Jan 2025 17:45:15 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4440080F93 for ; Fri, 17 Jan 2025 22:45:15 +0000 (UTC) X-FDA: 83018426190.18.B7599D6 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) by imf13.hostedemail.com (Postfix) with ESMTP id 67A9420004 for ; Fri, 17 Jan 2025 22:45:13 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LuwgOnd1; spf=pass (imf13.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737153913; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZVGMMNdci1fpRsJcW/V30M8issQrQyqlggYL6PN/p4E=; b=sJwxG8szVBFPhK/hUkr++fQRLj2CANDEyaAu7bRGu4Oi1tNyc0qh4sbW4QHFJw1t+ErcWO esKbdBdsZrVjcghkPV0Ekt3AFULhJgkkAN3UcI8ere+l/qXSqt3G2YJsag+7z6/QQzaC4V 701GhKq2WcGqQ1y0S/KWCw7dhSeFWd0= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LuwgOnd1; spf=pass (imf13.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737153913; a=rsa-sha256; cv=none; b=q+vGFYZLjjitHvbK0CMmEY9LzuLOB+uY032+4FCH61W9HjwO36qs7cPSh+zdWpA953fuw9 MgiDlOAyKr8vgAq2d0FtqmVlzyWHc/eMV2agYv6daMpllBV45VbubeiAVlJ2wwcnHM+ryg V9lr3eAovRihP+b+ZJxIKGr4HpHCkWs= Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-467a37a2a53so28479681cf.2 for ; Fri, 17 Jan 2025 14:45:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737153912; x=1737758712; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ZVGMMNdci1fpRsJcW/V30M8issQrQyqlggYL6PN/p4E=; b=LuwgOnd18cXEX71Azh8dsYtSVE9ZOzBF3d11KLp7UJJTwag6IPlymPt5LrQuRL0ubM 2HEK3DnoqNSohfSnIHgzBbltKmmaDD4qcxCi1CZaHYO20NBoKoXyzF6hkxRon3/MPnAw K6l5A/OIhOXXp5yPTUgTvH9RnZeF/dfzV8VdE4iQFZPcCQ9WHSUO68MnYlHfhZHqa4tQ GQLJ7ybRG3dLfFxgCp5g6N9cxiRsT9eBbf4JUJ24WbsaPV4NxPzEaQPLNuxt9XjUo4X/ xGoLLazBPUp4pH6B37pBCDcFg1JHmeVkC+gtgQXUfRX4dwFk02qXeVqokVB4vxe0dNt7 8i0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737153912; x=1737758712; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZVGMMNdci1fpRsJcW/V30M8issQrQyqlggYL6PN/p4E=; b=RHQfwbvWkgkDnPp7YTZFzn3FL9Ek7dzUVKu6NSCxTN7Z76HzKdq6LkudqHjGSz5J13 apR0M+Yv0r0r2A0bw6wOXprveuSI4pHXYdyEle58z+46LQfx4T7hv/Wkzp/kt2exn7SX OvecIuf+gM3nhVjLdHqpbicXkUtvydOMky9p55vzG5Ap8qkx/cGB0QQVhEej78IdQJir XmYlAPPZCKw9enf9oRHyCHvBai4UO74IXu4r2lyOWxH+04emByUAZo2pYgwvTdFm9DI9 8Klz4Qiu1+LwviG2UYHI/GYW39gSiawTXCVROogCDXkE9NcDfx7/7vGC6EL/JwvQHcQ3 bg3w== X-Forwarded-Encrypted: i=1; AJvYcCV36EJKJ8L75p/T5kxRXdc7QR4vWpv0dCxeyYSvw4/5pLztJ3zt4QMhwLhJbooFTBuoy3vLZeYzNQ==@kvack.org X-Gm-Message-State: AOJu0Yx4p1Gw9xkn7HZW4MTDt+5QsDSwHyI0c0MV6OPza0YfNnpK1mmT 0bnZyIhsxtVzTkA8XjNeUSe5kepxk0+TyCk+1wCAMjtBAlttE9A7n0D+Vtb1S3pUa0YS/xnUbcv OjXXhzbevnXyTVZztoTXH1Y97OzA= X-Gm-Gg: ASbGnctLICJz0fDNM3iLk8zn2YQW1I301IX3YT/+lx9abkFLEU0cZGCQcJU1dv8Yuyu dUhw/9WvuENunsvQ6qDqNCCzajWb7sJ0dfB+uKMCJGvueLSskrd080uluE4g2nHlMMg== X-Google-Smtp-Source: AGHT+IHjo/kwTJrZ4eQLTHx7i9eMG3PkQ60CPH5SyToQgKdp1jhiOEO+tmdeqcpctW9Q5TSsz+Cy+4aNoGCFIdXY79Q= X-Received: by 2002:ac8:7d8b:0:b0:467:6b96:dc5a with SMTP id d75a77b69052e-46e12b97280mr60596521cf.47.1737153912358; Fri, 17 Jan 2025 14:45:12 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Joanne Koong Date: Fri, 17 Jan 2025 14:45:01 -0800 X-Gm-Features: AbW1kvbSfv33KHSFKhq9273Chnq1fyPNq5qN1QMEAUXci_5R5vFGOG1fpzpQ-SY Message-ID: Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance To: Jan Kara Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, "Matthew Wilcox (Oracle)" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 67A9420004 X-Stat-Signature: npq9ks4a5do3tjyfmh4f496qjbh3rj4a X-Rspam-User: X-HE-Tag: 1737153913-963076 X-HE-Meta: U2FsdGVkX19nO6Sjm73eVDobQIv054rivpLHrA9mhfDTymlUsEV0AAB9i3iNdPoNT01N58J2MMCpQmzWdDMAhnF43A7VxMNf6DnaAG8fRWGmom+ZHSs4JafJChl5/aWnVhU2jqhPp8J1tWgn2edOmZHA3YfLTH5U8HZTkzey4ObFz44yDMQZJeASyHhjIgcyulg5g3WYYJxZ2EpgWq785oemQCj5phEKsuIcMVtQIyJW5jBdAnAuar5SIwoLUXEiLPG4SdcPJmpqHR9VaUKFQuNYUHCC11lK0jUDphuUziWdUSc39uC6zFpLIWloEHqhpC8OBjK1x3+Xue/lh7jlpjwaWFrmRuiBIyuqxdjjT0/skDxssfk/mNbkmGswRR3vlqOy8Rzvp6BZQztIPfsYEUInYfHp1D2jOvs3ZbjywbFg4A69ba791XsresESMr9ub1gXfgpTsewnu4AUBnhRbor9L0fd6nEvx60XqTUhDE241lq77bMb+1laKDG6Ed+ooe7y6Kd7c99hxIRT6sttE6T5oORPCUi3VubwwSmIgaVRhSaOprlfEkBFHA4VV2jP153BueAHXHHa6oWaNdo2itOg4eIkhLhcHx39KHLNf9YjrooV52GcptreeeVkM7+yWec0d0yAQfPsjFu6ED+s1oToj08fiV29lViihT+q5lLS2WIqQJWZ2N2nHu0NeY99QtutNK2YKFKdJJx1HUpyNAS9pEHEoLQ4GM6bvKoSuQfzLKYRIz0y9h/j6TQXDuOamw+0jk/b7VTJLLmfMybVLf72+WZV6Vwueztcb80aXIiRAWV4SLm/fzK8GxTlbrVBo+W+ppfkJU8EC1tKpTskkNxxe7/49/Rbjy/btYA8/pe3NkwFy1pzsyZ7Zj3WpjFhjLOS9z2EqoCvnEvXn6bTw0y1vX9BY6l5lvEDXP42iy/eAaOZf/Zp0arCHbLc/6+pYfTSCPXLVolv+0Dsz/w NWCcbNuz hie5oK96XG+pFq3g6XiLhsQxdjLpFelAzV5Qdor8UtgAyxcWYxOvK8YZ/kN9WHbxLmpfjpegmK8Kv4gVDE3KpnOk/RfYOgsrdYoJpwUiuTAZroPebZJMaLTrUiwoPhjPI3sC7kZzDn7hmoodUXXUIokJKagYXJfffsKcYnH0ZCmsjDv2Xm3FTsCD+OVnatg6kGkXur71O5mzY182kdbXgqD3mH0dPxu8d7dYEpOiMy6dIjqGYuMZ6CGIf4lV4D+iyJcVm+1WomhLdLP9slsIuYFbUvQP4PLTl6+S3j23SKWEqTD3iygguEn/VUJhCGX4Dwb0h5xOEhIkldF1E2b2ha/mY/sPWLj2rNog1B8oXRLMaCrXSQ0Cbnbt26CO2/xUZmGCXsQ9p+foFH7ofFj5D0s07HQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000110, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 17, 2025 at 3:53=E2=80=AFAM Jan Kara wrote: > > On Thu 16-01-25 15:38:54, Joanne Koong wrote: > > On Thu, Jan 16, 2025 at 3:01=E2=80=AFAM Jan Kara wrote: > > > On Tue 14-01-25 16:50:53, Joanne Koong wrote: > > > > I would like to propose a discussion topic about improving large fo= lio > > > > writeback performance. As more filesystems adopt large folios, it > > > > becomes increasingly important that writeback is made to be as > > > > performant as possible. There are two areas I'd like to discuss: > > > > > > > > =3D=3D Granularity of dirty pages writeback =3D=3D > > > > Currently, the granularity of writeback is at the folio level. If o= ne > > > > byte in a folio is dirty, the entire folio will be written back. Th= is > > > > becomes unscalable for larger folios and significantly degrades > > > > performance, especially for workloads that employ random writes. > > > > > > > > One idea is to track dirty pages at a smaller granularity using a > > > > 64-bit bitmap stored inside the folio struct where each bit tracks = a > > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32= k > > > > pages), and only write back dirty chunks rather than the entire fol= io. > > > > > > Yes, this is known problem and as Dave pointed out, currently it is u= pto > > > the lower layer to handle finer grained dirtiness handling. You can t= ake > > > inspiration in the iomap layer that already does this, or you can con= vert > > > your filesystem to use iomap (preferred way). > > > > > > > =3D=3D Balancing dirty pages =3D=3D > > > > It was observed that the dirty page balancing logic used in > > > > balance_dirty_pages() fails to scale for large folios [1]. For > > > > example, fuse saw around a 125% drop in throughput for writes when > > > > using large folios vs small folios on 1MB block sizes, which was > > > > attributed to scheduled io waits in the dirty page balancing logic.= In > > > > generic_perform_write(), dirty pages are balanced after every write= to > > > > the page cache by the filesystem. With large folios, each write > > > > dirties a larger number of pages which can grossly exceed the > > > > ratelimit, whereas with small folios each write is one page and so > > > > pages are balanced more incrementally and adheres more closely to t= he > > > > ratelimit. In order to accomodate large folios, likely the logic in > > > > balancing dirty pages needs to be reworked. > > > > > > I think there are several separate issues here. One is that > > > folio_account_dirtied() will consider the whole folio as needing writ= eback > > > which is not necessarily the case (as e.g. iomap will writeback only = dirty > > > blocks in it). This was OKish when pages were 4k and you were using 1= k > > > blocks (which was uncommon configuration anyway, usually you had 4k b= lock > > > size), it starts to hurt a lot with 2M folios so we might need to fin= d a > > > way how to propagate the information about really dirty bits into wri= teback > > > accounting. > > > > Agreed. The only workable solution I see is to have some sort of api > > similar to filemap_dirty_folio() that takes in the number of pages > > dirtied as an arg, but maybe there's a better solution. > > Yes, something like that I suppose. > > > > Another problem *may* be that fast increments to dirtied pages (as we= dirty > > > 512 pages at once instead of 16 we did in the past) cause over-reacti= on in > > > the dirtiness balancing logic and we throttle the task too much. The > > > heuristics there try to find the right amount of time to block a task= so > > > that dirtying speed matches the writeback speed and it's plausible th= at > > > the large increments make this logic oscilate between two extremes le= ading > > > to suboptimal throughput. Also, since this was observed with FUSE, I = belive > > > a significant factor is that FUSE enables "strictlimit" feature of th= e BDI > > > which makes dirty throttling more aggressive (generally the amount of > > > allowed dirty pages is lower). Anyway, these are mostly speculations = from > > > my end. This needs more data to decide what exactly (if anything) nee= ds > > > tweaking in the dirty throttling logic. > > > > I tested this experimentally and you're right, on FUSE this is > > impacted a lot by the "strictlimit". I didn't see any bottlenecks when > > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects > > the dirty throttle control freerun flag (which gets used to determine > > whether throttling can be skipped) in the balance_dirty_pages() logic. > > For FUSE, we can't turn off strictlimit for unprivileged servers, but > > maybe we can make the throttling check more permissive by upping the > > value of the min_pause calculation in wb_min_pause() for writes that > > support large folios? As of right now, the current logic makes writing > > large folios unfeasible in FUSE (estimates show around a 75% drop in > > throughput). > > I think tweaking min_pause is a wrong way to do this. I think that is jus= t a > symptom. Can you run something like: > > while true; do > cat /sys/kernel/debug/bdi//stats > echo "---------" > sleep 1 > done >bdi-debug.txt > > while you are writing to the FUSE filesystem and share the output file? > That should tell us a bit more about what's happening inside the writebac= k > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi? > You can check in /sys/block//bdi/{min,max}_ratio . I suspect th= e > problem is that the BDI dirty limit does not ramp up properly when we > increase dirtied pages in large chunks. This is the debug info I see for FUSE large folio writes where bs=3D1M and size=3D1G: BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 896 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1071104 kB BdiWritten: 4096 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1290240 kB BdiWritten: 4992 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1517568 kB BdiWritten: 5824 kB BdiWriteBandwidth: 25692 kBps b_dirty: 0 b_io: 1 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 7 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1747968 kB BdiWritten: 6720 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 896 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1949696 kB BdiWritten: 7552 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3612 kB DirtyThresh: 361300 kB BackgroundThresh: 180428 kB BdiDirtied: 2097152 kB BdiWritten: 8128 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- I didn't do anything to configure/change the FUSE bdi min/max_ratio. This is what I see on my system: cat /sys/class/bdi/0:52/min_ratio 0 cat /sys/class/bdi/0:52/max_ratio 1 > > Actually, there's a patch queued in mm tree that improves the ramping up = of > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could > test whether it changes something in the behavior you observe. Thanks! > > Honza > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/= patche > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_th= resh.pa > tch I still see the same results (~230 MiB/s throughput using fio) with this patch applied, unfortunately. Here's the debug info I see with this patch (same test scenario as above on FUSE large folio writes where bs=3D1M and size=3D1G): BdiWriteback: 0 kB BdiReclaimable: 2048 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359132 kB BackgroundThresh: 179348 kB BdiDirtied: 51200 kB BdiWritten: 128 kB BdiWriteBandwidth: 102400 kBps b_dirty: 1 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 5 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 331776 kB BdiWritten: 1216 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 562176 kB BdiWritten: 2176 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 792576 kB BdiWritten: 3072 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 64 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 1026048 kB BdiWritten: 3904 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- Thanks, Joanne > > -- > Jan Kara > SUSE Labs, CR