From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FDB6D41C27 for ; Wed, 13 Nov 2024 10:07:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B17F66B00BC; Wed, 13 Nov 2024 05:07:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC8E26B00BD; Wed, 13 Nov 2024 05:07:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 941006B00C5; Wed, 13 Nov 2024 05:07:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6A5DB6B00BC for ; Wed, 13 Nov 2024 05:07:40 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 247251A0964 for ; Wed, 13 Nov 2024 10:07:40 +0000 (UTC) X-FDA: 82780644618.02.D453D27 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf26.hostedemail.com (Postfix) with ESMTP id 294DD140016 for ; Wed, 13 Nov 2024 10:07:04 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U3iCofho; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=J4piauRe; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U3iCofho; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=J4piauRe; dmarc=none; spf=pass (imf26.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731492395; a=rsa-sha256; cv=none; b=WcHdgFWRkaIUFt2il/V69E9WTZ5NIJIuSZoS01CIKGbpny+9p0Lf19hsK4+PyzsjLbonsa 1CrH1PbBimB6vC2tKcxSz5wb7415oKweytjNqcZb9tjV1gM+DCkIXwjy1zaGyotC9Ta4t4 UrYpeYwVsiLFvanXvUUJnnS5C0KM7lI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U3iCofho; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=J4piauRe; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=U3iCofho; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=J4piauRe; dmarc=none; spf=pass (imf26.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731492395; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BUaDYKCeAAV6mV6tt0Zz0KpJFThajAvpLku1YYcIo5o=; b=18ZIwRmscoGFdhiM+9G91E2ldFj0ecsU2h59FqP50xzTGR0VHjw5qB1TyOL9G53LJjtuKE Sct+mnhmn+Ro896SkQt3VMuNQoyXzXtOE7jwrn19RjR3QIbkWcG7gMmRjiLsnvEHlgMB6r wk8pdHNIUyWEGLEr2wkbEPipPOUL0e8= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id C9B101F37C; Wed, 13 Nov 2024 10:07:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1731492455; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BUaDYKCeAAV6mV6tt0Zz0KpJFThajAvpLku1YYcIo5o=; b=U3iCofhoVSj+GPDiplkLeswb4fN7oOcaTkJSMC4qybsG9dgqT1dMvZcv5RyGnV8HEadv7/ zOUcRTO9scTbiiD6AOJ3azHYpwY8HJsNKeiQLuEHhU/ZRM/eH5qMXxOYGe/ZmYmdcdt/rp 4Dmq4wEXqa06QgoQMCYYefCSKtHefLk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1731492455; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BUaDYKCeAAV6mV6tt0Zz0KpJFThajAvpLku1YYcIo5o=; b=J4piauReyh6Q1qZEqD4sgEnO3B0RSOEKf7J555hQpclGrbCjNbeE2Dvzk+n1e/MKpNR7Vt 1NFCK1KD98g4OLAw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1731492455; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BUaDYKCeAAV6mV6tt0Zz0KpJFThajAvpLku1YYcIo5o=; b=U3iCofhoVSj+GPDiplkLeswb4fN7oOcaTkJSMC4qybsG9dgqT1dMvZcv5RyGnV8HEadv7/ zOUcRTO9scTbiiD6AOJ3azHYpwY8HJsNKeiQLuEHhU/ZRM/eH5qMXxOYGe/ZmYmdcdt/rp 4Dmq4wEXqa06QgoQMCYYefCSKtHefLk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1731492455; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BUaDYKCeAAV6mV6tt0Zz0KpJFThajAvpLku1YYcIo5o=; b=J4piauReyh6Q1qZEqD4sgEnO3B0RSOEKf7J555hQpclGrbCjNbeE2Dvzk+n1e/MKpNR7Vt 1NFCK1KD98g4OLAw== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id B322013A6E; Wed, 13 Nov 2024 10:07:35 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id lja8K2d6NGeeHgAAD6G6ig (envelope-from ); Wed, 13 Nov 2024 10:07:35 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 643DBA08D0; Wed, 13 Nov 2024 11:07:35 +0100 (CET) Date: Wed, 13 Nov 2024 11:07:35 +0100 From: Jan Kara To: Jim Zhao Cc: jack@suse.cz, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org Subject: Re: [PATCH] mm/page-writeback: Raise wb_thresh to prevent write blocking with strictlimit Message-ID: <20241113100735.4jafa56p4td66z7a@quack3> References: <20241108220215.s27rziym6mn5nzv4@quack3> <20241112084539.702485-1-jimzhao.ai@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20241112084539.702485-1-jimzhao.ai@gmail.com> X-Rspam-User: X-Rspamd-Queue-Id: 294DD140016 X-Rspamd-Server: rspam11 X-Stat-Signature: sqngemebhaz13pe3ge8k5yx4fjii3tcf X-HE-Tag: 1731492424-273050 X-HE-Meta: U2FsdGVkX19QgEmR6ktIs73HUMLDvXu2NqEANaPo0SFq4w1bdik7b9qgCblEtEOn4Bo0Tme1Lioza8PNgs5Cita33EEtrWnoVH2z7cRxdkyEhZ0gxO/z4e2PF197vYt5B8IZH+AkWV0FHLqzsSxlPLtnZ7Uu2YzFUsDLtlfs2OwsFR/YsDcepE2kIL820z2BnH+SLTVxBSgE/OiUq7vX2v+JzyRmCXm1Y8FoJu71NKLFd/xIrGWnDXmTusvPjXipE34gXU3SpXa8z+7wQ6F7epVRzbD3/NYP4C197LQ5KjMIwPeXrHzL1kxEicO3T7WEras7T3FnyySsajOqneTgNmcfohSYq4260mMRxX9XPOWLCcnRLxW5X7kWhKBUBhBNzPuacpXTLumXHMxL7GDhjrMNNXi+upF1pkrTrlYgPZsA2lpSo9RWdcyk24W1+YK8rThEEYFM6+nTlSn4+9H07MuqTO6E/UTvhcuYqiks4eaquXCMv2EGHY0SH6vvRVkDWEeIBRxbwfBj+bYGzrq0UH3wCLWdXXUHpf+tisU8/g1+GFJkLW3PFpQ2gIAZpRa0wszvYroHayeW9aJjm46+LA9oE2x92QP31K+XwiVXa3HSnxMCxw00enK0YbZX4lAMCC7GrRdL84bl81kmHTOYU5hz4Fd+wPmbigOPO7XpEZLqDNnH+obE29hg8D/+/jxuzkZuJq31ak1um2otrSqYp5BBCj5cdu3Uvi/T9HDnwNUoeIoCl18TBlxB1iwSJAgtMA2BpGKmUmhcQkZvM6oH0F3OQF6F/4Mu6m+Eg9KDkHCVMYVdbjEuHnNZlNdUVQ4ySwBDHTtY5q1XT4No6k2cCGXCPrsDyjvxUAIvbiRPyLSYws28Wb7ZovX9T6jkxhdFieOX1rJextNLIqzOIvPetJ7bJsuBJVtrKmA+ab8sdu2D1/xiADn+YBhTw9Nd3zaFzVz1DB/dT7cuBOv4t6q dcxkk4Ad PioHyYMwtv6HAxlw9PpF9syIVDkxd8Kn6ikVP61DQNfYKFADlB/TMbBBZcW/AX403unL+Qhd2qNU9xkSvkrZoqXw9mlLNAGYeQKLKUHCwYJLVJrx3YvLE4yLalccscVQhkZg/FnmEGwnjUuvcuMhAbDSniL/rsSjAnVfbVfvlrJ0JJlScyz1SWXSAkfu/Atkx1z50eBwNcG3uaztOIohBeeh04YePFl4OGWXxbrSWPkNG5fMuNONJbm5eI8SGRFIIYlRO2rgzq1KvFyuinAembs+d/YlOkzfypp4EvZwdXnSAEo304DV38geezHCs66Lz1AmxWt0QYMUcIXgtuKjOOP/8drZ0P83SnQpo7wSsyUzosNogR9d/0eXnflEpg9Mx4wzYe8RIqmIqV+szNvZmdQOZYK8LW3Srk5Ssf2PHOCG9ut1A/xV+dFZ+WjlVHgTmKNdgd75XfD6v0/Cg0Mh+2TTnUBFVhwyfh4gj7PhAWWwWtXYCu8mYp1CExRiybj0DhLV8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 12-11-24 16:45:39, Jim Zhao wrote: > > On Fri 08-11-24 11:19:49, Jim Zhao wrote: > > > > On Wed 23-10-24 18:00:32, Jim Zhao wrote: > > > > > With the strictlimit flag, wb_thresh acts as a hard limit in > > > > > balance_dirty_pages() and wb_position_ratio(). When device write > > > > > operations are inactive, wb_thresh can drop to 0, causing writes to > > > > > be blocked. The issue occasionally occurs in fuse fs, particularly > > > > > with network backends, the write thread is blocked frequently during > > > > > a period. To address it, this patch raises the minimum wb_thresh to a > > > > > controllable level, similar to the non-strictlimit case. > > > > > > > > > > Signed-off-by: Jim Zhao > > > > > > > > ... > > > > > > > > > + /* > > > > > + * With strictlimit flag, the wb_thresh is treated as > > > > > + * a hard limit in balance_dirty_pages() and wb_position_ratio(). > > > > > + * It's possible that wb_thresh is close to zero, not because > > > > > + * the device is slow, but because it has been inactive. > > > > > + * To prevent occasional writes from being blocked, we raise wb_thresh. > > > > > + */ > > > > > + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { > > > > > + unsigned long limit = hard_dirty_limit(dom, dtc->thresh); > > > > > + u64 wb_scale_thresh = 0; > > > > > + > > > > > + if (limit > dtc->dirty) > > > > > + wb_scale_thresh = (limit - dtc->dirty) / 100; > > > > > + wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh / 4)); > > > > > + } > > > > > > > > What you propose makes sense in principle although I'd say this is mostly a > > > > userspace setup issue - with strictlimit enabled, you're kind of expected > > > > to set min_ratio exactly if you want to avoid these startup issues. But I > > > > tend to agree that we can provide a bit of a slack for a bdi without > > > > min_ratio configured to ramp up. > > > > > > > > But I'd rather pick the logic like: > > > > > > > > /* > > > > * If bdi does not have min_ratio configured and it was inactive, > > > > * bump its min_ratio to 0.1% to provide it some room to ramp up. > > > > */ > > > > if (!wb_min_ratio && !numerator) > > > > wb_min_ratio = min(BDI_RATIO_SCALE / 10, wb_max_ratio / 2); > > > > > > > > That would seem like a bit more systematic way than the formula you propose > > > > above... > > > > > > Thanks for the advice. > > > Here's the explanation of the formula: > > > 1. when writes are small and intermittent,wb_thresh can approach 0, not > > > just 0, making the numerator value difficult to verify. > > > > I see, ok. > > > > > 2. The ramp-up margin, whether 0.1% or another value, needs > > > consideration. > > > I based this on the logic of wb_position_ratio in the non-strictlimit > > > scenario: wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); It seems > > > provides more room and ensures ramping up within a controllable range. > > > > I see, thanks for explanation. So I was thinking how to make the code more > > consistent instead of adding another special constant and workaround. What > > I'd suggest is: > > > > 1) There's already code that's supposed to handle ramping up with > > strictlimit in wb_update_dirty_ratelimit(): > > > > /* > > * For strictlimit case, calculations above were based on wb counters > > * and limits (starting from pos_ratio = wb_position_ratio() and up to > > * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate). > > * Hence, to calculate "step" properly, we have to use wb_dirty as > > * "dirty" and wb_setpoint as "setpoint". > > * > > * We rampup dirty_ratelimit forcibly if wb_dirty is low because > > * it's possible that wb_thresh is close to zero due to inactivity > > * of backing device. > > */ > > if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { > > dirty = dtc->wb_dirty; > > if (dtc->wb_dirty < 8) > > setpoint = dtc->wb_dirty + 1; > > else > > setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2; > > } > > > > Now I agree that increasing wb_thresh directly is more understandable and > > transparent so I'd just drop this special case. > > yes, I agree. > > > 2) I'd just handle all the bumping of wb_thresh in a single place instead > > of having is spread over multiple places. So __wb_calc_thresh() could have > > a code like: > > > > wb_thresh = (thresh * (100 * BDI_RATIO_SCALE - bdi_min_ratio)) / (100 * BDI_RATIO_SCALE) > > wb_thresh *= numerator; > > wb_thresh = div64_ul(wb_thresh, denominator); > > > > wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio); > > > > wb_thresh += (thresh * wb_min_ratio) / (100 * BDI_RATIO_SCALE); > > limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); > > /* > > * It's very possible that wb_thresh is close to 0 not because the > > * device is slow, but that it has remained inactive for long time. > > * Honour such devices a reasonable good (hopefully IO efficient) > > * threshold, so that the occasional writes won't be blocked and active > > * writes can rampup the threshold quickly. > > */ > > if (limit > dtc->dirty) > > wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); > > if (wb_thresh > (thresh * wb_max_ratio) / (100 * BDI_RATIO_SCALE)) > > wb_thresh = thresh * wb_max_ratio / (100 * BDI_RATIO_SCALE); > > > > and we can drop the bumping from wb_position)_ratio(). This way have the > > wb_thresh bumping in a single logical place. Since we still limit wb_tresh > > with max_ratio, untrusted bdis for which max_ratio should be configured > > (otherwise they can grow amount of dirty pages upto global treshold anyway) > > are still under control. > > > > If we really wanted, we could introduce a different bumping in case of > > strictlimit, but at this point I don't think it is warranted so I'd leave > > that as an option if someone comes with a situation where this bumping > > proves to be too aggressive. > > Thank you, this is very helpful. And I have 2 concerns: > > 1. > In the current non-strictlimit logic, wb_thresh is only bumped within > wb_position_ratio() for calculating pos_ratio, and this bump isn’t > restricted by max_ratio. I’m unsure if moving this adjustment to > __wb_calc_thresh() would effect existing behavior. Would it be possible > to keep the current logic for non-strictlimit case? You are correct that current bumping is not affected by max_ratio and that is actually a bug. wb_thresh should never exceed what is corresponding to the configured max_ratio. Furthermore in practical configurations I don't think the max_ratio limiting will actually make a big difference because bumping should happen when wb_thresh is really low. So for consistency I would apply it also to the non-strictlimit case. > 2. Regarding the formula: > wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); > > Consider a case: > With 100 fuse devices(with high max_ratio) experiencing high writeback > delays, the pages being written back are accounted in NR_WRITEBACK_TEMP, > not dtc->dirty. As a result, the bumped wb_thresh may remain high. While > individual devices are under control, the total could exceed > expectations. I agree but this is a potential problem with any kind of bumping based on 'limit - dtc->dirty'. It is just a matter of how many fuse devices you have and how exactly you have max_ratio configured. > Although lowering the max_ratio can avoid this issue, how about reducing > the bumped wb_thresh? > > The formula in my patch: > wb_scale_thresh = (limit - dtc->dirty) / 100; > The intention is to use the default fuse max_ratio(1%) as the multiplier. So basically you propose to use the "/ 8" factor for the normal case and "/ 100" factor for the strictlimit case. My position is that I would not complicate the logic unless somebody comes with a real world setup where the simpler logic is causing real problems. But if you feel strongly about this, I'm fine with that option. Honza -- Jan Kara SUSE Labs, CR