From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.0 required=3.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9F04C07E96 for ; Thu, 8 Jul 2021 12:38:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1A2BE613F7 for ; Thu, 8 Jul 2021 12:38:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1A2BE613F7 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CB4CB6B0011; Thu, 8 Jul 2021 08:38:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C656C6B005D; Thu, 8 Jul 2021 08:38:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2CAF6B006C; Thu, 8 Jul 2021 08:38:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0252.hostedemail.com [216.40.44.252]) by kanga.kvack.org (Postfix) with ESMTP id 9021A6B0011 for ; Thu, 8 Jul 2021 08:38:05 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D351228486 for ; Thu, 8 Jul 2021 12:38:04 +0000 (UTC) X-FDA: 78339372888.04.2FECCE6 Received: from mail3-166.sinamail.sina.com.cn (mail3-166.sinamail.sina.com.cn [202.108.3.166]) by imf26.hostedemail.com (Postfix) with SMTP id 5B79220019C4 for ; Thu, 8 Jul 2021 12:38:02 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([221.219.3.165]) by sina.com (172.16.97.23) with ESMTP id 60E6ECF50000105E; Thu, 8 Jul 2021 20:17:59 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 68194454919459 From: Hillf Danton To: Jan Kara Cc: Andrew Morton , linux-fsdevel@vger.kernel.org, Michael Stapelberg , linux-mm@kvack.org Subject: Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Date: Thu, 8 Jul 2021 20:17:51 +0800 Message-Id: <20210708121751.327-1-hdanton@sina.com> In-Reply-To: <20210707095138.GC5335@quack2.suse.cz> References: <20210705161610.19406-1-jack@suse.cz> <20210707074017.2195-1-hdanton@sina.com> MIME-Version: 1.0 X-Rspam-User: nil Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf26.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.166 as permitted sender) smtp.mailfrom=hdanton@sina.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 5B79220019C4 X-Stat-Signature: 4tfo185ofboyynmug76mogk1dnmfk1dc X-HE-Tag: 1625747882-811531 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: >On Wed 07-07-21 15:40:17, Hillf Danton wrote: >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: >> > >> >Michael Stapelberg has reported that for workload with short big spik= es >> >of writes (GCC linker seem to trigger this frequently) the write >> >throughput is heavily underestimated and tends to steadily sink until= it >> >reaches zero. This has rather bad impact on writeback throttling >> >(causing stalls). The problem is that writeback throughput estimate g= ets >> >updated at most once per 200 ms. One update happens early after we >> >submit pages for writeback (at that point writeout of only small >> >fraction of pages is completed and thus observed throughput is tiny). >> >Next update happens only during the next write spike (updates happen >> >only from inode writeback and dirty throttling code) and if that is >> >more than 1s after previous spike, we decide system was idle and just >> >ignore whatever was written until this moment. >> > >> >Fix the problem by making sure writeback throughput estimate is also >> >updated shortly after writeback completes to get reasonable estimate = of >> >throughput for spiky workloads. >> > >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelber= g+li>nux@google.com >> >Reported-by: Michael Stapelberg >> >Signed-off-by: Jan Kara >... >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> >index 1fecf8ebadb0..6a99ddca95c0 100644 >> >--- a/mm/page-writeback.c >> >+++ b/mm/page-writeback.c >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty= _thr>ottle_control *gdtc, >> > unsigned long dirtied; >> > unsigned long written; >> > >> >- lockdep_assert_held(&wb->list_lock); >> >- >> >- /* >> >- * rate-limit, only update once every 200ms. >> >- */ >> >- if (elapsed < BANDWIDTH_INTERVAL) >> >- return; >>=20 >> Please leave it as it is if you are not dumping the 200ms rule. > >Well, that could break the delayed updated scheduled after the end of >writeback and for no good reason. The problematic ordering is like: After another look at 2/5, you are cutting the rule, which is worth a seperate patch. > >end writeback on inode1 > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL > >__wb_update_bandwidth() called e.g. from balance_dirty_pages() > wb->bw_time_stamp =3D now; > >end writeback on inode2 > queue_delayed_work() - does nothing since work is already queued > >delayed work calls __wb_update_bandwidth() - nothing is done since elaps= ed >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2= in >our estimates. Your example says the estimate based on inode2 is torpedoed by a random update, and you are looking to make that estimate meaningful at the cost of breaking the rule - how differet is it to the current one if the estimate is derived from 20ms-elapsed interval at inode2? Is it likely to see another palpablely different result at inode3 from 50ms-elapsed inter= val? > >> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bd= i_wr>iteback *wb) >> > static void wb_inode_writeback_end(struct bdi_writeback *wb) >> > { >> > atomic_dec(&wb->writeback_inodes); >> >+ /* >> >+ * Make sure estimate of writeback throughput gets >> >+ * updated after writeback completed. >> >+ */ >> >+ queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); >> > } >>=20 >> This is a bogus estimate - it does not break the 200ms rule but walks >> around it without specifying why 300ms is not good. > >Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. W= e >do want some batching of bandwidth updates after writeback completes for >the cases where lots of inodes end their writeback in a quick succession= . >I've picked BANDWIDTH_INTERVAL here as that's the batching of other >bandwidth updates as well so it kind of makes sense. I'll add a comment = why >BANDWIDTH_INTERVAL is picked here. > > Honza >--=20 >Jan Kara >SUSE Labs, CR > >