From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=OeGd=MA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.0 required=3.0 tests=BAYES_00,
	FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D9F04C07E96
	for <linux-mm@archiver.kernel.org>; Thu,  8 Jul 2021 12:38:06 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1A2BE613F7
	for <linux-mm@archiver.kernel.org>; Thu,  8 Jul 2021 12:38:06 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1A2BE613F7
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id CB4CB6B0011; Thu,  8 Jul 2021 08:38:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C656C6B005D; Thu,  8 Jul 2021 08:38:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B2CAF6B006C; Thu,  8 Jul 2021 08:38:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0252.hostedemail.com [216.40.44.252])
	by kanga.kvack.org (Postfix) with ESMTP id 9021A6B0011
	for <linux-mm@kvack.org>; Thu,  8 Jul 2021 08:38:05 -0400 (EDT)
Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id D351228486
	for <linux-mm@kvack.org>; Thu,  8 Jul 2021 12:38:04 +0000 (UTC)
X-FDA: 78339372888.04.2FECCE6
Received: from mail3-166.sinamail.sina.com.cn (mail3-166.sinamail.sina.com.cn [202.108.3.166])
	by imf26.hostedemail.com (Postfix) with SMTP id 5B79220019C4
	for <linux-mm@kvack.org>; Thu,  8 Jul 2021 12:38:02 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([221.219.3.165])
	by sina.com (172.16.97.23) with ESMTP
	id 60E6ECF50000105E; Thu, 8 Jul 2021 20:17:59 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 68194454919459
From: Hillf Danton <hdanton@sina.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org,
	Michael Stapelberg <stapelberg+linux@google.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
Date: Thu,  8 Jul 2021 20:17:51 +0800
Message-Id: <20210708121751.327-1-hdanton@sina.com>
In-Reply-To: <20210707095138.GC5335@quack2.suse.cz>
References: <20210705161610.19406-1-jack@suse.cz> <20210707074017.2195-1-hdanton@sina.com>
MIME-Version: 1.0
X-Rspam-User: nil
Authentication-Results: imf26.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf26.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.166 as permitted sender) smtp.mailfrom=hdanton@sina.com
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 5B79220019C4
X-Stat-Signature: 4tfo185ofboyynmug76mogk1dnmfk1dc
X-HE-Tag: 1625747882-811531
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
>On Wed 07-07-21 15:40:17, Hillf Danton wrote:
>> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>> >
>> >Michael Stapelberg has reported that for workload with short big spik=
es
>> >of writes (GCC linker seem to trigger this frequently) the write
>> >throughput is heavily underestimated and tends to steadily sink until=
 it
>> >reaches zero. This has rather bad impact on writeback throttling
>> >(causing stalls). The problem is that writeback throughput estimate g=
ets
>> >updated at most once per 200 ms. One update happens early after we
>> >submit pages for writeback (at that point writeout of only small
>> >fraction of pages is completed and thus observed throughput is tiny).
>> >Next update happens only during the next write spike (updates happen
>> >only from inode writeback and dirty throttling code) and if that is
>> >more than 1s after previous spike, we decide system was idle and just
>> >ignore whatever was written until this moment.
>> >
>> >Fix the problem by making sure writeback throughput estimate is also
>> >updated shortly after writeback completes to get reasonable estimate =
of
>> >throughput for spiky workloads.
>> >
>> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelber=
g+li>nux@google.com
>> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>> >Signed-off-by: Jan Kara <jack@suse.cz>
>...
>> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> >index 1fecf8ebadb0..6a99ddca95c0 100644
>> >--- a/mm/page-writeback.c
>> >+++ b/mm/page-writeback.c
>> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty=
_thr>ottle_control *gdtc,
>> > 	unsigned long dirtied;
>> > 	unsigned long written;
>> >
>> >-	lockdep_assert_held(&wb->list_lock);
>> >-
>> >-	/*
>> >-	 * rate-limit, only update once every 200ms.
>> >-	 */
>> >-	if (elapsed < BANDWIDTH_INTERVAL)
>> >-		return;
>>=20
>> Please leave it as it is if you are not dumping the 200ms rule.
>
>Well, that could break the delayed updated scheduled after the end of
>writeback and for no good reason. The problematic ordering is like:

After another look at 2/5, you are cutting the rule, which is worth a
seperate patch.
>
>end writeback on inode1
>  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
>
>__wb_update_bandwidth() called e.g. from balance_dirty_pages()
>  wb->bw_time_stamp =3D now;
>
>end writeback on inode2
>  queue_delayed_work() - does nothing since work is already queued
>
>delayed work calls __wb_update_bandwidth() - nothing is done since elaps=
ed
>< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2=
 in
>our estimates.

Your example says the estimate based on inode2 is torpedoed by a random
update, and you are looking to make that estimate meaningful at the cost
of breaking the rule - how differet is it to the current one if the
estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
see another palpablely different result at inode3 from 50ms-elapsed inter=
val?
>
>> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bd=
i_wr>iteback *wb)
>> > static void wb_inode_writeback_end(struct bdi_writeback *wb)
>> > {
>> > 	atomic_dec(&wb->writeback_inodes);
>> >+	/*
>> >+	 * Make sure estimate of writeback throughput gets
>> >+	 * updated after writeback completed.
>> >+	 */
>> >+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
>> > }
>>=20
>> This is a bogus estimate - it does not break the 200ms rule but walks
>> around it without specifying why 300ms is not good.
>
>Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. W=
e
>do want some batching of bandwidth updates after writeback completes for
>the cases where lots of inodes end their writeback in a quick succession=
.
>I've picked BANDWIDTH_INTERVAL here as that's the batching of other
>bandwidth updates as well so it kind of makes sense. I'll add a comment =
why
>BANDWIDTH_INTERVAL is picked here.
>
>								Honza
>--=20
>Jan Kara <jack@suse.com>
>SUSE Labs, CR
>
>