From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=pOhq=ZB=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 29958C43331
	for <linux-mm@archiver.kernel.org>; Sat,  9 Nov 2019 10:31:30 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B418721848
	for <linux-mm@archiver.kernel.org>; Sat,  9 Nov 2019 10:31:29 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B418721848
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 170376B0003; Sat,  9 Nov 2019 05:31:29 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 122D26B0006; Sat,  9 Nov 2019 05:31:29 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 037CF6B0007; Sat,  9 Nov 2019 05:31:29 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212])
	by kanga.kvack.org (Postfix) with ESMTP id E31476B0003
	for <linux-mm@kvack.org>; Sat,  9 Nov 2019 05:31:28 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 8FB0E249B
	for <linux-mm@kvack.org>; Sat,  9 Nov 2019 10:31:28 +0000 (UTC)
X-FDA: 76136372256.13.ear17_46dd1a8143e06
X-HE-Tag: ear17_46dd1a8143e06
X-Filterd-Recvd-Size: 7494
Received: from mail3-164.sinamail.sina.com.cn (mail3-164.sinamail.sina.com.cn [202.108.3.164])
	by imf37.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Sat,  9 Nov 2019 10:31:26 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([114.244.162.243])
	by sina.com with ESMTP
	id 5DC69579000371EC; Sat, 9 Nov 2019 18:31:23 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 71657415073688
From: Hillf Danton <hdanton@sina.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Hillf Danton <hdanton@sina.com>,
	linux-mm <linux-mm@kvack.org>,
	fsdev <linux-fsdevel@vger.kernel.org>,
	Fengguang Wu <fengguang.wu@intel.com>,
	Tejun Heo <tj@kernel.org>,
	Jan Kara <jack@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeelb@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Mel Gorman <mgorman@suse.de>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [RFC v2] writeback: add elastic bdi in cgwb bdp
Date: Sat,  9 Nov 2019 18:31:12 +0800
Message-Id: <20191109103112.6480-1-hdanton@sina.com>
In-Reply-To: <20191026104656.15176-1-hdanton@sina.com>
References: <20191026104656.15176-1-hdanton@sina.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Fri, 8 Nov 2019 13:00:17 -0800 Andrew Morton wrote:
>=20
> On Sat, 26 Oct 2019 18:46:56 +0800 Hillf Danton <hdanton@sina.com> wrot=
e:
> >=20
> > The elastic bdi is the mirror bdi of spinning disks, SSD, USB and
> > other storage devices/instruments on market. The performance of
> > ebdi goes up and down as the pattern of IO dispatched changes, as
> > approximately estimated as below.
> >=20
> > 	P =3D j(..., IO pattern);
> >=20
> > In ebdi's view, the bandwidth currently measured in balancing dirty
> > pages has close relation to its performance because the former is a
> > part of the latter.
> >=20
> > 	B =3D y(P);
> >=20
> > The functions above suggest there may be a layer violation if it
> > could be better measured somewhere below fs.
> >=20
> > It is measured however to the extent that makes every judge happy,
> > and is playing a role in dispatching IO with the IO pattern entirely
> > ignored that is volatile in nature.
> >=20
> > And it helps to throttle the dirty speed, with the figure ignored
> > that DRAM in general is x10 faster than ebdi. If B is half of P for
> > instance, then it is near 5% of dirty speed, just 2 points from the
> > figure in the snippet below.
> >=20
> > /*
> >  * If ratelimit_pages is too high then we can get into dirty-data ove=
rload
> >  * if a large number of processes all perform writes at the same time=
.
> >  * If it is too low then SMP machines will call the (expensive)
> >  * get_writeback_state too often.
> >  *
> >  * Here we set ratelimit_pages to a level which ensures that when all=
 CPUs are
> >  * dirtying in parallel, we cannot go more than 3% (1/32) over the di=
rty memory
> >  * thresholds.
> >  */
> >=20
> > To prevent dirty speed from running away from laundry speed, ebdi
> > suggests the walk-dog method to put in bdp as a leash seems to
> > churn less in IO pattern.
>=20
> I'm finding both the changelog and the patch rather hard to understand.

Will change both, Sir.

> The absence of code comments doesn't help.  But the test robot

Will add.

> performance results look nice.
>=20
> Presumably you did your own performance testing.  Please share the
> results of that in the changelog.
>=20
> >=20
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -170,6 +170,8 @@ struct bdi_writeback {
> > =20
> >  	struct list_head bdi_node;	/* anchored at bdi->wb_list */
> > =20
> > +	struct wait_queue_head bdp_waitq;
>=20
> Please add comments which help the reader find out what "bdp" stands
> for.

Yes, Sir. bdp is an acronym for "balance dirty pages".
>=20
> >  #ifdef CONFIG_CGROUP_WRITEBACK
> >  	struct percpu_ref refcnt;	/* used only for !root wb's */
> >  	struct fprop_local_percpu memcg_completions;
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -324,6 +324,8 @@ static int wb_init(struct bdi_writeback
> >  			goto out_destroy_stat;
> >  	}
> > =20
> > +	init_waitqueue_head(&wb->bdp_waitq);
> > +
> >  	return 0;
> > =20
> >  out_destroy_stat:
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1551,6 +1551,39 @@ static inline void wb_dirty_limits(struc
> >  	}
> >  }
> >
> > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
>=20
> Please document this function.  Describe the "why" not the "what".

Will do.

> Comment should help readers understand what "cgwb" means.
>=20
> > +{
> > +	struct dirty_throttle_control gdtc =3D { GDTC_INIT_NO_WB };
> > +
> > +	if (fatal_signal_pending(current))
> > +		return false;
> > +
> > +	gdtc.avail =3D global_dirtyable_memory();
> > +
> > +	domain_dirty_limits(&gdtc);
> > +
> > +	gdtc.dirty =3D global_node_page_state(NR_FILE_DIRTY) +
> > +			global_node_page_state(NR_UNSTABLE_NFS) +
> > +			global_node_page_state(NR_WRITEBACK);
> > +
> > +	if (gdtc.dirty < gdtc.bg_thresh)
> > +		return false;
> > +
> > +	if (!writeback_in_progress(wb))
> > +		wb_start_background_writeback(wb);
>=20
> Add a comment explaining what's going on here and why we're doing this.

Will do.

> > +	return gdtc.dirty > gdtc.thresh &&
> > +		wb_stat(wb, WB_DIRTIED) >
> > +		wb_stat(wb, WB_WRITTEN) +
> > +		wb_stat_error();
>=20
> Why?

Will add explanation.

> > +}
> > +
> > +static inline void cgwb_bdp(struct bdi_writeback *wb)
> > +{
> > +	wait_event_interruptible_timeout(wb->bdp_waitq,
> > +			!cgwb_bdp_should_throttle(wb), HZ);
> > +}
> > +
> >  /*
> >   * balance_dirty_pages() must be called by processes which are gener=
ating dirty
> >   * data.  It looks at the number of dirty pages in the machine and w=
ill force
> > @@ -1910,7 +1943,7 @@ void balance_dirty_pages_ratelimited(str
> >  	preempt_enable();
> > =20
> >  	if (unlikely(current->nr_dirtied >=3D ratelimit))
> > -		balance_dirty_pages(wb, current->nr_dirtied);
> > +		cgwb_bdp(wb);
> > =20
> >  	wb_put(wb);
> >  }
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
> >  	if (nr_pages =3D=3D LONG_MAX)
> >  		return LONG_MAX;
> > =20
> > +	return nr_pages;
> > +
> >  	/*
> >  	 * This may be called on clean wb's and proportional distribution
> >  	 * may not make sense, just use the original @nr_pages in those
> > @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
> >  		pages =3D min(pages, work->nr_pages);
> >  		pages =3D round_down(pages + MIN_WRITEBACK_PAGES,
> >  				   MIN_WRITEBACK_PAGES);
> > +		pages =3D work->nr_pages;
> >  	}
> > =20
> >  	return pages;
> > @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
> >  		wb_wakeup_delayed(wb);
> > =20
> >  	current->flags &=3D ~PF_SWAPWRITE;
> > +
> > +	if (waitqueue_active(&wb->bdp_waitq))
> > +		wake_up_all(&wb->bdp_waitq);
> >  }
>=20
> Does this patch affect both cgroup writeback and global writeback?=20
> Were both tested?  Performance results of both?

Both will be affected IMHO, as "bdi->wb remains unchanged and will keep
serving the root cgroup", please see 52ebea749aae ("writeback: make
backing_dev_info host cgroup-specific bdi_writebacks").

And more tests are needed, and it helps a lot if folks would like to
tippoint the test cases they have interests in.

Thanks
Hillf