From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29958C43331 for ; Sat, 9 Nov 2019 10:31:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B418721848 for ; Sat, 9 Nov 2019 10:31:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B418721848 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 170376B0003; Sat, 9 Nov 2019 05:31:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 122D26B0006; Sat, 9 Nov 2019 05:31:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 037CF6B0007; Sat, 9 Nov 2019 05:31:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212]) by kanga.kvack.org (Postfix) with ESMTP id E31476B0003 for ; Sat, 9 Nov 2019 05:31:28 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 8FB0E249B for ; Sat, 9 Nov 2019 10:31:28 +0000 (UTC) X-FDA: 76136372256.13.ear17_46dd1a8143e06 X-HE-Tag: ear17_46dd1a8143e06 X-Filterd-Recvd-Size: 7494 Received: from mail3-164.sinamail.sina.com.cn (mail3-164.sinamail.sina.com.cn [202.108.3.164]) by imf37.hostedemail.com (Postfix) with SMTP for ; Sat, 9 Nov 2019 10:31:26 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.244.162.243]) by sina.com with ESMTP id 5DC69579000371EC; Sat, 9 Nov 2019 18:31:23 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 71657415073688 From: Hillf Danton To: Andrew Morton Cc: Hillf Danton , linux-mm , fsdev , Fengguang Wu , Tejun Heo , Jan Kara , Johannes Weiner , Shakeel Butt , Minchan Kim , Mel Gorman , linux-kernel Subject: Re: [RFC v2] writeback: add elastic bdi in cgwb bdp Date: Sat, 9 Nov 2019 18:31:12 +0800 Message-Id: <20191109103112.6480-1-hdanton@sina.com> In-Reply-To: <20191026104656.15176-1-hdanton@sina.com> References: <20191026104656.15176-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 8 Nov 2019 13:00:17 -0800 Andrew Morton wrote: >=20 > On Sat, 26 Oct 2019 18:46:56 +0800 Hillf Danton wrot= e: > >=20 > > The elastic bdi is the mirror bdi of spinning disks, SSD, USB and > > other storage devices/instruments on market. The performance of > > ebdi goes up and down as the pattern of IO dispatched changes, as > > approximately estimated as below. > >=20 > > P =3D j(..., IO pattern); > >=20 > > In ebdi's view, the bandwidth currently measured in balancing dirty > > pages has close relation to its performance because the former is a > > part of the latter. > >=20 > > B =3D y(P); > >=20 > > The functions above suggest there may be a layer violation if it > > could be better measured somewhere below fs. > >=20 > > It is measured however to the extent that makes every judge happy, > > and is playing a role in dispatching IO with the IO pattern entirely > > ignored that is volatile in nature. > >=20 > > And it helps to throttle the dirty speed, with the figure ignored > > that DRAM in general is x10 faster than ebdi. If B is half of P for > > instance, then it is near 5% of dirty speed, just 2 points from the > > figure in the snippet below. > >=20 > > /* > > * If ratelimit_pages is too high then we can get into dirty-data ove= rload > > * if a large number of processes all perform writes at the same time= . > > * If it is too low then SMP machines will call the (expensive) > > * get_writeback_state too often. > > * > > * Here we set ratelimit_pages to a level which ensures that when all= CPUs are > > * dirtying in parallel, we cannot go more than 3% (1/32) over the di= rty memory > > * thresholds. > > */ > >=20 > > To prevent dirty speed from running away from laundry speed, ebdi > > suggests the walk-dog method to put in bdp as a leash seems to > > churn less in IO pattern. >=20 > I'm finding both the changelog and the patch rather hard to understand. Will change both, Sir. > The absence of code comments doesn't help. But the test robot Will add. > performance results look nice. >=20 > Presumably you did your own performance testing. Please share the > results of that in the changelog. >=20 > >=20 > > --- a/include/linux/backing-dev-defs.h > > +++ b/include/linux/backing-dev-defs.h > > @@ -170,6 +170,8 @@ struct bdi_writeback { > > =20 > > struct list_head bdi_node; /* anchored at bdi->wb_list */ > > =20 > > + struct wait_queue_head bdp_waitq; >=20 > Please add comments which help the reader find out what "bdp" stands > for. Yes, Sir. bdp is an acronym for "balance dirty pages". >=20 > > #ifdef CONFIG_CGROUP_WRITEBACK > > struct percpu_ref refcnt; /* used only for !root wb's */ > > struct fprop_local_percpu memcg_completions; > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -324,6 +324,8 @@ static int wb_init(struct bdi_writeback > > goto out_destroy_stat; > > } > > =20 > > + init_waitqueue_head(&wb->bdp_waitq); > > + > > return 0; > > =20 > > out_destroy_stat: > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -1551,6 +1551,39 @@ static inline void wb_dirty_limits(struc > > } > > } > > > > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb) >=20 > Please document this function. Describe the "why" not the "what". Will do. > Comment should help readers understand what "cgwb" means. >=20 > > +{ > > + struct dirty_throttle_control gdtc =3D { GDTC_INIT_NO_WB }; > > + > > + if (fatal_signal_pending(current)) > > + return false; > > + > > + gdtc.avail =3D global_dirtyable_memory(); > > + > > + domain_dirty_limits(&gdtc); > > + > > + gdtc.dirty =3D global_node_page_state(NR_FILE_DIRTY) + > > + global_node_page_state(NR_UNSTABLE_NFS) + > > + global_node_page_state(NR_WRITEBACK); > > + > > + if (gdtc.dirty < gdtc.bg_thresh) > > + return false; > > + > > + if (!writeback_in_progress(wb)) > > + wb_start_background_writeback(wb); >=20 > Add a comment explaining what's going on here and why we're doing this. Will do. > > + return gdtc.dirty > gdtc.thresh && > > + wb_stat(wb, WB_DIRTIED) > > > + wb_stat(wb, WB_WRITTEN) + > > + wb_stat_error(); >=20 > Why? Will add explanation. > > +} > > + > > +static inline void cgwb_bdp(struct bdi_writeback *wb) > > +{ > > + wait_event_interruptible_timeout(wb->bdp_waitq, > > + !cgwb_bdp_should_throttle(wb), HZ); > > +} > > + > > /* > > * balance_dirty_pages() must be called by processes which are gener= ating dirty > > * data. It looks at the number of dirty pages in the machine and w= ill force > > @@ -1910,7 +1943,7 @@ void balance_dirty_pages_ratelimited(str > > preempt_enable(); > > =20 > > if (unlikely(current->nr_dirtied >=3D ratelimit)) > > - balance_dirty_pages(wb, current->nr_dirtied); > > + cgwb_bdp(wb); > > =20 > > wb_put(wb); > > } > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd > > if (nr_pages =3D=3D LONG_MAX) > > return LONG_MAX; > > =20 > > + return nr_pages; > > + > > /* > > * This may be called on clean wb's and proportional distribution > > * may not make sense, just use the original @nr_pages in those > > @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct > > pages =3D min(pages, work->nr_pages); > > pages =3D round_down(pages + MIN_WRITEBACK_PAGES, > > MIN_WRITEBACK_PAGES); > > + pages =3D work->nr_pages; > > } > > =20 > > return pages; > > @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work) > > wb_wakeup_delayed(wb); > > =20 > > current->flags &=3D ~PF_SWAPWRITE; > > + > > + if (waitqueue_active(&wb->bdp_waitq)) > > + wake_up_all(&wb->bdp_waitq); > > } >=20 > Does this patch affect both cgroup writeback and global writeback?=20 > Were both tested? Performance results of both? Both will be affected IMHO, as "bdi->wb remains unchanged and will keep serving the root cgroup", please see 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks"). And more tests are needed, and it helps a lot if folks would like to tippoint the test cases they have interests in. Thanks Hillf