From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A112BC432C3 for ; Thu, 14 Nov 2019 09:38:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 66ED820709 for ; Thu, 14 Nov 2019 09:38:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 66ED820709 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EFCEB6B0003; Thu, 14 Nov 2019 04:38:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAD6B6B0005; Thu, 14 Nov 2019 04:38:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DEA2A6B0006; Thu, 14 Nov 2019 04:38:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0043.hostedemail.com [216.40.44.43]) by kanga.kvack.org (Postfix) with ESMTP id CB5086B0003 for ; Thu, 14 Nov 2019 04:38:51 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 47051180AD80F for ; Thu, 14 Nov 2019 09:38:51 +0000 (UTC) X-FDA: 76154383662.13.shake60_2ff659ec2fa0b X-HE-Tag: shake60_2ff659ec2fa0b X-Filterd-Recvd-Size: 9113 Received: from r3-23.sinamail.sina.com.cn (r3-23.sinamail.sina.com.cn [202.108.3.23]) by imf43.hostedemail.com (Postfix) with SMTP for ; Thu, 14 Nov 2019 09:38:49 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([124.64.3.114]) by sina.com with ESMTP id 5DCD20A000013F7A; Thu, 14 Nov 2019 17:38:44 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 82592954921274 From: Hillf Danton To: Andrew Morton Cc: Hillf Danton , linux-mm , Rong Chen , Fengguang Wu , Tejun Heo , Shakeel Butt , Minchan Kim , Mel Gorman , linux-kernel Subject: Re: [RFC v3] writeback: add elastic bdi in cgwb bdp Date: Thu, 14 Nov 2019 17:38:32 +0800 Message-Id: <20191114093832.8504-1-hdanton@sina.com> In-Reply-To: <20191112034227.3112-1-hdanton@sina.com> References: <20191112034227.3112-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 12 Nov 2019 17:02:36 -0800 Andrew Morton wrote: >=20 > On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton wrot= e: > >=20 > > The elastic bdi (ebdi) which is the mirror bdi of spinning disk, > > SSD and USB key on market is introduced to balancing dirty pages > > (bdp). > >=20 > > The risk arises that system runs out of free memory, when dirty > > pages are produced too many too soon, so bdp is needed in field. > >=20 > > Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy > > to one HZ, depending on the time it would take to increase dirty > > pages by the amount which is defined by the variable > > ratelimit_pages. > >=20 > > During cgroup writeback (cgwb) bdp, ebdi helps observe the > > changes both in cgwb's dirty pages (dirty speed) and in > > written-out pages (laundry speed) in elastic time intervals, > > until a balance is established between the two parties i.e. > > the two speeds statistically equal. > >=20 > > The above mechanism of elastic equilibrium effectively prevents > > dirty page hogs, as no chance is left for dirty pages to pile up, > > thus cuts the risk that system free memory falls to unsafe level. > >=20 > > Thanks to Rong Chen for testing. >=20 > That sounds like a Tested-by: >=20 Yes, Sir, will add Tested-by: Rong Chen > The changelog has no testing results. Please prepare results which > show, amongst other things, the change in performance when the kernel > isn't tight on memory. As well as the alteration in behaviour when > memory is short. >=20 Will do. > Generally, please work on making this code much more understandable? >=20 Will do. > > > > ... > > > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd > > if (nr_pages =3D=3D LONG_MAX) > > return LONG_MAX; > > =20 > > + return nr_pages; > > + > > /* > > * This may be called on clean wb's and proportional distribution > > * may not make sense, just use the original @nr_pages in those > > @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct > > pages =3D min(pages, work->nr_pages); > > pages =3D round_down(pages + MIN_WRITEBACK_PAGES, > > MIN_WRITEBACK_PAGES); > > + pages =3D work->nr_pages; >=20 > It's unclear what this is doing, but it makes the three preceding > statements non-operative. >=20 This change, and the above one as well, is trying to bypass the current bandwidth, and a couple of rounds of cleanup are needed after it survives the LTP. > > } > > =20 > > return pages; > > @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work) > > wb_wakeup_delayed(wb); > > =20 > > current->flags &=3D ~PF_SWAPWRITE; > > + > > + if (waitqueue_active(&wb->bdp_waitq)) > > + wake_up_all(&wb->bdp_waitq); >=20 > Please add a comment explaining why this is being done here. >=20 After writing out some dirty pages, it it a check point to see if a balance is already set up between the dirty speed and laundry speed. Those under throttling will be unthrottled after seeing a balance in place. A comment will be added. > > } > > =20 > > /* > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -1830,6 +1830,67 @@ pause: > > wb_start_background_writeback(wb); > > } > > =20 > > +/** > > + * cgwb_bdp_should_throttle() tell if a wb should be throttled > > + * @wb bdi_writeback to throttle > > + * > > + * To avoid the risk of exhausting the system free memory, we check > > + * and try much to prevent too many dirty pages from being produced > > + * too soon. > > + * > > + * For cgroup writeback, it is essencially to keep an equilibrium >=20 > "it is essential"? >=20 Yes Sir. > > + * between its dirty speed and laundry speed i.e. dirty pages are > > + * written out as fast as they are produced in an ideal state. > > + */ > > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb) > > +{ > > + struct dirty_throttle_control gdtc =3D { GDTC_INIT_NO_WB }; > > + > > + if (fatal_signal_pending(current)) > > + return false; > > + > > + gdtc.avail =3D global_dirtyable_memory(); > > + > > + domain_dirty_limits(&gdtc); > > + > > + gdtc.dirty =3D global_node_page_state(NR_FILE_DIRTY) + > > + global_node_page_state(NR_UNSTABLE_NFS) + > > + global_node_page_state(NR_WRITEBACK); > > + > > + if (gdtc.dirty < gdtc.bg_thresh) > > + return false; > > + > > + if (!writeback_in_progress(wb)) > > + wb_start_background_writeback(wb); >=20 > This is a bit ugly. Something called "bool cgwb_bdp_should_throttle()" > shoiuld just check whether we should throttle. But here it is, also > initiating writeback. That's an inappropriate thing for this function > to do? >=20 It is the current bdp behavior trying to keep dirty pages below the user-configurable background threshold by waking up flushers, because no dirty page will be sent to disk without flusher's efforts, please see 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()"). Will try to find some chance to pinch it out. > Also, we don't know *why* this is being done here, because there's no > code comment explaining the reasoning to us. >=20 Will add a comment. >=20 > > + if (gdtc.dirty < gdtc.thresh) > > + return false; > > + > > + /* > > + * throttle wb if there is the risk that wb's dirty speed is > > + * running away from its laundry speed, better with statistic > > + * error taken into account. > > + */ > > + return wb_stat(wb, WB_DIRTIED) > > > + wb_stat(wb, WB_WRITTEN) + wb_stat_error(); > > +} > > + > > > > ... > > > > @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str > > * 1000+ tasks, all of them start dirtying pages at exactly the sam= e > > * time, hence all honoured too large initial task->nr_dirtied_paus= e. > > */ > > - p =3D this_cpu_ptr(&bdp_ratelimits); > > - if (unlikely(current->nr_dirtied >=3D ratelimit)) > > - *p =3D 0; > > - else if (unlikely(*p >=3D ratelimit_pages)) { > > - *p =3D 0; > > - ratelimit =3D 0; > > - } > > + dirty =3D this_cpu_ptr(&bdp_ratelimits); > > + > > /* > > * Pick up the dirtied pages by the exited tasks. This avoids lots = of > > * short-lived tasks (eg. gcc invocations in a kernel build) escapi= ng > > * the dirty throttling and livelock other long-run dirtiers. > > */ > > - p =3D this_cpu_ptr(&dirty_throttle_leaks); > > - if (*p > 0 && current->nr_dirtied < ratelimit) { > > - unsigned long nr_pages_dirtied; > > - nr_pages_dirtied =3D min(*p, ratelimit - current->nr_dirtied); > > - *p -=3D nr_pages_dirtied; > > - current->nr_dirtied +=3D nr_pages_dirtied; > > + leak =3D this_cpu_ptr(&dirty_throttle_leaks); > > + > > + if (*dirty + *leak < ratelimit_pages) { > > + /* > > + * nothing to do as it would take some more time to > > + * eat out ratelimit_pages > > + */ > > + try_bdp =3D false; > > + } else { > > + try_bdp =3D true; > > + > > + /* > > + * bdp in flight helps detect dirty page hogs soon > > + */ >=20 > How? Please expand on this comment a lot. =20 >=20 We should be cautious here in red zone after paying the ratelimit_pages price; we might soon have to tackle a deluge of dirty page hogs. Will cut it. > > + flights =3D this_cpu_ptr(&bdp_in_flight); > > + > > + if ((*flights)++ & 1) { >=20 > What is that "& 1" doing? >=20 It helps to tell if a bdp is alredy in flight. It would have been something like if (*flights =3D=3D 0) { (*flights)++; } else { *flights =3D 0; > > + *dirty =3D *dirty + *leak - ratelimit_pages; > > + *leak =3D 0; > > + } but I was curious to see the flights in long run. Thanks Hillf > > } > > preempt_enable(); > > =20 > > - if (unlikely(current->nr_dirtied >=3D ratelimit)) > > - balance_dirty_pages(wb, current->nr_dirtied); > > + if (try_bdp) > > + cgwb_bdp(wb); > > =20 > > wb_put(wb);