From: Hillf Danton <hdanton@sina.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Hillf Danton <hdanton@sina.com>, linux-mm <linux-mm@kvack.org>,
fsdev <linux-fsdevel@vger.kernel.org>,
Fengguang Wu <fengguang.wu@intel.com>, Tejun Heo <tj@kernel.org>,
Jan Kara <jack@suse.com>, Johannes Weiner <hannes@cmpxchg.org>,
Shakeel Butt <shakeelb@google.com>,
Minchan Kim <minchan@kernel.org>, Mel Gorman <mgorman@suse.de>,
linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [RFC v2] writeback: add elastic bdi in cgwb bdp
Date: Sat, 9 Nov 2019 18:31:12 +0800 [thread overview]
Message-ID: <20191109103112.6480-1-hdanton@sina.com> (raw)
In-Reply-To: <20191026104656.15176-1-hdanton@sina.com>
On Fri, 8 Nov 2019 13:00:17 -0800 Andrew Morton wrote:
>
> On Sat, 26 Oct 2019 18:46:56 +0800 Hillf Danton <hdanton@sina.com> wrote:
> >
> > The elastic bdi is the mirror bdi of spinning disks, SSD, USB and
> > other storage devices/instruments on market. The performance of
> > ebdi goes up and down as the pattern of IO dispatched changes, as
> > approximately estimated as below.
> >
> > P = j(..., IO pattern);
> >
> > In ebdi's view, the bandwidth currently measured in balancing dirty
> > pages has close relation to its performance because the former is a
> > part of the latter.
> >
> > B = y(P);
> >
> > The functions above suggest there may be a layer violation if it
> > could be better measured somewhere below fs.
> >
> > It is measured however to the extent that makes every judge happy,
> > and is playing a role in dispatching IO with the IO pattern entirely
> > ignored that is volatile in nature.
> >
> > And it helps to throttle the dirty speed, with the figure ignored
> > that DRAM in general is x10 faster than ebdi. If B is half of P for
> > instance, then it is near 5% of dirty speed, just 2 points from the
> > figure in the snippet below.
> >
> > /*
> > * If ratelimit_pages is too high then we can get into dirty-data overload
> > * if a large number of processes all perform writes at the same time.
> > * If it is too low then SMP machines will call the (expensive)
> > * get_writeback_state too often.
> > *
> > * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> > * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> > * thresholds.
> > */
> >
> > To prevent dirty speed from running away from laundry speed, ebdi
> > suggests the walk-dog method to put in bdp as a leash seems to
> > churn less in IO pattern.
>
> I'm finding both the changelog and the patch rather hard to understand.
Will change both, Sir.
> The absence of code comments doesn't help. But the test robot
Will add.
> performance results look nice.
>
> Presumably you did your own performance testing. Please share the
> results of that in the changelog.
>
> >
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -170,6 +170,8 @@ struct bdi_writeback {
> >
> > struct list_head bdi_node; /* anchored at bdi->wb_list */
> >
> > + struct wait_queue_head bdp_waitq;
>
> Please add comments which help the reader find out what "bdp" stands
> for.
Yes, Sir. bdp is an acronym for "balance dirty pages".
>
> > #ifdef CONFIG_CGROUP_WRITEBACK
> > struct percpu_ref refcnt; /* used only for !root wb's */
> > struct fprop_local_percpu memcg_completions;
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -324,6 +324,8 @@ static int wb_init(struct bdi_writeback
> > goto out_destroy_stat;
> > }
> >
> > + init_waitqueue_head(&wb->bdp_waitq);
> > +
> > return 0;
> >
> > out_destroy_stat:
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1551,6 +1551,39 @@ static inline void wb_dirty_limits(struc
> > }
> > }
> >
> > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
>
> Please document this function. Describe the "why" not the "what".
Will do.
> Comment should help readers understand what "cgwb" means.
>
> > +{
> > + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> > +
> > + if (fatal_signal_pending(current))
> > + return false;
> > +
> > + gdtc.avail = global_dirtyable_memory();
> > +
> > + domain_dirty_limits(&gdtc);
> > +
> > + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> > + global_node_page_state(NR_UNSTABLE_NFS) +
> > + global_node_page_state(NR_WRITEBACK);
> > +
> > + if (gdtc.dirty < gdtc.bg_thresh)
> > + return false;
> > +
> > + if (!writeback_in_progress(wb))
> > + wb_start_background_writeback(wb);
>
> Add a comment explaining what's going on here and why we're doing this.
Will do.
> > + return gdtc.dirty > gdtc.thresh &&
> > + wb_stat(wb, WB_DIRTIED) >
> > + wb_stat(wb, WB_WRITTEN) +
> > + wb_stat_error();
>
> Why?
Will add explanation.
> > +}
> > +
> > +static inline void cgwb_bdp(struct bdi_writeback *wb)
> > +{
> > + wait_event_interruptible_timeout(wb->bdp_waitq,
> > + !cgwb_bdp_should_throttle(wb), HZ);
> > +}
> > +
> > /*
> > * balance_dirty_pages() must be called by processes which are generating dirty
> > * data. It looks at the number of dirty pages in the machine and will force
> > @@ -1910,7 +1943,7 @@ void balance_dirty_pages_ratelimited(str
> > preempt_enable();
> >
> > if (unlikely(current->nr_dirtied >= ratelimit))
> > - balance_dirty_pages(wb, current->nr_dirtied);
> > + cgwb_bdp(wb);
> >
> > wb_put(wb);
> > }
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
> > if (nr_pages == LONG_MAX)
> > return LONG_MAX;
> >
> > + return nr_pages;
> > +
> > /*
> > * This may be called on clean wb's and proportional distribution
> > * may not make sense, just use the original @nr_pages in those
> > @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
> > pages = min(pages, work->nr_pages);
> > pages = round_down(pages + MIN_WRITEBACK_PAGES,
> > MIN_WRITEBACK_PAGES);
> > + pages = work->nr_pages;
> > }
> >
> > return pages;
> > @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
> > wb_wakeup_delayed(wb);
> >
> > current->flags &= ~PF_SWAPWRITE;
> > +
> > + if (waitqueue_active(&wb->bdp_waitq))
> > + wake_up_all(&wb->bdp_waitq);
> > }
>
> Does this patch affect both cgroup writeback and global writeback?
> Were both tested? Performance results of both?
Both will be affected IMHO, as "bdi->wb remains unchanged and will keep
serving the root cgroup", please see 52ebea749aae ("writeback: make
backing_dev_info host cgroup-specific bdi_writebacks").
And more tests are needed, and it helps a lot if folks would like to
tippoint the test cases they have interests in.
Thanks
Hillf
next prev parent reply other threads:[~2019-11-09 10:31 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-26 10:46 Hillf Danton
2019-11-07 8:38 ` [writeback] 2b871886bb: fio.write_bw_MBps 40.6% improvement kernel test robot
2019-11-08 21:00 ` [RFC v2] writeback: add elastic bdi in cgwb bdp Andrew Morton
2019-11-09 10:31 ` Hillf Danton [this message]
2019-11-14 12:17 ` Jan Kara
2019-11-15 3:32 ` Hillf Danton
2019-11-15 9:53 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191109103112.6480-1-hdanton@sina.com \
--to=hdanton@sina.com \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox