From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: SeongJae Park <sjpark@amazon.com>
Cc: <akpm@linux-foundation.org>, SeongJae Park <sjpark@amazon.de>,
<aarcange@redhat.com>, <acme@kernel.org>,
<alexander.shishkin@linux.intel.com>, <amit@kernel.org>,
<benh@kernel.crashing.org>, <brendan.d.gregg@gmail.com>,
<brendanhiggins@google.com>, <cai@lca.pw>,
<colin.king@canonical.com>, <corbet@lwn.net>, <dwmw@amazon.com>,
<irogers@google.com>, <jolsa@redhat.com>, <kirill@shutemov.name>,
<mark.rutland@arm.com>, <mgorman@suse.de>, <minchan@kernel.org>,
<mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>,
<rdunlap@infradead.org>, <riel@surriel.com>,
<rientjes@google.com>, <rostedt@goodmis.org>, <sblbir@amazon.com>,
<shakeelb@google.com>, <shuah@kernel.org>, <sj38.park@gmail.com>,
<snu@amazon.de>, <vbabka@suse.cz>, <vdavydov.dev@gmail.com>,
<yang.shi@linux.alibaba.com>, <ying.huang@intel.com>,
<linux-damon@amazon.com>, <linux-mm@kvack.org>,
<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v9 00/15] Introduce Data Access MONitor (DAMON)
Date: Wed, 29 Apr 2020 10:18:06 +0100 [thread overview]
Message-ID: <20200429101806.000002f4@Huawei.com> (raw)
In-Reply-To: <20200429074954.24032-1-sjpark@amazon.com>
On Wed, 29 Apr 2020 09:49:54 +0200
SeongJae Park <sjpark@amazon.com> wrote:
> On Tue, 28 Apr 2020 17:17:13 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>
> > On Tue, 28 Apr 2020 15:23:42 +0200
> > SeongJae Park <sjpark@amazon.com> wrote:
> >
> > > On Tue, 28 Apr 2020 13:27:04 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > >
> > > > On Mon, 27 Apr 2020 14:04:27 +0200
> > > > SeongJae Park <sjpark@amazon.com> wrote:
> > > >
> > > > > From: SeongJae Park <sjpark@amazon.de>
> > > > >
> > > > > Introduction
> > > > > ============
> > > > >
> > > > > Memory management decisions can be improved if finer data access information is
> > > > > available. However, because such finer information usually comes with higher
> > > > > overhead, most systems including Linux forgives the potential benefit and rely
> > > > > on only coarse information or some light-weight heuristics. The pseudo-LRU and
> > > > > the aggressive THP promotions are such examples.
> > > > >
> > > > > A number of data access pattern awared memory management optimizations (refer
> > > > > to 'Appendix A' for more details) consistently say the potential benefit is not
> > > > > small. However, none of those has successfully merged to the mainline Linux
> > > > > kernel mainly due to the absence of a scalable and efficient data access
> > > > > monitoring mechanism. Refer to 'Appendix B' to see the limitations of existing
> > > > > memory monitoring mechanisms.
> > > > >
> > > > > DAMON is a data access monitoring subsystem for the problem. It is 1) accurate
> > > > > enough to be used for the DRAM level memory management (a straightforward
> > > > > DAMON-based optimization achieved up to 2.55x speedup), 2) light-weight enough
> > > > > to be applied online (compared to a straightforward access monitoring scheme,
> > > > > DAMON is up to 94,242.42x lighter) and 3) keeps predefined upper-bound overhead
> > > > > regardless of the size of target workloads (thus scalable). Refer to 'Appendix
> > > > > C' if you interested in how it is possible, and 'Appendix F' to know how the
> > > > > numbers collected.
> > > > >
> > > > > DAMON has mainly designed for the kernel's memory management mechanisms.
> > > > > However, because it is implemented as a standalone kernel module and provides
> > > > > several interfaces, it can be used by a wide range of users including kernel
> > > > > space programs, user space programs, programmers, and administrators. DAMON
> > > > > is now supporting the monitoring only, but it will also provide simple and
> > > > > convenient data access pattern awared memory managements by itself. Refer to
> > > > > 'Appendix D' for more detailed expected usages of DAMON.
> > > > >
> > > [...]
> > > > >
> > > > > Future Plans
> > > > > ============
> > > > >
> > > > > This patchset is only for the first stage of DAMON. As soon as this patchset
> > > > > is merged, official patchsets for below future plans will be posted.
> > > > >
> > > [...]
> > > > >
> > > > > Support Various Address Spaces
> > > > > ------------------------------
> > > > >
> > > > > Currently, DAMON supports virtual memory address spaces using PTE Accessed bits
> > > > > as its access checking primitive. However, the core design of DAMON is not
> > > > > dependent to such implementation details. In a future, DAMON will decouple
> > > > > those and support various address spaces including physical memory. It will
> > > > > further allow users to configure and even implement the primitives by
> > > > > themselves for their special usecase. Monitoring of page cache, NUMA nodes,
> > > > > specific files, or block devices would be examples of such usecases.
> > > > >
> > > > > An RFC patchset for this plan is already available
> > > > > (https://lore.kernel.org/linux-mm/20200409094232.29680-1-sjpark@amazon.com/).
> > > > >
> > > [...]
> > > > >
> > > > > Patch History
> > > > > =============
> > > > >
> > > > > The most biggest change in this version is support of minimal region size,
> > > > > which defaults to 'PAGE_SIZE'. This change will reduce unnecessary region
> > > > > splits and thus improve the quality of the output. In a future, we will be
> > > > > able to make this configurable for support of various access check primitives
> > > > > such as PMUs.
> > > >
> > > > That is a good improvement. Might be interesting to consider taking
> > > > hugepages into account as well.
> > >
> > > Thanks! Kudos to Stefan and you for giving me the comments for the change.
> > >
> > > As abovely mentioned in 'Future Plans' section, DAMON will be highly
> > > configurable. You can see the plan in more detail via the RFC patchset[1].
> > > Thus, the minimal region size will also be able to configured as users want,
> > > including the size of the hugepage.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20200409094232.29680-1-sjpark@amazon.com/
> > >
> > > >
> > > > One issue I've noted is that we have a degeneracy problem with the current
> > > > region merging and splitting that perhaps could do with a small tweak.
> > > >
> > > > Currently we can end with a very small number of regions because there
> > > > is no limit on how many regions can be merged in a give pass for merging.
> > > > However, splitting only doubles the number of regions.
> > > >
> > > > I've been experimenting with a few loops of the splitting algorithm to ensure
> > > > we don't end up stuck with limited regions. I think the problem we are working
> > > > around can be roughly described as:
> > > >
> > > > 1) Program allocates a lot of memory - not really touching much of it.
> > > > 2) Damon fuses the large memory allocations in to one region because the
> > > > access counts are always near 0.
> > > > 3) Program finishes setup.
> > > > 4) Program accesses a few pages in the huge reason a lot, but not that much
> > > > for most of the rest. Taking an extreme option, the page in the middle
> > > > gets all the accesses and the other 1G on either side gets none.
> > > > 5) As a split always breaks the page in two, the chances of significantly
> > > > different values for the two resulting regions is low (as we only sample
> > > > the hot page occasionally).
> > > >
> > > > If we just run the splits twice if the number of regions < max regions / 4
> > > > then over time we should eventually get a region with the single hot page in it.
> > > > We will get there faster if we split more (keeping below max regions).
> > > >
> > > > As we always remain below max regions, we are still obeying the fixed
> > > > maximum overhead and actually monitoring at closer to the desired granularity.
> > >
> > > Good point. However, as you also mentioned, DAMON will slowly, but eventually
> > > adjust the regions appropriately.
> > >
> > > And yes, your suggested solution will work pretty well. Indeed, my one
> > > previous colleague found this problem on a few of special workloads and tried
> > > the solution you suggested. The improvement was clear.
> > >
> > > However, I didn't adopt the solution due to below reasons.
> > >
> > > First, IMHO, this is an accuracy improvement, rather than bug fix. But the
> > > extent of the enhancement didn't seem very critical to me. Most of other
> > > workloads didn't show such problem (and thus improvement). Even with the
> > > workloads showing the problem, the problem was not seem so critical.
> > >
> > > Second, if the low accuracy is problem, users could get higher accuracy by
> > > simply adjusting the sampling interval and/or aggregation interval to lower
> > > value. This is the supposed way to trade the accuracy with the overhead.
> >
> > I disagree. There is very little chance of getting out of this situation with the
> > current splitting. Changing sampling and aggregation intervals doesn't actually help.
> >
> > Let's draw out an example to discuss.
> >
> > Toy state - taking just one block of memory.
> >
> > 0 = not accessed page (very cold)
> > X = accessed page (extremely hot)
> >
> > First few cycles - no accesses
> >
> > in X.Regions list average value estimated by damon.
> >
> > Region C is needed to set the max and will never be aggregated.
> >
> > aggregation cycle then state.
> > 0.start
> > 0.accessed 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X X X
> > 0.regions (percent)| A (0) | B (0) | C(1)|
> > 0.merge | A | C |
> > 0.split | A | B | C |
> >
> > After a few cycles, hot page
> > 1.start
> > 1.accessed 0 0 0 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > 1.regions (acc_cnt)| A (1/18) | B (0) | C(1)|
>
> ^ not count but ratio, right?
oops. Absolutely.
>
> > 1.merge | A | C |
> > 1.split | A | B | C |
> > 2.start
> > 2.accessed 0 0 0 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > 2.regions (acc_cnt)| A (1/12) | B (0) | C(1)|
> > 2.merge | A | C |
> > 2.split | A | B | C |
> > 3.start
> > 3.accessed 0 0 0 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > 3.regions (acc_cnt)| A (0) | B (1/21) | C(1)|
> > 3.merge | A | C |
> > 3.split | A | B | C |
> >
> > Now make that 1000 pages long with the hot page at page 500.
> > So the average best case we will ever get is a 1/500 * number of sample period
> > between aggregations.
>
> So nice example, thank you! Now I understand the point.
>
> So, the problem is that we cannot find the small hot region near the _middle_
> because we split each region into only two subregions.
Exactly. Pathological case :(
>
> >
> > So what are the chances of failing to aggregate on the sample after we split
> > at that optimal point? We need to successfully sample that one page enough that
> > we get it 10% of the time.
> >
> > I 'think' this a case of where the 10% point is on the CDF of a binomial
> > f(1/N, M) where N is number of bins and Mis number of samples.
> >
> > Using matlab online I think the best chance you ever get is when you take 10 samples
> > and need just one of them to be in the region.
> >
> > p = 1 - binocdf(0,10,1/N)
> > For N = 500, p = 0.0198
> > For N = 1000, p = 0.0099
> >
> > Someone with better maths than me can check.
> >
> > Now this just got us to the point where we won't aggregate the region for one
> > round of aggregation. We may split it again and if the resulting region is small
> > enough might not merge it the next aggregation cycle.
> >
> > So I'd argue that allowing at least 2 repeats of splitting is well worth while.
> > It is just a couple of additional lines of code.
>
> Nice suggestion, I will apply this suggestion in the next spin. It might be as
> below:
>
> if (nr_regions() < nr_max_regions / 4)
> split_into_4_regions();
> else if (nr_regions() < nr_max_regions / 2)
> split_into_2_regions();
>
> If this pseudo-code is missing some of your point, please let me know.
That's it. My prototype was less efficient in that it just ran the
2 way split twice if we still had too few regions, but result is very
nearly the same (potentially some changes in the location of the split as
10-90% both times vs whatever limits you put in the 4 region version).
>
> >
> > >
> > > Finally, I would like to keep code as simple as it can.
> > >
> > > For same reasons, I would like to keep the code as currently is until real user
> > > problem is reported. If you have different opinions, please feel free to yell
> > > at me.
> >
> > :)
>
> Appreciate your explanations and suggestions.
You are welcome.
Out of interest, do you have any comparative data on how 'accurate' the resulting
estimates are vs a more precise heatmap from a memory trace?
I'm looking at gathering such data but much happier to leverage your work if
you've already done it!
Thanks,
Jonathan
>
>
> Thanks,
> SeongJae Park
next prev parent reply other threads:[~2020-04-29 9:18 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-27 12:04 SeongJae Park
2020-04-27 12:04 ` [PATCH v9 01/15] scripts/spelling: Add a few more typos SeongJae Park
2020-04-27 12:04 ` [PATCH v9 02/15] mm/page_ext: Export lookup_page_ext() to GPL modules SeongJae Park
2020-04-27 12:04 ` [PATCH v9 03/15] mm: Introduce Data Access MONitor (DAMON) SeongJae Park
2020-04-27 12:04 ` [PATCH v9 04/15] mm/damon: Implement region based sampling SeongJae Park
2020-04-27 12:04 ` [PATCH v9 05/15] mm/damon: Adaptively adjust regions SeongJae Park
2020-04-27 12:04 ` [PATCH v9 06/15] mm/damon: Apply dynamic memory mapping changes SeongJae Park
2020-04-27 12:04 ` [PATCH v9 07/15] mm/damon: Implement callbacks SeongJae Park
2020-04-27 12:04 ` [PATCH v9 08/15] mm/damon: Implement access pattern recording SeongJae Park
2020-04-27 12:04 ` [PATCH v9 09/15] mm/damon: Add debugfs interface SeongJae Park
2020-04-27 12:04 ` [PATCH v9 10/15] mm/damon: Add tracepoints SeongJae Park
2020-04-27 12:04 ` [PATCH v9 11/15] tools: Add a minimal user-space tool for DAMON SeongJae Park
2020-04-27 12:04 ` [PATCH v9 12/15] Documentation/admin-guide/mm: Add a document " SeongJae Park
2020-04-27 12:04 ` [PATCH v9 13/15] mm/damon: Add kunit tests SeongJae Park
2020-04-27 12:04 ` [PATCH v9 14/15] mm/damon: Add user space selftests SeongJae Park
2020-04-27 12:04 ` [PATCH v9 15/15] MAINTAINERS: Update for DAMON SeongJae Park
2020-04-28 12:27 ` [PATCH v9 00/15] Introduce Data Access MONitor (DAMON) Jonathan Cameron
2020-04-28 13:23 ` SeongJae Park
2020-04-28 16:17 ` Jonathan Cameron
2020-04-29 7:49 ` SeongJae Park
2020-04-29 9:18 ` Jonathan Cameron [this message]
2020-04-29 10:10 ` SeongJae Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200429101806.000002f4@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=aarcange@redhat.com \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=amit@kernel.org \
--cc=benh@kernel.crashing.org \
--cc=brendan.d.gregg@gmail.com \
--cc=brendanhiggins@google.com \
--cc=cai@lca.pw \
--cc=colin.king@canonical.com \
--cc=corbet@lwn.net \
--cc=dwmw@amazon.com \
--cc=irogers@google.com \
--cc=jolsa@redhat.com \
--cc=kirill@shutemov.name \
--cc=linux-damon@amazon.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mark.rutland@arm.com \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=sblbir@amazon.com \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=sj38.park@gmail.com \
--cc=sjpark@amazon.com \
--cc=sjpark@amazon.de \
--cc=snu@amazon.de \
--cc=vbabka@suse.cz \
--cc=vdavydov.dev@gmail.com \
--cc=yang.shi@linux.alibaba.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox