From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=elf6=27=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.5 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 95785C33CA5
	for <linux-mm@archiver.kernel.org>; Fri, 10 Jan 2020 13:16:18 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3130C2077C
	for <linux-mm@archiver.kernel.org>; Fri, 10 Jan 2020 13:16:18 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="W6a9TSH4"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3130C2077C
Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9C89C8E000A; Fri, 10 Jan 2020 08:16:11 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 92C848E0001; Fri, 10 Jan 2020 08:16:11 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7F25E8E000A; Fri, 10 Jan 2020 08:16:11 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 663FC8E0001
	for <linux-mm@kvack.org>; Fri, 10 Jan 2020 08:16:11 -0500 (EST)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 300B84435
	for <linux-mm@kvack.org>; Fri, 10 Jan 2020 13:16:11 +0000 (UTC)
X-FDA: 76361772942.08.truck67_18cedd9c2fd2e
X-HE-Tag: truck67_18cedd9c2fd2e
X-Filterd-Recvd-Size: 18460
Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29])
	by imf50.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 10 Jan 2020 13:16:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1578662171; x=1610198171;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=pSUASsSyOliXn6zxjs/WndFpkUwK2Cn3aRxKvF+GEnU=;
  b=W6a9TSH4YCFOUm4O2ollR/9ei4k43L57Bk9PNpaPTD8vDkMuvDAW1Yfg
   1seiEThTAZgqQpKnQ0++y51bWM0nHXNk6O+dMFWhv/Iy/aEdPNZEV3yjn
   os5Z0kcizkrtQmuuep5p4ko1VnlEcQbyBg2KPvfAH+15KBa5VW2g8Sjsn
   A=;
IronPort-SDR: wvy9v+eZOlLemHmt5+LPqiF/nwnr8jI03lOH6haZzbaKd74biA7bMckcaduhMrIfQGjxJXMXJ9
 Z8tOKzAkfR7Q==
X-IronPort-AV: E=Sophos;i="5.69,417,1571702400"; 
   d="scan'208";a="17962720"
Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1e-57e1d233.us-east-1.amazon.com) ([10.47.23.38])
  by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 10 Jan 2020 13:15:56 +0000
Received: from EX13MTAUEA002.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan3.iad.amazon.com [10.40.159.166])
	by email-inbound-relay-1e-57e1d233.us-east-1.amazon.com (Postfix) with ESMTPS id BC3C91415D7;
	Fri, 10 Jan 2020 13:15:53 +0000 (UTC)
Received: from EX13D31EUA001.ant.amazon.com (10.43.165.15) by
 EX13MTAUEA002.ant.amazon.com (10.43.61.77) with Microsoft SMTP Server (TLS)
 id 15.0.1236.3; Fri, 10 Jan 2020 13:15:52 +0000
Received: from u886c93fd17d25d.ant.amazon.com (10.43.161.115) by
 EX13D31EUA001.ant.amazon.com (10.43.165.15) with Microsoft SMTP Server (TLS)
 id 15.0.1367.3; Fri, 10 Jan 2020 13:15:47 +0000
From: SeongJae Park <sjpark@amazon.com>
To: <akpm@linux-foundation.org>
CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <corbet@lwn.net>,
	<linux-doc@vger.kernel.org>, <mgorman@suse.de>, <brendanhiggins@google.com>,
	<sj38.park@gmail.com>, SeongJae Park <sjpark@amazon.de>
Subject: [RFC PATCH 0/5] Introduce Data Access MONitor (DAMON)
Date: Fri, 10 Jan 2020 14:15:17 +0100
Message-ID: <20200110131522.29964-1-sjpark@amazon.com>
X-Mailer: git-send-email 2.17.1
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Originating-IP: [10.43.161.115]
X-ClientProxiedBy: EX13D10UWA001.ant.amazon.com (10.43.160.216) To
 EX13D31EUA001.ant.amazon.com (10.43.165.15)
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: SeongJae Park <sjpark@amazon.de>

This RFC patchset introduces a new kernel module for practical monitoring=
 of
data accesses, namely DAMON.

The patches are organized in the following sequence.  The first and secon=
d
patch introduces the core logic and the raw level user interface of DAMON=
,
respectively.  To provide a minimal reference to the raw level interfaces=
 and
for more convenient test of the DAMON itself, the third patch implements =
an
user space wrapper tools for the DAMON.  The fourth patch adds a document=
 for
the DAMON, and finally the fifth patch provides DAMON's unit tests, which=
 is
using the kunit framework.

The patches are based on the v5.4 plus the back-ported kunit, which retri=
eved
from v5.5-rc1.  You can also clone the complete git tree by:

    $ git clone git://github.com/sjp38/linux -b damon/rfc/v1

The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/rfc/v1

----

DAMON is a kernel module that allows users to monitor the actual memory a=
ccess
pattern of specific user-space processes.  It aims to be 1) accurate enou=
gh to
be useful for performance-centric domains, and 2) sufficiently light-weig=
ht so
that it can be applied online.

For the goals, DAMON utilizes its two core mechanisms, called region-base=
d
sampling and adaptive regions adjustment.  The region-based sampling allo=
ws
users to make their own trade-off between the quality and the overhead of=
 the
monitoring and set the upperbound of the monitoring overhead.  Further, t=
he
adaptive regions adjustment mechanism makes DAMON to maximize the quality=
 and
minimize the overhead with its best efforts while preserving the users
configured trade-off.


Background
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

For performance-centric analysis and optimizations of memory management s=
chemes
(either that of kernel space or user space), the actual data access patte=
rn of
the workloads is highly useful.  The information need to be only reasonab=
le
rather than strictly correct, because some level of incorrectness can be
handled in many performance-centric domains.  It also need to be taken wi=
thin
reasonably short time with only light-weight overhead.

Manually extracting such data is not easy and time consuming if the targe=
t
workload is huge and complex, even for the developers of the programs.  T=
here
are a range of tools and techniques developed for general memory access
investigations, and some of those could be partially used for this purpos=
e.
However, most of those are not practical or unscalable, mainly because th=
ose
are designed with no consideration about the trade-off between the accura=
cy of
the output and the overhead.

The memory access instrumentation techniques which is applied to many too=
ls
such as Intel PIN is essential for correctness required cases such as inv=
alid
memory access bug detections.  However, those usually incur high overhead=
 which
is unacceptable for many of the performance-centric domains.  Periodic ac=
cess
checks based on H/W or S/W access counting features (e.g., the Accessed b=
its of
PTEs or the PG_Idle flags of pages) can dramatically decrease the overhea=
d by
forgiving some of the quality, compared to the instrumentation based
techniques.  The reduced quality is still reasonable for many of the doma=
ins,
but the overhead can arbitrarily increase as the size of the target workl=
oad
grows.  Miniature-like static region based sampling can set the upperboun=
d of
the overhead, but it will now decrease the quality of the output as the s=
ize of
the workload grows.


Related Works
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

There are a number of researches[1,2,3,4,5,6] optimizing memory managemen=
t
mechanisms based on the actual memory access patterns that shows impressi=
ve
results.  However, most of those has no deep consideration about the moni=
toring
of the accesses itself.  Some of those focused on the overhead of the
monitoring, but does not consider the accuracy scalability[6] or has addi=
tional
dependencies[7].  Indeed, one recent research[5] about the proactive
reclamation has also proposed[8] to the kernel community but the monitori=
ng
overhead was considered a main problem.

[1] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram=
,
    Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2=
016.
    Data tiering in heterogeneous memory systems. In Proceedings of the 1=
1th
    European Conference on Computer Systems (EuroSys). ACM, 15.
[2] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and =
Emmett
    Witchel. 2016. Coordinated and efficient huge page management with in=
gens.
    In 12th USENIX Symposium on Operating Systems Design and Implementati=
on
    (OSDI).  705=E2=80=93721.
[3] Harald Servat, Antonio J Pe=C3=B1a, Germ=C3=A1n Llort, Estanislao Mer=
cadal,
    HansChristian Hoppe, and Jes=C3=BAs Labarta. 2017. Automating the app=
lication
    data placement in hybrid memory systems. In 2017 IEEE International
    Conference on Cluster Computing (CLUSTER). IEEE, 126=E2=80=93136.
[4] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimo=
nt.
    2018. Welcome to zombieland: practical and energy-efficient memory
    disaggregation in a datacenter. In Proceedings of the 13th European
    Conference on Computer Systems (EuroSys). ACM, 16.
[5] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Ra=
doslaw
    Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid
    Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy
    Ranganathan.  2019. Software-Defined Far Memory in Warehouse-Scale
    Computers.  In Proceedings of the 24th International Conference on
    Architectural Support for Programming Languages and Operating Systems
    (ASPLOS).  ACM, New York, NY, USA, 317=E2=80=93330.
    DOI:https://doi.org/10.1145/3297858.3304053
[6] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park.
    2017. Cache Modeling and Optimization using Miniature Simulations. In=
 2017
    USENIX Annual Technical Conference (ATC). USENIX Association, Santa
    Clara, CA, 487=E2=80=93498.
    https://www.usenix.org/conference/atc17/technical-sessions/
[7] Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and
    Wenguang Chen. 2018. Spindle: Informed Memory Access Monitoring. In 2=
018
    USENIX Annual Technical Conference (ATC). USENIX Association, Boston,=
 MA,
    561=E2=80=93574.  https://www.usenix.org/conference/atc18/presentatio=
n/wang-haojie
[8] Jonathan Corbet. 2019. Proactively reclaiming idle memory. (2019).
    https://lwn.net/Articles/787611/.


Expected Use-cases
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

A straightforward usecase of DAMON would be the program behavior analysis=
.
With the DAMON output, users can confirm whether the program is running a=
s
intended or not.  This will be useful for debuggings and tests of design
points.

The monitored results can also be useful for counting the dynamic working=
 set
size of workloads.  For the administration of memory overcommitted system=
s or
selection of the environments (e.g., containers providing different amoun=
t of
memory) for your workloads, this will be useful.

If you are a programmer, you can optimize your program by managing the me=
mory
based on the actual data access pattern.  For example, you can identify t=
he
dynamic hotness of your data using DAMON and call ``mlock()`` to keep you=
r hot
data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively
reclaim cold data.  Even though your program is guaranteed to not encount=
er
memory pressure, you can still improve the performance by applying the DA=
MON
outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``.  More crea=
tive
optimizations would be possible.  Our evaluations of DAMON includes a
straightforward optimization using the ``mlock()``.  Please refer to the =
below
Evaluation section for more detail.

As DAMON incurs very low overhead, such optimizations can be applied not =
only
offline, but also online.  Also, there is no reason to limit such optimiz=
ations
to the user space.  Several parts of the kernel's memory management mecha=
nisms
could be also optimized using DAMON. The reclamation, the THP (de)promoti=
on
decisions, and the compaction would be such a candidates.  Nevertheless,
current version of DAMON is not highly optimized for the online/in-kernel=
 uses.


Mechanisms of DAMON
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D


Basic Access Check
------------------

DAMON basically reports what pages are how frequently accessed.  The repo=
rt is
passed to users in binary format via a ``result file`` which users can se=
t it's
path.  Note that the frequency is not an absolute number of accesses, but=
 a
relative frequency among the pages of the target workloads.

Users can also control the resolution of the reports by setting two time
intervals, ``sampling interval`` and ``aggregation interval``.  In detail=
,
DAMON checks access to each page per ``sampling interval``, aggregates th=
e
results (counts the number of the accesses to each page), and reports the
aggregated results per ``aggregation interval``.  For the access check of=
 each
page, DAMON uses the Accessed bits of PTEs.

This is thus similar to the previously mentioned periodic access checks b=
ased
mechanisms, which overhead is increasing as the size of the target proces=
s
grows.


Region Based Sampling
---------------------

To avoid the unbounded increase of the overhead, DAMON groups a number of
adjacent pages that assumed to have same access frequencies into a region=
.  As
long as the assumption (pages in a region have same access frequencies) i=
s
kept, only one page in the region is required to be checked.  Thus, for e=
ach
``sampling interval``, DAMON randomly picks one page in each region and c=
lears
its Accessed bit.  After one more ``sampling interval``, DAMON reads the
Accessed bit of the page and increases the access frequency of the region=
 if
the bit has set meanwhile.  Therefore, the monitoring overhead is control=
lable
by setting the number of regions.  DAMON allows users to set the minimal =
and
maximum number of regions for the trade-off.

Except the assumption, this is almost same with the above-mentioned
miniature-like static region based sampling.  In other words, this scheme
cannot preserve the quality of the output if the assumption is not guaran=
teed.


Adaptive Regions Adjustment
---------------------------

At the beginning of the monitoring, DAMON constructs the initial regions =
by
evenly splitting the memory mapped address space of the process into the
user-specified minimal number of regions.  In this initial state, the
assumption is normally not kept and thus the quality could be low.  To ke=
ep the
assumption as much as possible, DAMON adaptively merges and splits each r=
egion.
For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small.  =
Then,
after it reports and clears the aggregated access frequency of each regio=
n, it
splits each region into two regions if the total number of regions is sma=
ller
than the half of the user-specified maximum number of regions.

In this way, DAMON provides its best-effort quality and minimal overhead =
while
keeping the bounds users set for their trade-off.


Applying Dynamic Memory Mappings
--------------------------------

Only a number of small parts in the super-huge virtual address space of t=
he
processes is mapped to physical memory and accessed.  Thus, tracking the
unmapped address regions is just wasteful.  However, tracking every memor=
y
mapping change might incur an overhead.  For the reason, DAMON applies th=
e
dynamic memory mapping changes to the tracking regions only for each of a=
n
user-specified time interval (``regions update interval``).


Evaluations
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

A prototype of DAMON has evaluated on an Intel Xeon E7-8837 machine using=
 20
benchmarks that picked from SPEC CPU 2006, NAS, Tensorflow Benchmark,
SPLASH-2X, and PARSEC 3 benchmark suite.  Nonethless, this section provid=
es
only summary of the results.  For more detail, please refer to the slides=
 used
for the introduction of DAMON at the Linux Plumbers Conference 2019[1] or=
 the
MIDDLEWARE'19 industrial track paper[2].


Quality
-------

We first traced and visualized the data access pattern of each workload. =
 We
were able to confirm that the visualized results are reasonably accurate =
by
manually comparing those with the source code of the workloads.

To see the usefulness of the monitoring, we optimized 9 memory intensive
workloads among them for memory pressure situations using the DAMON outpu=
ts.
In detail, we identified frequently accessed memory regions in each workl=
oad
based on the DAMON results and protected them with ``mlock()`` system cal=
ls.
The optimized versions consistently show speedup (2.55x in best case, 1.6=
5x in
average) under memory pressure situation.


Overhead
--------

We also measured the overhead of DAMON.  It was not only under the upperb=
ound
we set, but was much lower (0.6 percent of the bound in best case, 13.288
percent of the bound in average).  This reduction of the overhead is main=
ly
resulted from the adaptive regions adjustment.  We also compared the over=
head
with that of the straightforward periodic Accessed bit check-based monito=
ring,
which checks the access of every page frame.  DAMON's overhead was much s=
maller
than the straightforward mechanism by 94,242.42x in best case, 3,159.61x =
in
average.


References
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Prototypes of DAMON have introduced by an LPC kernel summit track talk[1]=
 and
two academic papers[2,3].  Please refer to those for more detailed inform=
ation,
especially the evaluations.

[1] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and
    Best-effort Accuracy. In The Linux Kernel Summit, September 2019.
    https://linuxplumbersconf.org/event/4/contributions/548/
[2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Acces=
s
    Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP
    International Middleware Conference Industry, December 2019.
    https://dl.acm.org/doi/10.1145/3366626.3368125
[3] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynami=
c Data
    Access Patterns with Bounded Overhead and Accuracy. In IEEE Internati=
onal
    Workshop on Foundations and Applications of Self- Systems (FAS 2019),=
 June
    2019.


SeongJae Park (5):
  mm: Introduce Data Access MONitor (DAMON)
  mm/damon: Add debugfs interface
  mm/damon: Add minimal user-space tools
  Documentation/admin-guide/mm: Add a document for DAMON
  mm/damon: Add kunit tests

 .../admin-guide/mm/data_access_monitor.rst    |  235 +++
 Documentation/admin-guide/mm/index.rst        |    1 +
 mm/Kconfig                                    |   23 +
 mm/Makefile                                   |    1 +
 mm/damon-test.h                               |  571 ++++++++
 mm/damon.c                                    | 1266 +++++++++++++++++
 tools/damon/bin2txt.py                        |   64 +
 tools/damon/damn                              |   36 +
 tools/damon/heats.py                          |  358 +++++
 tools/damon/nr_regions.py                     |  116 ++
 tools/damon/record.py                         |  182 +++
 tools/damon/report.py                         |   45 +
 tools/damon/wss.py                            |  121 ++
 13 files changed, 3019 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst
 create mode 100644 mm/damon-test.h
 create mode 100644 mm/damon.c
 create mode 100644 tools/damon/bin2txt.py
 create mode 100644 tools/damon/damn
 create mode 100644 tools/damon/heats.py
 create mode 100644 tools/damon/nr_regions.py
 create mode 100644 tools/damon/record.py
 create mode 100644 tools/damon/report.py
 create mode 100644 tools/damon/wss.py

--=20
2.17.1