From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.5 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95785C33CA5 for ; Fri, 10 Jan 2020 13:16:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3130C2077C for ; Fri, 10 Jan 2020 13:16:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="W6a9TSH4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3130C2077C Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9C89C8E000A; Fri, 10 Jan 2020 08:16:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 92C848E0001; Fri, 10 Jan 2020 08:16:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F25E8E000A; Fri, 10 Jan 2020 08:16:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 663FC8E0001 for ; Fri, 10 Jan 2020 08:16:11 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 300B84435 for ; Fri, 10 Jan 2020 13:16:11 +0000 (UTC) X-FDA: 76361772942.08.truck67_18cedd9c2fd2e X-HE-Tag: truck67_18cedd9c2fd2e X-Filterd-Recvd-Size: 18460 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jan 2020 13:16:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1578662171; x=1610198171; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=pSUASsSyOliXn6zxjs/WndFpkUwK2Cn3aRxKvF+GEnU=; b=W6a9TSH4YCFOUm4O2ollR/9ei4k43L57Bk9PNpaPTD8vDkMuvDAW1Yfg 1seiEThTAZgqQpKnQ0++y51bWM0nHXNk6O+dMFWhv/Iy/aEdPNZEV3yjn os5Z0kcizkrtQmuuep5p4ko1VnlEcQbyBg2KPvfAH+15KBa5VW2g8Sjsn A=; IronPort-SDR: wvy9v+eZOlLemHmt5+LPqiF/nwnr8jI03lOH6haZzbaKd74biA7bMckcaduhMrIfQGjxJXMXJ9 Z8tOKzAkfR7Q== X-IronPort-AV: E=Sophos;i="5.69,417,1571702400"; d="scan'208";a="17962720" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1e-57e1d233.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 10 Jan 2020 13:15:56 +0000 Received: from EX13MTAUEA002.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan3.iad.amazon.com [10.40.159.166]) by email-inbound-relay-1e-57e1d233.us-east-1.amazon.com (Postfix) with ESMTPS id BC3C91415D7; Fri, 10 Jan 2020 13:15:53 +0000 (UTC) Received: from EX13D31EUA001.ant.amazon.com (10.43.165.15) by EX13MTAUEA002.ant.amazon.com (10.43.61.77) with Microsoft SMTP Server (TLS) id 15.0.1236.3; Fri, 10 Jan 2020 13:15:52 +0000 Received: from u886c93fd17d25d.ant.amazon.com (10.43.161.115) by EX13D31EUA001.ant.amazon.com (10.43.165.15) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Fri, 10 Jan 2020 13:15:47 +0000 From: SeongJae Park To: CC: , , , , , , , SeongJae Park Subject: [RFC PATCH 0/5] Introduce Data Access MONitor (DAMON) Date: Fri, 10 Jan 2020 14:15:17 +0100 Message-ID: <20200110131522.29964-1-sjpark@amazon.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Originating-IP: [10.43.161.115] X-ClientProxiedBy: EX13D10UWA001.ant.amazon.com (10.43.160.216) To EX13D31EUA001.ant.amazon.com (10.43.165.15) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: SeongJae Park This RFC patchset introduces a new kernel module for practical monitoring= of data accesses, namely DAMON. The patches are organized in the following sequence. The first and secon= d patch introduces the core logic and the raw level user interface of DAMON= , respectively. To provide a minimal reference to the raw level interfaces= and for more convenient test of the DAMON itself, the third patch implements = an user space wrapper tools for the DAMON. The fourth patch adds a document= for the DAMON, and finally the fifth patch provides DAMON's unit tests, which= is using the kunit framework. The patches are based on the v5.4 plus the back-ported kunit, which retri= eved from v5.5-rc1. You can also clone the complete git tree by: $ git clone git://github.com/sjp38/linux -b damon/rfc/v1 The web is also available: https://github.com/sjp38/linux/releases/tag/damon/rfc/v1 ---- DAMON is a kernel module that allows users to monitor the actual memory a= ccess pattern of specific user-space processes. It aims to be 1) accurate enou= gh to be useful for performance-centric domains, and 2) sufficiently light-weig= ht so that it can be applied online. For the goals, DAMON utilizes its two core mechanisms, called region-base= d sampling and adaptive regions adjustment. The region-based sampling allo= ws users to make their own trade-off between the quality and the overhead of= the monitoring and set the upperbound of the monitoring overhead. Further, t= he adaptive regions adjustment mechanism makes DAMON to maximize the quality= and minimize the overhead with its best efforts while preserving the users configured trade-off. Background =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D For performance-centric analysis and optimizations of memory management s= chemes (either that of kernel space or user space), the actual data access patte= rn of the workloads is highly useful. The information need to be only reasonab= le rather than strictly correct, because some level of incorrectness can be handled in many performance-centric domains. It also need to be taken wi= thin reasonably short time with only light-weight overhead. Manually extracting such data is not easy and time consuming if the targe= t workload is huge and complex, even for the developers of the programs. T= here are a range of tools and techniques developed for general memory access investigations, and some of those could be partially used for this purpos= e. However, most of those are not practical or unscalable, mainly because th= ose are designed with no consideration about the trade-off between the accura= cy of the output and the overhead. The memory access instrumentation techniques which is applied to many too= ls such as Intel PIN is essential for correctness required cases such as inv= alid memory access bug detections. However, those usually incur high overhead= which is unacceptable for many of the performance-centric domains. Periodic ac= cess checks based on H/W or S/W access counting features (e.g., the Accessed b= its of PTEs or the PG_Idle flags of pages) can dramatically decrease the overhea= d by forgiving some of the quality, compared to the instrumentation based techniques. The reduced quality is still reasonable for many of the doma= ins, but the overhead can arbitrarily increase as the size of the target workl= oad grows. Miniature-like static region based sampling can set the upperboun= d of the overhead, but it will now decrease the quality of the output as the s= ize of the workload grows. Related Works =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D There are a number of researches[1,2,3,4,5,6] optimizing memory managemen= t mechanisms based on the actual memory access patterns that shows impressi= ve results. However, most of those has no deep consideration about the moni= toring of the accesses itself. Some of those focused on the overhead of the monitoring, but does not consider the accuracy scalability[6] or has addi= tional dependencies[7]. Indeed, one recent research[5] about the proactive reclamation has also proposed[8] to the kernel community but the monitori= ng overhead was considered a main problem. [1] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram= , Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2= 016. Data tiering in heterogeneous memory systems. In Proceedings of the 1= 1th European Conference on Computer Systems (EuroSys). ACM, 15. [2] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and = Emmett Witchel. 2016. Coordinated and efficient huge page management with in= gens. In 12th USENIX Symposium on Operating Systems Design and Implementati= on (OSDI). 705=E2=80=93721. [3] Harald Servat, Antonio J Pe=C3=B1a, Germ=C3=A1n Llort, Estanislao Mer= cadal, HansChristian Hoppe, and Jes=C3=BAs Labarta. 2017. Automating the app= lication data placement in hybrid memory systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 126=E2=80=93136. [4] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimo= nt. 2018. Welcome to zombieland: practical and energy-efficient memory disaggregation in a datacenter. In Proceedings of the 13th European Conference on Computer Systems (EuroSys). ACM, 16. [5] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Ra= doslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale Computers. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 317=E2=80=93330. DOI:https://doi.org/10.1145/3297858.3304053 [6] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park. 2017. Cache Modeling and Optimization using Miniature Simulations. In= 2017 USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA, 487=E2=80=93498. https://www.usenix.org/conference/atc17/technical-sessions/ [7] Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: Informed Memory Access Monitoring. In 2= 018 USENIX Annual Technical Conference (ATC). USENIX Association, Boston,= MA, 561=E2=80=93574. https://www.usenix.org/conference/atc18/presentatio= n/wang-haojie [8] Jonathan Corbet. 2019. Proactively reclaiming idle memory. (2019). https://lwn.net/Articles/787611/. Expected Use-cases =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D A straightforward usecase of DAMON would be the program behavior analysis= . With the DAMON output, users can confirm whether the program is running a= s intended or not. This will be useful for debuggings and tests of design points. The monitored results can also be useful for counting the dynamic working= set size of workloads. For the administration of memory overcommitted system= s or selection of the environments (e.g., containers providing different amoun= t of memory) for your workloads, this will be useful. If you are a programmer, you can optimize your program by managing the me= mory based on the actual data access pattern. For example, you can identify t= he dynamic hotness of your data using DAMON and call ``mlock()`` to keep you= r hot data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively reclaim cold data. Even though your program is guaranteed to not encount= er memory pressure, you can still improve the performance by applying the DA= MON outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``. More crea= tive optimizations would be possible. Our evaluations of DAMON includes a straightforward optimization using the ``mlock()``. Please refer to the = below Evaluation section for more detail. As DAMON incurs very low overhead, such optimizations can be applied not = only offline, but also online. Also, there is no reason to limit such optimiz= ations to the user space. Several parts of the kernel's memory management mecha= nisms could be also optimized using DAMON. The reclamation, the THP (de)promoti= on decisions, and the compaction would be such a candidates. Nevertheless, current version of DAMON is not highly optimized for the online/in-kernel= uses. Mechanisms of DAMON =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Basic Access Check ------------------ DAMON basically reports what pages are how frequently accessed. The repo= rt is passed to users in binary format via a ``result file`` which users can se= t it's path. Note that the frequency is not an absolute number of accesses, but= a relative frequency among the pages of the target workloads. Users can also control the resolution of the reports by setting two time intervals, ``sampling interval`` and ``aggregation interval``. In detail= , DAMON checks access to each page per ``sampling interval``, aggregates th= e results (counts the number of the accesses to each page), and reports the aggregated results per ``aggregation interval``. For the access check of= each page, DAMON uses the Accessed bits of PTEs. This is thus similar to the previously mentioned periodic access checks b= ased mechanisms, which overhead is increasing as the size of the target proces= s grows. Region Based Sampling --------------------- To avoid the unbounded increase of the overhead, DAMON groups a number of adjacent pages that assumed to have same access frequencies into a region= . As long as the assumption (pages in a region have same access frequencies) i= s kept, only one page in the region is required to be checked. Thus, for e= ach ``sampling interval``, DAMON randomly picks one page in each region and c= lears its Accessed bit. After one more ``sampling interval``, DAMON reads the Accessed bit of the page and increases the access frequency of the region= if the bit has set meanwhile. Therefore, the monitoring overhead is control= lable by setting the number of regions. DAMON allows users to set the minimal = and maximum number of regions for the trade-off. Except the assumption, this is almost same with the above-mentioned miniature-like static region based sampling. In other words, this scheme cannot preserve the quality of the output if the assumption is not guaran= teed. Adaptive Regions Adjustment --------------------------- At the beginning of the monitoring, DAMON constructs the initial regions = by evenly splitting the memory mapped address space of the process into the user-specified minimal number of regions. In this initial state, the assumption is normally not kept and thus the quality could be low. To ke= ep the assumption as much as possible, DAMON adaptively merges and splits each r= egion. For each ``aggregation interval``, it compares the access frequencies of adjacent regions and merges those if the frequency difference is small. = Then, after it reports and clears the aggregated access frequency of each regio= n, it splits each region into two regions if the total number of regions is sma= ller than the half of the user-specified maximum number of regions. In this way, DAMON provides its best-effort quality and minimal overhead = while keeping the bounds users set for their trade-off. Applying Dynamic Memory Mappings -------------------------------- Only a number of small parts in the super-huge virtual address space of t= he processes is mapped to physical memory and accessed. Thus, tracking the unmapped address regions is just wasteful. However, tracking every memor= y mapping change might incur an overhead. For the reason, DAMON applies th= e dynamic memory mapping changes to the tracking regions only for each of a= n user-specified time interval (``regions update interval``). Evaluations =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D A prototype of DAMON has evaluated on an Intel Xeon E7-8837 machine using= 20 benchmarks that picked from SPEC CPU 2006, NAS, Tensorflow Benchmark, SPLASH-2X, and PARSEC 3 benchmark suite. Nonethless, this section provid= es only summary of the results. For more detail, please refer to the slides= used for the introduction of DAMON at the Linux Plumbers Conference 2019[1] or= the MIDDLEWARE'19 industrial track paper[2]. Quality ------- We first traced and visualized the data access pattern of each workload. = We were able to confirm that the visualized results are reasonably accurate = by manually comparing those with the source code of the workloads. To see the usefulness of the monitoring, we optimized 9 memory intensive workloads among them for memory pressure situations using the DAMON outpu= ts. In detail, we identified frequently accessed memory regions in each workl= oad based on the DAMON results and protected them with ``mlock()`` system cal= ls. The optimized versions consistently show speedup (2.55x in best case, 1.6= 5x in average) under memory pressure situation. Overhead -------- We also measured the overhead of DAMON. It was not only under the upperb= ound we set, but was much lower (0.6 percent of the bound in best case, 13.288 percent of the bound in average). This reduction of the overhead is main= ly resulted from the adaptive regions adjustment. We also compared the over= head with that of the straightforward periodic Accessed bit check-based monito= ring, which checks the access of every page frame. DAMON's overhead was much s= maller than the straightforward mechanism by 94,242.42x in best case, 3,159.61x = in average. References =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Prototypes of DAMON have introduced by an LPC kernel summit track talk[1]= and two academic papers[2,3]. Please refer to those for more detailed inform= ation, especially the evaluations. [1] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and Best-effort Accuracy. In The Linux Kernel Summit, September 2019. https://linuxplumbersconf.org/event/4/contributions/548/ [2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Acces= s Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP International Middleware Conference Industry, December 2019. https://dl.acm.org/doi/10.1145/3366626.3368125 [3] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynami= c Data Access Patterns with Bounded Overhead and Accuracy. In IEEE Internati= onal Workshop on Foundations and Applications of Self- Systems (FAS 2019),= June 2019. SeongJae Park (5): mm: Introduce Data Access MONitor (DAMON) mm/damon: Add debugfs interface mm/damon: Add minimal user-space tools Documentation/admin-guide/mm: Add a document for DAMON mm/damon: Add kunit tests .../admin-guide/mm/data_access_monitor.rst | 235 +++ Documentation/admin-guide/mm/index.rst | 1 + mm/Kconfig | 23 + mm/Makefile | 1 + mm/damon-test.h | 571 ++++++++ mm/damon.c | 1266 +++++++++++++++++ tools/damon/bin2txt.py | 64 + tools/damon/damn | 36 + tools/damon/heats.py | 358 +++++ tools/damon/nr_regions.py | 116 ++ tools/damon/record.py | 182 +++ tools/damon/report.py | 45 + tools/damon/wss.py | 121 ++ 13 files changed, 3019 insertions(+) create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst create mode 100644 mm/damon-test.h create mode 100644 mm/damon.c create mode 100644 tools/damon/bin2txt.py create mode 100644 tools/damon/damn create mode 100644 tools/damon/heats.py create mode 100644 tools/damon/nr_regions.py create mode 100644 tools/damon/record.py create mode 100644 tools/damon/report.py create mode 100644 tools/damon/wss.py --=20 2.17.1