From: Andres Lagar-Cavilla <andreslc@google.com>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Minchan Kim <minchan@kernel.org>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
Michel Lespinasse <walken@google.com>,
David Rientjes <rientjes@google.com>,
Pavel Emelyanov <xemul@parallels.com>,
Cyrill Gorcunov <gorcunov@openvz.org>,
Jonathan Corbet <corbet@lwn.net>,
linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking
Date: Tue, 21 Jul 2015 14:39:17 -0700 [thread overview]
Message-ID: <CAJu=L5_RJkOhOgvNimi3Vj688w3WDza74_K+pg3oUx4eVK8Bjg@mail.gmail.com> (raw)
In-Reply-To: <cover.1437303956.git.vdavydov@parallels.com>
[-- Attachment #1: Type: text/plain, Size: 14247 bytes --]
On Sun, Jul 19, 2015 at 5:31 AM, Vladimir Davydov <vdavydov@parallels.com>
wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46
> It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
>
> Also, as noted by Minchan, this would be useful for per-process reclaim
> (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim
> idle
> pages only by smart user memory manager.
>
> ---- USER API ----
>
> The user API consists of two new proc files:
>
> * /proc/kpageidle. This file implements a bitmap where each bit
> corresponds
> to a page, indexed by PFN. When the bit is set, the corresponding page
> is
> idle. A page is considered idle if it has not been accessed since it was
> marked idle. To mark a page idle one should set the bit corresponding
> to the
> page by writing to the file. A value written to the file is OR-ed with
> the
> current bitmap value. Only user memory pages can be marked idle, for
> other
> page types input is silently ignored. Writing to this file beyond max
> PFN
> results in the ENXIO error. Only available when
> CONFIG_IDLE_PAGE_TRACKING is
> set.
>
> This file can be used to estimate the amount of pages that are not
> used by a particular workload as follows:
>
> 1. mark all pages of interest idle by setting corresponding bits in the
> /proc/kpageidle bitmap
> 2. wait until the workload accesses its working set
> 3. read /proc/kpageidle and count the number of bits set
>
> * /proc/kpagecgroup. This file contains a 64-bit inode number of the
> memory cgroup each page is charged to, indexed by PFN. Only available
> when
> CONFIG_MEMCG is set.
>
> This file can be used to find all pages (including unmapped file
> pages) accounted to a particular cgroup. Using /proc/kpageidle, one
> can then estimate the cgroup working set size.
>
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.
>
> ---- REASONING ----
>
> The reason to introduce the new user API instead of using
> /proc/PID/{clear_refs,smaps} is that the latter has two serious
> drawbacks:
>
> - it does not count unmapped file pages
> - it affects the reclaimer logic
>
> The new API attempts to overcome them both. For more details on how it
> is achieved, please see the comment to patch 6.
>
> ---- CHANGE LOG ----
>
> Changes in v9:
>
> - add cond_resched to /proc/kpage* read/write loop (Andres)
> - rebase on top of v4.2-rc2-mmotm-2015-07-15-16-46
>
And thanks for the perf report.
This series
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
> Changes in v8:
>
> - clear referenced/accessed bit in secondary ptes while accessing
> /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
> - check the young flag when collapsing a huge page
> - copy idle/young flags on page migration
>
> Changes in v7:
>
> This iteration addresses Andres's comments to v6:
>
> - do not reuse page_referenced for clearing idle flag, introduce a
> separate function instead; this way we won't issue expensive tlb
> flushes on /proc/kpageidle read/write
> - propagate young/idle flags from head to tail pages on thp split
> - skip compound tail pages while reading/writing /proc/kpageidle
> - cleanup page_referenced_one
>
> Changes in v6:
>
> - Split the patch introducing page_cgroup_ino helper to ease review.
> - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55
>
> Changes in v5:
>
> - Fix possible race between kpageidle_clear_pte_refs() and
> __page_set_anon_rmap() by checking that a page is on an LRU list
> under zone->lru_lock (Minchan).
> - Export idle flag via /proc/kpageflags (Minchan).
> - Rebase on top of 4.1-rc3.
>
> Changes in v4:
>
> This iteration primarily addresses Minchan's comments to v3:
>
> - Implement /proc/kpageidle as a bitmap instead of using u64 per each
> page,
> because there does not seem to be any future uses for the other 63 bits.
> - Do not double-increase pra->referenced in page_referenced_one() if the
> page
> was young and referenced recently.
> - Remove the pointless (page_count == 0) check from kpageidle_get_page().
> - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
> - Improve comments to kpageidle-related functions.
> - Rebase on top of 4.1-rc2.
>
> Note it does not address Minchan's concern of possible
> __page_set_anon_rmap vs
> page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is
> still
> unclear if this race can really happen (see
> https://lkml.org/lkml/2015/5/4/160)
>
> Changes in v3:
>
> - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
> requires two extra page flags and there is no space for them on 32
> bit, page ext is used (thanks to Minchan Kim).
> - Minor code cleanups and comments improved.
> - Rebase on top of 4.1-rc1.
>
> Changes in v2:
>
> - The main difference from v1 is the API change. In v1 the user can
> only set the idle flag for all pages at once, and for clearing the
> Idle flag on pages accessed via page tables /proc/PID/clear_refs
> should be used.
> The main drawback of the v1 approach, as noted by Minchan, is that on
> big machines setting the idle flag for each pages can result in CPU
> bursts, which would be especially frustrating if the user only wanted
> to estimate the amount of idle pages for a particular process or VMA.
> With the new API a more fine-grained approach is possible: one can
> read a process's /proc/PID/pagemap and set/check the Idle flag only
> for those pages of the process's address space he or she is
> interested in.
> Another good point about the v2 API is that it is possible to limit
> /proc/kpage* scanning rate when the user wants to estimate the total
> number of idle pages, which is unachievable with the v1 approach.
> - Make /proc/kpagecgroup return the ino of the closest online ancestor
> in case the cgroup a page is charged to is offline.
> - Fix /proc/PID/clear_refs not clearing Young page flag.
> - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54
>
> v8: https://lkml.org/lkml/2015/7/15/587
> v7: https://lkml.org/lkml/2015/7/11/119
> v6: https://lkml.org/lkml/2015/6/12/301
> v5: https://lkml.org/lkml/2015/5/12/449
> v4: https://lkml.org/lkml/2015/5/7/580
> v3: https://lkml.org/lkml/2015/4/28/224
> v2: https://lkml.org/lkml/2015/4/7/260
> v1: https://lkml.org/lkml/2015/3/18/794
>
> ---- PATCH SET STRUCTURE ----
>
> The patch set is organized as follows:
>
> - patch 1 adds page_cgroup_ino() helper for the sake of
> /proc/kpagecgroup and patches 2-3 do related cleanup
> - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
> charged to
> - patch 5 introduces a new mmu notifier callback, clear_young, which is
> a lightweight version of clear_flush_young; it is used in patch 6
> - patch 6 implements the idle page tracking feature, including the
> userspace API, /proc/kpageidle
> - patch 7 exports idle flag via /proc/kpageflags
>
> ---- SIMILAR WORKS ----
>
> Originally, the patch for tracking idle memory was proposed back in 2011
> by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
> difference between Michel's patch and this one is that Michel
> implemented a kernel space daemon for estimating idle memory size per
> cgroup while this patch only provides the userspace with the minimal API
> for doing the job, leaving the rest up to the userspace. However, they
> both share the same idea of Idle/Young page flags to avoid affecting the
> reclaimer logic.
>
> ---- PERFORMANCE EVALUATION ----
>
> SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
> performance impact introduced by this patch set. Three runs were carried
> out:
>
> - base: kernel without the patch
> - patched: patched kernel, the feature is not used
> - patched-active: patched kernel, 1 minute-period daemon is used for
> tracking idle memory
>
> For tracking idle memory, idlememstat utility was used:
> https://github.com/locker/idlememstat
>
> testcase base patched patched-active
>
> compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)%
> compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)%
> crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)%
> derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)%
> mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)%
> scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)%
> scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)%
> serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)%
> startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)%
> sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)%
> xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)%
>
> composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)%
>
> time idlememstat:
>
> 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata
> 8476maxresident)k
> 448inputs+40outputs (1major+36052minor)pagefaults 0swaps
>
> ---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
> #! /usr/bin/python
> #
>
> import os
> import stat
> import errno
> import struct
>
> CGROUP_MOUNT = "/sys/fs/cgroup/memory"
> BUFSIZE = 8 * 1024 # must be multiple of 8
>
>
> def get_hugepage_size():
> with open("/proc/meminfo", "r") as f:
> for s in f:
> k, v = s.split(":")
> if k == "Hugepagesize":
> return int(v.split()[0]) * 1024
>
> PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
> HUGEPAGE_SIZE = get_hugepage_size()
>
>
> def set_idle():
> f = open("/proc/kpageidle", "wb", BUFSIZE)
> while True:
> try:
> f.write(struct.pack("Q", pow(2, 64) - 1))
> except IOError as err:
> if err.errno == errno.ENXIO:
> break
> raise
> f.close()
>
>
> def count_idle():
> f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
> f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
>
> with open("/proc/kpageidle", "rb", BUFSIZE) as f:
> while f.read(BUFSIZE): pass # update idle flag
>
> idlememsz = {}
> while True:
> s1, s2 = f_flags.read(8), f_cgroup.read(8)
> if not s1 or not s2:
> break
>
> flags, = struct.unpack('Q', s1)
> cgino, = struct.unpack('Q', s2)
>
> unevictable = (flags >> 18) & 1
> huge = (flags >> 22) & 1
> idle = (flags >> 25) & 1
>
> if idle and not unevictable:
> idlememsz[cgino] = idlememsz.get(cgino, 0) + \
> (HUGEPAGE_SIZE if huge else PAGE_SIZE)
>
> f_flags.close()
> f_cgroup.close()
> return idlememsz
>
>
> if __name__ == "__main__":
> print "Setting the idle flag for each page..."
> set_idle()
>
> raw_input("Wait until the workload accesses its working set, "
> "then press Enter")
>
> print "Counting idle pages..."
> idlememsz = count_idle()
>
> for dir, subdirs, files in os.walk(CGROUP_MOUNT):
> ino = os.stat(dir)[stat.ST_INO]
> print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
> ---- END SCRIPT ----
>
> Comments are more than welcome.
>
> Thanks,
>
> Vladimir Davydov (8):
> memcg: add page_cgroup_ino helper
> hwpoison: use page_cgroup_ino for filtering by memcg
> memcg: zap try_get_mem_cgroup_from_page
> proc: add kpagecgroup file
> mmu-notifier: add clear_young callback
> proc: add kpageidle file
> proc: export idle flag via kpageflags
> proc: add cond_resched to /proc/kpage* read/write loop
>
> Documentation/vm/pagemap.txt | 22 ++-
> fs/proc/page.c | 282
> +++++++++++++++++++++++++++++++++
> fs/proc/task_mmu.c | 4 +-
> include/linux/memcontrol.h | 10 +-
> include/linux/mm.h | 98 ++++++++++++
> include/linux/mmu_notifier.h | 44 +++++
> include/linux/page-flags.h | 11 ++
> include/linux/page_ext.h | 4 +
> include/uapi/linux/kernel-page-flags.h | 1 +
> mm/Kconfig | 12 ++
> mm/debug.c | 4 +
> mm/huge_memory.c | 11 +-
> mm/hwpoison-inject.c | 5 +-
> mm/memcontrol.c | 71 ++++-----
> mm/memory-failure.c | 16 +-
> mm/migrate.c | 5 +
> mm/mmu_notifier.c | 17 ++
> mm/page_ext.c | 3 +
> mm/rmap.c | 5 +
> mm/swap.c | 2 +
> virt/kvm/kvm_main.c | 18 +++
> 21 files changed, 579 insertions(+), 66 deletions(-)
>
> --
> 2.1.4
>
>
--
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com
[-- Attachment #2: Type: text/html, Size: 18850 bytes --]
next prev parent reply other threads:[~2015-07-21 21:39 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-19 12:31 Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 9:21 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 9:45 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 10:33 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-20 18:34 ` Andres Lagar-Cavilla
2015-07-21 8:51 ` Vladimir Davydov
2015-07-22 16:33 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 15:20 ` Vladimir Davydov
2015-07-24 14:08 ` Paul Gortmaker
2015-07-24 14:17 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-21 23:35 ` Andrew Morton
2015-07-22 16:25 ` Vladimir Davydov
2015-07-22 19:44 ` Andrew Morton
2015-07-22 20:46 ` Andres Lagar-Cavilla
2015-07-23 7:57 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
2015-07-19 12:37 ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla [this message]
2015-07-21 23:34 ` Andrew Morton
2015-07-22 16:23 ` Vladimir Davydov
2015-07-25 16:24 ` Vladimir Davydov
2015-07-27 19:18 ` Kees Cook
2015-07-27 19:25 ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
2015-07-29 13:59 ` Vladimir Davydov
2015-07-29 14:12 ` Michel Lespinasse
2015-07-29 14:13 ` Michel Lespinasse
2015-07-29 14:45 ` Vladimir Davydov
2015-07-29 15:08 ` Michel Lespinasse
2015-07-29 15:31 ` Vladimir Davydov
2015-07-29 15:34 ` Michel Lespinasse
2015-07-29 15:08 ` Michal Hocko
2015-07-29 15:36 ` Vladimir Davydov
2015-07-29 15:58 ` Michal Hocko
2015-07-29 14:26 ` Michal Hocko
2015-07-29 15:28 ` Vladimir Davydov
2015-07-29 15:47 ` Michal Hocko
2015-07-29 16:29 ` Vladimir Davydov
2015-07-29 21:30 ` Andrew Morton
2015-07-30 9:12 ` Vladimir Davydov
2015-07-30 13:01 ` Vladimir Davydov
2015-07-31 9:34 ` Vladimir Davydov
2015-07-30 9:07 ` Michal Hocko
2015-07-30 9:31 ` Vladimir Davydov
2015-07-29 15:55 ` Andres Lagar-Cavilla
2015-07-29 16:37 ` Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJu=L5_RJkOhOgvNimi3Vj688w3WDza74_K+pg3oUx4eVK8Bjg@mail.gmail.com' \
--to=andreslc@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=gorcunov@openvz.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=minchan@kernel.org \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=rientjes@google.com \
--cc=vdavydov@parallels.com \
--cc=walken@google.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox