linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andres Lagar-Cavilla <andreslc@google.com>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan@kernel.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking
Date: Tue, 21 Jul 2015 14:39:17 -0700	[thread overview]
Message-ID: <CAJu=L5_RJkOhOgvNimi3Vj688w3WDza74_K+pg3oUx4eVK8Bjg@mail.gmail.com> (raw)
In-Reply-To: <cover.1437303956.git.vdavydov@parallels.com>

[-- Attachment #1: Type: text/plain, Size: 14247 bytes --]

On Sun, Jul 19, 2015 at 5:31 AM, Vladimir Davydov <vdavydov@parallels.com>
wrote:

> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46
> It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
>
> Also, as noted by Minchan, this would be useful for per-process reclaim
> (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim
> idle
> pages only by smart user memory manager.
>
> ---- USER API ----
>
> The user API consists of two new proc files:
>
>  * /proc/kpageidle.  This file implements a bitmap where each bit
> corresponds
>    to a page, indexed by PFN. When the bit is set, the corresponding page
> is
>    idle. A page is considered idle if it has not been accessed since it was
>    marked idle. To mark a page idle one should set the bit corresponding
> to the
>    page by writing to the file. A value written to the file is OR-ed with
> the
>    current bitmap value. Only user memory pages can be marked idle, for
> other
>    page types input is silently ignored. Writing to this file beyond max
> PFN
>    results in the ENXIO error. Only available when
> CONFIG_IDLE_PAGE_TRACKING is
>    set.
>
>    This file can be used to estimate the amount of pages that are not
>    used by a particular workload as follows:
>
>    1. mark all pages of interest idle by setting corresponding bits in the
>       /proc/kpageidle bitmap
>    2. wait until the workload accesses its working set
>    3. read /proc/kpageidle and count the number of bits set
>
>  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>    memory cgroup each page is charged to, indexed by PFN. Only available
> when
>    CONFIG_MEMCG is set.
>
>    This file can be used to find all pages (including unmapped file
>    pages) accounted to a particular cgroup. Using /proc/kpageidle, one
>    can then estimate the cgroup working set size.
>
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.
>
> ---- REASONING ----
>
> The reason to introduce the new user API instead of using
> /proc/PID/{clear_refs,smaps} is that the latter has two serious
> drawbacks:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> The new API attempts to overcome them both. For more details on how it
> is achieved, please see the comment to patch 6.
>
> ---- CHANGE LOG ----
>
> Changes in v9:
>
>  - add cond_resched to /proc/kpage* read/write loop (Andres)
>  - rebase on top of v4.2-rc2-mmotm-2015-07-15-16-46
>

And thanks for the perf report.

This series
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>


> Changes in v8:
>
>  - clear referenced/accessed bit in secondary ptes while accessing
>    /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
>  - check the young flag when collapsing a huge page
>  - copy idle/young flags on page migration
>
> Changes in v7:
>
> This iteration addresses Andres's comments to v6:
>
>  - do not reuse page_referenced for clearing idle flag, introduce a
>    separate function instead; this way we won't issue expensive tlb
>    flushes on /proc/kpageidle read/write
>  - propagate young/idle flags from head to tail pages on thp split
>  - skip compound tail pages while reading/writing /proc/kpageidle
>  - cleanup page_referenced_one
>
> Changes in v6:
>
>  - Split the patch introducing page_cgroup_ino helper to ease review.
>  - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55
>
> Changes in v5:
>
>  - Fix possible race between kpageidle_clear_pte_refs() and
>    __page_set_anon_rmap() by checking that a page is on an LRU list
>    under zone->lru_lock (Minchan).
>  - Export idle flag via /proc/kpageflags (Minchan).
>  - Rebase on top of 4.1-rc3.
>
> Changes in v4:
>
> This iteration primarily addresses Minchan's comments to v3:
>
>  - Implement /proc/kpageidle as a bitmap instead of using u64 per each
> page,
>    because there does not seem to be any future uses for the other 63 bits.
>  - Do not double-increase pra->referenced in page_referenced_one() if the
> page
>    was young and referenced recently.
>  - Remove the pointless (page_count == 0) check from kpageidle_get_page().
>  - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
>  - Improve comments to kpageidle-related functions.
>  - Rebase on top of 4.1-rc2.
>
> Note it does not address Minchan's concern of possible
> __page_set_anon_rmap vs
> page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is
> still
> unclear if this race can really happen (see
> https://lkml.org/lkml/2015/5/4/160)
>
> Changes in v3:
>
>  - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
>    requires two extra page flags and there is no space for them on 32
>    bit, page ext is used (thanks to Minchan Kim).
>  - Minor code cleanups and comments improved.
>  - Rebase on top of 4.1-rc1.
>
> Changes in v2:
>
>  - The main difference from v1 is the API change. In v1 the user can
>    only set the idle flag for all pages at once, and for clearing the
>    Idle flag on pages accessed via page tables /proc/PID/clear_refs
>    should be used.
>    The main drawback of the v1 approach, as noted by Minchan, is that on
>    big machines setting the idle flag for each pages can result in CPU
>    bursts, which would be especially frustrating if the user only wanted
>    to estimate the amount of idle pages for a particular process or VMA.
>    With the new API a more fine-grained approach is possible: one can
>    read a process's /proc/PID/pagemap and set/check the Idle flag only
>    for those pages of the process's address space he or she is
>    interested in.
>    Another good point about the v2 API is that it is possible to limit
>    /proc/kpage* scanning rate when the user wants to estimate the total
>    number of idle pages, which is unachievable with the v1 approach.
>  - Make /proc/kpagecgroup return the ino of the closest online ancestor
>    in case the cgroup a page is charged to is offline.
>  - Fix /proc/PID/clear_refs not clearing Young page flag.
>  - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54
>
> v8: https://lkml.org/lkml/2015/7/15/587
> v7: https://lkml.org/lkml/2015/7/11/119
> v6: https://lkml.org/lkml/2015/6/12/301
> v5: https://lkml.org/lkml/2015/5/12/449
> v4: https://lkml.org/lkml/2015/5/7/580
> v3: https://lkml.org/lkml/2015/4/28/224
> v2: https://lkml.org/lkml/2015/4/7/260
> v1: https://lkml.org/lkml/2015/3/18/794
>
> ---- PATCH SET STRUCTURE ----
>
> The patch set is organized as follows:
>
>  - patch 1 adds page_cgroup_ino() helper for the sake of
>    /proc/kpagecgroup and patches 2-3 do related cleanup
>  - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
>    charged to
>  - patch 5 introduces a new mmu notifier callback, clear_young, which is
>    a lightweight version of clear_flush_young; it is used in patch 6
>  - patch 6 implements the idle page tracking feature, including the
>    userspace API, /proc/kpageidle
>  - patch 7 exports idle flag via /proc/kpageflags
>
> ---- SIMILAR WORKS ----
>
> Originally, the patch for tracking idle memory was proposed back in 2011
> by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
> difference between Michel's patch and this one is that Michel
> implemented a kernel space daemon for estimating idle memory size per
> cgroup while this patch only provides the userspace with the minimal API
> for doing the job, leaving the rest up to the userspace. However, they
> both share the same idea of Idle/Young page flags to avoid affecting the
> reclaimer logic.
>
> ---- PERFORMANCE EVALUATION ----
>
> SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
> performance impact introduced by this patch set. Three runs were carried
> out:
>
>  - base: kernel without the patch
>  - patched: patched kernel, the feature is not used
>  - patched-active: patched kernel, 1 minute-period daemon is used for
>    tracking idle memory
>
> For tracking idle memory, idlememstat utility was used:
> https://github.com/locker/idlememstat
>
> testcase            base            patched        patched-active
>
> compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
> compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
> crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
> derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
> mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
> scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
> scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
> serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
> startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
> sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
> xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
>
> composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
>
> time idlememstat:
>
> 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata
> 8476maxresident)k
> 448inputs+40outputs (1major+36052minor)pagefaults 0swaps
>
> ---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
> #! /usr/bin/python
> #
>
> import os
> import stat
> import errno
> import struct
>
> CGROUP_MOUNT = "/sys/fs/cgroup/memory"
> BUFSIZE = 8 * 1024  # must be multiple of 8
>
>
> def get_hugepage_size():
>     with open("/proc/meminfo", "r") as f:
>         for s in f:
>             k, v = s.split(":")
>             if k == "Hugepagesize":
>                 return int(v.split()[0]) * 1024
>
> PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
> HUGEPAGE_SIZE = get_hugepage_size()
>
>
> def set_idle():
>     f = open("/proc/kpageidle", "wb", BUFSIZE)
>     while True:
>         try:
>             f.write(struct.pack("Q", pow(2, 64) - 1))
>         except IOError as err:
>             if err.errno == errno.ENXIO:
>                 break
>             raise
>     f.close()
>
>
> def count_idle():
>     f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
>     f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
>
>     with open("/proc/kpageidle", "rb", BUFSIZE) as f:
>         while f.read(BUFSIZE): pass  # update idle flag
>
>     idlememsz = {}
>     while True:
>         s1, s2 = f_flags.read(8), f_cgroup.read(8)
>         if not s1 or not s2:
>             break
>
>         flags, = struct.unpack('Q', s1)
>         cgino, = struct.unpack('Q', s2)
>
>         unevictable = (flags >> 18) & 1
>         huge = (flags >> 22) & 1
>         idle = (flags >> 25) & 1
>
>         if idle and not unevictable:
>             idlememsz[cgino] = idlememsz.get(cgino, 0) + \
>                 (HUGEPAGE_SIZE if huge else PAGE_SIZE)
>
>     f_flags.close()
>     f_cgroup.close()
>     return idlememsz
>
>
> if __name__ == "__main__":
>     print "Setting the idle flag for each page..."
>     set_idle()
>
>     raw_input("Wait until the workload accesses its working set, "
>               "then press Enter")
>
>     print "Counting idle pages..."
>     idlememsz = count_idle()
>
>     for dir, subdirs, files in os.walk(CGROUP_MOUNT):
>         ino = os.stat(dir)[stat.ST_INO]
>         print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
> ---- END SCRIPT ----
>
> Comments are more than welcome.
>
> Thanks,
>
> Vladimir Davydov (8):
>   memcg: add page_cgroup_ino helper
>   hwpoison: use page_cgroup_ino for filtering by memcg
>   memcg: zap try_get_mem_cgroup_from_page
>   proc: add kpagecgroup file
>   mmu-notifier: add clear_young callback
>   proc: add kpageidle file
>   proc: export idle flag via kpageflags
>   proc: add cond_resched to /proc/kpage* read/write loop
>
>  Documentation/vm/pagemap.txt           |  22 ++-
>  fs/proc/page.c                         | 282
> +++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c                     |   4 +-
>  include/linux/memcontrol.h             |  10 +-
>  include/linux/mm.h                     |  98 ++++++++++++
>  include/linux/mmu_notifier.h           |  44 +++++
>  include/linux/page-flags.h             |  11 ++
>  include/linux/page_ext.h               |   4 +
>  include/uapi/linux/kernel-page-flags.h |   1 +
>  mm/Kconfig                             |  12 ++
>  mm/debug.c                             |   4 +
>  mm/huge_memory.c                       |  11 +-
>  mm/hwpoison-inject.c                   |   5 +-
>  mm/memcontrol.c                        |  71 ++++-----
>  mm/memory-failure.c                    |  16 +-
>  mm/migrate.c                           |   5 +
>  mm/mmu_notifier.c                      |  17 ++
>  mm/page_ext.c                          |   3 +
>  mm/rmap.c                              |   5 +
>  mm/swap.c                              |   2 +
>  virt/kvm/kvm_main.c                    |  18 +++
>  21 files changed, 579 insertions(+), 66 deletions(-)
>
> --
> 2.1.4
>
>


-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

[-- Attachment #2: Type: text/html, Size: 18850 bytes --]

  parent reply	other threads:[~2015-07-21 21:39 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-19 12:31 Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22  9:21     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22  9:45     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22 10:33     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-20 18:34   ` Andres Lagar-Cavilla
2015-07-21  8:51     ` Vladimir Davydov
2015-07-22 16:33       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22 15:20     ` Vladimir Davydov
2015-07-24 14:08   ` Paul Gortmaker
2015-07-24 14:17     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-21 23:35   ` Andrew Morton
2015-07-22 16:25     ` Vladimir Davydov
2015-07-22 19:44       ` Andrew Morton
2015-07-22 20:46         ` Andres Lagar-Cavilla
2015-07-23  7:57           ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
2015-07-19 12:37 ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla [this message]
2015-07-21 23:34 ` Andrew Morton
2015-07-22 16:23   ` Vladimir Davydov
2015-07-25 16:24     ` Vladimir Davydov
2015-07-27 19:18   ` Kees Cook
2015-07-27 19:25     ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
2015-07-29 13:59   ` Vladimir Davydov
2015-07-29 14:12     ` Michel Lespinasse
2015-07-29 14:13       ` Michel Lespinasse
2015-07-29 14:45       ` Vladimir Davydov
2015-07-29 15:08         ` Michel Lespinasse
2015-07-29 15:31           ` Vladimir Davydov
2015-07-29 15:34             ` Michel Lespinasse
2015-07-29 15:08         ` Michal Hocko
2015-07-29 15:36           ` Vladimir Davydov
2015-07-29 15:58             ` Michal Hocko
2015-07-29 14:26     ` Michal Hocko
2015-07-29 15:28       ` Vladimir Davydov
2015-07-29 15:47         ` Michal Hocko
2015-07-29 16:29           ` Vladimir Davydov
2015-07-29 21:30             ` Andrew Morton
2015-07-30  9:12               ` Vladimir Davydov
2015-07-30 13:01                 ` Vladimir Davydov
2015-07-31  9:34                   ` Vladimir Davydov
2015-07-30  9:07             ` Michal Hocko
2015-07-30  9:31               ` Vladimir Davydov
2015-07-29 15:55         ` Andres Lagar-Cavilla
2015-07-29 16:37           ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJu=L5_RJkOhOgvNimi3Vj688w3WDza74_K+pg3oUx4eVK8Bjg@mail.gmail.com' \
    --to=andreslc@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=gorcunov@openvz.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=rientjes@google.com \
    --cc=vdavydov@parallels.com \
    --cc=walken@google.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox