[PATCH v3 0/6] mm: reduce the memory footprint of dying memory cgroups

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Roman Gushchin <guroan@gmail.com>
To: linux-mm@kvack.org, kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Rik van Riel <riel@surriel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <guro@fb.com>
Subject: [PATCH v3 0/6] mm: reduce the memory footprint of dying memory cgroups
Date: Wed, 13 Mar 2019 11:39:47 -0700	[thread overview]
Message-ID: <20190313183953.17854-1-guro@fb.com> (raw)

A cgroup can remain in the dying state for a long time, being pinned in the
memory by any kernel object. It can be pinned by a page, shared with other
cgroup (e.g. mlocked by a process in the other cgroup). It can be pinned
by a vfs cache object, etc.

Mostly because of percpu data, the size of a memcg structure in the kernel
memory is quite large. Depending on the machine size and the kernel config,
it can easily reach hundreds of kilobytes per cgroup.

Depending on the memory pressure and the reclaim approach (which is a separate
topic), it looks like several hundreds (if not single thousands) of dying
cgroups is a typical number. On a moderately sized machine the overall memory
footprint is measured in hundreds of megabytes.

So if we can't completely get rid of dying cgroups, let's make them smaller.
This patchset aims to reduce the size of a dying memory cgroup by the premature
release of percpu data during the cgroup removal, and use of atomic counterparts
instead. Currently it covers per-memcg vmstat_percpu, per-memcg per-node
lruvec_stat_cpu. The same approach can be further applied to other percpu data.

Results on my test machine (32 CPUs, singe node):

  With the patchset:              Originally:

  nr_dying_descendants 0
  Slab:              66640 kB	  Slab:              67644 kB
  Percpu:             6912 kB	  Percpu:             6912 kB

  nr_dying_descendants 1000
  Slab:              85912 kB	  Slab:              84704 kB
  Percpu:            26880 kB	  Percpu:            64128 kB

So one dying cgroup went from 75 kB to 39 kB, which is almost twice smaller.
The difference will be even bigger on a bigger machine
(especially, with NUMA).

To test the patchset, I used the following script:
  CG=/sys/fs/cgroup/percpu_test/

  mkdir ${CG}
  echo "+memory" > ${CG}/cgroup.subtree_control

  cat ${CG}/cgroup.stat | grep nr_dying_descendants
  cat /proc/meminfo | grep -e Percpu -e Slab

  for i in `seq 1 1000`; do
      mkdir ${CG}/${i}
      echo $$ > ${CG}/${i}/cgroup.procs
      dd if=/dev/urandom of=/tmp/test-${i} count=1 2> /dev/null
      echo $$ > /sys/fs/cgroup/cgroup.procs
      rmdir ${CG}/${i}
  done

  cat /sys/fs/cgroup/cgroup.stat | grep nr_dying_descendants
  cat /proc/meminfo | grep -e Percpu -e Slab

  rmdir ${CG}

v3:
  - replaced get_cpu_mask() with cpumask_of() (by Johannes)

v2:
  - several renamings suggested by Johannes Weiner
  - added a patch, which merges cpu offlining and percpu flush code

Roman Gushchin (6):
  mm: prepare to premature release of memcg->vmstats_percpu
  mm: prepare to premature release of per-node lruvec_stat_cpu
  mm: release memcg percpu data prematurely
  mm: release per-node memcg percpu data prematurely
  mm: flush memcg percpu stats and events before releasing
  mm: refactor memcg_hotplug_cpu_dead() to use
    memcg_flush_offline_percpu()

 include/linux/memcontrol.h |  66 ++++++++++----
 mm/memcontrol.c            | 179 ++++++++++++++++++++++++++++---------
 2 files changed, 186 insertions(+), 59 deletions(-)

-- 
2.20.1

next             reply	other threads:[~2019-03-13 18:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-13 18:39 Roman Gushchin [this message]
2019-03-13 18:39 ` [PATCH v3 1/6] mm: prepare to premature release of memcg->vmstats_percpu Roman Gushchin
2019-03-13 18:39 ` [PATCH v3 2/6] mm: prepare to premature release of per-node lruvec_stat_cpu Roman Gushchin
2019-03-13 18:39 ` [PATCH v3 3/6] mm: release memcg percpu data prematurely Roman Gushchin
2019-03-13 18:39 ` [PATCH v3 4/6] mm: release per-node " Roman Gushchin
2019-03-13 18:39 ` [PATCH v3 5/6] mm: flush memcg percpu stats and events before releasing Roman Gushchin
2019-03-13 18:39 ` [PATCH v3 6/6] mm: refactor memcg_hotplug_cpu_dead() to use memcg_flush_offline_percpu() Roman Gushchin
2019-03-13 19:48   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190313183953.17854-1-guro@fb.com \
    --to=guroan@gmail.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox