From: 王贇 <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>,
hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [RFC PATCH 0/5] NUMA Balancer Suite
Date: Mon, 22 Apr 2019 10:10:10 +0800 [thread overview]
Message-ID: <209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com> (raw)
We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:
* page cache can't be handled
* no cgroup level balancing
Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observed:
NODE0 | NODE1
|
CPU0 CPU1 | CPU2 CPU3
task_A0 task_A1 | task_A2 task_A3
task_B0 task_B1 | task_B2 task_B3
and usually with the equal memory consumption on each node, when tasks have
similar behavior.
In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,
pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,
depends on the first read/write CPU location.
Let's suppose another scenery:
NODE0 | NODE1
|
CPU0 CPU1 | CPU2 CPU3
task_A0 task_A1 | task_B0 task_B1
task_A2 task_A3 | task_B2 task_B3
By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads
of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same
but related tasks could share a closer cpu cache, while cache still randomly located.
Now what if the workloads generate lot's of page cache, and most of the memory
accessing are page cache writing?
A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0
was already on NODE0 before it read/write files, caches will be there, so how to
make sure this happen?
Usually we could solve this problem by binding workloads on a single node, if the
cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,
the numa bonus will be maximum.
However, this require a very well administration on specified workloads, suppose in our
cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a
single node would be a bad idea.
So what we need is a way to detect memory topology on cgroup level, and try to migrate
cpu/mem resources to the node with most of the caches there, as long as the resource
is plenty on that node.
This patch set introduced:
* advanced per-cgroup numa statistic
* numa preferred node feature
* Numa Balancer module
Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus
as much as possible.
Michael Wang (5):
numa: introduce per-cgroup numa balancing locality statistic
numa: append per-node execution info in memory.numa_stat
numa: introduce per-cgroup preferred numa node
numa: introduce numa balancer infrastructure
numa: numa balancer
drivers/Makefile | 1 +
drivers/numa/Makefile | 1 +
drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
include/linux/memcontrol.h | 99 ++++++
include/linux/sched.h | 9 +-
kernel/sched/debug.c | 8 +
kernel/sched/fair.c | 41 +++
mm/huge_memory.c | 7 +-
mm/memcontrol.c | 246 +++++++++++++++
mm/memory.c | 9 +-
mm/mempolicy.c | 4 +
11 files changed, 1133 insertions(+), 7 deletions(-)
create mode 100644 drivers/numa/Makefile
create mode 100644 drivers/numa/numa_balancer.c
--
2.14.4.44.g2045bb6
next reply other threads:[~2019-04-22 2:10 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-22 2:10 王贇 [this message]
2019-04-22 2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-04-23 8:44 ` Peter Zijlstra
2019-04-23 9:14 ` 王贇
2019-04-23 8:46 ` Peter Zijlstra
2019-04-23 9:32 ` 王贇
2019-04-23 8:47 ` Peter Zijlstra
2019-04-23 9:33 ` 王贇
2019-04-23 9:46 ` Peter Zijlstra
2019-04-22 2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
2019-04-23 8:52 ` Peter Zijlstra
2019-04-23 9:36 ` 王贇
2019-04-23 9:46 ` Peter Zijlstra
2019-04-23 10:01 ` 王贇
2019-04-22 2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
2019-04-23 8:55 ` Peter Zijlstra
2019-04-23 9:41 ` 王贇
2019-04-22 2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
2019-04-22 2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
2019-04-23 9:05 ` Peter Zijlstra
2019-04-23 9:59 ` 王贇
2019-04-22 14:34 ` [RFC PATCH 0/5] NUMA Balancer Suite 禹舟键
2019-04-23 2:14 ` 王贇
2019-07-03 3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
2019-07-03 3:28 ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-07-11 13:43 ` Peter Zijlstra
2019-07-12 3:15 ` 王贇
2019-07-11 13:47 ` Peter Zijlstra
2019-07-12 3:43 ` 王贇
2019-07-12 7:58 ` Peter Zijlstra
2019-07-12 9:11 ` 王贇
2019-07-12 9:42 ` Peter Zijlstra
2019-07-12 10:10 ` 王贇
2019-07-15 2:09 ` 王贇
2019-07-15 12:10 ` Michal Koutný
2019-07-16 2:41 ` 王贇
2019-07-19 16:47 ` Michal Koutný
2019-07-03 3:29 ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
2019-07-11 13:45 ` Peter Zijlstra
2019-07-12 3:17 ` 王贇
2019-07-03 3:32 ` [PATCH 3/4] numa: introduce numa group per task group 王贇
2019-07-11 14:10 ` Peter Zijlstra
2019-07-12 4:03 ` 王贇
2019-07-03 3:34 ` [PATCH 4/4] numa: introduce numa cling feature 王贇
2019-07-08 2:25 ` [PATCH v2 " 王贇
2019-07-11 14:27 ` [PATCH " Peter Zijlstra
2019-07-12 3:10 ` 王贇
2019-07-12 7:53 ` Peter Zijlstra
2019-07-12 8:58 ` 王贇
2019-07-22 3:44 ` 王贇
2019-07-11 9:00 ` [PATCH 0/4] per cgroup numa suite 王贇
2019-07-16 3:38 ` [PATCH v2 0/4] per-cgroup " 王贇
2019-07-16 3:39 ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
2019-07-16 3:40 ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
2019-07-19 16:39 ` Michal Koutný
2019-07-22 2:36 ` 王贇
2019-07-16 3:41 ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
2019-07-25 2:33 ` [PATCH v2 0/4] per-cgroup numa suite 王贇
2019-08-06 1:33 ` 王贇
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=209d247e-c1b2-3235-2722-dd7c1f896483@linux.alibaba.com \
--to=yun.wang@linux.alibaba.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox