* [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
@ 2024-09-20 22:11 kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Currently in Linux, there is no concept of fairness in memory tiering. Depending
on the memory usage and access patterns of other colocated applications, an
application cannot be sure of how much memory in which tier it will get, and how
much its performance will suffer or benefit.
Fairness is, however, important in a multi-tenant system. For example, an
application may need to meet a certain tail latency requirement, which can be
difficult to satisfy without x amount of frequently accessed pages in top-tier
memory. Similarly, an application may want to declare a minimum throughput when
running on a system for capacity planning purposes, but without fairness
controls in memory tiering its throughput can fluctuate wildly as other
applications come and go on the system.
In this proposal, we amend the memory.low control in memcg to protect a cgroup’s
memory usage in top-tier memory. A low protection for top-tier memory is scaled
proportionally to the ratio of top-tier memory and total memory on the system.
The protection is then applied to reclaim for top-tier memory. Promotion by NUMA
balancing is also throttled through reduced scanning window when top-tier memory
is contended and the cgroup is over its protection.
Experiments we did with microbenchmarks exhibiting a range of memory access
patterns and memory size confirmed that when top-tier memory is contended, the
system moves towards a stable memory distribution where each cgroup’s memory
usage in local DRAM converges to the protected amounts.
One notable missing part in the patches is determining which NUMA nodes have
top-tier memory; currently they use hardcoded node 0 for top-tier memory and
node 1 for a CPU-less node backed by CXL memory. We’re working on removing
this artifact and correctly applying to top-tier nodes in the system.
Your feedback is greatly appreciated!
Kaiyang Zhao (4):
Add get_cgroup_local_usage for estimating the top-tier memory usage
calculate memory.low for the local node and track its usage
use memory.low local node protection for local node reclaim
reduce NUMA balancing scan size of cgroups over their local memory.low
include/linux/memcontrol.h | 25 ++++++++-----
include/linux/page_counter.h | 16 ++++++---
kernel/sched/fair.c | 54 +++++++++++++++++++++++++---
mm/hugetlb_cgroup.c | 4 +--
mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++------
mm/page_counter.c | 52 +++++++++++++++++++++------
mm/vmscan.c | 19 +++++++---
7 files changed, 192 insertions(+), 46 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Approximate the usage of top-tier memory of a cgroup by its anon,
file, shmem and slab sizes in the top-tier.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..94aba4498fca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -648,6 +648,8 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
memcg == target;
}
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f19a58c252f0..20b715441332 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -855,6 +855,30 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
return READ_ONCE(memcg->vmstats->events_local[i]);
}
+/* Usage is in pages. */
+unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush)
+{
+ struct lruvec *lruvec;
+ const int local_nid = 0;
+
+ if (!memcg)
+ return 0;
+
+ if (flush)
+ mem_cgroup_flush_stats_ratelimited(memcg);
+
+ lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(local_nid));
+ unsigned long anon = lruvec_page_state(lruvec, NR_ANON_MAPPED);
+ unsigned long file = lruvec_page_state(lruvec, NR_FILE_PAGES);
+ unsigned long shmem = lruvec_page_state(lruvec, NR_SHMEM);
+ /* Slab size are in bytes */
+ unsigned long slab =
+ lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE_B) / PAGE_SIZE
+ + lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE_B) / PAGE_SIZE;
+
+ return anon + file + shmem + slab;
+}
+
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
{
/*
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
` (2 subsequent siblings)
4 siblings, 2 replies; 9+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Add a memory.low for the top-tier node (locallow) and track its usage.
locallow is set by scaling low by the ratio of node 0 capacity and
node 0 + node 1 capacity.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/page_counter.h | 16 ++++++++---
mm/hugetlb_cgroup.c | 4 +--
mm/memcontrol.c | 42 ++++++++++++++++++++++-------
mm/page_counter.c | 52 ++++++++++++++++++++++++++++--------
4 files changed, 88 insertions(+), 26 deletions(-)
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 79dbd8bc35a7..aa56c93415ef 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -13,6 +13,7 @@ struct page_counter {
* memcg->memory.usage is a hot member of struct mem_cgroup.
*/
atomic_long_t usage;
+ struct mem_cgroup *memcg; /* memcg that owns this counter */
CACHELINE_PADDING(_pad1_);
/* effective memory.min and memory.min usage tracking */
@@ -25,6 +26,10 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;
+ unsigned long elocallow;
+ atomic_long_t locallow_usage;
+ atomic_long_t children_locallow_usage;
+
unsigned long watermark;
/* Latest cg2 reset watermark */
unsigned long local_watermark;
@@ -36,6 +41,7 @@ struct page_counter {
bool protection_support;
unsigned long min;
unsigned long low;
+ unsigned long locallow;
unsigned long high;
unsigned long max;
struct page_counter *parent;
@@ -52,12 +58,13 @@ struct page_counter {
*/
static inline void page_counter_init(struct page_counter *counter,
struct page_counter *parent,
- bool protection_support)
+ bool protection_support, struct mem_cgroup *memcg)
{
counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
counter->max = PAGE_COUNTER_MAX;
counter->parent = parent;
counter->protection_support = protection_support;
+ counter->memcg = memcg;
}
static inline unsigned long page_counter_read(struct page_counter *counter)
@@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
struct page_counter **fail);
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+ unsigned long nr_pages_local);
static inline void page_counter_set_high(struct page_counter *counter,
unsigned long nr_pages)
@@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
#ifdef CONFIG_MEMCG
void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection);
+ bool recursive_protection, int is_local);
#else
static inline void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection) {}
+ bool recursive_protection, int is_local) {}
#endif
#endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index d8d0e665caed..0e07a7a1d5b8 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
}
page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
idx),
- fault_parent, false);
+ fault_parent, false, NULL);
page_counter_init(
hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
- rsvd_parent, false);
+ rsvd_parent, false, NULL);
limit = round_down(PAGE_COUNTER_MAX,
pages_per_huge_page(&hstates[idx]));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20b715441332..d7c5fff12105 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
vm_event_name(memcg_vm_event_stat[i]),
memcg_events(memcg, memcg_vm_event_stat[i]));
}
+
+ seq_buf_printf(s, "local_usage %lu\n",
+ get_cgroup_local_usage(memcg, true));
}
static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
@@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
if (parent) {
WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
- page_counter_init(&memcg->memory, &parent->memory, true);
- page_counter_init(&memcg->swap, &parent->swap, false);
+ page_counter_init(&memcg->memory, &parent->memory, true, memcg);
+ page_counter_init(&memcg->swap, &parent->swap, false, NULL);
#ifdef CONFIG_MEMCG_V1
WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
page_counter_init(&memcg->kmem, &parent->kmem, false);
@@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
} else {
init_memcg_stats();
init_memcg_events();
- page_counter_init(&memcg->memory, NULL, true);
- page_counter_init(&memcg->swap, NULL, false);
+ page_counter_init(&memcg->memory, NULL, true, memcg);
+ page_counter_init(&memcg->swap, NULL, false, NULL);
#ifdef CONFIG_MEMCG_V1
page_counter_init(&memcg->kmem, NULL, false);
page_counter_init(&memcg->tcpmem, NULL, false);
@@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
memcg1_css_offline(memcg);
page_counter_set_min(&memcg->memory, 0);
- page_counter_set_low(&memcg->memory, 0);
+ page_counter_set_low(&memcg->memory, 0, 0);
zswap_memcg_offline_cleanup(memcg);
@@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
#endif
page_counter_set_min(&memcg->memory, 0);
- page_counter_set_low(&memcg->memory, 0);
+ page_counter_set_low(&memcg->memory, 0, 0);
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
memcg1_soft_limit_reset(memcg);
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
@@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_locallow_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
+}
+
static int memory_low_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
- unsigned long low;
+ struct sysinfo si;
+ unsigned long low, locallow, local_capacity, total_capacity;
int err;
buf = strstrip(buf);
@@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
if (err)
return err;
- page_counter_set_low(&memcg->memory, low);
+ /* Hardcoded 0 for local node and 1 for remote. */
+ si_meminfo_node(&si, 0);
+ local_capacity = si.totalram; /* In pages. */
+ total_capacity = local_capacity;
+ si_meminfo_node(&si, 1);
+ total_capacity += si.totalram;
+ locallow = low * local_capacity / total_capacity;
+
+ page_counter_set_low(&memcg->memory, low, locallow);
return nbytes;
}
@@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
.seq_show = memory_low_show,
.write = memory_low_write,
},
+ {
+ .name = "locallow",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_locallow_show,
+ },
{
.name = "high",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
if (!root)
root = root_mem_cgroup;
- page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
+ page_counter_calculate_protection(&root->memory, &memcg->memory,
+ recursive_protection, false);
}
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b249d15af9dd..97205aafab46 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
return c->protection_support;
}
+extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
+
static void propagate_protected_usage(struct page_counter *c,
- unsigned long usage)
+ unsigned long usage, unsigned long local_usage)
{
unsigned long protected, old_protected;
long delta;
@@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
if (delta)
atomic_long_add(delta, &c->parent->children_low_usage);
}
+
+ protected = min(local_usage, READ_ONCE(c->locallow));
+ old_protected = atomic_long_read(&c->locallow_usage);
+ if (protected != old_protected) {
+ old_protected = atomic_long_xchg(&c->locallow_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_locallow_usage);
+ }
}
/**
@@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
atomic_long_set(&counter->usage, new);
}
if (track_protection(counter))
- propagate_protected_usage(counter, new);
+ propagate_protected_usage(counter, new,
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
new = atomic_long_add_return(nr_pages, &c->usage);
if (protection)
- propagate_protected_usage(c, new);
+ propagate_protected_usage(c, new,
+ get_cgroup_local_usage(counter->memcg, false));
/*
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
@@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
goto failed;
}
if (protection)
- propagate_protected_usage(c, new);
+ propagate_protected_usage(c, new,
+ get_cgroup_local_usage(counter->memcg, false));
/* see comment on page_counter_charge */
if (new > READ_ONCE(c->local_watermark)) {
@@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
WRITE_ONCE(counter->min, nr_pages);
for (c = counter; c; c = c->parent)
- propagate_protected_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage),
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
*
* The caller must serialize invocations on the same counter.
*/
-void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
+ unsigned long nr_pages_local)
{
struct page_counter *c;
WRITE_ONCE(counter->low, nr_pages);
+ WRITE_ONCE(counter->locallow, nr_pages_local);
for (c = counter; c; c = c->parent)
- propagate_protected_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage),
+ get_cgroup_local_usage(counter->memcg, false));
}
/**
@@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
*/
void page_counter_calculate_protection(struct page_counter *root,
struct page_counter *counter,
- bool recursive_protection)
+ bool recursive_protection, int is_local)
{
- unsigned long usage, parent_usage;
+ unsigned long usage, parent_usage, local_usage, parent_local_usage;
struct page_counter *parent = counter->parent;
/*
@@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
return;
usage = page_counter_read(counter);
- if (!usage)
+ local_usage = get_cgroup_local_usage(counter->memcg, true);
+ if (!usage || !local_usage)
return;
if (parent == root) {
counter->emin = READ_ONCE(counter->min);
counter->elow = READ_ONCE(counter->low);
+ counter->elocallow = READ_ONCE(counter->locallow);
return;
}
parent_usage = page_counter_read(parent);
+ parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
READ_ONCE(counter->min),
@@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
atomic_long_read(&parent->children_min_usage),
recursive_protection));
- WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
+ if (is_local)
+ WRITE_ONCE(counter->elocallow,
+ effective_protection(local_usage, parent_local_usage,
+ READ_ONCE(counter->locallow),
+ READ_ONCE(parent->elocallow),
+ atomic_long_read(&parent->children_locallow_usage),
+ recursive_protection));
+ else
+ WRITE_ONCE(counter->elow,
+ effective_protection(usage, parent_usage,
READ_ONCE(counter->low),
READ_ONCE(parent->elow),
atomic_long_read(&parent->children_low_usage),
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-10-15 21:52 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
2024-11-08 19:01 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
4 siblings, 1 reply; 9+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
When reclaim targets the top-tier node usage by the root memcg,
apply local memory.low protection instead of global protection.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
include/linux/memcontrol.h | 23 ++++++++++++++---------
mm/memcontrol.c | 4 ++--
mm/vmscan.c | 19 ++++++++++++++-----
3 files changed, 30 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 94aba4498fca..256912b91922 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
- unsigned long *low)
+ unsigned long *low, unsigned long *locallow)
{
- *min = *low = 0;
+ *min = *low = *locallow = 0;
if (mem_cgroup_disabled())
return;
@@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
*min = READ_ONCE(memcg->memory.emin);
*low = READ_ONCE(memcg->memory.elow);
+ *locallow = READ_ONCE(memcg->memory.elocallow);
}
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg);
+ struct mem_cgroup *memcg, int is_local);
static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
struct mem_cgroup *memcg)
@@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.elow) >=
- page_counter_read(&memcg->memory);
+ if (is_local)
+ return READ_ONCE(memcg->memory.elocallow) >=
+ get_cgroup_local_usage(memcg, true);
+ else
+ return READ_ONCE(memcg->memory.elow) >=
+ page_counter_read(&memcg->memory);
}
static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
@@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
- unsigned long *low)
+ unsigned long *low, unsigned long *locallow)
{
*min = *low = 0;
}
static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
}
@@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
return true;
}
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
return false;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7c5fff12105..61718ba998fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
* of a top-down tree iteration, not for isolated queries.
*/
void mem_cgroup_calculate_protection(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg, int is_local)
{
bool recursive_protection =
cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
@@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
root = root_mem_cgroup;
page_counter_calculate_protection(&root->memory, &memcg->memory,
- recursive_protection, false);
+ recursive_protection, is_local);
}
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce471d686a88..a2681d52fc5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
enum scan_balance scan_balance;
unsigned long ap, fp;
enum lru_list lru;
+ int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
/* If we have no swap space, do not bother scanning anon folios. */
if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
@@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
for_each_evictable_lru(lru) {
bool file = is_file_lru(lru);
unsigned long lruvec_size;
- unsigned long low, min;
+ unsigned long low, min, locallow;
unsigned long scan;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
mem_cgroup_protection(sc->target_mem_cgroup, memcg,
- &min, &low);
+ &min, &low, &locallow);
+ if (is_local)
+ low = locallow;
if (min || low) {
/*
@@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
* again by how much of the total memory used is under
* hard protection.
*/
- unsigned long cgroup_size = mem_cgroup_size(memcg);
+ unsigned long cgroup_size;
+
+ if (is_local)
+ cgroup_size = get_cgroup_local_usage(memcg, true);
+ else
+ cgroup_size = mem_cgroup_size(memcg);
unsigned long protection;
/* memory.low scaling, make sure we retry before OOM */
@@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
};
struct mem_cgroup_reclaim_cookie *partial = &reclaim;
struct mem_cgroup *memcg;
+ int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
/*
* In most cases, direct reclaimers can do partial walks
@@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
*/
cond_resched();
- mem_cgroup_calculate_protection(target_memcg, memcg);
+ mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
@@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
* If there is no reclaimable memory, OOM.
*/
continue;
- } else if (mem_cgroup_below_low(target_memcg, memcg)) {
+ } else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
/*
* Soft protection.
* Respect the protection only as long as
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
` (2 preceding siblings ...)
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-09-20 22:11 ` kaiyang2
2024-11-08 19:01 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
4 siblings, 0 replies; 9+ messages in thread
From: kaiyang2 @ 2024-09-20 22:11 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
When the top-tier node has less free memory than the promotion watermark,
reduce the scan size of cgroups that are over their local memory.low
proportional to their overage. In this case, the top-tier memory usage
of the cgroup should be reduced, and demotion is working towards the
goal. A smaller scan size should cause a slower rate of promotion for
the cgroup so as to not working against demotion.
A mininum of 1/16th of sysctl_numa_balancing_scan_size is still allowed
for such cgroups because identifying hot pages trapped in slow-tier is
still a worthy goal in this case (although a secondary objective).
16 is arbitrary and may need tuning.
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1b756f927b2..1737b2369f56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1727,14 +1727,21 @@ static inline bool cpupid_valid(int cpupid)
* advantage of fast memory capacity, all recently accessed slow
* memory pages will be migrated to fast memory node without
* considering hot threshold.
+ * This is also used for detecting memory pressure and decide whether
+ * limitting promotion scan size is needed, for which we don't requrie
+ * more free pages than the promo watermark.
*/
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+static bool pgdat_free_space_enough(struct pglist_data *pgdat,
+ bool require_extra)
{
int z;
unsigned long enough_wmark;
- enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
- pgdat->node_present_pages >> 4);
+ if (require_extra)
+ enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+ pgdat->node_present_pages >> 4);
+ else
+ enough_wmark = 0;
for (z = pgdat->nr_zones - 1; z >= 0; z--) {
struct zone *zone = pgdat->node_zones + z;
@@ -1846,7 +1853,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
unsigned int latency, th, def_th;
pgdat = NODE_DATA(dst_nid);
- if (pgdat_free_space_enough(pgdat)) {
+ if (pgdat_free_space_enough(pgdat, true)) {
/* workload changed, reset hot threshold */
pgdat->nbp_threshold = 0;
return true;
@@ -3214,10 +3221,14 @@ static void task_numa_work(struct callback_head *work)
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
- long pages, virtpages;
+ long pages, virtpages, min_scan_pages;
struct vma_iterator vmi;
bool vma_pids_skipped;
bool vma_pids_forced = false;
+ struct pglist_data *pgdat = NODE_DATA(0); /* hardcoded node 0 */
+ struct mem_cgroup *memcg;
+ unsigned long cgroup_size, cgroup_locallow;
+ const long min_scan_pages_fraction = 16; /* 1/16th of the scan size */
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
@@ -3262,6 +3273,39 @@ static void task_numa_work(struct callback_head *work)
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+
+ min_scan_pages = pages;
+ min_scan_pages /= min_scan_pages_fraction;
+
+ memcg = get_mem_cgroup_from_current();
+ /*
+ * Reduce the scan size when the local node is under pressure
+ * (WMARK_PROMO is not satisfied),
+ * proportional to a cgroup's overage of local memory guarantee.
+ * 10% over: 68% of scan size
+ * 20% over: 48% of scan size
+ * 50% over: 20% of scan size
+ * 100% over: 6% of scan size
+ */
+ if (likely(memcg)) {
+ if (!pgdat_free_space_enough(pgdat, false)) {
+ cgroup_size = get_cgroup_local_usage(memcg, false);
+ /*
+ * Protection needs refreshing, but reclaim on the cgroup
+ * should have refreshed recently.
+ */
+ cgroup_locallow = READ_ONCE(memcg->memory.elocallow);
+ if (cgroup_size > cgroup_locallow) {
+ /* 1/x^4 */
+ for (int i = 0; i < 4; i++)
+ pages = pages * cgroup_locallow / (cgroup_size + 1);
+ /* Lower bound to min_scan_pages. */
+ pages = max(pages, min_scan_pages);
+ }
+ }
+ css_put(&memcg->css);
+ }
+
virtpages = pages * 8; /* Scan up to this much virtual space */
if (!pages)
return;
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
@ 2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2024-09-22 8:39 UTC (permalink / raw)
To: kaiyang2
Cc: oe-lkp, lkp, linux-kernel, linux-mm, cgroups, roman.gushchin,
shakeel.butt, muchun.song, akpm, mhocko, nehagholkar, abhishekd,
hannes, weixugc, rientjes, Kaiyang Zhao, oliver.sang
Hello,
kernel test robot noticed "BUG:kernel_NULL_pointer_dereference,address" on:
commit: 6f4c005a5f8b8ff1ce674731545b302af5f28f3f ("[RFC PATCH 2/4] calculate memory.low for the local node and track its usage")
url: https://github.com/intel-lab-lkp/linux/commits/kaiyang2-cs-cmu-edu/Add-get_cgroup_local_usage-for-estimating-the-top-tier-memory-usage/20240921-061404
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20240920221202.1734227-3-kaiyang2@cs.cmu.edu/
patch subject: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
in testcase: boot
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+---------------------------------------------+------------+------------+
| | 0af685cc17 | 6f4c005a5f |
+---------------------------------------------+------------+------------+
| boot_successes | 12 | 0 |
| boot_failures | 0 | 12 |
| BUG:kernel_NULL_pointer_dereference,address | 0 | 12 |
| Oops | 0 | 12 |
| RIP:si_meminfo_node | 0 | 12 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 12 |
+---------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202409221625.1e974ac-oliver.sang@intel.com
[ 14.204830][ T1] BUG: kernel NULL pointer dereference, address: 0000000000000090
[ 14.206729][ T1] #PF: supervisor read access in kernel mode
[ 14.208090][ T1] #PF: error_code(0x0000) - not-present page
[ 14.209393][ T1] PGD 0 P4D 0
[ 14.210212][ T1] Oops: Oops: 0000 [#1] SMP PTI
[ 14.211269][ T1] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted 6.11.0-rc6-00570-g6f4c005a5f8b #1
[ 14.213284][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 14.215290][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.216523][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
0: 90 nop
1: 90 nop
2: 66 0f 1f 00 nopw (%rax)
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: 48 63 c6 movslq %esi,%rax
e: 55 push %rbp
f: 31 d2 xor %edx,%edx
11: 4c 8b 04 c5 c0 a7 fb mov -0x73045840(,%rax,8),%r8
18: 8c
19: 53 push %rbx
1a: 48 89 c5 mov %rax,%rbp
1d: 48 89 fb mov %rdi,%rbx
20: 4c 89 c0 mov %r8,%rax
23: 49 8d b8 00 1e 00 00 lea 0x1e00(%r8),%rdi
2a:* 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx <-- trapping instruction
31: 48 05 00 06 00 00 add $0x600,%rax
37: 48 01 ca add %rcx,%rdx
3a: 48 39 f8 cmp %rdi,%rax
3d: 75 eb jne 0x2a
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx
7: 48 05 00 06 00 00 add $0x600,%rax
d: 48 01 ca add %rcx,%rdx
10: 48 39 f8 cmp %rdi,%rax
13: 75 eb jne 0x0
15: 48 rex.W
[ 14.220364][ T1] RSP: 0018:ffffb14b40013d68 EFLAGS: 00010246
[ 14.221717][ T1] RAX: 0000000000000000 RBX: ffffb14b40013d88 RCX: 00000000003a19a2
[ 14.223496][ T1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000001e00
[ 14.225170][ T1] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000008
[ 14.226964][ T1] R10: 0000000000000008 R11: 0fffffffffffffff R12: ffffb14b40013d88
[ 14.228774][ T1] R13: 00000000003e7ac3 R14: ffffb14b40013e88 R15: ffff98ab0434f7a0
[ 14.230421][ T1] FS: 00007f9569ae9940(0000) GS:ffff98adefd00000(0000) knlGS:0000000000000000
[ 14.234569][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.235900][ T1] CR2: 0000000000000090 CR3: 0000000100072000 CR4: 00000000000006f0
[ 14.237620][ T1] Call Trace:
[ 14.238502][ T1] <TASK>
[ 14.239254][ T1] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 14.240189][ T1] ? page_fault_oops (arch/x86/mm/fault.c:715)
[ 14.241254][ T1] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
[ 14.242297][ T1] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:623)
[ 14.243313][ T1] ? si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.244443][ T1] ? si_meminfo_node (mm/show_mem.c:114)
[ 14.245460][ T1] memory_low_write (mm/memcontrol.c:4088)
[ 14.246547][ T1] kernfs_fop_write_iter (fs/kernfs/file.c:338)
[ 14.247804][ T1] vfs_write (fs/read_write.c:497 fs/read_write.c:590)
[ 14.248830][ T1] ksys_write (fs/read_write.c:643)
[ 14.249783][ T1] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 14.250800][ T1] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 14.252260][ T1] RIP: 0033:0x7f956a64b240
[ 14.253276][ T1] Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
All code
========
0: 40 00 48 8b add %cl,-0x75(%rax)
4: 15 c1 9b 0d 00 adc $0xd9bc1,%eax
9: f7 d8 neg %eax
b: 64 89 02 mov %eax,%fs:(%rdx)
e: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
15: eb b7 jmp 0xffffffffffffffce
17: 0f 1f 00 nopl (%rax)
1a: 80 3d a1 23 0e 00 00 cmpb $0x0,0xe23a1(%rip) # 0xe23c2
21: 74 17 je 0x3a
23: b8 01 00 00 00 mov $0x1,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 58 ja 0x8a
32: c3 retq
33: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
3a: 48 83 ec 28 sub $0x28,%rsp
3e: 48 rex.W
3f: 89 .byte 0x89
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 58 ja 0x60
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10: 48 83 ec 28 sub $0x28,%rsp
14: 48 rex.W
15: 89 .byte 0x89
[ 14.257195][ T1] RSP: 002b:00007ffcc66594e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 14.259009][ T1] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f956a64b240
[ 14.260848][ T1] RDX: 0000000000000002 RSI: 00007ffcc6659740 RDI: 000000000000001b
[ 14.262500][ T1] RBP: 00007ffcc6659740 R08: 0000000000000000 R09: 0000000000000001
[ 14.264147][ T1] R10: 00007f956a6c4820 R11: 0000000000000202 R12: 0000000000000002
[ 14.265934][ T1] R13: 000055fd63872c10 R14: 0000000000000002 R15: 00007f956a7219e0
[ 14.267589][ T1] </TASK>
[ 14.268340][ T1] Modules linked in: ip_tables
[ 14.269410][ T1] CR2: 0000000000000090
[ 14.270478][ T1] ---[ end trace 0000000000000000 ]---
[ 14.271717][ T1] RIP: 0010:si_meminfo_node (arch/x86/include/asm/atomic64_64.h:15 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:2583 (discriminator 3) include/linux/atomic/atomic-long.h:38 (discriminator 3) include/linux/atomic/atomic-instrumented.h:3189 (discriminator 3) include/linux/mmzone.h:1042 (discriminator 3) mm/show_mem.c:98 (discriminator 3))
[ 14.272874][ T1] Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 48 63 c6 55 31 d2 4c 8b 04 c5 c0 a7 fb 8c 53 48 89 c5 48 89 fb 4c 89 c0 49 8d b8 00 1e 00 00 <48> 8b 88 90 00 00 00 48 05 00 06 00 00 48 01 ca 48 39 f8 75 eb 48
All code
========
0: 90 nop
1: 90 nop
2: 66 0f 1f 00 nopw (%rax)
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: 48 63 c6 movslq %esi,%rax
e: 55 push %rbp
f: 31 d2 xor %edx,%edx
11: 4c 8b 04 c5 c0 a7 fb mov -0x73045840(,%rax,8),%r8
18: 8c
19: 53 push %rbx
1a: 48 89 c5 mov %rax,%rbp
1d: 48 89 fb mov %rdi,%rbx
20: 4c 89 c0 mov %r8,%rax
23: 49 8d b8 00 1e 00 00 lea 0x1e00(%r8),%rdi
2a:* 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx <-- trapping instruction
31: 48 05 00 06 00 00 add $0x600,%rax
37: 48 01 ca add %rcx,%rdx
3a: 48 39 f8 cmp %rdi,%rax
3d: 75 eb jne 0x2a
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 8b 88 90 00 00 00 mov 0x90(%rax),%rcx
7: 48 05 00 06 00 00 add $0x600,%rax
d: 48 01 ca add %rcx,%rdx
10: 48 39 f8 cmp %rdi,%rax
13: 75 eb jne 0x0
15: 48 rex.W
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240922/202409221625.1e974ac-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 3/4] use memory.low local node protection for local node reclaim
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
@ 2024-10-15 21:52 ` Gregory Price
0 siblings, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-10-15 21:52 UTC (permalink / raw)
To: kaiyang2
Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes
On Fri, Sep 20, 2024 at 10:11:50PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
>
> When reclaim targets the top-tier node usage by the root memcg,
> apply local memory.low protection instead of global protection.
>
Changelog probably needs a little more context about the intended
affect of this change. What exactly is the implication of this
change compared to applying it against elow?
> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
> include/linux/memcontrol.h | 23 ++++++++++++++---------
> mm/memcontrol.c | 4 ++--
> mm/vmscan.c | 19 ++++++++++++++-----
> 3 files changed, 30 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 94aba4498fca..256912b91922 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -586,9 +586,9 @@ static inline bool mem_cgroup_disabled(void)
> static inline void mem_cgroup_protection(struct mem_cgroup *root,
> struct mem_cgroup *memcg,
> unsigned long *min,
> - unsigned long *low)
> + unsigned long *low, unsigned long *locallow)
> {
> - *min = *low = 0;
> + *min = *low = *locallow = 0;
>
"locallow" can be read as "loc allow" or "local low", probably you
want to change all the references to local_low.
Sorry for not saying this on earlier feedback.
> if (mem_cgroup_disabled())
> return;
> @@ -631,10 +631,11 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
>
> *min = READ_ONCE(memcg->memory.emin);
> *low = READ_ONCE(memcg->memory.elow);
> + *locallow = READ_ONCE(memcg->memory.elocallow);
> }
>
> void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg);
> + struct mem_cgroup *memcg, int is_local);
>
> static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> struct mem_cgroup *memcg)
> @@ -651,13 +652,17 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
>
> static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> if (mem_cgroup_unprotected(target, memcg))
> return false;
>
> - return READ_ONCE(memcg->memory.elow) >=
> - page_counter_read(&memcg->memory);
> + if (is_local)
> + return READ_ONCE(memcg->memory.elocallow) >=
> + get_cgroup_local_usage(memcg, true);
> + else
> + return READ_ONCE(memcg->memory.elow) >=
> + page_counter_read(&memcg->memory);
Don't need else case here is if block returns.
> }
>
> static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> @@ -1159,13 +1164,13 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> static inline void mem_cgroup_protection(struct mem_cgroup *root,
> struct mem_cgroup *memcg,
> unsigned long *min,
> - unsigned long *low)
> + unsigned long *low, unsigned long *locallow)
> {
> *min = *low = 0;
> }
>
> static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> }
>
> @@ -1175,7 +1180,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
> return true;
> }
> static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> return false;
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7c5fff12105..61718ba998fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4495,7 +4495,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
> * of a top-down tree iteration, not for isolated queries.
> */
> void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> - struct mem_cgroup *memcg)
> + struct mem_cgroup *memcg, int is_local)
> {
> bool recursive_protection =
> cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> @@ -4507,7 +4507,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> root = root_mem_cgroup;
>
> page_counter_calculate_protection(&root->memory, &memcg->memory,
> - recursive_protection, false);
> + recursive_protection, is_local);
> }
>
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ce471d686a88..a2681d52fc5f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2377,6 +2377,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> enum scan_balance scan_balance;
> unsigned long ap, fp;
> enum lru_list lru;
> + int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
int should be bool to be more explicit as to what the valid values are.
Should be addressed across the patch set.
>
> /* If we have no swap space, do not bother scanning anon folios. */
> if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
> @@ -2457,12 +2458,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> for_each_evictable_lru(lru) {
> bool file = is_file_lru(lru);
> unsigned long lruvec_size;
> - unsigned long low, min;
> + unsigned long low, min, locallow;
> unsigned long scan;
>
> lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> - &min, &low);
> + &min, &low, &locallow);
> + if (is_local)
> + low = locallow;
>
> if (min || low) {
> /*
> @@ -2494,7 +2497,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> * again by how much of the total memory used is under
> * hard protection.
> */
> - unsigned long cgroup_size = mem_cgroup_size(memcg);
> + unsigned long cgroup_size;
> +
> + if (is_local)
> + cgroup_size = get_cgroup_local_usage(memcg, true);
> + else
> + cgroup_size = mem_cgroup_size(memcg);
> unsigned long protection;
>
> /* memory.low scaling, make sure we retry before OOM */
> @@ -5869,6 +5877,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> };
> struct mem_cgroup_reclaim_cookie *partial = &reclaim;
> struct mem_cgroup *memcg;
> + int is_local = (pgdat->node_id == 0) && root_reclaim(sc);
>
> /*
> * In most cases, direct reclaimers can do partial walks
> @@ -5896,7 +5905,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> */
> cond_resched();
>
> - mem_cgroup_calculate_protection(target_memcg, memcg);
> + mem_cgroup_calculate_protection(target_memcg, memcg, is_local);
>
> if (mem_cgroup_below_min(target_memcg, memcg)) {
> /*
> @@ -5904,7 +5913,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> * If there is no reclaimable memory, OOM.
> */
> continue;
> - } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> + } else if (mem_cgroup_below_low(target_memcg, memcg, is_local)) {
> /*
> * Soft protection.
> * Respect the protection only as long as
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 2/4] calculate memory.low for the local node and track its usage
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-22 8:39 ` kernel test robot
@ 2024-10-15 22:05 ` Gregory Price
1 sibling, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-10-15 22:05 UTC (permalink / raw)
To: kaiyang2
Cc: linux-mm, cgroups, roman.gushchin, shakeel.butt, muchun.song,
akpm, mhocko, nehagholkar, abhishekd, hannes, weixugc, rientjes
On Fri, Sep 20, 2024 at 10:11:49PM +0000, kaiyang2@cs.cmu.edu wrote:
> From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
>
> Add a memory.low for the top-tier node (locallow) and track its usage.
> locallow is set by scaling low by the ratio of node 0 capacity and
> node 0 + node 1 capacity.
>
> Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
> ---
> include/linux/page_counter.h | 16 ++++++++---
> mm/hugetlb_cgroup.c | 4 +--
> mm/memcontrol.c | 42 ++++++++++++++++++++++-------
> mm/page_counter.c | 52 ++++++++++++++++++++++++++++--------
> 4 files changed, 88 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 79dbd8bc35a7..aa56c93415ef 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -13,6 +13,7 @@ struct page_counter {
> * memcg->memory.usage is a hot member of struct mem_cgroup.
> */
> atomic_long_t usage;
> + struct mem_cgroup *memcg; /* memcg that owns this counter */
Can you make some comments on the lifetime of this new memcg reference?
How is it referenced, how is it cleaned up, etc.
Probably it's worth added this in a separate patch so it's easier
to review the reference tracking.
> CACHELINE_PADDING(_pad1_);
>
> /* effective memory.min and memory.min usage tracking */
> @@ -25,6 +26,10 @@ struct page_counter {
> atomic_long_t low_usage;
> atomic_long_t children_low_usage;
>
> + unsigned long elocallow;
> + atomic_long_t locallow_usage;
per note on other email - probably want local_low_* instead of locallow.
> + atomic_long_t children_locallow_usage;
> +
> unsigned long watermark;
> /* Latest cg2 reset watermark */
> unsigned long local_watermark;
> @@ -36,6 +41,7 @@ struct page_counter {
> bool protection_support;
> unsigned long min;
> unsigned long low;
> + unsigned long locallow;
> unsigned long high;
> unsigned long max;
> struct page_counter *parent;
> @@ -52,12 +58,13 @@ struct page_counter {
> */
> static inline void page_counter_init(struct page_counter *counter,
> struct page_counter *parent,
> - bool protection_support)
> + bool protection_support, struct mem_cgroup *memcg)
> {
> counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
> counter->max = PAGE_COUNTER_MAX;
> counter->parent = parent;
> counter->protection_support = protection_support;
> + counter->memcg = memcg;
> }
>
> static inline unsigned long page_counter_read(struct page_counter *counter)
> @@ -72,7 +79,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> struct page_counter **fail);
> void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> + unsigned long nr_pages_local);
>
> static inline void page_counter_set_high(struct page_counter *counter,
> unsigned long nr_pages)
> @@ -99,11 +107,11 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
> #ifdef CONFIG_MEMCG
> void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection);
> + bool recursive_protection, int is_local);
`bool is_local` is preferred
> #else
> static inline void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection) {}
> + bool recursive_protection, int is_local) {}
> #endif
>
> #endif /* _LINUX_PAGE_COUNTER_H */
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index d8d0e665caed..0e07a7a1d5b8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> }
> page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
> idx),
> - fault_parent, false);
> + fault_parent, false, NULL);
> page_counter_init(
> hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
> - rsvd_parent, false);
> + rsvd_parent, false, NULL);
>
> limit = round_down(PAGE_COUNTER_MAX,
> pages_per_huge_page(&hstates[idx]));
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 20b715441332..d7c5fff12105 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1497,6 +1497,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> vm_event_name(memcg_vm_event_stat[i]),
> memcg_events(memcg, memcg_vm_event_stat[i]));
> }
> +
> + seq_buf_printf(s, "local_usage %lu\n",
> + get_cgroup_local_usage(memcg, true));
> }
>
> static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> @@ -3597,8 +3600,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> if (parent) {
> WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
>
> - page_counter_init(&memcg->memory, &parent->memory, true);
> - page_counter_init(&memcg->swap, &parent->swap, false);
> + page_counter_init(&memcg->memory, &parent->memory, true, memcg);
> + page_counter_init(&memcg->swap, &parent->swap, false, NULL);
> #ifdef CONFIG_MEMCG_V1
> WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
> page_counter_init(&memcg->kmem, &parent->kmem, false);
> @@ -3607,8 +3610,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> } else {
> init_memcg_stats();
> init_memcg_events();
> - page_counter_init(&memcg->memory, NULL, true);
> - page_counter_init(&memcg->swap, NULL, false);
> + page_counter_init(&memcg->memory, NULL, true, memcg);
> + page_counter_init(&memcg->swap, NULL, false, NULL);
> #ifdef CONFIG_MEMCG_V1
> page_counter_init(&memcg->kmem, NULL, false);
> page_counter_init(&memcg->tcpmem, NULL, false);
> @@ -3677,7 +3680,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> memcg1_css_offline(memcg);
>
> page_counter_set_min(&memcg->memory, 0);
> - page_counter_set_low(&memcg->memory, 0);
> + page_counter_set_low(&memcg->memory, 0, 0);
>
> zswap_memcg_offline_cleanup(memcg);
>
> @@ -3748,7 +3751,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
> page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
> #endif
> page_counter_set_min(&memcg->memory, 0);
> - page_counter_set_low(&memcg->memory, 0);
> + page_counter_set_low(&memcg->memory, 0, 0);
> page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
> memcg1_soft_limit_reset(memcg);
> page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> @@ -4051,6 +4054,12 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static int memory_locallow_show(struct seq_file *m, void *v)
> +{
> + return seq_puts_memcg_tunable(m,
> + READ_ONCE(mem_cgroup_from_seq(m)->memory.locallow));
> +}
> +
> static int memory_low_show(struct seq_file *m, void *v)
> {
> return seq_puts_memcg_tunable(m,
> @@ -4061,7 +4070,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> - unsigned long low;
> + struct sysinfo si;
> + unsigned long low, locallow, local_capacity, total_capacity;
> int err;
>
> buf = strstrip(buf);
> @@ -4069,7 +4079,15 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
> if (err)
> return err;
>
> - page_counter_set_low(&memcg->memory, low);
> + /* Hardcoded 0 for local node and 1 for remote. */
I know we've talked about this before about this, but this is obviously broken
for multi-socket systems. If so, this needs a FIXME or a TODO at least so that
it's at least obvious that this patch isn't ready for upstream - even as an RFC.
Probably we can't move forward until we figure out how to solve this problem
out ahead of this patch set. Worth discussing this issue explicitly.
Maybe rather than guessing, a preferred node should be set for local and
remote if this mechanism is in use. Otherwise just guessing which local
and which remote node seems like it will be wrong - especially for sufficiently
large-threaded processes.
> + si_meminfo_node(&si, 0);
> + local_capacity = si.totalram; /* In pages. */
> + total_capacity = local_capacity;
> + si_meminfo_node(&si, 1);
> + total_capacity += si.totalram;
> + locallow = low * local_capacity / total_capacity;
> +
> + page_counter_set_low(&memcg->memory, low, locallow);
>
> return nbytes;
> }
> @@ -4394,6 +4412,11 @@ static struct cftype memory_files[] = {
> .seq_show = memory_low_show,
> .write = memory_low_write,
> },
> + {
> + .name = "locallow",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_locallow_show,
> + },
> {
> .name = "high",
> .flags = CFTYPE_NOT_ON_ROOT,
> @@ -4483,7 +4506,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> if (!root)
> root = root_mem_cgroup;
>
> - page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
> + page_counter_calculate_protection(&root->memory, &memcg->memory,
> + recursive_protection, false);
> }
>
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b249d15af9dd..97205aafab46 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -18,8 +18,10 @@ static bool track_protection(struct page_counter *c)
> return c->protection_support;
> }
>
> +extern unsigned long get_cgroup_local_usage(struct mem_cgroup *memcg, bool flush);
> +
> static void propagate_protected_usage(struct page_counter *c,
> - unsigned long usage)
> + unsigned long usage, unsigned long local_usage)
> {
> unsigned long protected, old_protected;
> long delta;
> @@ -44,6 +46,15 @@ static void propagate_protected_usage(struct page_counter *c,
> if (delta)
> atomic_long_add(delta, &c->parent->children_low_usage);
> }
> +
> + protected = min(local_usage, READ_ONCE(c->locallow));
> + old_protected = atomic_long_read(&c->locallow_usage);
> + if (protected != old_protected) {
> + old_protected = atomic_long_xchg(&c->locallow_usage, protected);
> + delta = protected - old_protected;
> + if (delta)
> + atomic_long_add(delta, &c->parent->children_locallow_usage);
> + }
> }
>
> /**
> @@ -63,7 +74,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> atomic_long_set(&counter->usage, new);
> }
> if (track_protection(counter))
> - propagate_protected_usage(counter, new);
> + propagate_protected_usage(counter, new,
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -83,7 +95,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>
> new = atomic_long_add_return(nr_pages, &c->usage);
> if (protection)
> - propagate_protected_usage(c, new);
> + propagate_protected_usage(c, new,
> + get_cgroup_local_usage(counter->memcg, false));
> /*
> * This is indeed racy, but we can live with some
> * inaccuracy in the watermark.
> @@ -151,7 +164,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> goto failed;
> }
> if (protection)
> - propagate_protected_usage(c, new);
> + propagate_protected_usage(c, new,
> + get_cgroup_local_usage(counter->memcg, false));
>
> /* see comment on page_counter_charge */
> if (new > READ_ONCE(c->local_watermark)) {
> @@ -238,7 +252,8 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
> WRITE_ONCE(counter->min, nr_pages);
>
> for (c = counter; c; c = c->parent)
> - propagate_protected_usage(c, atomic_long_read(&c->usage));
> + propagate_protected_usage(c, atomic_long_read(&c->usage),
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -248,14 +263,17 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
> *
> * The caller must serialize invocations on the same counter.
> */
> -void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
> +void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages,
> + unsigned long nr_pages_local)
> {
> struct page_counter *c;
>
> WRITE_ONCE(counter->low, nr_pages);
> + WRITE_ONCE(counter->locallow, nr_pages_local);
>
> for (c = counter; c; c = c->parent)
> - propagate_protected_usage(c, atomic_long_read(&c->usage));
> + propagate_protected_usage(c, atomic_long_read(&c->usage),
> + get_cgroup_local_usage(counter->memcg, false));
> }
>
> /**
> @@ -421,9 +439,9 @@ static unsigned long effective_protection(unsigned long usage,
> */
> void page_counter_calculate_protection(struct page_counter *root,
> struct page_counter *counter,
> - bool recursive_protection)
> + bool recursive_protection, int is_local)
> {
> - unsigned long usage, parent_usage;
> + unsigned long usage, parent_usage, local_usage, parent_local_usage;
> struct page_counter *parent = counter->parent;
>
> /*
> @@ -437,16 +455,19 @@ void page_counter_calculate_protection(struct page_counter *root,
> return;
>
> usage = page_counter_read(counter);
> - if (!usage)
> + local_usage = get_cgroup_local_usage(counter->memcg, true);
> + if (!usage || !local_usage)
> return;
>
> if (parent == root) {
> counter->emin = READ_ONCE(counter->min);
> counter->elow = READ_ONCE(counter->low);
> + counter->elocallow = READ_ONCE(counter->locallow);
> return;
> }
>
> parent_usage = page_counter_read(parent);
> + parent_local_usage = get_cgroup_local_usage(parent->memcg, true);
>
> WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage,
> READ_ONCE(counter->min),
> @@ -454,7 +475,16 @@ void page_counter_calculate_protection(struct page_counter *root,
> atomic_long_read(&parent->children_min_usage),
> recursive_protection));
>
> - WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage,
> + if (is_local)
> + WRITE_ONCE(counter->elocallow,
> + effective_protection(local_usage, parent_local_usage,
> + READ_ONCE(counter->locallow),
> + READ_ONCE(parent->elocallow),
> + atomic_long_read(&parent->children_locallow_usage),
> + recursive_protection));
> + else
> + WRITE_ONCE(counter->elow,
> + effective_protection(usage, parent_usage,
> READ_ONCE(counter->low),
> READ_ONCE(parent->elow),
> atomic_long_read(&parent->children_low_usage),
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
` (3 preceding siblings ...)
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
@ 2024-11-08 19:01 ` kaiyang2
4 siblings, 0 replies; 9+ messages in thread
From: kaiyang2 @ 2024-11-08 19:01 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: roman.gushchin, shakeel.butt, muchun.song, akpm, mhocko,
nehagholkar, abhishekd, hannes, weixugc, rientjes, gourry,
Kaiyang Zhao
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.
The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.
Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7
Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)
Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.
Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark.
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.
We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-11-08 19:02 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-20 22:11 [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
2024-09-20 22:11 ` [RFC PATCH 1/4] Add get_cgroup_local_usage for estimating the top-tier memory usage kaiyang2
2024-09-20 22:11 ` [RFC PATCH 2/4] calculate memory.low for the local node and track its usage kaiyang2
2024-09-22 8:39 ` kernel test robot
2024-10-15 22:05 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 3/4] use memory.low local node protection for local node reclaim kaiyang2
2024-10-15 21:52 ` Gregory Price
2024-09-20 22:11 ` [RFC PATCH 4/4] reduce NUMA balancing scan size of cgroups over their local memory.low kaiyang2
2024-11-08 19:01 ` [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion kaiyang2
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox