[RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
@ 2024-07-02  8:44 Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 1/4] mm: memcg: pmc framework Huan Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-02  8:44 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Huan Yang, Kairui Song, cgroups, linux-mm, linux-kernel,
	Christian Brauner
  Cc: opensource.kernel

This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).

Background
===

Modern computer systems always have performance gaps between hardware,
such as the performance differences between CPU, memory, and disk.
Due to the principle of locality of reference in data access:

  Programs often access data that has been accessed before
  Programs access the next set of data after accessing a particular data
As a result:
  1. CPU cache is used to speed up the access of already accessed data
     in memory
  2. Disk prefetching techniques are used to prepare the next set of data
     to be accessed in advance (to avoid direct disk access)
The basic utilization of locality greatly enhances computer performance.

PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
program performance.

In modern computers, especially in smartphones, services are provided to
users on a per-application basis (such as Camera, Chat, etc.),
where an application is composed of multiple processes working together to
provide services.

The basic unit for managing resources in a computer is the process,
which in turn uses threads to share memory and accomplish tasks.
Memory is shared among threads within a process.

However, modern computers have the following issues, with a locality deficiency:

  1. Different forms of memory exist and are not interconnected (anonymous
     pages, file pages, special memory such as DMA-BUF, various memory alloc in
     kernel mode, etc.)
  2. Memory isolation exists between processes, and apart from specific
     shared memory, they do not communicate with each other.
  3. During the transition of functionality within an application, a process
     usually releases memory, while another process requests memory, and in
     this process, memory has to be obtained from the lowest level through
     competition.

For example abount camera application:

Camera applications typically provide photo capture services as well as photo
preview services.
The photo capture process usually utilizes DMA-BUF to facilitate the sharing
of image data between the CPU and DMA devices.
When it comes to image preview, multiple algorithm processes are typically
involved in processing the image data, which may also involve heap memory
and other resources.

During the switch between photo capture and preview, the application typically
needs to release DMA-BUF memory and then the algorithms need to allocate
heap memory. The flow of system memory during this process is managed by
the PCP-BUDDY system.

However, the PCP and BUDDY systems are shared, and subsequently requested
memory may not be available due to previously allocated memory being used
(such as for file reading), requiring a competitive (memory reclamation)
process to obtain it.

So, if it is possible to allow the released memory to be allocated with
high priority within the application, then this can meet the locality
requirement, improve performance, and avoid unnecessary memory reclaim.

PMC solutions are similar to PCP, as they both establish cache pools according
to certain rules.

Why base on MEMCG?
===

The MEMCG container can allocate selected processes to a MEMCG based on certain
grouping strategies (typical examples include grouping by app or UID).
Processes within the same MEMCG can then be used for statistics, upper limit
restrictions, and reclamation control.

All processes within a MEMCG are considered as a single memory unit,
sharing memory among themselves. As a result, when one process releases
memory, another process within the same group can obtain it with the
highest priority, fully utilizing the locality of memory allocation
characteristics within the MEMCG (such as APP grouping).

In addition, MEMCG provides feature interfaces that can be dynamically toggled
and are fully controllable by the policy.This provides greater flexibility
and does not impact performance when not enabled (controlled through static key).

Abount PMC implement
===
Here, a cache switch is provided for each MEMCG(not on root).
When the user enables the cache, processes within the MEMCG will share memory
through this cache.

The cache pool is positioned before the PCP. All order0 page released by
processes in MEMCG will be released to the cache pool first, and when memory
is requested, it will also be prioritized to be obtained from the cache pool.

`memory.cache` is the sole entry point for controlling PMC, here are some
nested keys to control PMC:
  1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
  2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
  enabled PMC's behavior.
    a) `nid` to targeted a node to change it's key. or else all node.
    b) The `watermark` is used to control cache behavior, caching only when
       zone free pages above the zone's high water mark + this watermark is
       exceeded during memory release. (unit byte, default 50MB,
       min 10MB per-node-all-zone)
    c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
        MEMCG(unit us, default 5s, 0 is disable.)
    d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
       default 100MB, max 500MB per-node-all-zone)

Performance
===
PMC is based on MEMCG and requires performance measurement through the
sharing of complex workloads between application processes.
Therefore, at the moment, we unable to provide a better testing solution
for this patchset.

Here is the internal testing situation we provide, using the camera
application as an example. (1-NODE-1-ZONE-8GRAM)

Test Case: Capture in rear portrait HDR mode
1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
   which memory types including dmabuf(470M), PSS(150M) and APU(200M)
2. Test steps: take a photo, then click thumbnail to view the full image

The overall performance benefit from click shutter button to showing whole
image improves 500ms, and the total slowpath cost of all camera threads reduced
from 958ms to 495ms. 
Especially for the shot2shot in this mode, the preview dealy of each frame have
a significant improve.

Some question
===
1. The current patchset ignores the migrate type because the original
   requirement is to share between DMA-BUF and heap memory. However,
   this behavior will cause serious system fragmentation,
   so is there a better solution?

2. Current patchset only supports order 0 and use reaper to reclaim cache.
   Maybe better adapt to drain work and high order. 

3. Actually, above internal test set cache pool free before pcp, and alloc
   behind buddy free. So task will push common memory, and cace will only be
   used in emergency situations.(before into slowpath). This will result in
   better performance, but it may impact the system. Even if only when
   application start up, cache enable. So, which better?

4. Current patchset is simple to talk, some struct maybe need refcount/lock to
   fix race access.

Huan Yang (4):
  mm: memcg: pmc framework
  mm: memcg: pmc support change attribute
  mm: memcg: pmc: support reaper
  mm: memcg: pmc: support oom release

 include/linux/memcontrol.h |  41 ++++
 include/linux/mmzone.h     |  34 +++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            | 481 +++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            | 147 ++++++++++++
 5 files changed, 704 insertions(+)

base-commit: 727900b675b749c40ba1f6669c7ae5eb7eb8e837
-- 
2.45.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/4] mm: memcg: pmc framework
  2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
@ 2024-07-02  8:44 ` Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 2/4] mm: memcg: pmc support change attribute Huan Yang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-02  8:44 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Huan Yang, Kairui Song, cgroups, linux-mm, linux-kernel,
	Christian Brauner
  Cc: opensource.kernel

pmc - per memcg cache

This patch add a feature pmc in each memcg unless root memcg.
User can enable pmc in a target memcg, so all task in this memcg
will share a cache pool, the alloc/free order 0 page will high
priority turn in this cache pool.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/memcontrol.h |  41 +++++++
 include/linux/mmzone.h     |  25 ++++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            | 237 +++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            | 146 +++++++++++++++++++++++
 5 files changed, 450 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8f332b4ae84c..5ec4c64bc515 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -130,6 +130,7 @@ struct mem_cgroup_per_node {
 	bool			on_tree;
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
+	struct mem_cgroup_per_node_cache *cachep;
 };
 
 struct mem_cgroup_threshold {
@@ -336,6 +337,8 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+	bool cache_enabled;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -557,6 +560,8 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
 	return memcg;
 }
 
+extern struct static_key_true pmc_key;
+
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set.
@@ -1185,6 +1190,25 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
+static inline bool pmc_disabled(void)
+{
+	return static_branch_likely(&pmc_key);
+}
+
+static inline bool mem_cgroup_cache_disabled(struct mem_cgroup *memcg)
+{
+	return !READ_ONCE(memcg->cache_enabled);
+}
+
+
+static inline struct mem_cgroup_per_node_cache *
+mem_cgroup_get_node_cachep(struct mem_cgroup *memcg, int nid)
+{
+	struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+
+	return nodeinfo->cachep;
+}
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1648,6 +1672,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 {
 	return 0;
 }
+
+static inline bool pmc_disabled(void)
+{
+	return true;
+}
+
+static inline bool mem_cgroup_cache_disabled(struct mem_cgroup *memcg)
+{
+	return true;
+}
+
+
+static inline struct mem_cgroup_per_node_cache *
+mem_cgroup_get_node_cachep(struct mem_cgroup *memcg, int nid)
+{
+	return NULL;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..773b89e214c9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -603,6 +603,31 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 
 #endif /* CONFIG_LRU_GEN */
 
+struct mem_cgroup_zone_cache {
+	/* cache pages, current only hold order 0 */
+	struct list_head pages;
+	spinlock_t pages_lock;
+	atomic_t nr_pages;
+	atomic_t nr_alloced;
+};
+
+struct mem_cgroup_per_node_cache {
+	/* per zone cache */
+	struct mem_cgroup_zone_cache zone_cachep[MAX_NR_ZONES];
+	struct mem_cgroup *memcg;
+
+	/* max number to hold page, unit page, default 100MB */
+#define DEFAULT_PMC_HOLD_LIMIX ((100 << 20) >> PAGE_SHIFT)
+	unsigned int hold_limit;
+
+#define DEFAULT_PMC_GAP_WATERMARK ((50 << 20) >> PAGE_SHIFT)
+	/*
+	 * Only when zone free pages above high+allow watermark, can hold cache,
+	 * unit page, default 50MB
+	 */
+	unsigned int allow_watermark;
+};
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..d7b5e0a8317c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -420,6 +420,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 long remove_mapping(struct address_space *mapping, struct folio *folio);
+extern int mem_cgroup_release_cache(struct mem_cgroup_per_node_cache *fc);
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b3c3394a2ba..404fcb96bf68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,15 @@ static bool cgroup_memory_nokmem __ro_after_init;
 /* BPF memory accounting disabled? */
 static bool cgroup_memory_nobpf __ro_after_init;
 
+/*
+ * How many memcg enabled cache? If none, static branch will enable
+ * so none task free/alloc will into PMC path.
+ * Else, hold/free cache in target memcg, disable static branch.
+ */
+static atomic_t pmc_nr_enabled;
+DEFINE_STATIC_KEY_TRUE(pmc_key);
+
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
@@ -5738,6 +5747,8 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
 	lru_gen_release_memcg(memcg);
 }
 
+static int __disable_mem_cgroup_cache(struct mem_cgroup *memcg);
+
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5762,6 +5773,8 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
 	free_shrinker_info(memcg);
+	if (READ_ONCE(memcg->cache_enabled))
+		__disable_mem_cgroup_cache(memcg);
 	mem_cgroup_free(memcg);
 }
 
@@ -7088,6 +7101,223 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+static int __enable_mem_cgroup_cache(struct mem_cgroup *memcg)
+{
+	int nid, idx;
+
+	if (!mem_cgroup_cache_disabled(memcg))
+		return -EINVAL;
+
+	for_each_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+		struct mem_cgroup_per_node_cache *p = kvzalloc_node(
+			sizeof(struct mem_cgroup_per_node_cache),
+			GFP_KERNEL, nid);
+
+		if (unlikely(!p))
+			goto fail;
+
+		nodeinfo->cachep = p;
+	}
+
+	for_each_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+		pg_data_t *pgdat = NODE_DATA(nid);
+		struct mem_cgroup_per_node_cache *p = nodeinfo->cachep;
+
+		for (idx = 0; idx < MAX_NR_ZONES; idx++) {
+			struct zone *z = &pgdat->node_zones[idx];
+			struct mem_cgroup_zone_cache *zc;
+
+			if (!populated_zone(z))
+				continue;
+
+			zc = &p->zone_cachep[idx];
+
+			INIT_LIST_HEAD(&zc->pages);
+			spin_lock_init(&zc->pages_lock);
+		}
+
+		p->memcg = memcg;
+		p->hold_limit = DEFAULT_PMC_HOLD_LIMIX;
+		p->allow_watermark = DEFAULT_PMC_GAP_WATERMARK;
+
+		atomic_inc(&pmc_nr_enabled);
+	}
+
+	if (static_branch_likely(&pmc_key))
+		static_branch_disable(&pmc_key);
+
+	//online
+	smp_wmb();
+	WRITE_ONCE(memcg->cache_enabled, true);
+	atomic_inc(&pmc_nr_enabled);
+
+	return 0;
+
+fail:
+	for_each_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+
+		if (nodeinfo->cachep) {
+			kvfree(nodeinfo->cachep);
+			nodeinfo->cachep = NULL;
+		}
+	}
+
+	return -ENOMEM;
+}
+
+static int __disable_mem_cgroup_cache(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	if (unlikely(mem_cgroup_cache_disabled(memcg)))
+		return -EINVAL;
+
+	//offline
+	WRITE_ONCE(memcg->cache_enabled, false);
+
+	for_each_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+		struct mem_cgroup_per_node_cache *p;
+
+		p = nodeinfo->cachep;
+
+		mem_cgroup_release_cache(p);
+
+		kfree(p);
+	}
+
+	if (atomic_dec_and_test(&pmc_nr_enabled))
+		static_branch_enable(&pmc_key);
+
+	return 0;
+}
+
+static int mem_cgroup_cache_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg;
+	int nid;
+
+	if (static_branch_likely(&pmc_key))
+		return -EINVAL;
+
+	memcg = mem_cgroup_from_seq(m);
+	if (!READ_ONCE(memcg->cache_enabled))
+		return -EINVAL;
+
+	seq_printf(m, "%4s %16s %16s\n", "NODE", "WATERMARK", "HOLD_LIMIT");
+	for_each_online_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+		struct mem_cgroup_per_node_cache *p;
+
+		p = nodeinfo->cachep;
+		if (!p)
+			continue;
+
+		seq_printf(m, "%4d %14uKB %14uKB\n", nid,
+			   (READ_ONCE(p->allow_watermark) << (PAGE_SHIFT - 10)),
+			   (READ_ONCE(p->hold_limit) << (PAGE_SHIFT - 10)));
+	}
+
+	seq_puts(m, "===========\n");
+	seq_printf(m, "%4s %16s %16s %16s\n", "NODE", "ZONE", "CACHE", "HIT");
+
+	for_each_online_node(nid) {
+		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
+		struct mem_cgroup_per_node_cache *p;
+		pg_data_t *pgdat = NODE_DATA(nid);
+		int idx;
+
+		p = nodeinfo->cachep;
+		if (!p)
+			continue;
+
+		for (idx = 0; idx < MAX_NR_ZONES; idx++) {
+			struct mem_cgroup_zone_cache *zc;
+			struct zone *z = &pgdat->node_zones[idx];
+
+			if (!populated_zone(z))
+				continue;
+
+			zc = &p->zone_cachep[idx];
+			seq_printf(m, "%4d %16s %14dKB %14dKB\n", nid, z->name,
+				   (atomic_read(&zc->nr_pages)
+				    << (PAGE_SHIFT - 10)),
+				   (atomic_read(&zc->nr_alloced)
+				    << (PAGE_SHIFT - 10)));
+		}
+	}
+
+	return 0;
+}
+
+enum {
+	OPT_CTRL_ENABLE,
+	OPT_CTRL_ERR,
+	OPR_CTRL_NR = OPT_CTRL_ERR,
+};
+
+static const match_table_t ctrl_tokens = {
+					   { OPT_CTRL_ENABLE, "enable=%s" },
+					   { OPT_CTRL_ERR, NULL } };
+
+/**
+ * This function  can control target memcg's cache. include enable\keys set.
+ * To enable\disable this cache, by `echo enable=[y|n] > memory.cace`
+ * in target memcg.
+ */
+static ssize_t mem_cgroup_cache_control(struct kernfs_open_file *of, char *buf,
+					size_t nbytes, loff_t off)
+{
+	bool enable;
+	bool opt_enable_set = false;
+	int err = 0;
+	char *sub;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	buf = strstrip(buf);
+	if (!strlen(buf))
+		return -EINVAL;
+
+	while ((sub = strsep(&buf, " ")) != NULL) {
+		int token;
+		substring_t args[MAX_OPT_ARGS];
+		char tbuf[256];
+
+		sub = strstrip(sub);
+
+		token = match_token(sub, ctrl_tokens, args);
+		switch (token) {
+		case OPT_CTRL_ENABLE:
+			if (match_strlcpy(tbuf, &args[0], sizeof(tbuf)) >=
+			    sizeof(tbuf))
+				return -EINVAL;
+
+			err = kstrtobool(tbuf, &enable);
+			if (err)
+				return -EINVAL;
+			opt_enable_set = true;
+			break;
+		case OPT_CTRL_ERR:
+		default:
+			return -EINVAL;
+		}
+	}
+
+	if (opt_enable_set) {
+		if (enable) {
+			__enable_mem_cgroup_cache(memcg);
+		} else {
+			__disable_mem_cgroup_cache(memcg);
+			return nbytes;
+		}
+	}
+
+	return err ? err : nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -7156,6 +7386,13 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	/* free cache field */
+	{
+		.name = "cache",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = mem_cgroup_cache_control,
+		.seq_show = mem_cgroup_cache_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1beb56f75319..54c4d00c2506 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -530,6 +530,14 @@ static inline int pindex_to_order(unsigned int pindex)
 	return order;
 }
 
+/**
+ * Per memcg cache currently only allow order 0.
+ */
+static inline bool pmc_allow_order(unsigned int order)
+{
+	return !order;
+}
+
 static inline bool pcp_allowed_order(unsigned int order)
 {
 	if (order <= PAGE_ALLOC_COSTLY_ORDER)
@@ -1271,6 +1279,43 @@ void __free_pages_core(struct page *page, unsigned int order)
 	__free_pages_ok(page, order, FPI_TO_TAIL);
 }
 
+int mem_cgroup_release_cache(struct mem_cgroup_per_node_cache *nodep)
+{
+	LIST_HEAD(temp_list);
+	int zid, num = 0;
+
+	for (zid = 0; zid < MAX_NR_ZONES; ++zid) {
+		struct mem_cgroup_zone_cache *zc = &nodep->zone_cachep[zid];
+		int i = 0;
+
+		if (!atomic_read(&zc->nr_pages))
+			continue;
+
+		spin_lock(&zc->pages_lock);
+		list_splice_init(&zc->pages, &temp_list);
+		spin_unlock(&zc->pages_lock);
+
+		while (!list_empty(&temp_list)) {
+			struct page *page =
+				list_first_entry(&temp_list, struct page, lru);
+			struct zone *zone = page_zone(page);
+			unsigned long pfn = page_to_pfn(page);
+
+			list_del(&page->lru);
+
+
+			// is good to put into pcp?
+			free_one_page(zone, page, pfn, 0, FPI_NONE);
+			++i;
+		}
+
+		num += i;
+		atomic_sub(i, &zc->nr_pages);
+	}
+
+	return num;
+}
+
 /*
  * Check that the whole (or subset of) a pageblock given by the interval of
  * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
@@ -2603,6 +2648,41 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	}
 }
 
+static bool free_unref_page_to_pmc(struct page *page, struct zone *zone,
+				   int order)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node_cache *cachp;
+	struct mem_cgroup_zone_cache *zc;
+	unsigned long flags;
+	bool ret = false;
+
+	if (pmc_disabled())
+		return false;
+
+	memcg = get_mem_cgroup_from_current();
+	if (!memcg || mem_cgroup_is_root(memcg) ||
+	    mem_cgroup_cache_disabled(memcg))
+		goto out;
+
+	cachp = mem_cgroup_get_node_cachep(memcg, page_to_nid(page));
+	zc = &cachp->zone_cachep[page_zonenum(page)];
+
+	if (high_wmark_pages(zone) + READ_ONCE(cachp->allow_watermark) >=
+	    zone_page_state(zone, NR_FREE_PAGES))
+		goto out;
+
+	spin_lock_irqsave(&zc->pages_lock, flags);
+	list_add(&page->lru, &zc->pages);
+	spin_unlock_irqrestore(&zc->pages_lock, flags);
+	atomic_inc(&zc->nr_pages);
+
+	ret = true;
+out:
+	mem_cgroup_put(memcg);
+	return ret;
+}
+
 /*
  * Free a pcp page
  */
@@ -2634,6 +2714,17 @@ void free_unref_page(struct page *page, unsigned int order)
 	}
 
 	zone = page_zone(page);
+
+	/**
+	 * This function can cache release page before free into pcp if current
+	 * memcg enabled cache feature.
+	 * Compared to PCP, PMC is unique, only processes in PMC can access it.
+	 * So, if the conditions are met, it should be prioritized to be
+	 * released to PMC before being released to the public CPU cache.
+	 */
+	if (pmc_allow_order(order) && free_unref_page_to_pmc(page, zone, order))
+		return;
+
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
@@ -3012,6 +3103,49 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	return page;
 }
 
+static struct page *rmqueue_mem_cgroup_cache(struct zone *preferred_zone,
+					     struct zone *zone,
+					     unsigned int order,
+					     int migratetype)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node_cache *cachp;
+	struct mem_cgroup_zone_cache *zc;
+	unsigned long flags;
+	int nid = zone->zone_pgdat->node_id;
+	struct page *page = NULL;
+
+	if (pmc_disabled())
+		return NULL;
+
+	memcg = get_mem_cgroup_from_current();
+	if (!memcg || mem_cgroup_is_root(memcg) ||
+	    mem_cgroup_cache_disabled(memcg))
+		goto out;
+
+	cachp = mem_cgroup_get_node_cachep(memcg, nid);
+
+	zc = &cachp->zone_cachep[zone_idx(zone)];
+	if (!atomic_read(&zc->nr_pages))
+		goto out;
+
+	spin_lock_irqsave(&zc->pages_lock, flags);
+	if (list_empty(&zc->pages)) {
+		spin_unlock_irqrestore(&zc->pages_lock, flags);
+		goto out;
+	}
+	page = list_first_entry(&zc->pages, struct page, lru);
+	list_del(&page->lru);
+	spin_unlock_irqrestore(&zc->pages_lock, flags);
+
+	atomic_dec(&zc->nr_pages);
+	atomic_inc(&zc->nr_alloced);
+
+out:
+	mem_cgroup_put(memcg);
+	return page;
+}
+
 /*
  * Allocate a page from the given zone.
  * Use pcplists for THP or "cheap" high-order allocations.
@@ -3038,6 +3172,18 @@ struct page *rmqueue(struct zone *preferred_zone,
 	 */
 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 
+	/*
+	 * Before disturb public pcp or buddy, current may in a memcg
+	 * which already enabled cache feature.
+	 * If that's true, first get page from private pool can boost alloc.
+	 */
+	if (pmc_allow_order(order)) {
+		page = rmqueue_mem_cgroup_cache(preferred_zone, zone, order,
+						migratetype);
+		if (page)
+			goto out;
+	}
+
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				       migratetype, alloc_flags);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/4] mm: memcg: pmc support change attribute
  2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 1/4] mm: memcg: pmc framework Huan Yang
@ 2024-07-02  8:44 ` Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 3/4] mm: memcg: pmc: support reaper Huan Yang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-02  8:44 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Huan Yang, Kairui Song, cgroups, linux-mm, linux-kernel,
	Christian Brauner
  Cc: opensource.kernel

pmc have below attribute:
  watermark: only when zone free pages above high+watermark can
cache pages
  limit: max memory it can cached.
This patch let user can change each attribute by `memory.cache`.

To change attribute, can type `keys=attribute=vaule` into memcg's
`memory.cache` if it enabled cache.

For example:
  echo keys=watermark=157286400,limit=209715200 > memory.cache
This changed memcg's only when free pages above high+150MB can
cache pages, and can cache up to a maximum of 200MB .

Signed-off-by: Huan Yang <link@vivo.com>
---
 mm/memcontrol.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 151 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 404fcb96bf68..9db5bbe63b34 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7253,29 +7253,168 @@ static int mem_cgroup_cache_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+#define STATUS_UNSET_DEFAULT_VALUE -1
+enum {
+	OPT_KEY_NID,
+	OPT_KEY_WATERMARK,
+	OPT_KEY_HOLD_LIMIT,
+	OPT_KEY_ERR,
+	NR_PMC_KEY_OPTS = OPT_KEY_ERR
+};
+
+static const match_table_t fc_tokens = {
+	{ OPT_KEY_NID, "nid=%d" },
+	{ OPT_KEY_WATERMARK, "watermark=%u" },
+	{ OPT_KEY_HOLD_LIMIT, "limit=%u" },
+	{ OPT_KEY_ERR, NULL}
+};
+
+static void
+__apply_status_for_mem_cgroup_cache(struct mem_cgroup_per_node_cache *p,
+				    unsigned int opts[])
+{
+	int i;
+
+	for (i = OPT_KEY_WATERMARK; i < NR_PMC_KEY_OPTS; ++i) {
+		switch (i) {
+		case OPT_KEY_WATERMARK:
+			if (opts[OPT_KEY_WATERMARK] !=
+			    STATUS_UNSET_DEFAULT_VALUE)
+				p->allow_watermark = opts[OPT_KEY_WATERMARK];
+			break;
+		case OPT_KEY_HOLD_LIMIT:
+			if (opts[OPT_KEY_HOLD_LIMIT] !=
+			    STATUS_UNSET_DEFAULT_VALUE)
+				p->hold_limit = opts[OPT_KEY_HOLD_LIMIT];
+			break;
+		default:
+			break;
+		}
+	}
+}
+
+static __always_inline int
+mem_cgroup_apply_cache_status(struct mem_cgroup *memcg,
+				   unsigned int opts[])
+{
+	struct mem_cgroup_per_node_cache *p;
+	unsigned int nid = opts[OPT_KEY_NID];
+
+	if (nid != STATUS_UNSET_DEFAULT_VALUE) {
+		p = memcg->nodeinfo[nid]->cachep;
+		if (unlikely(!p))
+			return -EINVAL;
+		__apply_status_for_mem_cgroup_cache(p, opts);
+		return 0;
+	}
+
+	for_each_node(nid) {
+		p = memcg->nodeinfo[nid]->cachep;
+		if (!p)
+			continue;
+		__apply_status_for_mem_cgroup_cache(p, opts);
+	}
+
+	return 0;
+}
+
+/**
+ * Support nid=x,watermark=bytes,limit=bytes args
+ */
+static int __mem_cgroup_cache_control_key(char *buf,
+					      struct mem_cgroup *memcg)
+{
+	char *p;
+	unsigned int opts[NR_PMC_KEY_OPTS];
+
+	memset(opts, STATUS_UNSET_DEFAULT_VALUE, sizeof(opts));
+
+	if (!READ_ONCE(memcg->cache_enabled))
+		return -EINVAL;
+
+	if (!buf)
+		return -EINVAL;
+
+	while ((p = strsep(&buf, ",")) != NULL) {
+		int token;
+		u32 v;
+		substring_t args[MAX_OPT_ARGS];
+
+		p = strstrip(p);
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, fc_tokens, args);
+		switch (token) {
+		case OPT_KEY_NID:
+			if (match_uint(&args[0], &v) || v >= MAX_NUMNODES)
+				return -EINVAL;
+			opts[OPT_KEY_NID] = v;
+			break;
+		case OPT_KEY_WATERMARK:
+#define MIN_WATERMARK_LIMIT ((10 << 20) >> PAGE_SHIFT)
+			if (match_uint(&args[0], &v))
+				return -EINVAL;
+			v >>= PAGE_SHIFT;
+			if (v < MIN_WATERMARK_LIMIT)
+				return -EINVAL;
+			opts[OPT_KEY_WATERMARK] = v;
+			break;
+		case OPT_KEY_HOLD_LIMIT:
+			if (match_uint(&args[0], &v))
+				return -EINVAL;
+			v >>= PAGE_SHIFT;
+#define MAX_CACHE_LIMIT_NR ((500 << 20) >> PAGE_SHIFT)
+			if (v > MAX_CACHE_LIMIT_NR)
+				return -EINVAL;
+			opts[OPT_KEY_HOLD_LIMIT] = v;
+			break;
+		case OPT_KEY_ERR:
+		default:
+			break;
+		}
+	}
+
+	if (mem_cgroup_apply_cache_status(memcg, opts))
+		return -EINVAL;
+
+	return 0;
+}
+
 enum {
 	OPT_CTRL_ENABLE,
+	OPT_CTRL_KEYS,
 	OPT_CTRL_ERR,
 	OPR_CTRL_NR = OPT_CTRL_ERR,
 };
 
 static const match_table_t ctrl_tokens = {
 					   { OPT_CTRL_ENABLE, "enable=%s" },
+					   { OPT_CTRL_KEYS, "keys=%s" },
 					   { OPT_CTRL_ERR, NULL } };
 
 /**
  * This function  can control target memcg's cache. include enable\keys set.
  * To enable\disable this cache, by `echo enable=[y|n] > memory.cace`
  * in target memcg.
+ * To set keys, by `echo keys=[key=args;..] > memory.cache`, current support keys:
+ *   1. nid=x, if input, will only change target NODE's cache status. Else, all.
+ *   2. watermark=bytes, change cache hold behavior, only zone free pages above
+ *      high watermark+watermark, can hold.
+ *   3. limit=bytes, change max pages can cache. Max can change to 500MB
+ * Enable and keys can both input, split by space, so can set args after enable,
+ * if cache not enable, can't set keys.
  */
 static ssize_t mem_cgroup_cache_control(struct kernfs_open_file *of, char *buf,
 					size_t nbytes, loff_t off)
 {
 	bool enable;
-	bool opt_enable_set = false;
+	bool opt_enable_set = false, opt_key_set = false;
 	int err = 0;
 	char *sub;
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char keybuf[256];
 
 	buf = strstrip(buf);
 	if (!strlen(buf))
@@ -7300,6 +7439,14 @@ static ssize_t mem_cgroup_cache_control(struct kernfs_open_file *of, char *buf,
 				return -EINVAL;
 			opt_enable_set = true;
 			break;
+		case OPT_CTRL_KEYS:
+			if (match_strlcpy(tbuf, &args[0], sizeof(tbuf)) >=
+			    sizeof(tbuf))
+				return -EINVAL;
+
+			memcpy(keybuf, tbuf, sizeof(keybuf));
+			opt_key_set = true;
+			break;
 		case OPT_CTRL_ERR:
 		default:
 			return -EINVAL;
@@ -7315,6 +7462,9 @@ static ssize_t mem_cgroup_cache_control(struct kernfs_open_file *of, char *buf,
 		}
 	}
 
+	if (opt_key_set)
+		err = __mem_cgroup_cache_control_key(keybuf, memcg);
+
 	return err ? err : nbytes;
 }
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/4] mm: memcg: pmc: support reaper
  2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 1/4] mm: memcg: pmc framework Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 2/4] mm: memcg: pmc support change attribute Huan Yang
@ 2024-07-02  8:44 ` Huan Yang
  2024-07-02  8:44 ` [RFC PATCH 4/4] mm: memcg: pmc: support oom release Huan Yang
  2024-07-02 19:27 ` [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Roman Gushchin
  4 siblings, 0 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-02  8:44 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Huan Yang, Kairui Song, cgroups, linux-mm, linux-kernel,
	Christian Brauner
  Cc: opensource.kernel

If memcg enables pmc, it will cache some pages. However, if all
processes in memcg exit and there are some remaining pages in the cache,
those pages will not be used unless memcg is deleted.

To avoid this situation, a periodic reaping job has been added to each
memcg when pmc enabled, which will reclaim all the cached memory in the
memcg at regular intervals(default 5s).

User also can change reaper interval time like below:
  echo keys=reaper_time=8000000 > memory.cache
This memcg will reaper cache each 8s(type unit is us)

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/mmzone.h |  6 ++++
 mm/memcontrol.c        | 77 ++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c        |  1 +
 3 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 773b89e214c9..b56dd462232b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -608,6 +608,7 @@ struct mem_cgroup_zone_cache {
 	struct list_head pages;
 	spinlock_t pages_lock;
 	atomic_t nr_pages;
+	atomic_t nr_reapered;
 	atomic_t nr_alloced;
 };
 
@@ -616,6 +617,11 @@ struct mem_cgroup_per_node_cache {
 	struct mem_cgroup_zone_cache zone_cachep[MAX_NR_ZONES];
 	struct mem_cgroup *memcg;
 
+	/* cycle cache reclaim time unit, us, default 5s, 0 means disable reaper */
+#define DEFAULT_PMC_REAPER_TIME ((5 * 1000 * 1000))
+	unsigned int reaper_wait;
+	struct delayed_work reaper_work;
+
 	/* max number to hold page, unit page, default 100MB */
 #define DEFAULT_PMC_HOLD_LIMIX ((100 << 20) >> PAGE_SHIFT)
 	unsigned int hold_limit;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db5bbe63b34..ae6917de91cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7101,6 +7101,39 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+/**
+ * This function use to reaper all cache pages by cycling scan.
+ * The scan interval time depends on @reaper_wait which can set by `keys` nest
+ * key.
+ * Default, each memcg which enabled cache will be reapered every 5s.
+ */
+static void pmc_reaper(struct work_struct *worker)
+{
+	struct mem_cgroup_per_node_cache *node_cachep = container_of(
+		to_delayed_work(worker), struct mem_cgroup_per_node_cache,
+		reaper_work);
+	struct mem_cgroup *memcg;
+	int num;
+
+	if (!READ_ONCE(node_cachep->reaper_wait))
+		return;
+
+	memcg = node_cachep->memcg;
+	rcu_read_lock();
+	if (!css_tryget(&memcg->css)) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+
+	num = mem_cgroup_release_cache(node_cachep);
+
+	css_put(&memcg->css);
+
+	schedule_delayed_work(&node_cachep->reaper_work,
+			      usecs_to_jiffies(node_cachep->reaper_wait));
+}
+
 static int __enable_mem_cgroup_cache(struct mem_cgroup *memcg)
 {
 	int nid, idx;
@@ -7141,8 +7174,13 @@ static int __enable_mem_cgroup_cache(struct mem_cgroup *memcg)
 		p->memcg = memcg;
 		p->hold_limit = DEFAULT_PMC_HOLD_LIMIX;
 		p->allow_watermark = DEFAULT_PMC_GAP_WATERMARK;
+		p->reaper_wait = DEFAULT_PMC_REAPER_TIME;
 
 		atomic_inc(&pmc_nr_enabled);
+
+		INIT_DELAYED_WORK(&p->reaper_work, pmc_reaper);
+		schedule_delayed_work(&p->reaper_work,
+				      usecs_to_jiffies(p->reaper_wait));
 	}
 
 	if (static_branch_likely(&pmc_key))
@@ -7184,6 +7222,7 @@ static int __disable_mem_cgroup_cache(struct mem_cgroup *memcg)
 
 		p = nodeinfo->cachep;
 
+		cancel_delayed_work_sync(&p->reaper_work);
 		mem_cgroup_release_cache(p);
 
 		kfree(p);
@@ -7207,7 +7246,8 @@ static int mem_cgroup_cache_show(struct seq_file *m, void *v)
 	if (!READ_ONCE(memcg->cache_enabled))
 		return -EINVAL;
 
-	seq_printf(m, "%4s %16s %16s\n", "NODE", "WATERMARK", "HOLD_LIMIT");
+	seq_printf(m, "%4s %16s %16s %16s\n", "NODE", "WATERMARK",
+		   "HOLD_LIMIT", "REAPER_TIME");
 	for_each_online_node(nid) {
 		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
 		struct mem_cgroup_per_node_cache *p;
@@ -7216,13 +7256,15 @@ static int mem_cgroup_cache_show(struct seq_file *m, void *v)
 		if (!p)
 			continue;
 
-		seq_printf(m, "%4d %14uKB %14uKB\n", nid,
+		seq_printf(m, "%4d %14uKB %14uKB %16u\n", nid,
 			   (READ_ONCE(p->allow_watermark) << (PAGE_SHIFT - 10)),
-			   (READ_ONCE(p->hold_limit) << (PAGE_SHIFT - 10)));
+			   (READ_ONCE(p->hold_limit) << (PAGE_SHIFT - 10)),
+			   READ_ONCE(p->reaper_wait));
 	}
 
 	seq_puts(m, "===========\n");
-	seq_printf(m, "%4s %16s %16s %16s\n", "NODE", "ZONE", "CACHE", "HIT");
+	seq_printf(m, "%4s %16s %16s %16s %16s\n", "NODE", "ZONE", "CACHE",
+		   "REAPER", "HIT");
 
 	for_each_online_node(nid) {
 		struct mem_cgroup_per_node *nodeinfo = memcg->nodeinfo[nid];
@@ -7242,9 +7284,12 @@ static int mem_cgroup_cache_show(struct seq_file *m, void *v)
 				continue;
 
 			zc = &p->zone_cachep[idx];
-			seq_printf(m, "%4d %16s %14dKB %14dKB\n", nid, z->name,
+			seq_printf(m, "%4d %16s %14dKB %14dKB %14dKB\n", nid,
+				   z->name,
 				   (atomic_read(&zc->nr_pages)
 				    << (PAGE_SHIFT - 10)),
+				   (atomic_read(&zc->nr_reapered)
+				    << (PAGE_SHIFT - 10)),
 				   (atomic_read(&zc->nr_alloced)
 				    << (PAGE_SHIFT - 10)));
 		}
@@ -7257,6 +7302,7 @@ static int mem_cgroup_cache_show(struct seq_file *m, void *v)
 enum {
 	OPT_KEY_NID,
 	OPT_KEY_WATERMARK,
+	OPT_KEY_REAPER_TIME,
 	OPT_KEY_HOLD_LIMIT,
 	OPT_KEY_ERR,
 	NR_PMC_KEY_OPTS = OPT_KEY_ERR
@@ -7265,6 +7311,7 @@ enum {
 static const match_table_t fc_tokens = {
 	{ OPT_KEY_NID, "nid=%d" },
 	{ OPT_KEY_WATERMARK, "watermark=%u" },
+	{ OPT_KEY_REAPER_TIME, "reaper_time=%u" },
 	{ OPT_KEY_HOLD_LIMIT, "limit=%u" },
 	{ OPT_KEY_ERR, NULL}
 };
@@ -7282,6 +7329,12 @@ __apply_status_for_mem_cgroup_cache(struct mem_cgroup_per_node_cache *p,
 			    STATUS_UNSET_DEFAULT_VALUE)
 				p->allow_watermark = opts[OPT_KEY_WATERMARK];
 			break;
+		case OPT_KEY_REAPER_TIME:
+			if (opts[OPT_KEY_REAPER_TIME] !=
+			    STATUS_UNSET_DEFAULT_VALUE)
+				WRITE_ONCE(p->reaper_wait,
+					   opts[OPT_KEY_REAPER_TIME]);
+			break;
 		case OPT_KEY_HOLD_LIMIT:
 			if (opts[OPT_KEY_HOLD_LIMIT] !=
 			    STATUS_UNSET_DEFAULT_VALUE)
@@ -7319,7 +7372,7 @@ mem_cgroup_apply_cache_status(struct mem_cgroup *memcg,
 }
 
 /**
- * Support nid=x,watermark=bytes,limit=bytes args
+ * Support nid=x,watermark=bytes,limit=bytes,reaper=us args
  */
 static int __mem_cgroup_cache_control_key(char *buf,
 					      struct mem_cgroup *memcg)
@@ -7361,6 +7414,14 @@ static int __mem_cgroup_cache_control_key(char *buf,
 				return -EINVAL;
 			opts[OPT_KEY_WATERMARK] = v;
 			break;
+		case OPT_KEY_REAPER_TIME:
+			if (match_uint(&args[0], &v))
+				return -EINVAL;
+#define MAX_REAPER_TIME ((10 * 1000 * 1000))
+			if (v > MAX_REAPER_TIME)
+				return -EINVAL;
+			opts[OPT_KEY_REAPER_TIME] = v;
+			break;
 		case OPT_KEY_HOLD_LIMIT:
 			if (match_uint(&args[0], &v))
 				return -EINVAL;
@@ -7402,7 +7463,9 @@ static const match_table_t ctrl_tokens = {
  *   1. nid=x, if input, will only change target NODE's cache status. Else, all.
  *   2. watermark=bytes, change cache hold behavior, only zone free pages above
  *      high watermark+watermark, can hold.
- *   3. limit=bytes, change max pages can cache. Max can change to 500MB
+ *   3. reaper_time=us, change reaper time, default is 10s. Set 0 can disable,
+ *      max can change to 10s
+ *   4. limit=bytes, change max pages can cache. Max can change to 500MB
  * Enable and keys can both input, split by space, so can set args after enable,
  * if cache not enable, can't set keys.
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54c4d00c2506..1fe02f4f3b33 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1310,6 +1310,7 @@ int mem_cgroup_release_cache(struct mem_cgroup_per_node_cache *nodep)
 		}
 
 		num += i;
+		atomic_add(i, &zc->nr_reapered);
 		atomic_sub(i, &zc->nr_pages);
 	}
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/4] mm: memcg: pmc: support oom release
  2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
                   ` (2 preceding siblings ...)
  2024-07-02  8:44 ` [RFC PATCH 3/4] mm: memcg: pmc: support reaper Huan Yang
@ 2024-07-02  8:44 ` Huan Yang
  2024-07-02 19:27 ` [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Roman Gushchin
  4 siblings, 0 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-02  8:44 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Huan Yang, Kairui Song, cgroups, linux-mm, linux-kernel,
	Christian Brauner, Jan Kara
  Cc: opensource.kernel

This patch let each enabled pmc's memcg register a oom listener,
so if oom will trigger, release all hold pages.

Signed-off-by: Huan Yang <link@vivo.com>
---
 include/linux/mmzone.h |  3 +++
 mm/memcontrol.c        | 31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b56dd462232b..640a9cf51791 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -622,6 +622,9 @@ struct mem_cgroup_per_node_cache {
 	unsigned int reaper_wait;
 	struct delayed_work reaper_work;
 
+	/* listen oom event, release hold cache */
+	struct notifier_block oom_nb;
+
 	/* max number to hold page, unit page, default 100MB */
 #define DEFAULT_PMC_HOLD_LIMIX ((100 << 20) >> PAGE_SHIFT)
 	unsigned int hold_limit;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae6917de91cc..3dfb2a17c1fd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7101,6 +7101,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+/**
+ * This function listen to oom event, if oom will trigger, check and release
+ * all holded pages.
+ */
+static int pmc_oom_notify(struct notifier_block *self,
+				     unsigned long notused, void *nfreed)
+{
+	struct mem_cgroup_per_node_cache *node_cachep =
+		container_of(self, struct mem_cgroup_per_node_cache, oom_nb);
+	struct mem_cgroup *memcg = node_cachep->memcg;
+
+	unsigned long *nf = (unsigned long *)nfreed;
+
+	rcu_read_lock();
+	if (!css_tryget(&memcg->css)) {
+		rcu_read_unlock();
+		return NOTIFY_STOP;
+	}
+	rcu_read_unlock();
+
+	nf += mem_cgroup_release_cache(node_cachep);
+
+	css_put(&memcg->css);
+
+	return NOTIFY_OK;
+}
+
 /**
  * This function use to reaper all cache pages by cycling scan.
  * The scan interval time depends on @reaper_wait which can set by `keys` nest
@@ -7176,6 +7203,9 @@ static int __enable_mem_cgroup_cache(struct mem_cgroup *memcg)
 		p->allow_watermark = DEFAULT_PMC_GAP_WATERMARK;
 		p->reaper_wait = DEFAULT_PMC_REAPER_TIME;
 
+		p->oom_nb.notifier_call = pmc_oom_notify;
+		register_oom_notifier(&p->oom_nb);
+
 		atomic_inc(&pmc_nr_enabled);
 
 		INIT_DELAYED_WORK(&p->reaper_work, pmc_reaper);
@@ -7222,6 +7252,7 @@ static int __disable_mem_cgroup_cache(struct mem_cgroup *memcg)
 
 		p = nodeinfo->cachep;
 
+		unregister_oom_notifier(&p->oom_nb);
 		cancel_delayed_work_sync(&p->reaper_work);
 		mem_cgroup_release_cache(p);
 
-- 
2.45.2



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
                   ` (3 preceding siblings ...)
  2024-07-02  8:44 ` [RFC PATCH 4/4] mm: memcg: pmc: support oom release Huan Yang
@ 2024-07-02 19:27 ` Roman Gushchin
  2024-07-03  2:23   ` Huan Yang
  4 siblings, 1 reply; 12+ messages in thread
From: Roman Gushchin @ 2024-07-02 19:27 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel

On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
> 
> Background
> ===
> 
> Modern computer systems always have performance gaps between hardware,
> such as the performance differences between CPU, memory, and disk.
> Due to the principle of locality of reference in data access:
> 
>   Programs often access data that has been accessed before
>   Programs access the next set of data after accessing a particular data
> As a result:
>   1. CPU cache is used to speed up the access of already accessed data
>      in memory
>   2. Disk prefetching techniques are used to prepare the next set of data
>      to be accessed in advance (to avoid direct disk access)
> The basic utilization of locality greatly enhances computer performance.
> 
> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
> program performance.
> 
> In modern computers, especially in smartphones, services are provided to
> users on a per-application basis (such as Camera, Chat, etc.),
> where an application is composed of multiple processes working together to
> provide services.
> 
> The basic unit for managing resources in a computer is the process,
> which in turn uses threads to share memory and accomplish tasks.
> Memory is shared among threads within a process.
> 
> However, modern computers have the following issues, with a locality deficiency:
> 
>   1. Different forms of memory exist and are not interconnected (anonymous
>      pages, file pages, special memory such as DMA-BUF, various memory alloc in
>      kernel mode, etc.)
>   2. Memory isolation exists between processes, and apart from specific
>      shared memory, they do not communicate with each other.
>   3. During the transition of functionality within an application, a process
>      usually releases memory, while another process requests memory, and in
>      this process, memory has to be obtained from the lowest level through
>      competition.
> 
> For example abount camera application:
> 
> Camera applications typically provide photo capture services as well as photo
> preview services.
> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
> of image data between the CPU and DMA devices.
> When it comes to image preview, multiple algorithm processes are typically
> involved in processing the image data, which may also involve heap memory
> and other resources.
> 
> During the switch between photo capture and preview, the application typically
> needs to release DMA-BUF memory and then the algorithms need to allocate
> heap memory. The flow of system memory during this process is managed by
> the PCP-BUDDY system.
> 
> However, the PCP and BUDDY systems are shared, and subsequently requested
> memory may not be available due to previously allocated memory being used
> (such as for file reading), requiring a competitive (memory reclamation)
> process to obtain it.
> 
> So, if it is possible to allow the released memory to be allocated with
> high priority within the application, then this can meet the locality
> requirement, improve performance, and avoid unnecessary memory reclaim.
> 
> PMC solutions are similar to PCP, as they both establish cache pools according
> to certain rules.
> 
> Why base on MEMCG?
> ===
> 
> The MEMCG container can allocate selected processes to a MEMCG based on certain
> grouping strategies (typical examples include grouping by app or UID).
> Processes within the same MEMCG can then be used for statistics, upper limit
> restrictions, and reclamation control.
> 
> All processes within a MEMCG are considered as a single memory unit,
> sharing memory among themselves. As a result, when one process releases
> memory, another process within the same group can obtain it with the
> highest priority, fully utilizing the locality of memory allocation
> characteristics within the MEMCG (such as APP grouping).
> 
> In addition, MEMCG provides feature interfaces that can be dynamically toggled
> and are fully controllable by the policy.This provides greater flexibility
> and does not impact performance when not enabled (controlled through static key).
> 
> 
> Abount PMC implement
> ===
> Here, a cache switch is provided for each MEMCG(not on root).
> When the user enables the cache, processes within the MEMCG will share memory
> through this cache.
> 
> The cache pool is positioned before the PCP. All order0 page released by
> processes in MEMCG will be released to the cache pool first, and when memory
> is requested, it will also be prioritized to be obtained from the cache pool.
> 
> `memory.cache` is the sole entry point for controlling PMC, here are some
> nested keys to control PMC:
>   1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
>   2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
>   enabled PMC's behavior.
>     a) `nid` to targeted a node to change it's key. or else all node.
>     b) The `watermark` is used to control cache behavior, caching only when
>        zone free pages above the zone's high water mark + this watermark is
>        exceeded during memory release. (unit byte, default 50MB,
>        min 10MB per-node-all-zone)
>     c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
>         MEMCG(unit us, default 5s, 0 is disable.)
>     d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
>        default 100MB, max 500MB per-node-all-zone)
> 
> Performance
> ===
> PMC is based on MEMCG and requires performance measurement through the
> sharing of complex workloads between application processes.
> Therefore, at the moment, we unable to provide a better testing solution
> for this patchset.
> 
> Here is the internal testing situation we provide, using the camera
> application as an example. (1-NODE-1-ZONE-8GRAM)
> 
> Test Case: Capture in rear portrait HDR mode
> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
>    which memory types including dmabuf(470M), PSS(150M) and APU(200M)
> 2. Test steps: take a photo, then click thumbnail to view the full image
> 
> The overall performance benefit from click shutter button to showing whole
> image improves 500ms, and the total slowpath cost of all camera threads reduced
> from 958ms to 495ms. 
> Especially for the shot2shot in this mode, the preview dealy of each frame have
> a significant improve.

Hello Huan,

thank you for sharing your work.

Some high-level thoughts:
1) Naming is hard, but it took me quite a while to realize that you're talking
about free memory. Cache is obviously an overloaded term, but per-memcg-cache
can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
the best choice.
2) Overall an idea to have a per-memcg free memory pool makes sense to me,
especially if we talk 2MB or 1GB pages (or order > 0 in general).
3) You absolutely have to integrate the reclaim mechanism with a generic
memory reclaim mechanism, which is driven by the memory pressure.
4) You claim a ~50% performance win in your workload, which is a lot. It's not
clear to me where it's coming from. It's hard to believe the page allocation/release
paths are taking 50% of the cpu time. Please, clarify.

There are a lot of other questions, and you highlighted some of them below
(and these are indeed right questions to ask), but let's start with something.

Thanks


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-02 19:27 ` [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Roman Gushchin
@ 2024-07-03  2:23   ` Huan Yang
  2024-07-03 17:27     ` Shakeel Butt
  2024-07-03 22:59     ` T.J. Mercier
  0 siblings, 2 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-03  2:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel


在 2024/7/3 3:27, Roman Gushchin 写道:
> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
>>
>> Background
>> ===
>>
>> Modern computer systems always have performance gaps between hardware,
>> such as the performance differences between CPU, memory, and disk.
>> Due to the principle of locality of reference in data access:
>>
>>    Programs often access data that has been accessed before
>>    Programs access the next set of data after accessing a particular data
>> As a result:
>>    1. CPU cache is used to speed up the access of already accessed data
>>       in memory
>>    2. Disk prefetching techniques are used to prepare the next set of data
>>       to be accessed in advance (to avoid direct disk access)
>> The basic utilization of locality greatly enhances computer performance.
>>
>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
>> program performance.
>>
>> In modern computers, especially in smartphones, services are provided to
>> users on a per-application basis (such as Camera, Chat, etc.),
>> where an application is composed of multiple processes working together to
>> provide services.
>>
>> The basic unit for managing resources in a computer is the process,
>> which in turn uses threads to share memory and accomplish tasks.
>> Memory is shared among threads within a process.
>>
>> However, modern computers have the following issues, with a locality deficiency:
>>
>>    1. Different forms of memory exist and are not interconnected (anonymous
>>       pages, file pages, special memory such as DMA-BUF, various memory alloc in
>>       kernel mode, etc.)
>>    2. Memory isolation exists between processes, and apart from specific
>>       shared memory, they do not communicate with each other.
>>    3. During the transition of functionality within an application, a process
>>       usually releases memory, while another process requests memory, and in
>>       this process, memory has to be obtained from the lowest level through
>>       competition.
>>
>> For example abount camera application:
>>
>> Camera applications typically provide photo capture services as well as photo
>> preview services.
>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
>> of image data between the CPU and DMA devices.
>> When it comes to image preview, multiple algorithm processes are typically
>> involved in processing the image data, which may also involve heap memory
>> and other resources.
>>
>> During the switch between photo capture and preview, the application typically
>> needs to release DMA-BUF memory and then the algorithms need to allocate
>> heap memory. The flow of system memory during this process is managed by
>> the PCP-BUDDY system.
>>
>> However, the PCP and BUDDY systems are shared, and subsequently requested
>> memory may not be available due to previously allocated memory being used
>> (such as for file reading), requiring a competitive (memory reclamation)
>> process to obtain it.
>>
>> So, if it is possible to allow the released memory to be allocated with
>> high priority within the application, then this can meet the locality
>> requirement, improve performance, and avoid unnecessary memory reclaim.
>>
>> PMC solutions are similar to PCP, as they both establish cache pools according
>> to certain rules.
>>
>> Why base on MEMCG?
>> ===
>>
>> The MEMCG container can allocate selected processes to a MEMCG based on certain
>> grouping strategies (typical examples include grouping by app or UID).
>> Processes within the same MEMCG can then be used for statistics, upper limit
>> restrictions, and reclamation control.
>>
>> All processes within a MEMCG are considered as a single memory unit,
>> sharing memory among themselves. As a result, when one process releases
>> memory, another process within the same group can obtain it with the
>> highest priority, fully utilizing the locality of memory allocation
>> characteristics within the MEMCG (such as APP grouping).
>>
>> In addition, MEMCG provides feature interfaces that can be dynamically toggled
>> and are fully controllable by the policy.This provides greater flexibility
>> and does not impact performance when not enabled (controlled through static key).
>>
>>
>> Abount PMC implement
>> ===
>> Here, a cache switch is provided for each MEMCG(not on root).
>> When the user enables the cache, processes within the MEMCG will share memory
>> through this cache.
>>
>> The cache pool is positioned before the PCP. All order0 page released by
>> processes in MEMCG will be released to the cache pool first, and when memory
>> is requested, it will also be prioritized to be obtained from the cache pool.
>>
>> `memory.cache` is the sole entry point for controlling PMC, here are some
>> nested keys to control PMC:
>>    1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
>>    2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
>>    enabled PMC's behavior.
>>      a) `nid` to targeted a node to change it's key. or else all node.
>>      b) The `watermark` is used to control cache behavior, caching only when
>>         zone free pages above the zone's high water mark + this watermark is
>>         exceeded during memory release. (unit byte, default 50MB,
>>         min 10MB per-node-all-zone)
>>      c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
>>          MEMCG(unit us, default 5s, 0 is disable.)
>>      d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
>>         default 100MB, max 500MB per-node-all-zone)
>>
>> Performance
>> ===
>> PMC is based on MEMCG and requires performance measurement through the
>> sharing of complex workloads between application processes.
>> Therefore, at the moment, we unable to provide a better testing solution
>> for this patchset.
>>
>> Here is the internal testing situation we provide, using the camera
>> application as an example. (1-NODE-1-ZONE-8GRAM)
>>
>> Test Case: Capture in rear portrait HDR mode
>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
>>     which memory types including dmabuf(470M), PSS(150M) and APU(200M)
>> 2. Test steps: take a photo, then click thumbnail to view the full image
>>
>> The overall performance benefit from click shutter button to showing whole
>> image improves 500ms, and the total slowpath cost of all camera threads reduced
>> from 958ms to 495ms.
>> Especially for the shot2shot in this mode, the preview dealy of each frame have
>> a significant improve.
> Hello Huan,
>
> thank you for sharing your work.
thanks
>
> Some high-level thoughts:
> 1) Naming is hard, but it took me quite a while to realize that you're talking
Haha, sorry for my pool english
> about free memory. Cache is obviously an overloaded term, but per-memcg-cache
> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not

Currently, my idea is that all memory released by processes under memcg 
will go into the `cache`,

and the original attributes will be ignored, and can be freely requested 
by processes under memcg.

(so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more 
friendly? :)

> the best choice.
> 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
> especially if we talk 2MB or 1GB pages (or order > 0 in general).
I like it too :)
> 3) You absolutely have to integrate the reclaim mechanism with a generic
> memory reclaim mechanism, which is driven by the memory pressure.
Yes, I all think about it.
> 4) You claim a ~50% performance win in your workload, which is a lot. It's not
> clear to me where it's coming from. It's hard to believe the page allocation/release
> paths are taking 50% of the cpu time. Please, clarify.

Let me describe it more specifically. In our test scenario, we have 8GB 
of RAM, and our camera application

has a complex set of algorithms, with a peak memory requirement of up to 
3GB.

Therefore, in a multi-application background scenario, starting the 
camera and taking photos will create a

very high memory pressure. In this scenario, any released memory will be 
quickly used by other processes (such as file pages).

So, during the process of switching from camera capture to preview, 
DMA-BUF memory will be released,

while the memory used for the preview algorithm will be simultaneously 
requested.

We need to take a lot of slow path routes to obtain enough memory for 
the preview algorithm, and it seems that the

just released DMA-BUF memory does not provide much help.

But using PMC (let's call it that for now), we are able to quickly meet 
the memory needs of the subsequent preview process

with the just released DMA-BUF memory, without having to go through the 
slow path, resulting in a significant performance improvement.

(of course, break migrate type may not good.)

>
> There are a lot of other questions, and you highlighted some of them below
> (and these are indeed right questions to ask), but let's start with something.
>
> Thanks
Thanks


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-03  2:23   ` Huan Yang
@ 2024-07-03 17:27     ` Shakeel Butt
  2024-07-04  2:49       ` Huan Yang
  2024-07-03 22:59     ` T.J. Mercier
  1 sibling, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2024-07-03 17:27 UTC (permalink / raw)
  To: Huan Yang
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Muchun Song,
	Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel

On Wed, Jul 03, 2024 at 10:23:35AM GMT, Huan Yang wrote:
> 
> 在 2024/7/3 3:27, Roman Gushchin 写道:
[...]
> > Hello Huan,
> > 
> > thank you for sharing your work.
> thanks
> > 
> > Some high-level thoughts:
> > 1) Naming is hard, but it took me quite a while to realize that you're talking
> Haha, sorry for my pool english
> > about free memory. Cache is obviously an overloaded term, but per-memcg-cache
> > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
> 
> Currently, my idea is that all memory released by processes under memcg will
> go into the `cache`,
> 
> and the original attributes will be ignored, and can be freely requested by
> processes under memcg.
> 
> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more friendly?
> :)
> 
> > the best choice.
> > 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
> > especially if we talk 2MB or 1GB pages (or order > 0 in general).
> I like it too :)
> > 3) You absolutely have to integrate the reclaim mechanism with a generic
> > memory reclaim mechanism, which is driven by the memory pressure.
> Yes, I all think about it.
> > 4) You claim a ~50% performance win in your workload, which is a lot. It's not
> > clear to me where it's coming from. It's hard to believe the page allocation/release
> > paths are taking 50% of the cpu time. Please, clarify.
> 
> Let me describe it more specifically. In our test scenario, we have 8GB of
> RAM, and our camera application
> 
> has a complex set of algorithms, with a peak memory requirement of up to
> 3GB.
> 
> Therefore, in a multi-application background scenario, starting the camera
> and taking photos will create a
> 
> very high memory pressure. In this scenario, any released memory will be
> quickly used by other processes (such as file pages).
> 
> So, during the process of switching from camera capture to preview, DMA-BUF
> memory will be released,
> 
> while the memory used for the preview algorithm will be simultaneously
> requested.
> 
> We need to take a lot of slow path routes to obtain enough memory for the
> preview algorithm, and it seems that the
> 
> just released DMA-BUF memory does not provide much help.
> 
> But using PMC (let's call it that for now), we are able to quickly meet the
> memory needs of the subsequent preview process
> 
> with the just released DMA-BUF memory, without having to go through the slow
> path, resulting in a significant performance improvement.
> 
> (of course, break migrate type may not good.)
> 

Please correct me if I am wrong, IIUC you have applcations with
different latency or performance requirements, running on the same
system but the system is memory constraint. You want applications with
stringent performance requirement to go less in the allocation slowpath
and want the lower priority (or no perf requirement) applications to do
more slowpath work (reclaim/compaction) for themselves as well as for
the high priority applications.

What about the allocations from the softirqs or non-memcg-aware kernel
allocations? 

An alternative approach would be something similar to the watermark
based approach. Low priority applications (or kswapds) doing
reclaim/compaction at a higher newly defined watermark and the higher
priority applications are protected through the usual memcg protection.

I can see another use-case for whatever the solution we comeup with and
that is userspace reliable oom-killer.

Shakeel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-03  2:23   ` Huan Yang
  2024-07-03 17:27     ` Shakeel Butt
@ 2024-07-03 22:59     ` T.J. Mercier
  2024-07-04  2:29       ` Huan Yang
  1 sibling, 1 reply; 12+ messages in thread
From: T.J. Mercier @ 2024-07-03 22:59 UTC (permalink / raw)
  To: Huan Yang
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel

On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote:
>
>
> 在 2024/7/3 3:27, Roman Gushchin 写道:
> > On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
> >> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
> >>
> >> Background
> >> ===
> >>
> >> Modern computer systems always have performance gaps between hardware,
> >> such as the performance differences between CPU, memory, and disk.
> >> Due to the principle of locality of reference in data access:
> >>
> >>    Programs often access data that has been accessed before
> >>    Programs access the next set of data after accessing a particular data
> >> As a result:
> >>    1. CPU cache is used to speed up the access of already accessed data
> >>       in memory
> >>    2. Disk prefetching techniques are used to prepare the next set of data
> >>       to be accessed in advance (to avoid direct disk access)
> >> The basic utilization of locality greatly enhances computer performance.
> >>
> >> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
> >> program performance.
> >>
> >> In modern computers, especially in smartphones, services are provided to
> >> users on a per-application basis (such as Camera, Chat, etc.),
> >> where an application is composed of multiple processes working together to
> >> provide services.
> >>
> >> The basic unit for managing resources in a computer is the process,
> >> which in turn uses threads to share memory and accomplish tasks.
> >> Memory is shared among threads within a process.
> >>
> >> However, modern computers have the following issues, with a locality deficiency:
> >>
> >>    1. Different forms of memory exist and are not interconnected (anonymous
> >>       pages, file pages, special memory such as DMA-BUF, various memory alloc in
> >>       kernel mode, etc.)
> >>    2. Memory isolation exists between processes, and apart from specific
> >>       shared memory, they do not communicate with each other.
> >>    3. During the transition of functionality within an application, a process
> >>       usually releases memory, while another process requests memory, and in
> >>       this process, memory has to be obtained from the lowest level through
> >>       competition.
> >>
> >> For example abount camera application:
> >>
> >> Camera applications typically provide photo capture services as well as photo
> >> preview services.
> >> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
> >> of image data between the CPU and DMA devices.
> >> When it comes to image preview, multiple algorithm processes are typically
> >> involved in processing the image data, which may also involve heap memory
> >> and other resources.
> >>
> >> During the switch between photo capture and preview, the application typically
> >> needs to release DMA-BUF memory and then the algorithms need to allocate
> >> heap memory. The flow of system memory during this process is managed by
> >> the PCP-BUDDY system.
> >>
> >> However, the PCP and BUDDY systems are shared, and subsequently requested
> >> memory may not be available due to previously allocated memory being used
> >> (such as for file reading), requiring a competitive (memory reclamation)
> >> process to obtain it.
> >>
> >> So, if it is possible to allow the released memory to be allocated with
> >> high priority within the application, then this can meet the locality
> >> requirement, improve performance, and avoid unnecessary memory reclaim.
> >>
> >> PMC solutions are similar to PCP, as they both establish cache pools according
> >> to certain rules.
> >>
> >> Why base on MEMCG?
> >> ===
> >>
> >> The MEMCG container can allocate selected processes to a MEMCG based on certain
> >> grouping strategies (typical examples include grouping by app or UID).
> >> Processes within the same MEMCG can then be used for statistics, upper limit
> >> restrictions, and reclamation control.
> >>
> >> All processes within a MEMCG are considered as a single memory unit,
> >> sharing memory among themselves. As a result, when one process releases
> >> memory, another process within the same group can obtain it with the
> >> highest priority, fully utilizing the locality of memory allocation
> >> characteristics within the MEMCG (such as APP grouping).
> >>
> >> In addition, MEMCG provides feature interfaces that can be dynamically toggled
> >> and are fully controllable by the policy.This provides greater flexibility
> >> and does not impact performance when not enabled (controlled through static key).
> >>
> >>
> >> Abount PMC implement
> >> ===
> >> Here, a cache switch is provided for each MEMCG(not on root).
> >> When the user enables the cache, processes within the MEMCG will share memory
> >> through this cache.
> >>
> >> The cache pool is positioned before the PCP. All order0 page released by
> >> processes in MEMCG will be released to the cache pool first, and when memory
> >> is requested, it will also be prioritized to be obtained from the cache pool.
> >>
> >> `memory.cache` is the sole entry point for controlling PMC, here are some
> >> nested keys to control PMC:
> >>    1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
> >>    2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
> >>    enabled PMC's behavior.
> >>      a) `nid` to targeted a node to change it's key. or else all node.
> >>      b) The `watermark` is used to control cache behavior, caching only when
> >>         zone free pages above the zone's high water mark + this watermark is
> >>         exceeded during memory release. (unit byte, default 50MB,
> >>         min 10MB per-node-all-zone)
> >>      c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
> >>          MEMCG(unit us, default 5s, 0 is disable.)
> >>      d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
> >>         default 100MB, max 500MB per-node-all-zone)
> >>
> >> Performance
> >> ===
> >> PMC is based on MEMCG and requires performance measurement through the
> >> sharing of complex workloads between application processes.
> >> Therefore, at the moment, we unable to provide a better testing solution
> >> for this patchset.
> >>
> >> Here is the internal testing situation we provide, using the camera
> >> application as an example. (1-NODE-1-ZONE-8GRAM)
> >>
> >> Test Case: Capture in rear portrait HDR mode
> >> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
> >>     which memory types including dmabuf(470M), PSS(150M) and APU(200M)
> >> 2. Test steps: take a photo, then click thumbnail to view the full image
> >>
> >> The overall performance benefit from click shutter button to showing whole
> >> image improves 500ms, and the total slowpath cost of all camera threads reduced
> >> from 958ms to 495ms.
> >> Especially for the shot2shot in this mode, the preview dealy of each frame have
> >> a significant improve.
> > Hello Huan,
> >
> > thank you for sharing your work.
> thanks
> >
> > Some high-level thoughts:
> > 1) Naming is hard, but it took me quite a while to realize that you're talking
> Haha, sorry for my pool english
> > about free memory. Cache is obviously an overloaded term, but per-memcg-cache
> > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
>
> Currently, my idea is that all memory released by processes under memcg
> will go into the `cache`,
>
> and the original attributes will be ignored, and can be freely requested
> by processes under memcg.
>
> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more
> friendly? :)
>
> > the best choice.
> > 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
> > especially if we talk 2MB or 1GB pages (or order > 0 in general).
> I like it too :)
> > 3) You absolutely have to integrate the reclaim mechanism with a generic
> > memory reclaim mechanism, which is driven by the memory pressure.
> Yes, I all think about it.
> > 4) You claim a ~50% performance win in your workload, which is a lot. It's not
> > clear to me where it's coming from. It's hard to believe the page allocation/release
> > paths are taking 50% of the cpu time. Please, clarify.
>
> Let me describe it more specifically. In our test scenario, we have 8GB
> of RAM, and our camera application
>
> has a complex set of algorithms, with a peak memory requirement of up to
> 3GB.
>
> Therefore, in a multi-application background scenario, starting the
> camera and taking photos will create a
>
> very high memory pressure. In this scenario, any released memory will be
> quickly used by other processes (such as file pages).
>
> So, during the process of switching from camera capture to preview,
> DMA-BUF memory will be released,
>
> while the memory used for the preview algorithm will be simultaneously
> requested.
>
> We need to take a lot of slow path routes to obtain enough memory for
> the preview algorithm, and it seems that the
>
> just released DMA-BUF memory does not provide much help.
>
Hi Huan,

I find this part surprising. Assuming the dmabuf memory doesn't first
go into a page pool (used for some buffers, not all) and actually does
get freed synchronously with fput, this would mean it gets sucked up
by other supposedly background processes before it can be allocated by
the preview process. I thought the preview process was the one most
desperate for memory? You mention file pages, but where is this
newly-freed memory actually going if not to the preview process? My
initial reaction was the same as Roman's that the PMC should be hooked
up to reclaim instead of depending on the reaper. But I think this
might suggest that wouldn't work because the system is under such high
memory pressure that it'd be likely reclaim would have emptied the
PMCs before the preview process could use it.

One more thing I find odd is that for this to work a significant
portion of your dmabuf pages would have to be order 0, but we're
talking about a ~500M buffer. Does whatever exports this buffer not
try to use higher order pages like here?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54

Thanks!
-T.J.

> But using PMC (let's call it that for now), we are able to quickly meet
> the memory needs of the subsequent preview process
>
> with the just released DMA-BUF memory, without having to go through the
> slow path, resulting in a significant performance improvement.
>
> (of course, break migrate type may not good.)
>
> >
> > There are a lot of other questions, and you highlighted some of them below
> > (and these are indeed right questions to ask), but let's start with something.
> >
> > Thanks
> Thanks
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-03 22:59     ` T.J. Mercier
@ 2024-07-04  2:29       ` Huan Yang
  2024-07-09  0:11         ` T.J. Mercier
  0 siblings, 1 reply; 12+ messages in thread
From: Huan Yang @ 2024-07-04  2:29 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel


在 2024/7/4 6:59, T.J. Mercier 写道:
> On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote:
>>
>> 在 2024/7/3 3:27, Roman Gushchin 写道:
>>> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
>>>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
>>>>
>>>> Background
>>>> ===
>>>>
>>>> Modern computer systems always have performance gaps between hardware,
>>>> such as the performance differences between CPU, memory, and disk.
>>>> Due to the principle of locality of reference in data access:
>>>>
>>>>     Programs often access data that has been accessed before
>>>>     Programs access the next set of data after accessing a particular data
>>>> As a result:
>>>>     1. CPU cache is used to speed up the access of already accessed data
>>>>        in memory
>>>>     2. Disk prefetching techniques are used to prepare the next set of data
>>>>        to be accessed in advance (to avoid direct disk access)
>>>> The basic utilization of locality greatly enhances computer performance.
>>>>
>>>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
>>>> program performance.
>>>>
>>>> In modern computers, especially in smartphones, services are provided to
>>>> users on a per-application basis (such as Camera, Chat, etc.),
>>>> where an application is composed of multiple processes working together to
>>>> provide services.
>>>>
>>>> The basic unit for managing resources in a computer is the process,
>>>> which in turn uses threads to share memory and accomplish tasks.
>>>> Memory is shared among threads within a process.
>>>>
>>>> However, modern computers have the following issues, with a locality deficiency:
>>>>
>>>>     1. Different forms of memory exist and are not interconnected (anonymous
>>>>        pages, file pages, special memory such as DMA-BUF, various memory alloc in
>>>>        kernel mode, etc.)
>>>>     2. Memory isolation exists between processes, and apart from specific
>>>>        shared memory, they do not communicate with each other.
>>>>     3. During the transition of functionality within an application, a process
>>>>        usually releases memory, while another process requests memory, and in
>>>>        this process, memory has to be obtained from the lowest level through
>>>>        competition.
>>>>
>>>> For example abount camera application:
>>>>
>>>> Camera applications typically provide photo capture services as well as photo
>>>> preview services.
>>>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
>>>> of image data between the CPU and DMA devices.
>>>> When it comes to image preview, multiple algorithm processes are typically
>>>> involved in processing the image data, which may also involve heap memory
>>>> and other resources.
>>>>
>>>> During the switch between photo capture and preview, the application typically
>>>> needs to release DMA-BUF memory and then the algorithms need to allocate
>>>> heap memory. The flow of system memory during this process is managed by
>>>> the PCP-BUDDY system.
>>>>
>>>> However, the PCP and BUDDY systems are shared, and subsequently requested
>>>> memory may not be available due to previously allocated memory being used
>>>> (such as for file reading), requiring a competitive (memory reclamation)
>>>> process to obtain it.
>>>>
>>>> So, if it is possible to allow the released memory to be allocated with
>>>> high priority within the application, then this can meet the locality
>>>> requirement, improve performance, and avoid unnecessary memory reclaim.
>>>>
>>>> PMC solutions are similar to PCP, as they both establish cache pools according
>>>> to certain rules.
>>>>
>>>> Why base on MEMCG?
>>>> ===
>>>>
>>>> The MEMCG container can allocate selected processes to a MEMCG based on certain
>>>> grouping strategies (typical examples include grouping by app or UID).
>>>> Processes within the same MEMCG can then be used for statistics, upper limit
>>>> restrictions, and reclamation control.
>>>>
>>>> All processes within a MEMCG are considered as a single memory unit,
>>>> sharing memory among themselves. As a result, when one process releases
>>>> memory, another process within the same group can obtain it with the
>>>> highest priority, fully utilizing the locality of memory allocation
>>>> characteristics within the MEMCG (such as APP grouping).
>>>>
>>>> In addition, MEMCG provides feature interfaces that can be dynamically toggled
>>>> and are fully controllable by the policy.This provides greater flexibility
>>>> and does not impact performance when not enabled (controlled through static key).
>>>>
>>>>
>>>> Abount PMC implement
>>>> ===
>>>> Here, a cache switch is provided for each MEMCG(not on root).
>>>> When the user enables the cache, processes within the MEMCG will share memory
>>>> through this cache.
>>>>
>>>> The cache pool is positioned before the PCP. All order0 page released by
>>>> processes in MEMCG will be released to the cache pool first, and when memory
>>>> is requested, it will also be prioritized to be obtained from the cache pool.
>>>>
>>>> `memory.cache` is the sole entry point for controlling PMC, here are some
>>>> nested keys to control PMC:
>>>>     1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
>>>>     2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
>>>>     enabled PMC's behavior.
>>>>       a) `nid` to targeted a node to change it's key. or else all node.
>>>>       b) The `watermark` is used to control cache behavior, caching only when
>>>>          zone free pages above the zone's high water mark + this watermark is
>>>>          exceeded during memory release. (unit byte, default 50MB,
>>>>          min 10MB per-node-all-zone)
>>>>       c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
>>>>           MEMCG(unit us, default 5s, 0 is disable.)
>>>>       d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
>>>>          default 100MB, max 500MB per-node-all-zone)
>>>>
>>>> Performance
>>>> ===
>>>> PMC is based on MEMCG and requires performance measurement through the
>>>> sharing of complex workloads between application processes.
>>>> Therefore, at the moment, we unable to provide a better testing solution
>>>> for this patchset.
>>>>
>>>> Here is the internal testing situation we provide, using the camera
>>>> application as an example. (1-NODE-1-ZONE-8GRAM)
>>>>
>>>> Test Case: Capture in rear portrait HDR mode
>>>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
>>>>      which memory types including dmabuf(470M), PSS(150M) and APU(200M)
>>>> 2. Test steps: take a photo, then click thumbnail to view the full image
>>>>
>>>> The overall performance benefit from click shutter button to showing whole
>>>> image improves 500ms, and the total slowpath cost of all camera threads reduced
>>>> from 958ms to 495ms.
>>>> Especially for the shot2shot in this mode, the preview dealy of each frame have
>>>> a significant improve.
>>> Hello Huan,
>>>
>>> thank you for sharing your work.
>> thanks
>>> Some high-level thoughts:
>>> 1) Naming is hard, but it took me quite a while to realize that you're talking
>> Haha, sorry for my pool english
>>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache
>>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
>> Currently, my idea is that all memory released by processes under memcg
>> will go into the `cache`,
>>
>> and the original attributes will be ignored, and can be freely requested
>> by processes under memcg.
>>
>> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more
>> friendly? :)
>>
>>> the best choice.
>>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
>>> especially if we talk 2MB or 1GB pages (or order > 0 in general).
>> I like it too :)
>>> 3) You absolutely have to integrate the reclaim mechanism with a generic
>>> memory reclaim mechanism, which is driven by the memory pressure.
>> Yes, I all think about it.
>>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not
>>> clear to me where it's coming from. It's hard to believe the page allocation/release
>>> paths are taking 50% of the cpu time. Please, clarify.
>> Let me describe it more specifically. In our test scenario, we have 8GB
>> of RAM, and our camera application
>>
>> has a complex set of algorithms, with a peak memory requirement of up to
>> 3GB.
>>
>> Therefore, in a multi-application background scenario, starting the
>> camera and taking photos will create a
>>
>> very high memory pressure. In this scenario, any released memory will be
>> quickly used by other processes (such as file pages).
>>
>> So, during the process of switching from camera capture to preview,
>> DMA-BUF memory will be released,
>>
>> while the memory used for the preview algorithm will be simultaneously
>> requested.
>>
>> We need to take a lot of slow path routes to obtain enough memory for
>> the preview algorithm, and it seems that the
>>
>> just released DMA-BUF memory does not provide much help.
>>
> Hi Huan,
HI T.J.
>
> I find this part surprising. Assuming the dmabuf memory doesn't first
> go into a page pool (used for some buffers, not all) and actually does
Actually, when PMC enabled, we let page free avoid free into page pool.
> get freed synchronously with fput, this would mean it gets sucked up
> by other supposedly background processes before it can be allocated by
> the preview process. I thought the preview process was the one most
> desperate for memory? You mention file pages, but where is this
> newly-freed memory actually going if not to the preview process? My
This was discovered through the meminfo observation program.
When the dma-buf is released, there is a noticeable increase in cache.

This may be triggered by pagecache when loading the algorithm model.

Additionally, the algorithm heap memory cannot benefit from the release 
of the dma-buf.
I believe this is related to the migratetype. The stack/heap cannot 
obtain priority access to
the dma-buf memory released by the kernel.(HIGHUSER_MOVABLE)

So, PMC break it, share each memory. Even if it's incorrect :)(If my 
understanding of the
fragmentation issue is incorrect, please correct me.)

> initial reaction was the same as Roman's that the PMC should be hooked
> up to reclaim instead of depending on the reaper. But I think this
> might suggest that wouldn't work because the system is under such high
> memory pressure that it'd be likely reclaim would have emptied the
> PMCs before the preview process could use it.
The point you raised is indeed very likely to happen, as there is immense
memory pressure.
Currently, we only open the PMC when the application is in the foreground,
and close it when it goes to the background.
It is indeed unnecessary to drain the PMC when the application is in the 
foreground,
and a longer reaper timeout would be more useful.(Thanks for the 
flexibility provided by memcg.)
>
> One more thing I find odd is that for this to work a significant
> portion of your dmabuf pages would have to be order 0, but we're
> talking about a ~500M buffer. Does whatever exports this buffer not
> try to use higher order pages like here?
Yes, actually our heap configured order 8 4 0, but In our practical 
application and observation processes,
it is often difficult to meet the high-order memory allocation, so 
falling back to order 0 is the most common.
Therefore, for our MID_ORDER allocation, we use LOW_ORDER_GFP.
Just like the testing scenario I mentioned earlier, with 8GB of RAM and 
the camera peaking at around 3GB,

the fragmentation at this point will cause most of the DMA-BUF 
allocations to fall back to order 0.
The use of PMC is for real-world, high-load applications. I don't think 
it's very practical for regular applications.

Thanks
HY

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54
>
> Thanks!
> -T.J.
>
>> But using PMC (let's call it that for now), we are able to quickly meet
>> the memory needs of the subsequent preview process
>>
>> with the just released DMA-BUF memory, without having to go through the
>> slow path, resulting in a significant performance improvement.
>>
>> (of course, break migrate type may not good.)
>>
>>> There are a lot of other questions, and you highlighted some of them below
>>> (and these are indeed right questions to ask), but let's start with something.
>>>
>>> Thanks
>> Thanks
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-03 17:27     ` Shakeel Butt
@ 2024-07-04  2:49       ` Huan Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Huan Yang @ 2024-07-04  2:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Muchun Song,
	Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel


在 2024/7/4 1:27, Shakeel Butt 写道:
> On Wed, Jul 03, 2024 at 10:23:35AM GMT, Huan Yang wrote:
>> 在 2024/7/3 3:27, Roman Gushchin 写道:
> [...]
>>> Hello Huan,
>>>
>>> thank you for sharing your work.
>> thanks
>>> Some high-level thoughts:
>>> 1) Naming is hard, but it took me quite a while to realize that you're talking
>> Haha, sorry for my pool english
>>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache
>>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
>> Currently, my idea is that all memory released by processes under memcg will
>> go into the `cache`,
>>
>> and the original attributes will be ignored, and can be freely requested by
>> processes under memcg.
>>
>> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more friendly?
>> :)
>>
>>> the best choice.
>>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
>>> especially if we talk 2MB or 1GB pages (or order > 0 in general).
>> I like it too :)
>>> 3) You absolutely have to integrate the reclaim mechanism with a generic
>>> memory reclaim mechanism, which is driven by the memory pressure.
>> Yes, I all think about it.
>>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not
>>> clear to me where it's coming from. It's hard to believe the page allocation/release
>>> paths are taking 50% of the cpu time. Please, clarify.
>> Let me describe it more specifically. In our test scenario, we have 8GB of
>> RAM, and our camera application
>>
>> has a complex set of algorithms, with a peak memory requirement of up to
>> 3GB.
>>
>> Therefore, in a multi-application background scenario, starting the camera
>> and taking photos will create a
>>
>> very high memory pressure. In this scenario, any released memory will be
>> quickly used by other processes (such as file pages).
>>
>> So, during the process of switching from camera capture to preview, DMA-BUF
>> memory will be released,
>>
>> while the memory used for the preview algorithm will be simultaneously
>> requested.
>>
>> We need to take a lot of slow path routes to obtain enough memory for the
>> preview algorithm, and it seems that the
>>
>> just released DMA-BUF memory does not provide much help.
>>
>> But using PMC (let's call it that for now), we are able to quickly meet the
>> memory needs of the subsequent preview process
>>
>> with the just released DMA-BUF memory, without having to go through the slow
>> path, resulting in a significant performance improvement.
>>
>> (of course, break migrate type may not good.)
>>
> Please correct me if I am wrong, IIUC you have applcations with
> different latency or performance requirements, running on the same
> system but the system is memory constraint. You want applications with
> stringent performance requirement to go less in the allocation slowpath
> and want the lower priority (or no perf requirement) applications to do
> more slowpath work (reclaim/compaction) for themselves as well as for
> the high priority applications.
Yes, The PMC does have the idea of priority control.
In the smartphone, the most strongly perceived aspect by users is the 
foreground app.
In the scenario I described, the camera application should have absolute 
priority for memory,
and its internal memory usage should be given priority to meet its 
needs.(Especially when we
set the PMC's allocation after the buddy free.)
>
> What about the allocations from the softirqs or non-memcg-aware kernel
> allocations?

Sorry softirqs I can't explain. But, many kernel thread also set into 
root memcg.

In our scenario, we set all processes related to the camera application 
to the same memcg.(both user
and kernel thread)

>
> An alternative approach would be something similar to the watermark
> based approach. Low priority applications (or kswapds) doing
> reclaim/compaction at a higher newly defined watermark and the higher
> priority applications are protected through the usual memcg protection.

Also, Please correct me if I am wrong.

I understand that even with boost, water level control cannot finely 
control which
applications or processes should be recycled with a high water level.
Application grouping and selection need to be re-implemented.

Through PMC, we can proactively group the processes required by the 
application,
only opening them when they enter the foreground and closing them when 
in the background.

>
> I can see another use-case for whatever the solution we comeup with and
> that is userspace reliable oom-killer.
Yes, LMKD is helpfull.
That's unfortunate, but our product also has other dimensions of 
assessment, including application persistence.
This means that when the camera is launched, we can only kill 
unnecessary applications to free up a small amount

of memory to meet its startup requirements. However, when it requests 
memory for taking a photo,

the memory allocation is relatively lazy during the kill-check phase.

And one more thing, the memory released by killing applications may not 
necessarily meet the
instantaneous memory requirements.(Many zram compress page, not too fast)

Thanks,

HY

>
> Shakeel
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE)
  2024-07-04  2:29       ` Huan Yang
@ 2024-07-09  0:11         ` T.J. Mercier
  0 siblings, 0 replies; 12+ messages in thread
From: T.J. Mercier @ 2024-07-09  0:11 UTC (permalink / raw)
  To: Huan Yang
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle),
	David Hildenbrand, Ryan Roberts, Chris Li, Dan Schatzberg,
	Kairui Song, cgroups, linux-mm, linux-kernel, Christian Brauner,
	opensource.kernel

On Wed, Jul 3, 2024 at 7:29 PM Huan Yang <link@vivo.com> wrote:
>
>
> 在 2024/7/4 6:59, T.J. Mercier 写道:
> > On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote:
> >>
> >> 在 2024/7/3 3:27, Roman Gushchin 写道:
> >>> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote:
> >>>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE).
> >>>>
> >>>> Background
> >>>> ===
> >>>>
> >>>> Modern computer systems always have performance gaps between hardware,
> >>>> such as the performance differences between CPU, memory, and disk.
> >>>> Due to the principle of locality of reference in data access:
> >>>>
> >>>>     Programs often access data that has been accessed before
> >>>>     Programs access the next set of data after accessing a particular data
> >>>> As a result:
> >>>>     1. CPU cache is used to speed up the access of already accessed data
> >>>>        in memory
> >>>>     2. Disk prefetching techniques are used to prepare the next set of data
> >>>>        to be accessed in advance (to avoid direct disk access)
> >>>> The basic utilization of locality greatly enhances computer performance.
> >>>>
> >>>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance
> >>>> program performance.
> >>>>
> >>>> In modern computers, especially in smartphones, services are provided to
> >>>> users on a per-application basis (such as Camera, Chat, etc.),
> >>>> where an application is composed of multiple processes working together to
> >>>> provide services.
> >>>>
> >>>> The basic unit for managing resources in a computer is the process,
> >>>> which in turn uses threads to share memory and accomplish tasks.
> >>>> Memory is shared among threads within a process.
> >>>>
> >>>> However, modern computers have the following issues, with a locality deficiency:
> >>>>
> >>>>     1. Different forms of memory exist and are not interconnected (anonymous
> >>>>        pages, file pages, special memory such as DMA-BUF, various memory alloc in
> >>>>        kernel mode, etc.)
> >>>>     2. Memory isolation exists between processes, and apart from specific
> >>>>        shared memory, they do not communicate with each other.
> >>>>     3. During the transition of functionality within an application, a process
> >>>>        usually releases memory, while another process requests memory, and in
> >>>>        this process, memory has to be obtained from the lowest level through
> >>>>        competition.
> >>>>
> >>>> For example abount camera application:
> >>>>
> >>>> Camera applications typically provide photo capture services as well as photo
> >>>> preview services.
> >>>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing
> >>>> of image data between the CPU and DMA devices.
> >>>> When it comes to image preview, multiple algorithm processes are typically
> >>>> involved in processing the image data, which may also involve heap memory
> >>>> and other resources.
> >>>>
> >>>> During the switch between photo capture and preview, the application typically
> >>>> needs to release DMA-BUF memory and then the algorithms need to allocate
> >>>> heap memory. The flow of system memory during this process is managed by
> >>>> the PCP-BUDDY system.
> >>>>
> >>>> However, the PCP and BUDDY systems are shared, and subsequently requested
> >>>> memory may not be available due to previously allocated memory being used
> >>>> (such as for file reading), requiring a competitive (memory reclamation)
> >>>> process to obtain it.
> >>>>
> >>>> So, if it is possible to allow the released memory to be allocated with
> >>>> high priority within the application, then this can meet the locality
> >>>> requirement, improve performance, and avoid unnecessary memory reclaim.
> >>>>
> >>>> PMC solutions are similar to PCP, as they both establish cache pools according
> >>>> to certain rules.
> >>>>
> >>>> Why base on MEMCG?
> >>>> ===
> >>>>
> >>>> The MEMCG container can allocate selected processes to a MEMCG based on certain
> >>>> grouping strategies (typical examples include grouping by app or UID).
> >>>> Processes within the same MEMCG can then be used for statistics, upper limit
> >>>> restrictions, and reclamation control.
> >>>>
> >>>> All processes within a MEMCG are considered as a single memory unit,
> >>>> sharing memory among themselves. As a result, when one process releases
> >>>> memory, another process within the same group can obtain it with the
> >>>> highest priority, fully utilizing the locality of memory allocation
> >>>> characteristics within the MEMCG (such as APP grouping).
> >>>>
> >>>> In addition, MEMCG provides feature interfaces that can be dynamically toggled
> >>>> and are fully controllable by the policy.This provides greater flexibility
> >>>> and does not impact performance when not enabled (controlled through static key).
> >>>>
> >>>>
> >>>> Abount PMC implement
> >>>> ===
> >>>> Here, a cache switch is provided for each MEMCG(not on root).
> >>>> When the user enables the cache, processes within the MEMCG will share memory
> >>>> through this cache.
> >>>>
> >>>> The cache pool is positioned before the PCP. All order0 page released by
> >>>> processes in MEMCG will be released to the cache pool first, and when memory
> >>>> is requested, it will also be prioritized to be obtained from the cache pool.
> >>>>
> >>>> `memory.cache` is the sole entry point for controlling PMC, here are some
> >>>> nested keys to control PMC:
> >>>>     1. "enable=[y|n]" to enable or disable targeted MEMCG's cache
> >>>>     2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already
> >>>>     enabled PMC's behavior.
> >>>>       a) `nid` to targeted a node to change it's key. or else all node.
> >>>>       b) The `watermark` is used to control cache behavior, caching only when
> >>>>          zone free pages above the zone's high water mark + this watermark is
> >>>>          exceeded during memory release. (unit byte, default 50MB,
> >>>>          min 10MB per-node-all-zone)
> >>>>       c) `reaper_time` to control reaper gap, if meet, reaper all cache in this
> >>>>           MEMCG(unit us, default 5s, 0 is disable.)
> >>>>       d) `limit` is to limit the maximum memory used by the cache pool(unit bytes,
> >>>>          default 100MB, max 500MB per-node-all-zone)
> >>>>
> >>>> Performance
> >>>> ===
> >>>> PMC is based on MEMCG and requires performance measurement through the
> >>>> sharing of complex workloads between application processes.
> >>>> Therefore, at the moment, we unable to provide a better testing solution
> >>>> for this patchset.
> >>>>
> >>>> Here is the internal testing situation we provide, using the camera
> >>>> application as an example. (1-NODE-1-ZONE-8GRAM)
> >>>>
> >>>> Test Case: Capture in rear portrait HDR mode
> >>>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram
> >>>>      which memory types including dmabuf(470M), PSS(150M) and APU(200M)
> >>>> 2. Test steps: take a photo, then click thumbnail to view the full image
> >>>>
> >>>> The overall performance benefit from click shutter button to showing whole
> >>>> image improves 500ms, and the total slowpath cost of all camera threads reduced
> >>>> from 958ms to 495ms.
> >>>> Especially for the shot2shot in this mode, the preview dealy of each frame have
> >>>> a significant improve.
> >>> Hello Huan,
> >>>
> >>> thank you for sharing your work.
> >> thanks
> >>> Some high-level thoughts:
> >>> 1) Naming is hard, but it took me quite a while to realize that you're talking
> >> Haha, sorry for my pool english
> >>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache
> >>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not
> >> Currently, my idea is that all memory released by processes under memcg
> >> will go into the `cache`,
> >>
> >> and the original attributes will be ignored, and can be freely requested
> >> by processes under memcg.
> >>
> >> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more
> >> friendly? :)
> >>
> >>> the best choice.
> >>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me,
> >>> especially if we talk 2MB or 1GB pages (or order > 0 in general).
> >> I like it too :)
> >>> 3) You absolutely have to integrate the reclaim mechanism with a generic
> >>> memory reclaim mechanism, which is driven by the memory pressure.
> >> Yes, I all think about it.
> >>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not
> >>> clear to me where it's coming from. It's hard to believe the page allocation/release
> >>> paths are taking 50% of the cpu time. Please, clarify.
> >> Let me describe it more specifically. In our test scenario, we have 8GB
> >> of RAM, and our camera application
> >>
> >> has a complex set of algorithms, with a peak memory requirement of up to
> >> 3GB.
> >>
> >> Therefore, in a multi-application background scenario, starting the
> >> camera and taking photos will create a
> >>
> >> very high memory pressure. In this scenario, any released memory will be
> >> quickly used by other processes (such as file pages).
> >>
> >> So, during the process of switching from camera capture to preview,
> >> DMA-BUF memory will be released,
> >>
> >> while the memory used for the preview algorithm will be simultaneously
> >> requested.
> >>
> >> We need to take a lot of slow path routes to obtain enough memory for
> >> the preview algorithm, and it seems that the
> >>
> >> just released DMA-BUF memory does not provide much help.
> >>
> > Hi Huan,
> HI T.J.
> >
> > I find this part surprising. Assuming the dmabuf memory doesn't first
> > go into a page pool (used for some buffers, not all) and actually does
> Actually, when PMC enabled, we let page free avoid free into page pool.
> > get freed synchronously with fput, this would mean it gets sucked up
> > by other supposedly background processes before it can be allocated by
> > the preview process. I thought the preview process was the one most
> > desperate for memory? You mention file pages, but where is this
> > newly-freed memory actually going if not to the preview process? My
> This was discovered through the meminfo observation program.
> When the dma-buf is released, there is a noticeable increase in cache.
>
> This may be triggered by pagecache when loading the algorithm model.
>
> Additionally, the algorithm heap memory cannot benefit from the release
> of the dma-buf.
> I believe this is related to the migratetype. The stack/heap cannot
> obtain priority access to
> the dma-buf memory released by the kernel.(HIGHUSER_MOVABLE)
>
> So, PMC break it, share each memory. Even if it's incorrect :)(If my
> understanding of the
> fragmentation issue is incorrect, please correct me.)
>
Oh that would make sense, but then the memory *is* going to your
preview process just not in the form you were hoping for. So model
loading and your heap allocations were fighting for memory, probably
thrashing the file pages? I guess it's more important to get the heap
allocations done first for performance for your app, and I think I can
understand how PMC would give a sort of priority to those over the
file pages during the preview transition. Ok. Sorry I don't have an
opinion on this part yet if that's what's happening.

> > initial reaction was the same as Roman's that the PMC should be hooked
> > up to reclaim instead of depending on the reaper. But I think this
> > might suggest that wouldn't work because the system is under such high
> > memory pressure that it'd be likely reclaim would have emptied the
> > PMCs before the preview process could use it.
> The point you raised is indeed very likely to happen, as there is immense
> memory pressure.
> Currently, we only open the PMC when the application is in the foreground,
> and close it when it goes to the background.
> It is indeed unnecessary to drain the PMC when the application is in the
> foreground,
> and a longer reaper timeout would be more useful.(Thanks for the
> flexibility provided by memcg.)
> >
> > One more thing I find odd is that for this to work a significant
> > portion of your dmabuf pages would have to be order 0, but we're
> > talking about a ~500M buffer. Does whatever exports this buffer not
> > try to use higher order pages like here?
> Yes, actually our heap configured order 8 4 0, but In our practical
> application and observation processes,
> it is often difficult to meet the high-order memory allocation, so
> falling back to order 0 is the most common.
> Therefore, for our MID_ORDER allocation, we use LOW_ORDER_GFP.
> Just like the testing scenario I mentioned earlier, with 8GB of RAM and
> the camera peaking at around 3GB,
>
> the fragmentation at this point will cause most of the DMA-BUF
> allocations to fall back to order 0.
> The use of PMC is for real-world, high-load applications. I don't think
> it's very practical for regular applications.

Got it, thanks.
>
> Thanks
> HY
>
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54
> >
> > Thanks!
> > -T.J.
> >
> >> But using PMC (let's call it that for now), we are able to quickly meet
> >> the memory needs of the subsequent preview process
> >>
> >> with the just released DMA-BUF memory, without having to go through the
> >> slow path, resulting in a significant performance improvement.
> >>
> >> (of course, break migrate type may not good.)
> >>
> >>> There are a lot of other questions, and you highlighted some of them below
> >>> (and these are indeed right questions to ask), but let's start with something.
> >>>
> >>> Thanks
> >> Thanks
> >>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-07-09  0:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-02  8:44 [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Huan Yang
2024-07-02  8:44 ` [RFC PATCH 1/4] mm: memcg: pmc framework Huan Yang
2024-07-02  8:44 ` [RFC PATCH 2/4] mm: memcg: pmc support change attribute Huan Yang
2024-07-02  8:44 ` [RFC PATCH 3/4] mm: memcg: pmc: support reaper Huan Yang
2024-07-02  8:44 ` [RFC PATCH 4/4] mm: memcg: pmc: support oom release Huan Yang
2024-07-02 19:27 ` [RFC PATCH 0/4] Introduce PMC(PER-MEMCG-CACHE) Roman Gushchin
2024-07-03  2:23   ` Huan Yang
2024-07-03 17:27     ` Shakeel Butt
2024-07-04  2:49       ` Huan Yang
2024-07-03 22:59     ` T.J. Mercier
2024-07-04  2:29       ` Huan Yang
2024-07-09  0:11         ` T.J. Mercier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox