[PATCH v4 0/9] mm: workingset reporting

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/9] mm: workingset reporting
@ 2024-11-27  2:57 Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 1/9] mm: aggregate workingset information into histograms Yuanchu Xie
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace.
Another interesting idea might be hugepage workingset, so that we can
measure the proportion of hugepages backing cold memory. However, with
architectures like arm, there may be too many hugepage sizes leading to
a combinatorial explosion when exporting stats to the userspace.
Nonetheless, the kernel should provide a set of workingset interfaces
that is generic enough to accommodate the various use cases, and extensible
to potential future use cases.

Use cases
==========
Job scheduling
On overcommitted hosts, workingset information improves efficiency and
reliability by allowing the job scheduler to have better stats on the
exact memory requirements of each job. This can manifest in efficiency by
landing more jobs on the same host or NUMA node. On the other hand, the
job scheduler can also ensure each node has a sufficient amount of memory
and does not enter direct reclaim or the kernel OOM path. With workingset
information and job priority, the userspace OOM killing or proactive
reclaim policy can kick in before the system is under memory pressure.
If the job shape is very different from the machine shape, knowing the
workingset per-node can also help inform page allocation policies.

Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.

Ballooning (similar to proactive reclaim)
The last patch of the series extends the virtio-balloon device to report
the guest workingset.
Balloon policies benefit from workingset to more precisely determine the
size of the memory balloon. On end-user devices where memory is scarce and
overcommitted, the balloon sizing in multiple VMs running on the same
device can be orchestrated with workingset reports from each one.
On the server side, workingset reporting allows the balloon controller to
inflate the balloon without causing too much file cache to be reclaimed in
the guest.

Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].

[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.

1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892

Implementation
==========
The reporting of user pages is based off of MGLRU, and therefore requires
CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
fine-grained workingset report, but we can already gather a lot of data
with just four generations. The workingset reporting mechanism is gated
behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
CONFIG_WORKINGSET_REPORT_AGING.

Benchmarks
==========
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
compile and redis benchmarks from openbenchmarking.org. The policy and
runner is referred to as WMO (Workload Memory Optimization).
The results were based on v3 of the series, but v4 doesn't change the core
of the working set reporting and just adds the ballooning counterpart.

The timed Linux kernel compilation benchmark shows improvements in peak
memory usage with a policy of "swap out all bytes colder than 10 seconds
every 40 seconds". A swapfile is configured on SSD.
--------------------------------------------
peak memory usage (with WMO): 4982.61328 MiB
peak memory usage (control): 9569.1367 MiB
peak memory reduction: 47.9%
--------------------------------------------
Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
--------------------------------------------
Seconds, fewer is better

The redis benchmark shows employs the same policy:
--------------------------------------------
peak memory usage (with WMO): 375.9023 MiB
peak memory usage (control): 509.765 MiB
peak memory reduction: 26%
--------------------------------------------
Benchmark               | Experimental     | Control          | Experimental_Std_Dev | Control_Std_Dev
Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%)   | 1.2%                 | 2%
Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%)   | 2.3%                 | 1.4%
Redis - LPUSH (Reqs/sec)| 2024880 (100%)   | 2000884 (98.81%) | 1.1%                 | 0.8%
Redis - GET (Reqs/sec)  | 2835764 (100%)   | 2763722 (97.46%) | 2.7%                 | 1.6%
Redis - SET (Reqs/sec)  | 2340723 (100%)   | 2327372 (99.43%) | 2.4%                 | 1.8%
--------------------------------------------
Reqs/sec, more is better

The detailed report and benchmarking results are in Ghait's repo:
https://github.com/miloudi98/WMO

Changelog
==========

Changes from PATCH v3 -> v4:
- Added documentation for cgroup-v2
  (Waiman Long)
- Fixed types in documentation
  (Randy Dunlap)
- Added implementation for the ballooning use case
- Added detailed description of benchmark results
  (Andrew Morton)

Changes from PATCH v2 -> v3:
- Fixed typos in commit messages and documentation
  (Lance Yang, Randy Dunlap)
- Split out the force_scan patch to be reviewed separately
- Added benchmarks from Ghait Ouled Amar Ben Cheikh
- Fixed reported compile error without CONFIG_MEMCG

Changes from PATCH v1 -> v2:
- Updated selftest to use ksft_test_result_code instead of switch-case
  (Muhammad Usama Anjum)
- Included more use cases in the cover letter
  (Huang, Ying)
- Added documentation for sysfs and memcg interfaces
- Added an aging-specific struct lru_gen_mm_walk in struct pglist_data
  to avoid allocating for each lruvec.

[v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.com/
[v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.com/
[v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/

Yuanchu Xie (9):
  mm: aggregate workingset information into histograms
  mm: use refresh interval to rate-limit workingset report aggregation
  mm: report workingset during memory pressure driven scanning
  mm: extend workingset reporting to memcgs
  mm: add kernel aging thread for workingset reporting
  selftest: test system-wide workingset reporting
  Docs/admin-guide/mm/workingset_report: document sysfs and memcg
    interfaces
  Docs/admin-guide/cgroup-v2: document workingset reporting
  virtio-balloon: add workingset reporting

 Documentation/admin-guide/cgroup-v2.rst       |  35 +
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/workingset_report.rst      | 105 +++
 drivers/base/node.c                           |   6 +
 drivers/virtio/virtio_balloon.c               | 390 ++++++++++-
 include/linux/balloon_compaction.h            |   1 +
 include/linux/memcontrol.h                    |  21 +
 include/linux/mmzone.h                        |  13 +
 include/linux/workingset_report.h             | 167 +++++
 include/uapi/linux/virtio_balloon.h           |  30 +
 mm/Kconfig                                    |  15 +
 mm/Makefile                                   |   2 +
 mm/internal.h                                 |  19 +
 mm/memcontrol.c                               | 162 ++++-
 mm/mm_init.c                                  |   2 +
 mm/mmzone.c                                   |   2 +
 mm/vmscan.c                                   |  56 +-
 mm/workingset_report.c                        | 653 ++++++++++++++++++
 mm/workingset_report_aging.c                  | 127 ++++
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   3 +
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 .../testing/selftests/mm/workingset_report.c  | 306 ++++++++
 .../testing/selftests/mm/workingset_report.h  |  39 ++
 .../selftests/mm/workingset_report_test.c     | 330 +++++++++
 25 files changed, 2482 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c
 create mode 100644 mm/workingset_report_aging.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

-- 
2.47.0.338.g60cca15819-goog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/9] mm: aggregate workingset information into histograms
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  4:21   ` Matthew Wilcox
  2024-11-27  2:57 ` [PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation Yuanchu Xie
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

Hierarchically aggregate all memcgs' MGLRU generations and their
page counts into working set page age histograms.
The histograms break down the system's workingset per-node,
per-anon/file.

The sysfs interfaces are as follows:
/sys/devices/system/node/nodeX/workingset_report/page_age
	A per-node page age histogram, showing an aggregate of the
	node's lruvecs. The information is extracted from MGLRU's
	per-generation page counters. Reading this file causes a
	hierarchical aging of all lruvecs, scanning pages and creates a
	new generation in each lruvec.
	For example:
	1000 anon=0 file=0
	2000 anon=0 file=0
	100000 anon=5533696 file=5566464
	18446744073709551615 anon=0 file=0

/sys/devices/system/node/nodeX/workingset_report/page_age_interval
	A comma separated list of time in milliseconds that configures
	what the page age histogram uses for aggregation.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 drivers/base/node.c               |   6 +
 include/linux/mmzone.h            |   9 +
 include/linux/workingset_report.h |  79 ++++++
 mm/Kconfig                        |   9 +
 mm/Makefile                       |   1 +
 mm/internal.h                     |   5 +
 mm/memcontrol.c                   |   2 +
 mm/mm_init.c                      |   2 +
 mm/mmzone.c                       |   2 +
 mm/vmscan.c                       |  10 +-
 mm/workingset_report.c            | 451 ++++++++++++++++++++++++++++++
 11 files changed, 572 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c

diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb72580288e6..ba5b8720dbfa 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,8 @@
 #include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/workingset_report.h>
 
 static const struct bus_type node_subsys = {
 	.name = "node",
@@ -626,6 +628,7 @@ static int register_node(struct node *node, int num)
 	} else {
 		hugetlb_register_node(node);
 		compaction_register_node(node);
+		wsr_init_sysfs(node);
 	}
 
 	return error;
@@ -642,6 +645,9 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);
 	compaction_unregister_node(node);
+	wsr_remove_sysfs(node);
+	wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id)));
+	wsr_destroy_pgdat(NODE_DATA(node->dev.id));
 	node_remove_accesses(node);
 	node_remove_caches(node);
 	device_unregister(&node->dev);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 80bc5640bb60..ee728c0c5a3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -24,6 +24,7 @@
 #include <linux/local_lock.h>
 #include <linux/zswap.h>
 #include <asm/page.h>
+#include <linux/workingset_report.h>
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
@@ -630,6 +631,9 @@ struct lruvec {
 	struct lru_gen_mm_state		mm_state;
 #endif
 #endif /* CONFIG_LRU_GEN */
+#ifdef CONFIG_WORKINGSET_REPORT
+	struct wsr_state	wsr;
+#endif /* CONFIG_WORKINGSET_REPORT */
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
@@ -1424,6 +1428,11 @@ typedef struct pglist_data {
 	struct lru_gen_memcg memcg_lru;
 #endif
 
+#ifdef CONFIG_WORKINGSET_REPORT
+	struct mutex wsr_update_mutex;
+	struct wsr_report_bins __rcu *wsr_page_age_bins;
+#endif
+
 	CACHELINE_PADDING(_pad2_);
 
 	/* Per-node vmstats */
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
new file mode 100644
index 000000000000..d7c2ee14ec87
--- /dev/null
+++ b/include/linux/workingset_report.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_WORKINGSET_REPORT_H
+#define _LINUX_WORKINGSET_REPORT_H
+
+#include <linux/types.h>
+#include <linux/mutex.h>
+
+struct mem_cgroup;
+struct pglist_data;
+struct node;
+struct lruvec;
+
+#ifdef CONFIG_WORKINGSET_REPORT
+
+#define WORKINGSET_REPORT_MIN_NR_BINS 2
+#define WORKINGSET_REPORT_MAX_NR_BINS 32
+
+#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1)
+#define ANON_AND_FILE 2
+
+struct wsr_report_bin {
+	unsigned long idle_age;
+	unsigned long nr_pages[ANON_AND_FILE];
+};
+
+struct wsr_report_bins {
+	/* excludes the WORKINGSET_INTERVAL_MAX bin */
+	unsigned long nr_bins;
+	/* last bin contains WORKINGSET_INTERVAL_MAX */
+	unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS];
+	struct rcu_head rcu;
+};
+
+struct wsr_page_age_histo {
+	unsigned long timestamp;
+	struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS];
+};
+
+struct wsr_state {
+	/* breakdown of workingset by page age */
+	struct mutex page_age_lock;
+	struct wsr_page_age_histo *page_age;
+};
+
+void wsr_init_lruvec(struct lruvec *lruvec);
+void wsr_destroy_lruvec(struct lruvec *lruvec);
+void wsr_init_pgdat(struct pglist_data *pgdat);
+void wsr_destroy_pgdat(struct pglist_data *pgdat);
+void wsr_init_sysfs(struct node *node);
+void wsr_remove_sysfs(struct node *node);
+
+/*
+ * Returns true if the wsr is configured to be refreshed.
+ * The next refresh time is stored in refresh_time.
+ */
+bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
+			struct pglist_data *pgdat);
+#else
+static inline void wsr_init_lruvec(struct lruvec *lruvec)
+{
+}
+static inline void wsr_destroy_lruvec(struct lruvec *lruvec)
+{
+}
+static inline void wsr_init_pgdat(struct pglist_data *pgdat)
+{
+}
+static inline void wsr_destroy_pgdat(struct pglist_data *pgdat)
+{
+}
+static inline void wsr_init_sysfs(struct node *node)
+{
+}
+static inline void wsr_remove_sysfs(struct node *node)
+{
+}
+#endif /* CONFIG_WORKINGSET_REPORT */
+
+#endif /* _LINUX_WORKINGSET_REPORT_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b016808..be949786796d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1301,6 +1301,15 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config WORKINGSET_REPORT
+	bool "Working set reporting"
+	depends on LRU_GEN && SYSFS
+	help
+	  Report system and per-memcg working set to userspace.
+
+	  This option exports stats and events giving the user more insight
+	  into its memory working set.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5639b036166..f5ef0768253a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/internal.h b/mm/internal.h
index 64c2eb0b160e..bbd3c1501bac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn;
 /*
  * in mm/vmscan.c:
  */
+struct scan_control;
+bool isolate_lru_page(struct page *page);
 bool folio_isolate_lru(struct folio *folio);
 void folio_putback_lru(struct folio *folio);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap,
+			bool force_scan);
+void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53db98d2c4a1..096856b35fbc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/seq_buf.h>
 #include <linux/sched/isolation.h>
 #include <linux/kmemleak.h>
+#include <linux/workingset_report.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3453,6 +3454,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return;
 
+	wsr_destroy_lruvec(&pn->lruvec);
 	free_percpu(pn->lruvec_stats_percpu);
 	kfree(pn->lruvec_stats);
 	kfree(pn);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4ba5607aaf19..b4f7c904ce33 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -30,6 +30,7 @@
 #include <linux/crash_dump.h>
 #include <linux/execmem.h>
 #include <linux/vmstat.h>
+#include <linux/workingset_report.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -1378,6 +1379,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
+	wsr_init_pgdat(pgdat);
 }
 
 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f9baa8882fbf..0352a2018067 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec)
 	 */
 	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
 
+	wsr_init_lruvec(lruvec);
+
 	lru_gen_init_lruvec(lruvec);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 28ba2b06fc7d..89da4d8dfb5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
 #include <linux/mmu_notifier.h>
+#include <linux/workingset_report.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -271,8 +272,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
 }
 #endif
 
-static void set_task_reclaim_state(struct task_struct *task,
-				   struct reclaim_state *rs)
+void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs)
 {
 	/* Check for an overwrite */
 	WARN_ON_ONCE(rs && task->reclaim_state);
@@ -3861,8 +3861,8 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 	return success;
 }
 
-static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
-			       bool can_swap, bool force_scan)
+bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap,
+			bool force_scan)
 {
 	bool success;
 	struct lru_gen_mm_walk *walk;
@@ -5640,6 +5640,8 @@ static int __init init_lru_gen(void)
 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
 		pr_err("lru_gen: failed to create sysfs group\n");
 
+	wsr_init_sysfs(NULL);
+
 	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
 	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
 
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
new file mode 100644
index 000000000000..a4dcf62fcd96
--- /dev/null
+++ b/mm/workingset_report.c
@@ -0,0 +1,451 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/export.h>
+#include <linux/lockdep.h>
+#include <linux/jiffies.h>
+#include <linux/kernfs.h>
+#include <linux/memcontrol.h>
+#include <linux/rcupdate.h>
+#include <linux/mutex.h>
+#include <linux/err.h>
+#include <linux/atomic.h>
+#include <linux/node.h>
+#include <linux/mmzone.h>
+#include <linux/mm.h>
+#include <linux/mm_inline.h>
+#include <linux/workingset_report.h>
+
+#include "internal.h"
+
+void wsr_init_pgdat(struct pglist_data *pgdat)
+{
+	mutex_init(&pgdat->wsr_update_mutex);
+	RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL);
+}
+
+void wsr_destroy_pgdat(struct pglist_data *pgdat)
+{
+	struct wsr_report_bins __rcu *bins;
+
+	mutex_lock(&pgdat->wsr_update_mutex);
+	bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL,
+			    lockdep_is_held(&pgdat->wsr_update_mutex));
+	kfree_rcu(bins, rcu);
+	mutex_unlock(&pgdat->wsr_update_mutex);
+	mutex_destroy(&pgdat->wsr_update_mutex);
+}
+
+void wsr_init_lruvec(struct lruvec *lruvec)
+{
+	struct wsr_state *wsr = &lruvec->wsr;
+
+	memset(wsr, 0, sizeof(*wsr));
+	mutex_init(&wsr->page_age_lock);
+}
+
+void wsr_destroy_lruvec(struct lruvec *lruvec)
+{
+	struct wsr_state *wsr = &lruvec->wsr;
+
+	mutex_destroy(&wsr->page_age_lock);
+	kfree(wsr->page_age);
+	memset(wsr, 0, sizeof(*wsr));
+}
+
+static int workingset_report_intervals_parse(char *src,
+					     struct wsr_report_bins *bins)
+{
+	int err = 0, i = 0;
+	char *cur, *next = strim(src);
+
+	if (*next == '\0')
+		return 0;
+
+	while ((cur = strsep(&next, ","))) {
+		unsigned int interval;
+
+		err = kstrtouint(cur, 0, &interval);
+		if (err)
+			goto out;
+
+		bins->idle_age[i] = msecs_to_jiffies(interval);
+		if (i > 0 && bins->idle_age[i] <= bins->idle_age[i - 1]) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (++i == WORKINGSET_REPORT_MAX_NR_BINS) {
+			err = -ERANGE;
+			goto out;
+		}
+	}
+
+	if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) {
+		err = -ERANGE;
+		goto out;
+	}
+
+	bins->nr_bins = i;
+	bins->idle_age[i] = WORKINGSET_INTERVAL_MAX;
+out:
+	return err ?: i;
+}
+
+static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen,
+					unsigned long seq,
+					unsigned long max_seq,
+					unsigned long curr_timestamp)
+{
+	int younger_gen;
+
+	if (seq == max_seq)
+		return curr_timestamp;
+	younger_gen = lru_gen_from_seq(seq + 1);
+	return READ_ONCE(lrugen->timestamps[younger_gen]);
+}
+
+static void collect_page_age_type(const struct lru_gen_folio *lrugen,
+				  struct wsr_report_bin *bin,
+				  unsigned long max_seq, unsigned long min_seq,
+				  unsigned long curr_timestamp, int type)
+{
+	unsigned long seq;
+
+	for (seq = max_seq; seq + 1 > min_seq; seq--) {
+		int gen, zone;
+		unsigned long gen_end, gen_start, size = 0;
+
+		gen = lru_gen_from_seq(seq);
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++)
+			size += max(
+				READ_ONCE(lrugen->nr_pages[gen][type][zone]),
+				0L);
+
+		gen_start = get_gen_start_time(lrugen, seq, max_seq,
+					       curr_timestamp);
+		gen_end = READ_ONCE(lrugen->timestamps[gen]);
+
+		while (bin->idle_age != WORKINGSET_INTERVAL_MAX &&
+		       time_before(gen_end + bin->idle_age, curr_timestamp)) {
+			unsigned long gen_in_bin = (long)gen_start -
+						   (long)curr_timestamp +
+						   (long)bin->idle_age;
+			unsigned long gen_len = (long)gen_start - (long)gen_end;
+
+			if (!gen_len)
+				break;
+			if (gen_in_bin) {
+				unsigned long split_bin =
+					size / gen_len * gen_in_bin;
+
+				bin->nr_pages[type] += split_bin;
+				size -= split_bin;
+			}
+			gen_start = curr_timestamp - bin->idle_age;
+			bin++;
+		}
+		bin->nr_pages[type] += size;
+	}
+}
+
+/*
+ * proportionally aggregate Multi-gen LRU bins into a working set report
+ * MGLRU generations:
+ * current time
+ * |         max_seq timestamp
+ * |         |     max_seq - 1 timestamp
+ * |         |     |               unbounded
+ * |         |     |               |
+ * --------------------------------
+ * | max_seq | ... | ... | min_seq
+ * --------------------------------
+ *
+ * Bins:
+ *
+ * current time
+ * |       current - idle_age[0]
+ * |       |     current - idle_age[1]
+ * |       |     |               unbounded
+ * |       |     |               |
+ * ------------------------------
+ * | bin 0 | ... | ... | bin n-1
+ * ------------------------------
+ *
+ * Assume the heuristic that pages are in the MGLRU generation
+ * through uniform accesses, so we can aggregate them
+ * proportionally into bins.
+ */
+static void collect_page_age(struct wsr_page_age_histo *page_age,
+			     const struct lruvec *lruvec)
+{
+	int type;
+	const struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	unsigned long curr_timestamp = jiffies;
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+	unsigned long min_seq[ANON_AND_FILE] = {
+		READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]),
+		READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]),
+	};
+	struct wsr_report_bin *bin = &page_age->bins[0];
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		collect_page_age_type(lrugen, bin, max_seq, min_seq[type],
+				      curr_timestamp, type);
+}
+
+/* First step: hierarchically scan child memcgs. */
+static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root,
+			 struct pglist_data *pgdat)
+{
+	struct mem_cgroup *memcg;
+	unsigned int flags;
+	struct reclaim_state rs = { 0 };
+
+	set_task_reclaim_state(current, &rs);
+	flags = memalloc_noreclaim_save();
+
+	memcg = mem_cgroup_iter(root, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+
+		/*
+		 * setting can_swap=true and force_scan=true ensures
+		 * proper workingset stats when the system cannot swap.
+		 */
+		try_to_inc_max_seq(lruvec, max_seq, true, true);
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+
+	memalloc_noreclaim_restore(flags);
+	set_task_reclaim_state(current, NULL);
+}
+
+/* Second step: aggregate child memcgs into the page age histogram. */
+static void refresh_aggregate(struct wsr_page_age_histo *page_age,
+			      struct mem_cgroup *root,
+			      struct pglist_data *pgdat)
+{
+	struct mem_cgroup *memcg;
+	struct wsr_report_bin *bin;
+
+	for (bin = page_age->bins;
+	     bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) {
+		bin->nr_pages[0] = 0;
+		bin->nr_pages[1] = 0;
+	}
+	/* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */
+	bin->nr_pages[0] = 0;
+	bin->nr_pages[1] = 0;
+
+	memcg = mem_cgroup_iter(root, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		collect_page_age(page_age, lruvec);
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+	WRITE_ONCE(page_age->timestamp, jiffies);
+}
+
+static void copy_node_bins(struct pglist_data *pgdat,
+			   struct wsr_page_age_histo *page_age)
+{
+	struct wsr_report_bins *node_page_age_bins;
+	int i = 0;
+
+	rcu_read_lock();
+	node_page_age_bins = rcu_dereference(pgdat->wsr_page_age_bins);
+	if (!node_page_age_bins)
+		goto nocopy;
+	for (i = 0; i < node_page_age_bins->nr_bins; ++i)
+		page_age->bins[i].idle_age = node_page_age_bins->idle_age[i];
+
+nocopy:
+	page_age->bins[i].idle_age = WORKINGSET_INTERVAL_MAX;
+	rcu_read_unlock();
+}
+
+bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
+			struct pglist_data *pgdat)
+{
+	struct wsr_page_age_histo *page_age;
+
+	if (!READ_ONCE(wsr->page_age))
+		return false;
+
+	refresh_scan(wsr, root, pgdat);
+	mutex_lock(&wsr->page_age_lock);
+	page_age = READ_ONCE(wsr->page_age);
+	if (page_age) {
+		copy_node_bins(pgdat, page_age);
+		refresh_aggregate(page_age, root, pgdat);
+	}
+	mutex_unlock(&wsr->page_age_lock);
+	return !!page_age;
+}
+EXPORT_SYMBOL_GPL(wsr_refresh_report);
+
+static struct pglist_data *kobj_to_pgdat(struct kobject *kobj)
+{
+	int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id :
+					    first_memory_node;
+
+	return NODE_DATA(nid);
+}
+
+static struct wsr_state *kobj_to_wsr(struct kobject *kobj)
+{
+	return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr;
+}
+
+static ssize_t page_age_intervals_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	struct wsr_report_bins *bins;
+	int len = 0;
+	struct pglist_data *pgdat = kobj_to_pgdat(kobj);
+
+	rcu_read_lock();
+	bins = rcu_dereference(pgdat->wsr_page_age_bins);
+	if (bins) {
+		int i;
+		int nr_bins = bins->nr_bins;
+
+		for (i = 0; i < bins->nr_bins; ++i) {
+			len += sysfs_emit_at(
+				buf, len, "%u",
+				jiffies_to_msecs(bins->idle_age[i]));
+			if (i + 1 < nr_bins)
+				len += sysfs_emit_at(buf, len, ",");
+		}
+	}
+	len += sysfs_emit_at(buf, len, "\n");
+	rcu_read_unlock();
+
+	return len;
+}
+
+static ssize_t page_age_intervals_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *src, size_t len)
+{
+	struct wsr_report_bins *bins = NULL, __rcu *old;
+	char *buf = NULL;
+	int err = 0;
+	struct pglist_data *pgdat = kobj_to_pgdat(kobj);
+
+	buf = kstrdup(src, GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto failed;
+	}
+
+	bins =
+		kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL);
+
+	if (!bins) {
+		err = -ENOMEM;
+		goto failed;
+	}
+
+	err = workingset_report_intervals_parse(buf, bins);
+	if (err < 0)
+		goto failed;
+
+	if (err == 0) {
+		kfree(bins);
+		bins = NULL;
+	}
+
+	mutex_lock(&pgdat->wsr_update_mutex);
+	old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins,
+				  lockdep_is_held(&pgdat->wsr_update_mutex));
+	mutex_unlock(&pgdat->wsr_update_mutex);
+	kfree_rcu(old, rcu);
+	kfree(buf);
+	return len;
+failed:
+	kfree(bins);
+	kfree(buf);
+
+	return err;
+}
+
+static struct kobj_attribute page_age_intervals_attr =
+	__ATTR_RW(page_age_intervals);
+
+static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
+			     char *buf)
+{
+	struct wsr_report_bin *bin;
+	int ret = 0;
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+
+
+	mutex_lock(&wsr->page_age_lock);
+	if (!wsr->page_age)
+		wsr->page_age =
+			kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL);
+	mutex_unlock(&wsr->page_age_lock);
+
+	wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj));
+
+	mutex_lock(&wsr->page_age_lock);
+	if (!wsr->page_age)
+		goto unlock;
+	for (bin = wsr->page_age->bins;
+	     bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++)
+		ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n",
+				     jiffies_to_msecs(bin->idle_age),
+				     bin->nr_pages[0] * PAGE_SIZE,
+				     bin->nr_pages[1] * PAGE_SIZE);
+
+	ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n",
+			     WORKINGSET_INTERVAL_MAX,
+			     bin->nr_pages[0] * PAGE_SIZE,
+			     bin->nr_pages[1] * PAGE_SIZE);
+
+unlock:
+	mutex_unlock(&wsr->page_age_lock);
+	return ret;
+}
+
+static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
+
+static struct attribute *workingset_report_attrs[] = {
+	&page_age_intervals_attr.attr, &page_age_attr.attr, NULL
+};
+
+static const struct attribute_group workingset_report_attr_group = {
+	.name = "workingset_report",
+	.attrs = workingset_report_attrs,
+};
+
+void wsr_init_sysfs(struct node *node)
+{
+	struct kobject *kobj = node ? &node->dev.kobj : mm_kobj;
+	struct wsr_state *wsr;
+
+	if (IS_ENABLED(CONFIG_NUMA) && !node)
+		return;
+
+	wsr = kobj_to_wsr(kobj);
+
+	if (sysfs_create_group(kobj, &workingset_report_attr_group))
+		pr_warn("Workingset report failed to create sysfs files\n");
+}
+EXPORT_SYMBOL_GPL(wsr_init_sysfs);
+
+void wsr_remove_sysfs(struct node *node)
+{
+	struct kobject *kobj = &node->dev.kobj;
+	struct wsr_state *wsr;
+
+	if (IS_ENABLED(CONFIG_NUMA) && !node)
+		return;
+
+	wsr = kobj_to_wsr(kobj);
+	sysfs_remove_group(kobj, &workingset_report_attr_group);
+}
+EXPORT_SYMBOL_GPL(wsr_remove_sysfs);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 1/9] mm: aggregate workingset information into histograms Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 3/9] mm: report workingset during memory pressure driven scanning Yuanchu Xie
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

The refresh interval is a rate limiting factor to workingset page age
histogram reads. When a workingset report is generated, the oldest
timestamp of all the lruvecs is stored as the timestamp of the report.
The same report will be read until the report expires beyond the refresh
interval, at which point a new report is generated.

Sysfs interface
/sys/devices/system/node/nodeX/workingset_report/refresh_interval
	time in milliseconds specifying how long the report is valid for

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/workingset_report.h |   1 +
 mm/workingset_report.c            | 101 ++++++++++++++++++++++++------
 2 files changed, 83 insertions(+), 19 deletions(-)

diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
index d7c2ee14ec87..8bae6a600410 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -37,6 +37,7 @@ struct wsr_page_age_histo {
 };
 
 struct wsr_state {
+	unsigned long refresh_interval;
 	/* breakdown of workingset by page age */
 	struct mutex page_age_lock;
 	struct wsr_page_age_histo *page_age;
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index a4dcf62fcd96..8678536ccfc7 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -174,9 +174,11 @@ static void collect_page_age_type(const struct lru_gen_folio *lrugen,
  * Assume the heuristic that pages are in the MGLRU generation
  * through uniform accesses, so we can aggregate them
  * proportionally into bins.
+ *
+ * Returns the timestamp of the youngest gen in this lruvec.
  */
-static void collect_page_age(struct wsr_page_age_histo *page_age,
-			     const struct lruvec *lruvec)
+static unsigned long collect_page_age(struct wsr_page_age_histo *page_age,
+				      const struct lruvec *lruvec)
 {
 	int type;
 	const struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -191,11 +193,14 @@ static void collect_page_age(struct wsr_page_age_histo *page_age,
 	for (type = 0; type < ANON_AND_FILE; type++)
 		collect_page_age_type(lrugen, bin, max_seq, min_seq[type],
 				      curr_timestamp, type);
+
+	return READ_ONCE(lruvec->lrugen.timestamps[lru_gen_from_seq(max_seq)]);
 }
 
 /* First step: hierarchically scan child memcgs. */
 static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root,
-			 struct pglist_data *pgdat)
+			 struct pglist_data *pgdat,
+			 unsigned long refresh_interval)
 {
 	struct mem_cgroup *memcg;
 	unsigned int flags;
@@ -208,12 +213,15 @@ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root,
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+		int gen = lru_gen_from_seq(max_seq);
+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
 
 		/*
 		 * setting can_swap=true and force_scan=true ensures
 		 * proper workingset stats when the system cannot swap.
 		 */
-		try_to_inc_max_seq(lruvec, max_seq, true, true);
+		if (time_is_before_jiffies(birth + refresh_interval))
+			try_to_inc_max_seq(lruvec, max_seq, true, true);
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 
@@ -228,6 +236,7 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age,
 {
 	struct mem_cgroup *memcg;
 	struct wsr_report_bin *bin;
+	unsigned long oldest_lruvec_time = jiffies;
 
 	for (bin = page_age->bins;
 	     bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) {
@@ -241,11 +250,15 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age,
 	memcg = mem_cgroup_iter(root, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		unsigned long lruvec_time =
+			collect_page_age(page_age, lruvec);
+
+		if (time_before(lruvec_time, oldest_lruvec_time))
+			oldest_lruvec_time = lruvec_time;
 
-		collect_page_age(page_age, lruvec);
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
-	WRITE_ONCE(page_age->timestamp, jiffies);
+	WRITE_ONCE(page_age->timestamp, oldest_lruvec_time);
 }
 
 static void copy_node_bins(struct pglist_data *pgdat,
@@ -270,17 +283,25 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
 			struct pglist_data *pgdat)
 {
 	struct wsr_page_age_histo *page_age;
+	unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
 
 	if (!READ_ONCE(wsr->page_age))
 		return false;
 
-	refresh_scan(wsr, root, pgdat);
+	if (!refresh_interval)
+		return false;
+
 	mutex_lock(&wsr->page_age_lock);
 	page_age = READ_ONCE(wsr->page_age);
-	if (page_age) {
-		copy_node_bins(pgdat, page_age);
-		refresh_aggregate(page_age, root, pgdat);
-	}
+	if (!page_age)
+		goto unlock;
+	if (page_age->timestamp &&
+	    time_is_after_jiffies(page_age->timestamp + refresh_interval))
+		goto unlock;
+	refresh_scan(wsr, root, pgdat, refresh_interval);
+	copy_node_bins(pgdat, page_age);
+	refresh_aggregate(page_age, root, pgdat);
+unlock:
 	mutex_unlock(&wsr->page_age_lock);
 	return !!page_age;
 }
@@ -299,6 +320,52 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj)
 	return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr;
 }
 
+static ssize_t refresh_interval_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+	unsigned int interval = READ_ONCE(wsr->refresh_interval);
+
+	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(interval));
+}
+
+static ssize_t refresh_interval_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t len)
+{
+	unsigned int interval;
+	int err;
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+
+	err = kstrtouint(buf, 0, &interval);
+	if (err)
+		return err;
+
+	mutex_lock(&wsr->page_age_lock);
+	if (interval && !wsr->page_age) {
+		struct wsr_page_age_histo *page_age =
+			kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL);
+
+		if (!page_age) {
+			err = -ENOMEM;
+			goto unlock;
+		}
+		wsr->page_age = page_age;
+	}
+	if (!interval && wsr->page_age) {
+		kfree(wsr->page_age);
+		wsr->page_age = NULL;
+	}
+
+	WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval));
+unlock:
+	mutex_unlock(&wsr->page_age_lock);
+	return err ?: len;
+}
+
+static struct kobj_attribute refresh_interval_attr =
+	__ATTR_RW(refresh_interval);
+
 static ssize_t page_age_intervals_show(struct kobject *kobj,
 				       struct kobj_attribute *attr, char *buf)
 {
@@ -382,13 +449,6 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
 	int ret = 0;
 	struct wsr_state *wsr = kobj_to_wsr(kobj);
 
-
-	mutex_lock(&wsr->page_age_lock);
-	if (!wsr->page_age)
-		wsr->page_age =
-			kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL);
-	mutex_unlock(&wsr->page_age_lock);
-
 	wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj));
 
 	mutex_lock(&wsr->page_age_lock);
@@ -414,7 +474,10 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
 
 static struct attribute *workingset_report_attrs[] = {
-	&page_age_intervals_attr.attr, &page_age_attr.attr, NULL
+	&refresh_interval_attr.attr,
+	&page_age_intervals_attr.attr,
+	&page_age_attr.attr,
+	NULL
 };
 
 static const struct attribute_group workingset_report_attr_group = {
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 3/9] mm: report workingset during memory pressure driven scanning
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 1/9] mm: aggregate workingset information into histograms Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 4/9] mm: extend workingset reporting to memcgs Yuanchu Xie
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

When a node reaches its low watermarks and wakes up kswapd, notify all
userspace programs waiting on the workingset page age histogram of the
memory pressure, so a userspace agent can read the workingset report in
time and make policy decisions, such as logging, oom-killing, or
migration.

Sysfs interface:
/sys/devices/system/node/nodeX/workingset_report/report_threshold
	time in milliseconds that specifies how often the userspace
	agent can be notified for node memory pressure.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/workingset_report.h |  4 +++
 mm/internal.h                     | 12 ++++++++
 mm/vmscan.c                       | 46 +++++++++++++++++++++++++++++++
 mm/workingset_report.c            | 43 ++++++++++++++++++++++++++++-
 4 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
index 8bae6a600410..2ec8b927b200 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -37,7 +37,11 @@ struct wsr_page_age_histo {
 };
 
 struct wsr_state {
+	unsigned long report_threshold;
 	unsigned long refresh_interval;
+
+	struct kernfs_node *page_age_sys_file;
+
 	/* breakdown of workingset by page age */
 	struct mutex page_age_lock;
 	struct wsr_page_age_histo *page_age;
diff --git a/mm/internal.h b/mm/internal.h
index bbd3c1501bac..508b7d9937d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -479,6 +479,18 @@ bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap,
 			bool force_scan);
 void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs);
 
+#ifdef CONFIG_WORKINGSET_REPORT
+/*
+ * in mm/wsr.c
+ */
+void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat);
+#else
+static inline void notify_workingset(struct mem_cgroup *memcg,
+				     struct pglist_data *pgdat)
+{
+}
+#endif
+
 /*
  * in mm/rmap.c:
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89da4d8dfb5f..2bca81271d15 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2578,6 +2578,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_WORKINGSET_REPORT
+static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc);
+#else
+static inline void try_to_report_workingset(struct pglist_data *pgdat,
+					    struct scan_control *sc)
+{
+}
+#endif
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -4004,6 +4013,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	set_initial_priority(pgdat, sc);
 
+	try_to_report_workingset(pgdat, sc);
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -5649,6 +5660,38 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#ifdef CONFIG_WORKINGSET_REPORT
+static void try_to_report_workingset(struct pglist_data *pgdat,
+				     struct scan_control *sc)
+{
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
+	struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
+	unsigned long threshold = READ_ONCE(wsr->report_threshold);
+
+	if (sc->priority == DEF_PRIORITY)
+		return;
+
+	if (!threshold)
+		return;
+
+	if (!mutex_trylock(&wsr->page_age_lock))
+		return;
+
+	if (!wsr->page_age) {
+		mutex_unlock(&wsr->page_age_lock);
+		return;
+	}
+
+	if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) {
+		mutex_unlock(&wsr->page_age_lock);
+		return;
+	}
+
+	mutex_unlock(&wsr->page_age_lock);
+	notify_workingset(memcg, pgdat);
+}
+#endif /* CONFIG_WORKINGSET_REPORT */
+
 #else /* !CONFIG_LRU_GEN */
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -6200,6 +6243,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
+
+		if (!sc->proactive)
+			try_to_report_workingset(zone->zone_pgdat, sc);
 		shrink_node(zone->zone_pgdat, sc);
 	}
 
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index 8678536ccfc7..bbefb0046669 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -320,6 +320,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj)
 	return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr;
 }
 
+static ssize_t report_threshold_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+	unsigned int threshold = READ_ONCE(wsr->report_threshold);
+
+	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold));
+}
+
+static ssize_t report_threshold_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t len)
+{
+	unsigned int threshold;
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+
+	if (kstrtouint(buf, 0, &threshold))
+		return -EINVAL;
+
+	WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(threshold));
+
+	return len;
+}
+
+static struct kobj_attribute report_threshold_attr =
+	__ATTR_RW(report_threshold);
+
 static ssize_t refresh_interval_show(struct kobject *kobj,
 				     struct kobj_attribute *attr, char *buf)
 {
@@ -474,6 +501,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute page_age_attr = __ATTR_RO(page_age);
 
 static struct attribute *workingset_report_attrs[] = {
+	&report_threshold_attr.attr,
 	&refresh_interval_attr.attr,
 	&page_age_intervals_attr.attr,
 	&page_age_attr.attr,
@@ -495,8 +523,13 @@ void wsr_init_sysfs(struct node *node)
 
 	wsr = kobj_to_wsr(kobj);
 
-	if (sysfs_create_group(kobj, &workingset_report_attr_group))
+	if (sysfs_create_group(kobj, &workingset_report_attr_group)) {
 		pr_warn("Workingset report failed to create sysfs files\n");
+		return;
+	}
+
+	wsr->page_age_sys_file =
+		kernfs_walk_and_get(kobj->sd, "workingset_report/page_age");
 }
 EXPORT_SYMBOL_GPL(wsr_init_sysfs);
 
@@ -509,6 +542,14 @@ void wsr_remove_sysfs(struct node *node)
 		return;
 
 	wsr = kobj_to_wsr(kobj);
+	kernfs_put(wsr->page_age_sys_file);
 	sysfs_remove_group(kobj, &workingset_report_attr_group);
 }
 EXPORT_SYMBOL_GPL(wsr_remove_sysfs);
+
+void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat)
+{
+	struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
+
+	kernfs_notify(wsr->page_age_sys_file);
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 4/9] mm: extend workingset reporting to memcgs
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (2 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 3/9] mm: report workingset during memory pressure driven scanning Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 5/9] mm: add kernel aging thread for workingset reporting Yuanchu Xie
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

Break down the system-wide workingset report into per-memcg reports,
which aggregages its children hierarchically.

The per-node workingset reporting histograms and refresh/report
threshold files are presented as memcg files, showing a report
containing all the nodes.

The per-node page age interval is configurable in sysfs and not
available per-memcg, while the refresh interval and report threshold are
configured per-memcg.

Memcg interface:
/sys/fs/cgroup/.../memory.workingset.page_age
	The memcg equivalent of the sysfs workingset page age histogram
	breaks down the workingset of this memcg and its children into
	page age intervals. Each node is prefixed with a node header and
	a newline. Non-proactive direct reclaim on this memcg can also
	wake up userspace agents that are waiting on this file.
	e.g.
	N0
	1000 anon=0 file=0
	2000 anon=0 file=0
	3000 anon=0 file=0
	4000 anon=0 file=0
	5000 anon=0 file=0
	18446744073709551615 anon=0 file=0

/sys/fs/cgroup/.../memory.workingset.refresh_interval
	The memcg equivalent of the sysfs refresh interval. A per-node
	number of how much time a page age histogram is valid for, in
	milliseconds.
	e.g.
	echo N0=2000 > memory.workingset.refresh_interval

/sys/fs/cgroup/.../memory.workingset.report_threshold
	The memcg equivalent of the sysfs report threshold. A per-node
	number of how often userspace agent waiting on the page age
	histogram can be woken up, in milliseconds.
	e.g.
	echo N0=1000 > memory.workingset.report_threshold

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/memcontrol.h        |  21 ++++
 include/linux/workingset_report.h |  15 ++-
 mm/internal.h                     |   2 +
 mm/memcontrol.c                   | 160 +++++++++++++++++++++++++++++-
 mm/workingset_report.c            |  50 +++++++---
 5 files changed, 230 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1b41554a5fb..fd595b33a54d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,11 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+#ifdef CONFIG_WORKINGSET_REPORT
+	/* memory.workingset.page_age file */
+	struct cgroup_file workingset_page_age_file;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -1094,6 +1099,16 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
 
+static inline struct cgroup_file *
+mem_cgroup_page_age_file(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_WORKINGSET_REPORT
+	return &memcg->workingset_page_age_file;
+#else
+	return NULL;
+#endif
+}
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1511,6 +1526,12 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 static inline void split_page_memcg(struct page *head, int old_order, int new_order)
 {
 }
+
+static inline struct cgroup_file *
+mem_cgroup_page_age_file(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
index 2ec8b927b200..616be6469768 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -9,6 +9,8 @@ struct mem_cgroup;
 struct pglist_data;
 struct node;
 struct lruvec;
+struct cgroup_file;
+struct wsr_state;
 
 #ifdef CONFIG_WORKINGSET_REPORT
 
@@ -40,7 +42,10 @@ struct wsr_state {
 	unsigned long report_threshold;
 	unsigned long refresh_interval;
 
-	struct kernfs_node *page_age_sys_file;
+	union {
+		struct kernfs_node *page_age_sys_file;
+		struct cgroup_file *page_age_cgroup_file;
+	};
 
 	/* breakdown of workingset by page age */
 	struct mutex page_age_lock;
@@ -60,6 +65,9 @@ void wsr_remove_sysfs(struct node *node);
  */
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
 			struct pglist_data *pgdat);
+
+int wsr_set_refresh_interval(struct wsr_state *wsr,
+			     unsigned long refresh_interval);
 #else
 static inline void wsr_init_lruvec(struct lruvec *lruvec)
 {
@@ -79,6 +87,11 @@ static inline void wsr_init_sysfs(struct node *node)
 static inline void wsr_remove_sysfs(struct node *node)
 {
 }
+static inline int wsr_set_refresh_interval(struct wsr_state *wsr,
+					   unsigned long refresh_interval)
+{
+	return 0;
+}
 #endif /* CONFIG_WORKINGSET_REPORT */
 
 #endif /* _LINUX_WORKINGSET_REPORT_H */
diff --git a/mm/internal.h b/mm/internal.h
index 508b7d9937d6..50ca0c6e651c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -484,6 +484,8 @@ void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs);
  * in mm/wsr.c
  */
 void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat);
+int workingset_report_intervals_parse(char *src,
+				      struct wsr_report_bins *bins);
 #else
 static inline void notify_workingset(struct mem_cgroup *memcg,
 				     struct pglist_data *pgdat)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 096856b35fbc..d1032c6efc66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4348,6 +4348,144 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+#ifdef CONFIG_WORKINGSET_REPORT
+static int memory_ws_refresh_interval_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr_state *wsr =
+			&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
+
+		seq_printf(m, "N%d=%u ", nid,
+			   jiffies_to_msecs(READ_ONCE(wsr->refresh_interval)));
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_wsr_threshold_parse(char *buf, size_t nbytes,
+					  unsigned int *nid_out,
+					  unsigned int *msecs)
+{
+	char *node, *threshold;
+	unsigned int nid;
+	int err;
+
+	buf = strstrip(buf);
+	threshold = buf;
+	node = strsep(&threshold, "=");
+
+	if (*node != 'N')
+		return -EINVAL;
+
+	err = kstrtouint(node + 1, 0, &nid);
+	if (err)
+		return err;
+
+	if (nid >= nr_node_ids || !node_state(nid, N_MEMORY))
+		return -EINVAL;
+
+	err = kstrtouint(threshold, 0, msecs);
+	if (err)
+		return err;
+
+	*nid_out = nid;
+
+	return nbytes;
+}
+
+static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of,
+						 char *buf, size_t nbytes,
+						 loff_t off)
+{
+	unsigned int nid, msecs;
+	struct wsr_state *wsr;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
+
+	if (ret < 0)
+		return ret;
+
+	wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
+
+	ret = wsr_set_refresh_interval(wsr, msecs);
+	return ret ?: nbytes;
+}
+
+static int memory_ws_report_threshold_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr_state *wsr =
+			&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
+
+		seq_printf(m, "N%d=%u ", nid,
+			   jiffies_to_msecs(READ_ONCE(wsr->report_threshold)));
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_ws_report_threshold_write(struct kernfs_open_file *of,
+						 char *buf, size_t nbytes,
+						 loff_t off)
+{
+	unsigned int nid, msecs;
+	struct wsr_state *wsr;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
+
+	if (ret < 0)
+		return ret;
+
+	wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
+	WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs));
+	return ret;
+}
+
+static int memory_ws_page_age_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr_state *wsr =
+			&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr;
+		struct wsr_report_bin *bin;
+
+		if (!READ_ONCE(wsr->page_age))
+			continue;
+
+		wsr_refresh_report(wsr, memcg, NODE_DATA(nid));
+		mutex_lock(&wsr->page_age_lock);
+		if (!wsr->page_age)
+			goto unlock;
+		seq_printf(m, "N%d\n", nid);
+		for (bin = wsr->page_age->bins;
+		     bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++)
+			seq_printf(m, "%u anon=%lu file=%lu\n",
+				   jiffies_to_msecs(bin->idle_age),
+				   bin->nr_pages[0] * PAGE_SIZE,
+				   bin->nr_pages[1] * PAGE_SIZE);
+
+		seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX,
+			   bin->nr_pages[0] * PAGE_SIZE,
+			   bin->nr_pages[1] * PAGE_SIZE);
+
+unlock:
+		mutex_unlock(&wsr->page_age_lock);
+	}
+
+	return 0;
+}
+#endif
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -4419,7 +4557,27 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
-	{ }	/* terminate */
+#ifdef CONFIG_WORKINGSET_REPORT
+	{
+		.name = "workingset.refresh_interval",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_ws_refresh_interval_show,
+		.write = memory_ws_refresh_interval_write,
+	},
+	{
+		.name = "workingset.report_threshold",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_ws_report_threshold_show,
+		.write = memory_ws_report_threshold_write,
+	},
+	{
+		.name = "workingset.page_age",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.file_offset = offsetof(struct mem_cgroup, workingset_page_age_file),
+		.seq_show = memory_ws_page_age_show,
+	},
+#endif
+	{} /* terminate */
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index bbefb0046669..1e1bdb8bf75b 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -37,9 +37,12 @@ void wsr_destroy_pgdat(struct pglist_data *pgdat)
 void wsr_init_lruvec(struct lruvec *lruvec)
 {
 	struct wsr_state *wsr = &lruvec->wsr;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
 	memset(wsr, 0, sizeof(*wsr));
 	mutex_init(&wsr->page_age_lock);
+	if (memcg && !mem_cgroup_is_root(memcg))
+		wsr->page_age_cgroup_file = mem_cgroup_page_age_file(memcg);
 }
 
 void wsr_destroy_lruvec(struct lruvec *lruvec)
@@ -51,8 +54,8 @@ void wsr_destroy_lruvec(struct lruvec *lruvec)
 	memset(wsr, 0, sizeof(*wsr));
 }
 
-static int workingset_report_intervals_parse(char *src,
-					     struct wsr_report_bins *bins)
+int workingset_report_intervals_parse(char *src,
+				      struct wsr_report_bins *bins)
 {
 	int err = 0, i = 0;
 	char *cur, *next = strim(src);
@@ -356,20 +359,14 @@ static ssize_t refresh_interval_show(struct kobject *kobj,
 	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(interval));
 }
 
-static ssize_t refresh_interval_store(struct kobject *kobj,
-				      struct kobj_attribute *attr,
-				      const char *buf, size_t len)
+int wsr_set_refresh_interval(struct wsr_state *wsr,
+			     unsigned long refresh_interval)
 {
-	unsigned int interval;
-	int err;
-	struct wsr_state *wsr = kobj_to_wsr(kobj);
-
-	err = kstrtouint(buf, 0, &interval);
-	if (err)
-		return err;
+	int err = 0;
+	unsigned long old_interval = 0;
 
 	mutex_lock(&wsr->page_age_lock);
-	if (interval && !wsr->page_age) {
+	if (refresh_interval && !wsr->page_age) {
 		struct wsr_page_age_histo *page_age =
 			kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL);
 
@@ -377,16 +374,34 @@ static ssize_t refresh_interval_store(struct kobject *kobj,
 			err = -ENOMEM;
 			goto unlock;
 		}
+		page_age->bins[0].idle_age = WORKINGSET_INTERVAL_MAX;
 		wsr->page_age = page_age;
 	}
-	if (!interval && wsr->page_age) {
+	if (!refresh_interval && wsr->page_age) {
 		kfree(wsr->page_age);
 		wsr->page_age = NULL;
 	}
 
-	WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval));
+	old_interval = READ_ONCE(wsr->refresh_interval);
+	WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(refresh_interval));
 unlock:
 	mutex_unlock(&wsr->page_age_lock);
+	return err;
+}
+
+static ssize_t refresh_interval_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t len)
+{
+	unsigned int interval;
+	int err;
+	struct wsr_state *wsr = kobj_to_wsr(kobj);
+
+	err = kstrtouint(buf, 0, &interval);
+	if (err)
+		return err;
+
+	err = wsr_set_refresh_interval(wsr, interval);
 	return err ?: len;
 }
 
@@ -551,5 +566,8 @@ void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat)
 {
 	struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
 
-	kernfs_notify(wsr->page_age_sys_file);
+	if (mem_cgroup_is_root(memcg))
+		kernfs_notify(wsr->page_age_sys_file);
+	else
+		cgroup_file_notify(wsr->page_age_cgroup_file);
 }
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 5/9] mm: add kernel aging thread for workingset reporting
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (3 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 4/9] mm: extend workingset reporting to memcgs Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 6/9] selftest: test system-wide " Yuanchu Xie
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

For reliable and timely aging on memcgs, one has to read the page age
histograms on time. A kernel thread makes it easier by aging memcgs with
valid refresh_interval when they can be refreshed, and also reduces the
latency of any userspace consumers of the page age histogram.

The kerne aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING.
Debugging stats may be added in the future for when aging cannot
keep up with the configured refresh_interval.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/workingset_report.h |  10 ++-
 mm/Kconfig                        |   6 ++
 mm/Makefile                       |   1 +
 mm/memcontrol.c                   |   2 +-
 mm/workingset_report.c            |  13 ++-
 mm/workingset_report_aging.c      | 127 ++++++++++++++++++++++++++++++
 6 files changed, 154 insertions(+), 5 deletions(-)
 create mode 100644 mm/workingset_report_aging.c

diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
index 616be6469768..f6bbde2a04c3 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -64,7 +64,15 @@ void wsr_remove_sysfs(struct node *node);
  * The next refresh time is stored in refresh_time.
  */
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
-			struct pglist_data *pgdat);
+			struct pglist_data *pgdat, unsigned long *refresh_time);
+
+#ifdef CONFIG_WORKINGSET_REPORT_AGING
+void wsr_wakeup_aging_thread(void);
+#else /* CONFIG_WORKINGSET_REPORT_AGING */
+static inline void wsr_wakeup_aging_thread(void)
+{
+}
+#endif /* CONFIG_WORKINGSET_REPORT_AGING */
 
 int wsr_set_refresh_interval(struct wsr_state *wsr,
 			     unsigned long refresh_interval);
diff --git a/mm/Kconfig b/mm/Kconfig
index be949786796d..a8def8c65610 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1310,6 +1310,12 @@ config WORKINGSET_REPORT
 	  This option exports stats and events giving the user more insight
 	  into its memory working set.
 
+config WORKINGSET_REPORT_AGING
+	bool "Workingset report kernel aging thread"
+	depends on WORKINGSET_REPORT
+	help
+	  Performs aging on memcgs with their configured refresh intervals.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index f5ef0768253a..3a282510f960 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o
+obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1032c6efc66..ea83f10b22a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4462,7 +4462,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v)
 		if (!READ_ONCE(wsr->page_age))
 			continue;
 
-		wsr_refresh_report(wsr, memcg, NODE_DATA(nid));
+		wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL);
 		mutex_lock(&wsr->page_age_lock);
 		if (!wsr->page_age)
 			goto unlock;
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index 1e1bdb8bf75b..dad539e602bb 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -283,7 +283,7 @@ static void copy_node_bins(struct pglist_data *pgdat,
 }
 
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
-			struct pglist_data *pgdat)
+			struct pglist_data *pgdat, unsigned long *refresh_time)
 {
 	struct wsr_page_age_histo *page_age;
 	unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
@@ -300,10 +300,14 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
 		goto unlock;
 	if (page_age->timestamp &&
 	    time_is_after_jiffies(page_age->timestamp + refresh_interval))
-		goto unlock;
+		goto time;
 	refresh_scan(wsr, root, pgdat, refresh_interval);
 	copy_node_bins(pgdat, page_age);
 	refresh_aggregate(page_age, root, pgdat);
+
+time:
+	if (refresh_time)
+		*refresh_time = page_age->timestamp + refresh_interval;
 unlock:
 	mutex_unlock(&wsr->page_age_lock);
 	return !!page_age;
@@ -386,6 +390,9 @@ int wsr_set_refresh_interval(struct wsr_state *wsr,
 	WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(refresh_interval));
 unlock:
 	mutex_unlock(&wsr->page_age_lock);
+	if (!err && refresh_interval &&
+	    (!old_interval || jiffies_to_msecs(old_interval) > refresh_interval))
+		wsr_wakeup_aging_thread();
 	return err;
 }
 
@@ -491,7 +498,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr,
 	int ret = 0;
 	struct wsr_state *wsr = kobj_to_wsr(kobj);
 
-	wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj));
+	wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj), NULL);
 
 	mutex_lock(&wsr->page_age_lock);
 	if (!wsr->page_age)
diff --git a/mm/workingset_report_aging.c b/mm/workingset_report_aging.c
new file mode 100644
index 000000000000..91ad5020778a
--- /dev/null
+++ b/mm/workingset_report_aging.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Workingset report kernel aging thread
+ *
+ * Performs aging on behalf of memcgs with their configured refresh interval.
+ * While a userspace program can periodically read the page age breakdown
+ * per-memcg and trigger aging, the kernel performing aging is less overhead,
+ * more consistent, and more reliable for the use case where every memcg should
+ * be aged according to their refresh interval.
+ */
+#define pr_fmt(fmt) "workingset report aging: " fmt
+
+#include <linux/jiffies.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/memcontrol.h>
+#include <linux/swap.h>
+#include <linux/wait.h>
+#include <linux/mmzone.h>
+#include <linux/workingset_report.h>
+
+static DECLARE_WAIT_QUEUE_HEAD(aging_wait);
+static bool refresh_pending;
+
+static bool do_aging_node(int nid, unsigned long *next_wake_time)
+{
+	struct mem_cgroup *memcg;
+	bool should_wait = true;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		struct wsr_state *wsr = &lruvec->wsr;
+		unsigned long refresh_time;
+
+		/* use returned time to decide when to wake up next */
+		if (wsr_refresh_report(wsr, memcg, pgdat, &refresh_time)) {
+			if (should_wait) {
+				should_wait = false;
+				*next_wake_time = refresh_time;
+			} else if (time_before(refresh_time, *next_wake_time)) {
+				*next_wake_time = refresh_time;
+			}
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return should_wait;
+}
+
+static int do_aging(void *unused)
+{
+	while (!kthread_should_stop()) {
+		int nid;
+		long timeout_ticks;
+		unsigned long next_wake_time;
+		bool should_wait = true;
+
+		WRITE_ONCE(refresh_pending, false);
+		for_each_node_state(nid, N_MEMORY) {
+			unsigned long node_next_wake_time;
+
+			if (do_aging_node(nid, &node_next_wake_time))
+				continue;
+			if (should_wait) {
+				should_wait = false;
+				next_wake_time = node_next_wake_time;
+			} else if (time_before(node_next_wake_time,
+					       next_wake_time)) {
+				next_wake_time = node_next_wake_time;
+			}
+		}
+
+		if (should_wait) {
+			wait_event_interruptible(aging_wait, refresh_pending);
+			continue;
+		}
+
+		/* sleep until next aging */
+		timeout_ticks = next_wake_time - jiffies;
+		if (timeout_ticks > 0 &&
+		    timeout_ticks != MAX_SCHEDULE_TIMEOUT) {
+			schedule_timeout_idle(timeout_ticks);
+			continue;
+		}
+	}
+	return 0;
+}
+
+/* Invoked when refresh_interval shortens or changes to a non-zero value. */
+void wsr_wakeup_aging_thread(void)
+{
+	WRITE_ONCE(refresh_pending, true);
+	wake_up_interruptible(&aging_wait);
+}
+
+static struct task_struct *aging_thread;
+
+static int aging_init(void)
+{
+	struct task_struct *task;
+
+	task = kthread_run(do_aging, NULL, "kagingd");
+
+	if (IS_ERR(task)) {
+		pr_err("Failed to create aging kthread\n");
+		return PTR_ERR(task);
+	}
+
+	aging_thread = task;
+	pr_info("module loaded\n");
+	return 0;
+}
+
+static void aging_exit(void)
+{
+	kthread_stop(aging_thread);
+	aging_thread = NULL;
+	pr_info("module unloaded\n");
+}
+
+module_init(aging_init);
+module_exit(aging_exit);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 6/9] selftest: test system-wide workingset reporting
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (4 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 5/9] mm: add kernel aging thread for workingset reporting Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Yuanchu Xie
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

A basic test that verifies the working set size of a simple memory
accessor. It should work with or without the aging thread.

When running tests with run_vmtests.sh, file workingset report testing
requires an environment variable WORKINGSET_REPORT_TEST_FILE_PATH to
store a temporary file, which is passed into the test invocation as a
parameter.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   3 +
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 .../testing/selftests/mm/workingset_report.c  | 306 ++++++++++++++++
 .../testing/selftests/mm/workingset_report.h  |  39 +++
 .../selftests/mm/workingset_report_test.c     | 330 ++++++++++++++++++
 6 files changed, 684 insertions(+)
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index da030b43e43b..e5cd0085ab74 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -51,3 +51,4 @@ hugetlb_madv_vs_map
 mseal_test
 seal_elf
 droppable
+workingset_report_test
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 0f8c110e0805..5c6a7464da6e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
 TEST_GEN_FILES += hugetlb_madv_vs_map
 TEST_GEN_FILES += hugetlb_dio
 TEST_GEN_FILES += droppable
+TEST_GEN_FILES += workingset_report_test
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
@@ -138,6 +139,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c
 $(OUTPUT)/uffd-stress: uffd-common.c
 $(OUTPUT)/uffd-unit-tests: uffd-common.c
 
+$(OUTPUT)/workingset_report_test: workingset_report.c
+
 ifeq ($(ARCH),x86_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
 BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index c5797ad1d37b..63782667381a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -75,6 +75,8 @@ separated by spaces:
 	read-only VMAs
 - mdwe
 	test prctl(PR_SET_MDWE, ...)
+- workingset_report
+	test workingset reporting
 
 example: ./run_vmtests.sh -t "hmm mmap ksm"
 EOF
@@ -456,6 +458,9 @@ CATEGORY="mkdirty" run_test ./mkdirty
 
 CATEGORY="mdwe" run_test ./mdwe_test
 
+CATEGORY="workingset_report" run_test ./workingset_report_test \
+  "${WORKINGSET_REPORT_TEST_FILE_PATH}"
+
 echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix
 echo "1..${count_total}" | tap_output
 
diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c
new file mode 100644
index 000000000000..ee4dda5c371d
--- /dev/null
+++ b/tools/testing/selftests/mm/workingset_report.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "workingset_report.h"
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+
+#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online"
+#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches"
+
+/* Returns read len on success, or -errno on failure. */
+static ssize_t read_text(const char *path, char *buf, size_t max_len)
+{
+	ssize_t len;
+	int fd, err;
+	size_t bytes_read = 0;
+
+	if (!max_len)
+		return -EINVAL;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return -errno;
+
+	while (bytes_read < max_len - 1) {
+		len = read(fd, buf + bytes_read, max_len - 1 - bytes_read);
+
+		if (len <= 0)
+			break;
+		bytes_read += len;
+	}
+
+	buf[bytes_read] = '\0';
+
+	err = -errno;
+	close(fd);
+	return len < 0 ? err : bytes_read;
+}
+
+/* Returns written len on success, or -errno on failure. */
+static ssize_t write_text(const char *path, const char *buf, ssize_t max_len)
+{
+	int fd, len, err;
+	size_t bytes_written = 0;
+
+	fd = open(path, O_WRONLY | O_APPEND);
+	if (fd < 0)
+		return -errno;
+
+	while (bytes_written < max_len) {
+		len = write(fd, buf + bytes_written, max_len - bytes_written);
+
+		if (len < 0)
+			break;
+		bytes_written += len;
+	}
+
+	err = -errno;
+	close(fd);
+	return len < 0 ? err : bytes_written;
+}
+
+static long read_num(const char *path)
+{
+	char buf[21];
+
+	if (read_text(path, buf, sizeof(buf)) <= 0)
+		return -1;
+	return (long)strtoul(buf, NULL, 10);
+}
+
+static int write_num(const char *path, unsigned long n)
+{
+	char buf[21];
+
+	sprintf(buf, "%lu", n);
+	if (write_text(path, buf, strlen(buf)) < 0)
+		return -1;
+	return 0;
+}
+
+long sysfs_get_refresh_interval(int nid)
+{
+	char file[128];
+
+	snprintf(file, sizeof(file),
+		"/sys/devices/system/node/node%d/workingset_report/refresh_interval",
+		nid);
+	return read_num(file);
+}
+
+int sysfs_set_refresh_interval(int nid, long interval)
+{
+	char file[128];
+
+	snprintf(file, sizeof(file),
+		"/sys/devices/system/node/node%d/workingset_report/refresh_interval",
+		nid);
+	return write_num(file, interval);
+}
+
+int sysfs_get_page_age_intervals_str(int nid, char *buf, int len)
+{
+	char path[128];
+
+	snprintf(path, sizeof(path),
+		"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
+		nid);
+	return read_text(path, buf, len);
+
+}
+
+int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len)
+{
+	char path[128];
+
+	snprintf(path, sizeof(path),
+		"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
+		nid);
+	return write_text(path, buf, len);
+}
+
+int sysfs_set_page_age_intervals(int nid, const char *const intervals[],
+				 int nr_intervals)
+{
+	char file[128];
+	char buf[1024];
+	int i;
+	int err, len = 0;
+
+	for (i = 0; i < nr_intervals; ++i) {
+		err = snprintf(buf + len, sizeof(buf) - len, "%s", intervals[i]);
+
+		if (err < 0)
+			return err;
+		len += err;
+
+		if (i < nr_intervals - 1) {
+			err = snprintf(buf + len, sizeof(buf) - len, ",");
+			if (err < 0)
+				return err;
+			len += err;
+		}
+	}
+
+	snprintf(file, sizeof(file),
+		"/sys/devices/system/node/node%d/workingset_report/page_age_intervals",
+		nid);
+	return write_text(file, buf, len);
+}
+
+int get_nr_nodes(void)
+{
+	char buf[22];
+	char *found;
+
+	if (read_text(SYSFS_NODE_ONLINE, buf, sizeof(buf)) <= 0)
+		return -1;
+	found = strstr(buf, "-");
+	if (found)
+		return (int)strtoul(found + 1, NULL, 10) + 1;
+	return (long)strtoul(buf, NULL, 10) + 1;
+}
+
+int drop_pagecache(void)
+{
+	return write_num(PROC_DROP_CACHES, 1);
+}
+
+ssize_t sysfs_page_age_read(int nid, char *buf, size_t len)
+
+{
+	char file[128];
+
+	snprintf(file, sizeof(file),
+		 "/sys/devices/system/node/node%d/workingset_report/page_age",
+		 nid);
+	return read_text(file, buf, len);
+}
+
+/*
+ * Finds the first occurrence of "N<nid>\n"
+ * Modifies buf to terminate before the next occurrence of "N".
+ * Returns a substring of buf starting after "N<nid>\n"
+ */
+char *page_age_split_node(char *buf, int nid, char **next)
+{
+	char node_str[5];
+	char *found;
+	int node_str_len;
+
+	node_str_len = snprintf(node_str, sizeof(node_str), "N%u\n", nid);
+
+	/* find the node prefix first */
+	found = strstr(buf, node_str);
+	if (!found) {
+		ksft_print_msg("cannot find '%s' in page_idle_age", node_str);
+		return NULL;
+	}
+	found += node_str_len;
+
+	*next = strchr(found, 'N');
+	if (*next)
+		*(*next - 1) = '\0';
+
+	return found;
+}
+
+ssize_t page_age_read(const char *buf, const char *interval, int pagetype)
+{
+	static const char * const type[ANON_AND_FILE] = { "anon=", "file=" };
+	char *found;
+
+	found = strstr(buf, interval);
+	if (!found) {
+		ksft_print_msg("cannot find %s in page_age", interval);
+		return -1;
+	}
+	found = strstr(found, type[pagetype]);
+	if (!found) {
+		ksft_print_msg("cannot find %s in page_age", type[pagetype]);
+		return -1;
+	}
+	found += strlen(type[pagetype]);
+	return (long)strtoul(found, NULL, 10);
+}
+
+static const char *TEMP_FILE = "/tmp/workingset_selftest";
+void cleanup_file_workingset(void)
+{
+	remove(TEMP_FILE);
+}
+
+int alloc_file_workingset(void *arg)
+{
+	int err = 0;
+	char *ptr;
+	int fd;
+	int ppid;
+	char *mapped;
+	size_t size = (size_t)arg;
+	size_t page_size = getpagesize();
+
+	ppid = getppid();
+
+	fd = open(TEMP_FILE, O_RDWR | O_CREAT);
+	if (fd < 0) {
+		err = -errno;
+		ksft_perror("failed to open temp file\n");
+		goto cleanup;
+	}
+
+	if (fallocate(fd, 0, 0, size) < 0) {
+		err = -errno;
+		ksft_perror("fallocate");
+		goto cleanup;
+	}
+
+	mapped = (char *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			      fd, 0);
+	if (mapped == NULL) {
+		err = -errno;
+		ksft_perror("mmap");
+		goto cleanup;
+	}
+
+	while (getppid() == ppid) {
+		sync();
+		for (ptr = mapped; ptr < mapped + size; ptr += page_size)
+			*ptr = *ptr ^ 0xFF;
+	}
+
+cleanup:
+	cleanup_file_workingset();
+	return err;
+}
+
+int alloc_anon_workingset(void *arg)
+{
+	char *buf, *ptr;
+	int ppid = getppid();
+	size_t size = (size_t)arg;
+	size_t page_size = getpagesize();
+
+	buf = malloc(size);
+
+	if (!buf) {
+		ksft_print_msg("cannot allocate anon workingset");
+		exit(1);
+	}
+
+	while (getppid() == ppid) {
+		for (ptr = buf; ptr < buf + size; ptr += page_size)
+			*ptr = *ptr ^ 0xFF;
+	}
+
+	free(buf);
+	return 0;
+}
diff --git a/tools/testing/selftests/mm/workingset_report.h b/tools/testing/selftests/mm/workingset_report.h
new file mode 100644
index 000000000000..c5c281e4069b
--- /dev/null
+++ b/tools/testing/selftests/mm/workingset_report.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef WORKINGSET_REPORT_H_
+#define WORKINGSET_REPORT_H_
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <stdint.h>
+#include <sys/types.h>
+
+#define PAGETYPE_ANON 0
+#define PAGETYPE_FILE 1
+#define ANON_AND_FILE 2
+
+int get_nr_nodes(void);
+int drop_pagecache(void);
+
+long sysfs_get_refresh_interval(int nid);
+int sysfs_set_refresh_interval(int nid, long interval);
+
+int sysfs_get_page_age_intervals_str(int nid, char *buf, int len);
+int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len);
+
+int sysfs_set_page_age_intervals(int nid, const char *const intervals[],
+				 int nr_intervals);
+
+char *page_age_split_node(char *buf, int nid, char **next);
+ssize_t sysfs_page_age_read(int nid, char *buf, size_t len);
+ssize_t page_age_read(const char *buf, const char *interval, int pagetype);
+
+int alloc_file_workingset(void *arg);
+void cleanup_file_workingset(void);
+int alloc_anon_workingset(void *arg);
+
+#endif /* WORKINGSET_REPORT_H_ */
diff --git a/tools/testing/selftests/mm/workingset_report_test.c b/tools/testing/selftests/mm/workingset_report_test.c
new file mode 100644
index 000000000000..89ff4e9d746e
--- /dev/null
+++ b/tools/testing/selftests/mm/workingset_report_test.c
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "workingset_report.h"
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <signal.h>
+#include <time.h>
+
+#include "../clone3/clone3_selftests.h"
+
+#define REFRESH_INTERVAL 5000
+#define MB(x) (x << 20)
+
+static void sleep_ms(int milliseconds)
+{
+	struct timespec ts;
+
+	ts.tv_sec = milliseconds / 1000;
+	ts.tv_nsec = (milliseconds % 1000) * 1000000;
+	nanosleep(&ts, NULL);
+}
+
+/*
+ * Checks if two given values differ by less than err% of their sum.
+ */
+static inline int values_close(long a, long b, int err)
+{
+	return labs(a - b) <= (a + b) / 100 * err;
+}
+
+static const char * const PAGE_AGE_INTERVALS[] = {
+	"6000", "10000", "15000", "18446744073709551615",
+};
+#define NR_PAGE_AGE_INTERVALS (ARRAY_SIZE(PAGE_AGE_INTERVALS))
+
+static int set_page_age_intervals_all_nodes(const char *intervals, int nr_nodes)
+{
+	int i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		int err = sysfs_set_page_age_intervals_str(
+			i, &intervals[i * 1024], strlen(&intervals[i * 1024]));
+
+		if (err < 0)
+			return err;
+	}
+	return 0;
+}
+
+static int get_page_age_intervals_all_nodes(char *intervals, int nr_nodes)
+{
+	int i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		int err = sysfs_get_page_age_intervals_str(
+			i, &intervals[i * 1024], 1024);
+
+		if (err < 0)
+			return err;
+	}
+	return 0;
+}
+
+static int set_refresh_interval_all_nodes(const long *interval, int nr_nodes)
+{
+	int i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		int err = sysfs_set_refresh_interval(i, interval[i]);
+
+		if (err < 0)
+			return err;
+	}
+	return 0;
+}
+
+static int get_refresh_interval_all_nodes(long *interval, int nr_nodes)
+{
+	int i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		long val = sysfs_get_refresh_interval(i);
+
+		if (val < 0)
+			return val;
+		interval[i] = val;
+	}
+	return 0;
+}
+
+static pid_t clone_and_run(int fn(void *arg), void *arg)
+{
+	pid_t pid;
+
+	struct __clone_args args = {
+		.exit_signal = SIGCHLD,
+	};
+
+	pid = sys_clone3(&args, sizeof(struct __clone_args));
+
+	if (pid == 0)
+		exit(fn(arg));
+
+	return pid;
+}
+
+static int read_workingset(int pagetype, int nid,
+			   unsigned long page_age[NR_PAGE_AGE_INTERVALS])
+{
+	int i, err;
+	char buf[4096];
+
+	err = sysfs_page_age_read(nid, buf, sizeof(buf));
+	if (err < 0)
+		return err;
+
+	for (i = 0; i < NR_PAGE_AGE_INTERVALS; ++i) {
+		err = page_age_read(buf, PAGE_AGE_INTERVALS[i], pagetype);
+		if (err < 0)
+			return err;
+		page_age[i] = err;
+	}
+
+	return 0;
+}
+
+static ssize_t read_interval_all_nodes(int pagetype, int interval)
+{
+	int i, err;
+	unsigned long page_age[NR_PAGE_AGE_INTERVALS];
+	ssize_t ret = 0;
+	int nr_nodes = get_nr_nodes();
+
+	for (i = 0; i < nr_nodes; ++i) {
+		err = read_workingset(pagetype, i, page_age);
+		if (err < 0)
+			return err;
+
+		ret += page_age[interval];
+	}
+
+	return ret;
+}
+
+#define TEST_SIZE MB(500l)
+
+static int run_test(int f(void))
+{
+	int i, err, test_result;
+	long *old_refresh_intervals;
+	long *new_refresh_intervals;
+	char *old_page_age_intervals;
+	int nr_nodes = get_nr_nodes();
+
+	if (nr_nodes <= 0) {
+		ksft_print_msg("failed to get nr_nodes\n");
+		return KSFT_FAIL;
+	}
+
+	old_refresh_intervals = calloc(nr_nodes, sizeof(long));
+	new_refresh_intervals = calloc(nr_nodes, sizeof(long));
+	old_page_age_intervals = calloc(nr_nodes, 1024);
+
+	if (!(old_refresh_intervals && new_refresh_intervals &&
+	      old_page_age_intervals)) {
+		ksft_print_msg("failed to allocate memory for intervals\n");
+		return KSFT_FAIL;
+	}
+
+	err = get_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes);
+	if (err < 0) {
+		ksft_print_msg("failed to read refresh interval\n");
+		return KSFT_FAIL;
+	}
+
+	err = get_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes);
+	if (err < 0) {
+		ksft_print_msg("failed to read page age interval\n");
+		return KSFT_FAIL;
+	}
+
+	for (i = 0; i < nr_nodes; ++i)
+		new_refresh_intervals[i] = REFRESH_INTERVAL;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		err = sysfs_set_page_age_intervals(i, PAGE_AGE_INTERVALS,
+						   NR_PAGE_AGE_INTERVALS - 1);
+		if (err < 0) {
+			ksft_print_msg("failed to set page age interval\n");
+			test_result = KSFT_FAIL;
+			goto fail;
+		}
+	}
+
+	err = set_refresh_interval_all_nodes(new_refresh_intervals, nr_nodes);
+	if (err < 0) {
+		ksft_print_msg("failed to set refresh interval\n");
+		test_result = KSFT_FAIL;
+		goto fail;
+	}
+
+	sync();
+	drop_pagecache();
+
+	test_result = f();
+
+fail:
+	err = set_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes);
+	if (err < 0) {
+		ksft_print_msg("failed to restore refresh interval\n");
+		test_result = KSFT_FAIL;
+	}
+	err = set_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes);
+	if (err < 0) {
+		ksft_print_msg("failed to restore page age interval\n");
+		test_result = KSFT_FAIL;
+	}
+	return test_result;
+}
+
+static char *file_test_path;
+static int test_file(void)
+{
+	ssize_t ws_size_ref, ws_size_test;
+	int ret = KSFT_FAIL, i;
+	pid_t pid = 0;
+
+	if (!file_test_path) {
+		ksft_print_msg("Set a path to test file workingset\n");
+		return KSFT_SKIP;
+	}
+
+	ws_size_ref = read_interval_all_nodes(PAGETYPE_FILE, 0);
+	if (ws_size_ref < 0)
+		goto cleanup;
+
+	pid = clone_and_run(alloc_file_workingset, (void *)TEST_SIZE);
+	if (pid < 0)
+		goto cleanup;
+
+	read_interval_all_nodes(PAGETYPE_FILE, 0);
+	sleep_ms(REFRESH_INTERVAL);
+
+	for (i = 0; i < 3; ++i) {
+		sleep_ms(REFRESH_INTERVAL);
+		ws_size_test = read_interval_all_nodes(PAGETYPE_FILE, 0);
+		ws_size_test += read_interval_all_nodes(PAGETYPE_FILE, 1);
+		if (ws_size_test < 0)
+			goto cleanup;
+
+		if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) {
+			ksft_print_msg(
+				"file working set size difference too large: actual=%ld, expected=%ld\n",
+				ws_size_test - ws_size_ref, TEST_SIZE);
+			goto cleanup;
+		}
+	}
+	ret = KSFT_PASS;
+
+cleanup:
+	if (pid > 0)
+		kill(pid, SIGKILL);
+	cleanup_file_workingset();
+	return ret;
+}
+
+static int test_anon(void)
+{
+	ssize_t ws_size_ref, ws_size_test;
+	pid_t pid = 0;
+	int ret = KSFT_FAIL, i;
+
+	ws_size_ref = read_interval_all_nodes(PAGETYPE_ANON, 0);
+	ws_size_ref += read_interval_all_nodes(PAGETYPE_ANON, 1);
+	if (ws_size_ref < 0)
+		goto cleanup;
+
+	pid = clone_and_run(alloc_anon_workingset, (void *)TEST_SIZE);
+	if (pid < 0)
+		goto cleanup;
+
+	sleep_ms(REFRESH_INTERVAL);
+	read_interval_all_nodes(PAGETYPE_ANON, 0);
+
+	for (i = 0; i < 5; ++i) {
+		sleep_ms(REFRESH_INTERVAL);
+		ws_size_test = read_interval_all_nodes(PAGETYPE_ANON, 0);
+		ws_size_test += read_interval_all_nodes(PAGETYPE_ANON, 1);
+		if (ws_size_test < 0)
+			goto cleanup;
+
+		if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) {
+			ksft_print_msg(
+				"anon working set size difference too large: actual=%ld, expected=%ld\n",
+				ws_size_test - ws_size_ref, TEST_SIZE);
+			goto cleanup;
+		}
+	}
+	ret = KSFT_PASS;
+
+cleanup:
+	if (pid > 0)
+		kill(pid, SIGKILL);
+	return ret;
+}
+
+
+#define T(x) { x, #x }
+struct workingset_test {
+	int (*fn)(void);
+	const char *name;
+} tests[] = {
+	T(test_anon),
+	T(test_file),
+};
+#undef T
+
+int main(int argc, char **argv)
+{
+	int i, err;
+
+	if (argc > 1)
+		file_test_path = argv[1];
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		err = run_test(tests[i].fn);
+		ksft_test_result_code(err, tests[i].name, NULL);
+	}
+	return 0;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (5 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 6/9] selftest: test system-wide " Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting Yuanchu Xie
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

Add workingset reporting documentation for better discoverability of
its sysfs and memcg interfaces. Also document the required kernel
config to enable workingset reporting.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/workingset_report.rst      | 105 ++++++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/workingset_report.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 8b35795b664b..61a2a347fc91 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -41,4 +41,5 @@ the Linux memory management.
    swap_numa
    transhuge
    userfaultfd
+   workingset_report
    zswap
diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst
new file mode 100644
index 000000000000..0969513705c4
--- /dev/null
+++ b/Documentation/admin-guide/mm/workingset_report.rst
@@ -0,0 +1,105 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Workingset Report
+=================
+Workingset report provides a view of memory coldness in user-defined
+time intervals, e.g. X bytes are Y milliseconds cold. It breaks down
+the user pages in the system per-NUMA node, per-memcg, for both
+anonymous and file pages into histograms that look like:
+::
+
+    1000 anon=137368 file=24530
+    20000 anon=34342 file=0
+    30000 anon=353232 file=333608
+    40000 anon=407198 file=206052
+    9223372036854775807 anon=4925624 file=892892
+
+The workingset reports can be used to drive proactive reclaim, by
+identifying the number of cold bytes in a memcg, then writing to
+``memory.reclaim``.
+
+Quick start
+===========
+Build the kernel with the following configurations. The report relies
+on Multi-gen LRU for page coldness.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+* ``CONFIG_WORKINGSET_REPORT=y``
+
+Optionally, the aging kernel daemon can be enabled with the following
+configuration.
+* ``CONFIG_WORKINGSET_REPORT_AGING=y``
+
+Sysfs interfaces
+================
+``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides
+a per-node page age histogram, showing an aggregate of the node's lruvecs.
+Reading this file causes a hierarchical aging of all lruvecs, scanning
+pages and creates a new Multi-gen LRU generation in each lruvec.
+For example:
+::
+
+    1000 anon=0 file=0
+    2000 anon=0 file=0
+    100000 anon=5533696 file=5566464
+    18446744073709551615 anon=0 file=0
+
+``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals``
+is a comma-separated list of time in milliseconds that configures what
+the page age histogram uses for aggregation. For the above histogram,
+the intervals are::
+
+    1000,2000,100000
+
+``/sys/devices/system/node/nodeX/workingset_report/refresh_interval``
+defines the amount of time the report is valid for in milliseconds.
+When a report is still valid, reading the ``page_age`` file shows
+the existing valid report, instead of generating a new one.
+
+``/sys/devices/system/node/nodeX/workingset_report/report_threshold``
+specifies how often the userspace agent can be notified for node
+memory pressure, in milliseconds. When a node reaches its low
+watermarks and wakes up kswapd, programs waiting on ``page_age`` are
+woken up so they can read the histogram and make policy decisions.
+
+Memcg interface
+===============
+While ``page_age_interval`` is defined per-node in sysfs, ``page_age``,
+``refresh_interval`` and ``report_threshold`` are available per-memcg.
+
+``/sys/fs/cgroup/.../memory.workingset.page_age``
+The memcg equivalent of the sysfs workingset page age histogram
+breaks down the workingset of this memcg and its children into
+page age intervals. Each node is prefixed with a node header and
+a newline. Non-proactive direct reclaim on this memcg can also
+wake up userspace agents that are waiting on this file.
+E.g.
+::
+
+    N0
+    1000 anon=0 file=0
+    2000 anon=0 file=0
+    3000 anon=0 file=0
+    4000 anon=0 file=0
+    5000 anon=0 file=0
+    18446744073709551615 anon=0 file=0
+
+``/sys/fs/cgroup/.../memory.workingset.refresh_interval``
+The memcg equivalent of the sysfs refresh interval. A per-node
+number of how much time a page age histogram is valid for, in
+milliseconds.
+E.g.
+::
+
+    echo N0=2000 > memory.workingset.refresh_interval
+
+``/sys/fs/cgroup/.../memory.workingset.report_threshold``
+The memcg equivalent of the sysfs report threshold. A per-node
+number of how often userspace agent waiting on the page age
+histogram can be woken up, in milliseconds.
+E.g.
+::
+
+    echo N0=1000 > memory.workingset.report_threshold
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (6 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27  2:57 ` [PATCH v4 9/9] virtio-balloon: add " Yuanchu Xie
  2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
  9 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

Add workingset reporting documentation for better discoverability of
its memcg interfaces. Point the memcg documentation to
Documentation/admin-guide/mm/workingset_report.rst for more details.

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 2cb58daf3089..67a183f08245 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1784,6 +1784,41 @@ The following nested keys are defined.
 	Shows pressure stall information for memory. See
 	:ref:`Documentation/accounting/psi.rst <psi>` for details.
 
+  memory.workingset.page_age
+	A read-only histogram which exists on non-root cgroups.
+
+	This breaks down the cgroup's memory footprint into different
+	types of memory and groups them per-node into user-defined coldness
+	bins.
+
+	The output format of memory.workingset.page_age is::
+
+	  N0
+	  <interval 0 of node 0> type=<type bytes in interval 0 of node 0>
+	  <interval 1 of node 0> type=<type bytes in interval 1 of node 0>
+	  ...
+	  18446744073709551615 type=<the rest of type bytes of node 0>
+
+	The type of memory can be anon, file, or new types added later.
+	Don't rely on the types remaining fixed.  See
+	:ref:`Documentation/admin-guide/mm/workingset_report.rst <workingset_report>`
+	for details.
+
+  memory.workingset.refresh_interval
+	A read-write nested-keyed file which exists on non-root cgroups.
+
+	Setting it to a non-zero value for any node enables working set
+	reporting for that node.  The default is 0 for each node.   See
+	:ref:`Documentation/admin-guide/mm/workingset_report.rst <workingset_report>`
+	for details.
+
+  memory.workingset.report_threshold
+	A read-write nested-keyed file which exists on non-root cgroups.
+
+	The amount of milliseconds to wait before reporting the working
+	set again.  The default is 0 for each node.  See
+	:ref:`Documentation/admin-guide/mm/workingset_report.rst <workingset_report>`
+	for details.
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 9/9] virtio-balloon: add workingset reporting
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (7 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting Yuanchu Xie
@ 2024-11-27  2:57 ` Yuanchu Xie
  2024-11-27 23:14   ` Daniel Verkamp
  2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
  9 siblings, 1 reply; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27  2:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	Yuanchu Xie, cgroups, linux-doc, linux-kernel, virtualization,
	linux-mm, linux-kselftest

Ballooning is a way to dynamically size a VM, and it requires guest
collaboration. The amount to balloon without adversely affecting guest
performance is hard to compute without clear metrics from the guest.

Workingset reporting can provide guidance to the host to allow better
collaborative ballooning, such that the host balloon controller can
properly gauge the amount of memory the guest is actively using, i.e.,
the working set.

A draft QEMU series [1] is being worked on. Currently it is able to
configure the workingset reporting bins, refresh_interval, and report
threshold. Through QMP or HMP, a balloon controller can request a
workingset report. There is also a script [2] exercising the QMP
interface with a visual breakdown of the guest's workingset size.

According to the OASIS VIRTIO v1.3, there's a new balloon device in the
works and this one I'm adding to is the "traditional" balloon. If the
existing balloon device is not the right place for new features. I'm
more than happy to add it to the new one as well.

For technical details, this patch adds the a generic mechanism into
workingset reporting infrastructure to allow other parts of the kernel
to receive workingset reports. Two virtqueues are added to the
virtio-balloon device, notification_vq and report_vq. The notification
virtqueue allows the host to configure the guest workingset reporting
parameters and request a report. The report virtqueue sends a working
set report to the host when one is requested or due to memory pressure.

The workingset reporting feature is gated by the compilation flag
CONFIG_WORKINGSET_REPORT and the balloon feature flag
VIRTIO_BALLOON_F_WS_REPORTING.

[1] https://github.com/Dummyc0m/qemu/tree/wsr
[2] https://gist.github.com/Dummyc0m/d45b4e1b0dda8f2bc6cd8cfb37cc7e34

Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 drivers/virtio/virtio_balloon.c     | 390 +++++++++++++++++++++++++++-
 include/linux/balloon_compaction.h  |   1 +
 include/linux/mmzone.h              |   4 +
 include/linux/workingset_report.h   |  66 ++++-
 include/uapi/linux/virtio_balloon.h |  30 +++
 mm/workingset_report.c              |  89 ++++++-
 6 files changed, 566 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index b36d2803674e..8eb300653dd8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -18,6 +18,7 @@
 #include <linux/wait.h>
 #include <linux/mm.h>
 #include <linux/page_reporting.h>
+#include <linux/workingset_report.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,8 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
 	VIRTIO_BALLOON_VQ_REPORTING,
+	VIRTIO_BALLOON_VQ_WORKING_SET,
+	VIRTIO_BALLOON_VQ_NOTIFY,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -124,6 +127,23 @@ struct virtio_balloon {
 	spinlock_t wakeup_lock;
 	bool processing_wakeup_event;
 	u32 wakeup_signal_mask;
+
+#ifdef CONFIG_WORKINGSET_REPORT
+	struct virtqueue *working_set_vq, *notification_vq;
+
+	/* Protects node_id, wsr_receiver, and report_buf */
+	struct mutex wsr_report_lock;
+	int wsr_node_id;
+	struct wsr_receiver wsr_receiver;
+	/* Buffer to report to host */
+	struct virtio_balloon_working_set_report *report_buf;
+
+	/* Buffer to hold incoming notification from the host. */
+	struct virtio_balloon_working_set_notify *notify_buf;
+
+	struct work_struct update_balloon_working_set_work;
+	struct work_struct update_balloon_notification_work;
+#endif
 };
 
 #define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0)
@@ -339,8 +359,352 @@ static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num)
 	return num_freed_pages;
 }
 
-static inline void update_stat(struct virtio_balloon *vb, int idx,
-			       u16 tag, u64 val)
+#ifdef CONFIG_WORKINGSET_REPORT
+static bool wsr_is_configured(struct virtio_balloon *vb)
+{
+	if (node_online(READ_ONCE(vb->wsr_node_id)) &&
+	    READ_ONCE(vb->wsr_receiver.wsr.refresh_interval) > 0 &&
+	    READ_ONCE(vb->wsr_receiver.wsr.page_age) != NULL)
+		return true;
+	return false;
+}
+
+/* wsr_receiver callback */
+static void wsr_receiver_notify(struct wsr_receiver *receiver)
+{
+	int bin;
+	struct virtio_balloon *vb =
+		container_of(receiver, struct virtio_balloon, wsr_receiver);
+
+	/* if we fail to acquire the locks, send stale report */
+	if (!mutex_trylock(&vb->wsr_report_lock))
+		goto out;
+	if (!mutex_trylock(&receiver->wsr.page_age_lock))
+		goto out_unlock_report_buf;
+	if (!READ_ONCE(receiver->wsr.page_age))
+		goto out_unlock_page_age;
+
+	vb->report_buf->error = cpu_to_le32(0);
+	vb->report_buf->node_id = cpu_to_le32(vb->wsr_node_id);
+	for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) {
+		struct virtio_balloon_working_set_report_bin *dest =
+			&vb->report_buf->bins[bin];
+		struct wsr_report_bin *src = &receiver->wsr.page_age->bins[bin];
+
+		dest->anon_bytes =
+			cpu_to_le64(src->nr_pages[LRU_GEN_ANON] * PAGE_SIZE);
+		dest->file_bytes =
+			cpu_to_le64(src->nr_pages[LRU_GEN_FILE] * PAGE_SIZE);
+		if (src->idle_age == WORKINGSET_INTERVAL_MAX) {
+			dest->idle_age = cpu_to_le64(WORKINGSET_INTERVAL_MAX);
+			break;
+		}
+		dest->idle_age = cpu_to_le64(jiffies_to_msecs(src->idle_age));
+	}
+
+out_unlock_page_age:
+	mutex_unlock(&receiver->wsr.page_age_lock);
+out_unlock_report_buf:
+	mutex_unlock(&vb->wsr_report_lock);
+out:
+	/* Send the working set report to the device. */
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update)
+		queue_work(system_freezable_wq, &vb->update_balloon_working_set_work);
+	spin_unlock(&vb->stop_update_lock);
+}
+
+static void virtio_balloon_working_set_request(struct virtio_balloon *vb,
+					       int nid)
+{
+	int err = 0;
+
+	if (!node_online(nid)) {
+		err = -EINVAL;
+		goto error;
+	}
+
+	err = wsr_refresh_receiver_report(NODE_DATA(nid));
+	if (err)
+		goto error;
+
+	return;
+error:
+	mutex_lock(&vb->wsr_report_lock);
+	vb->report_buf->error = cpu_to_le16(err);
+	vb->report_buf->node_id = cpu_to_le32(nid);
+	mutex_unlock(&vb->wsr_report_lock);
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update)
+		queue_work(system_freezable_wq,
+			   &vb->update_balloon_working_set_work);
+	spin_unlock(&vb->stop_update_lock);
+}
+
+static void notification_receive(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update)
+		queue_work(system_freezable_wq, &vb->update_balloon_notification_work);
+	spin_unlock(&vb->stop_update_lock);
+}
+
+static int
+virtio_balloon_register_working_set_receiver(struct virtio_balloon *vb)
+{
+	struct pglist_data *pgdat;
+	struct wsr_report_bins *bins = NULL, __rcu *old;
+	int nid, bin, err = 0, old_nid = vb->wsr_node_id;
+	struct virtio_balloon_working_set_notify *notify = vb->notify_buf;
+
+	nid = le16_to_cpu(notify->node_id);
+	if (!node_online(nid)) {
+		dev_warn(&vb->vdev->dev, "node not online %d\n", nid);
+		return -EINVAL;
+	}
+
+	pgdat = NODE_DATA(nid);
+	bins = kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL);
+
+	if (!bins)
+		return -ENOMEM;
+
+	for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) {
+		u32 age_msecs = le32_to_cpu(notify->idle_age[bin]);
+		unsigned long age = msecs_to_jiffies(age_msecs);
+
+		/*
+		 * A correct idle_age array should end in
+		 * WORKINGSET_INTERVAL_MAX.
+		 */
+		if (age_msecs == (u32)WORKINGSET_INTERVAL_MAX) {
+			bins->idle_age[bin] = WORKINGSET_INTERVAL_MAX;
+			break;
+		}
+		bins->idle_age[bin] = age;
+		if (bin > 0 && bins->idle_age[bin] <= bins->idle_age[bin - 1]) {
+			dev_warn(&vb->vdev->dev, "bins not increasing\n");
+			err = -EINVAL;
+			goto error;
+		}
+	}
+	if (bin < WORKINGSET_REPORT_MIN_NR_BINS - 1 ||
+	    bin == WORKINGSET_REPORT_MAX_NR_BINS) {
+		err = -ERANGE;
+		goto error;
+	}
+	bins->nr_bins = bin;
+
+	mutex_lock(&vb->wsr_report_lock);
+	err = wsr_set_refresh_interval(
+		&vb->wsr_receiver.wsr,
+		le32_to_cpu(notify->refresh_interval));
+	if (err) {
+		mutex_unlock(&vb->wsr_report_lock);
+		goto error;
+	}
+	if (old_nid != NUMA_NO_NODE)
+		wsr_remove_receiver(&vb->wsr_receiver, NODE_DATA(old_nid));
+	WRITE_ONCE(vb->wsr_node_id, nid);
+	WRITE_ONCE(vb->wsr_receiver.wsr.report_threshold,
+		   msecs_to_jiffies(le32_to_cpu(notify->report_threshold)));
+	WRITE_ONCE(vb->wsr_receiver.notify, wsr_receiver_notify);
+	mutex_unlock(&vb->wsr_report_lock);
+
+	/* update the bins for target node */
+	mutex_lock(&pgdat->wsr_update_mutex);
+	old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins,
+				  lockdep_is_held(&pgdat->wsr_update_mutex));
+	mutex_unlock(&pgdat->wsr_update_mutex);
+	kfree_rcu(old, rcu);
+
+	wsr_register_receiver(&vb->wsr_receiver, pgdat);
+
+	return 0;
+error:
+	kfree(bins);
+	return err;
+}
+
+static void update_balloon_notification_func(struct work_struct *work)
+{
+	unsigned int len, op;
+	int err;
+	struct virtio_balloon *vb;
+	struct scatterlist sg_in;
+
+	vb = container_of(work, struct virtio_balloon,
+			  update_balloon_notification_work);
+	op = le16_to_cpu(vb->notify_buf->op);
+
+	switch (op) {
+	case VIRTIO_BALLOON_WS_OP_REQUEST:
+		virtio_balloon_working_set_request(vb,
+						   READ_ONCE(vb->wsr_node_id));
+		break;
+	case VIRTIO_BALLOON_WS_OP_CONFIG:
+		err = virtio_balloon_register_working_set_receiver(vb);
+		if (err)
+			dev_warn(&vb->vdev->dev,
+				 "Error configuring working set, %d\n", err);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "Received invalid notification, %u\n",
+			 op);
+		break;
+	}
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vb->notification_vq, &len))
+		;
+	/* Add a new notification buffer for device to fill. */
+	sg_init_one(&sg_in, vb->notify_buf, sizeof(*vb->notify_buf));
+	virtqueue_add_inbuf(vb->notification_vq, &sg_in, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vb->notification_vq);
+}
+
+static void update_balloon_ws_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  update_balloon_working_set_work);
+
+	if (wsr_is_configured(vb)) {
+		struct scatterlist sg_out;
+		int unused;
+		int err;
+
+		/* Detach all the used buffers from the vq */
+		while (virtqueue_get_buf(vb->working_set_vq, &unused))
+			;
+		sg_init_one(&sg_out, vb->report_buf, sizeof(*vb->report_buf));
+		err = virtqueue_add_outbuf(vb->working_set_vq, &sg_out, 1, vb, GFP_KERNEL);
+		if (unlikely(err))
+			dev_err(&vb->vdev->dev,
+				"Failed to send working set report err = %d\n",
+				err);
+		else
+			virtqueue_kick(vb->working_set_vq);
+
+	} else {
+		dev_warn(&vb->vdev->dev, "Working Set not initialized.");
+	}
+}
+
+static void wsr_init_vqs_info(struct virtio_balloon *vb,
+			      struct virtqueue_info vqs_info[])
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		vqs_info[VIRTIO_BALLOON_VQ_WORKING_SET].name = "ws";
+		vqs_info[VIRTIO_BALLOON_VQ_WORKING_SET].callback = NULL;
+		vqs_info[VIRTIO_BALLOON_VQ_NOTIFY].name = "notify";
+		vqs_info[VIRTIO_BALLOON_VQ_NOTIFY].callback = notification_receive;
+	}
+}
+
+static int wsr_init_vq(struct virtio_balloon *vb, struct virtqueue *vqs[])
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		struct scatterlist sg;
+		int err;
+
+		vb->working_set_vq = vqs[VIRTIO_BALLOON_VQ_WORKING_SET];
+		vb->notification_vq = vqs[VIRTIO_BALLOON_VQ_NOTIFY];
+
+		/* Prime the notification virtqueue for the device to fill. */
+		sg_init_one(&sg, vb->notify_buf, sizeof(*vb->notify_buf));
+		err = virtqueue_add_inbuf(vb->notification_vq, &sg, 1, vb, GFP_KERNEL);
+		if (unlikely(err)) {
+			dev_err(&vb->vdev->dev,
+				"Failed to prepare notifications, err = %d\n", err);
+			return err;
+		}
+		virtqueue_kick(vb->notification_vq);
+	}
+	return 0;
+}
+
+static void wsr_init_work(struct virtio_balloon *vb)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		INIT_WORK(&vb->update_balloon_working_set_work,
+			  update_balloon_ws_func);
+		INIT_WORK(&vb->update_balloon_notification_work,
+			  update_balloon_notification_func);
+	}
+}
+
+static int wsr_init(struct virtio_balloon *vb)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		vb->report_buf = kzalloc(sizeof(*vb->report_buf), GFP_KERNEL);
+		if (!vb->report_buf)
+			return -ENOMEM;
+
+		vb->notify_buf = kzalloc(sizeof(*vb->notify_buf), GFP_KERNEL);
+		if (!vb->notify_buf) {
+			kfree(vb->report_buf);
+			vb->report_buf = NULL;
+			return -ENOMEM;
+		}
+
+		wsr_init_state(&vb->wsr_receiver.wsr);
+		vb->wsr_node_id = NUMA_NO_NODE;
+		vb->report_buf->bins[0].idle_age = WORKINGSET_INTERVAL_MAX;
+	}
+	return 0;
+}
+
+static void wsr_remove(struct virtio_balloon *vb)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING) &&
+	    vb->wsr_node_id != NUMA_NO_NODE) {
+		wsr_remove_receiver(&vb->wsr_receiver, NODE_DATA(vb->wsr_node_id));
+		wsr_destroy_state(&vb->wsr_receiver.wsr);
+	}
+
+	kfree(vb->report_buf);
+	kfree(vb->notify_buf);
+	mutex_destroy(&vb->wsr_report_lock);
+}
+
+static void wsr_cancel_work(struct virtio_balloon *vb)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		cancel_work_sync(&vb->update_balloon_working_set_work);
+		cancel_work_sync(&vb->update_balloon_notification_work);
+	}
+}
+#else
+static inline void wsr_init_vqs_info(struct virtio_balloon *vb,
+				     struct virtqueue_info vqs_info[])
+{
+}
+static inline int wsr_init_vq(struct virtio_balloon *vb,
+			      struct virtqueue *vqs[])
+{
+	return 0;
+}
+static inline void wsr_init_work(struct virtio_balloon *vb)
+{
+}
+static inline int wsr_init(struct virtio_balloon *vb)
+{
+	return 0;
+}
+static inline void wsr_remove(struct virtio_balloon *vb)
+{
+}
+static inline void wsr_cancel_work(struct virtio_balloon *vb)
+{
+}
+#endif
+
+static inline void update_stat(struct virtio_balloon *vb, int idx, u16 tag,
+			       u64 val)
 {
 	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
 	vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
@@ -605,6 +969,8 @@ static int init_vqs(struct virtio_balloon *vb)
 		vqs_info[VIRTIO_BALLOON_VQ_REPORTING].callback = balloon_ack;
 	}
 
+	wsr_init_vqs_info(vb, vqs_info);
+
 	err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
 			      vqs_info, NULL);
 	if (err)
@@ -615,6 +981,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
 		unsigned int num_stats;
+
 		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
 
 		/*
@@ -640,6 +1007,11 @@ static int init_vqs(struct virtio_balloon *vb)
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
 		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
 
+	err = wsr_init_vq(vb, vqs);
+
+	if (err)
+		return err;
+
 	return 0;
 }
 
@@ -961,15 +1333,21 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		goto out;
 	}
 
+	vb->vdev = vdev;
+
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
+	wsr_init_work(vb);
 	spin_lock_init(&vb->stop_update_lock);
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
-	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
+	err = wsr_init(vb);
+	if (err)
+		goto out_remove_wsr;
+
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -1085,7 +1463,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
-
 out_unregister_oom:
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		unregister_oom_notifier(&vb->oom_nb);
@@ -1099,6 +1476,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
+out_remove_wsr:
+	wsr_remove(vb);
 out:
 	return err;
 }
@@ -1130,11 +1509,13 @@ static void virtballoon_remove(struct virtio_device *vdev)
 		unregister_oom_notifier(&vb->oom_nb);
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		virtio_balloon_unregister_shrinker(vb);
+	wsr_remove(vb);
 	spin_lock_irq(&vb->stop_update_lock);
 	vb->stop_update = true;
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	wsr_cancel_work(vb);
 
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
 		cancel_work_sync(&vb->report_free_page_work);
@@ -1200,6 +1581,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
 	VIRTIO_BALLOON_F_REPORTING,
+	VIRTIO_BALLOON_F_WS_REPORTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 5ca2d5699620..d92b8337dbcf 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -43,6 +43,7 @@
 #include <linux/err.h>
 #include <linux/fs.h>
 #include <linux/list.h>
+#include <linux/workingset_report.h>
 
 /*
  * Balloon device information descriptor.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ee728c0c5a3b..9a2dc506779d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1429,8 +1429,12 @@ typedef struct pglist_data {
 #endif
 
 #ifdef CONFIG_WORKINGSET_REPORT
+	/* protects wsr_page_age_bins */
 	struct mutex wsr_update_mutex;
 	struct wsr_report_bins __rcu *wsr_page_age_bins;
+	/* protects wsr_receiver_lost */
+	struct mutex wsr_receiver_mutex;
+	struct list_head wsr_receiver_list;
 #endif
 
 	CACHELINE_PADDING(_pad2_);
diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
index f6bbde2a04c3..1074b89035e9 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -11,13 +11,14 @@ struct node;
 struct lruvec;
 struct cgroup_file;
 struct wsr_state;
-
-#ifdef CONFIG_WORKINGSET_REPORT
+struct wsr_receiver;
 
 #define WORKINGSET_REPORT_MIN_NR_BINS 2
 #define WORKINGSET_REPORT_MAX_NR_BINS 32
 
 #define WORKINGSET_INTERVAL_MAX ((unsigned long)-1)
+
+#ifdef CONFIG_WORKINGSET_REPORT
 #define ANON_AND_FILE 2
 
 struct wsr_report_bin {
@@ -52,6 +53,8 @@ struct wsr_state {
 	struct wsr_page_age_histo *page_age;
 };
 
+void wsr_init_state(struct wsr_state *wsr);
+void wsr_destroy_state(struct wsr_state *wsr);
 void wsr_init_lruvec(struct lruvec *lruvec);
 void wsr_destroy_lruvec(struct lruvec *lruvec);
 void wsr_init_pgdat(struct pglist_data *pgdat);
@@ -66,6 +69,47 @@ void wsr_remove_sysfs(struct node *node);
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
 			struct pglist_data *pgdat, unsigned long *refresh_time);
 
+/*
+ * If refresh_interval > 0, enable working set reporting and kick
+ * the aging thread (if configured).
+ * If refresh_interval = 0, disable working set reporting and free
+ * the bookkeeping resources.
+ *
+ * @param refresh_interval milliseconds.
+ */
+int wsr_set_refresh_interval(struct wsr_state *wsr,
+			     unsigned long refresh_interval);
+
+struct wsr_receiver {
+	/*
+	 * Working set reporting ensures that two notify calls to
+	 * the same receivercannot interleave one another.
+	 *
+	 * Must be set before calling wsr_register_receiver.
+	 */
+	void (*notify)(struct wsr_receiver *receiver);
+	struct wsr_state wsr;
+	struct list_head list;
+};
+
+/*
+ * Register a per-node receiver
+ * report_threshold and refresh_interval are configured
+ * by the caller in struct wsr_state and contain valid values.
+ * page_age is allocated.
+ */
+void wsr_register_receiver(struct wsr_receiver *receiver,
+			  struct pglist_data *pgdat);
+
+void wsr_remove_receiver(struct wsr_receiver *receiver,
+			 struct pglist_data *pgdat);
+
+/*
+ * Refresh the report for the specified node, unless a refresh is already
+ * in progress or the parameters are being updated.
+ */
+int wsr_refresh_receiver_report(struct pglist_data *pgdat);
+
 #ifdef CONFIG_WORKINGSET_REPORT_AGING
 void wsr_wakeup_aging_thread(void);
 #else /* CONFIG_WORKINGSET_REPORT_AGING */
@@ -77,6 +121,12 @@ static inline void wsr_wakeup_aging_thread(void)
 int wsr_set_refresh_interval(struct wsr_state *wsr,
 			     unsigned long refresh_interval);
 #else
+static inline void wsr_init_state(struct wsr_state *wsr)
+{
+}
+static inline void wsr_destroy_state(struct wsr_state *wsr)
+{
+}
 static inline void wsr_init_lruvec(struct lruvec *lruvec)
 {
 }
@@ -100,6 +150,18 @@ static inline int wsr_set_refresh_interval(struct wsr_state *wsr,
 {
 	return 0;
 }
+static inline int wsr_register_receiver(struct wsr_receiver *receiver,
+					struct pglist_data *pgdat)
+{
+	return -ENODEV;
+}
+static inline void wsr_remove_receiver(struct wsr_receiver *receiver,
+				       struct pglist_data *pgdat)
+{
+}
+static inline void wsr_refresh_receiver_report(struct pglist_data *pgdat)
+{
+}
 #endif /* CONFIG_WORKINGSET_REPORT */
 
 #endif /* _LINUX_WORKINGSET_REPORT_H */
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index ee35a372805d..668eaa39c85b 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -25,6 +25,7 @@
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE. */
+#include "linux/workingset_report.h"
 #include <linux/types.h>
 #include <linux/virtio_types.h>
 #include <linux/virtio_ids.h>
@@ -37,6 +38,7 @@
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
+#define VIRTIO_BALLOON_F_WS_REPORTING	6 /* Working Set Size reporting */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -128,4 +130,32 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Operations from the device */
+#define VIRTIO_BALLOON_WS_OP_REQUEST 1
+#define VIRTIO_BALLOON_WS_OP_CONFIG 2
+
+struct virtio_balloon_working_set_notify {
+	/* REQUEST or CONFIG */
+	__le16 op;
+	__le16 node_id;
+	/* the following fields valid iff op=CONFIG */
+	__le32 report_threshold;
+	__le32 refresh_interval;
+	__le32 idle_age[WORKINGSET_REPORT_MAX_NR_BINS];
+};
+
+struct virtio_balloon_working_set_report_bin {
+	__le64 idle_age;
+	/* bytes in this bucket for anon and file */
+	__le64 anon_bytes;
+	__le64 file_bytes;
+};
+
+struct virtio_balloon_working_set_report {
+	__le32 error;
+	__le32 node_id;
+	struct virtio_balloon_working_set_report_bin
+		bins[WORKINGSET_REPORT_MAX_NR_BINS];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index dad539e602bb..4b3397ebdbd0 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -20,27 +20,51 @@ void wsr_init_pgdat(struct pglist_data *pgdat)
 {
 	mutex_init(&pgdat->wsr_update_mutex);
 	RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL);
+	INIT_LIST_HEAD(&pgdat->wsr_receiver_list);
 }
 
 void wsr_destroy_pgdat(struct pglist_data *pgdat)
 {
 	struct wsr_report_bins __rcu *bins;
+	struct list_head *cursor, *next;
 
 	mutex_lock(&pgdat->wsr_update_mutex);
 	bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL,
 			    lockdep_is_held(&pgdat->wsr_update_mutex));
-	kfree_rcu(bins, rcu);
 	mutex_unlock(&pgdat->wsr_update_mutex);
+	kfree_rcu(bins, rcu);
+	mutex_lock(&pgdat->wsr_receiver_mutex);
+	list_for_each_safe(cursor, next, &pgdat->wsr_receiver_list) {
+		/* pgdat does not own the receiver, so it's not free'd here */
+		list_del(cursor);
+	}
+	mutex_unlock(&pgdat->wsr_receiver_mutex);
+
 	mutex_destroy(&pgdat->wsr_update_mutex);
+	mutex_destroy(&pgdat->wsr_receiver_mutex);
+}
+
+void wsr_init_state(struct wsr_state *wsr)
+{
+	memset(wsr, 0, sizeof(*wsr));
+	mutex_init(&wsr->page_age_lock);
+}
+EXPORT_SYMBOL_GPL(wsr_init_state);
+
+void wsr_destroy_state(struct wsr_state *wsr)
+{
+	kfree(wsr->page_age);
+	mutex_destroy(&wsr->page_age_lock);
+	memset(wsr, 0, sizeof(*wsr));
 }
+EXPORT_SYMBOL_GPL(wsr_destroy_state);
 
 void wsr_init_lruvec(struct lruvec *lruvec)
 {
 	struct wsr_state *wsr = &lruvec->wsr;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	memset(wsr, 0, sizeof(*wsr));
-	mutex_init(&wsr->page_age_lock);
+	wsr_init_state(wsr);
 	if (memcg && !mem_cgroup_is_root(memcg))
 		wsr->page_age_cgroup_file = mem_cgroup_page_age_file(memcg);
 }
@@ -49,9 +73,7 @@ void wsr_destroy_lruvec(struct lruvec *lruvec)
 {
 	struct wsr_state *wsr = &lruvec->wsr;
 
-	mutex_destroy(&wsr->page_age_lock);
-	kfree(wsr->page_age);
-	memset(wsr, 0, sizeof(*wsr));
+	wsr_destroy_state(wsr);
 }
 
 int workingset_report_intervals_parse(char *src,
@@ -395,6 +417,7 @@ int wsr_set_refresh_interval(struct wsr_state *wsr,
 		wsr_wakeup_aging_thread();
 	return err;
 }
+EXPORT_SYMBOL_GPL(wsr_set_refresh_interval);
 
 static ssize_t refresh_interval_store(struct kobject *kobj,
 				      struct kobj_attribute *attr,
@@ -569,12 +592,62 @@ void wsr_remove_sysfs(struct node *node)
 }
 EXPORT_SYMBOL_GPL(wsr_remove_sysfs);
 
+/* wsr belongs to the root memcg or memcg is disabled */
+static int notify_receiver(struct wsr_state *wsr, struct pglist_data *pgdat)
+{
+	struct list_head *cursor;
+
+	if (!mutex_trylock(&pgdat->wsr_receiver_mutex))
+		return -EAGAIN;
+	list_for_each(cursor, &pgdat->wsr_receiver_list) {
+		struct wsr_receiver *entry =
+			list_entry(cursor, struct wsr_receiver, list);
+
+		wsr_refresh_report(&entry->wsr, NULL, pgdat, NULL);
+		entry->notify(entry);
+	}
+	mutex_unlock(&pgdat->wsr_receiver_mutex);
+	return 0;
+}
+
+int wsr_refresh_receiver_report(struct pglist_data *pgdat)
+{
+	struct wsr_state *wsr = &mem_cgroup_lruvec(NULL, pgdat)->wsr;
+
+	return notify_receiver(wsr, pgdat);
+}
+EXPORT_SYMBOL_GPL(wsr_refresh_receiver_report);
+
 void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat)
 {
 	struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
 
-	if (mem_cgroup_is_root(memcg))
+	if (mem_cgroup_is_root(memcg)) {
 		kernfs_notify(wsr->page_age_sys_file);
-	else
+		notify_receiver(wsr, pgdat);
+	} else
 		cgroup_file_notify(wsr->page_age_cgroup_file);
 }
+
+void wsr_register_receiver(struct wsr_receiver *receiver,
+			   struct pglist_data *pgdat)
+{
+	struct wsr_state *wsr = &receiver->wsr;
+
+	mutex_lock(&pgdat->wsr_receiver_mutex);
+	list_add_tail(&receiver->list, &pgdat->wsr_receiver_list);
+	mutex_unlock(&pgdat->wsr_receiver_mutex);
+
+	if (!!wsr->page_age && READ_ONCE(wsr->refresh_interval))
+		wsr_wakeup_aging_thread();
+}
+EXPORT_SYMBOL(wsr_register_receiver);
+
+void wsr_remove_receiver(struct wsr_receiver *receiver,
+			 struct pglist_data *pgdat)
+{
+	mutex_lock(&pgdat->wsr_receiver_mutex);
+	list_del(&receiver->list);
+	mutex_unlock(&pgdat->wsr_receiver_mutex);
+}
+EXPORT_SYMBOL(wsr_remove_receiver);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/9] mm: aggregate workingset information into histograms
  2024-11-27  2:57 ` [PATCH v4 1/9] mm: aggregate workingset information into histograms Yuanchu Xie
@ 2024-11-27  4:21   ` Matthew Wilcox
  2024-11-27 17:47     ` Yuanchu Xie
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2024-11-27  4:21 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Tue, Nov 26, 2024 at 06:57:20PM -0800, Yuanchu Xie wrote:
> diff --git a/mm/internal.h b/mm/internal.h
> index 64c2eb0b160e..bbd3c1501bac 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn;
>  /*
>   * in mm/vmscan.c:
>   */
> +struct scan_control;
> +bool isolate_lru_page(struct page *page);

Is this a mismerge?  It doesn't exist any more.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
                   ` (8 preceding siblings ...)
  2024-11-27  2:57 ` [PATCH v4 9/9] virtio-balloon: add " Yuanchu Xie
@ 2024-11-27  7:26 ` Johannes Weiner
  2024-11-27 19:40   ` SeongJae Park
                     ` (2 more replies)
  9 siblings, 3 replies; 21+ messages in thread
From: Johannes Weiner @ 2024-11-27  7:26 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest, SeongJae Park

On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references. However, the concept of workingset applies generically to
> all types of memory, which could be kernel slab caches, discardable
> userspace caches (databases), or CXL.mem. Therefore, data sources might
> come from slab shrinkers, device drivers, or the userspace.
> Another interesting idea might be hugepage workingset, so that we can
> measure the proportion of hugepages backing cold memory. However, with
> architectures like arm, there may be too many hugepage sizes leading to
> a combinatorial explosion when exporting stats to the userspace.
> Nonetheless, the kernel should provide a set of workingset interfaces
> that is generic enough to accommodate the various use cases, and extensible
> to potential future use cases.

Doesn't DAMON already provide this information?

CCing SJ.

> Use cases
> ==========
> Job scheduling
> On overcommitted hosts, workingset information improves efficiency and
> reliability by allowing the job scheduler to have better stats on the
> exact memory requirements of each job. This can manifest in efficiency by
> landing more jobs on the same host or NUMA node. On the other hand, the
> job scheduler can also ensure each node has a sufficient amount of memory
> and does not enter direct reclaim or the kernel OOM path. With workingset
> information and job priority, the userspace OOM killing or proactive
> reclaim policy can kick in before the system is under memory pressure.
> If the job shape is very different from the machine shape, knowing the
> workingset per-node can also help inform page allocation policies.
> 
> Proactive reclaim
> Workingset information allows the a container manager to proactively
> reclaim memory while not impacting a job's performance. While PSI may
> provide a reactive measure of when a proactive reclaim has reclaimed too
> much, workingset reporting allows the policy to be more accurate and
> flexible.

I'm not sure about more accurate.

Access frequency is only half the picture. Whether you need to keep
memory with a given frequency resident depends on the speed of the
backing device.

There is memory compression; there is swap on flash; swap on crappy
flash; swapfiles that share IOPS with co-located filesystems. There is
zswap+writeback, where avg refault speed can vary dramatically.

You can of course offload much more to a fast zswap backend than to a
swapfile on a struggling flashdrive, with comparable app performance.

So I think you'd be hard pressed to achieve a high level of accuracy
in the usecases you list without taking the (often highly dynamic)
cost of paging / memory transfer into account.

There is a more detailed discussion of this in a paper we wrote on
proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:

https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf

> Ballooning (similar to proactive reclaim)
> The last patch of the series extends the virtio-balloon device to report
> the guest workingset.
> Balloon policies benefit from workingset to more precisely determine the
> size of the memory balloon. On end-user devices where memory is scarce and
> overcommitted, the balloon sizing in multiple VMs running on the same
> device can be orchestrated with workingset reports from each one.
> On the server side, workingset reporting allows the balloon controller to
> inflate the balloon without causing too much file cache to be reclaimed in
> the guest.
> 
> Promotion/Demotion
> If different mechanisms are used for promition and demotion, workingset
> information can help connect the two and avoid pages being migrated back
> and forth.
> For example, given a promotion hot page threshold defined in reaccess
> distance of N seconds (promote pages accessed more often than every N
> seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> the fast memory node passes the threshold. This calculation can be done
> with workingset reports.
> To be directly useful for promotion policies, the workingset report
> interfaces need to be extended to report hotness and gather hotness
> information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
>
> Sysfs and Cgroup Interfaces
> ==========
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 20000 anon=34342 file=0
> 30000 anon=353232 file=333608
> 40000 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
>
> Implementation
> ==========
> The reporting of user pages is based off of MGLRU, and therefore requires
> CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
> fine-grained workingset report, but we can already gather a lot of data
> with just four generations. The workingset reporting mechanism is gated
> behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
> CONFIG_WORKINGSET_REPORT_AGING.
> 
> Benchmarks
> ==========
> Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> compile and redis benchmarks from openbenchmarking.org. The policy and
> runner is referred to as WMO (Workload Memory Optimization).
> The results were based on v3 of the series, but v4 doesn't change the core
> of the working set reporting and just adds the ballooning counterpart.
> 
> The timed Linux kernel compilation benchmark shows improvements in peak
> memory usage with a policy of "swap out all bytes colder than 10 seconds
> every 40 seconds". A swapfile is configured on SSD.
> --------------------------------------------
> peak memory usage (with WMO): 4982.61328 MiB
> peak memory usage (control): 9569.1367 MiB
> peak memory reduction: 47.9%
> --------------------------------------------
> Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> --------------------------------------------
> Seconds, fewer is better

You can do this with a recent (>2018) upstream kernel and ~100 lines
of python [1]. It also works on both LRU implementations.

[1] https://github.com/facebookincubator/senpai

We use this approach in virtually the entire Meta fleet, to offload
unneeded memory, estimate available capacity for job scheduling, plan
future capacity needs, and provide accurate memory usage feedback to
application developers.

It works over a wide variety of CPU and storage configurations with no
specific tuning.

The paper I referenced above provides a detailed breakdown of how it
all works together.

I would be curious to see a more in-depth comparison to the prior art
in this space. At first glance, your proposal seems more complex and
less robust/versatile, at least for offloading and capacity gauging.

It does provide more detailed insight into userspace memory behavior,
which could be helpful when trying to make sense of applications that
sit on a rich layer of libraries and complicated runtimes. But here a
comparison to DAMON would be helpful.

>  25 files changed, 2482 insertions(+), 9 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
>  create mode 100644 include/linux/workingset_report.h
>  create mode 100644 mm/workingset_report.c
>  create mode 100644 mm/workingset_report_aging.c
>  create mode 100644 tools/testing/selftests/mm/workingset_report.c
>  create mode 100644 tools/testing/selftests/mm/workingset_report.h
>  create mode 100644 tools/testing/selftests/mm/workingset_report_test.c


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/9] mm: aggregate workingset information into histograms
  2024-11-27  4:21   ` Matthew Wilcox
@ 2024-11-27 17:47     ` Yuanchu Xie
  0 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Tue, Nov 26, 2024 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:20PM -0800, Yuanchu Xie wrote:
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 64c2eb0b160e..bbd3c1501bac 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn;
> >  /*
> >   * in mm/vmscan.c:
> >   */
> > +struct scan_control;
> > +bool isolate_lru_page(struct page *page);
>
> Is this a mismerge?  It doesn't exist any more.
Yes this is a mismerge. I'll fix it in the next version.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
@ 2024-11-27 19:40   ` SeongJae Park
  2024-11-27 23:33   ` Yu Zhao
  2024-12-06 19:57   ` Yuanchu Xie
  2 siblings, 0 replies; 21+ messages in thread
From: SeongJae Park @ 2024-11-27 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: SeongJae Park, Yuanchu Xie, David Hildenbrand, Aneesh Kumar K.V,
	Khalid Aziz, Henry Huang, Yu Zhao, Dan Williams, Gregory Price,
	Huang Ying, Lance Yang, Randy Dunlap, Muhammad Usama Anjum,
	Tejun Heo, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest, damon

+ damon@lists.linux.dev

I haven't thoroughly read any version of this patch series due to my laziness,
sorry.  So I may saying something completely wrong.  My apology in advance, and
please correct me in the case.

> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references.

DAMON provides data access patterns of user pages.  It is not exactly named as
workingset but a superset of the information.  Users can therefore get the
workingset from DAMON-provided raw data.  So I feel I have to ask if DAMON can
be used for, or help at achieving the purpose of this patch series.

Depending on the detailed definition of workingset, of course, the workingset
we can get from DAMON might not be technically same to what this patch series
aim to provide, and the difference could be somewhat that makes DAMON unable to
be used or help here.  But I cannot know if this is the case with only this
cover letter.

> > However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.

This again sounds similar to what DAMON aims to provide, to me.  DAMON is
designed to be easy to extend for vairous use cases and internal mechanisms.
Specifically, it separates access check mechanisms and core logic into
different layers, and provides an interface to use for implementing extending
DAMON with new mechanisms.  DAMON's two access check mechanisms for virtual
address spaces and the physical address space are made using the interface,
indeed.  Also there were RFC patch series extending DAMON for NUMA-specific and
write-only access monitoring using NUMA hinting fault and soft-dirty PTEs as
the internal mechanisms.

My humble understanding of the major difference between DAMON and workingset
reporting is the internal mechanism.  Workingset reporting uses MGLRU as the
access check mechanism, while current access check mechanisms for DAMON are
using page table accessed bits checking as the major mechanism.  I think DAMON
can be extended to use MGLRU as its another internal access check mechanism,
but I understand that there could be many things that I overseeing.

Yuanchu, I think it would help me and other reviewers better understand this
patch series if you could share that.  And I will also be more than happy to
help you and others better understanding what DAMON can do or not with the
discussion.

> 
> Doesn't DAMON already provide this information?
> 
> CCing SJ.

Thank you for adding me, Johannes :)

[...]
> It does provide more detailed insight into userspace memory behavior,
> which could be helpful when trying to make sense of applications that
> sit on a rich layer of libraries and complicated runtimes. But here a
> comparison to DAMON would be helpful.

100% agree.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 9/9] virtio-balloon: add workingset reporting
  2024-11-27  2:57 ` [PATCH v4 9/9] virtio-balloon: add " Yuanchu Xie
@ 2024-11-27 23:14   ` Daniel Verkamp
  2024-11-27 23:38     ` Yuanchu Xie
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Verkamp @ 2024-11-27 23:14 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Tue, Nov 26, 2024 at 7:00 PM Yuanchu Xie <yuanchu@google.com> wrote:
[...]
> diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h
> index f6bbde2a04c3..1074b89035e9 100644
> --- a/include/linux/workingset_report.h
> +++ b/include/linux/workingset_report.h
[...]
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index ee35a372805d..668eaa39c85b 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -25,6 +25,7 @@
>   * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>   * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>   * SUCH DAMAGE. */
> +#include "linux/workingset_report.h"
>  #include <linux/types.h>
>  #include <linux/virtio_types.h>
>  #include <linux/virtio_ids.h>

This seems to be including a non-uapi header
(include/linux/workingset_report.h) from a uapi header
(include/uapi/linux/virtio_balloon.h), which won't compile outside the
kernel. Does anything in the uapi actually need declarations from
workingset_report.h?

> @@ -37,6 +38,7 @@
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
>  #define VIRTIO_BALLOON_F_REPORTING     5 /* Page reporting virtqueue */
> +#define VIRTIO_BALLOON_F_WS_REPORTING  6 /* Working Set Size reporting */
>
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -128,4 +130,32 @@ struct virtio_balloon_stat {
>         __virtio64 val;
>  } __attribute__((packed));
>
> +/* Operations from the device */
> +#define VIRTIO_BALLOON_WS_OP_REQUEST 1
> +#define VIRTIO_BALLOON_WS_OP_CONFIG 2
> +
> +struct virtio_balloon_working_set_notify {
> +       /* REQUEST or CONFIG */
> +       __le16 op;
> +       __le16 node_id;
> +       /* the following fields valid iff op=CONFIG */
> +       __le32 report_threshold;
> +       __le32 refresh_interval;
> +       __le32 idle_age[WORKINGSET_REPORT_MAX_NR_BINS];
> +};
> +
> +struct virtio_balloon_working_set_report_bin {
> +       __le64 idle_age;
> +       /* bytes in this bucket for anon and file */
> +       __le64 anon_bytes;
> +       __le64 file_bytes;
> +};
> +
> +struct virtio_balloon_working_set_report {
> +       __le32 error;
> +       __le32 node_id;
> +       struct virtio_balloon_working_set_report_bin
> +               bins[WORKINGSET_REPORT_MAX_NR_BINS];
> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

Have the spec changes been discussed in the virtio TC?

Thanks,
-- Daniel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
  2024-11-27 19:40   ` SeongJae Park
@ 2024-11-27 23:33   ` Yu Zhao
  2024-12-06 19:57   ` Yuanchu Xie
  2 siblings, 0 replies; 21+ messages in thread
From: Yu Zhao @ 2024-11-27 23:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yuanchu Xie, Andrew Morton, David Hildenbrand, Aneesh Kumar K.V,
	Khalid Aziz, Henry Huang, Dan Williams, Gregory Price,
	Huang Ying, Lance Yang, Randy Dunlap, Muhammad Usama Anjum,
	Tejun Heo, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest, SeongJae Park

On Wed, Nov 27, 2024 at 12:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?

Yuanchu might be able to answer this question a lot better than I do,
since he studied DAMON and tried to leverage it in our fleet.

My impression is that there are some fundamental differences in access
detection and accounting mechanisms between the two, i.e., sampling vs
scanning-based detection and non-lruvec vs lruvec-based accounting.

> CCing SJ.
>
> > Use cases
> > ==========
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memory
> > and does not enter direct reclaim or the kernel OOM path. With workingset
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed too
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.

Agreed. This is a (very) poor argument, unless there are facts to back this up.

> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.

Along a similar line, we also need to consider use cases that don't
involve backing storage, e.g., far memory (remote node). More details below.

> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to report
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine the
> > size of the memory balloon. On end-user devices where memory is scarce and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller to
> > inflate the balloon without causing too much file cache to be reclaimed in
> > the guest.
> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > Implementation
> > ==========
> > The reporting of user pages is based off of MGLRU, and therefore requires
> > CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
> > fine-grained workingset report, but we can already gather a lot of data
> > with just four generations. The workingset reporting mechanism is gated
> > behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
> > CONFIG_WORKINGSET_REPORT_AGING.
> >
> > Benchmarks
> > ==========
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the core
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.

How would Senpai work for use cases that don't have local storage,
i.e., all memory is mapped by either the fast or the slow tier? (>95%
memory usage in our fleet is mapped and local storage for non-storage
servers is only scratch space.)

My current understanding is that its approach would not be able to
form a feedback loop because there are currently no refaults from the
slow tier (because it's also mapped), and that's where I think this
proposal or something similar can help.

Also this proposal reports histograms, not scalars. So in theory,
userspace can see the projections of its potential actions, rather
than solely rely on trial and error. Of course, this needs to be
backed with data. So yes, some comparisons from real-world use cases
would be very helpful to demonstrate the value of this proposal.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 9/9] virtio-balloon: add workingset reporting
  2024-11-27 23:14   ` Daniel Verkamp
@ 2024-11-27 23:38     ` Yuanchu Xie
  0 siblings, 0 replies; 21+ messages in thread
From: Yuanchu Xie @ 2024-11-27 23:38 UTC (permalink / raw)
  To: Daniel Verkamp
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Wed, Nov 27, 2024 at 3:15 PM Daniel Verkamp <dverkamp@chromium.org> wrote:
> >   * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> >   * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> >   * SUCH DAMAGE. */
> > +#include "linux/workingset_report.h"
> >  #include <linux/types.h>
> >  #include <linux/virtio_types.h>
> >  #include <linux/virtio_ids.h>
>
> This seems to be including a non-uapi header
> (include/linux/workingset_report.h) from a uapi header
> (include/uapi/linux/virtio_balloon.h), which won't compile outside the
> kernel. Does anything in the uapi actually need declarations from
> workingset_report.h?
Good point. I should move the relevant constants over.

> > +
> > +struct virtio_balloon_working_set_notify {
> > +       /* REQUEST or CONFIG */
> > +       __le16 op;
> > +       __le16 node_id;
> > +       /* the following fields valid iff op=CONFIG */
> > +       __le32 report_threshold;
> > +       __le32 refresh_interval;
> > +       __le32 idle_age[WORKINGSET_REPORT_MAX_NR_BINS];
> > +};
> > +
> > +struct virtio_balloon_working_set_report_bin {
> > +       __le64 idle_age;
> > +       /* bytes in this bucket for anon and file */
> > +       __le64 anon_bytes;
> > +       __le64 file_bytes;
> > +};
> > +
> > +struct virtio_balloon_working_set_report {
> > +       __le32 error;
> > +       __le32 node_id;
> > +       struct virtio_balloon_working_set_report_bin
> > +               bins[WORKINGSET_REPORT_MAX_NR_BINS];
> > +};
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
>
> Have the spec changes been discussed in the virtio TC?
They have not. Thanks for bringing this up. I'll post in the VIRTIO TC.

Thanks,
Yuanchu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
  2024-11-27 19:40   ` SeongJae Park
  2024-11-27 23:33   ` Yu Zhao
@ 2024-12-06 19:57   ` Yuanchu Xie
  2024-12-11 19:53     ` SeongJae Park
  2 siblings, 1 reply; 21+ messages in thread
From: Yuanchu Xie @ 2024-12-06 19:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest, SeongJae Park

Thanks for the response Johannes. Some replies inline.

On Tue, Nov 26, 2024 at 11:26 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?
>
> CCing SJ.
Thanks for the CC. DAMON was really good at visualizing the memory
access frequencies last time I tried it out! For server use cases,
DAMON would benefit from integrations with cgroups. The key then would
be a standard interface for exporting a cgroup's working set to the
user. It would be good to have something that will work for different
backing implementations, DAMON, MGLRU, or active/inactive LRU.

>
> > Use cases
> > ==========
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memory
> > and does not enter direct reclaim or the kernel OOM path. With workingset
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed too
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.
>
> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.
>
> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
Yes, PSI takes into account the paging cost. I'm not claiming that
Workingset reporting provides a superset of information, but rather it
can complement PSI. Sorry for the bad wording here.

> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to report
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine the
> > size of the memory balloon. On end-user devices where memory is scarce and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller to
> > inflate the balloon without causing too much file cache to be reclaimed in
> > the guest.
The ballooning use case is an important one. Having working set
information would allow us to inflate a balloon of the right size in
the guest.

> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >...
> >
> > Benchmarks
> > ==========
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the core
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.
>
> The paper I referenced above provides a detailed breakdown of how it
> all works together.
>
> I would be curious to see a more in-depth comparison to the prior art
> in this space. At first glance, your proposal seems more complex and
> less robust/versatile, at least for offloading and capacity gauging.
We have implemented TMO PSI-based proactive reclaim and compared it to
a kstaled-based reclaimer (reclaiming based on 2 minute working set
and refaults). The PSI-based reclaimer was able to save more memory,
but it also caused spikes of refaults and a lot higher
decompressions/second. Overall the test workloads had better
performance with the kstaled-based reclaimer. The conclusion was that
it was a trade-off. Since we have some app classes that we don't want
to induce pressure but still want to proactively reclaim from, there's
a missing piece. I do agree there's not a good in-depth comparison
with prior art though.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-12-06 19:57   ` Yuanchu Xie
@ 2024-12-11 19:53     ` SeongJae Park
  2025-01-30  2:02       ` Yuanchu Xie
  0 siblings, 1 reply; 21+ messages in thread
From: SeongJae Park @ 2024-12-11 19:53 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: SeongJae Park, David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz,
	Henry Huang, Yu Zhao, Dan Williams, Gregory Price, Huang Ying,
	Lance Yang, Randy Dunlap, Muhammad Usama Anjum, Tejun Heo,
	Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:

> Thanks for the response Johannes. Some replies inline.
> 
> On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > This patch series provides workingset reporting of user pages in
> > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > references. However, the concept of workingset applies generically to
> > > all types of memory, which could be kernel slab caches, discardable
> > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > come from slab shrinkers, device drivers, or the userspace.
> > > Another interesting idea might be hugepage workingset, so that we can
> > > measure the proportion of hugepages backing cold memory. However, with
> > > architectures like arm, there may be too many hugepage sizes leading to
> > > a combinatorial explosion when exporting stats to the userspace.
> > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > that is generic enough to accommodate the various use cases, and extensible
> > > to potential future use cases.
> >
> > Doesn't DAMON already provide this information?
> >
> > CCing SJ.
> Thanks for the CC. DAMON was really good at visualizing the memory
> access frequencies last time I tried it out!

Thank you for this kind acknowledgement, Yuanchu!

> For server use cases,
> DAMON would benefit from integrations with cgroups.  The key then would be a
> standard interface for exporting a cgroup's working set to the user.

I show two ways to make DAMON supports cgroups for now.  First way is making
another DAMON operations set implementation for cgroups.  I shared a rough idea
for this before, probably on kernel summit.  But I haven't had a chance to
prioritize this so far.  Please let me know if you need more details.  The
second way is extending DAMOS filter to provide more detailed statistics per
DAMON-region, and adding another DAMOS action that does nothing but only
accounting the detailed statistics.  Using the new DAMOS action, users will be
able to know how much of specific DAMON-found regions are filtered out by the
given filter.  Because we have DAMOS filter type for cgroups, we can know how
much of workingset (or, warm memory) belongs to specific groups.  This can be
applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
anonymous page, young page).

I believe the second way is simpler to implement while providing information
that sufficient for most possible use cases.  I was anyway planning to do this.

> It would be good to have something that will work for different
> backing implementations, DAMON, MGLRU, or active/inactive LRU.

I think we can do this using the filter statistics, with new filter types.  For
example, we can add new DAMOS filter that filters pages if it is for specific
range of MGLRU-gen of the page, or whether the page belongs to active or
inactive LRU lists.

> 
> >
> > > Use cases
> > > ==========
[...]
> > Access frequency is only half the picture. Whether you need to keep
> > memory with a given frequency resident depends on the speed of the
> > backing device.
[...]
> > > Benchmarks
> > > ==========
> > > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > > compile and redis benchmarks from openbenchmarking.org. The policy and
> > > runner is referred to as WMO (Workload Memory Optimization).
> > > The results were based on v3 of the series, but v4 doesn't change the core
> > > of the working set reporting and just adds the ballooning counterpart.
> > >
> > > The timed Linux kernel compilation benchmark shows improvements in peak
> > > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > > every 40 seconds". A swapfile is configured on SSD.
[...]
> > You can do this with a recent (>2018) upstream kernel and ~100 lines
> > of python [1]. It also works on both LRU implementations.
> >
> > [1] https://github.com/facebookincubator/senpai
> >
> > We use this approach in virtually the entire Meta fleet, to offload
> > unneeded memory, estimate available capacity for job scheduling, plan
> > future capacity needs, and provide accurate memory usage feedback to
> > application developers.
> >
> > It works over a wide variety of CPU and storage configurations with no
> > specific tuning.
> >
> > The paper I referenced above provides a detailed breakdown of how it
> > all works together.
> >
> > I would be curious to see a more in-depth comparison to the prior art
> > in this space. At first glance, your proposal seems more complex and
> > less robust/versatile, at least for offloading and capacity gauging.
> We have implemented TMO PSI-based proactive reclaim and compared it to
> a kstaled-based reclaimer (reclaiming based on 2 minute working set
> and refaults). The PSI-based reclaimer was able to save more memory,
> but it also caused spikes of refaults and a lot higher
> decompressions/second. Overall the test workloads had better
> performance with the kstaled-based reclaimer. The conclusion was that
> it was a trade-off.

I agree it is only half of the picture, and there could be tradeoff.  Motivated
by those previous works, DAMOS provides PSI-based aggressiveness auto-tuning to
use both ways.

> I do agree there's not a good in-depth comparison
> with prior art though.

I would be more than happy to help the comparison work agains DAMON of current
implementation and future plans, and any possible collaborations.


Thanks,
SJ


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2024-12-11 19:53     ` SeongJae Park
@ 2025-01-30  2:02       ` Yuanchu Xie
  2025-01-30  4:11         ` SeongJae Park
  0 siblings, 1 reply; 21+ messages in thread
From: Yuanchu Xie @ 2025-01-30  2:02 UTC (permalink / raw)
  To: SeongJae Park
  Cc: David Hildenbrand, Aneesh Kumar K.V, Khalid Aziz, Henry Huang,
	Yu Zhao, Dan Williams, Gregory Price, Huang Ying, Lance Yang,
	Randy Dunlap, Muhammad Usama Anjum, Tejun Heo, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

On Wed, Dec 11, 2024 at 11:53 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:
>
> > Thanks for the response Johannes. Some replies inline.
> >
> > On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > > This patch series provides workingset reporting of user pages in
> > > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > > references. However, the concept of workingset applies generically to
> > > > all types of memory, which could be kernel slab caches, discardable
> > > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > > come from slab shrinkers, device drivers, or the userspace.
> > > > Another interesting idea might be hugepage workingset, so that we can
> > > > measure the proportion of hugepages backing cold memory. However, with
> > > > architectures like arm, there may be too many hugepage sizes leading to
> > > > a combinatorial explosion when exporting stats to the userspace.
> > > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > > that is generic enough to accommodate the various use cases, and extensible
> > > > to potential future use cases.
> > >
> > > Doesn't DAMON already provide this information?
> > >
> > > CCing SJ.
> > Thanks for the CC. DAMON was really good at visualizing the memory
> > access frequencies last time I tried it out!
>
> Thank you for this kind acknowledgement, Yuanchu!
>
> > For server use cases,
> > DAMON would benefit from integrations with cgroups.  The key then would be a
> > standard interface for exporting a cgroup's working set to the user.
>
> I show two ways to make DAMON supports cgroups for now.  First way is making
> another DAMON operations set implementation for cgroups.  I shared a rough idea
> for this before, probably on kernel summit.  But I haven't had a chance to
> prioritize this so far.  Please let me know if you need more details.  The
> second way is extending DAMOS filter to provide more detailed statistics per
> DAMON-region, and adding another DAMOS action that does nothing but only
> accounting the detailed statistics.  Using the new DAMOS action, users will be
> able to know how much of specific DAMON-found regions are filtered out by the
> given filter.  Because we have DAMOS filter type for cgroups, we can know how
> much of workingset (or, warm memory) belongs to specific groups.  This can be
> applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
> anonymous page, young page).
>
> I believe the second way is simpler to implement while providing information
> that sufficient for most possible use cases.  I was anyway planning to do this.
For a container orchestrator like kubernetes, the node agents need to
be able to gather the working set stats at a per-job level. Some jobs
can create sub-hierarchies as well, so it's important that we have
hierarchical stats.

Do you think it's a good idea to integrate DAMON to provide some
aggregate stats in a memory controller file? With the DAMOS cgroup
filter, there can be some kind of interface that a DAMOS action or the
damo tool could call into. I feel that would be a straightforward and
integrated way to support cgroups.

Yuanchu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/9] mm: workingset reporting
  2025-01-30  2:02       ` Yuanchu Xie
@ 2025-01-30  4:11         ` SeongJae Park
  0 siblings, 0 replies; 21+ messages in thread
From: SeongJae Park @ 2025-01-30  4:11 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: SeongJae Park, Aneesh Kumar K.V, Khalid Aziz, Henry Huang,
	Yu Zhao, Dan Williams, Gregory Price, Huang Ying, Lance Yang,
	Randy Dunlap, Muhammad Usama Anjum, Tejun Heo, Michal Koutný,
	Jonathan Corbet, Greg Kroah-Hartman, Rafael J. Wysocki,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Mike Rapoport, Shuah Khan, Christian Brauner, Daniel Watson,
	cgroups, linux-doc, linux-kernel, virtualization, linux-mm,
	linux-kselftest

Hi Yuanchu,

On Wed, 29 Jan 2025 18:02:26 -0800 Yuanchu Xie <yuanchu@google.com> wrote:

> On Wed, Dec 11, 2024 at 11:53 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:
> >
> > > Thanks for the response Johannes. Some replies inline.
> > >
> > > On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > > > This patch series provides workingset reporting of user pages in
> > > > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > > > references. However, the concept of workingset applies generically to
> > > > > all types of memory, which could be kernel slab caches, discardable
> > > > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > > > come from slab shrinkers, device drivers, or the userspace.
> > > > > Another interesting idea might be hugepage workingset, so that we can
> > > > > measure the proportion of hugepages backing cold memory. However, with
> > > > > architectures like arm, there may be too many hugepage sizes leading to
> > > > > a combinatorial explosion when exporting stats to the userspace.
> > > > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > > > that is generic enough to accommodate the various use cases, and extensible
> > > > > to potential future use cases.
> > > >
> > > > Doesn't DAMON already provide this information?
> > > >
> > > > CCing SJ.
> > > Thanks for the CC. DAMON was really good at visualizing the memory
> > > access frequencies last time I tried it out!
> >
> > Thank you for this kind acknowledgement, Yuanchu!
> >
> > > For server use cases,
> > > DAMON would benefit from integrations with cgroups.  The key then would be a
> > > standard interface for exporting a cgroup's working set to the user.
> >
> > I show two ways to make DAMON supports cgroups for now.  First way is making
> > another DAMON operations set implementation for cgroups.  I shared a rough idea
> > for this before, probably on kernel summit.  But I haven't had a chance to
> > prioritize this so far.  Please let me know if you need more details.  The
> > second way is extending DAMOS filter to provide more detailed statistics per
> > DAMON-region, and adding another DAMOS action that does nothing but only
> > accounting the detailed statistics.  Using the new DAMOS action, users will be
> > able to know how much of specific DAMON-found regions are filtered out by the
> > given filter.  Because we have DAMOS filter type for cgroups, we can know how
> > much of workingset (or, warm memory) belongs to specific groups.  This can be
> > applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
> > anonymous page, young page).
> >
> > I believe the second way is simpler to implement while providing information
> > that sufficient for most possible use cases.  I was anyway planning to do this.

I implemented the feature for the second approach I mentioned above.  The
initial version of the feature has recently merged[1] into the mainline as a
part of 6.14-rc1 MM pull request.  DAMON user-space tool (damo) is also updated
for baisc support of it.  I forgot updating that on this thread, sorry.

> For a container orchestrator like kubernetes, the node agents need to
> be able to gather the working set stats at a per-job level. Some jobs
> can create sub-hierarchies as well, so it's important that we have
> hierarchical stats.

This makes sense to me.  And yes, I believe DAMOS filters for memcg could also
be used for this use case, since we can install and use multiple DAMOS filters
in combinations.

The documentation of the feature is not that good and there are many rooms to
improve.  You might not be able to get what you want in a perfect way with the
current implementation.  But we will continue improving it, and I believe we
can make it faster if efforts are gathered.  Of course, I could be wrong, and
whether to use it or not is up to each person :)

Anyway, please feel free to ask me questions or any help about the feature if
you want.

> 
> Do you think it's a good idea to integrate DAMON to provide some
> aggregate stats in a memory controller file? With the DAMOS cgroup
> filter, there can be some kind of interface that a DAMOS action or the
> damo tool could call into. I feel that would be a straightforward and
> integrated way to support cgroups.

DAMON basically exposes its internal information via DAMON sysfs, and DAMON
user-space tool (damo) uses it.  In this case, per-memcg working set could also
be retrieved in the way (directly from DAMON sysfs or indirectly from damo).

But, yes, I think we could make new and optimized ABIs for exposing the
information to user-space in more efficient way depending on the use case, if
needed.  DAMON modules such as DAMON_RECLAIM and DAMON_LRU_SORT provides their
own ABIs that simplified and optimized for their usages.

[1] https://git.kernel.org/torvalds/c/626ffabe67c2


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-01-30  4:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-27  2:57 [PATCH v4 0/9] mm: workingset reporting Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 1/9] mm: aggregate workingset information into histograms Yuanchu Xie
2024-11-27  4:21   ` Matthew Wilcox
2024-11-27 17:47     ` Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 3/9] mm: report workingset during memory pressure driven scanning Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 4/9] mm: extend workingset reporting to memcgs Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 5/9] mm: add kernel aging thread for workingset reporting Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 6/9] selftest: test system-wide " Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting Yuanchu Xie
2024-11-27  2:57 ` [PATCH v4 9/9] virtio-balloon: add " Yuanchu Xie
2024-11-27 23:14   ` Daniel Verkamp
2024-11-27 23:38     ` Yuanchu Xie
2024-11-27  7:26 ` [PATCH v4 0/9] mm: " Johannes Weiner
2024-11-27 19:40   ` SeongJae Park
2024-11-27 23:33   ` Yu Zhao
2024-12-06 19:57   ` Yuanchu Xie
2024-12-11 19:53     ` SeongJae Park
2025-01-30  2:02       ` Yuanchu Xie
2025-01-30  4:11         ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox