From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Yu Zhao <yuzhao@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
Nhat Pham <nphamcs@gmail.com>, Yuanchu Xie <yuanchu@google.com>,
Suren Baghdasaryan <surenb@google.com>,
"T . J . Mercier" <tjmercier@google.com>,
linux-kernel@vger.kernel.orng, Kairui Song <kasong@tencent.com>
Subject: [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model
Date: Wed, 13 Sep 2023 02:45:07 +0800 [thread overview]
Message-ID: <20230912184511.49333-2-ryncsn@gmail.com> (raw)
In-Reply-To: <20230912184511.49333-1-ryncsn@gmail.com>
From: Kairui Song <kasong@tencent.com>
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based re-activation
for multi-gen LRU.
Currently, refault distance re-activation is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged and activated constantly as the generation sliding window slides.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an eviction sequence like before.
Let the `nonresistence_age` still be NA and get increased for each
eviction, so we get a "Shadow LRU" here of one evicted page:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by previously faulted in pages
* -> The page corresponding to SP
It can be easily seen that SP stands for how far the current workflow
could push a page out of available memory. Since all evicted page was
once head of INACTIVE list, the page could have such an access distance:
SP + NR_INACTIVE
It *may* get re-activated before getting evicted again if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from ACTIVE part,
since the access distance is shorter than the total memory to make it
stay. The calculation is same as before, just dropped the assumption 1
above.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the ACTIVE part of LRU is long enough, chanllaging ACTIVE pages
by re-activating a one-time faulted previously INACTIVE page may not
be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
by comparing with INACTIVE instead.
Combined all above, we have:
Upon refault, if any of following conditions is met, mark page as active:
- If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
The code is almose same but simpler than before, since no longer need to
do lruvec statistic update when activating a page. A few benchmarks
showed a similar or better result. And when combined with multi-gen
LRU (in later commits) it shows a measurable performance gain for some
workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 2G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0700 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--mysql-db=sb --tables=36 --table-size=2000000 --threads=12 --time=1800
Before (Average of 6 test run):
fio: IOPS=5213.7k
fio2: IOPS=7315.3k
memcached: 49493.75 ops/s
mysql: 6237.45 tps
After (Average of 6 test run):
fio: IOPS=5230.5k
fio2: IOPS=7349.3k
memcached: 49912.79 ops/s
mysql: 6240.62 tps
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 2 -
mm/swap.c | 1 -
mm/vmscan.c | 2 -
mm/workingset.c | 215 +++++++++++++++++++++----------------------
4 files changed, 106 insertions(+), 114 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 493487ed7c38..ca51d79842b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,10 +344,8 @@ static inline swp_entry_t page_swap_entry(struct page *page)
/* linux/mm/workingset.c */
bool workingset_test_recent(void *shadow, bool file, bool *workingset);
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
void workingset_refault(struct folio *folio, void *shadow);
-void workingset_activation(struct folio *folio);
/* Only track the nodes of mappings with shadow entries */
void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..685b446fd4f9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio)
else
__lru_cache_activate_folio(folio);
folio_clear_referenced(folio);
- workingset_activation(folio);
}
if (folio_test_idle(folio))
folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..3f4de75e5186 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
nr_moved += nr_pages;
- if (folio_test_active(folio))
- workingset_age_nonresident(lruvec, nr_pages);
}
/*
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..babda11601ea 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,9 +180,10 @@
*/
#define WORKINGSET_SHIFT 1
-#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
+#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
WORKINGSET_SHIFT + NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
+#define EVICTION_BITS (BITS_PER_LONG - (EVICTION_SHIFT))
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
/*
@@ -226,8 +227,103 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*workingsetp = workingset;
}
-#ifdef CONFIG_LRU_GEN
+/*
+ * Get the distance reading at eviction time.
+ */
+static inline unsigned long lru_eviction(struct lruvec *lruvec,
+ int bits, int bucket_order)
+{
+ unsigned long eviction = atomic_long_read(&lruvec->nonresident_age);
+
+ eviction >>= bucket_order;
+ eviction &= ~0UL >> (BITS_PER_LONG - bits);
+
+ return eviction;
+}
+
+/*
+ * Calculate and test refault distance
+ */
+static inline bool lru_refault(struct mem_cgroup *memcg,
+ struct lruvec *lruvec,
+ unsigned long eviction, bool file,
+ int bits, int bucket_order)
+{
+ unsigned long refault, distance;
+ unsigned long workingset, active, inactive, inactive_file, inactive_anon = 0;
+
+ eviction <<= bucket_order;
+ refault = atomic_long_read(&lruvec->nonresident_age);
+
+ /*
+ * The unsigned subtraction here gives an accurate distance
+ * across nonresident_age overflows in most cases. There is a
+ * special case: usually, shadow entries have a short lifetime
+ * and are either refaulted or reclaimed along with the inode
+ * before they get too old. But it is not impossible for the
+ * nonresident_age to lap a shadow entry in the field, which
+ * can then result in a false small refault distance, leading
+ * to a false activation should this old entry actually
+ * refault again. However, earlier kernels used to deactivate
+ * unconditionally with *every* reclaim invocation for the
+ * longest time, so the occasional inappropriate activation
+ * leading to pressure on the active list is not a problem.
+ */
+ distance = (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
+ active = lruvec_page_state(lruvec, NR_ACTIVE_FILE);
+ inactive_file = lruvec_page_state(lruvec, NR_INACTIVE_FILE);
+ if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+ active += lruvec_page_state(lruvec, NR_ACTIVE_ANON);
+ inactive_anon = lruvec_page_state(lruvec, NR_INACTIVE_ANON);
+ }
+
+ /*
+ * Compare the distance to the existing workingset size. We
+ * don't activate pages that couldn't stay resident even if
+ * all the memory was available to the workingset. Whether
+ * workingset competition needs to consider anon or not depends
+ * on having free swap space.
+ *
+ * When there are already enough active pages, be less aggressive
+ * on reactivating pages, challenge an already established set of
+ * active pages with one time refaulted page may not be a good idea.
+ */
+ if (active >= (inactive_anon + inactive_file))
+ return distance < inactive_anon + inactive_file;
+ else
+ return distance < active + (file ? inactive_anon : inactive_file);
+}
+
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @lruvec: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+static void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
+{
+ /*
+ * Reclaiming a cgroup means reclaiming all its children in a
+ * round-robin fashion. That means that each cgroup has an LRU
+ * order that is composed of the LRU orders of its child
+ * cgroups; and every page has an LRU position not just in the
+ * cgroup that owns it, but in all of that group's ancestors.
+ *
+ * So when the physical inactive list of a leaf cgroup ages,
+ * the virtual inactive lists of all its parents, including
+ * the root cgroup's, age as well.
+ */
+ do {
+ atomic_long_add(nr_pages, &lruvec->nonresident_age);
+ } while ((lruvec = parent_lruvec(lruvec)));
+}
+
+#ifdef CONFIG_LRU_GEN
static void *lru_gen_eviction(struct folio *folio)
{
int hist;
@@ -342,34 +438,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
#endif /* CONFIG_LRU_GEN */
-/**
- * workingset_age_nonresident - age non-resident entries as LRU ages
- * @lruvec: the lruvec that was aged
- * @nr_pages: the number of pages to count
- *
- * As in-memory pages are aged, non-resident pages need to be aged as
- * well, in order for the refault distances later on to be comparable
- * to the in-memory dimensions. This function allows reclaim and LRU
- * operations to drive the non-resident aging along in parallel.
- */
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
-{
- /*
- * Reclaiming a cgroup means reclaiming all its children in a
- * round-robin fashion. That means that each cgroup has an LRU
- * order that is composed of the LRU orders of its child
- * cgroups; and every page has an LRU position not just in the
- * cgroup that owns it, but in all of that group's ancestors.
- *
- * So when the physical inactive list of a leaf cgroup ages,
- * the virtual inactive lists of all its parents, including
- * the root cgroup's, age as well.
- */
- do {
- atomic_long_add(nr_pages, &lruvec->nonresident_age);
- } while ((lruvec = parent_lruvec(lruvec)));
-}
-
/**
* workingset_eviction - note the eviction of a folio from memory
* @target_memcg: the cgroup that is causing the reclaim
@@ -396,11 +464,11 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
- eviction = atomic_long_read(&lruvec->nonresident_age);
- eviction >>= bucket_order;
+
+ eviction = lru_eviction(lruvec, EVICTION_BITS, bucket_order);
workingset_age_nonresident(lruvec, folio_nr_pages(folio));
return pack_shadow(memcgid, pgdat, eviction,
- folio_test_workingset(folio));
+ folio_test_workingset(folio));
}
/**
@@ -418,9 +486,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
{
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
- unsigned long refault_distance;
- unsigned long workingset_size;
- unsigned long refault;
int memcgid;
struct pglist_data *pgdat;
unsigned long eviction;
@@ -429,7 +494,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
- eviction <<= bucket_order;
/*
* Look up the memcg associated with the stored ID. It might
@@ -450,50 +514,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
eviction_memcg = mem_cgroup_from_id(memcgid);
if (!mem_cgroup_disabled() && !eviction_memcg)
return false;
-
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
- refault = atomic_long_read(&eviction_lruvec->nonresident_age);
-
- /*
- * Calculate the refault distance
- *
- * The unsigned subtraction here gives an accurate distance
- * across nonresident_age overflows in most cases. There is a
- * special case: usually, shadow entries have a short lifetime
- * and are either refaulted or reclaimed along with the inode
- * before they get too old. But it is not impossible for the
- * nonresident_age to lap a shadow entry in the field, which
- * can then result in a false small refault distance, leading
- * to a false activation should this old entry actually
- * refault again. However, earlier kernels used to deactivate
- * unconditionally with *every* reclaim invocation for the
- * longest time, so the occasional inappropriate activation
- * leading to pressure on the active list is not a problem.
- */
- refault_distance = (refault - eviction) & EVICTION_MASK;
- /*
- * Compare the distance to the existing workingset size. We
- * don't activate pages that couldn't stay resident even if
- * all the memory was available to the workingset. Whether
- * workingset competition needs to consider anon or not depends
- * on having free swap space.
- */
- workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
- if (!file) {
- workingset_size += lruvec_page_state(eviction_lruvec,
- NR_INACTIVE_FILE);
- }
- if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
- workingset_size += lruvec_page_state(eviction_lruvec,
- NR_ACTIVE_ANON);
- if (file) {
- workingset_size += lruvec_page_state(eviction_lruvec,
- NR_INACTIVE_ANON);
- }
- }
-
- return refault_distance <= workingset_size;
+ return lru_refault(eviction_memcg, eviction_lruvec, eviction, file,
+ EVICTION_BITS, bucket_order);
}
/**
@@ -543,7 +567,6 @@ void workingset_refault(struct folio *folio, void *shadow)
goto out;
folio_set_active(folio);
- workingset_age_nonresident(lruvec, nr);
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
/* Folio was active prior to eviction */
@@ -560,30 +583,6 @@ void workingset_refault(struct folio *folio, void *shadow)
rcu_read_unlock();
}
-/**
- * workingset_activation - note a page activation
- * @folio: Folio that is being activated.
- */
-void workingset_activation(struct folio *folio)
-{
- struct mem_cgroup *memcg;
-
- rcu_read_lock();
- /*
- * Filter non-memcg pages here, e.g. unmap can call
- * mark_page_accessed() on VDSO pages.
- *
- * XXX: See workingset_refault() - this should return
- * root_mem_cgroup even for !CONFIG_MEMCG.
- */
- memcg = folio_memcg_rcu(folio);
- if (!mem_cgroup_disabled() && !memcg)
- goto out;
- workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
-out:
- rcu_read_unlock();
-}
-
/*
* Shadow entries reflect the share of the working set that does not
* fit into memory, so their number depends on the access pattern of
@@ -778,7 +777,6 @@ static struct lock_class_key shadow_nodes_key;
static int __init workingset_init(void)
{
- unsigned int timestamp_bits;
unsigned int max_order;
int ret;
@@ -790,12 +788,11 @@ static int __init workingset_init(void)
* some more pages at runtime, so keep working with up to
* double the initial memory by using totalram_pages as-is.
*/
- timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
max_order = fls_long(totalram_pages() - 1);
- if (max_order > timestamp_bits)
- bucket_order = max_order - timestamp_bits;
+ if (max_order > EVICTION_BITS)
+ bucket_order = max_order - EVICTION_BITS;
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
- timestamp_bits, max_order, bucket_order);
+ EVICTION_BITS, max_order, bucket_order);
ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
if (ret)
--
2.41.0
next prev parent reply other threads:[~2023-09-12 18:45 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-12 18:45 [RFC PATCH v2 0/5] Refault distance checking for MGLRU Kairui Song
2023-09-12 18:45 ` Kairui Song [this message]
2023-09-12 19:47 ` [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model Johannes Weiner
2023-09-13 9:26 ` Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 2/5] workingset: update comment in workingset.c Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 3/5] workingset: simplify lru_gen_test_recent Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 4/5] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 5/5] workingset, lru_gen: apply refault-distance based re-activation Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230912184511.49333-2-ryncsn@gmail.com \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.orng \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=surenb@google.com \
--cc=tjmercier@google.com \
--cc=yuanchu@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox