[RFC PATCH v2 0/5] Refault distance checking for MGLRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
	Nhat Pham <nphamcs@gmail.com>, Yuanchu Xie <yuanchu@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	"T . J . Mercier" <tjmercier@google.com>,
	linux-kernel@vger.kernel.orng, Kairui Song <kasong@tencent.com>
Subject: [RFC PATCH v2 0/5] Refault distance checking for MGLRU
Date: Wed, 13 Sep 2023 02:45:06 +0800	[thread overview]
Message-ID: <20230912184511.49333-1-ryncsn@gmail.com> (raw)

From: Kairui Song <kasong@tencent.com>

Hi, linux-mm

I noticed MGLRU not working very well on certain workflows, which is
observed on some workloads with heavy memory stress.

After some debugging, I found this was related to refault distance
detection, when the file page workingset size exceeds total memory,
and the access distance (the left-shift time of a page before it gets
activated or promoted, considering LRU starts from right) of file pages
are larger than total memory. All file pages are stuck on the oldest
generation and getting read-in then evicted permutably, few get activated
and stay in memory.

This series tries to fix this problem by rework the refault distance
based activation to better fit MGLRU, and also tries to use a unified
algorithm for both MGLRU and Inactive/Active LRU, the performance almost
doubled for the workloads that are not working well previously.

Patch 1/5 reworked the refault distance detection model for
Inactive/Active LRU.

Patch 2/5 updated the comments.

Patch 3/5 and 4/5 are simplification and prepare.

Patch 4/4 applies the modified refault distance detection
for MGLRU.

Following benchmark showed 5x improvement:
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5GB of wiretiger cache
and 10g of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before the patch (with 10G swap, the result won't change whether
swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     503             27150226136.4   0.02 txn/s
------------------------------------------------------------------
  TOTAL           503             27150226136.4   0.02 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 53391
workingset_refault_anon 0
workingset_refault_file 23856735
workingset_activate_anon 0
workingset_activate_file 23845737
workingset_restore_anon 0
workingset_restore_file 18280692
workingset_nodereclaim 1024

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        6752         379          23       24706       24607
Swap:         10239           0       10239

After the patch (with 10G swap on same disk, similar result using ZRAM):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 903 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     2575            27094953498.8   0.10 txn/s
------------------------------------------------------------------
  TOTAL           2575            27094953498.8   0.10 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 78249
workingset_refault_anon 10139
workingset_refault_file 23001863
workingset_activate_anon 7238
workingset_activate_file 6718032
workingset_restore_anon 7432
workingset_restore_file 6719406
workingset_nodereclaim 9747

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        7376         320           3       24140       24014
Swap:         10239        1662        8577

The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. Testing with lower stress also shows
a improvement.

I also checked the benchmark with memtier/memcached and fio and some
other benchmarks, they looked OK so far, the results are in each commits.

Sending out as RFC, I'm still trying to do more test on it, since this changed
a frequently used algorithm and not really sure if there is any performance
regression, it should improvement the performance for file pages in general,
since it saved some operations.

Update from V1:
- Removed the fls operations which previously used in patch 1 for
  protecting active pages by expontial ratio, simply compare with number of
  inactive pages seems good enough.
- Update some benchmarks results, test result that are basically
  identical as before are not updated.

Kairui Song (5):
  workingset: simplify and use a more intuitive model
  workingset: update comment in workingset.c
  workingset: simplify lru_gen_test_recent
  lru_gen: convert avg_total and avg_refaulted to atomic
  workingset, lru_gen: apply refault-distance based re-activation

 include/linux/mmzone.h |   4 +-
 include/linux/swap.h   |   2 -
 mm/swap.c              |   1 -
 mm/vmscan.c            |  18 +-
 mm/workingset.c        | 411 +++++++++++++++++++++--------------------
 5 files changed, 221 insertions(+), 215 deletions(-)

-- 
2.41.0

next             reply	other threads:[~2023-09-12 18:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-12 18:45 Kairui Song [this message]
2023-09-12 18:45 ` [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model Kairui Song
2023-09-12 19:47   ` Johannes Weiner
2023-09-13  9:26     ` Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 2/5] workingset: update comment in workingset.c Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 3/5] workingset: simplify lru_gen_test_recent Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 4/5] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 5/5] workingset, lru_gen: apply refault-distance based re-activation Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230912184511.49333-1-ryncsn@gmail.com \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.orng \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=tjmercier@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox