From: Yu Zhao <yuzhao@google.com>
To: linux-mm@kvack.org
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
Andrew Morton <akpm@linux-foundation.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Hillf Danton <hdanton@sina.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
Wei Yang <richard.weiyang@linux.alibaba.com>,
Yang Shi <shy828301@gmail.com>,
Ying Huang <ying.huang@intel.com>,
linux-kernel@vger.kernel.org, page-reclaim@google.com,
Yu Zhao <yuzhao@google.com>
Subject: [PATCH v1 14/14] mm: multigenerational lru: documentation
Date: Sat, 13 Mar 2021 00:57:47 -0700 [thread overview]
Message-ID: <20210313075747.3781593-15-yuzhao@google.com> (raw)
In-Reply-To: <20210313075747.3781593-1-yuzhao@google.com>
Add Documentation/vm/multigen_lru.rst.
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
Documentation/vm/index.rst | 1 +
Documentation/vm/multigen_lru.rst | 210 ++++++++++++++++++++++++++++++
2 files changed, 211 insertions(+)
create mode 100644 Documentation/vm/multigen_lru.rst
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
swap_numa
zswap
+ multigen_lru
Kernel developers MM documentation
==================================
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..fea927da2572
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,210 @@
+=====================
+Multigenerational LRU
+=====================
+
+Quick Start
+===========
+Build Options
+-------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
+ a maximum of ``X`` generations.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+
+Runtime Options
+---------------
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+
+:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
+ to spread pages out across ``N+1`` generations. ``N`` must be less
+ than ``X``. Larger values make the background aging more aggressive.
+
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
+ This file has the following output:
+
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen birth_time anon_size file_size
+ ...
+ max_gen birth_time anon_size file_size
+
+Given a memcg and a node, ``min_gen`` is the oldest generation
+(number) and ``max_gen`` is the youngest. Birth time is in
+milliseconds. Anon and file sizes are in pages.
+
+Recipes
+-------
+:Android on ARMv8.1+: ``X=4``, ``N=0``
+
+:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
+ ``ARM64_HW_AFDBM``
+
+:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
+
+:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
+ to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
+ generation ``max_gen`` and create the next generation ``max_gen+1``.
+ ``gen`` must be equal to ``max_gen`` in order to avoid races. A swap
+ file and a non-zero swappiness value are required to scan anon pages.
+ If swapping is not desired, set ``vm.swappiness`` to ``0`` and
+ overwrite it with a non-zero ``swappiness``.
+
+:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
+ generations less than or equal to ``gen``. ``gen`` must be less than
+ ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active generations
+ and therefore protected from the eviction. ``nr_to_reclaim`` can be
+ used to limit the number of pages to be evicted. Multiple command
+ lines are supported, so does concatenation with delimiters ``,`` and
+ ``;``.
+
+Workflow
+========
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in ``max_seq``
+for both anon and file types as they are aged on an equal footing. The
+oldest generation numbers are stored in ``min_seq[2]`` separately for
+anon and file types as clean file pages can be evicted regardless of
+swap and write-back constraints. Generation numbers are truncated into
+``ilog2(CONFIG_NR_LRU_GENS)+1`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists. Evictable pages are added to the per-zone lists indexed by
+``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
+depending on whether they are being faulted in or read ahead. The
+workflow comprises two conceptually independent functions: the aging
+and the eviction.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, the aging
+scans page tables for referenced pages of this ``lruvec``. Upon
+finding one, the aging updates its generation number to ``max_seq``.
+After each round of scan, the aging increments ``max_seq``. The aging
+maintains either a system-wide ``mm_struct`` list or per-memcg
+``mm_struct`` lists, and it only scans page tables of processes that
+have been scheduled since the last scan. Since scans are differential
+with respect to referenced pages, the cost is roughly proportional to
+their number.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, the
+eviction scans the pages on the per-zone lists indexed by either of
+``min_seq[2]``. It selects a type according to the values of
+``min_seq[2]`` and swappiness. During a scan, the eviction either
+sorts or isolates a page, depending on whether the aging has updated
+its generation number. When it finds all the per-zone lists are empty,
+the eviction increments ``min_seq[2]`` indexed by this selected type.
+The eviction triggers the aging when both of ``min_seq[2]`` reaches
+``max_seq-1``, assuming both anon and file types are reclaimable.
+
+Rationale
+=========
+Characteristics of cloud workloads
+----------------------------------
+With cloud storage gone mainstream, the role of local storage has
+diminished. For most of the systems running cloud workloads, anon
+pages account for the majority of memory consumption and page cache
+contains mostly executable pages. Notably, the portion of the unmapped
+is negligible.
+
+As a result, swapping is necessary to achieve substantial memory
+overcommit. And the ``rmap`` is the hottest in the reclaim path
+because its usage is proportional to the number of scanned pages,
+which on average is many times the number of reclaimed pages.
+
+With ``zram``, a typical ``kswapd`` profile on v5.11 looks like:
+
+::
+
+ 31.03% page_vma_mapped_walk
+ 25.59% lzo1x_1_do_compress
+ 4.63% do_raw_spin_lock
+ 3.89% vma_interval_tree_iter_next
+ 3.33% vma_interval_tree_subtree_search
+
+And with real swap, it looks like:
+
+::
+
+ 45.16% page_vma_mapped_walk
+ 7.61% do_raw_spin_lock
+ 5.69% vma_interval_tree_iter_next
+ 4.91% vma_interval_tree_subtree_search
+ 3.71% page_referenced_one
+
+Limitations of the Current Implementation
+-----------------------------------------
+Notion of the Active/Inactive
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+For servers equipped with hundreds of gigabytes of memory, the
+granularity of the active/inactive is too coarse to be useful for job
+scheduling. And false active/inactive rates are relatively high.
+
+For phones and laptops, the eviction is biased toward file pages
+because the selection has to resort to heuristics as direct
+comparisons between anon and file types are infeasible.
+
+For systems with multiple nodes and/or memcgs, it is impossible to
+compare ``lruvec``\s based on the notion of the active/inactive.
+
+Incremental Scans via the ``rmap``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each incremental scan picks up at where the last scan left off and
+stops after it has found a handful of unreferenced pages. For most of
+the systems running cloud workloads, incremental scans lose the
+advantage under sustained memory pressure due to high ratios of the
+number of scanned pages to the number of reclaimed pages. On top of
+that, the ``rmap`` has poor memory locality due to its complex data
+structures. The combined effects typically result in a high amount of
+CPU usage in the reclaim path.
+
+Benefits of the Multigenerational LRU
+-------------------------------------
+Notion of Generation Numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The notion of generation numbers introduces a quantitative approach to
+memory overcommit. A larger number of pages can be spread out across
+configurable generations, and thus they have relatively low false
+active/inactive rates. Each generation includes all pages that have
+been referenced since the last generation.
+
+Given an ``lruvec``, scans and the selections between anon and file
+types are all based on generation numbers, which are simple and yet
+effective. For different ``lruvec``\s, comparisons are still possible
+based on birth times of generations.
+
+Differential Scans via Page Tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each differential scan discovers all pages that have been referenced
+since the last scan. Specifically, it walks the ``mm_struct`` list
+associated with an ``lruvec`` to scan page tables of processes that
+have been scheduled since the last scan. The cost of each differential
+scan is roughly proportional to the number of referenced pages it
+discovers. Unless address spaces are extremely sparse, page tables
+usually have better memory locality than the ``rmap``. The end result
+is generally a significant reduction in CPU usage, for most of the
+systems running cloud workloads.
+
+To-do List
+==========
+KVM Optimization
+----------------
+Support shadow page table walk.
+
+NUMA Optimization
+-----------------
+Add per-node RSS for ``should_skip_mm()``.
+
+Refault Tracking Optimization
+-----------------------------
+Use generation numbers rather than LRU positions in
+``workingset_eviction()`` and ``workingset_refault()``.
--
2.31.0.rc2.261.g7f71774620-goog
next prev parent reply other threads:[~2021-03-13 7:58 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-13 7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
2021-03-13 7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09 ` Matthew Wilcox
2021-03-14 7:45 ` Yu Zhao
2021-03-13 7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13 7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13 7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13 7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13 7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12 ` Zi Yan
2021-03-14 22:51 ` Matthew Wilcox
2021-03-15 0:03 ` Yu Zhao
2021-03-15 0:27 ` Zi Yan
2021-03-15 1:04 ` Yu Zhao
2021-03-14 23:22 ` Dave Hansen
2021-03-15 3:16 ` Yu Zhao
2021-03-13 7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13 7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13 7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40 ` Rik van Riel
2021-03-16 2:07 ` Huang, Ying
2021-03-16 3:57 ` Yu Zhao
2021-03-16 6:44 ` Huang, Ying
2021-03-16 7:56 ` Yu Zhao
2021-03-17 3:37 ` Huang, Ying
2021-03-17 10:46 ` Yu Zhao
2021-03-22 3:13 ` Huang, Ying
2021-03-22 8:08 ` Yu Zhao
2021-03-24 6:58 ` Huang, Ying
2021-04-10 18:48 ` Yu Zhao
2021-04-13 3:06 ` Huang, Ying
2021-03-13 7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15 2:02 ` Andi Kleen
2021-03-15 3:37 ` Yu Zhao
2021-03-13 7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34 ` Matthew Wilcox
2021-03-16 21:29 ` Yu Zhao
2021-03-13 7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13 12:23 ` kernel test robot
2021-03-13 7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13 12:53 ` kernel test robot
2021-03-13 13:36 ` kernel test robot
2021-03-13 7:57 ` Yu Zhao [this message]
2021-03-19 9:31 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Alex Shi
2021-03-22 6:09 ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15 0:52 ` Yu Zhao
2021-03-15 1:13 ` Hillf Danton
2021-03-15 6:49 ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16 2:24 ` Yu Zhao
2021-03-16 14:50 ` Dave Hansen
2021-03-16 20:30 ` Yu Zhao
2021-03-16 21:14 ` Dave Hansen
2021-04-10 9:21 ` Yu Zhao
2021-04-13 3:02 ` Huang, Ying
2021-04-13 23:00 ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16 3:38 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210313075747.3781593-15-yuzhao@google.com \
--to=yuzhao@google.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=dave.hansen@linux.intel.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=page-reclaim@google.com \
--cc=richard.weiyang@linux.alibaba.com \
--cc=shy828301@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox