From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>,
Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>,
Michal Hocko <mhocko@suse.com>,
Kefeng Wang <wangkefeng.wang@huawei.com>,
Barry Song <21cnbao@gmail.com>, Chris Li <chrisl@kernel.org>,
Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH v5 0/6] Swap-out mTHP without splitting
Date: Wed, 27 Mar 2024 14:45:31 +0000 [thread overview]
Message-ID: <20240327144537.4165578-1-ryan.roberts@arm.com> (raw)
Hi All,
This series adds support for swapping out multi-size THP (mTHP) without needing
to first split the large folio via split_huge_page_to_list_to_order(). It
closely follows the approach already used to swap-out PMD-sized THP.
There are a couple of reasons for swapping out mTHP without splitting:
- Performance: It is expensive to split a large folio and under extreme memory
pressure some workloads regressed performance when using 64K mTHP vs 4K
small folios because of this extra cost in the swap-out path. This series
not only eliminates the regression but makes it faster to swap out 64K mTHP
vs 4K small folios.
- Memory fragmentation avoidance: If we can avoid splitting a large folio
memory is less likely to become fragmented, making it easier to re-allocate
a large folio in future.
- Performance: Enables a separate series [5] to swap-in whole mTHPs, which
means we won't lose the TLB-efficiency benefits of mTHP once the memory has
been through a swap cycle.
I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is sufficient.
Performance Testing
===================
I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
VM is set up with a 35G block ram device as the swap device and the test is run
from inside a memcg limited to 40G memory. I've then run `usemem` from
vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
repeated everything 6 times and taken the mean performance improvement relative
to 4K page baseline:
| alloc size | baseline | + this series |
| | mm-unstable (~v6.9-rc1) | |
|:-----------|------------------------:|------------------------:|
| 4K Page | 0.0% | 1.3% |
| 64K THP | -13.6% | 46.3% |
| 2M THP | 91.4% | 89.6% |
So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise.
---
The series applies against mm-unstable (4e567abb6482) with the addition of a
small fix for an arm64 build break (reported at [6]).
Changes since v4 [4]
====================
- patch #3:
- Added R-B from Huang, Ying - thanks!
- patch #4:
- get_swap_pages() now takes order instead of nr_pages (per Huang, Ying)
- Removed WARN_ON_ONCE() from get_swap_pages()
- Reworded comment for scan_swap_map_try_ssd_cluster() (per Huang, Ying)
- Unified VM_WARN_ON()s in scan_swap_map_slots() to scan: (per Huang, Ying)
- Removed redundant "order == 0" check (per Huang, Ying)
- patch #5:
- Marked list_empty() check with data_race() (per David)
- Added R-B from Barry and David - thanks!
- patch #6:
- Implemented mkold_ptes() generic helper (pre David)
- Enhanced folio_pte_batch() to report any_young (per David)
- madvise_cold_or_pageout_pte_range() sets old in batch (per David)
- Added R-B from Barry - thanks!
Changes since v3 [3]
====================
- Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
- Simplified max offset calculation (per Huang, Ying)
- Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
offset (per Huang, Ying)
- Removed swap_alloc_large() and merged its functionality into
scan_swap_map_slots() (per Huang, Ying)
- Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
by freeing swap entries in batches (see patch 2) (per DavidH)
- vmscan splits folio if its partially mapped (per Barry Song, DavidH)
- Avoid splitting in MADV_PAGEOUT path (per Barry Song)
- Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
since it's not actually a problem for THP as I first thought.
Changes since v2 [2]
====================
- Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
allocation. This required some refactoring to make everything work nicely
(new patches 2 and 3).
- Fix bug where nr_swap_pages would say there are pages available but the
scanner would not be able to allocate them because they were reserved for the
per-cpu allocator. We now allow stealing of order-0 entries from the high
order per-cpu clusters (in addition to exisiting stealing from order-0
per-cpu clusters).
Changes since v1 [1]
====================
- patch 1:
- Use cluster_set_count() instead of cluster_set_count_flag() in
swap_alloc_cluster() since we no longer have any flag to set. I was unable
to kill cluster_set_count_flag() as proposed against v1 as other call
sites depend explicitly setting flags to 0.
- patch 2:
- Moved large_next[] array into percpu_cluster to make it per-cpu
(recommended by Huang, Ying).
- large_next[] array is dynamically allocated because PMD_ORDER is not
compile-time constant for powerpc (fixes build error).
[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/20240311150058.1122862-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[6] https://lore.kernel.org/all/b9944ac1-3919-4bb2-8b65-f3e5c52bc2aa@arm.com/
Thanks,
Ryan
Ryan Roberts (6):
mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
mm: swap: Simplify struct percpu_cluster
mm: swap: Allow storage of all mTHP orders
mm: vmscan: Avoid split during shrink_folio_list()
mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
include/linux/pgtable.h | 58 ++++++++
include/linux/swap.h | 35 +++--
mm/huge_memory.c | 3 -
mm/internal.h | 60 +++++++-
mm/madvise.c | 100 +++++++------
mm/memory.c | 17 +--
mm/swap_slots.c | 6 +-
mm/swapfile.c | 306 ++++++++++++++++++++++------------------
mm/vmscan.c | 9 +-
9 files changed, 380 insertions(+), 214 deletions(-)
--
2.25.1
next reply other threads:[~2024-03-27 14:45 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-27 14:45 Ryan Roberts [this message]
2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-03-29 1:56 ` Huang, Ying
2024-04-05 9:22 ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
2024-04-01 5:52 ` Huang, Ying
2024-04-02 11:15 ` Ryan Roberts
2024-04-03 3:57 ` Huang, Ying
2024-04-03 7:16 ` Ryan Roberts
2024-04-03 0:30 ` Zi Yan
2024-04-03 0:47 ` Lance Yang
2024-04-03 7:21 ` Ryan Roberts
2024-04-05 9:24 ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-04-01 3:15 ` Huang, Ying
2024-04-02 11:18 ` Ryan Roberts
2024-04-03 3:07 ` Huang, Ying
2024-04-03 7:48 ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-03-28 8:18 ` Barry Song
2024-03-28 8:48 ` Ryan Roberts
2024-04-02 13:10 ` Ryan Roberts
2024-04-02 13:22 ` Lance Yang
2024-04-02 13:22 ` Ryan Roberts
2024-04-02 22:54 ` Barry Song
2024-04-05 4:06 ` Barry Song
2024-04-05 7:28 ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-04-01 12:25 ` Lance Yang
2024-04-02 11:20 ` Ryan Roberts
2024-04-02 11:30 ` Lance Yang
2024-04-02 10:16 ` Barry Song
2024-04-02 10:56 ` Ryan Roberts
2024-04-02 11:01 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240327144537.4165578-1-ryan.roberts@arm.com \
--to=ryan.roberts@arm.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=ioworker0@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=shy828301@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox