linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] Swap-out mTHP without splitting
@ 2024-04-03 11:40 Ryan Roberts
  2024-04-03 11:40 ` [PATCH v6 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Ryan Roberts @ 2024-04-03 11:40 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

Hi All,

This series adds support for swapping out multi-size THP (mTHP) without needing
to first split the large folio via split_huge_page_to_list_to_order(). It
closely follows the approach already used to swap-out PMD-sized THP.

There are a couple of reasons for swapping out mTHP without splitting:

  - Performance: It is expensive to split a large folio and under extreme memory
    pressure some workloads regressed performance when using 64K mTHP vs 4K
    small folios because of this extra cost in the swap-out path. This series
    not only eliminates the regression but makes it faster to swap out 64K mTHP
    vs 4K small folios.

  - Memory fragmentation avoidance: If we can avoid splitting a large folio
    memory is less likely to become fragmented, making it easier to re-allocate
    a large folio in future.

  - Performance: Enables a separate series [6] to swap-in whole mTHPs, which
    means we won't lose the TLB-efficiency benefits of mTHP once the memory has
    been through a swap cycle.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is sufficient.


Performance Testing
===================

I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
VM is set up with a 35G block ram device as the swap device and the test is run
from inside a memcg limited to 40G memory. I've then run `usemem` from
vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
repeated everything 6 times and taken the mean performance improvement relative
to 4K page baseline:

| alloc size |                baseline |           + this series |
|            | mm-unstable (~v6.9-rc1) |                         |
|:-----------|------------------------:|------------------------:|
| 4K Page    |                    0.0% |                    1.3% |
| 64K THP    |                  -13.6% |                   46.3% |
| 2M THP     |                   91.4% |                   89.6% |

So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise.

---
The series applies against mm-unstable (as of 2024-04-03) after dropping v5 of
this series from it. The performance numbers are from v5. Since the delta is
very small I don't anticipate any performance changes. I'm optimistically hoping
this is the final version.


Changes since v5 [5]
====================

  - patch #2
    - Don't bother trying to reclaim swap if none of the entries' refs have gone
      to 0 in free_swap_and_cache_nr() (per Huang, Ying)
  - patch #5
    - Only update THP_SWPOUT_FALLBACK counters for pmd-mappable folios (per
      Barry Song)
  - patch #6
    - Fix bug in madvise_cold_or_pageout_pte_range(): don't continue without ptl
      (reported by Barry [7], sysbot [8])


Changes since v4 [4]
====================

  - patch #3:
    - Added R-B from Huang, Ying - thanks!
  - patch #4:
    - get_swap_pages() now takes order instead of nr_pages (per Huang, Ying)
    - Removed WARN_ON_ONCE() from get_swap_pages()
    - Reworded comment for scan_swap_map_try_ssd_cluster() (per Huang, Ying)
    - Unified VM_WARN_ON()s in scan_swap_map_slots() to scan: (per Huang, Ying)
    - Removed redundant "order == 0" check (per Huang, Ying)
  - patch #5:
    - Marked list_empty() check with data_race() (per David)
    - Added R-B from Barry and David - thanks!
  - patch #6:
    - Implemented mkold_ptes() generic helper (pre David)
    - Enhanced folio_pte_batch() to report any_young (per David)
    - madvise_cold_or_pageout_pte_range() sets old in batch (per David)
    - Added R-B from Barry - thanks!


Changes since v3 [3]
====================

 - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
 - Simplified max offset calculation (per Huang, Ying)
 - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
   offset (per Huang, Ying)
 - Removed swap_alloc_large() and merged its functionality into
   scan_swap_map_slots() (per Huang, Ying)
 - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
   by freeing swap entries in batches (see patch 2) (per DavidH)
 - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
 - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
 - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
   since it's not actually a problem for THP as I first thought.


Changes since v2 [2]
====================

 - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
   allocation. This required some refactoring to make everything work nicely
   (new patches 2 and 3).
 - Fix bug where nr_swap_pages would say there are pages available but the
   scanner would not be able to allocate them because they were reserved for the
   per-cpu allocator. We now allow stealing of order-0 entries from the high
   order per-cpu clusters (in addition to exisiting stealing from order-0
   per-cpu clusters).


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/
[6] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[7] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/
[8] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/

Thanks,
Ryan

Ryan Roberts (6):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  mm: swap: Simplify struct percpu_cluster
  mm: swap: Allow storage of all mTHP orders
  mm: vmscan: Avoid split during shrink_folio_list()
  mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

 include/linux/pgtable.h |  58 ++++++++
 include/linux/swap.h    |  35 +++--
 mm/huge_memory.c        |   3 -
 mm/internal.h           |  60 +++++++-
 mm/madvise.c            | 100 +++++++------
 mm/memory.c             |  17 ++-
 mm/swap_slots.c         |   6 +-
 mm/swapfile.c           | 314 ++++++++++++++++++++++------------------
 mm/vmscan.c             |  17 ++-
 9 files changed, 396 insertions(+), 214 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-04-08 15:13 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-03 11:40 [PATCH v6 0/6] Swap-out mTHP without splitting Ryan Roberts
2024-04-03 11:40 ` [PATCH v6 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-04-03 22:12   ` Chris Li
2024-04-04  7:06     ` Ryan Roberts
2024-04-04 13:43       ` Chris Li
2024-04-08 11:56         ` Ryan Roberts
2024-04-05  9:25   ` David Hildenbrand
2024-04-03 11:40 ` [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
     [not found]   ` <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com>
2024-04-08  9:22     ` Ryan Roberts
2024-04-08  9:43       ` David Hildenbrand
2024-04-08 10:07         ` Ryan Roberts
     [not found]           ` <79c5513b-b3f2-4fbb-a3c7-a09894d54d22@redhat.com>
2024-04-08 10:39             ` Ryan Roberts
2024-04-08 12:07     ` Ryan Roberts
2024-04-08 12:47       ` Ryan Roberts
2024-04-08 13:27         ` Ryan Roberts
2024-04-08 15:13           ` David Hildenbrand
2024-04-03 11:40 ` [PATCH v6 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-04-03 11:40 ` [PATCH v6 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-04-05 10:38   ` David Hildenbrand
2024-04-07  6:02     ` Huang, Ying
2024-04-08  9:24       ` Ryan Roberts
2024-04-08  9:33         ` David Hildenbrand
2024-04-08  9:35           ` Ryan Roberts
2024-04-07  7:38   ` Barry Song
2024-04-08  9:28     ` Ryan Roberts
2024-04-03 11:40 ` [PATCH v6 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-04-05 10:42   ` David Hildenbrand
2024-04-08  9:31     ` Ryan Roberts
2024-04-03 11:40 ` [PATCH v6 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-04-03 17:17   ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox