linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2025-11-09 12:49 Youngjun Park
  2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Youngjun Park @ 2025-11-09 12:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: cgroups, linux-kernel, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Hi all,

In constrained environments, there is a need to improve workload
performance by controlling swap device usage on a per-process or
per-cgroup basis. For example, one might want to direct critical
processes to faster swap devices (like SSDs) while relegating
less critical ones to slower devices (like HDDs or Network Swap).

Initial approach was to introduce a per-cgroup swap priority
mechanism [1]. However, through review and discussion, several
drawbacks were identified:

a. There is a lack of concrete use cases for assigning a fine-grained,
   unique swap priority to each cgroup. 
b. The implementation complexity was high relative to the desired
   level of control.
c. Differing swap priorities between cgroups could lead to LRU
   inversion problems.

To address these concerns, I propose the "swap tiers" concept, 
originally suggested by Chris Li [2] and further developed through 
collaborative discussions. I would like to thank Chris Li and 
He Baoquan for their invaluable contributions in refining this 
approach, and Kairui Song, Nhat Pham, and Michal Koutný for their 
insightful reviews of earlier RFC versions.

Concept
-------
A swap tier is a grouping mechanism that assigns a "named id" to a
range of swap priorities. For example, all swap devices with a
priority of 100 or higher could be grouped into a tier named "SSD",
and all others into a tier named "HDD".

Cgroups can then select which named tiers they are permitted to use for
swapping via a new cgroup interface. This effectively restricts a
cgroup's swap activity to a specific subset of the available swap
devices.

Proposed Interface
------------------
1. Global Tier Definition: /sys/kernel/mm/swap/tiers

This file is used to define the global swap tiers and their associated
minimum priority levels.

- To add tiers:
  Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers

  There are several rules for defining tiers:
  - Priority ranges for tiers must not overlap.
  - The combination of all defined tiers must cover the entire valid
    priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
    can be assigned to a tier.
  - A tier's prio value is its inclusive lower bound,
    covering priorities up to the next tier's prio.
    The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
  - If the specified tiers do not cover the entire priority range,
    the priority of the tier with the lowest specified priority value
    is set to SHRT_MIN
  - The total number of tiers is limited. 

- To remove tiers:
  Format: - 'tier_name'[,|' ']'tier_name2']...
  Example:
  # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers

  Note: A tier cannot be removed if it is currently in use by any
  cgroup or if any active swap device is assigned to it. This acts as
  a reference count to prevent disruption.

- To show current tiers:
  Reading the file displays the currently configured tiers, their
  internal index, and the priority range they cover.
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
  # cat /sys/kernel/mm/swap/tiers
  Name      Idx   PrioStart   PrioEnd
            0
  SSD       1    100         32767
  HDD       2     -1         99

  - `Name`: The name of the tier. The unnamed entry is a default tier.
  - `Idx`: The internal index assigned to the tier.
  - `PrioStart`: The starting priority of the range covered by this tier.
  - `PrioEnd`: The ending priority of the range covered by this tier.

Two special tiers are predefined:
- "": Represents the default inheritance behavior in cgroups.
- "zswap": Reserved for zswap integration.

2. Cgroup Tier Selection: memory.swap.tiers

This file controls which swap tiers are enabled for a given cgroup.

- Reading the file:
  The first line shows the operation that was written to the file.
  The second line shows the final, effective set of tiers after
  merging with the parent cgroup's configuration.

- Writing to the file:
  Format: [+/-] [+|-][TIER_NAME]...
  - `+TIER_NAME`: Explicitly enables this tier for the cgroup.
  - `-TIER_NAME`: Explicitly disables this tier for the cgroup.
  - If a tier is not specified, its setting is inherited from the
    parent cgroup.
  - A standalone `+` at the beginning resets the configuration: it
    ignores the parent's settings, enables all globally defined tiers,
    and then applies the subsequent operations in the command.
  - A standalone `-` at the beginning also resets: it ignores the
    parent's settings, disables all tiers, and then applies subsequent
    operations.
  - The root cgroup defaults to an implicit `+`, enabling all swap
    devices.

  Example:
  # echo "+ -SSD -HDD" > /sys/fs/cgroup/my_cgroup/memory.swap.tiers
  This command first resets the cgroup's configuration to enable all
  tiers (due to the leading `+`), and then explicitly disables the
  "SSD" and "HDD" tiers.

Further Discussion and Open Questions
-------------------------------------
I seek feedback on this concept and have identified several key
points that require further discussion (though this is not an 
exhaustive list). This topic will also be presented at the upcoming 
Linux Plumbers Conference 2025 [3], and I would appreciate any 
feedback here on the list beforehand, or in person at the conference.

1.  The swap fast path utilizes a percpu cluster cache for efficiency.
    In swap tiers, this has been changed to a per-device per-cpu 
    cluster cache. (See the first patch in this series.)
    An alternative approach would be to cache only the swap_info_struct 
    (si) per-tier per-cpu, avoiding cluster caching entirely while still 
    maintaining fast device acquisition without `swap_avail_lock`.
    Should we pursue this alternative, or is the current per-device 
    per-cpu cluster caching approach preferable?

2.  Consistency with cgroup parent-child semantics: Unlike general
    resource distribution, tier selection may bypass parent
    constraints (e.g., a child can enable a tier disabled by its
    parent). Is this behavior acceptable?

3.  Per-cgroup swap tier limit: Is a `swap.tier.max` needed in
    addition to the existing `swap.max`?

4.  Parent-child tier mismatch: If a zombie memcg (child) uses a tier
    that is not available to its new parent, how should this be
    handled during recharging or reparenting? (This question is raised
    in the context of ongoing work to improve memcg reparenting and
    handle zombie memcgs [4, 5].)

5.  Tier mask calculation: What are the trade-offs between calculating
    the effective tier mask at runtime vs. pre-calculating it when the
    interface is written to?

6.  If a swap tier configuration is applied to a memcg, should we
    migrate existing swap-out pages that are on devices not belonging
    to any of the cgroup's allowed tiers?

7.  swap tier could be good abstraction layer. Discuss on extended usage of swap tiers.

Any feedback on the overall concept, interface, and these specific
points would be greatly appreciated.

Best Regards,
Youngjun Park

References
----------
[1] https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
[2] https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
[3] https://lpc.events/event/19/abstracts/2296/
[4] https://lore.kernel.org/linux-mm/20230720070825.992023-1-yosryahmed@google.com/
[5] https://blogs.oracle.com/linux/post/zombie-memcg-issues

Youngjun Park (3):
  mm, swap: change back to use each swap device's percpu cluster
  mm: swap: introduce swap tier infrastructure
  mm/swap: integrate swap tier infrastructure into swap subsystem

 Documentation/admin-guide/cgroup-v2.rst |  32 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   4 +
 include/linux/swap.h                    |  16 +-
 mm/Kconfig                              |  13 +
 mm/Makefile                             |   1 +
 mm/memcontrol.c                         |  69 +++
 mm/page_io.c                            |  21 +-
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  93 ++++
 mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  75 +++
 mm/swapfile.c                           | 169 +++----
 13 files changed, 987 insertions(+), 114 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-18  1:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2025-11-13  6:07   ` Kairui Song
2025-11-13 11:45     ` YoungJun Park
2025-11-14  1:05       ` Baoquan He
2025-11-14 15:52         ` Kairui Song
2025-11-15  9:28           ` YoungJun Park
2025-11-09 12:49 ` [PATCH 2/3] mm: swap: introduce swap tier infrastructure Youngjun Park
2025-11-12 14:20   ` Chris Li
2025-11-13  2:01     ` YoungJun Park
2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
2025-11-10 11:40   ` kernel test robot
2025-11-10 12:12   ` kernel test robot
2025-11-10 13:26   ` kernel test robot
2025-11-12 14:44   ` Chris Li
2025-11-13  4:07     ` YoungJun Park
2025-11-12 13:34 ` [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2025-11-13  1:33   ` YoungJun Park
2025-11-15  1:22 ` SeongJae Park
2025-11-15  9:44   ` YoungJun Park
2025-11-15 16:56     ` SeongJae Park
2025-11-15 15:13   ` Chris Li
2025-11-15 17:24     ` SeongJae Park
2025-11-17 22:17       ` Chris Li
2025-11-18  1:11         ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox