linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RE: [LSF/MM/BPF TOPIC] MGLRU on Android: Real-World Problems and Challenges
@ 2026-02-24  3:17 wangzicheng
  2026-02-24 17:10 ` Suren Baghdasaryan
  2026-02-24 20:23 ` Barry Song
  0 siblings, 2 replies; 4+ messages in thread
From: wangzicheng @ 2026-02-24  3:17 UTC (permalink / raw)
  To: lsf-pc, linux-mm
  Cc: wangxin 00023513, gao xu, wangtao, liulu 00013167, zhouxiaolong,
	linkunli, kasong, 21cnbao, akpm, axelrasmussen, yuanchu, weixugc,
	Randy Dunlap, Liam.Howlett, willy

Hi,

I previously sent a similar email which unfortunately had encoding issues.
I'm resending a cleaned-up version here so it's easier to read and discuss.

MGLRU has been available on Android for about four years, but many
OEM vendors still choose not to enable it in production.
HONOR is a major Android OEM shipping tens of millions of devices
per year, and we run MGLRU on all our devices across multiple kernel
versions (5.15~6.12) and RAM configurations(4G~24G), backed by
large-scale beta and field data. From this deployment, we have identified
four concrete issues (Q1-Q4) and current workarounds, and would like to
work with the community to design upstream solutions.
Also we would like to discuss MGLRU's future direction on Android.

Below is a short summary of what we see.

Q1: anon/file imbalance and drop in available memory
Android apps workload show a persistent anon/file generational
imbalance under MGLRU:
anon pages tend to stay in the youngest 2 generations;
file pages are spread across multiple generations and over-reclaimed.
Tuning swappiness to 200 and ANON_ONLY does not fully fix this.
On a 16G media workload we see:
MGLRU: MemAvailable ~ 6060 MB
legacy: MemAvailable ~ 6982 MB (differs by ~1G)
Today we mitigate this via explicit memcg aging in Android
userspace [1], which is a vendor-only workaround.

Q2: Hard to control reclaim amount and stopping conditions (memcg)
For memcg reclaim it is hard to stop near a target reclaim amount:
kswapd can continue reclaiming even after watermarks are met
(e.g. to satisfy higher-order or memcg allocations);
reclaim via try_to_free_mem_cgroup_pages() lacks clear abort
semantics and can overshoot the intended reclaim amount.
We currently use OEM hooks [2] to early-exit or bypass reclaim under
some conditions

Q3: High reclaim cost and long uninterruptible sleep on lower-end
devices
On lower-end devices, reclaim cost and latency are harder to control:
throttle_direct_reclaim can make tasks wait for kswapd instead of
doing direct reclaim;
sometimes the target generations in many memcgs have very few
reclaimable
pages, so the CPU spends time scanning with little progress.
We observe tasks staying in uninterruptible sleep in try_to_free_pages()
We haven't find any proper ways to fix it.

Q4: Lack of global hot/cold + priority view with per-app memcg
Android uses a per-app memcg model and foreground/background levels
for resource control. root reclaim lacks a cross-memcg hot/cold and
priority view;
foreground app file pages may be reclaimed and reloaded frequently,
causing visible stalls;
We currently use a hook [3] to skip reclaim for foreground apps.

Discussion

- Vendor-only workarounds -> generic mechanisms (Q1-Q4)
Our current fixes (userspace memcg aging [1], OEM reclaim hooks
[2,3]) are Android/vendor-only—what parts should be turned into
generic MGLRU/kernel mechanisms vs. kept as Android policy?
We need guidance from community.

- How much control should MGLRU expose to Android? (Q1-Q3)
For Q1/Q2, Android has strong fg/bg and priority semantics that
the kernel does not see. Should MGLRU provide more explicit control
points (e.g. anon-vs-file / generation steering,
"target amount + abort condition" memcg reclaim) so Android can
safely trade complexity and risk for better performance and bounded
reclaim latency (Q3)?

- MGLRU evolution without memcg LRU: global hot/cold & scanning (Q4)
If memcg LRU will be removed [4], how should we maintain a cross-memcg
global hot/cold view and per-app priority on Android?
Given that much of the power benefit seems to come from page-table
scanning while generations are complex, is it reasonable to decouple
page-scanning functionality from MGLRU and make it a seperate kernel
configuration.

We are happy to share more detailed data and experiments and to help
with PoCs and large-scale validation if there is interest in
pursuing these directions.

Reference
[1] https://lore.kernel.org/linux-mm/20251128025315.3520689-1-wangzicheng@honor.com/
[2] https://android-review.googlesource.com/c/kernel/common/+/3866554
[3] https://android-review.googlesource.com/c/kernel/common/+/3870920
[4] https://lwn.net/Articles/1051882/

--
Best regards, and wishing you a prosperous Year of the Horse,
Zicheng Wang

^ permalink raw reply	[flat|nested] 4+ messages in thread
* [LSF/MM/BPF TOPIC] MGLRU on Android: Real-World Problems and Challenges
@ 2026-02-14 10:06 wangzicheng
  0 siblings, 0 replies; 4+ messages in thread
From: wangzicheng @ 2026-02-14 10:06 UTC (permalink / raw)
  To: lsf-pc, linux-mm
  Cc: wangxin 00023513, gao xu, wangtao, liulu 00013167, zhouxiaolong,
	linkunli

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 4146 bytes --]

Hi,

MGLRU has been available on Android for about four years, but many
OEM vendors still choose not to enable it in production.
HONOR is a major Android OEM shipping tens of millions of devices
per year, and we run MGLRU on all our devices across multiple kernel
versions (5.15~6.12) and RAM configurations(4G~24G), backed by
large-scale beta and field data. From this deployment, we have identified
four concrete issues (Q1¨CQ4) and current workarounds, and would like to
work with the community to design upstream solutions. 
Also we would like to discuss MGLRU¡¯s future direction on Android.

Below is a short summary of what we see.

Q1: anon/file imbalance and drop in available memory
Android apps workload show a persistent anon/file generational
imbalance under MGLRU:
anon pages tend to stay in the youngest 2 generations;
file pages are spread across multiple generations and over-reclaimed.
Tuning swappiness to 200 and ANON_ONLY does not fully fix this.
On a 16G media workload we see:
MGLRU: MemAvailable ¡Ö 6060 MB
legacy: MemAvailable ¡Ö 6982 MB (differs by ~1G)
Today we mitigate this via explicit memcg aging in Android
userspace [1], which is a vendor-only workaround.

Q2: Hard to control reclaim amount and stopping conditions (memcg)
For memcg reclaim it is hard to stop near a target reclaim amount:
kswapd can continue reclaiming even after watermarks are met
(e.g. to satisfy higher-order or memcg allocations);
reclaim via try_to_free_mem_cgroup_pages() lacks clear abort
semantics and can overshoot the intended reclaim amount.
We currently use OEM hooks [2] to early-exit or bypass reclaim under
some conditions

Q3: High reclaim cost and long uninterruptible sleep on lower-end
devices
On lower-end devices, reclaim cost and latency are harder to control:
throttle_direct_reclaim can make tasks wait for kswapd instead of
doing direct reclaim;
sometimes the target generations in many memcgs have very few reclaimable
pages, so the CPU spends time scanning with little progress.
We observe tasks staying in uninterruptible sleep in try_to_free_pages()
We haven't find any proper ways to fix it.

Q4: Lack of global hot/cold + priority view with per-app memcg
Android uses a per-app memcg model and foreground/background levels
for resource control. root reclaim lacks a cross-memcg hot/cold and
priority view;
foreground app file pages may be reclaimed and reloaded frequently,
causing visible stalls;
We currently use a hook [3] to skip reclaim for foreground apps.

Discussion

- Vendor-only workarounds ¡ú generic mechanisms (Q1¨CQ4)
Our current fixes (userspace memcg aging [1], OEM reclaim hooks
[2,3]) are Android/vendor-only¡ªwhat parts should be turned into
generic MGLRU/kernel mechanisms vs. kept as Android policy?
We need guidance from community.

- How much control should MGLRU expose to Android? (Q1¨CQ3)
For Q1/Q2, Android has strong fg/bg and priority semantics that
the kernel does not see. Should MGLRU provide more explicit control
points (e.g. anon-vs-file / generation steering, 
"target amount + abort condition" memcg reclaim) so Android can
safely trade complexity and risk for better performance and bounded
reclaim latency (Q3)?

- MGLRU evolution without memcg LRU: global hot/cold & scanning (Q4)
If memcg LRU will be removed [4], how should we maintain a cross-memcg
global hot/cold view and per-app priority on Android?
Given that much of the power benefit seems to come from page-table
scanning while generations are complex, is it reasonable to decouple
page-scanning functionality from MGLRU and make it a seperate kernel
configuration.

We are happy to share more detailed data and experiments and to help
with PoCs and large-scale validation if there is interest in
pursuing these directions.

Reference
[1] https://lore.kernel.org/linux-mm/20251128025315.3520689-1-wangzicheng@honor.com/
[2] https://android-review.googlesource.com/c/kernel/common/+/3866554
[3] https://android-review.googlesource.com/c/kernel/common/+/3870920
[4] https://lwn.net/Articles/1051882/

--
Best,
Zicheng Wang

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-24 20:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  3:17 [LSF/MM/BPF TOPIC] MGLRU on Android: Real-World Problems and Challenges wangzicheng
2026-02-24 17:10 ` Suren Baghdasaryan
2026-02-24 20:23 ` Barry Song
  -- strict thread matches above, loose matches on Subject: below --
2026-02-14 10:06 wangzicheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox