[RFC PATCH 00/12] CMA balancing - Frank van der Linden

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Frank van der Linden <fvdl@google.com>
To: akpm@linux-foundation.org, muchun.song@linux.dev,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Cc: hannes@cmpxchg.org, david@redhat.com, roman.gushchin@linux.dev,
	 Frank van der Linden <fvdl@google.com>
Subject: [RFC PATCH 00/12] CMA balancing
Date: Mon, 15 Sep 2025 19:51:41 +0000	[thread overview]
Message-ID: <20250915195153.462039-1-fvdl@google.com> (raw)

This is an RFC on a solution to the long standing problem of OOMs
occuring when the kernel runs out of space for unmovable allocations
in the face of large amounts of CMA.

Introduction
============

When there is a large amount of CMA (e.g. with hugetlb_cma), it is
possible for the kernel to run out of space to get unmovable
allocations from. This is because it cannot use the CMA area.
If the issue is just that there is a large CMA area, and that
there isn't enough space left, that can be considered a
misconfigured system. However, there is a scenario in which
things could have been dealt with better: if the non-CMA area
also has movable allocations in it, and there are CMA pageblocks
still available.

The current mitigation for this issue is to start using CMA
pageblocks for movable allocations first if the amount of
free CMA pageblocks is more than 50% of the total amount
of free memory in a zone. But that may not always work out,
e.g. the system could easily run in to a scenario where
long-lasting movable allocations are made first, which do
not go to CMA before the 50% mark is reached. When the
non-CMA area fills up, these will get in the way of the
kernel's unmovable allocations, and OOMs might occur.

Even always directing movable allocations to CMA first does
not completely fix the issue. Take a scenario where there
is a large amount of CMA through hugetlb_cma. All of that
CMA has been taken up by 1G hugetlb pages. So, movable allocations
end up in the non-CMA area. Now, the number of hugetlb 
pages in the pool is lowered, so some CMA becomes available.
At the same time, increased system activity leads to more unmovable
allocations. Since the movable allocations are still in the non-CMA
area, these kernel allocations might still fail.

Additionally, CMA areas are allocated at the bottom of the zone.
There has been some discussion on this in the past. Originally,
doing allocations from CMA was deemed something that was best
avoided. The arguments were twofold:

1) cma_alloc needs to be quick and should not have to migrate a
   lot of pages.
2) migration might fail, so the fewer pages it has to migrate
   the better

These arguments are why CMA is avoided (until the 50% limit is hit),
and why CMA areas are allocated at the bottom of a zone. But
compaction migrates memory from the bottom to the top of a zone.
That means that compaction will actually end up migrating movable
allocations out of CMA and in to non-CMA, making the issue of
OOMing for unmovable allocations worse.

Solution: CMA balancing
=======================

First, this patch set makes the 50% threshold configurable, which
is useful in any case. vm.cma_first_limit is the percentage of
free CMA, as part of the total amount of free memory in a zone,
above which CMA will be used first for movable allocations. 0 
is always, 100 is never.

Then, it creates an interface that allows for moving movable
allocations from non-CMA to CMA. CMA areas opt in to taking part
in this through a flag. Also, if the flag is set for a CMA area,
it is allocated at the top of a zone instead of the bottom.

Lastly, the hugetlb_cma code was modified to try to migrate
movable allocations from non-CMA to CMA when a hugetlb CMA
page is freed. Only hugetlb CMA areas opt in to CMA balancing,
behavior for all other CMA areas is unchanged.

Discussion
==========

This approach works when tested with a hugetlb_cma setup
where a large number of 1G pages is active, but the number
is sometimes reduced in exchange for larger non-hugetlb
overhead.

Arguments against this approach:

* It's kind of heavy-handed. Since there is no easy way to
  track the amount of movable allocations residing in non-CMA
  pageblocks, it will likely end up scanning too much memory,
  as it only knows the upper bound.
* It should be more integrated with watermark handling in the
  allocation slow path. Again, this would likely require 
  tracking the number of movable allocations in non-CMA
  pageblocks.

Arguments for this approach:

* Yes, it does more, but the work is restricted to the context
  of a process that decreases the hugetlb pool, and is not
  more work than allocating (e.g. freeing a hugetlb page from
  the pool is now as expensive as allocating a new one).
* hugetlb_cma is really the only situation where you have CMA
  areas large enough to trigger the OOM scenario, so restricting
  it to hugetlb should be good enough.

Comments, thoughts?

Frank van der Linden (12):
  mm/cma: add tunable for CMA fallback limit
  mm/cma: clean up flag handling a bit
  mm/cma: add flags argument to init functions
  mm/cma: keep a global sorted list of CMA ranges
  mm/cma: add helper functions for CMA balancing
  mm/cma: define and act on CMA_BALANCE flag
  mm/compaction: optionally use a different isolate function
  mm/compaction: simplify isolation order checks a bit
  mm/cma: introduce CMA balancing
  mm/hugetlb: do explicit CMA balancing
  mm/cma: rebalance CMA when changing cma_first_limit
  mm/cma: add CMA balance VM event counter

 arch/powerpc/kernel/fadump.c         |   2 +-
 arch/powerpc/kvm/book3s_hv_builtin.c |   2 +-
 drivers/s390/char/vmcp.c             |   2 +-
 include/linux/cma.h                  |  64 +++++-
 include/linux/migrate_mode.h         |   1 +
 include/linux/mm.h                   |   4 +
 include/linux/vm_event_item.h        |   3 +
 include/trace/events/migrate.h       |   3 +-
 kernel/dma/contiguous.c              |  10 +-
 mm/cma.c                             | 318 +++++++++++++++++++++++----
 mm/cma.h                             |  13 +-
 mm/compaction.c                      | 199 +++++++++++++++--
 mm/hugetlb.c                         |  14 +-
 mm/hugetlb_cma.c                     |  18 +-
 mm/hugetlb_cma.h                     |   5 +
 mm/internal.h                        |  11 +-
 mm/migrate.c                         |   8 +
 mm/page_alloc.c                      | 104 +++++++--
 mm/vmstat.c                          |   2 +
 19 files changed, 676 insertions(+), 107 deletions(-)

-- 
2.51.0.384.g4c02a37b29-goog

next             reply	other threads:[~2025-09-15 19:52 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 19:51 Frank van der Linden [this message]
2025-09-15 19:51 ` [RFC PATCH 01/12] mm/cma: add tunable for CMA fallback limit Frank van der Linden
2025-09-16 20:23   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 02/12] mm/cma: clean up flag handling a bit Frank van der Linden
2025-09-16 20:25   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 03/12] mm/cma: add flags argument to init functions Frank van der Linden
2025-09-16 21:16   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 04/12] mm/cma: keep a global sorted list of CMA ranges Frank van der Linden
2025-09-16 22:25   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 05/12] mm/cma: add helper functions for CMA balancing Frank van der Linden
2025-09-16 22:57   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 06/12] mm/cma: define and act on CMA_BALANCE flag Frank van der Linden
2025-09-17  3:30   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 07/12] mm/compaction: optionally use a different isolate function Frank van der Linden
2025-09-17 12:53   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 08/12] mm/compaction: simplify isolation order checks a bit Frank van der Linden
2025-09-17 14:43   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 09/12] mm/cma: introduce CMA balancing Frank van der Linden
2025-09-17 15:17   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 10/12] mm/hugetlb: do explicit " Frank van der Linden
2025-09-17 15:21   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 11/12] mm/cma: rebalance CMA when changing cma_first_limit Frank van der Linden
2025-09-17 15:22   ` Rik van Riel
2025-09-15 19:51 ` [RFC PATCH 12/12] mm/cma: add CMA balance VM event counter Frank van der Linden
2025-09-17 15:22   ` Rik van Riel
2025-09-17  0:50 ` [RFC PATCH 00/12] CMA balancing Roman Gushchin
2025-09-17 22:04   ` Frank van der Linden
2025-09-18 22:12     ` Roman Gushchin
2025-09-25 22:11 ` [RFC PATCH 13/12] mm,cma: add compaction cma balance helper for direct reclaim Rik van Riel
2025-09-25 22:11 ` [RFC PATCH 00/12] mm,cma: call CMA balancing from page reclaim code Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250915195153.462039-1-fvdl@google.com \
    --to=fvdl@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox