Re: [LSF/MM/BPF TOPIC] Guaranteed CMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>,
	lsf-pc@lists.linux-foundation.org, SeongJae Park <sj@kernel.org>,
	Minchan Kim <minchan@kernel.org>,
	m.szyprowski@samsung.com, aneesh.kumar@kernel.org,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	mina86@mina86.com, Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <liam.howlett@oracle.com>,
	Michal Hocko <mhocko@kernel.org>, linux-mm <linux-mm@kvack.org>,
	android-kernel-team <android-kernel-team@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Guaranteed CMA
Date: Fri, 10 Oct 2025 17:37:20 +0200	[thread overview]
Message-ID: <5c3fc24c-dfbf-4d59-bda0-8726d2eaa78a@redhat.com> (raw)
In-Reply-To: <CAJuCfpGHZFmSV9ZDXc_CxxACzq-TjVudnCT3thn_Ok6S4nm3eQ@mail.gmail.com>

On 10.10.25 17:07, Suren Baghdasaryan wrote:
> On Fri, Oct 10, 2025 at 6:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.10.25 03:30, Suren Baghdasaryan wrote:
>>> On Mon, Sep 1, 2025 at 9:01 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 27.08.25 02:17, Suren Baghdasaryan wrote:
>>>>> On Tue, Aug 26, 2025 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 23.08.25 00:14, Suren Baghdasaryan wrote:
>>>>>>> On Wed, Apr 2, 2025 at 9:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, Mar 20, 2025 at 11:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Feb 4, 2025 at 8:33 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 4, 2025 at 3:23 AM Alexandru Elisei
>>>>>>>>>> <alexandru.elisei@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 04, 2025 at 09:18:20AM +0100, David Hildenbrand wrote:
>>>>>>>>>>>> On 02.02.25 01:19, Suren Baghdasaryan wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to discuss the Guaranteed Contiguous Memory Allocator
>>>>>>>>>>>>> (GCMA) mechanism that is being used by many Android vendors as an
>>>>>>>>>>>>> out-of-tree feature, collect input on its possible usefulness for
>>>>>>>>>>>>> others, feasibility to upstream and suggestions for possible better
>>>>>>>>>>>>> alternatives.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Problem statement: Some workloads/hardware require physically
>>>>>>>>>>>>> contiguous memory and carving out reserved memory areas for such
>>>>>>>>>>>>> allocations often lead to inefficient usage of those carveouts. CMA
>>>>>>>>>>>>> was designed to solve this inefficiency by allowing movable memory
>>>>>>>>>>>>> allocations to use this reserved memory when it’s otherwise unused.
>>>>>>>>>>>>> When a contiguous memory allocation is requested, CMA finds the
>>>>>>>>>>>>> requested contiguous area, possibly migrating some of the movable
>>>>>>>>>>>>> pages out of that area.
>>>>>>>>>>>>> In latency-sensitive use cases, like face unlock on phones, we need to
>>>>>>>>>>>>> allocate contiguous memory quickly and page migration in CMA takes
>>>>>>>>>>>>> enough time to cause user-perceptible lag. Such allocations can also
>>>>>>>>>>>>> fail if page migration is not possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> GCMA (Guaranteed CMA) is a mechanism previously proposed in [1] which
>>>>>>>>>>>>> was not upstreamed but got adopted later by many Android vendors as an
>>>>>>>>>>>>> out-of-tree feature. It is similar to CMA but backing memory is
>>>>>>>>>>>>> cleancache backend, containing only clean file-backed pages. Most
>>>>>>>>>>>>> importantly, the kernel can’t take a reference to pages from the
>>>>>>>>>>>>> cleancache, therefore can’t prevent GCMA from quickly dropping them
>>>>>>>>>>>>> when required. This guarantees GCMA low allocation latency and
>>>>>>>>>>>>> improves allocation success rate.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We would like to standardize GCMA implementation and upstream it since
>>>>>>>>>>>>> many Android vendors are asking to include it as a generic feature.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note: removal of cleancache in 5.17 kernel due to no users (sorry, we
>>>>>>>>>>>>> didn’t know at the time about this use case) might complicate
>>>>>>>>>>>>> upstreaming.
>>>>>>>>>>>>
>>>>>>>>>>>> we discussed another possible user last year: using MTE tag storage memory
>>>>>>>>>>>> while the storage is not getting used to store MTE tags [1].
>>>>>>>>>>>>
>>>>>>>>>>>> As long as the "ordinary RAM" that maps to a given MTE tag storage area does
>>>>>>>>>>>> not use MTE tagging, we can reuse the MTE tag storage ("almost ordinary RAM,
>>>>>>>>>>>> just that it doesn't support MTE itself") for different purposes.
>>>>>>>>>>>>
>>>>>>>>>>>> We need a guarantee that that memory can be freed up / migrated once the tag
>>>>>>>>>>>> storage gets activated.
>>>>>>>>>>>
>>>>>>>>>>> If I remember correctly, one of the issues with the MTE project that might be
>>>>>>>>>>> relevant to GCMA, was that userspace, once it gets a hold of a page, it can pin
>>>>>>>>>>> it for a very long time without specifying FOLL_LONGTERM.
>>>>>>>>>>>
>>>>>>>>>>> If I remember things correctly, there were two examples given for this; there
>>>>>>>>>>> might be more, or they might have been eliminated since then:
>>>>>>>>>>>
>>>>>>>>>>> * The page is used as a buffer for accesses to a file opened with
>>>>>>>>>>>       O_DIRECT.
>>>>>>>>>>>
>>>>>>>>>>> * 'vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet' - that's
>>>>>>>>>>>       a direct quote from David [1].
>>>>>>>>>>>
>>>>>>>>>>> Depending on your usecases, failing the allocation might be acceptable, but for
>>>>>>>>>>> MTE that wasn't the case.
>>>>>>>>>>>
>>>>>>>>>>> Hope some of this is useful.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/4e7a4054-092c-4e34-ae00-0105d7c9343c@redhat.com/
>>>>>>>>>>
>>>>>>>>>> Thanks for the references! I'll read through these discussions to see
>>>>>>>>>> how much useful information for GCMA I can extract.
>>>>>>>>>
>>>>>>>>> I wanted to get an RFC code ahead of LSF/MM and just finished putting
>>>>>>>>> it together. Sorry for the last minute posting. You can find it here:
>>>>>>>>> https://lore.kernel.org/all/20250320173931.1583800-1-surenb@google.com/
>>>>>>>>
>>>>>>>> Sorry about the delay. Attached are the slides from my GCMA
>>>>>>>> presentation at the conference.
>>>>>>>
>>>>>>> Hi Folks,
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> As I'm getting close to finalizing the GCMA patchset, one question
>>>>>>> keeps bugging me. How do we account the memory that is allocated from
>>>>>>> GCMA... In case of CMA allocations, they are backed by the system
>>>>>>> memory, so accounting is straightforward, allocations contribute to
>>>>>>> RSS, counted towards memcg limits, etc. In case of GCMA, the backing
>>>>>>> memory is reserved memory (a carveout) not directly accessible by the
>>>>>>> rest of the system and not part of the total_memory. So, if a process
>>>>>>> allocates a buffer from GCMA, should it be accounted as a normal
>>>>>>> allocation from system memory or as something else entirely? Any
>>>>>>> thoughts?
>>>>>>
>>>>>> You mean, an application allocates the memory and maps it into its page
>>>>>> tables?
>>>>>
>>>>> Allocation will happen via cma_alloc() or a similar interface, so
>>>>> applications would have to use some driver to allocate from GCMA. Once
>>>>> allocated, an application can map that memory if the driver supports
>>>>> mapping.
>>>>
>>>> Right, and that might happen either through a VM_PFNMAP or !VM_PFNMAP
>>>> (ordinarily ref- and currently map-counted).
>>>>
>>>> In the insert_page() case we do an inc_mm_counter, which increases the RSS.
>>>>
>>>> That could happen with pages from carevouts (memblock allocations)
>>>> already, but we don't run into that in general I assume.
>>>>
>>>>>
>>>>>>
>>>>>> Can that memory get reclaimed somehow?
>>>>>
>>>>> Hmm. I assume that once a driver allocates pages from GCMA it won't
>>>>> put them into system-managed LRU or free them into buddy allocator for
>>>>> kernel to use. If it does then at the time of cma_release() it can't
>>>>> guarantee there are no more users for such pages.
>>>>>
>>>>>>
>>>>>> How would we be mapping these pages into processes (VM_PFNMAP or
>>>>>> "normal" mappings)?
>>>>>
>>>>> They would be normal mappings as the pages do have `struct page` but I
>>>>> expect these pages to be managed by the driver that allocated them
>>>>> rather than the core kernel itself.
>>>>>
>>>>> I was trying to design GCMA to be used as close to CMA as possible so
>>>>> that we can use the same cma_alloc/cma_release API and reuse CMA's
>>>>> page management code but the fact that CMA is backed by the system
>>>>> memory and GCMA is backed by a carveout makes it a bit difficult.
>>>>
>>>> Makes sense. So I assume memcg does not apply here already -- memcg does
>>>> not apply on the CMA layer IIRC.
>>>>
>>>> The RSS is a bit tricky. We would have to modify things like
>>>> inc_mm_counter() to special-case on these things.
>>>>
>>>> But then, smaps output would still count these pages towards the rss/pss
>>>> (e.g., mss->resident). So that needs care as well ...
>>>
>>> In the end I decided to follow CMA as closely as possible, including
>>> accounting. GCMA and CMA both use reserved area and the difference is
>>> that CMA donates its memory to kernel to use for movable allocations
>>> while GCMA donates it to the cleancache. But once that donation is
>>> taken back by CMA/GCMA to satisfy cma_alloc() request, the memory
>>> usage is pretty much the same and therefore accounting should probably
>>> be the same. Anyway, that was the reasoning I eventually arrived at. I
>>> posted the GCMA patchset at [1] and included this reasoning in the
>>> cover letter. Happy to discuss this further in that patchset.
>>
>> Right, probably best to keep it simple. Will these GCMA pages be
>> accounted towards MemTotal like CMA pages would?
> 
> I thought CMA pages are accounted towards CmaTotal and if that's what
> you mean then yes, they are added to that metric in the patch [1], see
> the change in gcma_register_area(). I'm not adding the GcmaTotal
> metric because I think it's simpler to consider GCMA as just a flavor
> of CMA, as both are used via the same API (cma_alloc/cma_release) and
> serve the same purpose. The GCMA area can be distinguished from the
> CMA area using the /sys/kernel/mm/cma/<area>/gcma attribute, but
> otherwise, it should appear to users as yet another CMA area. Does
> that make sense?

I was rather wondering whether these pages will be part of /proc/meminfo 
MemTotal: so that totalram_pages_add() is called for them.

For ordinary CMA that happens in cma_activate_area() -> 
init_cma_reserved_pageblock() through adjust_managed_page_count(), where 
we also have the __free_pages() call.

If nothing changed on that front, then yes, it would behave just like 
ordinary CMA.

I should probably take a look at your v1, unfortunately that might have 
to wait for next week as I'm out of review capacity for today I'm afraid.

-- 
Cheers

David / dhildenb

next prev parent reply	other threads:[~2025-10-10 15:37 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-02  0:19 Suren Baghdasaryan
2025-02-04  5:46 ` Christoph Hellwig
2025-02-04  7:47   ` Lorenzo Stoakes
2025-02-04  7:48     ` Christoph Hellwig
2025-02-04  9:03   ` Vlastimil Babka
2025-02-04 15:56     ` Suren Baghdasaryan
2025-02-04  8:18 ` David Hildenbrand
2025-02-04 11:23   ` Alexandru Elisei
2025-02-04 16:33     ` Suren Baghdasaryan
2025-03-20 18:06       ` Suren Baghdasaryan
2025-04-02 16:35         ` Suren Baghdasaryan
2025-08-22 22:14           ` Suren Baghdasaryan
2025-08-26  8:58             ` David Hildenbrand
2025-08-27  0:17               ` Suren Baghdasaryan
2025-09-01 16:01                 ` David Hildenbrand
2025-10-10  1:30                   ` Suren Baghdasaryan
2025-10-10 13:58                     ` David Hildenbrand
2025-10-10 15:07                       ` Suren Baghdasaryan
2025-10-10 15:37                         ` David Hildenbrand [this message]
2025-10-10 15:47                           ` Suren Baghdasaryan
2025-02-04  9:07 ` Vlastimil Babka
2025-02-04 16:20   ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c3fc24c-dfbf-4d59-bda0-8726d2eaa78a@redhat.com \
    --to=david@redhat.com \
    --cc=alexandru.elisei@arm.com \
    --cc=android-kernel-team@google.com \
    --cc=aneesh.kumar@kernel.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=m.szyprowski@samsung.com \
    --cc=mhocko@kernel.org \
    --cc=mina86@mina86.com \
    --cc=minchan@kernel.org \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox