Re: [PATCH RFC 04/19] x86/mm: introduce the mermap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brendan Jackman <jackmanb@google.com>
To: Brendan Jackman <jackmanb@google.com>,
	Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,  Wei Xu <weixugc@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<x86@kernel.org>,  <rppt@kernel.org>,
	Sumit Garg <sumit.garg@oss.qualcomm.com>, <derkling@google.com>,
	 <reijiw@google.com>, Will Deacon <will@kernel.org>,
	<rientjes@google.com>,
	 "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	<patrick.roy@linux.dev>,
	 "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	 David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Yosry Ahmed <yosry.ahmed@linux.dev>
Subject: Re: [PATCH RFC 04/19] x86/mm: introduce the mermap
Date: Fri, 27 Feb 2026 10:47:45 +0000	[thread overview]
Message-ID: <DGPONXAKU7ZG.11C48FZY9OCPT@google.com> (raw)
In-Reply-To: <20260225-page_alloc-unmapped-v1-4-e8808a03cd66@google.com>

Relaying some code review from an AI that I wasn't able to run before
sending...

(This isn't the AI's verbatim output I'm filtering it and rephrasing).

On Wed Feb 25, 2026 at 4:34 PM UTC, Brendan Jackman wrote:

> +
> +/* Call with migration disabled. */
> +static inline struct mermap_alloc *mermap_alloc(struct mm_struct *mm,
> +						unsigned long size, bool use_reserve)
> +{
> +	int cpu = raw_smp_processor_id();
> +	struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);
> +	unsigned long cpu_end = mermap_cpu_end(cpu);
> +	struct mermap_alloc *alloc = NULL;
> +
> +	/*
> +	 * This is an extremely stupid allocator, there can only ever be a small
> +	 * number of allocations so everything just works on linear search.
> +	 *
> +	 * Allocations are "in order", i.e. if the whole region is free it
> +	 * allocates from the beginning. If there are any existing allocations
> +	 * it allocates from right after the last (highest address) one. Any
> +	 * free space before that goes unused.
> +	 *
> +	 * Once an allocation has been freed, the space it occupied must be flushed
> +	 * from the TLB before it can be reused.
> +	 *
> +	 * Visual example of how this is suppose to behave (A for allocated, T for
> +	 * TLB-flush-pending):
> +	 *
> +	 *  _______________ Start with everything free.
> +	 *  AaaA___________ Allocate something.
> +	 *  TttT___________ Free it. (Region needs a TLB flush now).
> +	 *  TttTAaaaaaaaA__ Allocate something else.
> +	 *  TttTAaaaaaaaAAA Allocate the remaining space.
> +	 *  TttTTtttttttTAA Free the allocation before last.
> +	 *  ^^^^^^^^^^^^^   This could all be reused now but for simplicity it
> +	 *                  isn't. Another allocation at this point  will fail.
> +	 *  TttTTtttttttTTT Free the last allocation.
> +	 *  _______________ Next time we allocate, first flush the TLB
> +	 *  AA_____________ Now we're back at the beginning.
> +	 */
> +
> +	if (use_reserve) {
> +		if (WARN_ON_ONCE(size != PAGE_SIZE))
> +			return NULL;
> +		lockdep_assert_preemption_disabled();
> +	} else {
> +		cpu_end -= PAGE_SIZE;
> +	}
> +
> +	if (WARN_ON_ONCE(!in_task()))
> +		return NULL;
> +	guard(preempt)();
> +
> +	/* Out of already-available space? */
> +	if (mc->next_addr + size > cpu_end) {
> +		unsigned long new_next = mermap_cpu_base(cpu);
> +
> +		/* Would we have space after a TLB flush? */
> +		for (int i = 0; i < ARRAY_SIZE(mc->allocs); i++) {
> +			struct mermap_alloc *alloc = &mc->allocs[i];
> +
> +			/*
> +			 * The space between the uppermost allocated alloc->end
> +			 * (or the base of the CPU's region if there are no
> +			 * current allocations) and mc->next_addr has been
> +			 * unmapped in the pagetables, but not flushed from the
> +			 * TLB. Set new_next to point to the beginning of that
> +			 * space.
> +			 */
> +			if (READ_ONCE(alloc->in_use))
> +				new_next = max(new_next, alloc->end);
> +		}
> +		if (size > cpu_end - new_next)
> +			return NULL;
> +
> +		mermap_flush_tlb(cpu, mc);
> +		mc->next_addr = new_next;
> +	}
> +
> +	/* Find an alloc-tracking structure to use */
> +	for (int i = 0; i < ARRAY_SIZE(mc->allocs); i++) {
> +		if (!READ_ONCE(mc->allocs[i].in_use)) {
> +			alloc = &mc->allocs[i];
> +			break;
> +		}
> +	}
> +	if (!alloc)
> +		return NULL;

Oops, I forgot to account for @use_reserve here. The alloc-tracking
structures should have a reservation like how the virtual address space
does otherwise allocations can fail where they aren't supposed to.

> +	alloc->in_use = true;
> +	alloc->base = mc->next_addr;
> +	alloc->end = alloc->base + size;
> +	mc->next_addr += size;
> +
> +	return alloc;
> +}
> +
> +struct set_pte_ctx {
> +	pgprot_t prot;
> +	unsigned long next_pfn;
> +};
> +
> +static inline int do_set_pte(pte_t *pte, unsigned long addr, void *data)
> +{
> +	struct set_pte_ctx *ctx = data;
> +
> +	set_pte(pte, pfn_pte(ctx->next_pfn, ctx->prot));
> +	ctx->next_pfn++;
> +
> +	return 0;
> +}
> +
> +static struct mermap_alloc *
> +__mermap_get(struct mm_struct *mm, struct page *page,
> +	     unsigned long size, pgprot_t prot, bool use_reserve)
> +{
> +	struct mermap_alloc *alloc = NULL;
> +	struct set_pte_ctx ctx;
> +	int err;
> +
> +	if (size > MERMAP_CPU_REGION_SIZE || WARN_ON_ONCE(!mm || !mm->mermap.cpu))
> +		return NULL;
> +	if (WARN_ON_ONCE(!arch_mermap_pgprot_allowed(prot)))
> +		return NULL;
> +
> +	size = PAGE_ALIGN(size);
> +
> +	migrate_disable();
> +
> +	alloc = mermap_alloc(mm, size, use_reserve);
> +	if (!alloc) {
> +		migrate_enable();
> +		return NULL;
> +	}
> +
> +	/* This probably wants to be optimised. */
> +	ctx.prot = prot;
> +	ctx.next_pfn = page_to_pfn(page);
> +	err = apply_to_existing_page_range(mm, alloc->base, size, do_set_pte, &ctx);

This takes a PTE lock, and we may have preemption off here, so this may be
broken on PREEMPT_RT? Haven't checked. (Maybe I can just test this by
running it with PREEMPT_RT and lockdep enabled?)

If that's indeed broken, this is yet another point to discuss in about
requirements for a pagetable library [0]. The lock is not needed at all
here - we need a way to modify pagetables that lets you take advantage
of higher-level synchronization.

[0] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/

> +	if (err) {
> +		WRITE_ONCE(alloc->in_use, false);
> +		return NULL;

Forgot migrate_enable().

(Is there a way to prevent this with the lovely new
__attribute__((cleanup)) magic?)

> +	}
> +
> +	return alloc;
> +}
> +
> +/*
> + * Allocate a region of virtual memory, and map the page into it. This tries
> + * pretty hard to be fast but doesn't try very hard at all to actually succeed.
> + *
> + * The returned region is physically local to the current mm. It is _logically_
> + * local to the current CPU but this is not enforced by hardware so it can be
> + * exploited to mitigate CPU vulns. This means the caller must not map memory
> + * here that doesn't belong to the current process. The caller must also perform
> + * a full TLB flush of the region before freeing the pages that have been mapped
> + * here.
> + *
> + * This may only be called from process context, and the caller must arrange to
> + * first call mermap_mm_prepare(). (It would be possible to support this in IRQ,
> + * but it seems unlikely there's a valid usecase given the TLB flushing
> + * requirements). If it succeeds, it disables migration until you call
> + * mermap_put().
> + *
> + * This is guaranteed not to allocate.

This one isn't from AI, but I just realised that's a pretty confusing
thing for an allocator to say. It should say it's guaranteed not to call
the page allocator. (This is important coz I want to use it from the
page allocator).

> + *
> + * Use mermap_addr() to get the actual address of the mapped region.
> + */
> +struct mermap_alloc *mermap_get(struct page *page, unsigned long size, pgprot_t prot)
> +{
> +	return __mermap_get(current->mm, page, size, prot, false);
> +}
> +EXPORT_SYMBOL(mermap_get);
> +
> +/*
> + * Allocate a single PAGE_SIZE page via mermap_get(), requiring preemption to be
> + * off until it is freed. This always succeeds.
> + */
> +void *mermap_get_reserved(struct page *page, pgprot_t prot)

Oops, that should return mermap_alloc *.

> +/* Clean up mermap stuff on mm teardown. */
> +void mermap_mm_teardown(struct mm_struct *mm)
> +{
> +	int cpu;
> +
> +	if (!mm->mermap.cpu)
> +		return;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);

Oops, this should be per_cpu_ptr(..., cpu) or whatever it is.

next prev parent reply	other threads:[~2026-02-27 10:47 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 16:34 [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 01/19] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 02/19] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 03/19] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 04/19] x86/mm: introduce the mermap Brendan Jackman
2026-02-27 10:47   ` Brendan Jackman [this message]
2026-02-25 16:34 ` [PATCH RFC 05/19] mm: KUnit tests for " Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 06/19] mm: introduce for_each_free_list() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 07/19] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 08/19] mm: introduce freetype_t Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 09/19] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 10/19] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 11/19] mm: rejig pageblock mask definitions Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 12/19] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 13/19] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 14/19] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 15/19] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 16/19] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 17/19] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-02-27 10:56   ` Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 18/19] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-02-27 11:04   ` Brendan Jackman
2026-02-25 16:34 ` [PATCH RFC 19/19] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DGPONXAKU7ZG.11C48FZY9OCPT@google.com \
    --to=jackmanb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox