From: David Hildenbrand <david@redhat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-trace-kernel@vger.kernel.org,
Dave Hansen <dave.hansen@linux.intel.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
"H. Peter Anvin" <hpa@zytor.com>,
Jani Nikula <jani.nikula@linux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
Rodrigo Vivi <rodrigo.vivi@intel.com>,
Tvrtko Ursulin <tursulin@ursulin.net>,
David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
Andrew Morton <akpm@linux-foundation.org>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Pedro Falcato <pfalcato@suse.de>, Peter Xu <peterx@redhat.com>,
Suren Baghdasaryan <surenb@google.com>
Subject: Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Date: Mon, 5 May 2025 15:00:02 +0200 [thread overview]
Message-ID: <7035fb14-c9f6-4281-9f65-b220aaa8f5c3@redhat.com> (raw)
In-Reply-To: <b92e3a0d-b878-43cf-9b68-9f7c228e45fa@lucifer.local>
>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
>
> Hm, but what if we shrink a VMA, then map another one, might it be
> incorrectly storing PAT attributes for part of the range that is now mapped
> elsewhere?
Not "incorrectly". We'll simply undo the reservation of the cachemode
for the original PFN range once everything of the original VMA is gone.
AFAIK, one can usually mmap() the "unmapped" part after shrinking again
with the same cachemode, which should be the main use case.
Supporting partial un-tracking will require hooking into vma splitting
code ... not something I am super happy about. :)
>
> Also my god re: the 'kind of working' aspects of PAT, so frustrating.
>
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Generally looking good, afaict, but maybe let's get some input from Suren
> on VMA size.
>
> Are there actually any PAT tests out here? I had a quick glance in
> tools/testing/selftests/x86,mm and couldn't find any, but didn't look
> _that_ card.
Heh, booting a simple VM gets PAT involved. I suspect because /dev/mem
and BIOS/GPU/whatever hacks.
In the cover letter I have
"Briefly tested with some basic /dev/mem test I crafted. I want to
convert them to selftests, but that might or might not require a bit of
more work (e.g., /dev/mem accessibility)."
>
> Thanks in general for tackling this, this is a big improvement!
>
>> ---
>> include/linux/mm_inline.h | 2 +
>> include/linux/mm_types.h | 11 ++++++
>> kernel/fork.c | 54 ++++++++++++++++++++++++--
>> mm/memory.c | 81 +++++++++++++++++++++++++++++++--------
>> mm/mremap.c | 4 --
>> 5 files changed, 128 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>
>> #endif /* CONFIG_ANON_VMA_NAME */
>>
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>> static inline void init_tlb_flush_pending(struct mm_struct *mm)
>> {
>> atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f91..91124761cfda8 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>> int prev_scan_seq;
>> };
>>
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> + struct kref kref;
>> + unsigned long pfn;
>> + unsigned long size;
>
> Again, (super) nitty, but we really should express units. I suppose 'size'
> implies bytes to be honest as you'd unlikely say 'size' for number of pages
> (you'd go with nr_pages or something). But maybe a trailing /* in bytes */
> would help.
>
> Not a big deal though!
"size" in the kernel is usually bytes, never pages ... but I might be wrong.
Anyhow, I can use "/* in bytes*/" here, although I doubt that many will
benefit from this comment :)
>
>> +};
>> +#endif
>> +
>> /*
>> * This struct describes a virtual memory area. There is one of these
>> * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>> struct anon_vma_name *anon_name;
>> #endif
>> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
>
> An aside, but absolutely hate '__HAVE_PFNMAP_TRACKING' as a name here. But
> you didn't create it, and it's not really sensible to change it in this
> series so. Just a rumble...
I cannot argue with that ... same here.
To be clear: I hate all of this with passion ;) With this series, I hate
it a bit less.
[...]
>
> Obviously my series will break this but should be _fairly_ trivial to
> update.
>
> You will however have to make sure to update tools/testing/vma/* to handle
> the new functions in userland testing (they need to be stubbed otu).
Ah, I was happy it compiled but looks like I'll have to mess with that
as well.
>
> If it makes life easier, you can even send it to me off-list, or just send
> it without changing this in a respin and I can fix it up fairly quick for
> you.
Let me give it a try first, I'll let you know if it takes me too long.
Thanks!
[...]
>> /**
>> * remap_pfn_range - remap kernel memory to userspace
>> * @vma: user vma to map to
>> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>> *
>> * Return: %0 on success, negative error code otherwise.
>> */
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>> unsigned long pfn, unsigned long size, pgprot_t prot)
>
> OK so to expose some of my lack-of-knowledge of PAT - is this the
> 'entrypoint' to PAT tracking?
Only if you're using remap_pfn_range() ... there is other low-level
tracking/reservation using the memtype_reserve() interface and friends.
>
> So we have some kernel memory we remap to userland as PFN map, the kind
> that very well might be sensible to use PAT the change cache behaviour for,
> and each time this happens, it's mapped as PAT?
Right, anytime someone uses remap_pfn_range() on the full VMA, we track
it (depending on RAM vs. !RAM this "tracking" has different semantics).
For RAM, we seem to only lookup the cachemode. For !RAM, we seem to
reserve the memtype for the PFN range, which will fail if there already
is an incompatible memtype reserved.
It's all ... very weird.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-05-05 13:00 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-25 8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
2025-04-25 8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
2025-04-28 16:16 ` Lorenzo Stoakes
2025-04-28 16:19 ` David Hildenbrand
2025-04-25 8:17 ` [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot() David Hildenbrand
2025-04-25 19:31 ` Peter Xu
2025-04-25 19:48 ` David Hildenbrand
2025-04-25 23:59 ` Peter Xu
2025-04-28 14:58 ` David Hildenbrand
2025-04-28 16:21 ` Peter Xu
2025-04-28 20:37 ` David Hildenbrand
2025-04-29 13:44 ` Peter Xu
2025-04-29 16:25 ` David Hildenbrand
2025-04-29 16:36 ` Peter Xu
2025-04-25 19:56 ` David Hildenbrand
2025-04-25 8:17 ` [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack() David Hildenbrand
2025-04-28 16:53 ` Lorenzo Stoakes
2025-04-28 17:12 ` David Hildenbrand
2025-04-28 18:58 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack() David Hildenbrand
2025-04-25 20:00 ` Peter Xu
2025-04-25 20:14 ` David Hildenbrand
2025-04-28 16:54 ` Lorenzo Stoakes
2025-04-28 17:07 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
2025-04-25 20:23 ` Peter Xu
2025-04-25 20:36 ` David Hildenbrand
2025-04-28 16:08 ` Peter Xu
2025-04-28 16:16 ` David Hildenbrand
2025-04-28 16:24 ` Peter Xu
2025-04-28 17:23 ` David Hildenbrand
2025-04-28 19:37 ` Lorenzo Stoakes
2025-04-28 19:57 ` Suren Baghdasaryan
2025-04-28 20:23 ` David Hildenbrand
2025-04-28 20:19 ` David Hildenbrand
2025-04-28 19:38 ` Lorenzo Stoakes
2025-04-28 20:00 ` Suren Baghdasaryan
2025-04-28 20:21 ` David Hildenbrand
2025-04-28 20:10 ` Lorenzo Stoakes
2025-05-05 13:00 ` David Hildenbrand [this message]
2025-05-07 13:25 ` David Hildenbrand
2025-05-07 14:27 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
2025-04-28 20:12 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 07/11] mm: remove VM_PAT David Hildenbrand
2025-04-28 20:16 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
2025-04-28 20:18 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
2025-04-28 20:23 ` Lorenzo Stoakes
2025-05-05 12:10 ` David Hildenbrand
2025-05-06 9:30 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
2025-04-28 20:23 ` Lorenzo Stoakes
2025-04-25 8:17 ` [PATCH v1 11/11] mm/io-mapping: " David Hildenbrand
2025-04-28 16:06 ` Lorenzo Stoakes
2025-04-28 16:14 ` David Hildenbrand
2025-04-25 8:54 ` [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Ingo Molnar
2025-04-25 9:27 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7035fb14-c9f6-4281-9f65-b220aaa8f5c3@redhat.com \
--to=david@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hpa@zytor.com \
--cc=intel-gfx@lists.freedesktop.org \
--cc=jani.nikula@linux.intel.com \
--cc=jannh@google.com \
--cc=joonas.lahtinen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=luto@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mingo@redhat.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=rodrigo.vivi@intel.com \
--cc=rostedt@goodmis.org \
--cc=simona@ffwll.ch \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=tursulin@ursulin.net \
--cc=vbabka@suse.cz \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox