From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>,
akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, linux-mm@kvack.org,
wangkefeng.wang@huawei.com, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
Date: Wed, 28 May 2025 12:06:07 -0400 [thread overview]
Message-ID: <aDc0b76xSiY4Vrxo@x1.local> (raw)
In-Reply-To: <f605b184-5f8f-4aea-a0e8-840035b81919@redhat.com>
On Wed, May 28, 2025 at 05:29:29PM +0200, David Hildenbrand wrote:
> On 28.05.25 17:25, Peter Xu wrote:
> > On Wed, May 28, 2025 at 05:02:15PM +0200, David Hildenbrand wrote:
> > > On 28.05.25 16:54, Peter Xu wrote:
> > > > [Add Jason]
> > > >
> > > > On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
> > > > > On 28.05.25 10:59, David Hildenbrand wrote:
> > > > > > On 28.05.25 03:56, Jinjiang Tu wrote:
> > > > > > > Syzkaller reports a below BUG:
> > > > > > > ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
> > > > > > > WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > > > Modules linked in:
> > > > > > > CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
> > > > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > > > > > RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > > > Call Trace:
> > > > > > > <TASK>
> > > > > > > generic_access_phys+0x241/0x480 mm/memory.c:6458
> > > > > > > __access_remote_vm+0x6af/0x970 mm/memory.c:6535
> > > > > > > access_process_vm+0x53/0x80 mm/memory.c:6600
> > > > > > > get_cmdline+0x192/0x380 mm/util.c:1041
> > > > > > > audit_log_proctitle kernel/auditsc.c:1620 [inline]
> > > > > > > audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
> > > > > > > __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
> > > > > > > audit_syscall_exit include/linux/audit.h:356 [inline]
> > > > > > > syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
> > > > > > > __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
> > > > > > > syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
> > > > > > > do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
> > > > > > > entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > > > >
> > > > > > > The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
> > > > > > > The special pfn mapping is replaced by anon folios due to COW.
> > > > > > > generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
> > > > > > > thus trigger a WARN_ON.
> > > > > > >
> > > > > > > Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
> > > > > > > COW mappings"). check if the pte is special to reject Cowed anon folios.
> > > > > > >
> > > > > > > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> > > > > > > ---
> > > > > > > mm/memory.c | 7 +++++++
> > > > > > > 1 file changed, 7 insertions(+)
> > > > > > >
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 49199410805c..e1dac84536ee 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > retry:
> > > > > > > if (follow_pfnmap_start(&args))
> > > > > > > return -EINVAL;
> > > > > > > +
> > > > > > > + /* Never return PFNs of anon folios in COW mappings. */
> > > > > > > + if (!args.special) {
> > > > > > > + follow_pfnmap_end(&args);
> > > > > > > + return -EINVAL;
> > > > > > > + }
> > > > > > > +
> > > > > > > prot = args.pgprot;
> > > > > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > > > > writable = args.writable;
> > > > > >
> > > > > > I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
> > > > > >
> > > > > > I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
> > > > > >
> > > > > > In case our arch does not support pte_special, you fix will not catch that case ...
> > > > > >
> > > > > > The following might be better:
> > > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 37d8738f5e12e..810adb8d1a53b 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > > > prot = args.pgprot;
> > > > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > > > writable = args.writable;
> > > > > > +
> > > > > > + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> > > > > > + if (is_cow_mapping(vma->vm_flags) &&
> > > > > > + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> > > > > > + follow_pfnmap_end(&args);
> > > > > > + return -EINVAL;
> > > > > > + }
> > > > > > +
> > > > >
> > > > > Thinking again, we might have a PMD/PUD mapping, so maybe
> > > > > follow_pfnmap_start() should really just refuse any refcounted pages.
> >
> > [1]
> >
> > > >
> > > > We may want to be careful on this.
> > > >
> > > > I feel like we can still potentially break drivers that
> > > > follow_pfnmap_start() used to work on debateable things like RAM page
> > > > injections, unless breaking them is the intention.
> > >
> > > Yes, that all needs a cleanup likely; it's all very confusing and
> > > inconsistent.
> > >
> > > >
> > > > OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
> > > >
> > > > *** drivers/gpu/drm/gma500/fbdev.c:
> > > > psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
> > > >
> > > > *** drivers/gpu/drm/omapdrm/omap_gem.c:
> > > > omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
> > > >
> > > > AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
> > > > work on e.g. RAM pages, because GUP will simply fail them..
> > >
> > > Right.
> > >
> > > VM_IO essentially tells us "don't touch this memory, it might have side
> > > effects", such as MMIO, that's why GUP outright refuses VM_IO VMAs.
> > >
> > > I am not sure why generic_access_phys() should be allowed to ... touch that
> > > memory instead?
> >
> > I'm looking at:
> >
> > commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
> > Author: Rik van Riel <riel@redhat.com>
> > Date: Wed Jul 23 21:27:05 2008 -0700
> >
> > access_process_vm device memory infrastructure
> >
> > VM_IO is also intentionally mentioned in the doc too:
> >
> > Documentation/filesystems/locking.rst
> >
> > ->access() is called when get_user_pages() fails in
> > access_process_vm(), typically used to debug a process through
> > /proc/pid/mem or ptrace. This function is needed only for
> > VM_IO | VM_PFNMAP VMAs.
> >
> > So it definitely looks like intentional, though I know nothing about PPC
> > Cell SPUs..
>
>
> VM_IO | VM_PFNMAP, I can understand that. It's weird combined with weird.
>
> But the use case for "VM_IO | VM_MIXEDMAP" ?
>
> To be precise, I am questioning if follow_pfnmap_start() should only work on
> ...
>
> PFNMAPs ?
>
> :)
It goes back to the question on whether things will break which used to
work..
I was almost conservative as I know little on driver side, that's also why
I tend to prefer making iov_iter work with ram-mapped VM_PFNMAP vmas too.
After all, AFAIU Linux should try to not break working users; sometimes we
pay for that.
Meanwhile:
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
I'm not confident to blame any driver yet to have those special cases for
VM_PFNMAP, because it only says "managed without struct page", it didn't
say "it must not contain struct page".. Hence it hints the core mm "please
do not manage these mappings with struct page at all". Still sounds fair
contract, even if not ideal.
But yeah, once again I agree that'll be ideal if what you said could happen
some day.
[I wanted to copy Jason but I failed the job; do it this time]
--
Peter Xu
next prev parent reply other threads:[~2025-05-28 16:06 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-28 1:56 Jinjiang Tu
2025-05-28 8:59 ` David Hildenbrand
2025-05-28 9:59 ` David Hildenbrand
2025-05-28 12:14 ` Jinjiang Tu
2025-05-28 14:54 ` Peter Xu
2025-05-28 15:02 ` David Hildenbrand
2025-05-28 15:25 ` Peter Xu
2025-05-28 15:29 ` David Hildenbrand
2025-05-28 16:06 ` Peter Xu [this message]
2025-05-28 16:29 ` Jason Gunthorpe
2025-05-28 17:14 ` Peter Xu
2025-05-28 17:34 ` Jason Gunthorpe
2025-05-28 17:37 ` David Hildenbrand
2025-05-28 17:32 ` David Hildenbrand
2025-05-28 17:47 ` Jason Gunthorpe
2025-05-28 17:59 ` Jason Gunthorpe
2025-05-28 18:03 ` David Hildenbrand
2025-05-28 18:00 ` David Hildenbrand
2025-05-28 18:15 ` Jason Gunthorpe
2025-05-28 18:22 ` David Hildenbrand
2025-05-28 18:29 ` Jason Gunthorpe
2025-05-30 10:04 ` David Hildenbrand
2025-05-28 12:13 ` Jinjiang Tu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aDc0b76xSiY4Vrxo@x1.local \
--to=peterx@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jgg@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tujinjiang@huawei.com \
--cc=vbabka@suse.cz \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox