* Mapping vmalloc pages to userspace
@ 2024-12-06 16:28 Matthew Wilcox
2024-12-06 21:01 ` Matthew Wilcox
2024-12-10 19:48 ` David Hildenbrand
0 siblings, 2 replies; 3+ messages in thread
From: Matthew Wilcox @ 2024-12-06 16:28 UTC (permalink / raw)
To: linux-mm; +Cc: Uladzislau Rezki, David Hildenbrand, Christoph Hellwig
Today we have a very useful helper, remap_vmalloc_range() (and _partial())
which lets drivers call vmalloc(), then map that memory to userspace.
It does so using vm_insert_page() which ends up calling folio_get() and
folio_add_file_rmap_pte(), so jiggling both the refcount and the mapcount.
As you all know by now, we're looking to eliminate both mapcount and
refcount from struct page. I have four options for consideration, some
of which I like more than others.
1. We could introduce a vmalloc memdesc that has a per-page mapcount and
refcount. This seems like unnecessarily high overhead for a precision
of tracking that is, perhaps, not warranted.
2. We could do no tracking at all of vmalloc pages. Insert the PFNs
of the allocated pages and rely on the driver to track everything
correctly, not freeing the vmalloc allocation until the mmap has been
torn down. This implies not supporting GUP. This option feels risky to
me; we're depending on device driver writers to get this right, and if
they get it wrong, it's quite the UAF hole; letting an attacker get
access to pages which could be allocated to any purpose.
3. Embed a refcount into struct vm_struct. We can support GUP if we want.
Calling GUP bumps the refcount on the entire struct. When the refcount
hits zero, we free the entire allocation. There's no need for a mapcount
or pincount because we don't need to distinguish between temporary and
longterm gups.
4. Introduce an indirection structure between the page and vm_struct which
contains the refcount.
I'm most in favour of #3, but there's probably ramifications I haven't
considered.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Mapping vmalloc pages to userspace
2024-12-06 16:28 Mapping vmalloc pages to userspace Matthew Wilcox
@ 2024-12-06 21:01 ` Matthew Wilcox
2024-12-10 19:48 ` David Hildenbrand
1 sibling, 0 replies; 3+ messages in thread
From: Matthew Wilcox @ 2024-12-06 21:01 UTC (permalink / raw)
To: linux-mm; +Cc: Uladzislau Rezki, David Hildenbrand, Christoph Hellwig
On Fri, Dec 06, 2024 at 04:28:17PM +0000, Matthew Wilcox wrote:
> 4. Introduce an indirection structure between the page and vm_struct which
> contains the refcount.
I'm starting to really warm up to this one. There are a number of
places that we allocate "some pages", but want to treat them as a single
object, not just vmalloc. Let's call this a 'scamem', short for
"scattered memory".
But this is going to be challenging. Assuming we want to support GUP,
we need to be able to go from page->scamem [1]. In the skinniest
version of shrinking struct page, we have just 8 bytes per page, and
we need to both store a pointer to the scamem and store information
like node, zone, section for _each_ page. We don't need to worry about
this for folios/slabs/... because all pages in the folio have the same
node/zone/section, so we can store this information once in the folio
and then copy it back to the page on free. We can't do that for scamem
without a (potentially large) allocation. And even if we do something
like:
struct scamem {
unsigned int nr;
refcount_t refcount;
unsigned long flags[];
};
to be able to implement page_to_nid() on a page, we'd have to figure
out which page within the scamem this was. So either we have to give up
on our dream of an 8 byte memdesc, or figure out some other way to do
this.
So what if we store the scamem pointer in vma->vm_file->private_data,
or vma->vm_private_data. That would let us keep the node/section/zone
in the struct page. GUP has the VMA, so this can work.
Yet another possibility would be if we can look up the page's pfn in
some data structure and reconstruct the zone/section/node information at
freeing time. I don't fully understand the meaning of this information,
so I have no idea if this is possible.
My current thought is:
struct scamem {
unsigned int nr;
refcount_t refcount;
struct page *pages[];
};
and changing vm_struct:
- struct page **pages;
+ struct scamem *scamem;
(I don't think we want to embed it in vm_struct, since we want vm_struct
to have one refcount on scamem, and for the scamem to be freed once its
refcount reaches zero rather than freed as part of vm_struct)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Mapping vmalloc pages to userspace
2024-12-06 16:28 Mapping vmalloc pages to userspace Matthew Wilcox
2024-12-06 21:01 ` Matthew Wilcox
@ 2024-12-10 19:48 ` David Hildenbrand
1 sibling, 0 replies; 3+ messages in thread
From: David Hildenbrand @ 2024-12-10 19:48 UTC (permalink / raw)
To: Matthew Wilcox, linux-mm; +Cc: Uladzislau Rezki, Christoph Hellwig
On 06.12.24 17:28, Matthew Wilcox wrote:
Sorry for the late reply, interesting topic.
> Today we have a very useful helper, remap_vmalloc_range() (and _partial())
> which lets drivers call vmalloc(), then map that memory to userspace.
> It does so using vm_insert_page() which ends up calling folio_get() and
> folio_add_file_rmap_pte(), so jiggling both the refcount and the mapcount.
> > As you all know by now, we're looking to eliminate both mapcount and
> refcount from struct page. I have four options for consideration, some
> of which I like more than others.
>
> 1. We could introduce a vmalloc memdesc that has a per-page mapcount and
> refcount. This seems like unnecessarily high overhead for a precision
> of tracking that is, perhaps, not warranted.
Especially the mapcount is probably of no use at all here. As discussed
with Lorenzo recently, I assume we only perform this in vm_insert_page()
because there is (was) no easy way to distinguish these pages on the zap
path to *not* decrement the refcounts.
With memdescs that would be easy (late: no folio -> no mapcount changes)
>
> 2. We could do no tracking at all of vmalloc pages. Insert the PFNs
> of the allocated pages and rely on the driver to track everything
> correctly, not freeing the vmalloc allocation until the mmap has been
> torn down. This implies not supporting GUP. This option feels risky to
> me; we're depending on device driver writers to get this right, and if
> they get it wrong, it's quite the UAF hole; letting an attacker get
> access to pages which could be allocated to any purpose.
Fully agreed.
>
> 3. Embed a refcount into struct vm_struct. We can support GUP if we want.
> Calling GUP bumps the refcount on the entire struct. When the refcount
> hits zero, we free the entire allocation. There's no need for a mapcount
> or pincount because we don't need to distinguish between temporary and
> longterm gups.
The pincount+mapcount should be specific to folios, agreed.
> > 4. Introduce an indirection structure between the page and
vm_struct which
> contains the refcount.
>
>
> I'm most in favour of #3, but there's probably ramifications I haven't
> considered.
I wonder if #1 only with the refcount would be doable. Maybe to a
vmalloc memdesc, but a more generic kmem memdesc.
Because for "oridnary" pages that a driver allocated I suspect we might
want to do the same.
But #3 sounds interesting as well. In any case, we'll have to teach
vm_normal_page() users that blindly assume that they get a folio, that
they could get something different instead. Using memdescs for that
sounds reasonable.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-12-10 19:48 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-06 16:28 Mapping vmalloc pages to userspace Matthew Wilcox
2024-12-06 21:01 ` Matthew Wilcox
2024-12-10 19:48 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox