* Avoiding allocation of unused shmem page
@ 2022-10-20 20:14 Matthew Wilcox
2022-10-20 21:10 ` Peter Xu
2022-10-20 22:17 ` Yang Shi
0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2022-10-20 20:14 UTC (permalink / raw)
To: linux-mm; +Cc: Hugh Dickins, David Hildenbrand
In yesterday's call, David brought up the case where we fallocate a file
in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over
a hole. That currently causes shmem to allocate a page, zero-fill it,
then COW it, resulting in two pages being allocated when only the
COW page really needs to be allocated.
The path we currently take through the MM when we take the page fault
looks like this (correct me if I'm wrong ...):
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_fault()
do_cow_fault()
__do_fault()
vm_ops->fault()
... which is where we come into shmem_fault(). Apart from the
horrendous hole-punch handling case, shmem_fault() is quite simple:
err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
gfp, vma, vmf, &ret);
if (err)
return vmf_error(err);
vmf->page = folio_file_page(folio, vmf->pgoff);
return ret;
What we could do here is detect this case. Something like:
enum sgp_type sgp = SGP_CACHE;
if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
sgp = SGP_READ;
err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, sgp, gfp,
vma, vmf, &ret);
if (err)
return vmf_error(err);
if (folio)
vmf->page = folio_file_page(folio, vmf->pgoff);
else
vmf->page = NULL;
return ret;
and change do_cow_fault() like this:
+++ b/mm/memory.c
@@ -4575,12 +4575,17 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
if (ret & VM_FAULT_DONE_COW)
return ret;
- copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
+ if (vmf->page)
+ copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
+ else
+ clear_user_highpage(vmf->cow_page, vmf->address);
__SetPageUptodate(vmf->cow_page);
ret |= finish_fault(vmf);
- unlock_page(vmf->page);
- put_page(vmf->page);
+ if (vmf->page) {
+ unlock_page(vmf->page);
+ put_page(vmf->page);
+ }
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
return ret;
... I wrote the code directly in my email client; definitely not
compile-tested. But if this situation is causing a real problem for
someone, this would be a quick fix for them.
Is this a real problem or just intellectual curiosity? Also, does
this need support for THPs being created directly, or is khugepaged
fixing it up afterwards good enough?
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Avoiding allocation of unused shmem page 2022-10-20 20:14 Avoiding allocation of unused shmem page Matthew Wilcox @ 2022-10-20 21:10 ` Peter Xu 2022-10-21 7:23 ` David Hildenbrand 2022-10-20 22:17 ` Yang Shi 1 sibling, 1 reply; 13+ messages in thread From: Peter Xu @ 2022-10-20 21:10 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm, Hugh Dickins, David Hildenbrand On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: > In yesterday's call, David brought up the case where we fallocate a file > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > a hole. That currently causes shmem to allocate a page, zero-fill it, > then COW it, resulting in two pages being allocated when only the > COW page really needs to be allocated. > > The path we currently take through the MM when we take the page fault > looks like this (correct me if I'm wrong ...): > > handle_mm_fault() > __handle_mm_fault() > handle_pte_fault() > do_fault() > do_cow_fault() > __do_fault() > vm_ops->fault() > > ... which is where we come into shmem_fault(). Apart from the > horrendous hole-punch handling case, shmem_fault() is quite simple: > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > gfp, vma, vmf, &ret); > if (err) > return vmf_error(err); > vmf->page = folio_file_page(folio, vmf->pgoff); > return ret; > > What we could do here is detect this case. Something like: > > enum sgp_type sgp = SGP_CACHE; > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > sgp = SGP_READ; Yes this will start to save the space, but just to mention this may start to break anything that will still depend on the pagecache to work. E.g., it'll change behavior if the vma is registered with uffd missing mode; we'll start to lose MISSING events for these private mappings. Not sure whether there're other side effects. The zero-page approach will not have such issue as long as the pagecache is still filled with something. > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, sgp, gfp, > vma, vmf, &ret); > if (err) > return vmf_error(err); > if (folio) > vmf->page = folio_file_page(folio, vmf->pgoff); > else > vmf->page = NULL; > return ret; > > and change do_cow_fault() like this: > > +++ b/mm/memory.c > @@ -4575,12 +4575,17 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) > if (ret & VM_FAULT_DONE_COW) > return ret; > > - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > + if (vmf->page) > + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > + else > + clear_user_highpage(vmf->cow_page, vmf->address); > __SetPageUptodate(vmf->cow_page); > > ret |= finish_fault(vmf); > - unlock_page(vmf->page); > - put_page(vmf->page); > + if (vmf->page) { > + unlock_page(vmf->page); > + put_page(vmf->page); > + } > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > goto uncharge_out; > return ret; > > ... I wrote the code directly in my email client; definitely not > compile-tested. But if this situation is causing a real problem for > someone, this would be a quick fix for them. > > Is this a real problem or just intellectual curiosity? For me it's pure curiosity when I was asking this question; I don't have a production environment that can directly benefit from this. For real users I'd expect private shmem will always be mapped on meaningful (aka, non-zero) shared pages just to have their own copy, but no better knowledge than that. > Also, does this need support for THPs being created directly, or is > khugepaged fixing it up afterwards good enough? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-20 21:10 ` Peter Xu @ 2022-10-21 7:23 ` David Hildenbrand 2022-10-21 14:01 ` Peter Xu 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 7:23 UTC (permalink / raw) To: Peter Xu, Matthew Wilcox; +Cc: linux-mm, Hugh Dickins On 20.10.22 23:10, Peter Xu wrote: > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >> In yesterday's call, David brought up the case where we fallocate a file >> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >> a hole. That currently causes shmem to allocate a page, zero-fill it, >> then COW it, resulting in two pages being allocated when only the >> COW page really needs to be allocated. >> >> The path we currently take through the MM when we take the page fault >> looks like this (correct me if I'm wrong ...): >> >> handle_mm_fault() >> __handle_mm_fault() >> handle_pte_fault() >> do_fault() >> do_cow_fault() >> __do_fault() >> vm_ops->fault() >> >> ... which is where we come into shmem_fault(). Apart from the >> horrendous hole-punch handling case, shmem_fault() is quite simple: >> >> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >> gfp, vma, vmf, &ret); >> if (err) >> return vmf_error(err); >> vmf->page = folio_file_page(folio, vmf->pgoff); >> return ret; >> >> What we could do here is detect this case. Something like: >> >> enum sgp_type sgp = SGP_CACHE; >> >> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >> sgp = SGP_READ; > > Yes this will start to save the space, but just to mention this may start > to break anything that will still depend on the pagecache to work. E.g., > it'll change behavior if the vma is registered with uffd missing mode; > we'll start to lose MISSING events for these private mappings. Not sure > whether there're other side effects. I don't follow, can you elaborate? hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? > > The zero-page approach will not have such issue as long as the pagecache is > still filled with something. Having the shared zeropage available would be nice. But I understand that it adds quite some complexity -- on write fault, we have to invalidate all these zeropage mappings. > >> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, sgp, gfp, >> vma, vmf, &ret); >> if (err) >> return vmf_error(err); >> if (folio) >> vmf->page = folio_file_page(folio, vmf->pgoff); >> else >> vmf->page = NULL; >> return ret; >> >> and change do_cow_fault() like this: >> >> +++ b/mm/memory.c >> @@ -4575,12 +4575,17 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) >> if (ret & VM_FAULT_DONE_COW) >> return ret; >> >> - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); >> + if (vmf->page) >> + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); >> + else >> + clear_user_highpage(vmf->cow_page, vmf->address); >> __SetPageUptodate(vmf->cow_page); >> >> ret |= finish_fault(vmf); >> - unlock_page(vmf->page); >> - put_page(vmf->page); >> + if (vmf->page) { >> + unlock_page(vmf->page); >> + put_page(vmf->page); >> + } >> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) >> goto uncharge_out; >> return ret; >> >> ... I wrote the code directly in my email client; definitely not >> compile-tested. But if this situation is causing a real problem for >> someone, this would be a quick fix for them. >> >> Is this a real problem or just intellectual curiosity? > > For me it's pure curiosity when I was asking this question; I don't have a > production environment that can directly benefit from this. > > For real users I'd expect private shmem will always be mapped on meaningful > (aka, non-zero) shared pages just to have their own copy, but no better > knowledge than that. There is an easy way to trigger this from QEMU, and we've had customers running into this: $ grep -E "(Anon|Shmem)" /proc/meminfo AnonPages: 4097900 kB Shmem: 1242364 kB $ qemu-system-x86_64 -object memory-backend-memfd,id=tmp,share=off,size=4G,prealloc=on -S --nographic & $ grep -E "(Anon|Shmem)" /proc/meminfo AnonPages: 8296696 kB Shmem: 5434800 kB I recall it's fairly easy to get wrong from Libvirt when starting a VM. We use an empty memfd and map it private. Each page we touch (especially write to) ends up allocating shmem memory. Note that figuring out the write side ("write to hole via private mapping") is only part of the story. For example, by dumping/migrating the VM (reading all memory) we can easily read yet unpopulated memory and allocate a shmem page as well; once the VM would write to it, we'd allocate an additional private page. We'd need support for the shared zeropage to handle that better -- which would implicitly also handle shared mappings of shmem better -- dumping/migrating a VM would not allocate a lot of shmem pages filled with zeroes. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 7:23 ` David Hildenbrand @ 2022-10-21 14:01 ` Peter Xu 2022-10-21 14:10 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2022-10-21 14:01 UTC (permalink / raw) To: David Hildenbrand; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: > On 20.10.22 23:10, Peter Xu wrote: > > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: > > > In yesterday's call, David brought up the case where we fallocate a file > > > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > > > a hole. That currently causes shmem to allocate a page, zero-fill it, > > > then COW it, resulting in two pages being allocated when only the > > > COW page really needs to be allocated. > > > > > > The path we currently take through the MM when we take the page fault > > > looks like this (correct me if I'm wrong ...): > > > > > > handle_mm_fault() > > > __handle_mm_fault() > > > handle_pte_fault() > > > do_fault() > > > do_cow_fault() > > > __do_fault() > > > vm_ops->fault() > > > > > > ... which is where we come into shmem_fault(). Apart from the > > > horrendous hole-punch handling case, shmem_fault() is quite simple: > > > > > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > > > gfp, vma, vmf, &ret); > > > if (err) > > > return vmf_error(err); > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > > return ret; > > > > > > What we could do here is detect this case. Something like: > > > > > > enum sgp_type sgp = SGP_CACHE; > > > > > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > > > sgp = SGP_READ; > > > > Yes this will start to save the space, but just to mention this may start > > to break anything that will still depend on the pagecache to work. E.g., > > it'll change behavior if the vma is registered with uffd missing mode; > > we'll start to lose MISSING events for these private mappings. Not sure > > whether there're other side effects. > > I don't follow, can you elaborate? > > hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? Hugetlb is special because hugetlb detects pte first and relies on pte at least for uffd. shmem is not. Feel free to also reference the recent fix which relies on the stable hugetlb pte with commit 2ea7ff1e39cbe375. > > > > > The zero-page approach will not have such issue as long as the pagecache is > > still filled with something. > > Having the shared zeropage available would be nice. But I understand that it adds quite some complexity -- on write fault, we have to invalidate all these zeropage mappings. Right, I didn't think further than that, if to do so the write will need pagecache update, and we'd need to properly rmap walk and drop ptes that referencing to the zero page, just leave the CoWed pages alone. What I wanted to express is only that the other approach will not suffer from this specific issue as long as it was still pagecache-based. > > > > > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, sgp, gfp, > > > vma, vmf, &ret); > > > if (err) > > > return vmf_error(err); > > > if (folio) > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > > else > > > vmf->page = NULL; > > > return ret; > > > > > > and change do_cow_fault() like this: > > > > > > +++ b/mm/memory.c > > > @@ -4575,12 +4575,17 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) > > > if (ret & VM_FAULT_DONE_COW) > > > return ret; > > > > > > - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > > > + if (vmf->page) > > > + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > > > + else > > > + clear_user_highpage(vmf->cow_page, vmf->address); > > > __SetPageUptodate(vmf->cow_page); > > > > > > ret |= finish_fault(vmf); > > > - unlock_page(vmf->page); > > > - put_page(vmf->page); > > > + if (vmf->page) { > > > + unlock_page(vmf->page); > > > + put_page(vmf->page); > > > + } > > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > > > goto uncharge_out; > > > return ret; > > > > > > ... I wrote the code directly in my email client; definitely not > > > compile-tested. But if this situation is causing a real problem for > > > someone, this would be a quick fix for them. > > > > > > Is this a real problem or just intellectual curiosity? > > > > For me it's pure curiosity when I was asking this question; I don't have a > > production environment that can directly benefit from this. > > > > For real users I'd expect private shmem will always be mapped on meaningful > > (aka, non-zero) shared pages just to have their own copy, but no better > > knowledge than that. > > > There is an easy way to trigger this from QEMU, and we've had > customers running into this: Can the customer simply set shared=on? > > $ grep -E "(Anon|Shmem)" /proc/meminfo > AnonPages: 4097900 kB > Shmem: 1242364 kB > > $ qemu-system-x86_64 -object memory-backend-memfd,id=tmp,share=off,size=4G,prealloc=on -S --nographic & > > $ grep -E "(Anon|Shmem)" /proc/meminfo > AnonPages: 8296696 kB > Shmem: 5434800 kB > > I recall it's fairly easy to get wrong from Libvirt when starting a VM. > > We use an empty memfd and map it private. Each page we touch (especially write to) > ends up allocating shmem memory. > > Note that figuring out the write side ("write to hole via private mapping") is only > part of the story. For example, by dumping/migrating the VM (reading all memory) we can easily read > yet unpopulated memory and allocate a shmem page as well; once the VM would write to it, we'd > allocate an additional private page. > > We'd need support for the shared zeropage to handle that better -- which would implicitly also handle > shared mappings of shmem better -- dumping/migrating a VM would not allocate a lot of shmem pages filled > with zeroes. Yes. I just had a vague memory that zeropage for shmem used to exist for a while but went away. But I might have had it wrong. -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 14:01 ` Peter Xu @ 2022-10-21 14:10 ` David Hildenbrand 2022-10-21 14:28 ` Peter Xu 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 14:10 UTC (permalink / raw) To: Peter Xu; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On 21.10.22 16:01, Peter Xu wrote: > On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: >> On 20.10.22 23:10, Peter Xu wrote: >>> On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >>>> In yesterday's call, David brought up the case where we fallocate a file >>>> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >>>> a hole. That currently causes shmem to allocate a page, zero-fill it, >>>> then COW it, resulting in two pages being allocated when only the >>>> COW page really needs to be allocated. >>>> >>>> The path we currently take through the MM when we take the page fault >>>> looks like this (correct me if I'm wrong ...): >>>> >>>> handle_mm_fault() >>>> __handle_mm_fault() >>>> handle_pte_fault() >>>> do_fault() >>>> do_cow_fault() >>>> __do_fault() >>>> vm_ops->fault() >>>> >>>> ... which is where we come into shmem_fault(). Apart from the >>>> horrendous hole-punch handling case, shmem_fault() is quite simple: >>>> >>>> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >>>> gfp, vma, vmf, &ret); >>>> if (err) >>>> return vmf_error(err); >>>> vmf->page = folio_file_page(folio, vmf->pgoff); >>>> return ret; >>>> >>>> What we could do here is detect this case. Something like: >>>> >>>> enum sgp_type sgp = SGP_CACHE; >>>> >>>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>> sgp = SGP_READ; >>> >>> Yes this will start to save the space, but just to mention this may start >>> to break anything that will still depend on the pagecache to work. E.g., >>> it'll change behavior if the vma is registered with uffd missing mode; >>> we'll start to lose MISSING events for these private mappings. Not sure >>> whether there're other side effects. >> >> I don't follow, can you elaborate? >> >> hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? > > Hugetlb is special because hugetlb detects pte first and relies on pte at > least for uffd. shmem is not. > > Feel free to also reference the recent fix which relies on the stable > hugetlb pte with commit 2ea7ff1e39cbe375. Sorry to be dense here, but I don't follow how that relates. Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd missing events on that mapping. Assume we get a page fault on a hole. We detect no page is mapped and check if the page cache has a page mapped -- which is also not the case, because there is a hole. So we notify uffd. Uffd will place a page. It should *not* touch the page cache and only insert that page into the page table -- otherwise we'd be violating MAP_PRIVATE semantics. What am I missing? [...] >> >> There is an easy way to trigger this from QEMU, and we've had >> customers running into this: > > Can the customer simply set shared=on? > Of course they can. It rather comes with a surprise for them, because -- for now -- we're not even warning that this most probably doesn't make too much sense. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 14:10 ` David Hildenbrand @ 2022-10-21 14:28 ` Peter Xu 2022-10-21 14:45 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2022-10-21 14:28 UTC (permalink / raw) To: David Hildenbrand; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: > On 21.10.22 16:01, Peter Xu wrote: > > On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: > > > On 20.10.22 23:10, Peter Xu wrote: > > > > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: > > > > > In yesterday's call, David brought up the case where we fallocate a file > > > > > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > > > > > a hole. That currently causes shmem to allocate a page, zero-fill it, > > > > > then COW it, resulting in two pages being allocated when only the > > > > > COW page really needs to be allocated. > > > > > > > > > > The path we currently take through the MM when we take the page fault > > > > > looks like this (correct me if I'm wrong ...): > > > > > > > > > > handle_mm_fault() > > > > > __handle_mm_fault() > > > > > handle_pte_fault() > > > > > do_fault() > > > > > do_cow_fault() > > > > > __do_fault() > > > > > vm_ops->fault() > > > > > > > > > > ... which is where we come into shmem_fault(). Apart from the > > > > > horrendous hole-punch handling case, shmem_fault() is quite simple: > > > > > > > > > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > > > > > gfp, vma, vmf, &ret); > > > > > if (err) > > > > > return vmf_error(err); > > > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > > > > return ret; > > > > > > > > > > What we could do here is detect this case. Something like: > > > > > > > > > > enum sgp_type sgp = SGP_CACHE; > > > > > > > > > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > > > > > sgp = SGP_READ; > > > > > > > > Yes this will start to save the space, but just to mention this may start > > > > to break anything that will still depend on the pagecache to work. E.g., > > > > it'll change behavior if the vma is registered with uffd missing mode; > > > > we'll start to lose MISSING events for these private mappings. Not sure > > > > whether there're other side effects. > > > > > > I don't follow, can you elaborate? > > > > > > hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? > > > > Hugetlb is special because hugetlb detects pte first and relies on pte at > > least for uffd. shmem is not. > > > > Feel free to also reference the recent fix which relies on the stable > > hugetlb pte with commit 2ea7ff1e39cbe375. > > Sorry to be dense here, but I don't follow how that relates. > > Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd > missing events on that mapping. > > Assume we get a page fault on a hole. We detect no page is mapped and check > if the page cache has a page mapped -- which is also not the case, because > there is a hole. > > So we notify uffd. > > Uffd will place a page. It should *not* touch the page cache and only insert > that page into the page table -- otherwise we'd be violating MAP_PRIVATE > semantics. That's actually exactly what we do right now... we insert into page cache for the shmem. See shmem_mfill_atomic_pte(). Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive ownership of pages, I don't see why it should restrict uffd behavior. Uffd missing mode (afaiu) is defined to resolve page cache missings in this case. Hugetlb is special but not shmem IMO comparing to most of the rest of the file systems. > > What am I missing? > > [...] > > > > > > > There is an easy way to trigger this from QEMU, and we've had > > > customers running into this: > > > > Can the customer simply set shared=on? > > > > Of course they can. It rather comes with a surprise for them, because -- for > now -- we're not even warning that this most probably doesn't make too much > sense. Right, some warning message could be helpful from qemu, but still not really required IMO, also we shouldn't assume that'll always happen because it's really impl detail of OS. It's the same as someone wrote a program that maps private memfd using shmem, we don't throw errros to them either. It's just a behavior of the OS underneath, and maybe one day it'll stop consuming twice the required size and it'll be transparent to apps. When that (if it will...) happens the error message could be misleading. -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 14:28 ` Peter Xu @ 2022-10-21 14:45 ` David Hildenbrand 2022-10-21 15:08 ` Peter Xu 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 14:45 UTC (permalink / raw) To: Peter Xu; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On 21.10.22 16:28, Peter Xu wrote: > On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: >> On 21.10.22 16:01, Peter Xu wrote: >>> On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: >>>> On 20.10.22 23:10, Peter Xu wrote: >>>>> On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >>>>>> In yesterday's call, David brought up the case where we fallocate a file >>>>>> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >>>>>> a hole. That currently causes shmem to allocate a page, zero-fill it, >>>>>> then COW it, resulting in two pages being allocated when only the >>>>>> COW page really needs to be allocated. >>>>>> >>>>>> The path we currently take through the MM when we take the page fault >>>>>> looks like this (correct me if I'm wrong ...): >>>>>> >>>>>> handle_mm_fault() >>>>>> __handle_mm_fault() >>>>>> handle_pte_fault() >>>>>> do_fault() >>>>>> do_cow_fault() >>>>>> __do_fault() >>>>>> vm_ops->fault() >>>>>> >>>>>> ... which is where we come into shmem_fault(). Apart from the >>>>>> horrendous hole-punch handling case, shmem_fault() is quite simple: >>>>>> >>>>>> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >>>>>> gfp, vma, vmf, &ret); >>>>>> if (err) >>>>>> return vmf_error(err); >>>>>> vmf->page = folio_file_page(folio, vmf->pgoff); >>>>>> return ret; >>>>>> >>>>>> What we could do here is detect this case. Something like: >>>>>> >>>>>> enum sgp_type sgp = SGP_CACHE; >>>>>> >>>>>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>>>> sgp = SGP_READ; >>>>> >>>>> Yes this will start to save the space, but just to mention this may start >>>>> to break anything that will still depend on the pagecache to work. E.g., >>>>> it'll change behavior if the vma is registered with uffd missing mode; >>>>> we'll start to lose MISSING events for these private mappings. Not sure >>>>> whether there're other side effects. >>>> >>>> I don't follow, can you elaborate? >>>> >>>> hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? >>> >>> Hugetlb is special because hugetlb detects pte first and relies on pte at >>> least for uffd. shmem is not. >>> >>> Feel free to also reference the recent fix which relies on the stable >>> hugetlb pte with commit 2ea7ff1e39cbe375. >> >> Sorry to be dense here, but I don't follow how that relates. >> >> Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd >> missing events on that mapping. >> >> Assume we get a page fault on a hole. We detect no page is mapped and check >> if the page cache has a page mapped -- which is also not the case, because >> there is a hole. >> >> So we notify uffd. >> >> Uffd will place a page. It should *not* touch the page cache and only insert >> that page into the page table -- otherwise we'd be violating MAP_PRIVATE >> semantics. > > That's actually exactly what we do right now... we insert into page cache > for the shmem. See shmem_mfill_atomic_pte(). > > Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive > ownership of pages, I don't see why it should restrict uffd behavior. Uffd > missing mode (afaiu) is defined to resolve page cache missings in this > case. Hugetlb is special but not shmem IMO comparing to most of the rest > of the file systems. If a write (or uffd placement) via a MAP_PRIVATE mapping results in other shared/private mappings from observing these modifications, you have a clear violation of MAP_PRIVATE semantics. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 14:45 ` David Hildenbrand @ 2022-10-21 15:08 ` Peter Xu 2022-10-21 15:17 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2022-10-21 15:08 UTC (permalink / raw) To: David Hildenbrand; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote: > On 21.10.22 16:28, Peter Xu wrote: > > On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: > > > On 21.10.22 16:01, Peter Xu wrote: > > > > On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: > > > > > On 20.10.22 23:10, Peter Xu wrote: > > > > > > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: > > > > > > > In yesterday's call, David brought up the case where we fallocate a file > > > > > > > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > > > > > > > a hole. That currently causes shmem to allocate a page, zero-fill it, > > > > > > > then COW it, resulting in two pages being allocated when only the > > > > > > > COW page really needs to be allocated. > > > > > > > > > > > > > > The path we currently take through the MM when we take the page fault > > > > > > > looks like this (correct me if I'm wrong ...): > > > > > > > > > > > > > > handle_mm_fault() > > > > > > > __handle_mm_fault() > > > > > > > handle_pte_fault() > > > > > > > do_fault() > > > > > > > do_cow_fault() > > > > > > > __do_fault() > > > > > > > vm_ops->fault() > > > > > > > > > > > > > > ... which is where we come into shmem_fault(). Apart from the > > > > > > > horrendous hole-punch handling case, shmem_fault() is quite simple: > > > > > > > > > > > > > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > > > > > > > gfp, vma, vmf, &ret); > > > > > > > if (err) > > > > > > > return vmf_error(err); > > > > > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > > > > > > return ret; > > > > > > > > > > > > > > What we could do here is detect this case. Something like: > > > > > > > > > > > > > > enum sgp_type sgp = SGP_CACHE; > > > > > > > > > > > > > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > > > > > > > sgp = SGP_READ; > > > > > > > > > > > > Yes this will start to save the space, but just to mention this may start > > > > > > to break anything that will still depend on the pagecache to work. E.g., > > > > > > it'll change behavior if the vma is registered with uffd missing mode; > > > > > > we'll start to lose MISSING events for these private mappings. Not sure > > > > > > whether there're other side effects. > > > > > > > > > > I don't follow, can you elaborate? > > > > > > > > > > hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? > > > > > > > > Hugetlb is special because hugetlb detects pte first and relies on pte at > > > > least for uffd. shmem is not. > > > > > > > > Feel free to also reference the recent fix which relies on the stable > > > > hugetlb pte with commit 2ea7ff1e39cbe375. > > > > > > Sorry to be dense here, but I don't follow how that relates. > > > > > > Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd > > > missing events on that mapping. > > > > > > Assume we get a page fault on a hole. We detect no page is mapped and check > > > if the page cache has a page mapped -- which is also not the case, because > > > there is a hole. > > > > > > So we notify uffd. > > > > > > Uffd will place a page. It should *not* touch the page cache and only insert > > > that page into the page table -- otherwise we'd be violating MAP_PRIVATE > > > semantics. > > > > That's actually exactly what we do right now... we insert into page cache > > for the shmem. See shmem_mfill_atomic_pte(). > > > > Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive > > ownership of pages, I don't see why it should restrict uffd behavior. Uffd > > missing mode (afaiu) is defined to resolve page cache missings in this > > case. Hugetlb is special but not shmem IMO comparing to most of the rest > > of the file systems. > > If a write (or uffd placement) via a MAP_PRIVATE mapping results in other > shared/private mappings from observing these modifications, you have a clear > violation of MAP_PRIVATE semantics. I think I understand what you meant, but just to mention again that I think it's a matter of how we defined the uffd missing semantics when shmem missing mode was introduced years ago. It does not need to be the same semantic as writting directly to a private mapping. -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 15:08 ` Peter Xu @ 2022-10-21 15:17 ` David Hildenbrand 2022-10-21 16:01 ` Peter Xu 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 15:17 UTC (permalink / raw) To: Peter Xu; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On 21.10.22 17:08, Peter Xu wrote: > On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote: >> On 21.10.22 16:28, Peter Xu wrote: >>> On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: >>>> On 21.10.22 16:01, Peter Xu wrote: >>>>> On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: >>>>>> On 20.10.22 23:10, Peter Xu wrote: >>>>>>> On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >>>>>>>> In yesterday's call, David brought up the case where we fallocate a file >>>>>>>> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >>>>>>>> a hole. That currently causes shmem to allocate a page, zero-fill it, >>>>>>>> then COW it, resulting in two pages being allocated when only the >>>>>>>> COW page really needs to be allocated. >>>>>>>> >>>>>>>> The path we currently take through the MM when we take the page fault >>>>>>>> looks like this (correct me if I'm wrong ...): >>>>>>>> >>>>>>>> handle_mm_fault() >>>>>>>> __handle_mm_fault() >>>>>>>> handle_pte_fault() >>>>>>>> do_fault() >>>>>>>> do_cow_fault() >>>>>>>> __do_fault() >>>>>>>> vm_ops->fault() >>>>>>>> >>>>>>>> ... which is where we come into shmem_fault(). Apart from the >>>>>>>> horrendous hole-punch handling case, shmem_fault() is quite simple: >>>>>>>> >>>>>>>> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >>>>>>>> gfp, vma, vmf, &ret); >>>>>>>> if (err) >>>>>>>> return vmf_error(err); >>>>>>>> vmf->page = folio_file_page(folio, vmf->pgoff); >>>>>>>> return ret; >>>>>>>> >>>>>>>> What we could do here is detect this case. Something like: >>>>>>>> >>>>>>>> enum sgp_type sgp = SGP_CACHE; >>>>>>>> >>>>>>>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>>>>>> sgp = SGP_READ; >>>>>>> >>>>>>> Yes this will start to save the space, but just to mention this may start >>>>>>> to break anything that will still depend on the pagecache to work. E.g., >>>>>>> it'll change behavior if the vma is registered with uffd missing mode; >>>>>>> we'll start to lose MISSING events for these private mappings. Not sure >>>>>>> whether there're other side effects. >>>>>> >>>>>> I don't follow, can you elaborate? >>>>>> >>>>>> hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? >>>>> >>>>> Hugetlb is special because hugetlb detects pte first and relies on pte at >>>>> least for uffd. shmem is not. >>>>> >>>>> Feel free to also reference the recent fix which relies on the stable >>>>> hugetlb pte with commit 2ea7ff1e39cbe375. >>>> >>>> Sorry to be dense here, but I don't follow how that relates. >>>> >>>> Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd >>>> missing events on that mapping. >>>> >>>> Assume we get a page fault on a hole. We detect no page is mapped and check >>>> if the page cache has a page mapped -- which is also not the case, because >>>> there is a hole. >>>> >>>> So we notify uffd. >>>> >>>> Uffd will place a page. It should *not* touch the page cache and only insert >>>> that page into the page table -- otherwise we'd be violating MAP_PRIVATE >>>> semantics. >>> >>> That's actually exactly what we do right now... we insert into page cache >>> for the shmem. See shmem_mfill_atomic_pte(). >>> >>> Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive >>> ownership of pages, I don't see why it should restrict uffd behavior. Uffd >>> missing mode (afaiu) is defined to resolve page cache missings in this >>> case. Hugetlb is special but not shmem IMO comparing to most of the rest >>> of the file systems. >> >> If a write (or uffd placement) via a MAP_PRIVATE mapping results in other >> shared/private mappings from observing these modifications, you have a clear >> violation of MAP_PRIVATE semantics. > > I think I understand what you meant, but just to mention again that I think > it's a matter of how we defined the uffd missing semantics when shmem > missing mode was introduced years ago. It does not need to be the same > semantic as writting directly to a private mapping. > I think uffd does exactly the right thing in mfill_atomic_pte: /* * The normal page fault path for a shmem will invoke the * fault, fill the hole in the file and COW it right away. The * result generates plain anonymous memory. So when we are * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll * generate anonymous memory directly without actually filling * the hole. For the MAP_PRIVATE case the robustness check * only happens in the pagetable (to verify it's still none) * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) { if (mode == MCOPY_ATOMIC_NORMAL) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy); else err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, dst_addr); } else { err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, mode != MCOPY_ATOMIC_NORMAL, wp_copy, page); } Unless we have a writable shared mapping, we end up not touching the pagecache. After what I understand from your last message (maybe I understood it wrong), I tried exploiting uffd behavior by writing into a hole of a file without write permissions using uffd. I failed because it does the right thing ;) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 15:17 ` David Hildenbrand @ 2022-10-21 16:01 ` Peter Xu 2022-10-21 16:19 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2022-10-21 16:01 UTC (permalink / raw) To: David Hildenbrand; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On Fri, Oct 21, 2022 at 05:17:08PM +0200, David Hildenbrand wrote: > On 21.10.22 17:08, Peter Xu wrote: > > On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote: > > > On 21.10.22 16:28, Peter Xu wrote: > > > > On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: > > > > > On 21.10.22 16:01, Peter Xu wrote: > > > > > > On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: > > > > > > > On 20.10.22 23:10, Peter Xu wrote: > > > > > > > > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: > > > > > > > > > In yesterday's call, David brought up the case where we fallocate a file > > > > > > > > > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > > > > > > > > > a hole. That currently causes shmem to allocate a page, zero-fill it, > > > > > > > > > then COW it, resulting in two pages being allocated when only the > > > > > > > > > COW page really needs to be allocated. > > > > > > > > > > > > > > > > > > The path we currently take through the MM when we take the page fault > > > > > > > > > looks like this (correct me if I'm wrong ...): > > > > > > > > > > > > > > > > > > handle_mm_fault() > > > > > > > > > __handle_mm_fault() > > > > > > > > > handle_pte_fault() > > > > > > > > > do_fault() > > > > > > > > > do_cow_fault() > > > > > > > > > __do_fault() > > > > > > > > > vm_ops->fault() > > > > > > > > > > > > > > > > > > ... which is where we come into shmem_fault(). Apart from the > > > > > > > > > horrendous hole-punch handling case, shmem_fault() is quite simple: > > > > > > > > > > > > > > > > > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > > > > > > > > > gfp, vma, vmf, &ret); > > > > > > > > > if (err) > > > > > > > > > return vmf_error(err); > > > > > > > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > > > > > > > > return ret; > > > > > > > > > > > > > > > > > > What we could do here is detect this case. Something like: > > > > > > > > > > > > > > > > > > enum sgp_type sgp = SGP_CACHE; > > > > > > > > > > > > > > > > > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > > > > > > > > > sgp = SGP_READ; > > > > > > > > > > > > > > > > Yes this will start to save the space, but just to mention this may start > > > > > > > > to break anything that will still depend on the pagecache to work. E.g., > > > > > > > > it'll change behavior if the vma is registered with uffd missing mode; > > > > > > > > we'll start to lose MISSING events for these private mappings. Not sure > > > > > > > > whether there're other side effects. > > > > > > > > > > > > > > I don't follow, can you elaborate? > > > > > > > > > > > > > > hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? > > > > > > > > > > > > Hugetlb is special because hugetlb detects pte first and relies on pte at > > > > > > least for uffd. shmem is not. > > > > > > > > > > > > Feel free to also reference the recent fix which relies on the stable > > > > > > hugetlb pte with commit 2ea7ff1e39cbe375. > > > > > > > > > > Sorry to be dense here, but I don't follow how that relates. > > > > > > > > > > Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd > > > > > missing events on that mapping. > > > > > > > > > > Assume we get a page fault on a hole. We detect no page is mapped and check > > > > > if the page cache has a page mapped -- which is also not the case, because > > > > > there is a hole. > > > > > > > > > > So we notify uffd. > > > > > > > > > > Uffd will place a page. It should *not* touch the page cache and only insert > > > > > that page into the page table -- otherwise we'd be violating MAP_PRIVATE > > > > > semantics. > > > > > > > > That's actually exactly what we do right now... we insert into page cache > > > > for the shmem. See shmem_mfill_atomic_pte(). > > > > > > > > Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive > > > > ownership of pages, I don't see why it should restrict uffd behavior. Uffd > > > > missing mode (afaiu) is defined to resolve page cache missings in this > > > > case. Hugetlb is special but not shmem IMO comparing to most of the rest > > > > of the file systems. > > > > > > If a write (or uffd placement) via a MAP_PRIVATE mapping results in other > > > shared/private mappings from observing these modifications, you have a clear > > > violation of MAP_PRIVATE semantics. > > > > I think I understand what you meant, but just to mention again that I think > > it's a matter of how we defined the uffd missing semantics when shmem > > missing mode was introduced years ago. It does not need to be the same > > semantic as writting directly to a private mapping. > > > > I think uffd does exactly the right thing in mfill_atomic_pte: > > /* > * The normal page fault path for a shmem will invoke the > * fault, fill the hole in the file and COW it right away. The > * result generates plain anonymous memory. So when we are > * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll > * generate anonymous memory directly without actually filling > * the hole. For the MAP_PRIVATE case the robustness check > * only happens in the pagetable (to verify it's still none) > * and not in the radix tree. > */ > if (!(dst_vma->vm_flags & VM_SHARED)) { > if (mode == MCOPY_ATOMIC_NORMAL) > err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, > dst_addr, src_addr, page, > wp_copy); > else > err = mfill_zeropage_pte(dst_mm, dst_pmd, > dst_vma, dst_addr); > } else { > err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, > dst_addr, src_addr, > mode != MCOPY_ATOMIC_NORMAL, > wp_copy, page); > } > > Unless we have a writable shared mapping, we end up not touching the pagecache. > > After what I understand from your last message (maybe I understood it wrong), > I tried exploiting uffd behavior by writing into a hole of a file without > write permissions using uffd. I failed because it does the right thing ;) Very interesting to learn this, thanks for the pointer, David. :) Definitely helpful to me on knowing better on the vma security model. Though note that it'll be a different topic if we go back to the original problem we're discussing - the fake-read approach of shmem will still keep the hole in file which will still change the behavior of missing messages from generating. Said that, I don't really know whether there can be a real impact on any uffd users, or anything else that similarly access the file cache. -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 16:01 ` Peter Xu @ 2022-10-21 16:19 ` David Hildenbrand 2022-10-21 16:26 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 16:19 UTC (permalink / raw) To: Peter Xu; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On 21.10.22 18:01, Peter Xu wrote: > On Fri, Oct 21, 2022 at 05:17:08PM +0200, David Hildenbrand wrote: >> On 21.10.22 17:08, Peter Xu wrote: >>> On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote: >>>> On 21.10.22 16:28, Peter Xu wrote: >>>>> On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: >>>>>> On 21.10.22 16:01, Peter Xu wrote: >>>>>>> On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: >>>>>>>> On 20.10.22 23:10, Peter Xu wrote: >>>>>>>>> On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >>>>>>>>>> In yesterday's call, David brought up the case where we fallocate a file >>>>>>>>>> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >>>>>>>>>> a hole. That currently causes shmem to allocate a page, zero-fill it, >>>>>>>>>> then COW it, resulting in two pages being allocated when only the >>>>>>>>>> COW page really needs to be allocated. >>>>>>>>>> >>>>>>>>>> The path we currently take through the MM when we take the page fault >>>>>>>>>> looks like this (correct me if I'm wrong ...): >>>>>>>>>> >>>>>>>>>> handle_mm_fault() >>>>>>>>>> __handle_mm_fault() >>>>>>>>>> handle_pte_fault() >>>>>>>>>> do_fault() >>>>>>>>>> do_cow_fault() >>>>>>>>>> __do_fault() >>>>>>>>>> vm_ops->fault() >>>>>>>>>> >>>>>>>>>> ... which is where we come into shmem_fault(). Apart from the >>>>>>>>>> horrendous hole-punch handling case, shmem_fault() is quite simple: >>>>>>>>>> >>>>>>>>>> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >>>>>>>>>> gfp, vma, vmf, &ret); >>>>>>>>>> if (err) >>>>>>>>>> return vmf_error(err); >>>>>>>>>> vmf->page = folio_file_page(folio, vmf->pgoff); >>>>>>>>>> return ret; >>>>>>>>>> >>>>>>>>>> What we could do here is detect this case. Something like: >>>>>>>>>> >>>>>>>>>> enum sgp_type sgp = SGP_CACHE; >>>>>>>>>> >>>>>>>>>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>>>>>>>> sgp = SGP_READ; >>>>>>>>> >>>>>>>>> Yes this will start to save the space, but just to mention this may start >>>>>>>>> to break anything that will still depend on the pagecache to work. E.g., >>>>>>>>> it'll change behavior if the vma is registered with uffd missing mode; >>>>>>>>> we'll start to lose MISSING events for these private mappings. Not sure >>>>>>>>> whether there're other side effects. >>>>>>>> >>>>>>>> I don't follow, can you elaborate? >>>>>>>> >>>>>>>> hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? >>>>>>> >>>>>>> Hugetlb is special because hugetlb detects pte first and relies on pte at >>>>>>> least for uffd. shmem is not. >>>>>>> >>>>>>> Feel free to also reference the recent fix which relies on the stable >>>>>>> hugetlb pte with commit 2ea7ff1e39cbe375. >>>>>> >>>>>> Sorry to be dense here, but I don't follow how that relates. >>>>>> >>>>>> Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd >>>>>> missing events on that mapping. >>>>>> >>>>>> Assume we get a page fault on a hole. We detect no page is mapped and check >>>>>> if the page cache has a page mapped -- which is also not the case, because >>>>>> there is a hole. >>>>>> >>>>>> So we notify uffd. >>>>>> >>>>>> Uffd will place a page. It should *not* touch the page cache and only insert >>>>>> that page into the page table -- otherwise we'd be violating MAP_PRIVATE >>>>>> semantics. >>>>> >>>>> That's actually exactly what we do right now... we insert into page cache >>>>> for the shmem. See shmem_mfill_atomic_pte(). >>>>> >>>>> Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive >>>>> ownership of pages, I don't see why it should restrict uffd behavior. Uffd >>>>> missing mode (afaiu) is defined to resolve page cache missings in this >>>>> case. Hugetlb is special but not shmem IMO comparing to most of the rest >>>>> of the file systems. >>>> >>>> If a write (or uffd placement) via a MAP_PRIVATE mapping results in other >>>> shared/private mappings from observing these modifications, you have a clear >>>> violation of MAP_PRIVATE semantics. >>> >>> I think I understand what you meant, but just to mention again that I think >>> it's a matter of how we defined the uffd missing semantics when shmem >>> missing mode was introduced years ago. It does not need to be the same >>> semantic as writting directly to a private mapping. >>> >> >> I think uffd does exactly the right thing in mfill_atomic_pte: >> >> /* >> * The normal page fault path for a shmem will invoke the >> * fault, fill the hole in the file and COW it right away. The >> * result generates plain anonymous memory. So when we are >> * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll >> * generate anonymous memory directly without actually filling >> * the hole. For the MAP_PRIVATE case the robustness check >> * only happens in the pagetable (to verify it's still none) >> * and not in the radix tree. >> */ >> if (!(dst_vma->vm_flags & VM_SHARED)) { >> if (mode == MCOPY_ATOMIC_NORMAL) >> err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, >> dst_addr, src_addr, page, >> wp_copy); >> else >> err = mfill_zeropage_pte(dst_mm, dst_pmd, >> dst_vma, dst_addr); >> } else { >> err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, >> dst_addr, src_addr, >> mode != MCOPY_ATOMIC_NORMAL, >> wp_copy, page); >> } >> >> Unless we have a writable shared mapping, we end up not touching the pagecache. >> >> After what I understand from your last message (maybe I understood it wrong), >> I tried exploiting uffd behavior by writing into a hole of a file without >> write permissions using uffd. I failed because it does the right thing ;) > > Very interesting to learn this, thanks for the pointer, David. :) > Definitely helpful to me on knowing better on the vma security model. > > Though note that it'll be a different topic if we go back to the original > problem we're discussing - the fake-read approach of shmem will still keep > the hole in file which will still change the behavior of missing messages > from generating. > > Said that, I don't really know whether there can be a real impact on any > uffd users, or anything else that similarly access the file cache. One odd behavior I could think of is if one would have someone a process A that uses uffd on a MAP_SHARED shmem and someone other process B (e.g., with read-only permissions) have a MAP_PRIVATE mapping on the same file. A read (or a write) from process B to the private mapping would result in process A not receiving uffd events. Of course, the same would happen if you have multiple MAP_SHARED mappings as well ... but it feels a bit weird being able to do that without write permissions to the file. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-21 16:19 ` David Hildenbrand @ 2022-10-21 16:26 ` David Hildenbrand 0 siblings, 0 replies; 13+ messages in thread From: David Hildenbrand @ 2022-10-21 16:26 UTC (permalink / raw) To: Peter Xu; +Cc: Matthew Wilcox, linux-mm, Hugh Dickins On 21.10.22 18:19, David Hildenbrand wrote: > On 21.10.22 18:01, Peter Xu wrote: >> On Fri, Oct 21, 2022 at 05:17:08PM +0200, David Hildenbrand wrote: >>> On 21.10.22 17:08, Peter Xu wrote: >>>> On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote: >>>>> On 21.10.22 16:28, Peter Xu wrote: >>>>>> On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote: >>>>>>> On 21.10.22 16:01, Peter Xu wrote: >>>>>>>> On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote: >>>>>>>>> On 20.10.22 23:10, Peter Xu wrote: >>>>>>>>>> On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote: >>>>>>>>>>> In yesterday's call, David brought up the case where we fallocate a file >>>>>>>>>>> in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over >>>>>>>>>>> a hole. That currently causes shmem to allocate a page, zero-fill it, >>>>>>>>>>> then COW it, resulting in two pages being allocated when only the >>>>>>>>>>> COW page really needs to be allocated. >>>>>>>>>>> >>>>>>>>>>> The path we currently take through the MM when we take the page fault >>>>>>>>>>> looks like this (correct me if I'm wrong ...): >>>>>>>>>>> >>>>>>>>>>> handle_mm_fault() >>>>>>>>>>> __handle_mm_fault() >>>>>>>>>>> handle_pte_fault() >>>>>>>>>>> do_fault() >>>>>>>>>>> do_cow_fault() >>>>>>>>>>> __do_fault() >>>>>>>>>>> vm_ops->fault() >>>>>>>>>>> >>>>>>>>>>> ... which is where we come into shmem_fault(). Apart from the >>>>>>>>>>> horrendous hole-punch handling case, shmem_fault() is quite simple: >>>>>>>>>>> >>>>>>>>>>> err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, >>>>>>>>>>> gfp, vma, vmf, &ret); >>>>>>>>>>> if (err) >>>>>>>>>>> return vmf_error(err); >>>>>>>>>>> vmf->page = folio_file_page(folio, vmf->pgoff); >>>>>>>>>>> return ret; >>>>>>>>>>> >>>>>>>>>>> What we could do here is detect this case. Something like: >>>>>>>>>>> >>>>>>>>>>> enum sgp_type sgp = SGP_CACHE; >>>>>>>>>>> >>>>>>>>>>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>>>>>>>>> sgp = SGP_READ; >>>>>>>>>> >>>>>>>>>> Yes this will start to save the space, but just to mention this may start >>>>>>>>>> to break anything that will still depend on the pagecache to work. E.g., >>>>>>>>>> it'll change behavior if the vma is registered with uffd missing mode; >>>>>>>>>> we'll start to lose MISSING events for these private mappings. Not sure >>>>>>>>>> whether there're other side effects. >>>>>>>>> >>>>>>>>> I don't follow, can you elaborate? >>>>>>>>> >>>>>>>>> hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like? >>>>>>>> >>>>>>>> Hugetlb is special because hugetlb detects pte first and relies on pte at >>>>>>>> least for uffd. shmem is not. >>>>>>>> >>>>>>>> Feel free to also reference the recent fix which relies on the stable >>>>>>>> hugetlb pte with commit 2ea7ff1e39cbe375. >>>>>>> >>>>>>> Sorry to be dense here, but I don't follow how that relates. >>>>>>> >>>>>>> Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd >>>>>>> missing events on that mapping. >>>>>>> >>>>>>> Assume we get a page fault on a hole. We detect no page is mapped and check >>>>>>> if the page cache has a page mapped -- which is also not the case, because >>>>>>> there is a hole. >>>>>>> >>>>>>> So we notify uffd. >>>>>>> >>>>>>> Uffd will place a page. It should *not* touch the page cache and only insert >>>>>>> that page into the page table -- otherwise we'd be violating MAP_PRIVATE >>>>>>> semantics. >>>>>> >>>>>> That's actually exactly what we do right now... we insert into page cache >>>>>> for the shmem. See shmem_mfill_atomic_pte(). >>>>>> >>>>>> Why it violates MAP_PRIVATE? Private pages only guarantee the exclusive >>>>>> ownership of pages, I don't see why it should restrict uffd behavior. Uffd >>>>>> missing mode (afaiu) is defined to resolve page cache missings in this >>>>>> case. Hugetlb is special but not shmem IMO comparing to most of the rest >>>>>> of the file systems. >>>>> >>>>> If a write (or uffd placement) via a MAP_PRIVATE mapping results in other >>>>> shared/private mappings from observing these modifications, you have a clear >>>>> violation of MAP_PRIVATE semantics. >>>> >>>> I think I understand what you meant, but just to mention again that I think >>>> it's a matter of how we defined the uffd missing semantics when shmem >>>> missing mode was introduced years ago. It does not need to be the same >>>> semantic as writting directly to a private mapping. >>>> >>> >>> I think uffd does exactly the right thing in mfill_atomic_pte: >>> >>> /* >>> * The normal page fault path for a shmem will invoke the >>> * fault, fill the hole in the file and COW it right away. The >>> * result generates plain anonymous memory. So when we are >>> * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll >>> * generate anonymous memory directly without actually filling >>> * the hole. For the MAP_PRIVATE case the robustness check >>> * only happens in the pagetable (to verify it's still none) >>> * and not in the radix tree. >>> */ >>> if (!(dst_vma->vm_flags & VM_SHARED)) { >>> if (mode == MCOPY_ATOMIC_NORMAL) >>> err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, >>> dst_addr, src_addr, page, >>> wp_copy); >>> else >>> err = mfill_zeropage_pte(dst_mm, dst_pmd, >>> dst_vma, dst_addr); >>> } else { >>> err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, >>> dst_addr, src_addr, >>> mode != MCOPY_ATOMIC_NORMAL, >>> wp_copy, page); >>> } >>> >>> Unless we have a writable shared mapping, we end up not touching the pagecache. >>> >>> After what I understand from your last message (maybe I understood it wrong), >>> I tried exploiting uffd behavior by writing into a hole of a file without >>> write permissions using uffd. I failed because it does the right thing ;) >> >> Very interesting to learn this, thanks for the pointer, David. :) >> Definitely helpful to me on knowing better on the vma security model. >> >> Though note that it'll be a different topic if we go back to the original >> problem we're discussing - the fake-read approach of shmem will still keep >> the hole in file which will still change the behavior of missing messages >> from generating. >> >> Said that, I don't really know whether there can be a real impact on any >> uffd users, or anything else that similarly access the file cache. > > One odd behavior I could think of is if one would have someone a process > A that uses uffd on a MAP_SHARED shmem and someone other process B > (e.g., with read-only permissions) have a MAP_PRIVATE mapping on the > same file. > > A read (or a write) from process B to the private mapping would result > in process A not receiving uffd events. > > > Of course, the same would happen if you have multiple MAP_SHARED > mappings as well ... but it feels a bit weird being able to do that > without write permissions to the file. > BTW, in a private mapping it would be perfectly fine to always populate a shared zeropage when reading or a fresh zero page into the process' page tables when finding a file hole -- without touching the file (page-cache) (to which we might not even have write permissions) at all. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Avoiding allocation of unused shmem page 2022-10-20 20:14 Avoiding allocation of unused shmem page Matthew Wilcox 2022-10-20 21:10 ` Peter Xu @ 2022-10-20 22:17 ` Yang Shi 1 sibling, 0 replies; 13+ messages in thread From: Yang Shi @ 2022-10-20 22:17 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm, Hugh Dickins, David Hildenbrand On Thu, Oct 20, 2022 at 1:14 PM Matthew Wilcox <willy@infradead.org> wrote: > > In yesterday's call, David brought up the case where we fallocate a file > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over > a hole. That currently causes shmem to allocate a page, zero-fill it, > then COW it, resulting in two pages being allocated when only the > COW page really needs to be allocated. > > The path we currently take through the MM when we take the page fault > looks like this (correct me if I'm wrong ...): > > handle_mm_fault() > __handle_mm_fault() > handle_pte_fault() > do_fault() > do_cow_fault() > __do_fault() > vm_ops->fault() > > ... which is where we come into shmem_fault(). Apart from the > horrendous hole-punch handling case, shmem_fault() is quite simple: > > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > gfp, vma, vmf, &ret); > if (err) > return vmf_error(err); > vmf->page = folio_file_page(folio, vmf->pgoff); > return ret; > > What we could do here is detect this case. Something like: > > enum sgp_type sgp = SGP_CACHE; > > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > sgp = SGP_READ; > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, sgp, gfp, > vma, vmf, &ret); > if (err) > return vmf_error(err); > if (folio) > vmf->page = folio_file_page(folio, vmf->pgoff); > else > vmf->page = NULL; > return ret; > > and change do_cow_fault() like this: > > +++ b/mm/memory.c > @@ -4575,12 +4575,17 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) > if (ret & VM_FAULT_DONE_COW) > return ret; > > - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > + if (vmf->page) > + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); > + else > + clear_user_highpage(vmf->cow_page, vmf->address); > __SetPageUptodate(vmf->cow_page); > > ret |= finish_fault(vmf); > - unlock_page(vmf->page); > - put_page(vmf->page); > + if (vmf->page) { > + unlock_page(vmf->page); > + put_page(vmf->page); > + } > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > goto uncharge_out; > return ret; > > ... I wrote the code directly in my email client; definitely not > compile-tested. But if this situation is causing a real problem for > someone, this would be a quick fix for them. > > Is this a real problem or just intellectual curiosity? Also, does > this need support for THPs being created directly, or is khugepaged > fixing it up afterwards good enough? AFAIK, THP is not supported for private mapping. > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-10-21 16:27 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-10-20 20:14 Avoiding allocation of unused shmem page Matthew Wilcox 2022-10-20 21:10 ` Peter Xu 2022-10-21 7:23 ` David Hildenbrand 2022-10-21 14:01 ` Peter Xu 2022-10-21 14:10 ` David Hildenbrand 2022-10-21 14:28 ` Peter Xu 2022-10-21 14:45 ` David Hildenbrand 2022-10-21 15:08 ` Peter Xu 2022-10-21 15:17 ` David Hildenbrand 2022-10-21 16:01 ` Peter Xu 2022-10-21 16:19 ` David Hildenbrand 2022-10-21 16:26 ` David Hildenbrand 2022-10-20 22:17 ` Yang Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox