From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>,
Lorenzo Stoakes <lstoakes@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
sidhartha.kumar@oracle.com,
"Paul E . McKenney" <paulmck@kernel.org>,
Bert Karwatzki <spasswolf@web.de>, Jiri Olsa <olsajiri@gmail.com>,
linux-kernel@vger.kernel.org, Kees Cook <kees@kernel.org>
Subject: Re: [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
Date: Tue, 9 Jul 2024 15:27:25 +0100 [thread overview]
Message-ID: <4c9b4ec9-88b3-4ca8-8358-734463533078@lucifer.local> (raw)
In-Reply-To: <fhshsdxi4axwriv2od7lkqfzhhydccnwnov22caw5lgbfuy7mi@5hf73szoc3uz>
On Mon, Jul 08, 2024 at 03:10:10PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:18]:
> > On Thu, Jul 04, 2024 at 02:27:15PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > area be overwritten and then clean up the gathered vmas using
> > > vms_complete_munmap_vmas().
> > >
> > > In the case of a driver mapping over existing vmas, the PTEs are cleared
> > > using the helper vms_complete_pte_clear().
> > >
> > > Temporarily keep track of the number of pages that will be removed and
> > > reduce the charged amount.
> > >
> > > This also drops the validate_mm() call in the vma_expand() function.
> > > It is necessary to drop the validate as it would fail since the mm
> > > map_count would be incorrect during a vma expansion, prior to the
> > > cleanup from vms_complete_munmap_vmas().
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > > mm/internal.h | 1 +
> > > mm/mmap.c | 61 ++++++++++++++++++++++++++++++---------------------
> > > 2 files changed, 37 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 4c9f06669cc4..fae4a1bba732 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> > > unsigned long stack_vm;
> > > unsigned long data_vm;
> > > bool unlock; /* Unlock after the munmap */
> > > + bool cleared_ptes; /* If the PTE are cleared already */
> > > };
> > >
> > > void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 5d458c5f080e..0c334eeae8cd 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > > }
> > >
> > > static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > > - unsigned long addr, unsigned long end)
> > > + unsigned long addr, unsigned long end,
> > > + unsigned long *nr_accounted)
> > > {
> > > VMA_ITERATOR(vmi, mm, addr);
> > > struct vm_area_struct *vma;
> > > unsigned long nr_pages = 0;
> > >
> > > + *nr_accounted = 0;
> > > for_each_vma_range(vmi, vma, end) {
> > > unsigned long vm_start = max(addr, vma->vm_start);
> > > unsigned long vm_end = min(end, vma->vm_end);
> > >
> > > nr_pages += PHYS_PFN(vm_end - vm_start);
> > > + if (vma->vm_flags & VM_ACCOUNT)
> > > + *nr_accounted += PHYS_PFN(vm_end - vm_start);
> >
> > We're duplicating the PHYS_PFN(vm_end - vm_start) thing, probably worth
> > adding something like:
> >
> > unsigned long num_pages = PHYS_PFN(vm_end - vm_start);
> >
> > Side-note, but it'd be nice to sort out the inconsistency of PHYS_PFN()
> > vs. (end - start) >> PAGE_SHIFT. This is probably not a huge deal though...
>
> I split this out into another patch for easier reviewing.
Yeah I noticed, inevitably :) the PHYS_PFN(...) duplication persisted, a
small thing obviously but covered in the subsequent commit.
>
> >
> > > }
> > >
> > > return nr_pages;
> > > @@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > > vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > > vms->unmap_start = FIRST_USER_ADDRESS;
> > > vms->unmap_end = USER_PGTABLES_CEILING;
> > > + vms->cleared_ptes = false;
> > > }
> > >
> > > /*
> > > @@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > vma_iter_store(vmi, vma);
> > >
> > > vma_complete(&vp, vmi, vma->vm_mm);
> > > - validate_mm(vma->vm_mm);
> >
> > Since we're dropping this here, do we need to re-add this back somehwere
> > where we are confident the state will be consistent?
>
> The vma_expand() function is used in two places - one is in the mmap.c
> file which can no longer validate the mm until the munmap is complete.
> The other is in fs/exec.c which cannot call the validate_mm(). So
> to add this call back, I'd have to add a wrapper to vma_expand() to call
> the validate_mm() function for debug builds.
>
> Really all this code in fs/exec.c doesn't belong there so we don't need
> to do an extra function wrapper just to call validate_mm(). And you have
> a patch to do that which is out for review!
Indeed :) perhaps we should add back to the wrapper?
>
> >
> > > return 0;
> > >
> > > nomem:
> > > @@ -2612,6 +2616,9 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > > {
> > > struct mmu_gather tlb;
> > >
> > > + if (vms->cleared_ptes)
> > > + return;
> > > +
> > > /*
> > > * We can free page tables without write-locking mmap_lock because VMAs
> > > * were isolated before we downgraded mmap_lock.
> > > @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > > mas_set(mas_detach, 1);
> > > free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> > > tlb_finish_mmu(&tlb);
> > > + vms->cleared_ptes = true;
> > > }
> > >
> > > /*
> > > @@ -2936,24 +2944,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > unsigned long merge_start = addr, merge_end = end;
> > > bool writable_file_mapping = false;
> > > pgoff_t vm_pgoff;
> > > - int error;
> > > + int error = -ENOMEM;
> > > VMA_ITERATOR(vmi, mm, addr);
> > > + unsigned long nr_pages, nr_accounted;
> > >
> > > - /* Check against address space limit. */
> > > - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > > - unsigned long nr_pages;
> > > -
> > > - /*
> > > - * MAP_FIXED may remove pages of mappings that intersects with
> > > - * requested mapping. Account for the pages it would unmap.
> > > - */
> > > - nr_pages = count_vma_pages_range(mm, addr, end);
> > > -
> > > - if (!may_expand_vm(mm, vm_flags,
> > > - (len >> PAGE_SHIFT) - nr_pages))
> > > - return -ENOMEM;
> > > - }
> > > + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > >
> > > + /* Check against address space limit. */
> > > + /*
> > > + * MAP_FIXED may remove pages of mappings that intersects with requested
> > > + * mapping. Account for the pages it would unmap.
> > > + */
> >
> > Utter pedantry, but could these comments be combined? Bit ugly to have one
> > after another like this.
>
> Since this was mainly a relocation, I didn't want to change it too much
> but since you asked, I'll do it.
Thanks, obviously a highly pedantic nit this one!
>
> >
> > > + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > > + return -ENOMEM;
> > >
> > > if (unlikely(!can_modify_mm(mm, addr, end)))
> > > return -EPERM;
> > > @@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > > return -ENOMEM;
> > >
> > > - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > > - return -ENOMEM;
> > > -
> > > - vms_complete_munmap_vmas(&vms, &mas_detach);
> > > next = vms.next;
> > > prev = vms.prev;
> > > vma = NULL;
> > > } else {
> > > + /* Minimal setup of vms */
> > > + vms.nr_pages = 0;
> >
> > I'm not a huge fan of having vms be uninitialised other than this field and
> > then to rely on no further code change accidentally using an uninitialised
> > field. This is kind of asking for bugs.
> >
> > Can we not find a way to sensibly initialise it somehow?
>
> Yes, I can switch to the same sort of thing as the maple state and
> initialize things as empty.
Thanks.
>
> >
> > > next = vma_next(&vmi);
> > > prev = vma_prev(&vmi);
> > > if (prev)
> > > @@ -2990,8 +2991,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > */
> > > if (accountable_mapping(file, vm_flags)) {
> > > charged = len >> PAGE_SHIFT;
> > > + charged -= nr_accounted;
> > > if (security_vm_enough_memory_mm(mm, charged))
> > > - return -ENOMEM;
> > > + goto abort_munmap;
> > > + vms.nr_accounted = 0;
> >
> > This is kind of expanding the 'vms possibly unitialised apart from selected
> > fields' pattern, makes me worry.
>
> I'll fix this with an init of the struct that will always be called.
Thanks.
>
> >
> > > vm_flags |= VM_ACCOUNT;
> > > }
> > >
> > > @@ -3040,10 +3043,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > * not unmapped, but the maps are removed from the list.
> > > */
> > > vma = vm_area_alloc(mm);
> > > - if (!vma) {
> > > - error = -ENOMEM;
> > > + if (!vma)
> > > goto unacct_error;
> > > - }
> > >
> > > vma_iter_config(&vmi, addr, end);
> > > vma_set_range(vma, addr, end, pgoff);
> > > @@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >
> > > if (file) {
> > > vma->vm_file = get_file(file);
> > > + /* call_mmap() map PTE, so ensure there are no existing PTEs */
> >
> > Typo? Should this be 'call_mmap() maps PTEs, so ensure there are no
> > existing PTEs'? I feel like this could be reworded something like:
> >
> > 'call_map() may map PTEs, so clear any that may be pending unmap ahead of
> > time.'
>
> I had changed this already to 'call_mmap() may map PTE, so ensure there
> are no existing PTEs' That way it's still one line and more descriptive
> than what I had.
That works!
>
> >
> > > + if (vms.nr_pages)
> > > + vms_complete_pte_clear(&vms, &mas_detach, true);
> > > error = call_mmap(file, vma);
> > > if (error)
> > > goto unmap_and_free_vma;
> > > @@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > expanded:
> > > perf_event_mmap(vma);
> > >
> > > + if (vms.nr_pages)
> > > + vms_complete_munmap_vmas(&vms, &mas_detach);
> > > +
> >
> > Hang on, if we already did this in the if (file) branch above, might we end
> > up calling this twice? I didn't see vms.nr_pages get set to zero or
> > decremented anywhere (unless I missed it)?
>
> No, we called the new helper vms_complete_pte_clear(), which will avoid
> clearing the ptes by the added flag vms->cleared_ptes in the second
> call.
>
> Above, I modified vms_complete_pte_clear() to check vms->cleared_ptes
> prior to clearing the ptes, so it will only be cleared if it needs
> clearing.
>
> I debated moving this nr_pages check within vms_complete_munmap_vmas(),
> but that would add an unnecessary check to the munmap() path. Avoiding
> both checks seemed too much code (yet another static inline, or such).
> I also wanted to keep the sanity of nr_pages checking to a single
> function - as you highlighted it could be a path to insanity.
>
> Considering I'll switch this ti a VMS_INIT(), I think that I could pass
> it through and do the logic within the static inline at the expense of
> the munmap() having a few extra instructions (but no cache hits, so not
> a really big deal).
Yeah it's a bit confusing that the rest of vms_complete_munmap_vmas() is
potentially run twice even if the vms_complete_pte_clear() exits early due
to vms->cleared_ptes being set.
>
> >
> > > vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > > if (vm_flags & VM_LOCKED) {
> > > if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > > @@ -3189,6 +3196,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > unacct_error:
> > > if (charged)
> > > vm_unacct_memory(charged);
> > > +
> > > +abort_munmap:
> > > + if (vms.nr_pages)
> > > + abort_munmap_vmas(&mas_detach);
> > > validate_mm(mm);
> > > return error;
> > > }
> > > --
> > > 2.43.0
> > >
> >
> > In general I like the approach and you've made it very clear how you've
> > altered this behaviour.
> >
> > However I have a few concerns (as well some trivial comments) above. With
> > those cleared up we'll be good to go!
next prev parent reply other threads:[~2024-07-09 14:27 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
2024-07-05 17:02 ` Lorenzo Stoakes
2024-07-05 18:12 ` Liam R. Howlett
2024-07-10 16:06 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-07-05 17:39 ` Lorenzo Stoakes
2024-07-10 16:07 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-07-05 18:01 ` Lorenzo Stoakes
2024-07-05 18:41 ` Liam R. Howlett
2024-07-10 16:07 ` Suren Baghdasaryan
2024-07-10 16:32 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-07-05 18:39 ` Lorenzo Stoakes
2024-07-05 19:09 ` Liam R. Howlett
2024-07-10 16:07 ` Suren Baghdasaryan
2024-07-10 16:30 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-07-05 19:27 ` Lorenzo Stoakes
2024-07-05 19:59 ` Liam R. Howlett
2024-07-10 16:07 ` Suren Baghdasaryan
2024-07-10 17:29 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-07-05 19:35 ` Lorenzo Stoakes
2024-07-10 16:06 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
2024-07-05 19:39 ` Lorenzo Stoakes
2024-07-05 20:00 ` Liam R. Howlett
2024-07-10 16:15 ` Suren Baghdasaryan
2024-07-10 16:35 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
2024-07-05 20:06 ` Lorenzo Stoakes
2024-07-05 20:30 ` Liam R. Howlett
2024-07-05 20:36 ` Lorenzo Stoakes
2024-07-08 14:49 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-07-05 20:18 ` Lorenzo Stoakes
2024-07-05 20:56 ` Liam R. Howlett
2024-07-08 11:08 ` Lorenzo Stoakes
2024-07-08 16:43 ` Liam R. Howlett
2024-07-10 16:48 ` Suren Baghdasaryan
2024-07-10 17:18 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
2024-07-05 20:27 ` Lorenzo Stoakes
2024-07-08 14:45 ` Liam R. Howlett
2024-07-10 17:14 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
2024-07-05 20:33 ` Lorenzo Stoakes
2024-07-10 17:14 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-07-08 12:18 ` Lorenzo Stoakes
2024-07-08 19:10 ` Liam R. Howlett
2024-07-09 14:27 ` Lorenzo Stoakes [this message]
2024-07-09 18:43 ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 14/16] mm/mmap: Use PHYS_PFN " Liam R. Howlett
2024-07-08 12:21 ` Lorenzo Stoakes
2024-07-09 18:35 ` Liam R. Howlett
2024-07-09 18:42 ` Lorenzo Stoakes
2024-07-10 17:32 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 15/16] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-07-08 12:43 ` Lorenzo Stoakes
2024-07-10 17:43 ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
2024-07-08 12:52 ` Lorenzo Stoakes
2024-07-08 20:43 ` Liam R. Howlett
2024-07-09 14:42 ` Liam R. Howlett
2024-07-09 14:51 ` Lorenzo Stoakes
2024-07-09 14:52 ` Liam R. Howlett
2024-07-09 18:13 ` Dave Hansen
2024-07-09 14:45 ` Lorenzo Stoakes
2024-07-10 12:28 ` Michael Ellerman
2024-07-10 12:45 ` Lorenzo Stoakes
2024-07-10 12:59 ` LEROY Christophe
2024-07-10 16:09 ` Liam R. Howlett
2024-07-10 19:27 ` Dmitry Safonov
2024-07-10 21:04 ` LEROY Christophe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4c9b4ec9-88b3-4ca8-8358-734463533078@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=kees@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lstoakes@gmail.com \
--cc=olsajiri@gmail.com \
--cc=paulmck@kernel.org \
--cc=sidhartha.kumar@oracle.com \
--cc=spasswolf@web.de \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox