* [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close
2021-12-08 21:22 [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
@ 2021-12-08 21:22 ` Suren Baghdasaryan
2021-12-09 8:55 ` Michal Hocko
2021-12-08 21:22 ` [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
` (2 subsequent siblings)
3 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-08 21:22 UTC (permalink / raw)
To: akpm
Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team, surenb
Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..b9b88ba7564b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -577,6 +577,10 @@ enum page_entry_size {
*/
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
+ /**
+ * @close: Called when the VMA is being removed from the MM.
+ * Context: User context. May sleep. Caller holds mmap_lock.
+ */
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
--
2.34.1.400.ga245620fadb-goog
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close
2021-12-08 21:22 ` [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
@ 2021-12-09 8:55 ` Michal Hocko
0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2021-12-09 8:55 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Wed 08-12-21 13:22:10, Suren Baghdasaryan wrote:
> Add comments for vm_operations_struct::close documenting locking
> requirements for this callback and its callers.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/mm.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7e4a9e7d807..b9b88ba7564b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -577,6 +577,10 @@ enum page_entry_size {
> */
> struct vm_operations_struct {
> void (*open)(struct vm_area_struct * area);
> + /**
> + * @close: Called when the VMA is being removed from the MM.
> + * Context: User context. May sleep. Caller holds mmap_lock.
> + */
> void (*close)(struct vm_area_struct * area);
> /* Called any time before splitting to check if it's allowed */
> int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> --
> 2.34.1.400.ga245620fadb-goog
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
2021-12-08 21:22 [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
2021-12-08 21:22 ` [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
@ 2021-12-08 21:22 ` Suren Baghdasaryan
2021-12-09 8:59 ` Michal Hocko
2021-12-09 8:55 ` [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Michal Hocko
2021-12-09 9:12 ` [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap Michal Hocko
3 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-08 21:22 UTC (permalink / raw)
To: akpm
Cc: mhocko, mhocko, rientjes, willy, hannes, guro, riel, minchan,
kirill, aarcange, christian, hch, oleg, david, jannh, shakeelb,
luto, christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team, surenb
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.
Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/oom_kill.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..67780386f478 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
goto put_task;
}
- if (mmget_not_zero(p->mm)) {
- mm = p->mm;
- if (task_will_free_mem(p))
- reap = true;
- else {
- /* Error only if the work has not been done already */
- if (!test_bit(MMF_OOM_SKIP, &mm->flags))
- ret = -EINVAL;
- }
+ mm = p->mm;
+ mmgrab(mm);
+
+ if (task_will_free_mem(p))
+ reap = true;
+ else {
+ /* Error only if the work has not been done already */
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+ ret = -EINVAL;
}
task_unlock(p);
@@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
ret = -EINTR;
goto drop_mm;
}
- if (!__oom_reap_task_mm(mm))
+ /*
+ * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
+ * possible change in exit_mmap is seen
+ */
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
ret = -EAGAIN;
mmap_read_unlock(mm);
drop_mm:
- if (mm)
- mmput(mm);
+ mmdrop(mm);
put_task:
put_task_struct(task);
return ret;
--
2.34.1.400.ga245620fadb-goog
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
2021-12-08 21:22 ` [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
@ 2021-12-09 8:59 ` Michal Hocko
2021-12-09 19:03 ` Suren Baghdasaryan
0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-12-09 8:59 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Wed 08-12-21 13:22:11, Suren Baghdasaryan wrote:
> With exit_mmap holding mmap_write_lock during free_pgtables call,
> process_mrelease does not need to elevate mm->mm_users in order to
> prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> is walking the VMA tree. The change prevents process_mrelease from
> calling the last mmput, which can lead to waiting for IO completion
> in exit_aio.
>
> Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
I am not sure I have brought this up already but I do not think Fixes
tag is a good fit. 337546e83fc7 is a correct way to handle the race. It
is just slightly less optimal than this fix.
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> mm/oom_kill.c | 27 +++++++++++++++------------
> 1 file changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ddabefcfb5a..67780386f478 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> goto put_task;
> }
>
> - if (mmget_not_zero(p->mm)) {
> - mm = p->mm;
> - if (task_will_free_mem(p))
> - reap = true;
> - else {
> - /* Error only if the work has not been done already */
> - if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> - ret = -EINVAL;
> - }
> + mm = p->mm;
> + mmgrab(mm);
> +
> + if (task_will_free_mem(p))
> + reap = true;
> + else {
> + /* Error only if the work has not been done already */
> + if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> + ret = -EINVAL;
> }
> task_unlock(p);
>
> @@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> ret = -EINTR;
> goto drop_mm;
> }
> - if (!__oom_reap_task_mm(mm))
> + /*
> + * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
> + * possible change in exit_mmap is seen
> + */
> + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
> ret = -EAGAIN;
> mmap_read_unlock(mm);
>
> drop_mm:
> - if (mm)
> - mmput(mm);
> + mmdrop(mm);
> put_task:
> put_task_struct(task);
> return ret;
> --
> 2.34.1.400.ga245620fadb-goog
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection
2021-12-09 8:59 ` Michal Hocko
@ 2021-12-09 19:03 ` Suren Baghdasaryan
0 siblings, 0 replies; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 19:03 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu, Dec 9, 2021 at 12:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 08-12-21 13:22:11, Suren Baghdasaryan wrote:
> > With exit_mmap holding mmap_write_lock during free_pgtables call,
> > process_mrelease does not need to elevate mm->mm_users in order to
> > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> > is walking the VMA tree. The change prevents process_mrelease from
> > calling the last mmput, which can lead to waiting for IO completion
> > in exit_aio.
> >
> > Fixes: 337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
>
> I am not sure I have brought this up already but I do not think Fixes
> tag is a good fit. 337546e83fc7 is a correct way to handle the race. It
> is just slightly less optimal than this fix.
Will post v5 without it. Thanks!
>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
> > ---
> > mm/oom_kill.c | 27 +++++++++++++++------------
> > 1 file changed, 15 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 1ddabefcfb5a..67780386f478 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > goto put_task;
> > }
> >
> > - if (mmget_not_zero(p->mm)) {
> > - mm = p->mm;
> > - if (task_will_free_mem(p))
> > - reap = true;
> > - else {
> > - /* Error only if the work has not been done already */
> > - if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > - ret = -EINVAL;
> > - }
> > + mm = p->mm;
> > + mmgrab(mm);
> > +
> > + if (task_will_free_mem(p))
> > + reap = true;
> > + else {
> > + /* Error only if the work has not been done already */
> > + if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > + ret = -EINVAL;
> > }
> > task_unlock(p);
> >
> > @@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > ret = -EINTR;
> > goto drop_mm;
> > }
> > - if (!__oom_reap_task_mm(mm))
> > + /*
> > + * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
> > + * possible change in exit_mmap is seen
> > + */
> > + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
> > ret = -EAGAIN;
> > mmap_read_unlock(mm);
> >
> > drop_mm:
> > - if (mm)
> > - mmput(mm);
> > + mmdrop(mm);
> > put_task:
> > put_task_struct(task);
> > return ret;
> > --
> > 2.34.1.400.ga245620fadb-goog
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
2021-12-08 21:22 [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
2021-12-08 21:22 ` [PATCH v4 2/3] mm: document locking restrictions for vm_operations_struct::close Suren Baghdasaryan
2021-12-08 21:22 ` [PATCH v4 3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection Suren Baghdasaryan
@ 2021-12-09 8:55 ` Michal Hocko
2021-12-09 19:03 ` Suren Baghdasaryan
2021-12-09 9:12 ` [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap Michal Hocko
3 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-12-09 8:55 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> oom-reaper and process_mrelease system call should protect against
> races with exit_mmap which can destroy page tables while they
> walk the VMA tree. oom-reaper protects from that race by setting
> MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> before taking and releasing mmap_write_lock. process_mrelease has
> to elevate mm->mm_users to prevent such race. Both oom-reaper and
> process_mrelease hold mmap_read_lock when walking the VMA tree.
> The locking rules and mechanisms could be simpler if exit_mmap takes
> mmap_write_lock while executing destructive operations such as
> free_pgtables.
> Change exit_mmap to hold the mmap_write_lock when calling
> free_pgtables and remove_vma. Operations like unmap_vmas and
> unlock_range are not destructive and could run under mmap_read_lock
> but for simplicity we take one mmap_write_lock during almost the entire
> operation.
unlock_range is not safe to be called under read lock. See 27ae357fa82b
("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
> Note also that because oom-reaper checks VM_LOCKED flag,
> unlock_range() should not be allowed to race with it.
> Before this patch, remove_vma used to be called with no locks held,
> however with fput being executed asynchronously and vm_ops->close
> not being allowed to hold mmap_lock (it is called from __split_vma
> with mmap_sem held for write), changing that should be fine.
> In most cases this lock should be uncontended. Previously, Kirill
> reported ~4% regression caused by a similar change [1]. We reran the
> same test and although the individual results are quite noisy, the
> percentiles show lower regression with 1.6% being the worst case [2].
> The change allows oom-reaper and process_mrelease to execute safely
> under mmap_read_lock without worries that exit_mmap might destroy page
> tables from under them.
>
> [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
> [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The patch looks good otherwise. Btw. when I was trying to do something
similar in the past Hugh has noted that we can get rid of the same
lock&&unlock trick in ksm. Maybe you want to have a look at that as well
;)
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> changes in v4
> - Separated comments describing vm_operations_struct::close locking
> requirements into a separate patch, per Matthew Wilcox
>
> mm/mmap.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bfb0ea164a90..f4e09d390a07 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
> * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> * __oom_reap_task_mm() will not block.
> *
> - * This needs to be done before calling munlock_vma_pages_all(),
> + * This needs to be done before calling unlock_range(),
> * which clears VM_LOCKED, otherwise the oom reaper cannot
> * reliably test it.
> */
> (void)__oom_reap_task_mm(mm);
>
> set_bit(MMF_OOM_SKIP, &mm->flags);
> - mmap_write_lock(mm);
> - mmap_write_unlock(mm);
> }
>
> + mmap_write_lock(mm);
> if (mm->locked_vm)
> unlock_range(mm->mmap, ULONG_MAX);
>
> arch_exit_mmap(mm);
>
> vma = mm->mmap;
> - if (!vma) /* Can happen if dup_mmap() received an OOM */
> + if (!vma) {
> + /* Can happen if dup_mmap() received an OOM */
> + mmap_write_unlock(mm);
> return;
> + }
>
> lru_add_drain();
> flush_cache_mm(mm);
> @@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> tlb_finish_mmu(&tlb);
>
> - /*
> - * Walk the list again, actually closing and freeing it,
> - * with preemption enabled, without holding any MM locks.
> - */
> + /* Walk the list again, actually closing and freeing it. */
> while (vma) {
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> cond_resched();
> }
> + mmap_write_unlock(mm);
> vm_unacct_memory(nr_accounted);
> }
>
> --
> 2.34.1.400.ga245620fadb-goog
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
2021-12-09 8:55 ` [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Michal Hocko
@ 2021-12-09 19:03 ` Suren Baghdasaryan
2021-12-10 9:20 ` Michal Hocko
0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 19:03 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu, Dec 9, 2021 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> > oom-reaper and process_mrelease system call should protect against
> > races with exit_mmap which can destroy page tables while they
> > walk the VMA tree. oom-reaper protects from that race by setting
> > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> > before taking and releasing mmap_write_lock. process_mrelease has
> > to elevate mm->mm_users to prevent such race. Both oom-reaper and
> > process_mrelease hold mmap_read_lock when walking the VMA tree.
> > The locking rules and mechanisms could be simpler if exit_mmap takes
> > mmap_write_lock while executing destructive operations such as
> > free_pgtables.
> > Change exit_mmap to hold the mmap_write_lock when calling
> > free_pgtables and remove_vma. Operations like unmap_vmas and
> > unlock_range are not destructive and could run under mmap_read_lock
> > but for simplicity we take one mmap_write_lock during almost the entire
> > operation.
>
> unlock_range is not safe to be called under read lock. See 27ae357fa82b
> ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
Ok, I'll remove the sentence above.
Is my understanding correct that it is unsafe only because oom-reaper
can't deal with VM_LOCKED, otherwise it would be fine?
>
> > Note also that because oom-reaper checks VM_LOCKED flag,
> > unlock_range() should not be allowed to race with it.
> > Before this patch, remove_vma used to be called with no locks held,
> > however with fput being executed asynchronously and vm_ops->close
> > not being allowed to hold mmap_lock (it is called from __split_vma
> > with mmap_sem held for write), changing that should be fine.
> > In most cases this lock should be uncontended. Previously, Kirill
> > reported ~4% regression caused by a similar change [1]. We reran the
> > same test and although the individual results are quite noisy, the
> > percentiles show lower regression with 1.6% being the worst case [2].
> > The change allows oom-reaper and process_mrelease to execute safely
> > under mmap_read_lock without worries that exit_mmap might destroy page
> > tables from under them.
> >
> > [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
> > [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> The patch looks good otherwise. Btw. when I was trying to do something
> similar in the past Hugh has noted that we can get rid of the same
> lock&&unlock trick in ksm. Maybe you want to have a look at that as well
> ;)
I'll take a look after we cleanup this path completely (oom pieces included).
>
> Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
>
> Thanks!
>
> > ---
> > changes in v4
> > - Separated comments describing vm_operations_struct::close locking
> > requirements into a separate patch, per Matthew Wilcox
> >
> > mm/mmap.c | 16 ++++++++--------
> > 1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index bfb0ea164a90..f4e09d390a07 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
> > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > * __oom_reap_task_mm() will not block.
> > *
> > - * This needs to be done before calling munlock_vma_pages_all(),
> > + * This needs to be done before calling unlock_range(),
> > * which clears VM_LOCKED, otherwise the oom reaper cannot
> > * reliably test it.
> > */
> > (void)__oom_reap_task_mm(mm);
> >
> > set_bit(MMF_OOM_SKIP, &mm->flags);
> > - mmap_write_lock(mm);
> > - mmap_write_unlock(mm);
> > }
> >
> > + mmap_write_lock(mm);
> > if (mm->locked_vm)
> > unlock_range(mm->mmap, ULONG_MAX);
> >
> > arch_exit_mmap(mm);
> >
> > vma = mm->mmap;
> > - if (!vma) /* Can happen if dup_mmap() received an OOM */
> > + if (!vma) {
> > + /* Can happen if dup_mmap() received an OOM */
> > + mmap_write_unlock(mm);
> > return;
> > + }
> >
> > lru_add_drain();
> > flush_cache_mm(mm);
> > @@ -3171,16 +3173,14 @@ void exit_mmap(struct mm_struct *mm)
> > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > tlb_finish_mmu(&tlb);
> >
> > - /*
> > - * Walk the list again, actually closing and freeing it,
> > - * with preemption enabled, without holding any MM locks.
> > - */
> > + /* Walk the list again, actually closing and freeing it. */
> > while (vma) {
> > if (vma->vm_flags & VM_ACCOUNT)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > cond_resched();
> > }
> > + mmap_write_unlock(mm);
> > vm_unacct_memory(nr_accounted);
> > }
> >
> > --
> > 2.34.1.400.ga245620fadb-goog
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap
2021-12-09 19:03 ` Suren Baghdasaryan
@ 2021-12-10 9:20 ` Michal Hocko
0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2021-12-10 9:20 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu 09-12-21 11:03:11, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 08-12-21 13:22:09, Suren Baghdasaryan wrote:
> > > oom-reaper and process_mrelease system call should protect against
> > > races with exit_mmap which can destroy page tables while they
> > > walk the VMA tree. oom-reaper protects from that race by setting
> > > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> > > before taking and releasing mmap_write_lock. process_mrelease has
> > > to elevate mm->mm_users to prevent such race. Both oom-reaper and
> > > process_mrelease hold mmap_read_lock when walking the VMA tree.
> > > The locking rules and mechanisms could be simpler if exit_mmap takes
> > > mmap_write_lock while executing destructive operations such as
> > > free_pgtables.
> > > Change exit_mmap to hold the mmap_write_lock when calling
> > > free_pgtables and remove_vma. Operations like unmap_vmas and
> > > unlock_range are not destructive and could run under mmap_read_lock
> > > but for simplicity we take one mmap_write_lock during almost the entire
> > > operation.
> >
> > unlock_range is not safe to be called under read lock. See 27ae357fa82b
> > ("mm, oom: fix concurrent munlock and oom reaper unmap, v3").
>
> Ok, I'll remove the sentence above.
> Is my understanding correct that it is unsafe only because oom-reaper
> can't deal with VM_LOCKED, otherwise it would be fine?
The commit message (27ae357fa82b) goes into details that I have forgot already.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap
2021-12-08 21:22 [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Suren Baghdasaryan
` (2 preceding siblings ...)
2021-12-09 8:55 ` [PATCH v4 1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap Michal Hocko
@ 2021-12-09 9:12 ` Michal Hocko
2021-12-09 16:24 ` Suren Baghdasaryan
3 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-12-09 9:12 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
Do we want this on top?
----
From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 9 Dec 2021 10:07:51 +0100
Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap
MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
oom repear in the past. Since the exclusive mmap_sem is held in
exit_mmap to cover all destructive operations the flag synchronization
is not needed anymore and we can safely drop it. Just make sure that
mm->mmap is set to NULL so that nobody will access the freed vma list.
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
mm/mmap.c | 23 +----------------------
1 file changed, 1 insertion(+), 22 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index f4e09d390a07..0d6af9d89aa8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3129,28 +3129,6 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_lock for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_lock is
- * dropped.
- *
- * Nothing can be holding mm->mmap_lock here and the above call
- * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
- * __oom_reap_task_mm() will not block.
- *
- * This needs to be done before calling unlock_range(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test it.
- */
- (void)__oom_reap_task_mm(mm);
-
- set_bit(MMF_OOM_SKIP, &mm->flags);
- }
-
mmap_write_lock(mm);
if (mm->locked_vm)
unlock_range(mm->mmap, ULONG_MAX);
@@ -3180,6 +3158,7 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
cond_resched();
}
+ mm->mmap = NULL;
mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
}
--
2.30.2
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap
2021-12-09 9:12 ` [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap Michal Hocko
@ 2021-12-09 16:24 ` Suren Baghdasaryan
2021-12-09 16:47 ` Michal Hocko
0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 16:24 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Do we want this on top?
As we discussed in this thread
https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
__oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
unmap pages in parallel with exit_mmap without blocking each other.
Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
and has a negative impact on performance. So the conclusion of that
thread I thought was to keep that part. My understanding is that we
also wanted to remove MMF_OOM_SKIP as a follow-up patch but
__oom_reap_task_mm would stay.
> ----
> From 58b04ae6dc97b0105ea2651daca55cf2386f69b4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 9 Dec 2021 10:07:51 +0100
> Subject: [PATCH] mm: drop MMF_OOM_SKIP from exit_mmap
>
> MMF_OOM_SKIP used to play a synchronization role between exit_mmap and
> oom repear in the past. Since the exclusive mmap_sem is held in
> exit_mmap to cover all destructive operations the flag synchronization
> is not needed anymore and we can safely drop it. Just make sure that
> mm->mmap is set to NULL so that nobody will access the freed vma list.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/mmap.c | 23 +----------------------
> 1 file changed, 1 insertion(+), 22 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f4e09d390a07..0d6af9d89aa8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3129,28 +3129,6 @@ void exit_mmap(struct mm_struct *mm)
> /* mm's last user has gone, and its about to be pulled down */
> mmu_notifier_release(mm);
>
> - if (unlikely(mm_is_oom_victim(mm))) {
> - /*
> - * Manually reap the mm to free as much memory as possible.
> - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> - * this mm from further consideration. Taking mm->mmap_lock for
> - * write after setting MMF_OOM_SKIP will guarantee that the oom
> - * reaper will not run on this mm again after mmap_lock is
> - * dropped.
> - *
> - * Nothing can be holding mm->mmap_lock here and the above call
> - * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> - * __oom_reap_task_mm() will not block.
> - *
> - * This needs to be done before calling unlock_range(),
> - * which clears VM_LOCKED, otherwise the oom reaper cannot
> - * reliably test it.
> - */
> - (void)__oom_reap_task_mm(mm);
> -
> - set_bit(MMF_OOM_SKIP, &mm->flags);
> - }
> -
> mmap_write_lock(mm);
> if (mm->locked_vm)
> unlock_range(mm->mmap, ULONG_MAX);
> @@ -3180,6 +3158,7 @@ void exit_mmap(struct mm_struct *mm)
> vma = remove_vma(vma);
> cond_resched();
> }
> + mm->mmap = NULL;
> mmap_write_unlock(mm);
> vm_unacct_memory(nr_accounted);
> }
> --
> 2.30.2
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap
2021-12-09 16:24 ` Suren Baghdasaryan
@ 2021-12-09 16:47 ` Michal Hocko
2021-12-09 17:06 ` Suren Baghdasaryan
0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-12-09 16:47 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Do we want this on top?
>
> As we discussed in this thread
> https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> unmap pages in parallel with exit_mmap without blocking each other.
> Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> and has a negative impact on performance. So the conclusion of that
> thread I thought was to keep that part. My understanding is that we
> also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> __oom_reap_task_mm would stay.
OK, then we were talking past each other, I am afraid. I really wanted
to get rid of this oom specific stuff from exit_mmap. It was there out
of necessity. With a proper locking we can finally get rid of the crud.
As I've said previously oom reaping has never been a hot path.
If we really want to optimize this path then I would much rather see a
generic solution which would allow to move the write lock down after
unmap_vmas. That would require oom reaper to be able to handle mlocked
memory.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap
2021-12-09 16:47 ` Michal Hocko
@ 2021-12-09 17:06 ` Suren Baghdasaryan
2021-12-16 2:26 ` Suren Baghdasaryan
0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-09 17:06 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > Do we want this on top?
> >
> > As we discussed in this thread
> > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > unmap pages in parallel with exit_mmap without blocking each other.
> > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > and has a negative impact on performance. So the conclusion of that
> > thread I thought was to keep that part. My understanding is that we
> > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > __oom_reap_task_mm would stay.
>
> OK, then we were talking past each other, I am afraid. I really wanted
> to get rid of this oom specific stuff from exit_mmap. It was there out
> of necessity. With a proper locking we can finally get rid of the crud.
> As I've said previously oom reaping has never been a hot path.
>
> If we really want to optimize this path then I would much rather see a
> generic solution which would allow to move the write lock down after
> unmap_vmas. That would require oom reaper to be able to handle mlocked
> memory.
Ok, let's work on that and when that's done we can get rid of the oom
stuff in exit_mmap. I'll look into this over the weekend and will
likely be back with questions.
Thanks!
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap
2021-12-09 17:06 ` Suren Baghdasaryan
@ 2021-12-16 2:26 ` Suren Baghdasaryan
[not found] ` <Ybsn2hJZXRofwuv+@cmpxchg.org>
0 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2021-12-16 2:26 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, rientjes, willy, hannes, guro, riel, minchan, kirill,
aarcange, christian, hch, oleg, david, jannh, shakeelb, luto,
christian.brauner, fweimer, jengelh, timmurray, linux-mm,
linux-kernel, kernel-team
On Thu, Dec 9, 2021 at 9:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Dec 9, 2021 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 09-12-21 08:24:04, Suren Baghdasaryan wrote:
> > > On Thu, Dec 9, 2021 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > Do we want this on top?
> > >
> > > As we discussed in this thread
> > > https://lore.kernel.org/all/YY4snVzZZZYhbigV@dhcp22.suse.cz,
> > > __oom_reap_task_mm in exit_mmap allows oom-reaper/process_mrelease to
> > > unmap pages in parallel with exit_mmap without blocking each other.
> > > Removal of __oom_reap_task_mm from exit_mmap prevents this parallelism
> > > and has a negative impact on performance. So the conclusion of that
> > > thread I thought was to keep that part. My understanding is that we
> > > also wanted to remove MMF_OOM_SKIP as a follow-up patch but
> > > __oom_reap_task_mm would stay.
> >
> > OK, then we were talking past each other, I am afraid. I really wanted
> > to get rid of this oom specific stuff from exit_mmap. It was there out
> > of necessity. With a proper locking we can finally get rid of the crud.
> > As I've said previously oom reaping has never been a hot path.
> >
> > If we really want to optimize this path then I would much rather see a
> > generic solution which would allow to move the write lock down after
> > unmap_vmas. That would require oom reaper to be able to handle mlocked
> > memory.
>
> Ok, let's work on that and when that's done we can get rid of the oom
> stuff in exit_mmap. I'll look into this over the weekend and will
> likely be back with questions.
As promised, I have a question:
Any particular reason why munlock_vma_pages_range clears VM_LOCKED
before unlocking pages and not after (see:
https://elixir.bootlin.com/linux/latest/source/mm/mlock.c#L424)? Seems
to me if VM_LOCKED was reset at the end (with proper ordering) then
__oom_reap_task_mm would correctly skip VM_LOCKED vmas.
https://lore.kernel.org/lkml/20180514064824.534798031@linuxfoundation.org/
has this explanation:
"Since munlock_vma_pages_range() depends on clearing VM_LOCKED from
vm_flags before actually doing the munlock to determine if any other
vmas are locking the same memory, the check for VM_LOCKED in the oom
reaper is racy."
but "to determine if any other vmas are locking the same memory"
explanation eludes me... Any insights?
Thanks,
Suren.
> Thanks!
>
> > --
> > Michal Hocko
> > SUSE Labs
^ permalink raw reply [flat|nested] 21+ messages in thread