From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
kirill.shutemov@linux.intel.com, riel@redhat.com
Subject: Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
Date: Wed, 28 Mar 2018 21:26:43 +0900 [thread overview]
Message-ID: <201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20180328110513.GH9275@dhcp22.suse.cz>
Michal Hocko wrote:
> > > I am not saying this is wrong, I would have to think about that much
> > > more because mmap_sem tends to be used on many surprising places and the
> > > write lock just hide them all.
> >
> > Then, an alternative approach which interrupts without downgrading is shown
> > below. But I'm not sure.
>
> Failing the whole dup_mmap might be quite reasonable, yes. I haven't
> checked your particular patch because this code path needs much more
> time than I can give this, though.
I think that interrupting at
diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..851c675 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -514,6 +514,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
+ if (!retval && fatal_signal_pending(current))
+ retval = -EINTR;
+
if (retval)
goto out;
}
--
is safe because there is no difference (except the error code) between above
change and hitting "goto fail_nomem;" path after "mpnt = mpnt->vm_next;".
Therefore, I think that interrupting at
diff --git a/kernel/fork.c b/kernel/fork.c
index 851c675..2706acc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -508,7 +508,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_parent = &tmp->vm_rb;
mm->map_count++;
- if (!(tmp->vm_flags & VM_WIPEONFORK))
+ if (fatal_signal_pending(current))
+ retval = -EINTR;
+ else if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(mm, oldmm, mpnt);
if (tmp->vm_ops && tmp->vm_ops->open)
--
is also safe because handling of copy_page_range() failure is already
scheduled by mmput().
Thus, I think that there are locations where it is known to be safely and consistently
interruptible inside "for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)" loop.
> On Wed 28-03-18 19:23:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 27-03-18 20:19:30, Tetsuo Handa wrote:
> > > > If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
> > > > victim can interrupt operations which need mm->mmap_sem held for write,
> > > > we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
> > > > able to reap the OOM victim's memory.
> > >
> > > This really begs for much better explanation. Why is it safe?
> >
> > Basic idea is
> >
> > bool downgraded = false;
> > down_write(mmap_sem);
> > for (something_1_that_might_depend_mmap_sem_held_for_write;
> > something_2_that_might_depend_mmap_sem_held_for_write;
> > something_3_that_might_depend_mmap_sem_held_for_write) {
> > something_4_that_might_depend_mmap_sem_held_for_write();
> > if (fatal_signal_pending(current)) {
> > downgrade_write(mmap_sem);
> > downgraded = true;
> > break;
> > }
> > something_5_that_might_depend_mmap_sem_held_for_write();
> > }
> > if (!downgraded)
> > up_write(mmap_sem);
> > else
> > up_read(mmap_sem);
> >
> > . That is, try to interrupt critical sections at locations where it is
> > known to be safe and consistent.
>
> Please explain why those places are safe to interrupt.
Because (regarding the downgrade_write() approach), as far as I know,
the current thread does not access memory which needs to be protected with
mmap_sem held for write.
>
> > > Are you
> > > assuming that the killed task will not perform any changes on the
> > > address space?
> >
> > If somebody drops mmap_sem held for write is not safe, how can the OOM
> > reaper work safely?
> >
> > The OOM reaper is assuming that the thread who got mmap_sem held for write
> > is responsible to complete critical sections before dropping mmap_sem held
> > for write, isn't it?
> >
> > Then, how an attempt to perform changes on the address space can become a
> > problem given that the thread who got mmap_sem held for write is responsible
> > to complete critical sections before dropping mmap_sem held for write?
>
> ENOPARSE. How does this have anything to do with oom_reaper.
The oom_reaper can work safely as long as mmap_sem held for write is released
at safely and consistently interruptible locations.
> Sure you
> want to _help_ the oom_reaper to do its job but you are dropping the
> lock in the downgrading the lock in the middle of dup_mmap and that is
> what we are dicussing here.
Yes.
> So please explain why it is safe. It is
> really not straightforward.
So, please explain why it is not safe. How can releasing mmap_sem held for
write at safely and consistently interruptible locations be not safe?
>
> > > What about ongoing page faults or other operations deeper
> > > in the call chain.
> >
> > Even if there are ongoing page faults or other operations deeper in the call
> > chain, there should be no problem as long as the thread who got mmap_sem
> > held for write is responsible to complete critical sections before dropping
> > mmap_sem held for write.
> >
> > > Why they are safe to change things for the child
> > > during the copy?
> >
> > In this patch, the current thread who got mmap_sem held for write (which is
> > likely an OOM victim thread) downgrades mmap_sem, with an assumption that
> > current thread no longer accesses memory which might depend on mmap_sem held
> > for write.
> >
> > dup_mmap() duplicates current->mm and to-be-duplicated mm is not visible yet.
> > If dup_mmap() failed, to-be-duplicated incomplete mm is discarded via mmput()
> > in dup_mm() rather than assigned to the child. Thus, this patch should not
> > change things which are visible to the child during the copy.
> >
> > What we need to be careful is making changes to current->mm.
> > I'm assuming that current->mm->mmap_sem held for read is enough for
> > i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
> > flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
> > reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.
>
> But as soon as you downgrade the lock then all other threads can
> interfere and perform page faults or update respecive mappings. Does
> this matter? If not then why?
>
Why does this matter?
I don't know what "update respecive mappings" means.
Is that about mmap()/munmap() which need mmap_sem held for write?
Since mmap_sem is still held for read, operations which needs
mmap_sem held for write cannot happen.
Anyway, as long as I downgrade the mmap_sem at safely and consistently
interruptible locations, there cannot be a problem.
next prev parent reply other threads:[~2018-03-28 12:26 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-27 11:19 Tetsuo Handa
2018-03-27 14:52 ` Michal Hocko
2018-03-28 10:23 ` Tetsuo Handa
2018-03-28 11:05 ` Michal Hocko
2018-03-28 12:26 ` Tetsuo Handa [this message]
2018-03-28 12:39 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201803282126.GBC56799.tOFVFSOHJLFOQM@I-love.SAKURA.ne.jp \
--to=penguin-kernel@i-love.sakura.ne.jp \
--cc=akpm@linux-foundation.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=riel@redhat.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox