Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mauricio Faria de Oliveira <mfo@canonical.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>, Yu Zhao <yuzhao@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-block@vger.kernel.org,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Yang Shi <shy828301@gmail.com>
Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read
Date: Thu, 13 Jan 2022 11:54:36 -0300	[thread overview]
Message-ID: <CAO9xwp1QbmLNA6her4HveuBOZSwcNY5jZqtc00XQ0=V=HEV6Aw@mail.gmail.com> (raw)
In-Reply-To: <87ilunu8au.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Thu, Jan 13, 2022 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> "Huang, Ying" <ying.huang@intel.com> writes:
>
> > Minchan Kim <minchan@kernel.org> writes:
> >
> >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
> >>> Hi Minchan Kim,
> >>>
> >>> Thanks for handling the hard questions! :)
> >>>
> >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >>> >
> >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> >>> > > Yu Zhao <yuzhao@google.com> writes:
> >>> > >
> >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > > >> index 163ac4e6bcee..8671de473c25 100644
> >>> > > >> --- a/mm/rmap.c
> >>> > > >> +++ b/mm/rmap.c
> >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> > > >>
> >>> > > >>                    /* MADV_FREE page check */
> >>> > > >>                    if (!PageSwapBacked(page)) {
> >>> > > >> -                          if (!PageDirty(page)) {
> >>> > > >> +                          int ref_count = page_ref_count(page);
> >>> > > >> +                          int map_count = page_mapcount(page);
> >>> > > >> +
> >>> > > >> +                          /*
> >>> > > >> +                           * The only page refs must be from the isolation
> >>> > > >> +                           * (checked by the caller shrink_page_list() too)
> >>> > > >> +                           * and one or more rmap's (dropped by discard:).
> >>> > > >> +                           *
> >>> > > >> +                           * Check the reference count before dirty flag
> >>> > > >> +                           * with memory barrier; see __remove_mapping().
> >>> > > >> +                           */
> >>> > > >> +                          smp_rmb();
> >>> > > >> +                          if ((ref_count - 1 == map_count) &&
> >>> > > >> +                              !PageDirty(page)) {
> >>> > > >>                                    /* Invalidate as we cleared the pte */
> >>> > > >>                                    mmu_notifier_invalidate_range(mm,
> >>> > > >>                                            address, address + PAGE_SIZE);
> >>> > > >
> >>> > > > Out of curiosity, how does it work with COW in terms of reordering?
> >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> >>> > > > is seen first, and direct io is holding a refcnt, this check can still
> >>> > > > pass?
> >>> > >
> >>> > > I think that you are correct.
> >>> > >
> >>> > > After more thoughts, it appears very tricky to compare page count and
> >>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> >>> > > page_mapcount(), an interrupt may happen between them.  During the
> >>> > > interrupt, the page count and map count may be changed, for example,
> >>> > > unmapped, or do_swap_page().
> >>> >
> >>> > Yeah, it happens but what specific problem are you concerning from the
> >>> > count change under race? The fork case Yu pointed out was already known
> >>> > for breaking DIO so user should take care not to fork under DIO(Please
> >>> > look at O_DIRECT section in man 2 open). If you could give a specific
> >>> > example, it would be great to think over the issue.
> >>> >
> >>> > I agree it's little tricky but it seems to be way other place has used
> >>> > for a long time(Please look at write_protect_page in ksm.c).
> >>>
> >>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
> >>>
> >>> > So, here what we missing is tlb flush before the checking.
> >>>
> >>> That shouldn't be required for this particular issue/case, IIUIC.
> >>> One of the things we checked early on was disabling deferred TLB flush
> >>> (similarly to what you've done), and it didn't help with the issue; also, the
> >>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
> >>
> >> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> >> time?  Anyway, I agree we don't need TLB flush here like KSM.
> >> I think the reason KSM is doing TLB flush before the check it to
> >> make sure trap trigger on the write from userprocess in other core.
> >> However, this MADV_FREE case, HW already gaurantees the trap.
> >> Please see below.
> >>
> >>>
> >>>
> >>> >
> >>> > Something like this.
> >>> >
> >>> > diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > index b0fd9dc19eba..b4ad9faa17b2 100644
> >>> > --- a/mm/rmap.c
> >>> > +++ b/mm/rmap.c
> >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> >
> >>> >                         /* MADV_FREE page check */
> >>> >                         if (!PageSwapBacked(page)) {
> >>> > -                               int refcount = page_ref_count(page);
> >>> > -
> >>> > -                               /*
> >>> > -                                * The only page refs must be from the isolation
> >>> > -                                * (checked by the caller shrink_page_list() too)
> >>> > -                                * and the (single) rmap (dropped by discard:).
> >>> > -                                *
> >>> > -                                * Check the reference count before dirty flag
> >>> > -                                * with memory barrier; see __remove_mapping().
> >>> > -                                */
> >>> > -                               smp_rmb();
> >>> > -                               if (refcount == 2 && !PageDirty(page)) {
> >>> > +                               if (!PageDirty(page) &&
> >>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
> >>>
> >>> In the interest of avoiding a different race/bug, it seemed worth following the
> >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
> >>> after the page's reference count, with a memory barrier in between.
> >>
> >> True so it means your patch as-is is good for me.
> >
> > If my understanding were correct, a shared anonymous page will be mapped
> > read-only.  If so, will a private anonymous page be called
> > SetPageDirty() concurrently after direct IO case has been dealt with
> > via comparing page_count()/page_mapcount()?
>
> Sorry, I found that I am not quite right here.  When direct IO read
> completes, it will call SetPageDirty() and put_page() finally.  And
> MADV_FREE in try_to_unmap_one() needs to deal with that too.
>
> Checking direct IO code, it appears that set_page_dirty_lock() is used
> to set page dirty, which will use lock_page().
>
>   dio_bio_complete
>     bio_check_pages_dirty
>       bio_dirty_fn  /* through workqueue */
>         bio_release_pages
>           set_page_dirty_lock
>     bio_release_pages
>       set_page_dirty_lock
>
> So in theory, for direct IO, the memory barrier may be unnecessary.  But
> I don't think it's a good idea to depend on this specific behavior of
> direct IO.  The original code with memory barrier looks more generic and
> I don't think it will introduce visible overhead.
>

Thanks for all the considerations/thought process with potential corner cases!

Regarding the overhead, agreed; and this is in memory reclaim which isn't a
fast path (and even if it's under direct reclaim, things have slowed
down already),
so that would seem to be fine.

cheers,

> Best Regards,
> Huang, Ying



-- 
Mauricio Faria de Oliveira

next prev parent reply	other threads:[~2022-01-13 14:54 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-05 23:34 Mauricio Faria de Oliveira
2022-01-06 23:15 ` Minchan Kim
2022-01-07  0:11   ` Yang Shi
2022-01-07  1:08     ` Yang Shi
2022-01-11  1:34   ` Huang, Ying
2022-01-11  6:48 ` Yu Zhao
2022-01-11 18:54   ` Minchan Kim
2022-01-11 19:29     ` John Hubbard
2022-01-11 20:20       ` Minchan Kim
2022-01-11 20:21         ` Minchan Kim
2022-01-11 21:59           ` Minchan Kim
2022-01-11 23:38             ` John Hubbard
2022-01-12  0:01               ` Minchan Kim
2022-01-12  1:46   ` Huang, Ying
2022-01-12 17:33     ` Minchan Kim
2022-01-12 21:53       ` Mauricio Faria de Oliveira
2022-01-12 22:37         ` Minchan Kim
2022-01-13  8:54           ` Huang, Ying
2022-01-13 12:30             ` Huang, Ying
2022-01-13 14:54               ` Mauricio Faria de Oliveira [this message]
2022-01-13 14:30           ` Mauricio Faria de Oliveira
2022-01-13  7:29         ` Yu Zhao
2022-01-14  0:35           ` Minchan Kim
2022-01-31 23:10             ` Mauricio Faria de Oliveira
2022-01-13  5:47       ` Huang, Ying
2022-01-13  6:37         ` Miaohe Lin
2022-01-13  8:04           ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAO9xwp1QbmLNA6her4HveuBOZSwcNY5jZqtc00XQ0=V=HEV6Aw@mail.gmail.com' \
    --to=mfo@canonical.com \
    --cc=akpm@linux-foundation.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox