From: "Huang, Ying" <ying.huang@intel.com>
To: Minchan Kim <minchan@kernel.org>
Cc: Mauricio Faria de Oliveira <mfo@canonical.com>,
Yu Zhao <yuzhao@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-block@vger.kernel.org,
Miaohe Lin <linmiaohe@huawei.com>,
Yang Shi <shy828301@gmail.com>
Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read
Date: Thu, 13 Jan 2022 20:30:49 +0800 [thread overview]
Message-ID: <87ilunu8au.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <87zgo0t3qz.fsf@yhuang6-desk2.ccr.corp.intel.com> (Ying Huang's message of "Thu, 13 Jan 2022 16:54:28 +0800")
"Huang, Ying" <ying.huang@intel.com> writes:
> Minchan Kim <minchan@kernel.org> writes:
>
>> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
>>> Hi Minchan Kim,
>>>
>>> Thanks for handling the hard questions! :)
>>>
>>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
>>> >
>>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>> > > Yu Zhao <yuzhao@google.com> writes:
>>> > >
>>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
>>> > > >> index 163ac4e6bcee..8671de473c25 100644
>>> > > >> --- a/mm/rmap.c
>>> > > >> +++ b/mm/rmap.c
>>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> > > >>
>>> > > >> /* MADV_FREE page check */
>>> > > >> if (!PageSwapBacked(page)) {
>>> > > >> - if (!PageDirty(page)) {
>>> > > >> + int ref_count = page_ref_count(page);
>>> > > >> + int map_count = page_mapcount(page);
>>> > > >> +
>>> > > >> + /*
>>> > > >> + * The only page refs must be from the isolation
>>> > > >> + * (checked by the caller shrink_page_list() too)
>>> > > >> + * and one or more rmap's (dropped by discard:).
>>> > > >> + *
>>> > > >> + * Check the reference count before dirty flag
>>> > > >> + * with memory barrier; see __remove_mapping().
>>> > > >> + */
>>> > > >> + smp_rmb();
>>> > > >> + if ((ref_count - 1 == map_count) &&
>>> > > >> + !PageDirty(page)) {
>>> > > >> /* Invalidate as we cleared the pte */
>>> > > >> mmu_notifier_invalidate_range(mm,
>>> > > >> address, address + PAGE_SIZE);
>>> > > >
>>> > > > Out of curiosity, how does it work with COW in terms of reordering?
>>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
>>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
>>> > > > is seen first, and direct io is holding a refcnt, this check can still
>>> > > > pass?
>>> > >
>>> > > I think that you are correct.
>>> > >
>>> > > After more thoughts, it appears very tricky to compare page count and
>>> > > map count. Even if we have added smp_rmb() between page_ref_count() and
>>> > > page_mapcount(), an interrupt may happen between them. During the
>>> > > interrupt, the page count and map count may be changed, for example,
>>> > > unmapped, or do_swap_page().
>>> >
>>> > Yeah, it happens but what specific problem are you concerning from the
>>> > count change under race? The fork case Yu pointed out was already known
>>> > for breaking DIO so user should take care not to fork under DIO(Please
>>> > look at O_DIRECT section in man 2 open). If you could give a specific
>>> > example, it would be great to think over the issue.
>>> >
>>> > I agree it's little tricky but it seems to be way other place has used
>>> > for a long time(Please look at write_protect_page in ksm.c).
>>>
>>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>>>
>>> > So, here what we missing is tlb flush before the checking.
>>>
>>> That shouldn't be required for this particular issue/case, IIUIC.
>>> One of the things we checked early on was disabling deferred TLB flush
>>> (similarly to what you've done), and it didn't help with the issue; also, the
>>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
>>
>> I guess you didn't try it with page_mapcount + 1 == page_count at tha
>> time? Anyway, I agree we don't need TLB flush here like KSM.
>> I think the reason KSM is doing TLB flush before the check it to
>> make sure trap trigger on the write from userprocess in other core.
>> However, this MADV_FREE case, HW already gaurantees the trap.
>> Please see below.
>>
>>>
>>>
>>> >
>>> > Something like this.
>>> >
>>> > diff --git a/mm/rmap.c b/mm/rmap.c
>>> > index b0fd9dc19eba..b4ad9faa17b2 100644
>>> > --- a/mm/rmap.c
>>> > +++ b/mm/rmap.c
>>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> >
>>> > /* MADV_FREE page check */
>>> > if (!PageSwapBacked(page)) {
>>> > - int refcount = page_ref_count(page);
>>> > -
>>> > - /*
>>> > - * The only page refs must be from the isolation
>>> > - * (checked by the caller shrink_page_list() too)
>>> > - * and the (single) rmap (dropped by discard:).
>>> > - *
>>> > - * Check the reference count before dirty flag
>>> > - * with memory barrier; see __remove_mapping().
>>> > - */
>>> > - smp_rmb();
>>> > - if (refcount == 2 && !PageDirty(page)) {
>>> > + if (!PageDirty(page) &&
>>> > + page_mapcount(page) + 1 == page_count(page)) {
>>>
>>> In the interest of avoiding a different race/bug, it seemed worth following the
>>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
>>> after the page's reference count, with a memory barrier in between.
>>
>> True so it means your patch as-is is good for me.
>
> If my understanding were correct, a shared anonymous page will be mapped
> read-only. If so, will a private anonymous page be called
> SetPageDirty() concurrently after direct IO case has been dealt with
> via comparing page_count()/page_mapcount()?
Sorry, I found that I am not quite right here. When direct IO read
completes, it will call SetPageDirty() and put_page() finally. And
MADV_FREE in try_to_unmap_one() needs to deal with that too.
Checking direct IO code, it appears that set_page_dirty_lock() is used
to set page dirty, which will use lock_page().
dio_bio_complete
bio_check_pages_dirty
bio_dirty_fn /* through workqueue */
bio_release_pages
set_page_dirty_lock
bio_release_pages
set_page_dirty_lock
So in theory, for direct IO, the memory barrier may be unnecessary. But
I don't think it's a good idea to depend on this specific behavior of
direct IO. The original code with memory barrier looks more generic and
I don't think it will introduce visible overhead.
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2022-01-13 12:30 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-05 23:34 Mauricio Faria de Oliveira
2022-01-06 23:15 ` Minchan Kim
2022-01-07 0:11 ` Yang Shi
2022-01-07 1:08 ` Yang Shi
2022-01-11 1:34 ` Huang, Ying
2022-01-11 6:48 ` Yu Zhao
2022-01-11 18:54 ` Minchan Kim
2022-01-11 19:29 ` John Hubbard
2022-01-11 20:20 ` Minchan Kim
2022-01-11 20:21 ` Minchan Kim
2022-01-11 21:59 ` Minchan Kim
2022-01-11 23:38 ` John Hubbard
2022-01-12 0:01 ` Minchan Kim
2022-01-12 1:46 ` Huang, Ying
2022-01-12 17:33 ` Minchan Kim
2022-01-12 21:53 ` Mauricio Faria de Oliveira
2022-01-12 22:37 ` Minchan Kim
2022-01-13 8:54 ` Huang, Ying
2022-01-13 12:30 ` Huang, Ying [this message]
2022-01-13 14:54 ` Mauricio Faria de Oliveira
2022-01-13 14:30 ` Mauricio Faria de Oliveira
2022-01-13 7:29 ` Yu Zhao
2022-01-14 0:35 ` Minchan Kim
2022-01-31 23:10 ` Mauricio Faria de Oliveira
2022-01-13 5:47 ` Huang, Ying
2022-01-13 6:37 ` Miaohe Lin
2022-01-13 8:04 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ilunu8au.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=linmiaohe@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mfo@canonical.com \
--cc=minchan@kernel.org \
--cc=shy828301@gmail.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox