From: "Huang, Ying" <ying.huang@intel.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: Yu Zhao <yuzhao@google.com>,
Mauricio Faria de Oliveira <mfo@canonical.com>,
Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
<linux-block@vger.kernel.org>, Yang Shi <shy828301@gmail.com>,
Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read
Date: Thu, 13 Jan 2022 16:04:10 +0800 [thread overview]
Message-ID: <87ee5cukn9.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <c7c7c839-f8b9-4864-33ea-37b95d935fe8@huawei.com> (Miaohe Lin's message of "Thu, 13 Jan 2022 14:37:15 +0800")
Miaohe Lin <linmiaohe@huawei.com> writes:
> On 2022/1/13 13:47, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>>> Yu Zhao <yuzhao@google.com> writes:
>>>>
>>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 163ac4e6bcee..8671de473c25 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>>
>>>>>> /* MADV_FREE page check */
>>>>>> if (!PageSwapBacked(page)) {
>>>>>> - if (!PageDirty(page)) {
>>>>>> + int ref_count = page_ref_count(page);
>>>>>> + int map_count = page_mapcount(page);
>>>>>> +
>>>>>> + /*
>>>>>> + * The only page refs must be from the isolation
>>>>>> + * (checked by the caller shrink_page_list() too)
>>>>>> + * and one or more rmap's (dropped by discard:).
>>>>>> + *
>>>>>> + * Check the reference count before dirty flag
>>>>>> + * with memory barrier; see __remove_mapping().
>>>>>> + */
>>>>>> + smp_rmb();
>>>>>> + if ((ref_count - 1 == map_count) &&
>>>>>> + !PageDirty(page)) {
>>>>>> /* Invalidate as we cleared the pte */
>>>>>> mmu_notifier_invalidate_range(mm,
>>>>>> address, address + PAGE_SIZE);
>>>>>
>>>>> Out of curiosity, how does it work with COW in terms of reordering?
>>>>> Specifically, it seems to me get_page() and page_dup_rmap() in
>>>>> copy_present_pte() can happen in any order, and if page_dup_rmap()
>>>>> is seen first, and direct io is holding a refcnt, this check can still
>>>>> pass?
>>>>
>>>> I think that you are correct.
>>>>
>>>> After more thoughts, it appears very tricky to compare page count and
>>>> map count. Even if we have added smp_rmb() between page_ref_count() and
>>>> page_mapcount(), an interrupt may happen between them. During the
>>>> interrupt, the page count and map count may be changed, for example,
>>>> unmapped, or do_swap_page().
>>>
>>> Yeah, it happens but what specific problem are you concerning from the
>>> count change under race? The fork case Yu pointed out was already known
>>> for breaking DIO so user should take care not to fork under DIO(Please
>>> look at O_DIRECT section in man 2 open). If you could give a specific
>>> example, it would be great to think over the issue.
>>
>> Whether is the following race possible?
>>
>> CPU0/Process A CPU1/Process B
>> -------------- --------------
>> try_to_unmap_one
>> page_mapcount()
>> zap_pte_range()
>> page_remove_rmap()
>> atomic_add_negative(-1, &page->_mapcount)
>> tlb_flush_mmu()
>> ...
>> put_page_testzero()
>> page_count()
>>
>
> It seems they're under the same page table Lock.
This is for a page shared by 2 processes (Process A/B above). But you
reminded me that an anonymous page cannot be shared between multiple
processes after direct IO. Because direct IO read un-shares the page,
and fork() isn't allowed, I guess ksm isn't allowed too.
So the above race isn't possible. Sorry for confusing.
Best Regards,
Huang, Ying
> Thanks.
>
>> Previously I thought that there's similar race in do_swap_page(). But
>> after more thoughts, I found that the page is locked in do_swap_page().
>> So do_swap_page() is safe. Per my understanding, except during fork()
>> as Yu pointed out, the anonymous page must be locked before increasing
>> its mapcount.
>>
>> So, if the above race is possible, we need to guarantee to read
>> page_count() before page_mapcount(). That is, something as follows,
>>
>> count = page_count();
>> smp_rmb();
>> mapcount = page_mapcount();
>> if (!PageDirty(page) && mapcount + 1 == count) {
>> ...
>> }
>>
>> Best Regards,
>> Huang, Ying
>>
>>> I agree it's little tricky but it seems to be way other place has used
>>> for a long time(Please look at write_protect_page in ksm.c).
>>> So, here what we missing is tlb flush before the checking.
>>>
>>> Something like this.
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index b0fd9dc19eba..b4ad9faa17b2 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>
>>> /* MADV_FREE page check */
>>> if (!PageSwapBacked(page)) {
>>> - int refcount = page_ref_count(page);
>>> -
>>> - /*
>>> - * The only page refs must be from the isolation
>>> - * (checked by the caller shrink_page_list() too)
>>> - * and the (single) rmap (dropped by discard:).
>>> - *
>>> - * Check the reference count before dirty flag
>>> - * with memory barrier; see __remove_mapping().
>>> - */
>>> - smp_rmb();
>>> - if (refcount == 2 && !PageDirty(page)) {
>>> + if (!PageDirty(page) &&
>>> + page_mapcount(page) + 1 == page_count(page)) {
>>> /* Invalidate as we cleared the pte */
>>> mmu_notifier_invalidate_range(mm,
>>> address, address + PAGE_SIZE);
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f3162a5724de..6454ff5c576f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>> enum ttu_flags flags = TTU_BATCH_FLUSH;
>>> bool was_swapbacked = PageSwapBacked(page);
>>>
>>> + if (!was_swapbacked && PageAnon(page))
>>> + flags &= ~TTU_BATCH_FLUSH;
>>> +
>>> if (unlikely(PageTransHuge(page)))
>>> flags |= TTU_SPLIT_HUGE_PMD;
>> .
>>
prev parent reply other threads:[~2022-01-13 8:04 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-05 23:34 Mauricio Faria de Oliveira
2022-01-06 23:15 ` Minchan Kim
2022-01-07 0:11 ` Yang Shi
2022-01-07 1:08 ` Yang Shi
2022-01-11 1:34 ` Huang, Ying
2022-01-11 6:48 ` Yu Zhao
2022-01-11 18:54 ` Minchan Kim
2022-01-11 19:29 ` John Hubbard
2022-01-11 20:20 ` Minchan Kim
2022-01-11 20:21 ` Minchan Kim
2022-01-11 21:59 ` Minchan Kim
2022-01-11 23:38 ` John Hubbard
2022-01-12 0:01 ` Minchan Kim
2022-01-12 1:46 ` Huang, Ying
2022-01-12 17:33 ` Minchan Kim
2022-01-12 21:53 ` Mauricio Faria de Oliveira
2022-01-12 22:37 ` Minchan Kim
2022-01-13 8:54 ` Huang, Ying
2022-01-13 12:30 ` Huang, Ying
2022-01-13 14:54 ` Mauricio Faria de Oliveira
2022-01-13 14:30 ` Mauricio Faria de Oliveira
2022-01-13 7:29 ` Yu Zhao
2022-01-14 0:35 ` Minchan Kim
2022-01-31 23:10 ` Mauricio Faria de Oliveira
2022-01-13 5:47 ` Huang, Ying
2022-01-13 6:37 ` Miaohe Lin
2022-01-13 8:04 ` Huang, Ying [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ee5cukn9.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=linmiaohe@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mfo@canonical.com \
--cc=minchan@kernel.org \
--cc=shy828301@gmail.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox