From: Yang Shi <yang.shi@linux.alibaba.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: kirill.shutemov@linux.intel.com, hughd@google.com,
aarcange@redhat.com, akpm@linux-foundation.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: khugepaged: fix potential page state corruption
Date: Thu, 19 Mar 2020 09:57:47 -0700 [thread overview]
Message-ID: <e716c8c6-898e-5199-019c-161ea3ec06c3@linux.alibaba.com> (raw)
In-Reply-To: <20200319104938.vphyajoyz6ob6jtl@box>
On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>
>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>
>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>> be freed
>>>>> via pagevec or free_page_and_swap_cache(). But, the anonymous page may
>>>>> be added back to LRU, then it might result in the below race:
>>>>>
>>>>> CPU A CPU B
>>>>> khugepaged:
>>>>> unlock page
>>>>> putback_lru_page
>>>>> add to lru
>>>>> page reclaim:
>>>>> isolate this page
>>>>> try_to_unmap
>>>>> page_remove_rmap <-- corrupt _mapcount
>>>>>
>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>> Hm. Why should it?
>>>>
>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>> reclaim as it's not part of usual page table tree. Basically
>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>> khugepaged side.
>>> I don't quite get. What does "not part of usual page table tree" means?
>>>
>>> How's about try_to_unmap() acquires ptl before khugepaged?
> The page table we are dealing with was detached from the process' page
> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> pte.
>
> try_to_unmap() can only reach the ptl if split ptl is disabled
> (mm->page_table_lock is used), but it still will not be able to reach pte.
Aha, got it. Thanks for explaining. I definitely missed this point. Yes,
pmdp_collapse_flush() would clear the pmd, then others won't see the
page table.
However, it looks the vmscan would not stop at try_to_unmap() at all,
try_to_unmap() would just return true since pmd_present() should return
false in pvmw. Then it would go all the way down to __remove_mapping(),
but freezing the page would fail since try_to_unmap() doesn't actually
drop the refcount from the pte map.
It would not result in any critical problem AFAICT, but suboptimal and
it may causes some unnecessary I/O due to swap.
>
>>>> I don't see the issue right away.
>>>>
>>>>> The other problem is the page's active or unevictable flag might be
>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>> So what?
>>> The flags may leak to page free path then kernel may complain if
>>> DEBUG_VM is set.
> Could you elaborate on what codepath you are talking about?
__put_page ->
__put_single_page ->
free_unref_page ->
put_unref_page_prepare ->
free_pcp_prepare ->
free_pages_prepare ->
free_pages_check
This check would just be run when DEBUG_VM is enabled.
>
>>>>> The putback_lru_page() would not clear those two flags if the pages are
>>>>> released via pagevec, it sounds nothing prevents from isolating active
>> Sorry, this is a typo. If the page is freed via pagevec, active and
>> unevictable flag would get cleared before freeing by page_off_lru().
>>
>> But, if the page is freed by free_page_and_swap_cache(), these two flags are
>> not cleared. But, it seems this path is hit rare, the pages are freed by
>> pagevec for the most cases.
>>
>>>>> or unevictable pages.
>>>> Again, why should it? vmscan is equipped to deal with this.
>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>> unevictable pages since it just simply walks page table.
> Why it is wrong? lru_cache_add() only complains if both flags set, it
> shouldn't happen.
Noting wrong about isolating active or unevictable pages. I just mean it
seems possible active or unevictable flag may be there if the page is
freed via free_page_add_swap_cache() path.
>
>>>>> However I didn't really run into these problems, just in theory
>>>>> by visual
>>>>> inspection.
>>>>>
>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>> again since
>>>>> they are about to be freed when reaching this point. So,
>>>>> clearing active
>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>> instead of calling putback_lru_page() as what page cache collapse does.
>>>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>>> It just calls putback_lru_page() at error path, not success path.
>>> Putting pages back to lru on error path definitely makes sense. Here it
>>> is the success path.
> I agree that putting the apage on LRU just before free the page is
> suboptimal, but I don't see it as a critical issue.
Yes, given the code analysis above, I agree. If you thought the patch is
a fine micro-optimization, I would like to re-submit it with rectified
commit log. Thank you for your time.
>
>
next prev parent reply other threads:[~2020-03-19 16:57 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-18 23:19 Yang Shi
2020-03-18 23:40 ` Kirill A. Shutemov
2020-03-18 23:47 ` Yang Shi
2020-03-19 0:12 ` Kirill A. Shutemov
2020-03-19 0:55 ` Yang Shi
2020-03-19 5:39 ` Yang Shi
2020-03-19 10:49 ` Kirill A. Shutemov
2020-03-19 16:57 ` Yang Shi [this message]
2020-03-19 17:22 ` Yang Shi
2020-03-20 11:45 ` Kirill A. Shutemov
2020-03-20 16:34 ` Yang Shi
2020-03-24 17:17 ` Yang Shi
2020-03-25 11:26 ` Kirill A. Shutemov
2020-03-25 18:42 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e716c8c6-898e-5199-019c-161ea3ec06c3@linux.alibaba.com \
--to=yang.shi@linux.alibaba.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox