Re: [PATCH] mm: khugepaged: fix potential page state corruption

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <yang.shi@linux.alibaba.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: kirill.shutemov@linux.intel.com, hughd@google.com,
	aarcange@redhat.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: khugepaged: fix potential page state corruption
Date: Thu, 19 Mar 2020 09:57:47 -0700	[thread overview]
Message-ID: <e716c8c6-898e-5199-019c-161ea3ec06c3@linux.alibaba.com> (raw)
In-Reply-To: <20200319104938.vphyajoyz6ob6jtl@box>



On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>
>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>
>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>> be freed
>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>> be added back to LRU, then it might result in the below race:
>>>>>
>>>>>      CPU A                CPU B
>>>>> khugepaged:
>>>>>     unlock page
>>>>>     putback_lru_page
>>>>>       add to lru
>>>>>                  page reclaim:
>>>>>                    isolate this page
>>>>>                    try_to_unmap
>>>>>     page_remove_rmap <-- corrupt _mapcount
>>>>>
>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>> Hm. Why should it?
>>>>
>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>> reclaim as it's not part of usual page table tree. Basically
>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>> khugepaged side.
>>> I don't quite get. What does "not part of usual page table tree" means?
>>>
>>> How's about try_to_unmap() acquires ptl before khugepaged?
> The page table we are dealing with was detached from the process' page
> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> pte.
>
> try_to_unmap() can only reach the ptl if split ptl is disabled
> (mm->page_table_lock is used), but it still will not be able to reach pte.

Aha, got it. Thanks for explaining. I definitely missed this point. Yes, 
pmdp_collapse_flush() would clear the pmd, then others won't see the 
page table.

However, it looks the vmscan would not stop at try_to_unmap() at all, 
try_to_unmap() would just return true since pmd_present() should return 
false in pvmw. Then it would go all the way down to __remove_mapping(), 
but freezing the page would fail since try_to_unmap() doesn't actually 
drop the refcount from the pte map.

It would not result in any critical problem AFAICT, but suboptimal and 
it may causes some unnecessary I/O due to swap.

>
>>>> I don't see the issue right away.
>>>>
>>>>> The other problem is the page's active or unevictable flag might be
>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>> So what?
>>> The flags may leak to page free path then kernel may complain if
>>> DEBUG_VM is set.
> Could you elaborate on what codepath you are talking about?

__put_page ->
     __put_single_page ->
         free_unref_page ->
             put_unref_page_prepare ->
                 free_pcp_prepare ->
                     free_pages_prepare ->
                         free_pages_check

This check would just be run when DEBUG_VM is enabled.

>
>>>>> The putback_lru_page() would not clear those two flags if the pages are
>>>>> released via pagevec, it sounds nothing prevents from isolating active
>> Sorry, this is a typo. If the page is freed via pagevec, active and
>> unevictable flag would get cleared before freeing by page_off_lru().
>>
>> But, if the page is freed by free_page_and_swap_cache(), these two flags are
>> not cleared. But, it seems this path is hit rare, the pages are freed by
>> pagevec for the most cases.
>>
>>>>> or unevictable pages.
>>>> Again, why should it? vmscan is equipped to deal with this.
>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>> unevictable pages since it just simply walks page table.
> Why it is wrong? lru_cache_add() only complains if both flags set, it
> shouldn't happen.

Noting wrong about isolating active or unevictable pages. I just mean it 
seems possible active or unevictable flag may be there if the page is 
freed via free_page_add_swap_cache() path.

>
>>>>> However I didn't really run into these problems, just in theory
>>>>> by visual
>>>>> inspection.
>>>>>
>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>> again since
>>>>> they are about to be freed when reaching this point.  So,
>>>>> clearing active
>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>> instead of calling putback_lru_page() as what page cache collapse does.
>>>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>>> It just calls putback_lru_page() at error path, not success path.
>>> Putting pages back to lru on error path definitely makes sense. Here it
>>> is the success path.
> I agree that putting the apage on LRU just before free the page is
> suboptimal, but I don't see it as a critical issue.

Yes, given the code analysis above, I agree. If you thought the patch is 
a fine micro-optimization, I would like to re-submit it with rectified 
commit log. Thank you for your time.

>
>

next prev parent reply	other threads:[~2020-03-19 16:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18 23:19 Yang Shi
2020-03-18 23:40 ` Kirill A. Shutemov
2020-03-18 23:47   ` Yang Shi
2020-03-19  0:12 ` Kirill A. Shutemov
2020-03-19  0:55   ` Yang Shi
2020-03-19  5:39     ` Yang Shi
2020-03-19 10:49       ` Kirill A. Shutemov
2020-03-19 16:57         ` Yang Shi [this message]
2020-03-19 17:22           ` Yang Shi
2020-03-20 11:45           ` Kirill A. Shutemov
2020-03-20 16:34             ` Yang Shi
2020-03-24 17:17         ` Yang Shi
2020-03-25 11:26           ` Kirill A. Shutemov
2020-03-25 18:42             ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e716c8c6-898e-5199-019c-161ea3ec06c3@linux.alibaba.com \
    --to=yang.shi@linux.alibaba.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox