From: Vlastimil Babka <vbabka@suse.cz>
To: Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Rik van Riel <riel@redhat.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Oleg Nesterov <oleg@redhat.com>,
Sasha Levin <sasha.levin@oracle.com>,
Andrey Konovalov <andreyknvl@google.com>,
Dmitry Vyukov <dvyukov@google.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-mm@kvack.org
Subject: Re: [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked
Date: Mon, 19 Oct 2015 08:23:23 +0200 [thread overview]
Message-ID: <56248C5B.3040505@suse.cz> (raw)
In-Reply-To: <alpine.LSU.2.11.1510182148040.2481@eggly.anvils>
On 10/19/2015 06:50 AM, Hugh Dickins wrote:
> KernelThreadSanitizer (ktsan) has shown that the down_read_trylock()
> of mmap_sem in try_to_unmap_one() (when going to set PageMlocked on
> a page found mapped in a VM_LOCKED vma) is ineffective against races
> with exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not
> held when tearing down an mm.
>
> But that's okay, those races are benign; and although we've believed
But didn't Kirill show that it's not so benign, and can leak memory?
- http://marc.info/?l=linux-mm&m=144196800325498&w=2
Although as I noted, it probably doesn't leak completely. But a page
will remain unevictable, until its last user unmaps it, which is again
not completely benign?
- http://marc.info/?l=linux-mm&m=144198536831589&w=2
> for years in that ugly down_read_trylock(), it's unsuitable for the job,
> and frustrates the good intention of setting PageMlocked when it fails.
>
> It just doesn't matter if here we read vm_flags an instant before or
> after a racing mlock() or munlock() or exit_mmap() sets or clears
> VM_LOCKED: the syscalls (or exit) work their way up the address space
> (taking pt locks after updating vm_flags) to establish the final state.
>
> We do still need to be careful never to mark a page Mlocked (hence
> unevictable) by any race that will not be corrected shortly after.
And waiting for the last user to unmap the page is not necessarily
shortly after :)
Anyway pte lock looks like it could work, but I'll need to think about
it some more, because...
> The page lock protects from many of the races, but not all (a page
> is not necessarily locked when it's unmapped). But the pte lock we
> just dropped is good to cover the rest (and serializes even with
> munlock_vma_pages_all(),
Note how munlock_vma_pages_range() via __munlock_pagevec() does
TestClearPageMlocked() without (or "between") pte or page lock. But the
pte lock is being taken after clearing VM_LOCKED, so perhaps it's safe
against try_to_unmap_one...
> so no special barriers required): now hold
> on to the pte lock while calling mlock_vma_page(). Is that lock
> ordering safe? Yes, that's how follow_page_pte() calls it, and
> how page_remove_rmap() calls the complementary clear_page_mlock().
>
> This fixes the following case (though not a case which anyone has
> complained of), which mmap_sem did not: truncation's preliminary
> unmap_mapping_range() is supposed to remove even the anonymous COWs
> of filecache pages, and that might race with try_to_unmap_one() on a
> VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
> zap_pte_range() unmaps the page, causing "Bad page state (mlocked)"
> when freed. The pte lock protects against this.
>
> You could say that it also protects against the more ordinary case,
> racing with the preliminary unmapping of a filecache page itself: but
> in our current tree, that's independently protected by i_mmap_rwsem;
> and that race would be why "Bad page state (mlocked)" was seen before
> commit 48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").
>
> While we're here, make a related optimization in try_to_munmap_one():
> if it's doing TTU_MUNLOCK, then there's no point at all in descending
> the page tables and getting the pt lock, unless the vma is VM_LOCKED.
> Yes, that can change racily, but it can change racily even without the
> optimization: it's not critical. Far better not to waste time here.
>
> Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
> on this occasion, but that's probably the sensible next step - with a
> rename, given that try_to_munlock()'s business is to try to set Mlocked.
>
> Updated the unevictable-lru Documentation, to remove its reference to
> mmap semaphore, but found a few more updates needed in just that area.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> I've cc'ed a lot of people on this one, since it originated in the
> "Multiple potential races on vma->vm_flags" discussion. But didn't
> cc everyone on the not-very-interesting 1/12 "mm Documentation: undoc
> non-linear vmas" - so if you apply just this to a 4.3-rc6 or mmotm tree,
> yes, there will be rejects on unevictable-lru.txt, that's expected.
>
> Documentation/vm/unevictable-lru.txt | 65 ++++++-------------------
> mm/rmap.c | 36 ++++---------
> 2 files changed, 29 insertions(+), 72 deletions(-)
>
> --- migrat.orig/Documentation/vm/unevictable-lru.txt 2015-10-18 17:53:03.716313651 -0700
> +++ migrat/Documentation/vm/unevictable-lru.txt 2015-10-18 17:53:07.098317501 -0700
> @@ -531,37 +531,20 @@ map.
>
> try_to_unmap() is always called, by either vmscan for reclaim or for page
> migration, with the argument page locked and isolated from the LRU. Separate
> -functions handle anonymous and mapped file pages, as these types of pages have
> -different reverse map mechanisms.
> -
> - (*) try_to_unmap_anon()
> -
> - To unmap anonymous pages, each VMA in the list anchored in the anon_vma
> - must be visited - at least until a VM_LOCKED VMA is encountered. If the
> - page is being unmapped for migration, VM_LOCKED VMAs do not stop the
> - process because mlocked pages are migratable. However, for reclaim, if
> - the page is mapped into a VM_LOCKED VMA, the scan stops.
> -
> - try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
> - the mm_struct to which the VMA belongs. If this is successful, it will
> - mlock the page via mlock_vma_page() - we wouldn't have gotten to
> - try_to_unmap_anon() if the page were already mlocked - and will return
> - SWAP_MLOCK, indicating that the page is unevictable.
> -
> - If the mmap semaphore cannot be acquired, we are not sure whether the page
> - is really unevictable or not. In this case, try_to_unmap_anon() will
> - return SWAP_AGAIN.
> -
> - (*) try_to_unmap_file()
> -
> - Unmapping of a mapped file page works the same as for anonymous mappings,
> - except that the scan visits all VMAs that map the page's index/page offset
> - in the page's mapping's reverse map interval search tree.
> -
> - As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
> - page, try_to_unmap_file() will attempt to acquire the associated
> - mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
> - is successful, and SWAP_AGAIN, if not.
> +functions handle anonymous and mapped file and KSM pages, as these types of
> +pages have different reverse map lookup mechanisms, with different locking.
> +In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
> +it will call try_to_unmap_one() for every VMA which might contain the page.
> +
> +When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
> +VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
> +and return SWAP_MLOCK to indicate that the page is unevictable: and the scan
> +stops there.
> +
> +mlock_vma_page() is called while holding the page table's lock (in addition
> +to the page lock, and the rmap lock): to serialize against concurrent mlock or
> +munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
> +holepunching, and truncation of file pages and their anonymous COWed pages.
>
>
> try_to_munlock() REVERSE MAP SCAN
> @@ -577,22 +560,15 @@ all PTEs from the page. For this purpos
> introduced a variant of try_to_unmap() called try_to_munlock().
>
> try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
> -mapped file pages with an additional argument specifying unlock versus unmap
> +mapped file and KSM pages with a flag argument specifying unlock versus unmap
> processing. Again, these functions walk the respective reverse maps looking
> for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case,
> -the functions attempt to acquire the associated mmap semaphore, mlock the page
> -via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
> -pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> -
> -If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
> -semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
> -recycle the page on the inactive list and hope that it has better luck with the
> -page next time.
> +the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This
> +undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
>
> Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
> reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
> -However, the scan can terminate when it encounters a VM_LOCKED VMA and can
> -successfully acquire the VMA's mmap semaphore for read and mlock the page.
> +However, the scan can terminate when it encounters a VM_LOCKED VMA.
> Although try_to_munlock() might be called a great many times when munlocking a
> large region or tearing down a large address space that has been mlocked via
> mlockall(), overall this is a fairly rare event.
> @@ -620,11 +596,6 @@ Some examples of these unevictable pages
> (3) mlocked pages that could not be isolated from the LRU and moved to the
> unevictable list in mlock_vma_page().
>
> - (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
> - acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
> - munlock_vma_page() was forced to let the page back on to the normal LRU
> - list for vmscan to handle.
> -
> shrink_inactive_list() also diverts any unevictable pages that it finds on the
> inactive lists to the appropriate zone's unevictable list.
>
> --- migrat.orig/mm/rmap.c 2015-09-12 18:30:20.857039763 -0700
> +++ migrat/mm/rmap.c 2015-10-18 17:53:07.099317502 -0700
> @@ -1304,6 +1304,10 @@ static int try_to_unmap_one(struct page
> int ret = SWAP_AGAIN;
> enum ttu_flags flags = (enum ttu_flags)arg;
>
> + /* munlock has nothing to gain from examining un-locked vmas */
> + if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> + goto out;
> +
> pte = page_check_address(page, mm, address, &ptl, 0);
> if (!pte)
> goto out;
> @@ -1314,9 +1318,12 @@ static int try_to_unmap_one(struct page
> * skipped over this mm) then we should reactivate it.
> */
> if (!(flags & TTU_IGNORE_MLOCK)) {
> - if (vma->vm_flags & VM_LOCKED)
> - goto out_mlock;
> -
> + if (vma->vm_flags & VM_LOCKED) {
> + /* Holding pte lock, we do *not* need mmap_sem here */
> + mlock_vma_page(page);
> + ret = SWAP_MLOCK;
> + goto out_unmap;
> + }
> if (flags & TTU_MUNLOCK)
> goto out_unmap;
> }
> @@ -1419,31 +1426,10 @@ static int try_to_unmap_one(struct page
>
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> - if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> + if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
> mmu_notifier_invalidate_page(mm, address);
> out:
> return ret;
> -
> -out_mlock:
> - pte_unmap_unlock(pte, ptl);
> -
> -
> - /*
> - * We need mmap_sem locking, Otherwise VM_LOCKED check makes
> - * unstable result and race. Plus, We can't wait here because
> - * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
> - * if trylock failed, the page remain in evictable lru and later
> - * vmscan could retry to move the page to unevictable lru if the
> - * page is actually mlocked.
> - */
> - if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
> - if (vma->vm_flags & VM_LOCKED) {
> - mlock_vma_page(page);
> - ret = SWAP_MLOCK;
> - }
> - up_read(&vma->vm_mm->mmap_sem);
> - }
> - return ret;
> }
>
> bool is_vma_temporary_stack(struct vm_area_struct *vma)
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-10-19 6:23 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 4:44 [PATCH 0/12] mm: page migration cleanups, and a little mlock Hugh Dickins
2015-10-19 4:45 ` [PATCH 1/12] mm Documentation: undoc non-linear vmas Hugh Dickins
2015-10-19 9:16 ` Kirill A. Shutemov
2015-11-05 17:29 ` Vlastimil Babka
2015-10-19 4:50 ` [PATCH 2/12] mm: rmap use pte lock not mmap_sem to set PageMlocked Hugh Dickins
2015-10-19 6:23 ` Vlastimil Babka [this message]
2015-10-19 11:20 ` Hugh Dickins
2015-10-19 12:33 ` Vlastimil Babka
2015-10-19 19:17 ` Hugh Dickins
2015-10-19 20:52 ` Vlastimil Babka
2015-10-19 13:13 ` Kirill A. Shutemov
2015-10-19 19:53 ` Hugh Dickins
2015-10-19 20:10 ` Kirill A. Shutemov
2015-10-19 21:25 ` Vlastimil Babka
2015-10-19 21:53 ` Kirill A. Shutemov
2015-10-21 23:26 ` Hugh Dickins
2015-10-29 18:49 ` [PATCH v2 " Hugh Dickins
2015-11-05 17:50 ` Vlastimil Babka
2015-10-19 23:30 ` [PATCH " Davidlohr Bueso
2015-10-19 4:52 ` [PATCH 3/12] mm: page migration fix PageMlocked on migrated pages Hugh Dickins
2015-11-05 18:18 ` Vlastimil Babka
2015-10-19 4:54 ` [PATCH 4/12] mm: rename mem_cgroup_migrate to mem_cgroup_replace_page Hugh Dickins
2015-10-19 12:35 ` Johannes Weiner
2015-12-02 9:33 ` [PATCH] mm: fix kerneldoc on mem_cgroup_replace_page Hugh Dickins
2015-12-02 10:17 ` Michal Hocko
2015-12-02 16:57 ` Johannes Weiner
2015-10-19 4:55 ` [PATCH 5/12] mm: correct a couple of page migration comments Hugh Dickins
2015-10-21 17:53 ` Rafael Aquini
2015-10-19 4:57 ` [PATCH 6/12] mm: page migration use the put_new_page whenever necessary Hugh Dickins
2015-11-05 18:31 ` Vlastimil Babka
2015-11-08 21:17 ` Hugh Dickins
2015-10-19 4:59 ` [PATCH 7/12] mm: page migration trylock newpage at same level as oldpage Hugh Dickins
2015-10-21 17:54 ` Rafael Aquini
2015-10-19 5:01 ` [PATCH 8/12] mm: page migration remove_migration_ptes at lock+unlock level Hugh Dickins
2015-10-19 5:03 ` [PATCH 9/12] mm: simplify page migration's anon_vma comment and flow Hugh Dickins
2015-10-19 5:05 ` [PATCH 10/12] mm: page migration use migration entry for swapcache too Hugh Dickins
2015-10-22 22:35 ` Cyrill Gorcunov
2015-10-19 5:07 ` [PATCH 11/12] mm: page migration avoid touching newpage until no going back Hugh Dickins
2015-10-19 5:11 ` [PATCH 12/12] mm: migrate dirty page without clear_page_dirty_for_io etc Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56248C5B.3040505@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=andreyknvl@google.com \
--cc=cl@linux.com \
--cc=dave@stgolabs.net \
--cc=dvyukov@google.com \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=oleg@redhat.com \
--cc=riel@redhat.com \
--cc=sasha.levin@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox