From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Andres Lagar-Cavilla <andreslc@google.com>,
Yang Shi <yang.shi@linaro.org>, Ning Qu <quning@gmail.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH 25/31] huge tmpfs recovery: shmem_recovery_remap & remap_team_by_pmd
Date: Tue, 5 Apr 2016 14:56:23 -0700 (PDT) [thread overview]
Message-ID: <alpine.LSU.2.11.1604051455010.5965@eggly.anvils> (raw)
In-Reply-To: <alpine.LSU.2.11.1604051403210.5965@eggly.anvils>
And once we have a fully populated huge page, replace the pte mappings
(by now already pointing into this huge page, as page migration has
arranged) by a huge pmd mapping - not just in the mm which prompted
this work, but in any other mm which might benefit from it.
However, the transition from pte mappings to huge pmd mapping is a
new one, which may surprise code elsewhere - pte_offset_map() and
pte_offset_map_lock() in particular. See the earlier discussion in
"huge tmpfs: avoid premature exposure of new pagetable", but now we
are forced to go beyond its solution.
The answer will be to put *pmd checking inside them, and examine
whether a pagetable page could ever be recycled for another purpose
before the pte lock is taken: the deposit/withdraw protocol, and
mmap_sem conventions, work nicely against that danger; but special
attention will have to be paid to MADV_DONTNEED's zap_huge_pmd()
pte_free under down_read of mmap_sem.
Avoid those complications for now: just use a rather unwelcome
down_write or down_write_trylock of mmap_sem here in
shmem_recovery_remap(), to exclude msyscalls or faults or ptrace or
GUP or NUMA work or /proc access. rmap access is already excluded
by our holding i_mmap_rwsem. Fast GUP on x86 is made safe by the
TLB flush in remap_team_by_pmd()'s pmdp_collapse_flush(), its IPIs
as usual blocked by fast GUP's local_irq_disable(). Fast GUP on
powerpc is made safe as usual by its RCU freeing of page tables
(though zap_huge_pmd()'s pte_free appears to violate that, but
if so it's an issue for anon THP too: investigate further later).
Does remap_team_by_pmd() really need its mmu_notifier_invalidate_range
pair? The manner of mapping changes, but nothing is actually unmapped.
Of course, the same question can be asked of remap_team_by_ptes().
Signed-off-by: Hugh Dickins <hughd@google.com>
---
include/linux/pageteam.h | 2
mm/huge_memory.c | 87 +++++++++++++++++++++++++++++++++++++
mm/shmem.c | 76 ++++++++++++++++++++++++++++++++
3 files changed, 165 insertions(+)
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -313,6 +313,8 @@ void unmap_team_by_pmd(struct vm_area_st
unsigned long addr, pmd_t *pmd, struct page *page);
void remap_team_by_ptes(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd);
+void remap_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page);
#else
static inline int map_team_by_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd, struct page *page)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3706,3 +3706,90 @@ raced:
spin_unlock(pml);
mmu_notifier_invalidate_range_end(mm, addr, end);
}
+
+void remap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, struct page *head)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page = head;
+ pgtable_t pgtable;
+ unsigned long end;
+ spinlock_t *pml;
+ spinlock_t *ptl;
+ pmd_t pmdval;
+ pte_t *pte;
+ int rss = 0;
+
+ VM_BUG_ON_PAGE(!PageTeam(head), head);
+ VM_BUG_ON_PAGE(!PageLocked(head), head);
+ VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+ end = addr + HPAGE_PMD_SIZE;
+
+ mmu_notifier_invalidate_range_start(mm, addr, end);
+ pml = pmd_lock(mm, pmd);
+ pmdval = *pmd;
+ /* I don't see how this can happen now, but be defensive */
+ if (pmd_trans_huge(pmdval) || pmd_none(pmdval))
+ goto out;
+
+ ptl = pte_lockptr(mm, pmd);
+ if (ptl != pml)
+ spin_lock(ptl);
+
+ pgtable = pmd_pgtable(pmdval);
+ pmdval = mk_pmd(head, vma->vm_page_prot);
+ pmdval = pmd_mkhuge(pmd_mkdirty(pmdval));
+
+ /* Perhaps wise to mark head as mapped before removing pte rmaps */
+ page_add_file_rmap(head);
+
+ /*
+ * Just as remap_team_by_ptes() would prefer to fill the page table
+ * earlier, remap_team_by_pmd() would prefer to empty it later; but
+ * ppc64's variant of the deposit/withdraw protocol prevents that.
+ */
+ pte = pte_offset_map(pmd, addr);
+ do {
+ if (pte_none(*pte))
+ continue;
+
+ VM_BUG_ON(!pte_present(*pte));
+ VM_BUG_ON(pte_page(*pte) != page);
+
+ pte_clear(mm, addr, pte);
+ page_remove_rmap(page, false);
+ put_page(page);
+ rss++;
+ } while (pte++, page++, addr += PAGE_SIZE, addr != end);
+
+ pte -= HPAGE_PMD_NR;
+ addr -= HPAGE_PMD_SIZE;
+
+ if (rss) {
+ pmdp_collapse_flush(vma, addr, pmd);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, addr, pmd, pmdval);
+ update_mmu_cache_pmd(vma, addr, pmd);
+ get_page(head);
+ page_add_team_rmap(head);
+ add_mm_counter(mm, MM_SHMEMPAGES, HPAGE_PMD_NR - rss);
+ } else {
+ /*
+ * Hmm. We might have caught this vma in between unmap_vmas()
+ * and free_pgtables(), which is a surprising time to insert a
+ * huge page. Before our caller checked mm_users, I sometimes
+ * saw a "bad pmd" report, and pgtable_pmd_page_dtor() BUG on
+ * pmd_huge_pte, when killing off tests. But checking mm_users
+ * is not enough to protect against munmap(): so for safety,
+ * back out if we found no ptes to replace.
+ */
+ page_remove_rmap(head, false);
+ }
+
+ if (ptl != pml)
+ spin_unlock(ptl);
+ pte_unmap(pte);
+out:
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(mm, addr, end);
+}
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1097,6 +1097,82 @@ unlock:
static void shmem_recovery_remap(struct recovery *recovery, struct page *head)
{
+ struct mm_struct *mm = recovery->mm;
+ struct address_space *mapping = head->mapping;
+ pgoff_t pgoff = head->index;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ pmd_t *pmd;
+ bool try_other_mms = false;
+
+ /*
+ * XXX: This use of mmap_sem is regrettable. It is needed for one
+ * reason only: because callers of pte_offset_map(_lock)() are not
+ * prepared for a huge pmd to appear in place of a page table at any
+ * instant. That can be fixed in pte_offset_map(_lock)() and callers,
+ * but that is a more invasive change, so just do it this way for now.
+ */
+ down_write(&mm->mmap_sem);
+ lock_page(head);
+ if (!PageTeam(head)) {
+ unlock_page(head);
+ up_write(&mm->mmap_sem);
+ return;
+ }
+ VM_BUG_ON_PAGE(!PageChecked(head), head);
+ i_mmap_lock_write(mapping);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ /* XXX: Use anon_vma as over-strict hint of COWed pages */
+ if (vma->anon_vma)
+ continue;
+ addr = vma_address(head, vma);
+ if (addr & (HPAGE_PMD_SIZE-1))
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ if (!atomic_read(&vma->vm_mm->mm_users))
+ continue;
+ if (vma->vm_mm != mm) {
+ try_other_mms = true;
+ continue;
+ }
+ /* Only replace existing ptes: empty pmd can fault for itself */
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ remap_team_by_pmd(vma, addr, pmd, head);
+ shr_stats(remap_faulter);
+ }
+ up_write(&mm->mmap_sem);
+ if (!try_other_mms)
+ goto out;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ if (vma->vm_mm == mm)
+ continue;
+ /* XXX: Use anon_vma as over-strict hint of COWed pages */
+ if (vma->anon_vma)
+ continue;
+ addr = vma_address(head, vma);
+ if (addr & (HPAGE_PMD_SIZE-1))
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ if (!atomic_read(&vma->vm_mm->mm_users))
+ continue;
+ /* Only replace existing ptes: empty pmd can fault for itself */
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ if (down_write_trylock(&vma->vm_mm->mmap_sem)) {
+ remap_team_by_pmd(vma, addr, pmd, head);
+ shr_stats(remap_another);
+ up_write(&vma->vm_mm->mmap_sem);
+ } else
+ shr_stats(remap_untried);
+ }
+out:
+ i_mmap_unlock_write(mapping);
+ unlock_page(head);
}
static void shmem_recovery_work(struct work_struct *work)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-04-05 21:56 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-05 21:10 [PATCH 00/31] huge tmpfs: THPagecache implemented by teams Hugh Dickins
2016-04-05 21:12 ` [PATCH 01/31] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m Hugh Dickins
2016-04-11 11:05 ` Kirill A. Shutemov
2016-04-17 2:28 ` Hugh Dickins
2016-04-05 21:13 ` [PATCH 02/31] huge tmpfs: include shmem freeholes in available memory Hugh Dickins
2016-04-05 21:15 ` [PATCH 03/31] huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge Hugh Dickins
2016-04-11 11:17 ` Kirill A. Shutemov
2016-04-17 2:00 ` Hugh Dickins
2016-04-05 21:16 ` [PATCH 04/31] huge tmpfs: try to allocate huge pages, split into a team Hugh Dickins
2016-04-05 21:17 ` [PATCH 05/31] huge tmpfs: avoid team pages in a few places Hugh Dickins
2016-04-05 21:20 ` [PATCH 06/31] huge tmpfs: shrinker to migrate and free underused holes Hugh Dickins
2016-04-05 21:21 ` [PATCH 07/31] huge tmpfs: get_unmapped_area align & fault supply huge page Hugh Dickins
2016-04-05 21:23 ` [PATCH 08/31] huge tmpfs: try_to_unmap_one use page_check_address_transhuge Hugh Dickins
2016-04-05 21:24 ` [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable Hugh Dickins
2016-04-11 11:54 ` Kirill A. Shutemov
2016-04-17 1:49 ` Hugh Dickins
2016-04-05 21:25 ` [PATCH 10/31] huge tmpfs: map shmem by huge page pmd or by page team ptes Hugh Dickins
2016-04-05 21:29 ` [PATCH 11/31] huge tmpfs: disband split huge pmds on race or memory failure Hugh Dickins
2016-04-05 21:33 ` [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd Hugh Dickins
2016-04-06 7:00 ` Ingo Molnar
2016-04-07 2:53 ` Hugh Dickins
2016-04-13 8:58 ` Ingo Molnar
2016-04-05 21:34 ` [PATCH 13/31] huge tmpfs: use Unevictable lru with variable hpage_nr_pages Hugh Dickins
2016-04-05 21:35 ` [PATCH 14/31] huge tmpfs: fix Mlocked meminfo, track huge & unhuge mlocks Hugh Dickins
2016-04-05 21:37 ` [PATCH 15/31] huge tmpfs: fix Mapped meminfo, track huge & unhuge mappings Hugh Dickins
2016-04-05 21:39 ` [PATCH 16/31] kvm: plumb return of hva when resolving page fault Hugh Dickins
2016-04-05 21:41 ` [PATCH 17/31] kvm: teach kvm to map page teams as huge pages Hugh Dickins
2016-04-05 23:37 ` Paolo Bonzini
2016-04-06 1:12 ` Hugh Dickins
2016-04-06 6:47 ` Paolo Bonzini
2016-04-06 6:56 ` Andres Lagar-Cavilla
2016-04-05 21:44 ` [PATCH 18/31] huge tmpfs: mem_cgroup move charge on shmem " Hugh Dickins
2016-04-05 21:46 ` [PATCH 19/31] huge tmpfs: mem_cgroup shmem_pmdmapped accounting Hugh Dickins
2016-04-05 21:47 ` [PATCH 20/31] huge tmpfs: mem_cgroup shmem_hugepages accounting Hugh Dickins
2016-04-05 21:49 ` [PATCH 21/31] huge tmpfs: show page team flag in pageflags Hugh Dickins
2016-04-05 21:51 ` [PATCH 22/31] huge tmpfs: /proc/<pid>/smaps show ShmemHugePages Hugh Dickins
2016-04-05 21:53 ` [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages Hugh Dickins
2016-04-06 10:28 ` Mika Penttilä
2016-04-07 2:05 ` Hugh Dickins
2016-04-05 21:54 ` [PATCH 24/31] huge tmpfs recovery: shmem_recovery_populate to fill huge page Hugh Dickins
2016-04-05 21:56 ` Hugh Dickins [this message]
2016-04-05 21:58 ` [PATCH 26/31] huge tmpfs recovery: shmem_recovery_swapin to read from swap Hugh Dickins
2016-04-05 22:00 ` [PATCH 27/31] huge tmpfs recovery: tweak shmem_getpage_gfp to fill team Hugh Dickins
2016-04-05 22:02 ` [PATCH 28/31] huge tmpfs recovery: debugfs stats to complete this phase Hugh Dickins
2016-04-05 22:03 ` [PATCH 29/31] huge tmpfs recovery: page migration call back into shmem Hugh Dickins
2016-04-05 22:05 ` [PATCH 30/31] huge tmpfs: shmem_huge_gfpmask and shmem_recovery_gfpmask Hugh Dickins
2016-04-05 22:07 ` [PATCH 31/31] huge tmpfs: no kswapd by default on sync allocations Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.11.1604051455010.5965@eggly.anvils \
--to=hughd@google.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreslc@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=quning@gmail.com \
--cc=yang.shi@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox