From: Dave Hansen <dave@sr71.net>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
Dave Hansen <dave@sr71.net>
Subject: [PATCH 02/10] mm: pagewalk: always skip hugetlbfs except when explicitly handled
Date: Mon, 02 Jun 2014 14:36:47 -0700 [thread overview]
Message-ID: <20140602213647.E5C5D134@viggo.jf.intel.com> (raw)
In-Reply-To: <20140602213644.925A26D0@viggo.jf.intel.com>
From: Dave Hansen <dave.hansen@linux.intel.com>
The short story:
The walk_page_range() code is fragile for hugetlbfs VMAs. Each
walker instance must either exclude hugetlbfs from being walked,
or add a ->hugetlb_entry handler. If this is not done, the code
will go off the rails and start clearing huge page table entries.
This patch removes that requirement on the walkers. They can
merrily call walk_page_range() on hugetlbfs areas, and those
areas will simply be skipped inside the page walker code if they
have not set up a handler.
This makes the code more robust, shorter, and makes it more
intuitive to write a page table walker. Yay.
Long story:
I was looking at the page walker code and thought I found a bug.
If the walker hits a hugetlbfs VMA where walk->hugetlb_entry was
not set, it would hit the if(), and the clear out the pgd
thinking it was bad.
This essentially means that *EVERY* page walker has to *KNOW* to
either exclude hugetlbfs VMAs, or set a ->hugetlb_entry handler.
The good news is that all 9 users of walk_page_range() do this
implicitly or explicitly. The bad news is that it took me an
hour to convince myself of this, and future walk_page_range()
instances are vulnerable to making this mistake. I think the
madvise() use was probably just lucky (details below).
Here's the code trimmed down. Note what happens if we have a
is_vm_hugetlb_page(), !walk->hugetlb_entry, and a huge page pgd
entry in 'pgd' (or any of the lower levels).
int walk_page_range(unsigned long addr, unsigned long end, ...
{
...
vma = find_vma(walk->mm, addr);
if (vma) {
if (walk->hugetlb_entry && is_vm_hugetlb_page(vma)) {
walk_hugetlb_range(vma, addr, next, walk);
...
continue;
}
}
if (pgd_none_or_clear_bad(pgd)) {
There are currently 9 users of walk_page_range(). They handle
hugetlbfs pages in 5 ways:
/proc/$pid/smaps:
/proc/$pid/clear_refs:
cgroup precharge:
cgroup move charge:
checks VMA explicitly for hugetblfs and skips, does not set
->hugetlb_entry (this patch removes the now unnecessary
hugetlbfs checks for these)
openrisc dma alloc:
works on kernel memory, so no hugetlbfs, also arch does not
even support hugetlbfs
powerpc subpage protection:
uses arch-specific is_hugepage_only_range() check
/proc/$pid/pagemap:
/proc/$pid/numa_map:
sets ->hugetlb_entry
(these are unaffected by this patch)
MADV_WILLNEED:
does not set ->hugetlb_entry
only called via:
madvise_willneed() {
if (!vma->file)
force_swapin_readahead(...) {
walk_page_range(...)
}
}
That !vma->file check just _happens_ to cover hugetlbfs
vmas since they are always file-backed (or at least have
vma->file set as far as I can tell)
(this case is unaffected by this patch)
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/fs/proc/task_mmu.c | 4 +---
b/mm/memcontrol.c | 4 ----
b/mm/pagewalk.c | 5 ++++-
3 files changed, 5 insertions(+), 8 deletions(-)
diff -puN fs/proc/task_mmu.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 2014-06-02 14:20:19.210803615 -0700
+++ b/fs/proc/task_mmu.c 2014-06-02 14:20:19.218803974 -0700
@@ -590,7 +590,7 @@ static int show_smap(struct seq_file *m,
memset(&mss, 0, sizeof mss);
mss.vma = vma;
/* mmap_sem is held in m_start */
- if (vma->vm_mm && !is_vm_hugetlb_page(vma))
+ if (vma->vm_mm)
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
show_map_vma(m, vma, is_pid);
@@ -829,8 +829,6 @@ static ssize_t clear_refs_write(struct f
mmu_notifier_invalidate_range_start(mm, 0, -1);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cp.vma = vma;
- if (is_vm_hugetlb_page(vma))
- continue;
/*
* Writing 1 to /proc/pid/clear_refs affects all pages.
*
diff -puN mm/memcontrol.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 mm/memcontrol.c
--- a/mm/memcontrol.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 2014-06-02 14:20:19.212803706 -0700
+++ b/mm/memcontrol.c 2014-06-02 14:20:19.220804064 -0700
@@ -6821,8 +6821,6 @@ static unsigned long mem_cgroup_count_pr
.mm = mm,
.private = vma,
};
- if (is_vm_hugetlb_page(vma))
- continue;
walk_page_range(vma->vm_start, vma->vm_end,
&mem_cgroup_count_precharge_walk);
}
@@ -7087,8 +7085,6 @@ retry:
.mm = mm,
.private = vma,
};
- if (is_vm_hugetlb_page(vma))
- continue;
ret = walk_page_range(vma->vm_start, vma->vm_end,
&mem_cgroup_move_charge_walk);
if (ret)
diff -puN mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 mm/pagewalk.c
--- a/mm/pagewalk.c~pagewalk-always-skip-hugetlbfs-except-when-explicitly-handled-1 2014-06-02 14:20:19.214803794 -0700
+++ b/mm/pagewalk.c 2014-06-02 14:20:19.220804064 -0700
@@ -115,6 +115,9 @@ static int walk_hugetlb_range(struct vm_
pte_t *pte;
int err = 0;
+ if (!walk->hugetlb_entry)
+ return 0;
+
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
@@ -208,7 +211,7 @@ int walk_page_range(unsigned long addr,
* architecture and we can't handled it in the same
* manner as non-huge pages.
*/
- if (walk->hugetlb_entry && is_vm_hugetlb_page(vma)) {
+ if (is_vm_hugetlb_page(vma)) {
if (vma->vm_end < next)
next = vma->vm_end;
/*
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-06-02 21:36 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-02 21:36 [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Dave Hansen
2014-06-02 21:36 ` [PATCH 01/10] mm: pagewalk: consolidate vma->vm_start checks Dave Hansen
2014-06-02 21:36 ` Dave Hansen [this message]
2014-06-02 21:36 ` [PATCH 03/10] mm: pagewalk: have generic code keep track of VMA Dave Hansen
2014-06-02 21:36 ` [PATCH 04/10] mm: pagewalk: add page walker for mincore() Dave Hansen
2014-06-02 21:36 ` [PATCH 05/10] mm: mincore: clean up hugetlbfs handling (part 1) Dave Hansen
2014-06-02 21:36 ` [PATCH 06/10] mm: mincore: clean up hugetlbfs handler (part 2) Dave Hansen
2014-06-02 21:36 ` [PATCH 07/10] mm: pagewalk: kill check for hugetlbfs inside /proc pagemap code Dave Hansen
2014-06-02 21:36 ` [PATCH 08/10] mm: pagewalk: add locked pte walker Dave Hansen
2014-06-02 21:36 ` [PATCH 09/10] mm: pagewalk: use new locked walker for /proc/pid/smaps Dave Hansen
2014-06-02 21:36 ` [PATCH 10/10] mm: pagewalk: use locked walker for /proc/pid/numa_maps Dave Hansen
2014-06-02 21:52 ` [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing Naoya Horiguchi
[not found] ` <1401745925-l651h3s9@n-horiguchi@ah.jp.nec.com>
2014-06-02 21:53 ` Dave Hansen
2014-06-03 6:18 ` [PATCH -mm] mincore: apply page table walker on do_mincore() (Re: [PATCH 00/10] mm: pagewalk: huge page cleanups and VMA passing) Naoya Horiguchi
[not found] ` <1401776292-dn0fof8e@n-horiguchi@ah.jp.nec.com>
2014-06-03 15:55 ` Dave Hansen
2014-06-03 20:01 ` Naoya Horiguchi
2014-06-03 20:08 ` Naoya Horiguchi
[not found] ` <1401825676-8py0r32h@n-horiguchi@ah.jp.nec.com>
2014-06-03 20:33 ` Dave Hansen
2014-06-03 15:59 ` Dave Hansen
2014-06-03 16:22 ` Naoya Horiguchi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140602213647.E5C5D134@viggo.jf.intel.com \
--to=dave@sr71.net \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox