* [PATCH 0/8] hugepage migration fixes (v5)
@ 2014-12-02 8:26 Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 2/8] mm/hugetlb: pmd_huge() returns true for non-present hugepage Naoya Horiguchi
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
Hi everyone,
This is ver.5 patchset for fixing hugepage migration's race problem.
In ver.4, Hugh enlighted me about the problem around pmd_huge(), where
pmd_huge() returned false for migration/hwpoison entry and we treated them
as normal pages. IOW, we didn't handle !pmd_present case properly.
So I added a new separate patch for this problem as patch 2/8, and I changed
patch "mm/hugetlb: take page table lock in follow_huge_pmd()" (3/8 in this
series) to handle non-present hugetlb case in follow_huge_pmd().
Other than that, changes in this version are minor ones like comment fix.
Can I beg your comments and reviews again?
Thanks,
Naoya Horiguchi
---
Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: mmotm-2014-11-26-15-45/fix_hugetlbfs_follow_page.v5
v2: http://thread.gmane.org/gmane.linux.kernel/1761065
v3: http://thread.gmane.org/gmane.linux.kernel/1776585
v4: http://thread.gmane.org/gmane.linux.kernel/1788215
---
Summary:
Naoya Horiguchi (8):
mm/hugetlb: reduce arch dependent code around follow_huge_*
mm/hugetlb: pmd_huge() returns true for non-present hugepage
mm/hugetlb: take page table lock in follow_huge_pmd()
mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
mm/hugetlb: add migration entry check in __unmap_hugepage_range
mm/hugetlb: fix suboptimal migration/hwpoisoned entry check
mm/hugetlb: cleanup and rename is_hugetlb_entry_(migration|hwpoisoned)()
arch/arm/mm/hugetlbpage.c | 6 --
arch/arm64/mm/hugetlbpage.c | 6 --
arch/ia64/mm/hugetlbpage.c | 6 --
arch/metag/mm/hugetlbpage.c | 6 --
arch/mips/mm/hugetlbpage.c | 18 ----
arch/powerpc/mm/hugetlbpage.c | 8 ++
arch/s390/mm/hugetlbpage.c | 20 -----
arch/sh/mm/hugetlbpage.c | 12 ---
arch/sparc/mm/hugetlbpage.c | 12 ---
arch/tile/mm/hugetlbpage.c | 28 ------
arch/x86/mm/gup.c | 2 +-
arch/x86/mm/hugetlbpage.c | 20 ++---
include/linux/hugetlb.h | 8 +-
include/linux/swapops.h | 4 +
mm/gup.c | 25 ++----
mm/hugetlb.c | 196 ++++++++++++++++++++++++++----------------
mm/migrate.c | 5 +-
17 files changed, 156 insertions(+), 226 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 2/8] mm/hugetlb: pmd_huge() returns true for non-present hugepage
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 1/8] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
Migrating hugepages and hwpoisoned hugepages are considered as
non-present hugepages, and they are referenced via migration entries
and hwpoison entries in their page table slots.
This behavior causes race condition because pmd_huge() doesn't tell
non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask()
is one example where the kernel would call follow_page_pte() for such
hugepage while this function is supposed to handle only normal pages.
To avoid this, this patch makes pmd_huge() return true when pmd_none()
is true *and* pmd_present() is false. We don't have to worry about mixing
up non-present pmd entry with normal pmd (pointing to leaf level pte
entry) because pmd_present() is true in normal pmd.
The same race condition could happen in (x86-specific) gup_pmd_range(),
where this patch simply adds pmd_present() check instead of pmd_huge().
This is because gup_pmd_range() is fast path. If we have non-present
hugepage in this function, we will go into gup_huge_pmd(), then return
0 at flag mask check, and finally fall back to the slow path.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
arch/x86/mm/gup.c | 2 +-
arch/x86/mm/hugetlbpage.c | 8 +++++++-
mm/hugetlb.c | 2 ++
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/arch/x86/mm/gup.c mmotm-2014-11-26-15-45/arch/x86/mm/gup.c
index 207d9aef662d..448ee8912d9b 100644
--- mmotm-2014-11-26-15-45.orig/arch/x86/mm/gup.c
+++ mmotm-2014-11-26-15-45/arch/x86/mm/gup.c
@@ -172,7 +172,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
*/
if (pmd_none(pmd) || pmd_trans_splitting(pmd))
return 0;
- if (unlikely(pmd_large(pmd))) {
+ if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
/*
* NUMA hinting faults need to be handled in the GUP
* slowpath for accounting purposes and so that they
diff --git mmotm-2014-11-26-15-45.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/x86/mm/hugetlbpage.c
index 03b8a7c11817..9161f764121e 100644
--- mmotm-2014-11-26-15-45.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/x86/mm/hugetlbpage.c
@@ -54,9 +54,15 @@ int pud_huge(pud_t pud)
#else
+/*
+ * pmd_huge() returns 1 if @pmd is hugetlb related entry, that is normal
+ * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry.
+ * Otherwise, returns 0.
+ */
int pmd_huge(pmd_t pmd)
{
- return !!(pmd_val(pmd) & _PAGE_PSE);
+ return !pmd_none(pmd) &&
+ (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
}
int pud_huge(pud_t pud)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index 6be4a690e554..dd42878549d5 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3679,6 +3679,8 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
{
struct page *page;
+ if (!pmd_present(*pmd))
+ return NULL;
page = pte_page(*(pte_t *)pmd);
if (page)
page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 3/8] mm/hugetlb: take page table lock in follow_huge_pmd()
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 2/8] mm/hugetlb: pmd_huge() returns true for non-present hugepage Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 1/8] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 4/8] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
We have a race condition between move_pages() and freeing hugepages,
where move_pages() calls follow_page(FOLL_GET) for hugepages internally
and tries to get its refcount without preventing concurrent freeing.
This race crashes the kernel, so this patch fixes it by moving FOLL_GET
code for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page. This
is justified because pte_page() never returns NULL for any architectures
or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to follow_huge_(addr|pud)
for consistency, but it's not necessary because currently these functions
don't support FOLL_GET flag, so let's leave it for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [3.12+]
---
ChangeLog v5:
- make follow_huge_pmd() wake and retry if pmd is migration entry,
for that purpose __migration_entry_wait() is exported outside mm/migrate.c
- make follow_huge_pmd() return NULL if pmd is hwpoison entry
- return no_page_table() instead of NULL if follow_huge_pmd/pud returns NULL
ChangeLog v4:
- remove changes related to taking ptl from follow_huge_(addr|pud)(),
which is not neccessary because these functions don't support FOLL_GET
at least for now
- add justification of removing page==NULL check to patch description
- stop changing parameter mm to vma in follow_huge_(pud|pmd)()
- use pmd_lockptr() instead of huge_pte_lock() in follow_huge_pmd()
- use get_page() instead of get_page_unless_zero() in follow_huge_pmd()
- use Fixes: tag and move changelog under '---'
ChangeLog v3:
- remove unnecessary if (page) check
- check (pmd|pud)_huge again after holding ptl
- do the same change also on follow_huge_pud()
- take page table lock also in follow_huge_addr()
ChangeLog v2:
- introduce follow_huge_pmd_lock() to do locking in arch-independent code.
---
include/linux/hugetlb.h | 8 ++++----
include/linux/swapops.h | 4 ++++
mm/gup.c | 25 ++++++++-----------------
mm/hugetlb.c | 48 ++++++++++++++++++++++++++++++++++--------------
mm/migrate.c | 5 +++--
5 files changed, 53 insertions(+), 37 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/include/linux/hugetlb.h mmotm-2014-11-26-15-45/include/linux/hugetlb.h
index cdd149ca5cc0..65b73169f511 100644
--- mmotm-2014-11-26-15-45.orig/include/linux/hugetlb.h
+++ mmotm-2014-11-26-15-45/include/linux/hugetlb.h
@@ -99,9 +99,9 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
int write);
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write);
+ pmd_t *pmd, int flags);
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
- pud_t *pud, int write);
+ pud_t *pud, int flags);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
static inline void hugetlb_show_meminfo(void)
{
}
-#define follow_huge_pmd(mm, addr, pmd, write) NULL
-#define follow_huge_pud(mm, addr, pud, write) NULL
+#define follow_huge_pmd(mm, addr, pmd, flags) NULL
+#define follow_huge_pud(mm, addr, pud, flags) NULL
#define prepare_hugepage_range(file, addr, len) (-EINVAL)
#define pmd_huge(x) 0
#define pud_huge(x) 0
diff --git mmotm-2014-11-26-15-45.orig/include/linux/swapops.h mmotm-2014-11-26-15-45/include/linux/swapops.h
index 6adfb7bfbf44..e288d5c016a7 100644
--- mmotm-2014-11-26-15-45.orig/include/linux/swapops.h
+++ mmotm-2014-11-26-15-45/include/linux/swapops.h
@@ -137,6 +137,8 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
}
+extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl);
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
extern void migration_entry_wait_huge(struct vm_area_struct *vma,
@@ -150,6 +152,8 @@ static inline int is_migration_entry(swp_entry_t swp)
}
#define migration_entry_to_page(swp) NULL
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
+static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+ spinlock_t *ptl) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
diff --git mmotm-2014-11-26-15-45.orig/mm/gup.c mmotm-2014-11-26-15-45/mm/gup.c
index 3af2f08578ce..a9b260c5e2c8 100644
--- mmotm-2014-11-26-15-45.orig/mm/gup.c
+++ mmotm-2014-11-26-15-45/mm/gup.c
@@ -167,10 +167,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pud_none(*pud))
return no_page_table(vma, flags);
if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
- if (flags & FOLL_GET)
- return NULL;
- page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
- return page;
+ page = follow_huge_pud(mm, address, pud, flags);
+ if (page)
+ return page;
+ return no_page_table(vma, flags);
}
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);
@@ -179,19 +179,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
- page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
- if (flags & FOLL_GET) {
- /*
- * Refcount on tail pages are not well-defined and
- * shouldn't be taken. The caller should handle a NULL
- * return when trying to follow tail pages.
- */
- if (PageHead(page))
- get_page(page);
- else
- page = NULL;
- }
- return page;
+ page = follow_huge_pmd(mm, address, pmd, flags);
+ if (page)
+ return page;
+ return no_page_table(vma, flags);
}
if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
return no_page_table(vma, flags);
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index dd42878549d5..adafced1aa17 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3675,28 +3675,48 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
struct page * __weak
follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
+ pmd_t *pmd, int flags)
{
- struct page *page;
-
- if (!pmd_present(*pmd))
- return NULL;
- page = pte_page(*(pte_t *)pmd);
- if (page)
- page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ struct page *page = NULL;
+ spinlock_t *ptl;
+retry:
+ ptl = pmd_lockptr(mm, pmd);
+ spin_lock(ptl);
+ /*
+ * make sure that the address range covered by this pmd is not
+ * unmapped from other threads.
+ */
+ if (!pmd_huge(*pmd))
+ goto out;
+ if (pmd_present(*pmd)) {
+ page = pte_page(*(pte_t *)pmd) +
+ ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ if (flags & FOLL_GET)
+ get_page(page);
+ } else {
+ if (is_hugetlb_entry_migration(huge_ptep_get((pte_t *)pmd))) {
+ spin_unlock(ptl);
+ __migration_entry_wait(mm, (pte_t *)pmd, ptl);
+ goto retry;
+ }
+ /*
+ * hwpoisoned entry is treated as no_page_table in
+ * follow_page_mask().
+ */
+ }
+out:
+ spin_unlock(ptl);
return page;
}
struct page * __weak
follow_huge_pud(struct mm_struct *mm, unsigned long address,
- pud_t *pud, int write)
+ pud_t *pud, int flags)
{
- struct page *page;
+ if (flags & FOLL_GET)
+ return NULL;
- page = pte_page(*(pte_t *)pud);
- if (page)
- page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
- return page;
+ return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
}
#ifdef CONFIG_MEMORY_FAILURE
diff --git mmotm-2014-11-26-15-45.orig/mm/migrate.c mmotm-2014-11-26-15-45/mm/migrate.c
index 41945cb0ca38..a4b2a4103d38 100644
--- mmotm-2014-11-26-15-45.orig/mm/migrate.c
+++ mmotm-2014-11-26-15-45/mm/migrate.c
@@ -229,7 +229,7 @@ static void remove_migration_ptes(struct page *old, struct page *new)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl)
{
pte_t pte;
@@ -1260,7 +1260,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto put_and_set;
if (PageHuge(page)) {
- isolate_huge_page(page, &pagelist);
+ if (PageHead(page))
+ isolate_huge_page(page, &pagelist);
goto put_and_set;
}
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 1/8] mm/hugetlb: reduce arch dependent code around follow_huge_*
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 2/8] mm/hugetlb: pmd_huge() returns true for non-present hugepage Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 3/8] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
Currently we have many duplicates in definitions around follow_huge_addr(),
follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
The basic idea is to put the default implementation for these functions in
mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
and to implement arch-specific code only when the arch needs it.
For follow_huge_addr(), only powerpc and ia64 have their own implementation,
and in all other architectures this function just returns ERR_PTR(-EINVAL).
So this patch sets returning ERR_PTR(-EINVAL) as default.
As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
return 0 in your architecture (like in ia64 or sparc,) it's never called
(the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.
In some architecture (like mips, s390 and tile,) their current arch-specific
follow_huge_(pmd|pud)() are effectively identical with the common code,
so this patch lets these architecture use the common code.
One exception is metag, where pmd_huge() could return non-zero but it expects
follow_huge_pmd() to always return NULL. This means that we need arch-specific
implementation which returns NULL. This behavior looks strange to me (because
non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
so follow_huge_pmd() can/should return some relevant value,) but that's beyond
this cleanup patch, so let's keep it.
Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
is true when follow_huge_pmd() can be called (note that pmd_huge() has
the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
code. This patch forces these archs use PMD_MASK, but it's OK because
they are identical in both archs.
In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
PTE_ORDER is always 0, so these are identical.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
---
arch/arm/mm/hugetlbpage.c | 6 ------
arch/arm64/mm/hugetlbpage.c | 6 ------
arch/ia64/mm/hugetlbpage.c | 6 ------
arch/metag/mm/hugetlbpage.c | 6 ------
arch/mips/mm/hugetlbpage.c | 18 ------------------
arch/powerpc/mm/hugetlbpage.c | 8 ++++++++
arch/s390/mm/hugetlbpage.c | 20 --------------------
arch/sh/mm/hugetlbpage.c | 12 ------------
arch/sparc/mm/hugetlbpage.c | 12 ------------
arch/tile/mm/hugetlbpage.c | 28 ----------------------------
arch/x86/mm/hugetlbpage.c | 12 ------------
mm/hugetlb.c | 30 +++++++++++++++---------------
12 files changed, 23 insertions(+), 141 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/arm/mm/hugetlbpage.c
index 66781bf34077..c72412415093 100644
--- mmotm-2014-11-26-15-45.orig/arch/arm/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/arm/mm/hugetlbpage.c
@@ -36,12 +36,6 @@
* of type casting from pmd_t * to pte_t *.
*/
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
- int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pud_huge(pud_t pud)
{
return 0;
diff --git mmotm-2014-11-26-15-45.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/arm64/mm/hugetlbpage.c
index 023747bf4dd7..2de9d2e59d96 100644
--- mmotm-2014-11-26-15-45.orig/arch/arm64/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/arm64/mm/hugetlbpage.c
@@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
}
#endif
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
- int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return !(pmd_val(pmd) & PMD_TABLE_BIT);
diff --git mmotm-2014-11-26-15-45.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/ia64/mm/hugetlbpage.c
index 76069c18ee42..52b7604b5215 100644
--- mmotm-2014-11-26-15-45.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/ia64/mm/hugetlbpage.c
@@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
return 0;
}
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
-{
- return NULL;
-}
-
void hugetlb_free_pgd_range(struct mmu_gather *tlb,
unsigned long addr, unsigned long end,
unsigned long floor, unsigned long ceiling)
diff --git mmotm-2014-11-26-15-45.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/metag/mm/hugetlbpage.c
index 3c32075d2945..7ca80ac42ed5 100644
--- mmotm-2014-11-26-15-45.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/metag/mm/hugetlbpage.c
@@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
return 0;
}
-struct page *follow_huge_addr(struct mm_struct *mm,
- unsigned long address, int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return pmd_page_shift(pmd) > PAGE_SHIFT;
diff --git mmotm-2014-11-26-15-45.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/mips/mm/hugetlbpage.c
index 4ec8ee10d371..06e0f421b41b 100644
--- mmotm-2014-11-26-15-45.orig/arch/mips/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/mips/mm/hugetlbpage.c
@@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
return 0;
}
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return (pmd_val(pmd) & _PAGE_HUGE) != 0;
@@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
{
return (pud_val(pud) & _PAGE_HUGE) != 0;
}
-
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
-{
- struct page *page;
-
- page = pte_page(*(pte_t *)pmd);
- if (page)
- page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
- return page;
-}
diff --git mmotm-2014-11-26-15-45.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..9517a93a315c 100644
--- mmotm-2014-11-26-15-45.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/powerpc/mm/hugetlbpage.c
@@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
return NULL;
}
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd, int write)
+{
+ BUG();
+ return NULL;
+}
+
static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
unsigned long sz)
{
diff --git mmotm-2014-11-26-15-45.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/s390/mm/hugetlbpage.c
index 389bc17934b7..348b3b9b6b59 100644
--- mmotm-2014-11-26-15-45.orig/arch/s390/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/s390/mm/hugetlbpage.c
@@ -192,12 +192,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
return 0;
}
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
- int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
if (!MACHINE_HAS_HPAGE)
@@ -210,17 +204,3 @@ int pud_huge(pud_t pud)
{
return 0;
}
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmdp, int write)
-{
- struct page *page;
-
- if (!MACHINE_HAS_HPAGE)
- return NULL;
-
- page = pmd_page(*pmdp);
- if (page)
- page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
- return page;
-}
diff --git mmotm-2014-11-26-15-45.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/sh/mm/hugetlbpage.c
index d7762349ea48..534bc978af8a 100644
--- mmotm-2014-11-26-15-45.orig/arch/sh/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/sh/mm/hugetlbpage.c
@@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
return 0;
}
-struct page *follow_huge_addr(struct mm_struct *mm,
- unsigned long address, int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return 0;
@@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
{
return 0;
}
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
-{
- return NULL;
-}
diff --git mmotm-2014-11-26-15-45.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/sparc/mm/hugetlbpage.c
index d329537739c6..4242eab12e10 100644
--- mmotm-2014-11-26-15-45.orig/arch/sparc/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/sparc/mm/hugetlbpage.c
@@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
return entry;
}
-struct page *follow_huge_addr(struct mm_struct *mm,
- unsigned long address, int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return 0;
@@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
{
return 0;
}
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
-{
- return NULL;
-}
diff --git mmotm-2014-11-26-15-45.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/tile/mm/hugetlbpage.c
index e514899e1100..8a00c7b7b862 100644
--- mmotm-2014-11-26-15-45.orig/arch/tile/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/tile/mm/hugetlbpage.c
@@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
return NULL;
}
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
- int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
@@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
}
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
-{
- struct page *page;
-
- page = pte_page(*(pte_t *)pmd);
- if (page)
- page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
- return page;
-}
-
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
- pud_t *pud, int write)
-{
- struct page *page;
-
- page = pte_page(*(pte_t *)pud);
- if (page)
- page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
- return page;
-}
-
int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
{
return 0;
diff --git mmotm-2014-11-26-15-45.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-11-26-15-45/arch/x86/mm/hugetlbpage.c
index 8b977ebf9388..03b8a7c11817 100644
--- mmotm-2014-11-26-15-45.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-11-26-15-45/arch/x86/mm/hugetlbpage.c
@@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
return 0;
}
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int write)
-{
- return NULL;
-}
#else
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
- return ERR_PTR(-EINVAL);
-}
-
int pmd_huge(pmd_t pmd)
{
return !!(pmd_val(pmd) & _PAGE_PSE);
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index 85032de5e20f..6be4a690e554 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3660,7 +3660,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
return (pte_t *) pmd;
}
-struct page *
+#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
+
+/*
+ * These functions are overwritable if your architecture needs its own
+ * behavior.
+ */
+struct page * __weak
+follow_huge_addr(struct mm_struct *mm, unsigned long address,
+ int write)
+{
+ return ERR_PTR(-EINVAL);
+}
+
+struct page * __weak
follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
{
@@ -3672,7 +3685,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
return page;
}
-struct page *
+struct page * __weak
follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int write)
{
@@ -3684,19 +3697,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
return page;
}
-#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
-/* Can be overriden by architectures */
-struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
- pud_t *pud, int write)
-{
- BUG();
- return NULL;
-}
-
-#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
#ifdef CONFIG_MEMORY_FAILURE
/* Should be called in hugetlb_lock */
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 4/8] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
` (2 preceding siblings ...)
2014-12-02 8:26 ` [PATCH v5 3/8] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 7/8] mm/hugetlb: fix suboptimal migration/hwpoisoned entry check Naoya Horiguchi
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().
This patch also reorders taking ptl and doing pte_page(), because pte_page()
should be done in ptl. Due to this reordering, we need use trylock_page()
in page != pagecache_page case to respect locking order.
Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [3.2+]
---
ChangeLog v5:
- add comment to justify calling wait_on_page_locked without taking refcount
- remove stale comment about lock order
ChangeLog v4:
- move !pte_present(entry) (i.e. migration/hwpoison) check before
taking page table lock
- call wait_on_page_locked() if trylock_page() returns false
- remove unused label out_unlock_page
- fix the order of put_page() and unlock_page() after out_put_page label
- move changelog under '---'
Hugh advised me for ver.3 that we can call migration_entry_wait_huge()
when the !pte_present(entry) check returns true to avoid busy faulting.
But it seems that in that case only one additional page fault happens
instead of busy faulting, because is_hugetlb_entry_migration() in the
second call of hugetlb_fault() should return true and then
migration_entry_wait_huge() is called. We could avoid this additional
page fault by adding another migration_entry_wait_huge(), but then
we should separate pte_present() check into is_hugetlb_entry_migration()
path and is_hugetlb_entry_hwpoisoned() path, which makes code complicated.
So let me take the simpler approach for sending stable tree.
And it's also advised that we can clean up is_hugetlb_entry_migration()
and is_hugetlb_entry_hwpoisoned() things. This will be done in another
work, and the above migration_entry_wait_huge problem will be revisited
there.
ChangeLog v3:
- doing pte_page() and taking refcount under page table lock
- check pte_present after taking ptl, which makes it unnecessary to use
get_page_unless_zero()
- use trylock_page in page != pagecache_page case
- fixed target stable version
---
mm/hugetlb.c | 52 ++++++++++++++++++++++++++++++++++++----------------
1 file changed, 36 insertions(+), 16 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index adafced1aa17..dfc1527e8f4e 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3134,6 +3134,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *pagecache_page = NULL;
struct hstate *h = hstate_vma(vma);
struct address_space *mapping;
+ int need_wait_lock = 0;
address &= huge_page_mask(h);
@@ -3172,6 +3173,16 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = 0;
/*
+ * entry could be a migration/hwpoison entry at this point, so this
+ * check prevents the kernel from going below assuming that we have
+ * a active hugepage in pagecache. This goto expects the 2nd page fault,
+ * and is_hugetlb_entry_(migration|hwpoisoned) check will properly
+ * handle it.
+ */
+ if (!pte_present(entry))
+ goto out_mutex;
+
+ /*
* If we are going to COW the mapping later, we examine the pending
* reservations for this page now. This will ensure that any
* allocations necessary to record that reservation occur outside the
@@ -3190,30 +3201,31 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
vma, address);
}
+ ptl = huge_pte_lock(h, mm, ptep);
+
+ /* Check for a racing update before calling hugetlb_cow */
+ if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+ goto out_ptl;
+
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
* when page != pagecache_page or !pagecache_page.
- * Note that locking order is always pagecache_page -> page,
- * so no worry about deadlock.
*/
page = pte_page(entry);
- get_page(page);
if (page != pagecache_page)
- lock_page(page);
-
- ptl = huge_pte_lockptr(h, mm, ptep);
- spin_lock(ptl);
- /* Check for a racing update before calling hugetlb_cow */
- if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
- goto out_ptl;
+ if (!trylock_page(page)) {
+ need_wait_lock = 1;
+ goto out_ptl;
+ }
+ get_page(page);
if (flags & FAULT_FLAG_WRITE) {
if (!huge_pte_write(entry)) {
ret = hugetlb_cow(mm, vma, address, ptep, entry,
pagecache_page, ptl);
- goto out_ptl;
+ goto out_put_page;
}
entry = huge_pte_mkdirty(entry);
}
@@ -3221,7 +3233,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (huge_ptep_set_access_flags(vma, address, ptep, entry,
flags & FAULT_FLAG_WRITE))
update_mmu_cache(vma, address, ptep);
-
+out_put_page:
+ if (page != pagecache_page)
+ unlock_page(page);
+ put_page(page);
out_ptl:
spin_unlock(ptl);
@@ -3229,12 +3244,17 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(pagecache_page);
put_page(pagecache_page);
}
- if (page != pagecache_page)
- unlock_page(page);
- put_page(page);
-
out_mutex:
mutex_unlock(&htlb_fault_mutex_table[hash]);
+ /*
+ * Generally it's safe to hold refcount during waiting page lock. But
+ * here we just wait to defer the next page fault to avoid busy loop and
+ * the page is not used after unlocked before returning from the current
+ * page fault. So we are safe from accessing freed page, even if we wait
+ * here without taking refcount.
+ */
+ if (need_wait_lock)
+ wait_on_page_locked(page);
return ret;
}
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 5/8] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
` (5 preceding siblings ...)
2014-12-02 8:26 ` [PATCH v5 6/8] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 8/8] mm/hugetlb: cleanup and rename is_hugetlb_entry_(migration|hwpoisoned)() Naoya Horiguchi
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
There is a race condition between hugepage migration and change_protection(),
where hugetlb_change_protection() doesn't care about migration entries and
wrongly overwrites them. That causes unexpected results like kernel crash.
HWPoison entries also can cause the same problem.
This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
ChangeLog v4:
- s/set_pte_at/set_huge_pte_at/
ChangeLog v3:
- handle migration entry correctly (instead of just skipping)
---
mm/hugetlb.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index dfc1527e8f4e..2807be3f260d 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3384,7 +3384,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
spin_unlock(ptl);
continue;
}
- if (!huge_pte_none(huge_ptep_get(ptep))) {
+ pte = huge_ptep_get(ptep);
+ if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+ spin_unlock(ptl);
+ continue;
+ }
+ if (unlikely(is_hugetlb_entry_migration(pte))) {
+ swp_entry_t entry = pte_to_swp_entry(pte);
+
+ if (is_write_migration_entry(entry)) {
+ pte_t newpte;
+
+ make_migration_entry_read(&entry);
+ newpte = swp_entry_to_pte(entry);
+ set_huge_pte_at(mm, address, ptep, newpte);
+ pages++;
+ }
+ spin_unlock(ptl);
+ continue;
+ }
+ if (!huge_pte_none(pte)) {
pte = huge_ptep_get_and_clear(mm, address, ptep);
pte = pte_mkhuge(huge_pte_modify(pte, newprot));
pte = arch_make_huge_pte(pte, vma, NULL, 0);
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 6/8] mm/hugetlb: add migration entry check in __unmap_hugepage_range
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
` (4 preceding siblings ...)
2014-12-02 8:26 ` [PATCH v5 7/8] mm/hugetlb: fix suboptimal migration/hwpoisoned entry check Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 5/8] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 8/8] mm/hugetlb: cleanup and rename is_hugetlb_entry_(migration|hwpoisoned)() Naoya Horiguchi
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch simply clears the pte for
migration entries as we do for hwpoison entries.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
ChangeLog v4:
- check hwpoisoned hugepage and migrating hugepage with the same check
instead of doing separately
---
mm/hugetlb.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index 2807be3f260d..a2bfd02e289f 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -2657,9 +2657,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
goto unlock;
/*
- * HWPoisoned hugepage is already unmapped and dropped reference
+ * Migrating hugepage or HWPoisoned hugepage is already
+ * unmapped and its refcount is dropped, so just clear pte here.
*/
- if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+ if (unlikely(!pte_present(pte))) {
huge_pte_clear(mm, address, ptep);
goto unlock;
}
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 7/8] mm/hugetlb: fix suboptimal migration/hwpoisoned entry check
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
` (3 preceding siblings ...)
2014-12-02 8:26 ` [PATCH v5 4/8] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 6/8] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
Currently hugetlb_fault() checks at first whether pte of the faulted address
is a migration or hwpoisoned entry, which means that we call huge_ptep_get()
twice in single hugetlb_fault(). This is not optimized. The reason of this
approach is that without checking at first, huge_pte_alloc() can trigger
BUG_ON() because pmd_huge() returned false for non-present hugetlb entry.
With a previous patch in this series, pmd_huge() becomes to return true
for non-present entry, so we no longer need this dirty workaround.
Let's move the checking code to the proper place.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
mm/hugetlb.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index a2bfd02e289f..6c38f9ad3d56 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -3136,20 +3136,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct address_space *mapping;
int need_wait_lock = 0;
+ int need_wait_migration = 0;
address &= huge_page_mask(h);
- ptep = huge_pte_offset(mm, address);
- if (ptep) {
- entry = huge_ptep_get(ptep);
- if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait_huge(vma, mm, ptep);
- return 0;
- } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
- return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(hstate_index(h));
- }
-
ptep = huge_pte_alloc(mm, address, huge_page_size(h));
if (!ptep)
return VM_FAULT_OOM;
@@ -3176,12 +3166,16 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* entry could be a migration/hwpoison entry at this point, so this
* check prevents the kernel from going below assuming that we have
- * a active hugepage in pagecache. This goto expects the 2nd page fault,
- * and is_hugetlb_entry_(migration|hwpoisoned) check will properly
- * handle it.
+ * a active hugepage in pagecache.
*/
- if (!pte_present(entry))
+ if (!pte_present(entry)) {
+ if (is_hugetlb_entry_migration(entry))
+ need_wait_migration = 1;
+ else if (is_hugetlb_entry_hwpoisoned(entry))
+ ret = VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto out_mutex;
+ }
/*
* If we are going to COW the mapping later, we examine the pending
@@ -3247,6 +3241,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&htlb_fault_mutex_table[hash]);
+ if (need_wait_migration)
+ migration_entry_wait_huge(vma, mm, ptep);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 8/8] mm/hugetlb: cleanup and rename is_hugetlb_entry_(migration|hwpoisoned)()
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
` (6 preceding siblings ...)
2014-12-02 8:26 ` [PATCH v5 5/8] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
@ 2014-12-02 8:26 ` Naoya Horiguchi
7 siblings, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2014-12-02 8:26 UTC (permalink / raw)
To: Andrew Morton, Hugh Dickins
Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi
non_swap_entry() returns true if a given swp_entry_t is a migration
entry or hwpoisoned entry. So non_swap_entry() && is_migration_entry() is
identical with just is_migration_entry(). So by removing non_swap_entry(),
we can write is_hugetlb_entry_(migration|hwpoisoned)() more simply.
And the name is_hugetlb_entry_(migration|hwpoisoned) is lengthy and
it's not predictable from naming convention around pte_* family.
Just pte_migration() looks better, but these function contains hugetlb
specific (so architecture dependent) huge_pte_none() check, so let's
rename them as huge_pte_(migration|hwpoisoned).
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
mm/hugetlb.c | 38 +++++++++++++-------------------------
1 file changed, 13 insertions(+), 25 deletions(-)
diff --git mmotm-2014-11-26-15-45.orig/mm/hugetlb.c mmotm-2014-11-26-15-45/mm/hugetlb.c
index 6c38f9ad3d56..bc9cbdb4f58f 100644
--- mmotm-2014-11-26-15-45.orig/mm/hugetlb.c
+++ mmotm-2014-11-26-15-45/mm/hugetlb.c
@@ -2516,30 +2516,18 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
}
-static int is_hugetlb_entry_migration(pte_t pte)
+static inline int huge_pte_migration(pte_t pte)
{
- swp_entry_t swp;
-
if (huge_pte_none(pte) || pte_present(pte))
return 0;
- swp = pte_to_swp_entry(pte);
- if (non_swap_entry(swp) && is_migration_entry(swp))
- return 1;
- else
- return 0;
+ return is_migration_entry(pte_to_swp_entry(pte));
}
-static int is_hugetlb_entry_hwpoisoned(pte_t pte)
+static inline int huge_pte_hwpoisoned(pte_t pte)
{
- swp_entry_t swp;
-
if (huge_pte_none(pte) || pte_present(pte))
return 0;
- swp = pte_to_swp_entry(pte);
- if (non_swap_entry(swp) && is_hwpoison_entry(swp))
- return 1;
- else
- return 0;
+ return is_hwpoison_entry(pte_to_swp_entry(pte));
}
int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
@@ -2583,8 +2571,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_ptep_get(src_pte);
if (huge_pte_none(entry)) { /* skip none entry */
;
- } else if (unlikely(is_hugetlb_entry_migration(entry) ||
- is_hugetlb_entry_hwpoisoned(entry))) {
+ } else if (unlikely(huge_pte_migration(entry) ||
+ huge_pte_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);
if (is_write_migration_entry(swp_entry) && cow) {
@@ -3169,9 +3157,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* a active hugepage in pagecache.
*/
if (!pte_present(entry)) {
- if (is_hugetlb_entry_migration(entry))
+ if (huge_pte_migration(entry))
need_wait_migration = 1;
- else if (is_hugetlb_entry_hwpoisoned(entry))
+ else if (huge_pte_hwpoisoned(entry))
ret = VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
goto out_mutex;
@@ -3303,8 +3291,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* (in which case hugetlb_fault waits for the migration,) and
* hwpoisoned hugepages (in which case we need to prevent the
* caller from accessing to them.) In order to do this, we use
- * here is_swap_pte instead of is_hugetlb_entry_migration and
- * is_hugetlb_entry_hwpoisoned. This is because it simply covers
+ * here is_swap_pte instead of huge_pte_migration and
+ * huge_pte_hwpoisoned. This is because it simply covers
* both cases, and because we can't follow correct pages
* directly from any kind of swap entries.
*/
@@ -3382,11 +3370,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
continue;
}
pte = huge_ptep_get(ptep);
- if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+ if (unlikely(huge_pte_hwpoisoned(pte))) {
spin_unlock(ptl);
continue;
}
- if (unlikely(is_hugetlb_entry_migration(pte))) {
+ if (unlikely(huge_pte_migration(pte))) {
swp_entry_t entry = pte_to_swp_entry(pte);
if (is_write_migration_entry(entry)) {
@@ -3730,7 +3718,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
if (flags & FOLL_GET)
get_page(page);
} else {
- if (is_hugetlb_entry_migration(huge_ptep_get((pte_t *)pmd))) {
+ if (huge_pte_migration(huge_ptep_get((pte_t *)pmd))) {
spin_unlock(ptl);
__migration_entry_wait(mm, (pte_t *)pmd, ptl);
goto retry;
--
2.2.0.rc0.2.gf745acb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-12-02 8:38 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-02 8:26 [PATCH 0/8] hugepage migration fixes (v5) Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 2/8] mm/hugetlb: pmd_huge() returns true for non-present hugepage Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 1/8] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 3/8] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 4/8] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 7/8] mm/hugetlb: fix suboptimal migration/hwpoisoned entry check Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 6/8] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 5/8] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
2014-12-02 8:26 ` [PATCH v5 8/8] mm/hugetlb: cleanup and rename is_hugetlb_entry_(migration|hwpoisoned)() Naoya Horiguchi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox