From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>,
Mike Kravetz <mike.kravetz@oracle.com>
Cc: akpm@linux-foundation.org, songmuchun@bytedance.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page
Date: Thu, 25 Aug 2022 18:54:07 +0800 [thread overview]
Message-ID: <9b5b9b7a-3cba-37a1-411a-5031a67208fe@linux.alibaba.com> (raw)
In-Reply-To: <887ca2e2-a7c5-93a7-46cb-185daccd4444@redhat.com>
On 8/25/2022 3:25 PM, David Hildenbrand wrote:
>> Is the primary concern the locking? If so, I am not sure we have an issue.
>> As mentioned in your commit message, current code will use
>> pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr
>> will either be the mm wide lock or pmd_page lock. To me, it seems that
>> either would provide correct synchronization for CONT-PTE entries. Am I
>> missing something or misreading the code?
>>
>> I started looking at code cleanup suggested by David. Here is a quick
>> patch (not tested and likely containing errors) to see if this is a step
>> in the right direction.
>>
>> I like it because we get rid of/combine all those follow_huge_p*d
>> routines.
>>
>
> Yes, see comments below.
>
>> From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Wed, 24 Aug 2022 15:59:15 -0700
>> Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in
>> follow_page_mask
>>
>> At the beginning of follow_page_mask, there currently is a call to
>> follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only
>> architecture which (incorrectly) provides a follow_huge_addr routine
>> that does not return error. Instead, at each level of the page table a
>> check is made for a hugetlb entry. If a hugetlb entry is found, a call
>> to a routine associated with that page table level such as
>> follow_huge_pmd is made.
>>
>> All the follow_huge_p*d routines are basically the same. In addition
>> huge page size can be derived from the vma, so we know where in the page
>> table a huge page would reside. So, replace follow_huge_addr with a
>> new architecture independent routine which will provide the same
>> functionality as the follow_huge_p*d routines. We can then eliminate
>> the p*d_huge checks in follow_page_mask page table walking as well as
>> the follow_huge_p*d routines themselves.>
>> follow_page_mask still has is_hugepd hugetlb checks during page table
>> walking. This is due to these checks and follow_huge_pd being
>> architecture specific. These can be eliminated if
>> hugetlb_follow_page_mask can be overwritten by architectures (powerpc)
>> that need to do follow_huge_pd processing.
>
> But won't the
>
>> + /* hugetlb is special */
>> + if (is_vm_hugetlb_page(vma))
>> + return hugetlb_follow_page_mask(vma, address, flags);
>
> code route everything via hugetlb_follow_page_mask() and all these
> (beloved) hugepd checks would essentially be unreachable?
>
> At least my understanding is that hugepd only applies to hugetlb.
>
> Can't we move the hugepd handling code into hugetlb_follow_page_mask()
> as well?
>
> I mean, doesn't follow_hugetlb_page() also have to handle that hugepd
> stuff already ... ?
Yes, I also think about this, and I did a simple patch (without testing)
based on Mike's patch to make it more clean.
diff --git a/mm/gup.c b/mm/gup.c
index d3239ea63159..1003c03dcf78 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -626,14 +626,7 @@ static struct page *follow_pmd_mask(struct
vm_area_struct *vma,
pmdval = READ_ONCE(*pmd);
if (pmd_none(pmdval))
return no_page_table(vma, flags);
- if (is_hugepd(__hugepd(pmd_val(pmdval)))) {
- page = follow_huge_pd(vma, address,
- __hugepd(pmd_val(pmdval)), flags,
- PMD_SHIFT);
- if (page)
- return page;
- return no_page_table(vma, flags);
- }
+
retry:
if (!pmd_present(pmdval)) {
/*
@@ -723,14 +716,6 @@ static struct page *follow_pud_mask(struct
vm_area_struct *vma,
pud = pud_offset(p4dp, address);
if (pud_none(*pud))
return no_page_table(vma, flags);
- if (is_hugepd(__hugepd(pud_val(*pud)))) {
- page = follow_huge_pd(vma, address,
- __hugepd(pud_val(*pud)), flags,
- PUD_SHIFT);
- if (page)
- return page;
- return no_page_table(vma, flags);
- }
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags,
&ctx->pgmap);
@@ -759,14 +744,6 @@ static struct page *follow_p4d_mask(struct
vm_area_struct *vma,
if (unlikely(p4d_bad(*p4d)))
return no_page_table(vma, flags);
- if (is_hugepd(__hugepd(p4d_val(*p4d)))) {
- page = follow_huge_pd(vma, address,
- __hugepd(p4d_val(*p4d)), flags,
- P4D_SHIFT);
- if (page)
- return page;
- return no_page_table(vma, flags);
- }
return follow_pud_mask(vma, address, p4d, flags, ctx);
}
@@ -813,15 +790,6 @@ static struct page *follow_page_mask(struct
vm_area_struct *vma,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
- if (is_hugepd(__hugepd(pgd_val(*pgd)))) {
- page = follow_huge_pd(vma, address,
- __hugepd(pgd_val(*pgd)), flags,
- PGDIR_SHIFT);
- if (page)
- return page;
- return no_page_table(vma, flags);
- }
-
return follow_p4d_mask(vma, address, pgd, flags, ctx);
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c107e7ebd66..848b4fb7a05d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6118,6 +6118,81 @@ static inline bool
__follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
return false;
}
+static struct page *hugetlb_follow_hugepd(struct vm_area_struct *vma,
+ unsigned long address,
+ unsigned int flags)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || pgd_bad(*pgd))
+ return ERR_PTR(-EFAULT);
+
+ if (pgd_huge(*pgd))
+ return NULL;
+
+ if (is_hugepd(__hugepd(pgd_val(*pgd)))) {
+ page = follow_huge_pd(vma, address,
+ __hugepd(pgd_val(*pgd)), flags,
+ PGDIR_SHIFT);
+ if (page)
+ return page;
+ return ERR_PTR(-EFAULT);
+ }
+
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(*p4d) || p4d_bad(*p4d))
+ return ERR_PTR(-EFAULT);
+
+ if (is_hugepd(__hugepd(p4d_val(*p4d)))) {
+ page = follow_huge_pd(vma, address,
+ __hugepd(p4d_val(*p4d)), flags,
+ P4D_SHIFT);
+ if (page)
+ return page;
+ return ERR_PTR(-EFAULT);
+ }
+
+ pud = pud_offset(p4d, address);
+ if (pud_none(*pud) || pud_bad(*pud))
+ return ERR_PTR(-EFAULT);
+
+ if (pud_huge(*pud))
+ return NULL;
+
+ if (is_hugepd(__hugepd(pud_val(*pud)))) {
+ page = follow_huge_pd(vma, address,
+ __hugepd(pud_val(*pud)), flags,
+ PUD_SHIFT);
+ if (page)
+ return page;
+ return ERR_PTR(-EFAULT);
+ }
+
+ pmd = pmd_offset(pud, address);+ if (pmd_none(*pmd) ||
pmd_bad(*pmd))
+ return ERR_PTR(-EFAULT);
+
+ if (pmd_huge(*pmd))
+ return NULL;
+
+ if (is_hugepd(__hugepd(pmd_val(*pmd)))) {
+ page = follow_huge_pd(vma, address,
+ __hugepd(pmd_val(*pmd)), flags,
+ PMD_SHIFT);
+ if (page)
+ return page;
+ return ERR_PTR(-EFAULT);
+ }
+
+ return NULL;
+}
+
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
@@ -6135,6 +6210,10 @@ struct page *hugetlb_follow_page_mask(struct
vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;
+ page = hugetlb_follow_hugepd(vma, address, flags);
+ if (page)
+ return page;
+
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
return NULL;
>>
>> +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>> + unsigned long address, unsigned int flags)
>> +{
>> + struct hstate *h = hstate_vma(vma);
>> + struct mm_struct *mm = vma->vm_mm;
>> + unsigned long haddr = address & huge_page_mask(h);
>> + struct page *page = NULL;
>> + spinlock_t *ptl;
>> + pte_t *pte, entry;
>> +
>> + /*
>> + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
>> + * follow_hugetlb_page().
>> + */
>> + if (WARN_ON_ONCE(flags & FOLL_PIN))
>> + return NULL;
>> +
>> + pte = huge_pte_offset(mm, haddr, huge_page_size(h));
>> + if (!pte)
>> + return NULL;
>> +
>> +retry:
>> + ptl = huge_pte_lock(h, mm, pte);
>> + entry = huge_ptep_get(pte);
>> + if (pte_present(entry)) {
>> + page = pte_page(entry);
>> + /*
>> + * try_grab_page() should always succeed here, because we hold
>> + * the ptl lock and have verified pte_present().
>> + */
>> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
>> + page = NULL;
>> + goto out;
>> + }
>> + } else {
>> + if (is_hugetlb_entry_migration(entry)) {
>> + spin_unlock(ptl);
>> + __migration_entry_wait_huge(pte, ptl);
>> + goto retry;
>> + }
>> + /*
>> + * hwpoisoned entry is treated as no_page_table in
>> + * follow_page_mask().
>> + */
>> + }
>> +out:
>> + spin_unlock(ptl);
>> + return page;
>
>
> This is neat and clean enough to not reuse follow_hugetlb_page(). I
> wonder if we want to add some comment to the function how this differs
> to follow_hugetlb_page().
>
> ... or do we maybe want to rename follow_hugetlb_page() to someting like
> __hugetlb_get_user_pages() to make it clearer in which context it will
> get called?
Sounds reasonable to me.
> I guess it might be feasible in the future to eliminate
> follow_hugetlb_page() and centralizing the faulting code. For now, this
> certainly improves the situation.
>
next prev parent reply other threads:[~2022-08-25 10:54 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-23 7:50 [PATCH v2 0/5] Fix some issues when looking up " Baolin Wang
2022-08-23 7:50 ` [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size " Baolin Wang
2022-08-23 8:29 ` David Hildenbrand
2022-08-23 10:02 ` Baolin Wang
2022-08-23 10:23 ` David Hildenbrand
2022-08-23 23:55 ` Mike Kravetz
2022-08-24 2:06 ` Baolin Wang
2022-08-24 7:31 ` David Hildenbrand
2022-08-24 9:41 ` Baolin Wang
2022-08-24 11:55 ` David Hildenbrand
2022-08-24 14:30 ` Baolin Wang
2022-08-24 14:33 ` David Hildenbrand
2022-08-24 15:06 ` Baolin Wang
2022-08-24 15:13 ` David Hildenbrand
2022-08-24 15:23 ` Baolin Wang
2022-08-24 23:34 ` Mike Kravetz
2022-08-25 1:43 ` Baolin Wang
2022-08-25 7:10 ` David Hildenbrand
2022-08-25 7:58 ` Baolin Wang
2022-08-25 18:30 ` Mike Kravetz
2022-08-25 7:25 ` David Hildenbrand
2022-08-25 10:54 ` Baolin Wang [this message]
2022-08-25 21:13 ` Mike Kravetz
2022-08-26 22:40 ` Mike Kravetz
2022-08-27 13:59 ` Aneesh Kumar K.V
2022-08-29 18:30 ` Mike Kravetz
2022-08-23 7:50 ` [PATCH v2 2/5] mm/hugetlb: use PTE page lock to protect CONT-PTE entries Baolin Wang
2022-08-23 7:50 ` [PATCH v2 3/5] mm/hugetlb: fix races when looking up a CONT-PMD size hugetlb page Baolin Wang
2022-08-23 7:50 ` [PATCH v2 4/5] mm/hugetlb: use PMD page lock to protect CONT-PTE entries Baolin Wang
2022-08-23 8:14 ` David Hildenbrand
2022-08-23 10:12 ` Baolin Wang
2022-08-23 7:50 ` [PATCH v2 5/5] mm/hugetlb: add FOLL_MIGRATION validation before waiting for a migration entry Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9b5b9b7a-3cba-37a1-411a-5031a67208fe@linux.alibaba.com \
--to=baolin.wang@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=songmuchun@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox