linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yin Tirui <yintirui@huawei.com>
To: "Matthew Wilcox" <willy@infradead.org>, "Jürgen Groß" <jgross@suse.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<x86@kernel.org>, <linux-arm-kernel@lists.infradead.org>,
	<david@kernel.org>, <catalin.marinas@arm.com>, <will@kernel.org>,
	<tglx@kernel.org>, <mingo@redhat.com>, <bp@alien8.de>,
	<dave.hansen@linux.intel.com>, <hpa@zytor.com>, <luto@kernel.org>,
	<peterz@infradead.org>, <akpm@linux-foundation.org>,
	<lorenzo.stoakes@oracle.com>, <ziy@nvidia.com>,
	<baolin.wang@linux.alibaba.com>, <Liam.Howlett@oracle.com>,
	<npache@redhat.com>, <ryan.roberts@arm.com>, <dev.jain@arm.com>,
	<baohua@kernel.org>, <lance.yang@linux.dev>, <vbabka@suse.cz>,
	<rppt@kernel.org>, <surenb@google.com>, <mhocko@suse.com>,
	<anshuman.khandual@arm.com>, <rmclure@linux.ibm.com>,
	<kevin.brodsky@arm.com>, <apopple@nvidia.com>,
	<ajd@linux.ibm.com>, <pasha.tatashin@soleen.com>,
	<bhe@redhat.com>, <thuth@redhat.com>, <coxu@redhat.com>,
	<dan.j.williams@intel.com>, <yu-cheng.yu@intel.com>,
	<baolu.lu@linux.intel.com>, <conor.dooley@microchip.com>,
	<Jonathan.Cameron@huawei.com>, <riel@surriel.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>, <chenjun102@huawei.com>
Subject: Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes
Date: Thu, 5 Mar 2026 17:38:46 +0800	[thread overview]
Message-ID: <e8d5ba16-3071-4b4a-b6a1-d492cc46e8c2@huawei.com> (raw)
In-Reply-To: <5eaf3846-01db-471e-9903-b0b239d7838d@suse.com>


On 3/4/2026 3:52 PM, Jürgen Groß wrote:
> On 28.02.26 08:09, Yin Tirui wrote:
>> A fundamental principle of page table type safety is that `pte_t` 
>> represents
>> the lowest level page table entry and should never carry huge page 
>> attributes.
>>
>> Currently, passing a pgprot with huge page bits (e.g., extracted via
>> pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the 
>> huge
>> attribute, leading to the necessity of the ugly `pte_clrhuge()` anti- 
>> pattern.
>>
>> Enforce type safety by making `pfn_pte()` inherently filter out huge page
>> attributes:
>> - On x86: Strip the `_PAGE_PSE` bit.
>> - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and
>>    enforce the `PTE_TYPE_PAGE` format.
>> - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the
>>    exact same hardware format and do not use a distinct huge bit.
>>
>> Signed-off-by: Yin Tirui <yintirui@huawei.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 4 +++-
>>   arch/x86/include/asm/pgtable.h   | 4 ++++
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ 
>> asm/pgtable.h
>> index b3e58735c49b..f2a7a40106d2 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -141,7 +141,9 @@ static inline pteval_t 
>> __phys_to_pte_val(phys_addr_t phys)
>>   #define pte_pfn(pte)        (__pte_to_phys(pte) >> PAGE_SHIFT)
>>   #define pfn_pte(pfn,prot)    \
>> -    __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
>> pgprot_val(prot))
>> +    __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \
>> +        ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \
>> +        (PTE_TYPE_PAGE & ~PTE_VALID)))
>>   #define pte_none(pte)        (!pte_val(pte))
>>   #define pte_page(pte)        (pfn_to_page(pte_pfn(pte)))
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/ 
>> pgtable.h
>> index 1662c5a8f445..a4dbd81d42bf 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t 
>> pgprot)
>>   static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
>>   {
>>       phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
>> +
>> +    /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page 
>> bit */
>> +    pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE);
> 
> Is it really a good idea to silently drop the bit?
> 
> Today it can either be used for a large page (which should be a pmd,
> of course), or - much worse - you'd strip the _PAGE_PAT bit, which is
> at the same position in PTEs.
> 
> So basically you are removing the ability to use some cache modes.
> 
> NACK!
> 
> 
> Juergen

Hi Willy and Jürgen,

Following up on the x86 _PAGE_PSE and _PAGE_PAT aliasing issue.

To achieve the goal of keeping pfn_pte() pure and completely eradicating 
the pte_clrhuge() anti-pattern, we need a way to ensure pfn_pte() never 
receives a pgprot with the huge bit set.

@Jürgen:
Just to be absolutely certain: is there any safe way to filter out the 
huge page attributes directly inside x86's pfn_pte() without breaking 
PAT? Or does the hardware bit-aliasing make this strictly impossible at 
the pfn_pte() level?

@Willy @Jürgen:
Assuming it is impossible to filter this safely inside pfn_pte() on x86, 
we must translate the pgprot before passing it down. To maintain strict 
type-safety and still drop pte_clrhuge(), I plan to introduce two 
arch-neutral wrappers:

x86:
/* Translates large prot to 4K. Shifts PAT back to bit 7, inherently 
clearing _PAGE_PSE */
#define pgprot_huge_to_pte(prot)    pgprot_large_2_4k(prot)
/* Translates 4K prot to large. Shifts PAT to bit 12, strictly sets 
_PAGE_PSE */
#define pgprot_pte_to_huge(prot) 
__pgprot(pgprot_val(pgprot_4k_2_large(prot)) | _PAGE_PSE)

arm64:
/*
  * Drops Block marker, enforces Page marker.
  * Strictly preserves the PTE_VALID bit to avoid validating PROT_NONE 
pages.
  */
#define pgprot_huge_to_pte(prot) \
      __pgprot((pgprot_val(prot) & ~(PMD_TYPE_MASK & ~PTE_VALID)) | \
             (PTE_TYPE_PAGE & ~PTE_VALID))
/*
  * Drops Page marker, sets Block marker.
  * Strictly preserves the PTE_VALID bit.
  */
#define pgprot_pte_to_huge(prot) \
      __pgprot((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \
             (PMD_TYPE_SECT & ~PTE_VALID))

Usage:
1.  Creating a huge pfnmap (remap_try_huge_pmd)
pgprot_t huge_prot = pgprot_pte_to_huge(prot);

/* No need for pmd_mkhuge() */
pmd_t entry = pmd_mkspecial(pfn_pmd(pfn, huge_prot));
set_pmd_at(mm, addr, pmd, entry);

2. Splitting a huge pfnmap (__split_huge_pmd_locked)
pgprot_t small_prot = pgprot_huge_to_pte(pmd_pgprot(old_pmd));

/* No need for pte_clrhuge() */
pte_t entry = pfn_pte(pmd_pfn(old_pmd), small_prot);
set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);


Willy, is there a better architectural approach to handle this and 
satisfy the type-safety requirement given the x86 hardware constraints?

-- 
Thanks,
Yin Tirui



  parent reply	other threads:[~2026-03-05  9:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28  7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui
2026-03-04  7:52   ` Jürgen Groß
2026-03-04 10:08     ` Yin Tirui
2026-03-05  9:38     ` Yin Tirui [this message]
2026-03-05 10:05       ` Jürgen Groß
2026-02-28  7:09 ` [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8d5ba16-3071-4b4a-b6a1-d492cc46e8c2@huawei.com \
    --to=yintirui@huawei.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=ajd@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=chenjun102@huawei.com \
    --cc=conor.dooley@microchip.com \
    --cc=coxu@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=kevin.brodsky@arm.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rmclure@linux.ibm.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=thuth@redhat.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yu-cheng.yu@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox