From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>,
akpm@linux-foundation.org, hughd@google.com
Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, baohua@kernel.org, vbabka@suse.cz,
rppt@kernel.org, surenb@google.com, mhocko@suse.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: support large mapping building for tmpfs
Date: Fri, 4 Jul 2025 10:04:39 +0800 [thread overview]
Message-ID: <9771e4ac-4f25-4822-9882-d8a94813e7c0@linux.alibaba.com> (raw)
In-Reply-To: <7b17af10-b052-4719-bbce-ffad2d74006a@redhat.com>
On 2025/7/2 19:38, David Hildenbrand wrote:
>
>>> So by mapping more in a single page fault, you end up increasing "RSS".
>>> But I wouldn't
>>> call that "expected". I rather suspect that nobody will really care :)
>>
>> But tmpfs is a little special here. It uses the 'huge=' option to
>> control large folio allocation. So, I think users should know they want
>> to use large folios and build the whole mapping for the large folios.
>> That is why I call it 'expected'.
>
> Well, if your distribution decides to set huge= on /tmp or something
> like that, your application might have very little saying in that,
> right? :)
>
> Again, I assume it's fine, but we might find surprises on the way.
>
>>>
>>> The thing is, when you *allocate* a new folio, it must adhere at
>>> least to
>>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) --
>>
>> Yes, agree.
>>
>>> that is what
>>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the
>>> pagecache.
>>
>> But this alignment is not done by thp_vma_suitable_order().
>>
>> For tmpfs, it will check the alignment in shmem_suitable_orders() via:
>> "
>> if (!xa_find(&mapping->i_pages, &aligned_index,
>> aligned_index + pages - 1, XA_PRESENT))
>> "
>
> That's not really alignment check, that's just checking whether a
> suitable folio order spans already-present entries, no?
Because 'aligned_index' already did round_down() before checking if it's
suitable. So it's still considered an implicit alignment check.
"
pages = 1UL << order;
aligned_index = round_down(index, pages);
"
> Finding suitable orders is still up to other code IIUC.
>
>>
>> For other fs systems, it will check the alignment in
>> __filemap_get_folio() via:
>> "
>> /* If we're not aligned, allocate a smaller folio */
>> if (index & ((1UL << order) - 1))
>> order = __ffs(index);
>> "
>>
>>> But once you *obtain* a folio from the pagecache and are supposed to
>>> map it
>>> into the page tables, that must already hold true.
>>>
>>> So you should be able to just blindly map whatever is given to you here
>>> AFAIKS.
>>>
>>> If you would get a pagecache folio that violates the linear page offset
>>> requirement
>>> at that point, something else would have messed up the pagecache.
>>
>> Yes. But the comments from thp_vma_suitable_order() is not about the
>> pagecache alignment, it says "the order-aligned addresses in the VMA map
>> to order-aligned offsets within the file",
>
> Let's dig, it's confusing.
>
> The code in question is:
>
> if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> hpage_size >> PAGE_SHIFT))
>
> So yes, I think this tells us: if we would have a PMD THP in the
> pagecache, would we be able to map it with a PMD. If not, then don't
> bother with allocating a PMD THP.
>
> Of course, this also applies to other orders, but for PMD THPs it's
> probably most relevant: if we cannot even map it through a PMD, then
> probably it could be a wasted THP.
>
> So yes, I agree: if we are both no missing something, then this
> primarily relevant for the PMD case.
>
> And it's more about "optimization" than "correctness" I guess?
>
> But when mapping a folio that is already in the pagecache, I assume this
> is not required.
>
> Assume we have a 2 MiB THP in the pagecache.
>
> If someone were to map it at virtual addr 1MiB, we could still map 1MiB
> worth of PTEs into a single page table in one go, and not fallback to
> individual PTEs.
That's how I understand the code as well, I just wasn't very sure
before. Thanks for your explanation which clarified my doubts.
I will drop the check in next version.
prev parent reply other threads:[~2025-07-04 2:04 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-01 8:40 Baolin Wang
2025-07-01 13:08 ` David Hildenbrand
2025-07-02 2:03 ` Baolin Wang
2025-07-02 8:45 ` David Hildenbrand
2025-07-02 9:44 ` Baolin Wang
2025-07-02 11:38 ` David Hildenbrand
2025-07-02 11:55 ` David Hildenbrand
2025-07-04 2:35 ` Baolin Wang
2025-07-04 2:04 ` Baolin Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9771e4ac-4f25-4822-9882-d8a94813e7c0@linux.alibaba.com \
--to=baolin.wang@linux.alibaba.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox