Re: [PATCH] mm: support large mapping building for tmpfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, hughd@google.com
Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, baohua@kernel.org, vbabka@suse.cz,
	rppt@kernel.org, surenb@google.com, mhocko@suse.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: support large mapping building for tmpfs
Date: Wed, 2 Jul 2025 17:44:11 +0800	[thread overview]
Message-ID: <67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com> (raw)
In-Reply-To: <ec5d4e52-658b-4fdc-b7f9-f844ab29665c@redhat.com>



On 2025/7/2 16:45, David Hildenbrand wrote:
>>> Hm, are we sure about that?
>>
>> IMO, referring to the definition of RSS:
>> "resident set size (RSS) is the portion of memory (measured in
>> kilobytes) occupied by a process that is held in main memory (RAM). "
>>
>> Seems we should report the whole large folio already in file to users.
>> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
>> option to allocate large folios, so the increase in RSS seems also 
>> expected?
> 
> Well, traditionally we only account what is actually mapped. If you
> MADV_DONTNEED part of the large folio, or only mmap() parts of it,
> the RSS would never cover the whole folio -- only what is mapped.
> 
> I discuss part of that in:
> 
> commit 749492229e3bd6222dda7267b8244135229d1fd8
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Mar 3 17:30:13 2025 +0100
> 
>      mm: stop maintaining the per-page mapcount of large folios 
> (CONFIG_NO_PAGE_MAPCOUNT)
> 
> And how my changes there affect some system stats (e.g., "AnonPages", 
> "Mapped").
> But the RSS stays unchanged and corresponds to what is actually mapped into
> the process.
> Doing something similar for the RSS would be extremely hard (single page 
> mapped into process
> -> account whole folio to RSS), because it's per-folio-per-process 
> information, not
> per-folio information.

Thanks. Good to know this.

> So by mapping more in a single page fault, you end up increasing "RSS". 
> But I wouldn't
> call that "expected". I rather suspect that nobody will really care :)

But tmpfs is a little special here. It uses the 'huge=' option to 
control large folio allocation. So, I think users should know they want 
to use large folios and build the whole mapping for the large folios. 
That is why I call it 'expected'.

>> Also, how does fault_around_bytes interact
>>> here?
>>
>> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
>> applies to read faults (via do_read_fault()) and does not control write
>> shared faults (via do_shared_fault()). Additionally, in the
>> do_shared_fault() function, PMD-sized large folios are also not
>> controlled by 'fault_around', so I just follow the handling of PMD-sized
>> large folios.
>>
>>>> In order to support large mappings for tmpfs, besides checking VMA
>>>> limits and
>>>> PMD pagetable limits, it is also necessary to check if the linear page
>>>> offset
>>>> of the VMA is order-aligned within the file.
>>>
>>> Why?
>>>
>>> This only applies to PMD mappings. See below.
>>
>> I previously had the same question, but I saw the comments for
>> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
>> not necessary to check non-PMD-sized large folios, should we update the
>> comments for 'thp_vma_suitable_order'?
> 
> I was not quite clear about PMD vs. !PMD.
> 
> The thing is, when you *allocate* a new folio, it must adhere at least to
> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- 

Yes, agree.

> that is what
> thp_vma_suitable_order() checks. Otherwise you cannot add it to the 
> pagecache.

But this alignment is not done by thp_vma_suitable_order().

For tmpfs, it will check the alignment in shmem_suitable_orders() via:
"
	if (!xa_find(&mapping->i_pages, &aligned_index,
			aligned_index + pages - 1, XA_PRESENT))
"

For other fs systems, it will check the alignment in 
__filemap_get_folio() via:
"
	/* If we're not aligned, allocate a smaller folio */
	if (index & ((1UL << order) - 1))
		order = __ffs(index);
"

> But once you *obtain* a folio from the pagecache and are supposed to map it
> into the page tables, that must already hold true.
> 
> So you should be able to just blindly map whatever is given to you here
> AFAIKS.
> 
> If you would get a pagecache folio that violates the linear page offset 
> requirement
> at that point, something else would have messed up the pagecache.

Yes. But the comments from thp_vma_suitable_order() is not about the 
pagecache alignment, it says "the order-aligned addresses in the VMA map 
to order-aligned offsets within the file", which is used to do alignment 
for PMD mapping originally. So I wonder if we need this restriction for 
non-PMD-sized large folios?

"
  *   - For file vma, check if the linear page offset of vma is
  *     order-aligned within the file.  The hugepage is
  *     guaranteed to be order-aligned within the file, but we must
  *     check that the order-aligned addresses in the VMA map to
  *     order-aligned offsets within the file, else the hugepage will
  *     not be mappable.
"

next prev parent reply	other threads:[~2025-07-02  9:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-01  8:40 Baolin Wang
2025-07-01 13:08 ` David Hildenbrand
2025-07-02  2:03   ` Baolin Wang
2025-07-02  8:45     ` David Hildenbrand
2025-07-02  9:44       ` Baolin Wang [this message]
2025-07-02 11:38         ` David Hildenbrand
2025-07-02 11:55           ` David Hildenbrand
2025-07-04  2:35             ` Baolin Wang
2025-07-04  2:04           ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com \
    --to=baolin.wang@linux.alibaba.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox