From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4CD4CC7EE30
	for <linux-mm@archiver.kernel.org>; Wed,  2 Jul 2025 09:44:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DC8BA6B00D2; Wed,  2 Jul 2025 05:44:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D51DC6B00D3; Wed,  2 Jul 2025 05:44:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C1A056B00D5; Wed,  2 Jul 2025 05:44:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id AA60C6B00D2
	for <linux-mm@kvack.org>; Wed,  2 Jul 2025 05:44:35 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 0AFC35B802
	for <linux-mm@kvack.org>; Wed,  2 Jul 2025 09:44:35 +0000 (UTC)
X-FDA: 83618839710.11.0CABC29
Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113])
	by imf15.hostedemail.com (Postfix) with ESMTP id D023FA0005
	for <linux-mm@kvack.org>; Wed,  2 Jul 2025 09:44:26 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=GhUIz9fO;
	spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1751449472;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=TPDFnR6xyIyxfr0iYGezk0MNTrjhrc4TVyMwpLGO3Lo=;
	b=a9aPpdm7VtCCfHTspX+LQdgR5Na0S7Sq3BOVd6vEbZ9P/akbvMwBc5UaBRtQxdCQNhZ9zR
	+Vv+k09toZyurn+rHjNBCjn//UdZyULnIzg4kc8329yOgjEHmYlsgM/dTilKTDePF5D0i2
	KrxlW9cpEbcm0uHica+KKy4orpgbIhY=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=GhUIz9fO;
	spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751449472; a=rsa-sha256;
	cv=none;
	b=MOk9z2e25qRfa+aZ4DtSTjZ0K7+IVSe68ftzoDy2ygRcV9r941SPXSab7OeJ+tqOyX5acH
	NY85U8lviAbYc4rIo1VZvQw9eWQB+l0Pc7YD/HCiGbRPGK+svcS6vLqKzRhayPNlxbMI8X
	Yxb4FmvdzHAo2GZPM3ssUB1G3oOT5PA=
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1751449456; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type;
	bh=TPDFnR6xyIyxfr0iYGezk0MNTrjhrc4TVyMwpLGO3Lo=;
	b=GhUIz9fOhivaKqH416OhQTN5GEY1JKPxsXVxvCkZ53wcy6ek+LfprdNNu9x5wLqP9DoHpihw4etbIBAzluApAxqL2EPPzpFHvPR2by9AvEflbZXMelMmymfpvkCLzZ3Xtj6ftxrFwXVM2zCpGEdsNYLvXGuV8FhJRoF5UCRM988=
Received: from 30.74.144.115(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WgdgHUS_1751449452 cluster:ay36)
          by smtp.aliyun-inc.com;
          Wed, 02 Jul 2025 17:44:12 +0800
Message-ID: <67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com>
Date: Wed, 2 Jul 2025 17:44:11 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm: support large mapping building for tmpfs
To: David Hildenbrand <david@redhat.com>, akpm@linux-foundation.org,
 hughd@google.com
Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
 npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
 baohua@kernel.org, vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
 mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <d4cb6e578bca8c430174d5972550cbeb530ec3fe.1751359073.git.baolin.wang@linux.alibaba.com>
 <b8258f91-ad92-419e-a0a1-a8db706c814c@redhat.com>
 <fca114c1-9699-4dd7-9bca-83a5f5ac615d@linux.alibaba.com>
 <ec5d4e52-658b-4fdc-b7f9-f844ab29665c@redhat.com>
From: Baolin Wang <baolin.wang@linux.alibaba.com>
In-Reply-To: <ec5d4e52-658b-4fdc-b7f9-f844ab29665c@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: D023FA0005
X-Stat-Signature: absq77m6cmctsi3163fknyh9rfbbb91f
X-Rspam-User: 
X-HE-Tag: 1751449466-849276
X-HE-Meta: U2FsdGVkX19ikEJ44f4w7vKj5qgmeqkU2gwKjj7BWgbCm6uc0PL0XyjX1hvChwsFEH+IdyxON0itIuVwTuZ9DIRxB7nlW0+hBAm2NIVWbvAuPyGkagNXmpj48nnDaX7WFdTtpQTqKQgOVlmfyNw0iCNuCp5PsRyiuaeLzSoZzwxMdcthAkOmFwLHT/jOHODVn0kd1Mk+oxvDx/pcmYokfGY2ewHENPR5WIWApb69QdFzxorHttKyoOKpJ40CboQj1AJyYkS0cWoKmZ6vzz1rt65WowBXPcufRfalp9C1MPP5Av9Q9A308mO7owKlFhJ/q6F2UDd+6ul+9yTqAr0xaBJFfKWVDRzSetRPSXXWb3wQ7fd9lj43DIibYYmuiZZUccuzNWmhZOGxagbttG1M/HOGpH9NPtjAZLhZ2hD+1Z8pSl4T7aBEY/3VgMabRqbJOwcF99Kha0AoiL7yhtw9kNnCimuxsjrZQxeCnaIZHx5hAsxnyFbzJoPEqkXRxr9vQJx4U+nNnRmE58EO9PEvj0UVZW6IWEByB/ron//jPrGzIP7akduq2DKnqBaZw+5toJvi+RD0UI0VPZ2Lx+zIw3HaFFRfp0ADRf4clM3Rsywgc70i4Czqehfl0AC614JFUQ4lex1+wkBidGyBEORlndKL3yH0pT4Tw1tNgHyMcJo3cMvnRj9v/Ut+XEczVCiiSSmFlo1C/HqQ53OK/xc8iUIlk8KFksSyMCYH41NZw/KD+tVOyGTjEspVivdTM0WVP6goPyeAkOEy64MjsbFwUwul1IjnkJf0xB2Sc09OdEOLOT7jmpnRnYMG1jjDeplqkYwoXk1Zfg+eZUNUsw1+iF9v/2WKUzHYdQZiceVJx2Uu5eN8U7OWgm0sHSbhLxA2F46UfNho5DLxduKyYDywir22xBrpjjonenFAwvzH9YXQ7p+/4OLMxf5SMStiT8TAfpa+5TFIJ+bddHk8NZe
 oD08JYQ3
 PliyPvF7OvTVo15Dv8LhXqMZuTJcUHZeW3CamN0q3vEQdVB3oatdFgC+4FWqUNKMjnam/N9ITMG7G1hl60PaP4MrFJDuGswvM2ONMSKK0Xhzhh/rcUyvNcAHtACTnNFM0+ZBea3QzvUgEVi/PRLl5SUBsaSSe8hNB6+JpC67LFLuyMa6s0nmhlWbx22uvjPjLMRdpnpCt70gU7fUKmOBkWh+eZd/XQxKR97NIn00fV/dj1oTqA4fHv7nUb240eueIQ3zYH7QneNGiH+G2TIdu1M5UutVWSEE+4Dwb0ncPhzQITi0NKqJw654ce19ht5Wv6eRL5N4wdSD/dqtUMAmv6zEk5X6DMIJzbPxleKm4q6ooEt+Jxa8kpm5PcA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2025/7/2 16:45, David Hildenbrand wrote:
>>> Hm, are we sure about that?
>>
>> IMO, referring to the definition of RSS:
>> "resident set size (RSS) is the portion of memory (measured in
>> kilobytes) occupied by a process that is held in main memory (RAM). "
>>
>> Seems we should report the whole large folio already in file to users.
>> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
>> option to allocate large folios, so the increase in RSS seems also 
>> expected?
> 
> Well, traditionally we only account what is actually mapped. If you
> MADV_DONTNEED part of the large folio, or only mmap() parts of it,
> the RSS would never cover the whole folio -- only what is mapped.
> 
> I discuss part of that in:
> 
> commit 749492229e3bd6222dda7267b8244135229d1fd8
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Mar 3 17:30:13 2025 +0100
> 
>      mm: stop maintaining the per-page mapcount of large folios 
> (CONFIG_NO_PAGE_MAPCOUNT)
> 
> And how my changes there affect some system stats (e.g., "AnonPages", 
> "Mapped").
> But the RSS stays unchanged and corresponds to what is actually mapped into
> the process.
> Doing something similar for the RSS would be extremely hard (single page 
> mapped into process
> -> account whole folio to RSS), because it's per-folio-per-process 
> information, not
> per-folio information.

Thanks. Good to know this.

> So by mapping more in a single page fault, you end up increasing "RSS". 
> But I wouldn't
> call that "expected". I rather suspect that nobody will really care :)

But tmpfs is a little special here. It uses the 'huge=' option to 
control large folio allocation. So, I think users should know they want 
to use large folios and build the whole mapping for the large folios. 
That is why I call it 'expected'.

>> Also, how does fault_around_bytes interact
>>> here?
>>
>> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
>> applies to read faults (via do_read_fault()) and does not control write
>> shared faults (via do_shared_fault()). Additionally, in the
>> do_shared_fault() function, PMD-sized large folios are also not
>> controlled by 'fault_around', so I just follow the handling of PMD-sized
>> large folios.
>>
>>>> In order to support large mappings for tmpfs, besides checking VMA
>>>> limits and
>>>> PMD pagetable limits, it is also necessary to check if the linear page
>>>> offset
>>>> of the VMA is order-aligned within the file.
>>>
>>> Why?
>>>
>>> This only applies to PMD mappings. See below.
>>
>> I previously had the same question, but I saw the comments for
>> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
>> not necessary to check non-PMD-sized large folios, should we update the
>> comments for 'thp_vma_suitable_order'?
> 
> I was not quite clear about PMD vs. !PMD.
> 
> The thing is, when you *allocate* a new folio, it must adhere at least to
> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- 

Yes, agree.

> that is what
> thp_vma_suitable_order() checks. Otherwise you cannot add it to the 
> pagecache.

But this alignment is not done by thp_vma_suitable_order().

For tmpfs, it will check the alignment in shmem_suitable_orders() via:
"
	if (!xa_find(&mapping->i_pages, &aligned_index,
			aligned_index + pages - 1, XA_PRESENT))
"

For other fs systems, it will check the alignment in 
__filemap_get_folio() via:
"
	/* If we're not aligned, allocate a smaller folio */
	if (index & ((1UL << order) - 1))
		order = __ffs(index);
"

> But once you *obtain* a folio from the pagecache and are supposed to map it
> into the page tables, that must already hold true.
> 
> So you should be able to just blindly map whatever is given to you here
> AFAIKS.
> 
> If you would get a pagecache folio that violates the linear page offset 
> requirement
> at that point, something else would have messed up the pagecache.

Yes. But the comments from thp_vma_suitable_order() is not about the 
pagecache alignment, it says "the order-aligned addresses in the VMA map 
to order-aligned offsets within the file", which is used to do alignment 
for PMD mapping originally. So I wonder if we need this restriction for 
non-PMD-sized large folios?

"
  *   - For file vma, check if the linear page offset of vma is
  *     order-aligned within the file.  The hugepage is
  *     guaranteed to be order-aligned within the file, but we must
  *     check that the order-aligned addresses in the VMA map to
  *     order-aligned offsets within the file, else the hugepage will
  *     not be mappable.
"