From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CD4CC7EE30 for ; Wed, 2 Jul 2025 09:44:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC8BA6B00D2; Wed, 2 Jul 2025 05:44:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D51DC6B00D3; Wed, 2 Jul 2025 05:44:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1A056B00D5; Wed, 2 Jul 2025 05:44:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AA60C6B00D2 for ; Wed, 2 Jul 2025 05:44:35 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0AFC35B802 for ; Wed, 2 Jul 2025 09:44:35 +0000 (UTC) X-FDA: 83618839710.11.0CABC29 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf15.hostedemail.com (Postfix) with ESMTP id D023FA0005 for ; Wed, 2 Jul 2025 09:44:26 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=GhUIz9fO; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751449472; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TPDFnR6xyIyxfr0iYGezk0MNTrjhrc4TVyMwpLGO3Lo=; b=a9aPpdm7VtCCfHTspX+LQdgR5Na0S7Sq3BOVd6vEbZ9P/akbvMwBc5UaBRtQxdCQNhZ9zR +Vv+k09toZyurn+rHjNBCjn//UdZyULnIzg4kc8329yOgjEHmYlsgM/dTilKTDePF5D0i2 KrxlW9cpEbcm0uHica+KKy4orpgbIhY= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=GhUIz9fO; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751449472; a=rsa-sha256; cv=none; b=MOk9z2e25qRfa+aZ4DtSTjZ0K7+IVSe68ftzoDy2ygRcV9r941SPXSab7OeJ+tqOyX5acH NY85U8lviAbYc4rIo1VZvQw9eWQB+l0Pc7YD/HCiGbRPGK+svcS6vLqKzRhayPNlxbMI8X Yxb4FmvdzHAo2GZPM3ssUB1G3oOT5PA= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1751449456; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=TPDFnR6xyIyxfr0iYGezk0MNTrjhrc4TVyMwpLGO3Lo=; b=GhUIz9fOhivaKqH416OhQTN5GEY1JKPxsXVxvCkZ53wcy6ek+LfprdNNu9x5wLqP9DoHpihw4etbIBAzluApAxqL2EPPzpFHvPR2by9AvEflbZXMelMmymfpvkCLzZ3Xtj6ftxrFwXVM2zCpGEdsNYLvXGuV8FhJRoF5UCRM988= Received: from 30.74.144.115(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WgdgHUS_1751449452 cluster:ay36) by smtp.aliyun-inc.com; Wed, 02 Jul 2025 17:44:12 +0800 Message-ID: <67c79f65-ca6d-43be-a4ec-decd08bbce0a@linux.alibaba.com> Date: Wed, 2 Jul 2025 17:44:11 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: support large mapping building for tmpfs To: David Hildenbrand , akpm@linux-foundation.org, hughd@google.com Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: D023FA0005 X-Stat-Signature: absq77m6cmctsi3163fknyh9rfbbb91f X-Rspam-User: X-HE-Tag: 1751449466-849276 X-HE-Meta: U2FsdGVkX19ikEJ44f4w7vKj5qgmeqkU2gwKjj7BWgbCm6uc0PL0XyjX1hvChwsFEH+IdyxON0itIuVwTuZ9DIRxB7nlW0+hBAm2NIVWbvAuPyGkagNXmpj48nnDaX7WFdTtpQTqKQgOVlmfyNw0iCNuCp5PsRyiuaeLzSoZzwxMdcthAkOmFwLHT/jOHODVn0kd1Mk+oxvDx/pcmYokfGY2ewHENPR5WIWApb69QdFzxorHttKyoOKpJ40CboQj1AJyYkS0cWoKmZ6vzz1rt65WowBXPcufRfalp9C1MPP5Av9Q9A308mO7owKlFhJ/q6F2UDd+6ul+9yTqAr0xaBJFfKWVDRzSetRPSXXWb3wQ7fd9lj43DIibYYmuiZZUccuzNWmhZOGxagbttG1M/HOGpH9NPtjAZLhZ2hD+1Z8pSl4T7aBEY/3VgMabRqbJOwcF99Kha0AoiL7yhtw9kNnCimuxsjrZQxeCnaIZHx5hAsxnyFbzJoPEqkXRxr9vQJx4U+nNnRmE58EO9PEvj0UVZW6IWEByB/ron//jPrGzIP7akduq2DKnqBaZw+5toJvi+RD0UI0VPZ2Lx+zIw3HaFFRfp0ADRf4clM3Rsywgc70i4Czqehfl0AC614JFUQ4lex1+wkBidGyBEORlndKL3yH0pT4Tw1tNgHyMcJo3cMvnRj9v/Ut+XEczVCiiSSmFlo1C/HqQ53OK/xc8iUIlk8KFksSyMCYH41NZw/KD+tVOyGTjEspVivdTM0WVP6goPyeAkOEy64MjsbFwUwul1IjnkJf0xB2Sc09OdEOLOT7jmpnRnYMG1jjDeplqkYwoXk1Zfg+eZUNUsw1+iF9v/2WKUzHYdQZiceVJx2Uu5eN8U7OWgm0sHSbhLxA2F46UfNho5DLxduKyYDywir22xBrpjjonenFAwvzH9YXQ7p+/4OLMxf5SMStiT8TAfpa+5TFIJ+bddHk8NZe oD08JYQ3 PliyPvF7OvTVo15Dv8LhXqMZuTJcUHZeW3CamN0q3vEQdVB3oatdFgC+4FWqUNKMjnam/N9ITMG7G1hl60PaP4MrFJDuGswvM2ONMSKK0Xhzhh/rcUyvNcAHtACTnNFM0+ZBea3QzvUgEVi/PRLl5SUBsaSSe8hNB6+JpC67LFLuyMa6s0nmhlWbx22uvjPjLMRdpnpCt70gU7fUKmOBkWh+eZd/XQxKR97NIn00fV/dj1oTqA4fHv7nUb240eueIQ3zYH7QneNGiH+G2TIdu1M5UutVWSEE+4Dwb0ncPhzQITi0NKqJw654ce19ht5Wv6eRL5N4wdSD/dqtUMAmv6zEk5X6DMIJzbPxleKm4q6ooEt+Jxa8kpm5PcA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/7/2 16:45, David Hildenbrand wrote: >>> Hm, are we sure about that? >> >> IMO, referring to the definition of RSS: >> "resident set size (RSS) is the portion of memory (measured in >> kilobytes) occupied by a process that is held in main memory (RAM). " >> >> Seems we should report the whole large folio already in file to users. >> Moreover, the tmpfs mount already adds the 'huge=always (or within)' >> option to allocate large folios, so the increase in RSS seems also >> expected? > > Well, traditionally we only account what is actually mapped. If you > MADV_DONTNEED part of the large folio, or only mmap() parts of it, > the RSS would never cover the whole folio -- only what is mapped. > > I discuss part of that in: > > commit 749492229e3bd6222dda7267b8244135229d1fd8 > Author: David Hildenbrand > Date:   Mon Mar 3 17:30:13 2025 +0100 > >     mm: stop maintaining the per-page mapcount of large folios > (CONFIG_NO_PAGE_MAPCOUNT) > > And how my changes there affect some system stats (e.g., "AnonPages", > "Mapped"). > But the RSS stays unchanged and corresponds to what is actually mapped into > the process. > Doing something similar for the RSS would be extremely hard (single page > mapped into process > -> account whole folio to RSS), because it's per-folio-per-process > information, not > per-folio information. Thanks. Good to know this. > So by mapping more in a single page fault, you end up increasing "RSS". > But I wouldn't > call that "expected". I rather suspect that nobody will really care :) But tmpfs is a little special here. It uses the 'huge=' option to control large folio allocation. So, I think users should know they want to use large folios and build the whole mapping for the large folios. That is why I call it 'expected'. >> Also, how does fault_around_bytes interact >>> here? >> >> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only >> applies to read faults (via do_read_fault()) and does not control write >> shared faults (via do_shared_fault()). Additionally, in the >> do_shared_fault() function, PMD-sized large folios are also not >> controlled by 'fault_around', so I just follow the handling of PMD-sized >> large folios. >> >>>> In order to support large mappings for tmpfs, besides checking VMA >>>> limits and >>>> PMD pagetable limits, it is also necessary to check if the linear page >>>> offset >>>> of the VMA is order-aligned within the file. >>> >>> Why? >>> >>> This only applies to PMD mappings. See below. >> >> I previously had the same question, but I saw the comments for >> ‘thp_vma_suitable_order’ function, so I added the check here. If it's >> not necessary to check non-PMD-sized large folios, should we update the >> comments for 'thp_vma_suitable_order'? > > I was not quite clear about PMD vs. !PMD. > > The thing is, when you *allocate* a new folio, it must adhere at least to > pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- Yes, agree. > that is what > thp_vma_suitable_order() checks. Otherwise you cannot add it to the > pagecache. But this alignment is not done by thp_vma_suitable_order(). For tmpfs, it will check the alignment in shmem_suitable_orders() via: " if (!xa_find(&mapping->i_pages, &aligned_index, aligned_index + pages - 1, XA_PRESENT)) " For other fs systems, it will check the alignment in __filemap_get_folio() via: " /* If we're not aligned, allocate a smaller folio */ if (index & ((1UL << order) - 1)) order = __ffs(index); " > But once you *obtain* a folio from the pagecache and are supposed to map it > into the page tables, that must already hold true. > > So you should be able to just blindly map whatever is given to you here > AFAIKS. > > If you would get a pagecache folio that violates the linear page offset > requirement > at that point, something else would have messed up the pagecache. Yes. But the comments from thp_vma_suitable_order() is not about the pagecache alignment, it says "the order-aligned addresses in the VMA map to order-aligned offsets within the file", which is used to do alignment for PMD mapping originally. So I wonder if we need this restriction for non-PMD-sized large folios? " * - For file vma, check if the linear page offset of vma is * order-aligned within the file. The hugepage is * guaranteed to be order-aligned within the file, but we must * check that the order-aligned addresses in the VMA map to * order-aligned offsets within the file, else the hugepage will * not be mappable. "