From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
Kefeng Wang <wangkefeng.wang@huawei.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
Andrey Ryabinin <ryabinin.a.a@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Mark Rutland <mark.rutland@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>,
Barry Song <21cnbao@gmail.com>,
Alistair Popple <apopple@nvidia.com>,
Yang Shi <shy828301@gmail.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: linux-arm-kernel@lists.infradead.org, x86@kernel.org,
linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings
Date: Tue, 25 Jun 2024 20:37:39 +0800 [thread overview]
Message-ID: <43a5986a-52ea-4090-9333-90af137a4735@linux.alibaba.com> (raw)
In-Reply-To: <b6b485ee-7af0-42b8-b0ca-5a75f76a69e2@arm.com>
On 2024/6/25 19:40, Ryan Roberts wrote:
> On 25/06/2024 08:23, Baolin Wang wrote:
>>
>>
>> On 2024/6/25 11:16, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/6/24 23:56, Ryan Roberts wrote:
>>>> + Baolin Wang and Yin Fengwei, who maybe able to help with this.
>>>>
>>>>
>>>> Hi Kefeng,
>>>>
>>>> Thanks for the report!
>>>>
>>>>
>>>> On 24/06/2024 15:30, Kefeng Wang wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> A big regression on page-fault3("Separate file shared mapping page
>>>>> fault") testcase from will-it-scale on arm64, no issue on x86,
>>>>>
>>>>> ./page_fault3_processes -t 128 -s 5
>>>>
>>>> I see that this program is mkstmp'ing a file at "/tmp/willitscale.XXXXXX". Based
>>>> on your description, I'm inferring that /tmp is backed by ext4 with your large
>>>> folio patches enabled?
>>>
>>> Yes, mount /tmp by ext4, sorry to forget to mention that.
>>>
>>>>
>>>>>
>>>>> 1) large folio disabled on ext4:
>>>>> 92378735
>>>>> 2) large folio enabled on ext4 + CONTPTE enabled
>>>>> 16164943
>>>>> 3) large folio enabled on ext4 + CONTPTE disabled
>>>>> 80364074
>>>>> 4) large folio enabled on ext4 + CONTPTE enabled + large folio mapping
>>>>> enabled
>>>>> in finish_fault()[2]
>>>>> 299656874
>>>>>
>>>>> We found *contpte_convert* consume lots of CPU(76%) in case 2),
>>>>
>>>> contpte_convert() is expensive and to be avoided; In this case I expect it is
>>>> repainting the PTEs with the PTE_CONT bit added in, and to do that it needs to
>>>> invalidate the tlb for the virtual range. The code is there to mop up user space
>>>> patterns where each page in a range is temporarily made RO, then later changed
>>>> back. In this case, we want to re-fold the contpte range once all pages have
>>>> been serviced in RO mode.
>>>>
>>>> Of course this path is only intended as a fallback, and the more optimium
>>>> approach is to set_ptes() the whole folio in one go where possible - kind of
>>>> what you are doing below.
>>>>
>>>>> and disappeared
>>>>> by following change[2], it is easy to understood the different between case 2)
>>>>> and case 4) since case 2) always map one page
>>>>> size, but always try to fold contpte mappings, which spend a lot of
>>>>> time. Case 4) is a workaround, any other better suggestion?
>>>>
>>>> See below.
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> [1] https://github.com/antonblanchard/will-it-scale
>>>>> [2] enable large folio mapping in finish_fault()
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 00728ea95583..5623a8ce3a1e 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4880,7 +4880,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>> * approach also applies to non-anonymous-shmem faults to avoid
>>>>> * inflating the RSS of the process.
>>>>> */
>>>>> - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
>>>>> + if (unlikely(userfaultfd_armed(vma))) {
>>>>
>>>> The change to make finish_fault() handle multiple pages in one go are new; added
>>>> by Baolin Wang at [1]. That extra conditional that you have removed is there to
>>>> prevent RSS reporting bloat. See discussion that starts at [2].
>>>>
>>>> Anyway, it was my vague understanding that the fault around mechanism
>>>> (do_fault_around()) would ensure that (by default) 64K worth of pages get mapped
>>>> together in a single set_ptes() call, via filemap_map_pages() ->
>>>> filemap_map_folio_range(). Looking at the code, I guess fault around only
>>>> applies to read faults. This test is doing a write fault.
>>>>
>>>> I guess we need to do a change a bit like what you have done, but also taking
>>>> into account fault_around configuration?
>>
>> For the writable mmap() of tmpfs, we will use mTHP interface to control the size
>> of folio to allocate, as discussed in previous meeting [1], so I don't think
>> fault_around configuration will be helpful for tmpfs.
>
> Yes agreed. But we are talking about ext4 here.
>
>>
>> For other filesystems, like ext4, I did not found the logic to determin what
>> size of folio to allocate in writable mmap() path
>
> Yes I'd be keen to understand this to. When I was doing contpte, page cache
> would only allocate large folios for readahead. So that's why I wouldn't have
You mean non-large folios, right?
> seen this.
>
>> (Kefeng, please correct me if
>> I missed something). If there is a control like mTHP, we can rely on that
>> instead of 'fault_around'?
>
> Page cache doesn't currently expose any controls for folio allocation size.
> Personally, I'd like to see some in future becaudse I suspect it will be
> neccessary to limit physical fragmentation. But that is another conversation...
Yes, agree. If writable mmap() path wants to allocate large folios, we
should rely on some controls or hints (maybe mTHP?).
>> [1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
>>
>>> Yes, the current changes is not enough, I hint some issue and still debugging,
>>> so our direction is trying to map large folio for do_shared_fault(), right?
>
> We just need to make sure that if finish_fault() has a (non-shmem) large folio,
> it never maps more than fault_around_pages, and it does it in a way that is
> naturally aligned in virtual space (like do_fault_around() does).
> do_fault_around() actually tries to get other folios from the page cache to map.
> We don't want to do that; we just want to make sure that we don't inflate a
> process's RSS by mapping unbounded large folios.
>
> Another (orthogonal, longer term) strategy would be to optimize
> contpte_convert(). arm64 has a feature called "BBM level 2"; we could
> potentially elide the TLBIs for systems that support this. But ultimately its
> best to avoid the need for folding in the first place.
>
> Thanks,
> Ryan
>
>>
>> I think this is the right direction to do. I add this '!vma_is_anon_shmem(vma)'
>> conditon to gradually implement support for large folio mapping buidling,
>> especially for writable mmap() support in tmpfs.
>>
[snip]
next prev parent reply other threads:[~2024-06-25 12:37 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 10:31 [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 01/18] mm: Clarify the spec for set_ptes() Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 02/18] mm: thp: Batch-collapse PMD with set_ptes() Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn() Ryan Roberts
2024-02-15 10:40 ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn() Ryan Roberts
2024-02-15 10:42 ` David Hildenbrand
2024-02-15 11:17 ` Mark Rutland
2024-02-15 18:27 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 05/18] x86/mm: " Ryan Roberts
2024-02-15 10:43 ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition Ryan Roberts
2024-02-15 10:43 ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep) Ryan Roberts
2024-02-15 11:18 ` Mark Rutland
2024-02-15 18:34 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1) Ryan Roberts
2024-02-15 11:19 ` Mark Rutland
2024-02-15 18:34 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear() Ryan Roberts
2024-02-15 11:20 ` Mark Rutland
2024-02-15 18:35 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit Ryan Roberts
2024-02-15 11:23 ` Mark Rutland
2024-02-15 19:21 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2024-02-15 11:24 ` Mark Rutland
2024-02-15 19:22 ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2024-02-15 11:27 ` Mark Rutland
2024-02-16 12:25 ` Catalin Marinas
2024-02-16 12:53 ` Ryan Roberts
2024-02-16 16:56 ` Catalin Marinas
2024-02-16 19:54 ` John Hubbard
2024-02-20 19:50 ` Ryan Roberts
2024-02-19 15:18 ` Catalin Marinas
2024-02-20 19:58 ` Ryan Roberts
2024-02-15 10:32 ` [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API Ryan Roberts
2024-02-15 11:28 ` Mark Rutland
2024-02-16 12:30 ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs Ryan Roberts
2024-02-15 11:28 ` Mark Rutland
2024-02-16 12:30 ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 15/18] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() Ryan Roberts
2024-02-15 10:32 ` [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint() Ryan Roberts
2024-02-16 12:34 ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf Ryan Roberts
2024-02-16 12:34 ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings Ryan Roberts
2024-02-15 11:30 ` Mark Rutland
2024-02-16 12:35 ` Catalin Marinas
2024-06-24 14:30 ` Kefeng Wang
2024-06-24 15:56 ` Ryan Roberts
2024-06-25 3:16 ` Kefeng Wang
2024-06-25 7:23 ` Baolin Wang
2024-06-25 11:40 ` Ryan Roberts
2024-06-25 12:37 ` Baolin Wang [this message]
2024-06-25 12:41 ` Ryan Roberts
2024-06-25 13:06 ` Matthew Wilcox
2024-06-25 13:41 ` Ryan Roberts
2024-06-25 14:06 ` Matthew Wilcox
2024-06-25 14:45 ` Ryan Roberts
2024-06-25 12:23 ` Kefeng Wang
2024-02-15 11:36 ` [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings Mark Rutland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=43a5986a-52ea-4090-9333-90af137a4735@linux.alibaba.com \
--to=baolin.wang@linux.alibaba.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=fengwei.yin@intel.com \
--cc=hpa@zytor.com \
--cc=james.morse@arm.com \
--cc=jhubbard@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=ryabinin.a.a@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=tglx@linutronix.de \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox