From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F128C07CA9 for ; Tue, 28 Nov 2023 12:08:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B12976B013A; Tue, 28 Nov 2023 07:08:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC2D26B02FF; Tue, 28 Nov 2023 07:08:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B2C06B0302; Tue, 28 Nov 2023 07:08:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8B86A6B013A for ; Tue, 28 Nov 2023 07:08:38 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DE6F81A01A4 for ; Tue, 28 Nov 2023 12:08:37 +0000 (UTC) X-FDA: 81507241074.07.8618F53 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf07.hostedemail.com (Postfix) with ESMTP id 5023440011 for ; Tue, 28 Nov 2023 12:08:34 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701173314; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IEwlxplYwN2aCGa6zzkHAY2DbujQ8C5QGnrOqoDUc50=; b=dM6V/+b6iexrtAwhPkBO0E+qP56Lk1q+29hONdBarVPi+Pcykyu+ifeUb4m1QYv6Bsxhby SiZlFe+NNL6CSh+wBzVKVecq++4xo5TSO1GKQCDvRNfgpFRMg1xfol1+cyLelaMNp4ZziM AKxm2sTgAI2uUtQzf3QGgzHeAHaDiCs= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701173314; a=rsa-sha256; cv=none; b=qkCD3JGwgvNdJvIgOJDxhckN342Nh+LVcl7LmeunHEy2Wjv0UWkfLGrAW8GHBMjGxkyBw+ /wXf5yNGEHqp/xNzsI+yiHM4DlB4UuRRLpAzH3zmEr6F2DBMyQB+v73jd2ryzUTJNDGxpE QIA5qKQB2cHbm6mEurGVAOaye0sFAoQ= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id ACCCFC15; Tue, 28 Nov 2023 04:09:19 -0800 (PST) Received: from [10.1.33.188] (XHFQ2J9959.cambridge.arm.com [10.1.33.188]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 031113F73F; Tue, 28 Nov 2023 04:08:24 -0800 (PST) Message-ID: <3e61d181-5e8d-4103-8dee-e18e493bc125@arm.com> Date: Tue, 28 Nov 2023 12:08:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com References: <20231115163018.1303287-1-ryan.roberts@arm.com> <20231127031813.5576-1-v-songbaohua@oppo.com> <234021ba-73c2-474a-82f9-91e1604d5bb5@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: cmwazjh1guaru59triwyzba1neg6dnz5 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5023440011 X-HE-Tag: 1701173314-260637 X-HE-Meta: U2FsdGVkX19n/M07fE+9yW6RriG7RCM3Z1V5YGd+HFn1D0FvJ86AW8amm9aTkl3gNjkMYY+T2cTFK5dLIG8IdIcYPbiDeRC1vPVlBj1H5+Mx8Yk3utAzE2yU1JaTqQaaRMgL/82qR99Tvo3ri/+DEATEDsEunNC9Ter/osciGWtVLVFS7f5MP66tIY1vP6DnTdMF18/7Du63rjPSx5bgceZ12vGjrEQHjvli2IKNg6rxB8bNrWFVn+iNfQhkMjGXIWYMsF/S6xN9uc4cjHd8igQ+w8+nwsGQEBhNGSUsUb5gFA/h3ycUcdB6Q3QQtO2F4YVvuToKYZoPrwD4T9X1rVvenF6QTYR1OFSxcTHx9cqP6fBcq86TwMU9RW2A3zl7ssDqE1nFyMy3F//X42B19z4XFspnskX3ILCNzZBN65f2b4YvUgJxrf+mcRMEtO673H2dH/N6AIKJp5/85WrFijisJMDuQ0q55HFurMkgEGvv8DSPVlgKiHcJOZC2vasfxnM6Wn/oddwhlxfz3VkZ5/HNlgSkasxlttgRdai6cnzypscGSk/wXFLVsZb6LQnj4MU3i7KSE+Ssrq8w2KJuWhktgYbaYTVQBPMKKZTnWmljzttbbGj+WrUQtnJiLrCcsqhpjxn3MSbte30xEWfP1BfKZxcsisSQhAIzPcMipbrMNbu+uD/gP+H6qZ8/3dui1/8fSKWo6sNfuVl1AVaFt+TYwXCeASjPLRsvSsJRHZYcunDVCUEW2xWEMydN2BGCYRAqpG9LeDNYiB2j1HKVTYeTD+yqjgDnxvmgbtLaAJhvrfctp4ZBUDY6BT07Mby5qI4E4Tqu23jmvmVkN4EmrY66hRnKjn/EN7B3pdXDfZfC4Y77s1AeGXzH0DC0hIS3nm1V020Ruc+BuZVG7oUYz/YncTu1gApmBUhxaYAiiUgRQ0NvE8kRPzlUkDXh+49UMMLgY3NGXwnHqy1OcHv Kc3UF4yD 4oeXe5zk0w6quaGRj7v/2s+Eet8l81jQE+8rA+sWaOapVH+YZSCiiuqY0zEJ5udz1CJ0MTX+MoJkYWkBV6xl5oCBeiJ33bKlIbqry1qg78tnxA1prXKkddetBYhE2iK0aH+im6jZb6nLWTwHRWqsFmXTil8plErpxcbING6m0JQybgO7RHztFpZiaDyRyt0DUoHZHnyCV9NfT4XycfoyJ7QxwoZqdsnPYxx9Ptwvw7jf/YRhOlyHj930avWtIy15p6MJ51KoPmyxYFtTApaieF1xduA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/11/2023 05:49, Barry Song wrote: > On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts wrote: >> >> On 27/11/2023 03:18, Barry Song wrote: >>>> Ryan Roberts (14): >>>> mm: Batch-copy PTE ranges during fork() >>>> arm64/mm: set_pte(): New layer to manage contig bit >>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit >>>> arm64/mm: pte_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit >>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit >>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit >>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit >>>> arm64/mm: ptep_get(): New layer to manage contig bit >>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB >>>> arm64/mm: Wire up PTE_CONT for user mappings >>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork() >>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown >>> >>> Hi Ryan, >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES >>> in the below cases >> >> The general idea is that the core-mm sets the individual ptes (one at a time if >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them >> (ptep_clear(), etc); This is exactly the same interface as previously. >> >> BUT, the arm64 implementation of those interfaces will now detect when a set of >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K >> base pages) are all appropriate for having the CONT_PTE bit set; in this case >> the block is "folded". And it will detect when the first PTE in the block >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the >> requirements for folding a contpte block is that all the pages must belong to >> the *same* folio (that means its safe to only track access/dirty for thecontpte >> block as a whole rather than for each individual pte). >> >> (there are a couple of optimizations that make the reality slightly more >> complicated than what I've just explained, but you get the idea). >> >> On that basis, I believe all the specific cases you describe below are all >> covered and safe - please let me know if you think there is a hole here! >> >>> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio >> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit >> is removed from the whole contpte block. If there is then a subsequent >> set_pte_at() to set a swap entry, the implementation will see that its not >> appropriate to re-fold, so the range will remain unfolded. >> >>> >>> 2. vma split in a large folio due to various reasons such as mprotect, >>> munmap, mlock etc. >> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I >> suspect not, so if the VMA is split in the middle of a currently folded contpte >> block, it will remain folded. But this is safe and continues to work correctly. >> The VMA arrangement is not important; it is just important that a single folio >> is mapped contiguously across the whole block. >> >>> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one >>> rather than being as a whole. >> >> Yes, as per 1; the arm64 implementation will notice when the first entry is >> cleared and unfold the contpte block. >> >>> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16 >>> contiguous physical address with CONTPTE set. if one of them run away >>> from the 16 ptes group and PTEs become unconsistent, some terrible >>> errors/faults can happen in HW. for example >> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and >> contpte_try_unfold(). the fold/unfold operation is only done when all >> requirements are met, and we perform it in a manner that is conformant to the >> architecture requirements (see contpte_fold() - being renamed to >> contpte_convert() in the next version). > > Hi Ryan, > > sorry for too many comments, I remembered another case > > 4. mremap > > a CONTPTE might be remapped to another address which might not be > aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs > from src to dst. > static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > unsigned long old_addr, unsigned long old_end, > struct vm_area_struct *new_vma, pmd_t *new_pmd, > unsigned long new_addr, bool need_rmap_locks) > { > struct mm_struct *mm = vma->vm_mm; > pte_t *old_pte, *new_pte, pte; > ... > > /* > * We don't have to worry about the ordering of src and dst > * pte locks because exclusive mmap_lock prevents deadlock. > */ > old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl); > if (!old_pte) { > err = -EAGAIN; > goto out; > } > new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl); > if (!new_pte) { > pte_unmap_unlock(old_pte, old_ptl); > err = -EAGAIN; > goto out; > } > if (new_ptl != old_ptl) > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > flush_tlb_batched_pending(vma->vm_mm); > arch_enter_lazy_mmu_mode(); > > for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, > new_pte++, new_addr += PAGE_SIZE) { > if (pte_none(ptep_get(old_pte))) > continue; > > pte = ptep_get_and_clear(mm, old_addr, old_pte); > .... > } > > This has two possibilities > 1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE; > 2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE > while copying. > > does your code also handle this properly? Yes; same mechanism - the arm64 arch code does the CONT_PTE bit management and folds/unfolds as neccessary. Admittedly this may be non-optimal because we are iterating a single PTE at a time. When we clear the first pte of a contpte block in the source, the block will be unfolded. When we set the last pte of the contpte block in the dest, the block will be folded. If we had a batching mechanism, we could just clear the whole source contpte block in one hit (no need to unfold first) and we could just set the dest contpte block in one hit (no need to fold at the end). I haven't personally seen this as a hotspot though; I don't know if you have any data to the contrary? I've followed this type of batching technique for the fork case (patch 13). We could do a similar thing in theory, but its a bit more complex because of the ptep_get_and_clear() return value; you would need to return all ptes for the cleared range, or somehow collapse the actual info that the caller requires (presumably access/dirty info). > >> >> Thanks for the review! >> >> Thanks, >> Ryan >> >>> >>> case0: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has CONTPTE >>> >>> case 1: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has swap >>> >>> Unconsistent 16 PTEs will lead to crash even in the firmware based on >>> our observation. >>> >>> Thanks >>> Barry > > Thanks > Barry