From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DFADC4167B for ; Tue, 28 Nov 2023 11:58:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 20CBF6B013A; Tue, 28 Nov 2023 06:58:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 197DE6B013B; Tue, 28 Nov 2023 06:58:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0DB86B013D; Tue, 28 Nov 2023 06:58:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D6A626B013A for ; Tue, 28 Nov 2023 06:58:37 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AA6651C0102 for ; Tue, 28 Nov 2023 11:58:37 +0000 (UTC) X-FDA: 81507215874.01.5786478 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf23.hostedemail.com (Postfix) with ESMTP id 7B734140004 for ; Tue, 28 Nov 2023 11:58:35 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701172715; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XKZ41Ulz/G3hxjLP2RRCgQqcvwQl4KBW6Y2zmScpfBs=; b=G9Our6tjENhgoW2q577FZ9jsYn3oR7FH3kA9v7TC2WqmOj5K3q8jMBDkzjpOzAE9jMeW9W cmpLYe+7j7Ofri4WlDquda4nt4LKiyBy5MdgqkGHVUROdIfC+L8rGZdGqtv+9QcYY2vWmw G6b5EUTyeNR/UTRN+TH931+p/6F7UXM= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701172715; a=rsa-sha256; cv=none; b=fDVfqw8u5ipclPHHwQAtgmwOFL6JoIXtaj989tKFfZ1OITs5Z1H2MuL7BxUGdNwNIbUgOo 71IYPto9L/5JqgwJ/p9x7ar3LeYRXPCY7LSyMiOLwW23NxjhbIY/KCt/sJ1pETAr/2e5nx ImTlO5zV6KEBTB+nCVeJSZiwNvSkDYU= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 59460C15; Tue, 28 Nov 2023 03:59:21 -0800 (PST) Received: from [10.1.33.188] (XHFQ2J9959.cambridge.arm.com [10.1.33.188]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B28993F73F; Tue, 28 Nov 2023 03:58:27 -0800 (PST) Message-ID: <15c68452-cc8e-45a5-bcaf-79b040afc746@arm.com> Date: Tue, 28 Nov 2023 11:58:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings Content-Language: en-GB To: Yang Shi Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com References: <20231115163018.1303287-1-ryan.roberts@arm.com> <20231127031813.5576-1-v-songbaohua@oppo.com> <234021ba-73c2-474a-82f9-91e1604d5bb5@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 7B734140004 X-Stat-Signature: yyaktqmuew1tg8bixsz89d4ngp3ry34k X-HE-Tag: 1701172715-354337 X-HE-Meta: U2FsdGVkX18ZhDhbfpzcw9qShtJJf1eE5jcR2zDcoZnYaKiLPePm0Rr9Arx53jORaT/GLKo9STRZYDyid95/PcDHU7IFJjIgvbCG2iEc2DnSTo7b6GZ+K/xfy/dnT2iU6TDQ7Wh36Mf07O30BYg8+r7QMgXAqkUnF6Wp/AL8QfjzKeOqrF8WKO81B4jNiMJXyfmx8HhFeO1YaaasXq56YHUnI9LmVijfPsS53Zfwsa+8DhjKE1VxggfLw9d5HOkgWDPS3c35IYae/CkogNCAmOMKLp1rTfni9UDyjdHWgxFjOEKqvhnW+FW6LOOx+/XjokhrgicnV5ziFh9CEAYDW908L1Bx+zNeVo0esliRWheGTQSVje9wMRm+575N3kex/CCGJh1vMNZj4EqhW5I5fHQYg8r3JvD8AkwlUPNW3RZ0mwxlklUaz7oic/GIWnRHZZBPQ7SSxy1aMgCKEybfIw3dUiir3DZkMoJ0Wyjih1/eCUZ8BeMA/m08u6k+PrSPg2DSA+7LrW0eymeFFmaPpDYFguRpoaN81b5w+d+aW/hJzHWse3lkRKq9bL1xs6xCj1dIbnIy4JPSWlC0FpcZnLh1ZGn96W2/bMyDpaFv+h924Qexg07iM5OOB8NJhSWE21KSfkGT1LcHcHiFyczrkbyvYaeiVv/oXi3CgX3uyK+FxoqDP7E9sMARGAWdmxb/BNqOVgMIiUaJrhHUgVhxt6+dMQxoGeiyeXXBZUGh4D1pFGbz/smmAQqo0aQVIKitTQicMOVJvNsSCrUah0ILa09agS0AMsiMpxQJWaqPD8s1O7VXG8UhiFBe+QqMDrleHxGtpkwoIKI7r3LiVIZBycyI+MVa00l6mZlRsK6i7nz0Dq6EqEC3Dusjgtfp/CHGHrJ00D1V1jOeLWQCV5Z9O+/5sd+28djyS4JQsg/Nhb05Uuo3MHuFaPmxEmmfj2ZXA2h/I7sRlJ/JUZSdlgc UCYhVrvk LcSMNgsLcd3TgQ67YxumAAiOREex09n28pYQCBdodtUS4KFBPv4bi4gqKU1wQdXu0h5H3s/29l/J46AG1JG0WQ5ohgJAnLy+uRRbROnBu4KATDjek2e/54I3sxi++lxB9nSYJGPLRBTXEuckNol2bNa3WHua9XdHsJC8kcYADyvegcjbUiR/BiC2rERSM4EjyKJBQ9mjlaFJDTQ5iu6FOyVZTAehXpbr2WGDQa6yy6vwmmryzejAAxA3NnEwEY/3L9G4WeGrmNhesvTfkn8iRXbNBGw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/11/2023 03:13, Yang Shi wrote: > On Mon, Nov 27, 2023 at 1:15 AM Ryan Roberts wrote: >> >> On 27/11/2023 03:18, Barry Song wrote: >>>> Ryan Roberts (14): >>>> mm: Batch-copy PTE ranges during fork() >>>> arm64/mm: set_pte(): New layer to manage contig bit >>>> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit >>>> arm64/mm: pte_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit >>>> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit >>>> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit >>>> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit >>>> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit >>>> arm64/mm: ptep_get(): New layer to manage contig bit >>>> arm64/mm: Split __flush_tlb_range() to elide trailing DSB >>>> arm64/mm: Wire up PTE_CONT for user mappings >>>> arm64/mm: Implement ptep_set_wrprotects() to optimize fork() >>>> arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown >>> >>> Hi Ryan, >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES >>> in the below cases >> >> The general idea is that the core-mm sets the individual ptes (one at a time if >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them >> (ptep_clear(), etc); This is exactly the same interface as previously. >> >> BUT, the arm64 implementation of those interfaces will now detect when a set of >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K >> base pages) are all appropriate for having the CONT_PTE bit set; in this case >> the block is "folded". And it will detect when the first PTE in the block >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the >> requirements for folding a contpte block is that all the pages must belong to >> the *same* folio (that means its safe to only track access/dirty for thecontpte >> block as a whole rather than for each individual pte). >> >> (there are a couple of optimizations that make the reality slightly more >> complicated than what I've just explained, but you get the idea). >> >> On that basis, I believe all the specific cases you describe below are all >> covered and safe - please let me know if you think there is a hole here! >> >>> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio >> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit >> is removed from the whole contpte block. If there is then a subsequent >> set_pte_at() to set a swap entry, the implementation will see that its not >> appropriate to re-fold, so the range will remain unfolded. >> >>> >>> 2. vma split in a large folio due to various reasons such as mprotect, >>> munmap, mlock etc. >> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I >> suspect not, so if the VMA is split in the middle of a currently folded contpte >> block, it will remain folded. But this is safe and continues to work correctly. >> The VMA arrangement is not important; it is just important that a single folio >> is mapped contiguously across the whole block. > > Even with different permissions, for example, read-only vs read-write? > The mprotect() may change the permission. It should be misprogramming > per ARM ARM. If the permissions are changed, then mprotect() must have called the pgtable helpers to modify the page table (e.g. ptep_set_wrprotect(), ptep_set_access_flags() or whatever). These functions will notice that the contpte block is currently folded and unfold it before apply the permissions change. The unfolding process is done in a way that intentionally avoids misprogramming as defined by the Arm ARM. See contpte_fold() in contpte.c. > >> >>> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one >>> rather than being as a whole. >> >> Yes, as per 1; the arm64 implementation will notice when the first entry is >> cleared and unfold the contpte block. >> >>> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16 >>> contiguous physical address with CONTPTE set. if one of them run away >>> from the 16 ptes group and PTEs become unconsistent, some terrible >>> errors/faults can happen in HW. for example >> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and >> contpte_try_unfold(). the fold/unfold operation is only done when all >> requirements are met, and we perform it in a manner that is conformant to the >> architecture requirements (see contpte_fold() - being renamed to >> contpte_convert() in the next version). >> >> Thanks for the review! >> >> Thanks, >> Ryan >> >>> >>> case0: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has CONTPTE >>> >>> case 1: >>> addr0 PTE - has no CONTPE >>> addr0+4kb PTE - has CONTPTE >>> .... >>> addr0+60kb PTE - has swap >>> >>> Unconsistent 16 PTEs will lead to crash even in the firmware based on >>> our observation. >>> >>> Thanks >>> Barry >>> >>> >> >>