From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D65F9C77B7F for ; Thu, 26 Jun 2025 08:47:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4664F8D0002; Thu, 26 Jun 2025 04:47:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 418C68D0001; Thu, 26 Jun 2025 04:47:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32D2A8D0002; Thu, 26 Jun 2025 04:47:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 21CEB8D0001 for ; Thu, 26 Jun 2025 04:47:33 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6A049C0994 for ; Thu, 26 Jun 2025 08:47:32 +0000 (UTC) X-FDA: 83596923144.07.173B21B Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf21.hostedemail.com (Postfix) with ESMTP id 781C31C000D for ; Thu, 26 Jun 2025 08:47:30 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750927650; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cFOP6/lgY5mDIoJX47tBtTyO+JAhWfX6Gspj0vjq/yU=; b=WSzIp4jfo7xGZ/kDr0P1TNg1LPXhh2xeBE5v5HOdfXgRtawokkC8JKbrRWtwji/z0jAI5d 0KfRqkEWK/dZB+oE/2X/kexCIFDegCsD7D8wqwA9ouO8cW3k52oGeiCOqxOiU5h481UCJ/ 8FZUSIokp3qBgXUSW2RJR1ikxBaoJ3I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750927650; a=rsa-sha256; cv=none; b=U0mGiLhk2RTE2POkV99xzfzL4TL7ODHIUxsFoYdr+dzz2d08QXA9HkODJWe8iIT+aXSaH0 VrK2RPy9AnSfDBdARA2OlwWixG4ZGh0r0/I3ircYK4pzLfjIm7aX1A/tJX5SZgb6uuUQy3 MAWhKHIbqS+XeMnRS2/zSBc79oJFwRk= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5AA101758; Thu, 26 Jun 2025 01:47:11 -0700 (PDT) Received: from [10.57.84.221] (unknown [10.57.84.221]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E19A03F58B; Thu, 26 Jun 2025 01:47:25 -0700 (PDT) Message-ID: Date: Thu, 26 Jun 2025 09:47:24 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/2] arm64: pageattr: Use pagewalk API to change memory permissions Content-Language: en-GB To: Yang Shi , Mike Rapoport , Dev Jain Cc: akpm@linux-foundation.org, david@redhat.com, catalin.marinas@arm.com, will@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, suzuki.poulose@arm.com, steven.price@arm.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, anshuman.khandual@arm.com References: <20250613134352.65994-1-dev.jain@arm.com> <20250613134352.65994-2-dev.jain@arm.com> <956f6ebe-606f-4575-a0a5-7841c95b5371@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 781C31C000D X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: ytgpfn1814fseu7pjm6598jm4j4ew38d X-HE-Tag: 1750927650-110354 X-HE-Meta: U2FsdGVkX1/3pdkQjuU/rRiYwOD7DwfAEzENAOvyK4KezlzFD1r4CNXi4t52O7wyqvYU9B6xmgDw34NA2nwTIv7njjVzbYheieu7EznWZXlPWmWeUOpvJca6KzOlzGKAxbWxkU9sotFZh9JWYSth5zQ4Jh3x4MwU4aozhLNBTlNMRCT8HuhuuvnaJlABBCkAppowwYeL6j3bHG0C34qj3APPwGyFZ5qbVuIPxFmQwm1GEOrwX16KjEyUZpJRBj6Ti3MdKeHP7wTLtYH8kQ/VDDnV98bwFAfFgWvJnW5Rca0lTNaQ7M7Cvg2ESdxtl1IWK8Ja+HaQ1J9c0i4LQBLTRQCokuW286uheB0i1DMm+Zp+alI1xE37+/FWrykgiC3NVtW+ybRF0YdtlB6j73m2/2RBWlEsV0hibeIdSRRuwbzvK7UYYpIcColdpQQsuczHB0/L/7MAiKol84iBGsmL9SRppVln6tLxGRa98wk0vIr3taAtzTpiUM3xTjDzFuIrlr6L3yLB+ROsk5l33hljYTyw90r975D+VDL+1d6U+sr3rnEl2Axl7EzGq3ltSaUttDcvE8dTHTWiPwWefdH+mT6YKv4wFjkqtRA2V56ncM/aTyxURjgYoqlYMWPbjqcyR84NrdxVjbiaC/fSIBFN7NMpuk9rS2OY3WabDG/QFVsNHKDSLi9GmHvk6Bivft7Ra5bHAh9NzNVsm/gj2RRuWNkJ13SjGOyLaSZP5HDZ15Kcj2jeVOErdehV0weM6fZN9BJEOC5eTlO4c5drkOCUAqekRjpRJ0JkT/1cUnbt4ALov/lwY9CR6h0qFmdgIAwQWx4tZgu6UntsGz2YCb5KdRFzsHPjmdQCUKZaB9EScXGRjPUvCQKDaoiNNv4mv1XbkjAVcDoP/u0ADzg3E4LB/GG26+dKEe5NIndJYpnuDI2HUZ5XJUYKGGh7QPOgILmJCZC6TyuNv7CFLh7P9+o /SolHdDU kMDA2Wq7LtIM0CmlyNiIIl/Ge5YILRFF1Y1NqjZlMFNrl/5rI4U7IW5JGkt5ozaZI7A1qnv7pJOYQUboTsfhYN8l/fiRZrmpxCuHODyLQ8eplqbjffcEbin1PLXBAtBnHkWBje31YZ5Z14mMDRR5yg3d6PHspiZon2jmyFW+ixNVXb7u0l2AgF4MKRnlHIxZrjoYE/Oi47IcopGBj+o0KSwOKdg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 25/06/2025 21:40, Yang Shi wrote: > > > On 6/25/25 4:04 AM, Ryan Roberts wrote: >> On 15/06/2025 08:32, Mike Rapoport wrote: >>> On Fri, Jun 13, 2025 at 07:13:51PM +0530, Dev Jain wrote: >>>> -/* >>>> - * This function assumes that the range is mapped with PAGE_SIZE pages. >>>> - */ >>>> -static int __change_memory_common(unsigned long start, unsigned long size, >>>> +static int ___change_memory_common(unsigned long start, unsigned long size, >>>>                   pgprot_t set_mask, pgprot_t clear_mask) >>>>   { >>>>       struct page_change_data data; >>>> @@ -61,9 +140,28 @@ static int __change_memory_common(unsigned long start, >>>> unsigned long size, >>>>       data.set_mask = set_mask; >>>>       data.clear_mask = clear_mask; >>>>   -    ret = apply_to_page_range(&init_mm, start, size, change_page_range, >>>> -                    &data); >>>> +    arch_enter_lazy_mmu_mode(); >>>> + >>>> +    /* >>>> +     * The caller must ensure that the range we are operating on does not >>>> +     * partially overlap a block mapping. Any such case should either not >>>> +     * exist, or must be eliminated by splitting the mapping - which for >>>> +     * kernel mappings can be done only on BBML2 systems. >>>> +     * >>>> +     */ >>>> +    ret = walk_kernel_page_table_range_lockless(start, start + size, >>>> +                            &pageattr_ops, NULL, &data); >>> x86 has a cpa_lock for set_memory/set_direct_map to ensure that there's on >>> concurrency in kernel page table updates. I think arm64 has to have such >>> lock as well. >> We don't have a lock today, using apply_to_page_range(); we are expecting that >> the caller has exclusive ownership of the portion of virtual memory - i.e. the >> vmalloc region or linear map. So I don't think this patch changes that >> requirement? >> >> Where it does get a bit more hairy is when we introduce the support for >> splitting. In that case, 2 non-overlapping areas of virtual memory may share a >> large leaf mapping that needs to be split. But I've been discussing that with >> Yang Shi at [1] and I think we can handle that locklessly too. > > If the split is serialized by a lock, changing permission can be lockless. But > if split is lockless, changing permission may be a little bit tricky, > particularly for CONT mappings. The implementation in my split patch assumes the > whole range has cont bit cleared if the first PTE in the range has cont bit > cleared because the lock guarantees two concurrent splits are serialized. > > But lockless split may trigger the below race: > > CPU A is splitting the page table, CPU B is changing the permission for one PTE > entry in the same table. Clearing cont bit is RMW, changing permission is RMW > too, but neither of them is atomic. > >                CPU A                                      CPU B > read the PTE read the PTE > clear the cont bit for the PTE >                                    change the PTE permission from RW to RO >                                    store the new PTE > > store the new PTE <- it will overwrite the PTE value stored by CPU B and result > in misprogrammed cont PTEs Ahh yes, good point! I missed that. When I was thinking about this, I had assumed that *both* CPUs racing to split would (non-atomically) RMW to remove the cont bit on the whole block. That is safe as long as nothing else in the PTE changes. But of course you're right that the first one to complete that may then go on to modify the permissions in their portion of the now-split VA space. So there is definitely a problem. > > > We should need do one the of the follows to avoid the race off the top of my head: > 1. Serialize the split with a lock I guess this is certainly the simplest as per your original proposal. > 2. Make page table RMW atomic in both split and permission change I don't think we would need atomic RMW for the permission change - we would only need it for removing the cont bit? My reasoning is that by the time a thread is doing the permission change it must have already finished splitting the cont block. The permission change will only be for PTEs that we know we have exclusive access too. The other CPU may still be "splitting" the cont block, but since we already won, it will just be reading the PTEs and noticing that cont is already clear? I guess split_contpte()/split_contpmd() becomes a loop doing READ_ONCE() to test if the bit is set, followed by atomic bit clear if it was set (avoid the atomic where we can)? > 3. Check whether PTE is cont or not for every PTEs in the range instead of the > first PTE, before clearing cont bit if they are Ahh perhaps this is what I'm actually describing above? > 4. Retry if cont bit is not cleared in permission change, but we need > distinguish this from changing permission for the whole CONT PTE range because > we keep cont bit for this case I'd prefer to keep the splitting decoupled from the permission change if we can. Personally, I'd prefer to take the lockless approach. I think it has the least chance of contention issues. But if you prefer to use a lock, then I'm ok with that as a starting point. I'd prefer to use a new separate lock though (like x86 does) rather than risking extra contention with the init_mm PTL. Thanks, Ryan > > Thanks, > Yang > >> >> Perhaps I'm misunderstanding something? >> >> [1] https://lore.kernel.org/all/f036acea-1bd1-48a7-8600-75ddd504b8db@arm.com/ >> >> Thanks, >> Ryan >> >>>> +    arch_leave_lazy_mmu_mode(); >>>> + >>>> +    return ret; >>>> +} >