From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD146C5AD4C for ; Thu, 23 Nov 2023 16:01:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4185C8D0051; Thu, 23 Nov 2023 11:01:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A1308D0002; Thu, 23 Nov 2023 11:01:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 242928D0051; Thu, 23 Nov 2023 11:01:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0E96F8D0002 for ; Thu, 23 Nov 2023 11:01:34 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D7AF01CC6F8 for ; Thu, 23 Nov 2023 16:01:33 +0000 (UTC) X-FDA: 81489684066.19.0D8B1D1 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf23.hostedemail.com (Postfix) with ESMTP id 975C614003F for ; Thu, 23 Nov 2023 16:01:30 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700755291; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sTG9jHMZ3sHydKvODMAnIczrL3GVmZgL5IHAwRQ2jjo=; b=e25W6YMYVH6gkYlZ5nLtnPIgTLRK5xqZdGcqeSH64+if7DE/aGkUaFt4kZhqWOtVkx9bm6 tT9AwSeoJ08z2RQocDoHcuNy7kYbG2BJvrr2uQ1/cnu6kvMIwLOlyby0KJyjlEiMLQ9yZo VuuW3zrWHraGvxBwCXbmeaOJpPMFMBg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700755291; a=rsa-sha256; cv=none; b=ZqC/ONAGYVowoNUUuSefGc1J5uUaEdH2Y9jyvNkhUSLA0AaSOkC5N9ZLyY4qtfxXoQX23e zefqz90moOIZVIfI5AJp52QAAYkvLKNjPzQCNYAzv1d5cFFqvv2pq3b+Tiu12bQoqj9lSx mOgUlzbFQjnT8p/tgf7H/UyMzSHozOc= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CDB3312FC; Thu, 23 Nov 2023 08:02:15 -0800 (PST) Received: from [10.1.37.168] (XHFQ2J9959.cambridge.arm.com [10.1.37.168]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 48E003F6C4; Thu, 23 Nov 2023 08:01:26 -0800 (PST) Message-ID: <3b4f6bff-6322-4394-9efb-9c3b9ef52010@arm.com> Date: Thu, 23 Nov 2023 16:01:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Content-Language: en-GB To: Alistair Popple Cc: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland , David Hildenbrand , Kefeng Wang , John Hubbard , Zi Yan , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20231115163018.1303287-1-ryan.roberts@arm.com> <20231115163018.1303287-15-ryan.roberts@arm.com> <87fs0xxd5g.fsf@nvdebian.thelocal> From: Ryan Roberts In-Reply-To: <87fs0xxd5g.fsf@nvdebian.thelocal> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 975C614003F X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: qxhyor4ut66czppzytiq9t4dgxgprnaz X-HE-Tag: 1700755290-762616 X-HE-Meta: U2FsdGVkX1+gz1HIa39ir/KMUqHcz2is73lV3RBEb/wWyO+n4jf3/BoMTLOXbe1sWE6tGHlRvgvFspPl3sxDRD5GCaMlOnlpeRkpmwfqGzASKBnisakVaNK6W8qOkQ8gAoQCy6Gmpwu3JsHlWenZ+XONcRXjir0x0T3hf6Wdj6+thesOkiplFFVhnzolhJYAgqjWQp/BXgnXho6b0bZDDoveSl5ltVY6+13WFvfCQoNLIqGbdPpLjYGZNhTQ+bCffLtaQPBRItQhHVnbeVvWxL2IhFo+g2shzD0GRmQHeTLqHSIW7suwMpEfF/pBd4n3G3z2QUUldz4X3y+7SCnxNyNProAPh6prCeTNCSBLet9LmM54sFVEAhUjAbLv8L1s8ts0CToGxds3dHwEqB5ac4bkdBkoif4LSChk5UWD4WtatIupQ7mt7FOTQ583iPu9RJBEHYoA5csJGLNpw9wa8Zy+KUGv/pIKzrxewwIlYEh+WFRv5CQtu7A7yEaQlID7zrkLb7QzNAkKWQd3j2NY6T1khtJXfjugr5y/qK14WJV1+QR9VkotiW6R3JPN46KLv2Mh+2ivZRV8+1dzz4w/iLrRIFIbS7LTDOSE/fNtB6n65vOxQeOIuhL4+xD8J7a/EqW329HAbqAn0VeHFTDvXLf1gpiN1/KZKz+hkmA2NSzLhtPSXkt0Ajk+I5BqqyVZYPZXsojZw+53/6Xe9zq6ohfzu/Ne6tC37Hc0VBnJY7rgfmqn+oMb06UgTXg/ovQYanfGpcVr5WcsUDKE1ozDmHGkA1ByIk5ISaGNsBW+gtn3pi21VAY1HMvikIpeBNCr+oOFtNiXAAO9jBt5+BnEbB+vnalv0yR9DRuoIqbMFLMICc9xRitRh0FSHmsJnzdczGpJpu/p8CMj8SeeJXk9Il73LSObFgYc5pR8iYFB1sv9SkAi63/IAKA/uJHc/K4Yrjh6EyXXkfP9MM7CJX9 OLDFzPdr fOq4i09zN6pDXB8NZE7qWlz8EVIIRp2SuN0iV2WAvLFDUDPeSTiviFJfbWR0Bj2nmZFO+hpw3teWn46+UhSwN6MczQiWmPSZ0X7vDmxOZcYkJe2+MyDuOcQssDjoxk4ZWqfg2vClw6qkT6g7jXI8F9201hqUGvCXEAyN82kcWBwQpLnOe9TEa+zufkeIjYL4jtBXL+ys97vByTcfs2zqCLD7J4haX6EueXZ6gC+UhHL4etkn7bcc/1LJNA7i/qA7zTRR3Mg4SlYyeeYcXE6POLFg68+z7e+l7m6uso/aRInQyeawC6kWVce+89GRL3mU0dHk0Ng4Oh1AcQr8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 23/11/2023 05:13, Alistair Popple wrote: > > Ryan Roberts writes: > >> ptep_get_and_clear_full() adds a 'full' parameter which is not present >> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when >> a full address space teardown is in progress. We use this information to >> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore >> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the >> contiguous neighbours to keep their contig bit set, because we know we >> are about to clear the rest too. >> >> Before this optimization, the cost of arm64_sys_exit_group() exploded to >> 32x what it was before PTE_CONT support was wired up, when compiling the >> kernel. With this optimization in place, we are back down to the >> original cost. >> >> This approach is not perfect though, as for the duration between >> returning from the first call to ptep_get_and_clear_full() and making >> the final call, the contpte block in an intermediate state, where some >> ptes are cleared and others are still set with the PTE_CONT bit. If any >> other APIs are called for the ptes in the contpte block during that >> time, we have to be very careful. The core code currently interleaves >> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get() >> must be careful to ignore the cleared entries when accumulating the >> access and dirty bits - the same goes for ptep_get_lockless(). The only >> other calls we might resonably expect are to set markers in the >> previously cleared ptes. (We shouldn't see valid entries being set until >> after the tlbi, at which point we are no longer in the intermediate >> state). Since markers are not valid, this is safe; set_ptes() will see >> the old, invalid entry and will not attempt to unfold. And the new pte >> is also invalid so it won't attempt to fold. We shouldn't see this for >> the 'full' case anyway. >> >> The last remaining issue is returning the access/dirty bits. That info >> could be present in any of the ptes in the contpte block. ptep_get() >> will gather those bits from across the contpte block. We don't bother >> doing that here, because we know that the information is used by the >> core-mm to mark the underlying folio as accessed/dirty. And since the >> same folio must be underpinning the whole block (that was a requirement >> for folding in the first place), that information will make it to the >> folio eventually once all the ptes have been cleared. This approach >> means we don't have to play games with accumulating and storing the >> bits. It does mean that any interleaved calls to ptep_get() may lack >> correct access/dirty information if we have already cleared the pte that >> happened to store it. The core code does not rely on this though. > > Does not *currently* rely on this. I can't help but think it is > potentially something that could change in the future though which would > lead to some subtle bugs. Yes, there is a risk, although IMHO, its very small. > > Would there be any may of avoiding this? Half baked thought but could > you for example copy the access/dirty information to the last (or > perhaps first, most likely invalid) PTE? I spent a long time thinking about this and came up with a number of possibilities, none of them ideal. In the end, I went for the simplest one (which works but suffers from the problem that it depends on the way it is called not changing). 1) copy the access/dirty flags into all the remaining uncleared ptes within the contpte block. This is how I did it in v1; although it was racy. I think this could be implemented correctly but its extremely complex. 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we can clear 1 or more full contpte blocks in a single call - the ptes are never in an intermediate state. This is difficult because ptep_get_and_clear_full() returns the pte that was cleared so its difficult to scale that up to multiple ptes. 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that to be called while ptep_get_and_clear_full() calls are on-going. Then assert in the other functions that ptep_get_and_clear_full() is not on-going when they are called. So we would get a clear sign that usage patterns have changed. But there is no easy place to store that state (other than scanning a contpte block looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly. 4) The simple approach I ended up taking; I thought it would be best to keep it simple and see if anyone was concerned before doing something more drastic. What do you think? If we really need to solve this, then option 1 is my preferred route, but it would take some time to figure out and reason about a race-free scheme. Thanks, Ryan