From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FE01CD5BB8 for ; Thu, 5 Sep 2024 20:08:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3D6F6B008C; Thu, 5 Sep 2024 16:08:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AED236B0092; Thu, 5 Sep 2024 16:08:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 98E9E6B0093; Thu, 5 Sep 2024 16:08:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 731916B008C for ; Thu, 5 Sep 2024 16:08:14 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 00529A9B10 for ; Thu, 5 Sep 2024 20:08:13 +0000 (UTC) X-FDA: 82531771266.05.07FEF70 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf07.hostedemail.com (Postfix) with ESMTP id 1445740026 for ; Thu, 5 Sep 2024 20:08:11 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=3bzdZQsf; dkim=pass header.d=linutronix.de header.s=2020e header.b=+53uSW5J; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf07.hostedemail.com: domain of namcao@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=namcao@linutronix.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725566784; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FgM0WkMalmx+pbbJgmLSbP7B3h75J4NY0XJJincjeU8=; b=nE7m5yzFeQC4LgRpxxJRSVceVRyyd5frF5QDwlsUkUkUprQsj9NB6lya/Zail38sS/tVnK oxsp7S5ac2vwfy/SXZKR5UqVqY15rGrqAHRc6+9frhtJotv78F9iBGcvdihghMTxNvchFV UpAdF0lD6pGSVovrnQHDHVGbl3CWhD0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725566784; a=rsa-sha256; cv=none; b=ILKv1MHXW+IFeTWzBi054BPSoBJacyNltEvmvwKKdlorrIX6DGNz8pORSdRtbUdXDfDWHe ZhHozzSWvCmXEGqzdYs7TiSh97TUq5e34fO0IeVgyOfEp6BttgtP1FxxG1gd4HyzCyA9bU bfLuANyHOuC7XuDiCql8UzcwH7DH+0k= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=3bzdZQsf; dkim=pass header.d=linutronix.de header.s=2020e header.b=+53uSW5J; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf07.hostedemail.com: domain of namcao@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=namcao@linutronix.de Date: Thu, 5 Sep 2024 22:08:02 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1725566889; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FgM0WkMalmx+pbbJgmLSbP7B3h75J4NY0XJJincjeU8=; b=3bzdZQsfWKXJZV5XYukBMdzAMaPxJtA9IO10loRsad8Finp7Ojgt2NfW81LNWnGtUHx1vh fJCAxAYSyxoJ8QVfdgcmKB4Xbf+zoU6o4MFStxQw10lmU7PdpXs9eznvdjABuszNDTXrqI oruAiJpieYs1kLOCM4S8pZef2QsVT5pAonGp+sKm8FCbMlhGCPHjmFq/ZbiLVnBybyoCJ0 kwsNS5Nbw9IxLH1gdu8Cq8pHeqLEaysahitah6PCpbeGE8UkZiAkjaakA5RsxwJ1lNiAhT L884rsN2+Dt9Ng7eI0oeQEdiZh7ipWrLT22iL1eQUvk8ppjMCw8H6Ilkes7ZMg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1725566889; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FgM0WkMalmx+pbbJgmLSbP7B3h75J4NY0XJJincjeU8=; b=+53uSW5JfHSYC6EiCJmvorMX9fBbcW9K3gvvTEBFuQHwvtgPPk7DWDzwG7iD2/MybArZuS xg2EClXNphAORRCA== From: Nam Cao To: "Liam R. Howlett" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H. Peter Anvin" , Andrew Morton , Vlastimil Babka , Lorenzo Stoakes , linux-kernel@vger.kernel.org, linux-mm@kvack.org, bigeasy@linutronix.de Subject: Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas Message-ID: <20240905200802.2vY0SstV@linutronix.de> References: <20240825152403.3171682-1-namcao@linutronix.de> <5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian> <20240827075841.BO2qAOzq@linutronix.de> <20240903103616.i0GrRAfD@linutronix.de> <20240904075937.xh2rLk3q@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1445740026 X-Stat-Signature: mmk3mf7zzggasrzqygzbt5y4yaoomnj6 X-Rspam-User: X-HE-Tag: 1725566891-461632 X-HE-Meta: U2FsdGVkX1/BZId/dF728sZf5+hlL+UspckezV0zXA9hhVlWXAvcM5CM4UIz6coBwXnRmml3UZLQGBjV+B1L7IvBjRuzvMoxLNdAnrKiJjfnawojWvAceXnJYqXdKYk/i31RIQwylTW7iyQ7g3DL6QJ5bdQEgw9L4GK1OX1Aaf1Qgo1rlTmIlJqWFMmPFYvgpODz8RqRAEnGuRWTQOW8Ut4Tr8diNouC5Noh9jYfosm64hK3HIzM3OEdSUm8mWSGoyHobQSTf8S7tMjhJIeBK3gZJsYCQIIApY2fdx/dtulK3uM9YfFht+t4ZaWp8bBTt8vpyyNmdqby0ngW4mSEZd5YKzQZdP1r/OpNaiondzfV/YNT8a1VrmQ68patRIF0PPO3GBJxR+A4ZUeNSLPEmqgZ1PcyA14alu6dBUd1s0OOppcbyHLyyaMVyTps3jKyhz4Qg30RwHuoXYtgMKISTx4x4N26hryoRUEDHS2UGtVnNcQkPjTnMnUDWNRv6P8/XvdRJtC53day183VKSHtaBhGizlrM9uBuY5xJtYNRyUd+IYFBHIQeDqaj8FYT92dayQWW0jNVz7MJXg40U4qLKw6MJTpk7kF5yUewuJLSpFQkPn6hj2YMzZodh8goC9b03qNPZy/2O8INwO31jEJFME1Ka95IRtA4FHjCRMCMpcMkhoGswrVrsChAnKpe+flHH/t8cYcSBeUZrEzAy2zzz2N/NIZMdFvYWYq3jYIXX7RbNwFv2OBot4ZNGs8GfTGPi2VCiiKGAvn2z5ashaH9ZETShnY7yZcF2Zqzg9rQWDgx/o6mpc0G3UhmjULKoXBsCCekeU/H2KPf3fqdO3MaHi9R623nrGnsGBSwOjglZpgQfG9U68YIDaKK82G5eC8iD91fDSqBjrGMcdNozj+06Xna2qjLpEgH3mh/X2eTyU8VSfY3Khe2BO9GDvjA8jLi5mG67I02qqcGL1WFEB QHo4oyEF pO2yYbopNARYEzEad5m3ay3wY/g81cK8WQ87r0R1t3r5LVo1CN6s6W3txwhoUaUwSH8Hkn5Ez0TVq2XZjbEXalD5t9RrJ1dGDY2QfdguWlhlH12FpkjWV1lJCJjGwEZLwHI+pgprzWHzs+/ZsqFhR3/y1ktRFcRO/gF7c X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 04, 2024 at 02:40:34PM -0400, Liam R. Howlett wrote: > * Nam Cao [240904 03:59]: > > On Tue, Sep 03, 2024 at 11:56:57AM -0400, Liam R. Howlett wrote: > > > * Nam Cao [240903 06:36]: > > ... > > > > On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote: > > > > > * Nam Cao [240827 03:59]: > > > > > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote: > > > > > > > * Nam Cao [240825 11:29]: > > > > > So the interval split should occur when the PAT changes and needs to be > > > > > tracked differently. This does not happen when the vma is split - it > > > > > happens when a vma is removed or when the PAT is changed. > > > > > > > > > > And, indeed, for the mremap() shrinking case, you already support > > > > > finding a range by just the end and have an abstraction layer. The > > > > > problem here is that you don't check by the start - but you could. You > > > > > could make the change to memtype_erase() to search for the exact, end, > > > > > or start and do what is necessary to shrink off the front of a region as > > > > > well. > > > > > > > > I thought about this solution initially, but since the interval tree allow > > > > overlapping ranges, it can be tricky to determine the "best match" out > > > > of the overlapping ranges. But I agree that this approach (if possible) > > > > would be better than the current patch. > > > > > > > > Let me think about this some more, and I will come back later. > > > > > > Reading this some more, I believe you can detect the correct address by > > > matching the start address with the smallest end address (the smallest > > > interval has to be the entry created by the vma mapping). > > > > I don't think that would cover all cases. For example, if the tree has 2 > > intervals: [0x0000-0x2000] and [0x1000-0x3000]. Now, the mm subsystem tells > > us that the interval [0x1000-0x2000] needs to be removed (e.g. user does > > munmap()), your proposal would match this to the second interval. After the > > removal, the tree has [0-0x2000] and [0x2000-0x3000] > > > > Then, mm subsystem says [0x1000-0x3000] should be removed, and that doesn't > > match anything. Turns out, the first removal was meant for the first > > interval, but we didn't have enough information at the time to determine > > that. > > > > Bottom line is, it is not possible to correctly match [0x1000-0x2000] to > > [0x0000-0x2000] and [0x1000-0x3000]: both matches can be valid. > > But those ranges won't exist. What appears to be happening in this code > is that there are higher levels of non-overlapping ranges with > memory (cache) types (or none are defined) , which are tracked on page > granularity. So we can't have a page that has two memory type. > > The overlapping happens later, when the vmas are mapped. And we are > ensuring that the mapping of the vmas match the higher, larger areas. > The vmas are inserted with memtype_check_insert() which calls > memtype_check_conflict() that ensures any overlapping areas have the > same type as the one being added, so either there is no match or the > interval(s) with this page is set to a specific type. I suspect there > can only really be one range. > > So I don't think overlapping areas like above could exist. The vma > cache type has to be the same throughout. It has to be the same type as > all overlapping areas. Dave agreed with you, so I am likely the confused one, but I still think the overlapping areas as I described do exist. For example, this userspace code: #include #include #include #include #include #define PCI_BAR "/sys/devices/pci0000:00/0000:00:02.0/resource0" int main(void) { void *p1, *p2; int fd, ret; fd = open(PCI_BAR, O_RDWR); // track 0xfd000000-0xfd001fff p1 = mmap(0, 0x2000, PROT_READ, MAP_SHARED, fd, 0); // track 0xfd001000-0xfd002fff p2 = mmap(0, 0x2000, PROT_READ, MAP_SHARED, fd, 0x1000); // untrack 0xfd001000-0xfd001fff munmap(p2, 0x1000); // untrack 0xfd000000-0xfd001fff munmap(p1, 0x2000); // untrack 0xfd002000-0xfd002fff munmap(p2 + 0x1000, 0x1000); } If I pause this program right after the two mmap(), before any munmap(), then: $cat /sys/kernel/debug/x86/pat_memtype_list PAT memtype list: PAT: [mem 0x00000000bffe0000-0x00000000bffe2000] write-back PAT: [mem 0x00000000bffe1000-0x00000000bffe2000] write-back PAT: [mem 0x00000000fd000000-0x00000000fd002000] uncached-minus <-- what I described PAT: [mem 0x00000000fd001000-0x00000000fd003000] uncached-minus <-- what I described PAT: [mem 0x00000000febc0000-0x00000000febe0000] uncached-minus PAT: [mem 0x00000000fed00000-0x00000000fed01000] uncached-minus PAT: [mem 0x00000000fed00000-0x00000000fed01000] uncached-minus The 2 mmap() call would create the overlapping intervals as I described. Then, I let the C program run to completion, see what happen in dmesg: x86/PAT: memtype_reserve added [mem 0xfd000000-0xfd001fff], track uncached-minus, req uncached-minus, ret uncached-minus x86/PAT: Overlap at 0xfd000000-0xfd002000 x86/PAT: memtype_reserve added [mem 0xfd001000-0xfd002fff], track uncached-minus, req uncached-minus, ret uncached-minus x86/PAT: memtype_free request [mem 0xfd001000-0xfd001fff] x86/PAT: test:178 freeing invalid memtype [mem 0xfd000000-0xfd001fff] x86/PAT: memtype_free request [mem 0xfd002000-0xfd002fff] The problem I am raising is the first munmap() call: [0xfd001000-0xfd001fff] would be untracked, but there is no way to tell for sure which interval it belongs to. The current implementation matches it to the first range, but it actually belongs to the second range. This incorrect matching results in the "freeing invalid memtype" later on. Hopefully I'm not being an idiot and wasting everyone's time.. > > Also, your ranges are inclusive while the ranges passed in seem to be > exclusive on the end address, so your example would look more like: > [0x0000-0x2000) [0x2000-0x3000). > > You can see this documented in memtype_reserve() where sanitize_phys() > is called. > > So we could have a VMA of [0x1000-0x2000), but this vma would have to be > in the first range. [0x0000-0x0FFF) would also be in the first range. > > I think that searching for the smallest area containing the entry will > yield the desired entry in the interval tree. > > Note that there is debugging support in the Documentation so you can go > look at what is in there with debugfs. > > ... > > > One solution I can think of: stop allowing overlapping intervals. Instead, > > the overlapping portions would be split into new intervals with some > > reference counting. memtype_erase() would need to be modified to: > > - assemble the potentially split intervals > > - split the intervals if needed > > The point is, there wouldn't be any confusion with matching overlapping > > intervals. > > > > I will give it a try when I have some time, unless someone sees a problem > > with it or has a better idea. > > I don't think this will work at all. It is dependent of overlapping > ranges to ensure the vmas match what is allowed in certain areas. We can ensure that the cache type is the same, before splitting, so I think it can work? But let's clear up the other disagreement first. Best regards, Nam