From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 128F6C02190 for ; Thu, 30 Jan 2025 13:19:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8978D280294; Thu, 30 Jan 2025 08:19:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 846FC280293; Thu, 30 Jan 2025 08:19:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C0B5280294; Thu, 30 Jan 2025 08:19:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4EE39280293 for ; Thu, 30 Jan 2025 08:19:57 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id F3A641A0E2D for ; Thu, 30 Jan 2025 13:19:56 +0000 (UTC) X-FDA: 83064175992.17.D00B301 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf18.hostedemail.com (Postfix) with ESMTP id C7B431C0007 for ; Thu, 30 Jan 2025 13:19:54 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=dnT6rnzM; dmarc=none; spf=none (imf18.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.50) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738243194; a=rsa-sha256; cv=none; b=0ii4QgHJcrlbK11BH0SkXdbJml1jOYSwRMYhVXTs7RP6XSk09XeERzizrY8EM/JTzHYUxC dZUxHJygIw0XU49Xc/2kQWEEgYg23Z0ye8csfMF+fWMBjYkGh5kS3LsejrA4cQ07LJIT7Z 5dhPboAc8AE2gKpC5QP9pJdVeHeAToY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=dnT6rnzM; dmarc=none; spf=none (imf18.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.50) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738243194; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=je4i/r40oAXuV4+p1/WiSxUA0U60ME/6bXfv5n4XiRM=; b=hNCChFHXgGYia8LuGLb89lHDzoBUZKmxj3Ro+1MqDEq4EZE49Jbmu2OzXI87tlyEx7fCKJ zlArDgCw74tWAMCN9DZbiNYy14i0CgkUhhIW5mic954nborbjfeHhT+/Y4K6sHN/Uqzma6 8l1dbginUs/h14QJ8TrF3KSWes8qnhY= Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-4363ae65100so8738755e9.0 for ; Thu, 30 Jan 2025 05:19:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1738243193; x=1738847993; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date:from:to:cc :subject:date:message-id:reply-to; bh=je4i/r40oAXuV4+p1/WiSxUA0U60ME/6bXfv5n4XiRM=; b=dnT6rnzMGkmLwyOmzlDUajsd2Ut5g0qb4H6z1XxM0nzwpFax/ZPPj9gVDykwdtJenp UD4zFdFNwN/3I0+JkySf+OKcDQkBYXZ0HaCni+nTW+rWDa9/Y6YCDVkvFibCP2E7zytO 4enB51nhsP7X9lDTUBMRbwHTty9pEIUpx0MGQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738243193; x=1738847993; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=je4i/r40oAXuV4+p1/WiSxUA0U60ME/6bXfv5n4XiRM=; b=O83RJ/+b+F3+XOP7Mo/i+/X00wcOztUJfR5wOGEiCRNGj9jcr0zuiyiAVjPlV4h7Hv ZV1ch51zBzXj4JZl8vR4YWG6bBvMGhAgFR/ry7/kmQ+ywGTMIPNCB3brKeNjU/va7EiP qqdBVd8F9Acnwpuhunk+JxgomPCKnuef2RiMKavictnvmuFZdQSo2piSPXUqwRb5thej dOzd5pwpcqiIqoxhdj5LSWcVo+dbyv47IC/dLHFxb1J9r1WIwp09yL6JnIvWRqoZSC1V 3+05akISDXZeZoxZb1nDQ/gNJ574sjkOMafO6Aiq7rrDAWvSp/t0KocZXSUwG2ZTNinh ajDw== X-Forwarded-Encrypted: i=1; AJvYcCVsNytEiT5HdWLtXi/wIgvYTK1I6Oi2RUgxTmVQ9M1Duky8rTvRWsGaGzqo3ZxbAG3PWef7KgGNeg==@kvack.org X-Gm-Message-State: AOJu0YwpVUiSujhzdW7IBevQM6cRpoDuYShX16BglJw3dVwYbRp10849 ONaFu5gODlxVwhGou6D8Is4aFMU+VL80prC3Mvy6VIv+CSPsCe/K+s0cNzGENjI= X-Gm-Gg: ASbGncsCZG1egbGmsGT5j45XhuFa1MjQ8z/AEPBQCd51NiAx2t0lfMZtlFoHxKqKVxN OJ9rZzvgyMLiBdOj1P2zH11Vkz0UZkpi5AC4R1PN69bvoE7MyEcxsjeTjoSeEsOOhWq5qSKcypR rOmbotp31tegnAQS4nJPcVitmThOkRPtH21pMEHA68jHNMirRI+5S6tYFLaBCuUE6wm+3L+0cqh 4RqlhneM1G9DeVRs3Okv6EICAWa+smlns1HLfd7nT+tD11VMb7myOrmct7lT26mt13zHCDkCTmV WUjZBNWsunTR82d8/mmJxgICVVs= X-Google-Smtp-Source: AGHT+IGpIlpBVQ4cq9fFNMeb+rvecigacDPTxIRWc2mwixyppcYhS8KCgcxJ6JU5vKwfdjCrG9DHJQ== X-Received: by 2002:a5d:5f56:0:b0:385:d7f9:f157 with SMTP id ffacd0b85a97d-38c51e957b9mr6367577f8f.36.1738243193249; Thu, 30 Jan 2025 05:19:53 -0800 (PST) Received: from phenom.ffwll.local ([2a02:168:57f4:0:5485:d4b2:c087:b497]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38c5c101599sm1939047f8f.23.2025.01.30.05.19.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Jan 2025 05:19:52 -0800 (PST) Date: Thu, 30 Jan 2025 14:19:50 +0100 From: Simona Vetter To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, nouveau@lists.freedesktop.org, Andrew Morton , =?iso-8859-1?B?Suly9G1l?= Glisse , Jonathan Corbet , Alex Shi , Yanteng Si , Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn , Pasha Tatashin , Peter Xu , Alistair Popple , Jason Gunthorpe Subject: Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries Message-ID: Mail-Followup-To: David Hildenbrand , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, nouveau@lists.freedesktop.org, Andrew Morton , =?iso-8859-1?B?Suly9G1l?= Glisse , Jonathan Corbet , Alex Shi , Yanteng Si , Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn , Pasha Tatashin , Peter Xu , Alistair Popple , Jason Gunthorpe References: <20250129115411.2077152-1-david@redhat.com> <20250129115411.2077152-13-david@redhat.com> <887df26d-b8bb-48df-af2f-21b220ef22e6@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <887df26d-b8bb-48df-af2f-21b220ef22e6@redhat.com> X-Operating-System: Linux phenom 6.12.11-amd64 X-Rspam-User: X-Rspamd-Queue-Id: C7B431C0007 X-Rspamd-Server: rspam10 X-Stat-Signature: r57cpj6mx7fyh9qc317e5icnmwu7rfqq X-HE-Tag: 1738243194-108910 X-HE-Meta: U2FsdGVkX1+0YF82tWReeygWUbvj3kxotoOSX8EJVXsn7hnuD1+vBeEqXQf0wuCRSNOrrIzoA3CFQNJli6tXo7jrGbAgdg/kbawJKDMjoIdF7ndxnSHs/JaLJqowLcz8pcoGp5/aZAOdN7xqKe/VvXFU3bqllP5mV8PkkK9bcfDtar2JBJDOQbfpLIMBlB4phxYpQdnSukxfNtt/NCxTjXrB4ZaXwdsKsxy0Tb2ppQQ0NfUp7N3POv8KstB9qsTBNpD787TrkHCddbPkDX1XEoFj68Ej6dJK8hf3wTtUr/zdAWRXUes2ReX6qH7LxTrfyBY5qz9Txlnz+EPfv7GEbQS4nEXo76bxdA7kTB1dPlkAh6upf4MXRAGG63D9VB0S/AmKwj3F8VRpCSKln+mAvRnmHVKdO6WL6XCB8d3ff8s6VN+n6VkO70GgR4NVFNWoDHA2liKXaGYnmPfzqopdiPQO73vfZCZxnMKxhlxK069rSn7XEsqQ8mWnxfhGYTM/2kK9wHMvq1r/DFMqdtpdZXWLo8/6sVIXwu6gSaxa6E2c2oZ8DxdObUlHdjUZJTL1og1Rnkz2zuY8/DC376jGKXq6rnHKz5XWmvWnyZJljt7IzM9M3zgaKAL8ZSCcXyew79gt8+hAzfrytWY1Dd8dEbXE1ziMfLp0dIq+GyyTtJG7PVyLz1PLS8dXc8HTBIQP3Aawqvfjc5JYdRRzULXW2NOMgb1tACegpO693M0DTtSM07awJmuCmhKIvv+xELyApClLg9os2oymaAnKBjaNkHp5cDWXMJPljLHnzpRPxCzi0yMOVtwvSmoYp1wskOmA2zOGZT6k9eq2EunHKnKG3XotsCq/CFautmYyjcifevgATXPNoCEEv6XPEGvHDdd8UC93j02SfMvTEdiy15TnX5MvwHqSSIxF4C0lhLz4H3TKoi9ov2mVyjYxd3GpNvQ61eu4sibv3S/Z/CEmxZ8 8IfD8iCz IR/hko2i0Plx2UEE24Iq2Zr4lsTDvcUzs8KFNDj6RGjQrhjFBOuBO+wLtqb4Fp7xh/GYFZED4++/9xV4w3OPt2md8LHhSxpFsAm/+L/uxc9/fo/xpTMKJfNSfmcLRZLJNadlj2s1UxBL6vlHAr/h/mQDNThHeWkMBOMjDH9LfKSPa3XF/zPy0s3+I57dGwDQIHgozOOfkHMjJvMBYgg0iLuWsCWFgo7dy8E9X+H2tBkmTavxTwy73y+O3atVBOPU4eYu1GTXBlM9bTmwBUsrJJJHttFWxp9og2cCejpUTVoxtrjI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 30, 2025 at 12:42:26PM +0100, David Hildenbrand wrote: > On 30.01.25 11:37, Simona Vetter wrote: > > On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote: > > > Now that conversion to device-exclusive does no longer perform an > > > rmap walk and the main page_vma_mapped_walk() users were taught to > > > properly handle nonswap entries, let's treat device-exclusive entries just > > > as if they would be present, similar to how we handle device-private > > > entries already. > > > > So the reason for handling device-private entries in rmap is so that > > drivers can rely on try_to_migrate and related code to invalidate all the > > various ptes even for device private memory. Otherwise no one should hit > > this path, at least if my understanding is correct. > > Right, device-private probably only happen to be seen on the migration path > so far. > > > > > So I'm very much worried about opening a can of worms here because I think > > this adds a genuine new case to all the various callers. > > To be clear: it can all already happen. > > Assume you have a THP (or any mTHP today). You can easily trigger the > scenario that folio_mapcount() != 0 with active device-exclusive entries, > and you start doing rmap walks and stumble over these device-exclusive > entries and *not* handle them properly. Note that more and more systems are > configured to just give you THP unless you explicitly opted-out using > MADV_NOHUGEPAGE early. > > Note that b756a3b5e7ea added that hunk that still walks these > device-exclusive entries in rmap code, but didn't actually update the rmap > walkers: > > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > > /* Handle un-addressable ZONE_DEVICE memory */ > entry = pte_to_swp_entry(*pvmw->pte); > - if (!is_device_private_entry(entry)) > + if (!is_device_private_entry(entry) && > + !is_device_exclusive_entry(entry)) > return false; > > pfn = swp_offset(entry); > > That was the right thing to do, because they resemble PROT_NONE entries and > not migration entries or anything else that doesn't hold a folio reference). Yeah I got that part. What I meant is that doubling down on this needs a full audit and cannot rely on "we already have device private entries going through these paths for much longer", which was the impression I got. I guess it worked, thanks for doing that below :-) And at least from my very rough understanding of mm, at least around all this gpu stuff, tracking device exclusive mappings like real cpu mappings makes sense, they do indeed act like PROT_NONE with some magic to restore access on fault. I do wonder a bit though what else is all not properly tracked because they should be like prot_none except arent. I guess we'll find those as we hit them :-/ > Fortunately, it's only the page_vma_mapped_walk() callers that need care. > > mm/rmap.c is handled with this series. > > mm/page_vma_mapped.c should work already. > > mm/migrate.c: does not apply > > mm/page_idle.c: likely should just skip !pte_present(). > > mm/ksm.c might be fine, but likely we should just reject !pte_present(). > > kernel/events/uprobes.c likely should reject !pte_present(). > > mm/damon/paddr.c likely should reject !pte_present(). > > > I briefly though about a flag to indicate if a page_vma_mapped_walk() > supports these non-present entries, but likely just fixing them up is > easier+cleaner. > > Now that I looked at all, I might just write patches for them. > > > > > > This fixes swapout/migration of folios with device-exclusive entries. > > > > > > Likely there are still some page_vma_mapped_walk() callers that are not > > > fully prepared for these entries, and where we simply want to refuse > > > !pte_present() entries. They have to be fixed independently; the ones in > > > mm/rmap.c are prepared. > > > > The other worry is that maybe breaking migration is a feature, at least in > > parts. > > Maybe breaking swap and migration is a feature in some reality, in this > reality it's a BUG :) Oh yeah I agree on those :-) > If thp constantly reassembles a pmd entry because hey all the > > memory is contig and userspace allocated a chunk of memory to place > > atomics that alternate between cpu and gpu nicely separated by 4k pages, > > then we'll thrash around invalidating ptes to no end. So might be more > > fallout here. > > khugepaged will back off once it sees an exclusive entry, so collapsing > could only happen once everything is non-exclusive. See > __collapse_huge_page_isolate() as an example. Ah ok. I think might be good to add that to the commit message, so that people who don't understand mm deeply (like me) aren't worried when they stumble over this change in the future again when digging around. > It's really only page_vma_mapped_walk() callers that are affected by this > change, not any other page table walkers. I guess my mm understanding is just not up to that, but I couldn't figure out why just looking at page_vma_mapped_walk() only is good enough? > It's unfortunate that we now have to fix it all up, that original series > should have never been merged that way. Yeah looks like a rather big miss. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch