From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 391C8C02193 for ; Wed, 29 Jan 2025 14:05:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8C71D280054; Wed, 29 Jan 2025 09:05:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8772E280052; Wed, 29 Jan 2025 09:05:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 717A8280054; Wed, 29 Jan 2025 09:05:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 51A48280052 for ; Wed, 29 Jan 2025 09:05:49 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AF9A81C6D55 for ; Wed, 29 Jan 2025 14:05:48 +0000 (UTC) X-FDA: 83060662776.05.FF5BFB4 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf05.hostedemail.com (Postfix) with ESMTP id 17C65100019 for ; Wed, 29 Jan 2025 14:05:45 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=iavByD3P; spf=none (imf05.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.50) smtp.mailfrom=simona.vetter@ffwll.ch; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738159546; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zf6e0kULVG4a/wjTdFmCl7liG7EJ3+U2lXDRmhieYr8=; b=Vl+ViL5MMqycplyYwiJmjNELj6m/ulKogXdVEbOasjSSjm3Ef1Fe0VPr3h7Zvb3iC8wetX MuqUDPS9dm6Pp5Rp2Q/w1rMIfGLrecEPAHPiNC5+39fbcXzXJHkMCh2Mbb8L6YKe137pPm 38G8S/PvBH+QGWEsG97V+0C885yEW74= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738159546; a=rsa-sha256; cv=none; b=V7PCXtZwtrqcYURGXm6tU1GnR0B9Yp8EIRWwOAobjuOgN5E7jCPaNB/LLk2sw8YIfDreDV 3cnwI+IY4hxOVJP/8Ih5ugwgjEu0vcMPP/ZxVqO1byK1TX/t+L63Q0pEvt2yi1wHa/05+j Fs/8v/ZBnUV9K0LLl59ioD9RLElM7f8= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=iavByD3P; spf=none (imf05.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.128.50) smtp.mailfrom=simona.vetter@ffwll.ch; dmarc=none Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-436341f575fso76820785e9.1 for ; Wed, 29 Jan 2025 06:05:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1738159544; x=1738764344; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date:from:to:cc :subject:date:message-id:reply-to; bh=zf6e0kULVG4a/wjTdFmCl7liG7EJ3+U2lXDRmhieYr8=; b=iavByD3PbF+C+pvf8sQCsp/2gK+Fjlb+UkBXaDU4W7KG+KUUTuyr5HEnTVa8u7FyDJ XOuK7s84vf83C74G2fm64j3JGpBSWkJRv/r+wwnsY8ljAFaykoymP0DPceusc1MwNiyd nAEMDnqdQSZz6un6HOHMmExx/3EYjlLBtOjgA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738159544; x=1738764344; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=zf6e0kULVG4a/wjTdFmCl7liG7EJ3+U2lXDRmhieYr8=; b=u+6NC34DjOSQAfiXQtm83CA7yxkd+DRavNw5ggrqjO/V4L+02OvpdDBBle2DKOkrWD u+rmWLy7QNmT8TKJ7j3YmP9FmECFTKsIjfA8b1coKhfPy4DCNTuI/1jSLGpj3PIIpfbP yEmWOCdNofJ/Egd8vqYQ3aMKzzj2R32CtAsY6TewVQgYz+81wZWbclLc6kYqv/7Om/Xp Mew06f3/VaWaaF05Jx4th1BfOEUqy5tllVWXsmw7KuX6KcKbFAdidEanTdQqwLyKUiuO dh1l9mudjHk9+RJtnKjALg91jWUEm7bJhkCsuz1C02UpfXcS+WGu4uFpvF3EVqwclGAG dqbQ== X-Forwarded-Encrypted: i=1; AJvYcCXKwpw2CACCdH6/3y6YCne2QA2mGWPhM5ZiBxHnuYR1aWoBK13rMTN3lWSxU5gJNV3XeRXN78zhhA==@kvack.org X-Gm-Message-State: AOJu0YyLa03nrnsUIIEqDTTICh07QYD+yAEfc08Df4lQcVDVOXQXnsgG Tt+S/dreKvNAL4Cz0MP1kX5qlck+KaZOaRPHHMHZsxwqy3zNL8/brLpvXnJKE3A= X-Gm-Gg: ASbGncs6YpjpnhV2g9tIPQ9EQm4W7Mb+f2eTh9kXwnWa7LgHXPxojVrJZUSNj8B+qA8 +YcTwcbJWNOQHpMUwkwnNA2A6LrvWURwi6yjz6+sMRh1h10vCn/pYtMs95D8ixFT3YCWNOUWDvJ XiO5cl+jkmZNttIYTJwLjxqOSZC2khf4ScuNLO8ADdO1l4qL6JN2gTB/v0r8Ohr3vLME7X5UFX6 l2xFMpJp7dX7tr4EqCdolKUAdVPmds58OYIK0GKBGKCGD211FyAN70PBGmnTHXq3P2pdHX66CdQ Kd+t3zoYCwGOK3FWkXoEoLoxQ60= X-Google-Smtp-Source: AGHT+IHypKbvC8b8WmchXENWEqUKju6omXEWcPjruLbtZl3XLGVBOOHq0y6FESSakHsx7UJ+ymdUXw== X-Received: by 2002:a05:600c:28f:b0:436:6460:e67a with SMTP id 5b1f17b1804b1-438e0cc040dmr2990685e9.25.1738159543876; Wed, 29 Jan 2025 06:05:43 -0800 (PST) Received: from phenom.ffwll.local ([2a02:168:57f4:0:5485:d4b2:c087:b497]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-438dcc50655sm24086135e9.33.2025.01.29.06.05.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Jan 2025 06:05:43 -0800 (PST) Date: Wed, 29 Jan 2025 15:05:41 +0100 From: Simona Vetter To: David Hildenbrand Cc: Alistair Popple , "linux-mm@kvack.org" , John Hubbard , nouveau@lists.freedesktop.org, Jason Gunthorpe , DRI Development , Karol Herbst , Lyude Paul , Danilo Krummrich Subject: Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? Message-ID: Mail-Followup-To: David Hildenbrand , Alistair Popple , "linux-mm@kvack.org" , John Hubbard , nouveau@lists.freedesktop.org, Jason Gunthorpe , DRI Development , Karol Herbst , Lyude Paul , Danilo Krummrich References: <8c6f3838-f194-4a42-845d-10011192a234@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: Linux phenom 6.12.11-amd64 X-Stat-Signature: g7xbrgd1fj9gtt5mz37hrq3bp87riadh X-Rspamd-Queue-Id: 17C65100019 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1738159545-182083 X-HE-Meta: U2FsdGVkX19lDmgGMgCpUNvSNa3pgHQNz7SEpewgdv0EU8pn1kTLbAwp099HnTyBKTzAYZsr8XGx06i911wlQ4ikBm+QWuB9rjhqBlCzkgtVSl5WwgxtBbAi84DrgohvCefuSwb1H76rG2VIeOsOlhIjsspSYxHUWwe1GCXz10cPiJEMl069NNjY4xewYT4KOQRWqUBVT7HDAQplu8t9LhNrjAGOkaCo418scjgA8gsnRbl3jnwOIIaUTr7hUxEX/LYpEF2awwgHfnnRIwXmeYHU4UjAW8nQerHGHGUwYX9n03ggZxaPRfdM/bJYcuft81+pSMjcdX8WaJlIVxrNwoHIczhPBH2GefHcb9Y5jw4wFPUQYpcwr9SROEHDvfUq3UuE1T98ibofj2g043XwDjz2c5jG/ugDWxZSO0FfHI9abnT9iALGL7qBFzKONkd+RGeBSdrg3LyBie0PG2ByhN+eAAQNSUjWtyP14L+QMVe51g4/0qgXRPLczuwSRK+ZU91DperJo7tPpOpyUxigBJlnINxv66JoSyFbGdeMFYu5WjTXvwgT1D9z07arB//9jg8Yzia/zAmj2YV64krak/xxkY31AlXF9DLiUjDuDFARQQznzQrCi9KdfwtJwEEUg8UuH6QQU8G4tZ4x2GInR8AOp3KUkSpYBAPm4kZMcRIlHcDCSiBqKxXsDHkDnNZfDRKvn6J2q6ZGb4I3qIHm+uLS9rzgTe6vXhcLLDBF3egEhwybfzlYnl+q+tKvpFvJO0FWGA+yxhUZi6QwPUDRg4YpcVcbZu7xlhEKwuntAhEptOHcm5uhK2eoFtOn1+284j3KPnCXoGubapJoL7Ansm+gyRvKSmfwA6yEXskNr7ia6mgQiOWfDNQ1LqYDvLkK10AGCrp9gUTixZ3MNnpbsS7RRXlqCQba6IKjIZ2zQ8psdKe4WGJDhkMiHviFP3l0xW0S0qva4IUyjFaD9i4 FrOKnjHE TUG439BekOWs85XF8urK276fU7GBdlEjnxoUwvuN22qWzDqITDOEOoWyCCIhhOm2qI3rJqYhwX/5ZU6c1Mxd+6vNCpchw2B4dkiaUaDOAuYKFohR1Q+dic6SZHb6sCMMN7gFqdl3rRQT/NWNXpYAiesNuu6ftZoSVB1EdeGKS9M0/llu0NcPlrtVTy4OVCGmDG99uIzGeBjoOZlBwTOu3UNDXc5rNvz9x5gqJqPE8RZh4EYcEKWbyoiQ8qzdirsAURGYlJo2f95h6yWUBUrB51xlXT7rUNw9Q7/vtwbb7VxVWt0G0n3+LWflzmfGxHG7DEw9cNTo7LhG0KXq2mUyqOSYtwDjo+Le4+YN98Cwby9EJEU5E7Y9cgmOBYRmkW1eMd2G03P2m/+MokRazN8X3khsRWg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote: > On 29.01.25 12:28, Simona Vetter wrote: > > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: > > > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > > > > On 28.01.25 21:14, Simona Vetter wrote: > > > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > > > > will work out. > > > > > > > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > > > > work reliably with THP. > > > > > > > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > > > > folio lock. That's different than page migration. > > > > > > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > > > > way page migration can (otherwise we could have just migrated to a device page, > > > > > > even if that may have added unwanted overhead). > > > > > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > > > > when locking multiple folios, but I think at least on the first one we > > > > > need an unconditional folio_lock to guarantee forward progress. > > > > > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times > > > > due to concurrent swapout. But the hmm-tests selftest fails immediately > > > > instead of retrying. > > > > > > My worries with just retrying is that it's very hard to assess whether > > > there's a livelock or whether the retry has a good chance of success. As > > > an example the ->migrate_to_ram path has some trylocks, and the window > > > where all other threads got halfway and then fail the trylock is big > > > enough that once you pile up enough threads that spin through there, > > > you're stuck forever. Which isn't great. > > > > > > So if we could convert at least the first folio_trylock into a plain lock > > > then forward progress is obviously assured and there's no need to crawl > > > through large chunks of mm/ code to hunt for corner cases where we could > > > be too unlucky to ever win the race. > > > > > > > > Since > > > > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > > > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > > > > any other reason why a folio_lock shouldn't work here, but then my > > > > > understanding of mm/ stuff is really just scratching the surface. > > > > > > > > > > I did crawl through all the other code and it looks like everything else > > > > > is unconditional locks. So looks all good and I didn't spot anything else > > > > > that seemed problematic. > > > > > > > > > > Somewhat aside, I do wonder whether we really want to require callers to > > > > > hold the mmap lock, or whether with all the work towards lockless fastpath > > > > > that shouldn't instead just be an implementation detail. > > > > > > > > We might be able to use the VMA lock in the future, but that will require > > > > GUP support and a bunch more. Until then, the mm_lock in read mode is > > > > required. > > > > > > Yup. I also don't think we should try to improve before benchmarks show an > > > actual need. It's more about future proofing and making sure mmap_lock > > > doesn't leak into driver data structures that I'm worried about. Because > > > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to > > > keep everything correct on the driver side, which is not a clean design. > > > > > > > I was not able to convince myself that we'll really need the folio lock, but > > > > that's also a separate discussion. > > > > > > This is way above my pay understanding of mm/ unfortunately. > > > > I pondered this some more, and I think it's to make sure we get a stable > > reading of folio_mapcount() and are not racing with new rmaps being > > established. But I also got lost a few times in the maze ... > > That mapcount handling is all messed up and I'll remove that along with > the rmap walk. Also, the folio lock does not stabilize the mapcount at all ... Hm ... also rethinking this all, we don't need a lot of guarantees here. Anything userspace does that re-elevates the mapcount or otherwise could starve out make_device_exclusive is I think a case of "don't do that". I think the only guarantee we need is that make_device_exclusive succeeds against other kernel stuff like thp/migration/ksm and all those things, at least reliably when you retry. And maybe also that concurrent make_device_exclusive calls don't starve each another but eventually get the job done (but only if it's the same owner). > Here is my understanding: > > commit e2dca6db09186534c7e6082b77be6e17d8920f10 > Author: David Hildenbrand > Date: Tue Jan 28 15:25:37 2025 +0100 > > mm/memory: document restore_exclusive_pte() > Let's document how this function is to be used, and why the requirement > for the folio lock might maybe be dropped in the future. > Signed-off-by: David Hildenbrand > > diff --git a/mm/memory.c b/mm/memory.c > index 46956994aaff..caaae8df11a9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > } > #endif > +/** > + * restore_exclusive_pte - Restore a device-exclusive entry > + * @vma: VMA covering @address > + * @folio: the mapped folio > + * @page: the mapped folio page > + * @address: the virtual address > + * @ptep: PTE pointer into the locked page table mapping the folio page > + * @orig_pte: PTE value at @ptep > + * > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > + * > + * The folio and the page table must be locked, and MMU notifiers must have > + * been called to invalidate any (exclusive) device mappings. In case of > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > + * > + * Locking the folio makes sure that anybody who just converted the PTE to > + * a device-private entry can map it into the device, before unlocking it; so > + * the folio lock prevents concurrent conversion to device-exclusive. > + * > + * TODO: the folio lock does not protect against all cases of concurrent > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > + * must already use MMU notifiers to sync against any concurrent changes > + * Maybe the requirement for the folio lock can be dropped in the future. Hm yeah I was a bit confused why this would work at first. But since we break cow with FOLL_WRITE there shouldn't be any other mappings around. Therefore relying on the mmu notifier for that mm_struct is enough, and we don't need to hold the folio_lock in callers. I think pushing both the folio_unlock and the mmap_read_lock/unlock into make_device_exclusive would be a good clarification of what's going on here. Cheers, Sima > + */ > > > -- > Cheers, > > David / dhildenb > -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch