From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4886BF30924 for ; Thu, 5 Mar 2026 09:39:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B2A2A6B0092; Thu, 5 Mar 2026 04:39:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AD85A6B0093; Thu, 5 Mar 2026 04:39:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D9FC6B0095; Thu, 5 Mar 2026 04:39:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 910156B0092 for ; Thu, 5 Mar 2026 04:39:52 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 243198C6AF for ; Thu, 5 Mar 2026 09:39:52 +0000 (UTC) X-FDA: 84511512624.30.28523F5 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) by imf27.hostedemail.com (Postfix) with ESMTP id D038D40004 for ; Thu, 5 Mar 2026 09:39:49 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PSocFVS8; spf=pass (imf27.hostedemail.com: domain of thomas.hellstrom@linux.intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=thomas.hellstrom@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772703590; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0/TpceiNX+u/GIhvybfnutzsa7AEQ491tu+4e00K93Q=; b=TddrIrXYz7NTfBJY//hzOcYEZDTNS2OIOdOx1B5bvk0B2iz/dZaVJQSYhM9GAo/e+WCdfI 8HquP6ELtsZMCWZ1vYhAjLaMoQXefb89a/HbPDqkQcZa6/bWW2XOrkSg/NRurxv07ys4pb n+yH3ACKw2OWvO1HJz1sI6SI6euzxGc= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PSocFVS8; spf=pass (imf27.hostedemail.com: domain of thomas.hellstrom@linux.intel.com designates 198.175.65.13 as permitted sender) smtp.mailfrom=thomas.hellstrom@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772703590; a=rsa-sha256; cv=none; b=vifcg1TSsuDLWgbEMOQCPV3vPZ8PBFSMbDqQSKu+7miLsyJJ4S1L49JCbbm1ZycJR27nP/ YtPsbihig2+yJbSCB4KZvsSoAK11yhkphTvEkTVFg1QhqDXiu5dKgoOVqdrBn8gAp82MKY CP17scsycrb9/4H5+Et4aCmchOOy++s= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1772703590; x=1804239590; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=P1HvCqMdm4NZrhz2qfIQ6WDOE/LToL9MrwGTxNicvOI=; b=PSocFVS8uzNRpjTsXiIiv7Wgv7Bpac/ZcUAqgV+za2rvaJZ/oTcLw+cI ilrb3lnLNI6EF8Gsb6D7KyzinDFRWah4cyyHbiZWa/1xxXCHY0lKwFdGs 1UicSzh9DH3qSlZovhGjE/iKC8XZXF9besL53vCWcivGiEO6JBHoYA1eq haBtyEei7LMHXpVPWRXYVad65tekRl1am/dfd79djAwaWJG5cF23O9Dtb HRKsBZxxE2IHo+NKJ7yXAvHt+8mJlv62vmc6Nl+HeYIoIt36PeHZoAgTG KAT6ZLpZjBgW01wF5S/B0oXU5hE4HOy3ZJ/Q1V3PUkVE3ERKoc4BQmPFZ w==; X-CSE-ConnectionGUID: D1Idm/cWRHWZic564e3BrA== X-CSE-MsgGUID: VMMIL4X9Q6+zD8FVOvtuvQ== X-IronPort-AV: E=McAfee;i="6800,10657,11719"; a="84870955" X-IronPort-AV: E=Sophos;i="6.23,102,1770624000"; d="scan'208";a="84870955" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2026 01:39:48 -0800 X-CSE-ConnectionGUID: +uBdXiXbRQOX7B8MyuSWYQ== X-CSE-MsgGUID: JU/2s9FxTmqGmrqqaWLtQA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,102,1770624000"; d="scan'208";a="214684978" Received: from vpanait-mobl.ger.corp.intel.com (HELO fedora) ([10.245.244.71]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2026 01:39:44 -0800 From: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= To: intel-xe@lists.freedesktop.org Cc: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= , Matthew Brost , =?UTF-8?q?Christian=20K=C3=B6nig?= , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jason Gunthorpe , Andrew Morton , Simona Vetter , Dave Airlie , Alistair Popple , dri-devel@lists.freedesktop.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 1/4] mm/mmu_notifier: Allow two-pass struct mmu_interval_notifiers Date: Thu, 5 Mar 2026 10:39:06 +0100 Message-ID: <20260305093909.43623-2-thomas.hellstrom@linux.intel.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260305093909.43623-1-thomas.hellstrom@linux.intel.com> References: <20260305093909.43623-1-thomas.hellstrom@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: D038D40004 X-Stat-Signature: 6oncuwgqjbmr4e95xr3jo5oxdqfam9go X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1772703589-752511 X-HE-Meta: U2FsdGVkX1/vkYDb+9txgW7bWWjTaocUMTyq7HWal5b+muTu5fFdqsURs36YSPKGOn7EB/W3FKIoS5cK7TiiSo5/qFpd/6Px716fRGOKoGSIjDQnvzDj87Rn4cI1lhboR2/uF/tyLiRpubupATvBYT3Kd0ILPULu/Lu98oA20xwoRTwc2SvRwD+VLyx+5qHF266Xg3VVpkhaVux4YbYNXbLkHSovDQOblg+5D3std+B0YKSzfIoLNNlAnqqX93xyn315KQIRkXO1XNXjF5YQ95w4GqSjrziphoDn5qnEefmtufEcMAkOM7Siv6+GsfN3jMyzoCY9tROQzZOO6hDJWvO82h8/LiYvfCTy3sdPwAjMoIwYdKhTTllydKubHM/1ifekKKBXGVhH0tr3/4DSy3NBjP9xxC1V+LJMFfVvZWRWULeCRHovTLwBvdnQlKjK8Ybh39Y/KDjG1oRMlCA+tFxKNIhAfvmV7BpUzK6qC8jABCilnZHnWUxj74+qxM6fWoKQGkQQlAtZ1Qs7pxQVnrXCu+xi669y+Y08+kCQKlfPGTGd+4oohWzDZkgF4PbyawYKyNY1LWzaiLU6NJVbJFZo1RBZemSOPhmhHM+Fcv7VV+rnVk2RCOhFgkdBq4ObaeprLJjxAGHQIy61fUy0MdkYUiG2lFVFJ2nK5PqDHtd5mItdBhglqDzDIGd/N58MppRq+hWXFRoUzGcB/ISg0KyKb1oHo815BoeY3UkYmfdVSKD3U6L4DKJTKxB7/1zj8bEC2++3BiXgn4DC9OnU5XM0sn8D/gMWG4cGF0aDPmhlQgnk2L9JyP8zCXJ5evzoFT8ETajagwQDVilJZEXOp6nN+Vxr8UAJf9At4RUvR/1zMLEefkPkPv9mxt+kgLaAH2mvsA0EfI5rroaAoRN2C8+BWyk+ucpZcM6tIsWBBwbpIXLXFIGpByLEanjuLCElEcDGR/kJGeWphwVTXoa b6T9K7do TJ1SLmkdeYRXqKpua+SSBeygTCgP6Alhykdk5EBjY3dzJiY4IYqmcOGn0QPaXMoJTnv5OIg1Yyz30Dwq4qbLzKCn+9QZmAm/C+T1EYaQEr+jQSxvT7MnglUmn//1JjdYECKxzdJqelluxVZdistjuKfnALJximweXWQM3a1NXHgU43jcS/ax4lGMTP8p3Jxhvy4j/I/x63Egy0ujZMdQs3n+DjhehHw2xwv0oLXfnSvU48xOYR4QtmMg7CCQmz9ZYkzznyalTELLZ/DksTG/rQZx4PecCtDzPda3SSBG1Dd7McDK01FgJyulgFMIObKreQWXUkbZ10oKRoCVnP8KGH7O+UaoM3dxnYC/N/k/2uCEkhyXSHaIMa6cgAzww2AYtLOwNUzy0A/uLHNjz4wtZib8gB+0KzKP+kbTvBcTazmGc6wSajDpfMTQ6R0RK4Slsz4a0CU3nHbwdXbW1v44X0cOneBgX1pCWYcsOHWmWt38QxWfpuhIrORNh7OsfZ3T7c4jHPYPI3r5SxJLe1G1GfMbbclT+lk+2r0/LWzO3Rug6xtt4mI8xOVc2DIxA9ydpKlPxQsKeEJt38XYJKYhQoSMWUmKiPBemHKRK5Aqe+lc5DOIiKI//4vFGzbqV5/dtEctwpiVOC31sJCQZGggiuKPs81T4bMcGabmxsvdyU8UE48lI5krHLCkpk/Xz3OzWyLC2oPzPVc/hIS5xCfn45zyyBe/tkP/E1grIQeoVMdAAHWM= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: GPU use-cases for mmu_interval_notifiers with hmm often involve starting a gpu operation and then waiting for it to complete. These operations are typically context preemption or TLB flushing. With single-pass notifiers per GPU this doesn't scale in multi-gpu scenarios. In those scenarios we'd want to first start preemption- or TLB flushing on all GPUs and as a second pass wait for them to complete. One can do this on per-driver basis multiplexing per-driver notifiers but that would mean sharing the notifier "user" lock across all GPUs and that doesn't scale well either, so adding support for multi-pass in the core appears to be the right choice. Implement two-pass capability in the mmu_interval_notifier. Use a linked list for the final passes to minimize the impact for use-cases that don't need the multi-pass functionality by avoiding a second interval tree walk, and to be able to easily pass data between the two passes. v1: - Restrict to two passes (Jason Gunthorpe) - Improve on documentation (Jason Gunthorpe) - Improve on function naming (Alistair Popple) v2: - Include the invalidate_finish() callback in the struct mmu_interval_notifier_ops. - Update documentation (GitHub Copilot:claude-sonnet-4.6) - Use lockless list for list management. v3: - Update kerneldoc for the struct mmu_interval_notifier_finish::list member (Matthew Brost) - Add a WARN_ON_ONCE() checking for NULL invalidate_finish() op if if invalidate_start() is non-NULL. (Matthew Brost) v4: - Addressed documentation review comments by David Hildenbrand. Cc: Matthew Brost Cc: Christian König Cc: David Hildenbrand Cc: Lorenzo Stoakes Cc: Liam R. Howlett Cc: Vlastimil Babka Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Michal Hocko Cc: Jason Gunthorpe Cc: Andrew Morton Cc: Simona Vetter Cc: Dave Airlie Cc: Alistair Popple Cc: Cc: Cc: Assisted-by: GitHub Copilot:claude-sonnet-4.6 # Documentation only. Signed-off-by: Thomas Hellström --- include/linux/mmu_notifier.h | 42 +++++++++++++++++++++++ mm/mmu_notifier.c | 65 +++++++++++++++++++++++++++++++----- 2 files changed, 98 insertions(+), 9 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 07a2bbaf86e9..dcdfdf1e0b39 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -233,16 +233,58 @@ struct mmu_notifier { unsigned int users; }; +/** + * struct mmu_interval_notifier_finish - mmu_interval_notifier two-pass abstraction + * @link: Lockless list link for the notifiers pending pass list + * @notifier: The mmu_interval_notifier for which the finish pass is called. + * + * Allocate, typically using GFP_NOWAIT in the interval notifier's start pass. + * Note that with a large number of notifiers implementing two passes, + * allocation with GFP_NOWAIT will become increasingly likely to fail, so consider + * implementing a small pool instead of using kmalloc() allocations. + * + * If the implementation needs to pass data between the start and the finish passes, + * the recommended way is to embed struct mmu_interval_notifier_finish into a larger + * structure that also contains the data needed to be shared. Keep in mind that + * a notifier callback can be invoked in parallel, and each invocation needs its + * own struct mmu_interval_notifier_finish. + * + * If allocation fails, then the &mmu_interval_notifier_ops->invalidate_start op + * needs to implements the full notifier functionality. Please refer to its + * documentation. + */ +struct mmu_interval_notifier_finish { + struct llist_node link; + struct mmu_interval_notifier *notifier; +}; + /** * struct mmu_interval_notifier_ops * @invalidate: Upon return the caller must stop using any SPTEs within this * range. This function can sleep. Return false only if sleeping * was required but mmu_notifier_range_blockable(range) is false. + * @invalidate_start: Similar to @invalidate, but intended for two-pass notifier + * callbacks where the call to @invalidate_start is the first + * pass and any struct mmu_interval_notifier_finish pointer + * returned in the @finish parameter describes the finish pass. + * If *@finish is %NULL on return, then no final pass will be + * called, and @invalidate_start needs to implement the full + * notifier, behaving like @invalidate. The value of *@finish + * is guaranteed to be %NULL at function entry. + * @invalidate_finish: Called as the second pass for any notifier that returned + * a non-NULL *@finish from @invalidate_start. The @finish + * pointer passed here is the same one returned by + * @invalidate_start. */ struct mmu_interval_notifier_ops { bool (*invalidate)(struct mmu_interval_notifier *interval_sub, const struct mmu_notifier_range *range, unsigned long cur_seq); + bool (*invalidate_start)(struct mmu_interval_notifier *interval_sub, + const struct mmu_notifier_range *range, + unsigned long cur_seq, + struct mmu_interval_notifier_finish **finish); + void (*invalidate_finish)(struct mmu_interval_notifier_finish *finish); }; struct mmu_interval_notifier { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index a6cdf3674bdc..4d8a64ce8eda 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -260,6 +260,15 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub) } EXPORT_SYMBOL_GPL(mmu_interval_read_begin); +static void mn_itree_finish_pass(struct llist_head *finish_passes) +{ + struct llist_node *first = llist_reverse_order(__llist_del_all(finish_passes)); + struct mmu_interval_notifier_finish *f, *next; + + llist_for_each_entry_safe(f, next, first, link) + f->notifier->ops->invalidate_finish(f); +} + static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions, struct mm_struct *mm) { @@ -271,6 +280,7 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions, .end = ULONG_MAX, }; struct mmu_interval_notifier *interval_sub; + LLIST_HEAD(finish_passes); unsigned long cur_seq; bool ret; @@ -278,11 +288,27 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions, mn_itree_inv_start_range(subscriptions, &range, &cur_seq); interval_sub; interval_sub = mn_itree_inv_next(interval_sub, &range)) { - ret = interval_sub->ops->invalidate(interval_sub, &range, - cur_seq); + if (interval_sub->ops->invalidate_start) { + struct mmu_interval_notifier_finish *finish = NULL; + + ret = interval_sub->ops->invalidate_start(interval_sub, + &range, + cur_seq, + &finish); + if (ret && finish) { + finish->notifier = interval_sub; + __llist_add(&finish->link, &finish_passes); + } + + } else { + ret = interval_sub->ops->invalidate(interval_sub, + &range, + cur_seq); + } WARN_ON(!ret); } + mn_itree_finish_pass(&finish_passes); mn_itree_inv_end(subscriptions); } @@ -430,7 +456,9 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions, const struct mmu_notifier_range *range) { struct mmu_interval_notifier *interval_sub; + LLIST_HEAD(finish_passes); unsigned long cur_seq; + int err = 0; for (interval_sub = mn_itree_inv_start_range(subscriptions, range, &cur_seq); @@ -438,23 +466,41 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions, interval_sub = mn_itree_inv_next(interval_sub, range)) { bool ret; - ret = interval_sub->ops->invalidate(interval_sub, range, - cur_seq); + if (interval_sub->ops->invalidate_start) { + struct mmu_interval_notifier_finish *finish = NULL; + + ret = interval_sub->ops->invalidate_start(interval_sub, + range, + cur_seq, + &finish); + if (ret && finish) { + finish->notifier = interval_sub; + __llist_add(&finish->link, &finish_passes); + } + + } else { + ret = interval_sub->ops->invalidate(interval_sub, + range, + cur_seq); + } if (!ret) { if (WARN_ON(mmu_notifier_range_blockable(range))) continue; - goto out_would_block; + err = -EAGAIN; + break; } } - return 0; -out_would_block: + mn_itree_finish_pass(&finish_passes); + /* * On -EAGAIN the non-blocking caller is not allowed to call * invalidate_range_end() */ - mn_itree_inv_end(subscriptions); - return -EAGAIN; + if (err) + mn_itree_inv_end(subscriptions); + + return err; } static int mn_hlist_invalidate_range_start( @@ -976,6 +1022,7 @@ int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub, struct mmu_notifier_subscriptions *subscriptions; int ret; + WARN_ON_ONCE(ops->invalidate_start && !ops->invalidate_finish); might_lock(&mm->mmap_lock); subscriptions = smp_load_acquire(&mm->notifier_subscriptions); -- 2.53.0