From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.5 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 379A5CA9EC0 for ; Mon, 28 Oct 2019 20:11:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BCBB8208C0 for ; Mon, 28 Oct 2019 20:11:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=ziepe.ca header.i=@ziepe.ca header.b="cXvTHdIK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BCBB8208C0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ED17A6B0269; Mon, 28 Oct 2019 16:10:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EAA246B026A; Mon, 28 Oct 2019 16:10:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6E046B026B; Mon, 28 Oct 2019 16:10:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0036.hostedemail.com [216.40.44.36]) by kanga.kvack.org (Postfix) with ESMTP id A87936B0269 for ; Mon, 28 Oct 2019 16:10:50 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 40452180AD81A for ; Mon, 28 Oct 2019 20:10:50 +0000 (UTC) X-FDA: 76094286660.10.lock96_416b41bc6c42b X-HE-Tag: lock96_416b41bc6c42b X-Filterd-Recvd-Size: 31152 Received: from mail-qt1-f195.google.com (mail-qt1-f195.google.com [209.85.160.195]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Mon, 28 Oct 2019 20:10:49 +0000 (UTC) Received: by mail-qt1-f195.google.com with SMTP id g50so16546819qtb.4 for ; Mon, 28 Oct 2019 13:10:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=p2sO9HeBGe55xT+CHTSwfg39F5YjFhzL8wq3LxZbe6w=; b=cXvTHdIKZK55cQQin8dGci427hTRiUockWlEy6Y0ICdq582fHBOfb30HIaxmsyyS/p VO48joMXKjHRdPQsTx7NYVCc27xyHbl0ljkVD1bnlqKgp2mbaxe+/vNwxle/N375wmN7 LLiLx61YfjoYug+9VNzD6w4Vz4EQxLAkMLyXTDa9JOOn86fAWGzlUS2JLAhQ5bmecZZ/ a5+OzqMmMbmY9WYvRiligl1TnCVrKxVwZlm0cjHkBnOrOSpHAla+MEOnGgkh4pOfoVMO 36bGxek2HKVb0rJgSusQ9TNBqCn6uiIo6I8hU9X6aLkg+2p0L2e5guBKt+QltMY6afCu /2hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=p2sO9HeBGe55xT+CHTSwfg39F5YjFhzL8wq3LxZbe6w=; b=g2MSZ9VDxD4PUHbNEjVEGUELUWlYoXIsIgiUes25oVcdXbNQ2HCeuYGIdNZr3LCYQV /zEYg97F8/8J4TwY3iT7htWwL6BkvBn5cu8efM+bZxP/tvm84BX0iCGu1qKI4/bikfad xbFltoWOo6qXAZB1L/x2afI7JAgwbMuRayzSHc2vij4gdgNQNIECJQBpe25dtoxMyDm0 UO/Vy/fjWUatb+1m2EJtYYWLf1pe8gThC5O0dDDloNAbHDBxcfp4baj5bR1DnkcXGAop cWdLoFZuq1XIlZLNZWMjW/iggBJZL9FcDpt0DU131CVPtpKYDfsTREF9NaTgBK9k4pEn UWAg== X-Gm-Message-State: APjAAAVnxXUcatL1Jj0L0XxKI5hY6P3Gveucmx3aVhkbaknwzmHO2ven mbLkebPAHSSroSdWjprcBSjrAVAdE1o= X-Google-Smtp-Source: APXvYqxk4tf4YFjeg5NLxezOvsTX/rHJRfFXNqIANjhcGIKgcc8HiyI9ezh/HHJDJoeromDlsXvJEg== X-Received: by 2002:a05:6214:2aa:: with SMTP id m10mr17377906qvv.224.1572293448000; Mon, 28 Oct 2019 13:10:48 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-180.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.180]) by smtp.gmail.com with ESMTPSA id t127sm6775397qkf.43.2019.10.28.13.10.43 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 28 Oct 2019 13:10:43 -0700 (PDT) Received: from jgg by mlx.ziepe.ca with local (Exim 4.90_1) (envelope-from ) id 1iPBLf-0001gS-8N; Mon, 28 Oct 2019 17:10:43 -0300 From: Jason Gunthorpe To: linux-mm@kvack.org, Jerome Glisse , Ralph Campbell , John Hubbard , Felix.Kuehling@amd.com Cc: linux-rdma@vger.kernel.org, dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, Alex Deucher , Ben Skeggs , Boris Ostrovsky , =?UTF-8?q?Christian=20K=C3=B6nig?= , David Zhou , Dennis Dalessandro , Juergen Gross , Mike Marciniszyn , Oleksandr Andrushchenko , Petr Cvek , Stefano Stabellini , nouveau@lists.freedesktop.org, xen-devel@lists.xenproject.org, Christoph Hellwig , Jason Gunthorpe Subject: [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Date: Mon, 28 Oct 2019 17:10:22 -0300 Message-Id: <20191028201032.6352-6-jgg@ziepe.ca> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20191028201032.6352-1-jgg@ziepe.ca> References: <20191028201032.6352-1-jgg@ziepe.ca> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jason Gunthorpe Replace the internal interval tree based mmu notifier with the new common mmu_range_notifier_insert() API. This removes a lot of code and fixes a deadlock that can be triggered in ODP: zap_page_range() mmu_notifier_invalidate_range_start() [..] ib_umem_notifier_invalidate_range_start() down_read(&per_mm->umem_rwsem) unmap_single_vma() [..] __split_huge_page_pmd() mmu_notifier_invalidate_range_start() [..] ib_umem_notifier_invalidate_range_start() down_read(&per_mm->umem_rwsem) // DEADLOCK mmu_notifier_invalidate_range_end() up_read(&per_mm->umem_rwsem) mmu_notifier_invalidate_range_end() up_read(&per_mm->umem_rwsem) The umem_rwsem is held across the range_start/end as the ODP algorithm fo= r invalidate_range_end cannot tolerate changes to the interval tree. However, due to the nested invalidation regions the second down_read() can deadlock if there are competing writers. The new core cod= e provides an alternative scheme to solve this problem. Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count") Signed-off-by: Jason Gunthorpe --- drivers/infiniband/core/device.c | 1 - drivers/infiniband/core/umem_odp.c | 288 +++------------------------ drivers/infiniband/hw/mlx5/mlx5_ib.h | 7 +- drivers/infiniband/hw/mlx5/mr.c | 3 +- drivers/infiniband/hw/mlx5/odp.c | 50 +++-- include/rdma/ib_umem_odp.h | 65 ++---- include/rdma/ib_verbs.h | 2 - 7 files changed, 69 insertions(+), 347 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/d= evice.c index 2dd2cfe9b56136..ac7924b3c73abe 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -2617,7 +2617,6 @@ void ib_set_device_ops(struct ib_device *dev, const= struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, get_vf_config); SET_DEVICE_OP(dev_ops, get_vf_stats); SET_DEVICE_OP(dev_ops, init_port); - SET_DEVICE_OP(dev_ops, invalidate_range); SET_DEVICE_OP(dev_ops, iw_accept); SET_DEVICE_OP(dev_ops, iw_add_ref); SET_DEVICE_OP(dev_ops, iw_connect); diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core= /umem_odp.c index d7d5fadf0899ad..6132b8127e8435 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -48,197 +48,32 @@ =20 #include "uverbs.h" =20 -static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp) -{ - mutex_lock(&umem_odp->umem_mutex); - if (umem_odp->notifiers_count++ =3D=3D 0) - /* - * Initialize the completion object for waiting on - * notifiers. Since notifier_count is zero, no one should be - * waiting right now. - */ - reinit_completion(&umem_odp->notifier_completion); - mutex_unlock(&umem_odp->umem_mutex); -} - -static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp) -{ - mutex_lock(&umem_odp->umem_mutex); - /* - * This sequence increase will notify the QP page fault that the page - * that is going to be mapped in the spte could have been freed. - */ - ++umem_odp->notifiers_seq; - if (--umem_odp->notifiers_count =3D=3D 0) - complete_all(&umem_odp->notifier_completion); - mutex_unlock(&umem_odp->umem_mutex); -} - -static void ib_umem_notifier_release(struct mmu_notifier *mn, - struct mm_struct *mm) -{ - struct ib_ucontext_per_mm *per_mm =3D - container_of(mn, struct ib_ucontext_per_mm, mn); - struct rb_node *node; - - down_read(&per_mm->umem_rwsem); - if (!per_mm->mn.users) - goto out; - - for (node =3D rb_first_cached(&per_mm->umem_tree); node; - node =3D rb_next(node)) { - struct ib_umem_odp *umem_odp =3D - rb_entry(node, struct ib_umem_odp, interval_tree.rb); - - /* - * Increase the number of notifiers running, to prevent any - * further fault handling on this MR. - */ - ib_umem_notifier_start_account(umem_odp); - complete_all(&umem_odp->notifier_completion); - umem_odp->umem.ibdev->ops.invalidate_range( - umem_odp, ib_umem_start(umem_odp), - ib_umem_end(umem_odp)); - } - -out: - up_read(&per_mm->umem_rwsem); -} - -static int invalidate_range_start_trampoline(struct ib_umem_odp *item, - u64 start, u64 end, void *cookie) -{ - ib_umem_notifier_start_account(item); - item->umem.ibdev->ops.invalidate_range(item, start, end); - return 0; -} - -static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *= mn, - const struct mmu_notifier_range *range) -{ - struct ib_ucontext_per_mm *per_mm =3D - container_of(mn, struct ib_ucontext_per_mm, mn); - int rc; - - if (mmu_notifier_range_blockable(range)) - down_read(&per_mm->umem_rwsem); - else if (!down_read_trylock(&per_mm->umem_rwsem)) - return -EAGAIN; - - if (!per_mm->mn.users) { - up_read(&per_mm->umem_rwsem); - /* - * At this point users is permanently zero and visible to this - * CPU without a lock, that fact is relied on to skip the unlock - * in range_end. - */ - return 0; - } - - rc =3D rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, - range->end, - invalidate_range_start_trampoline, - mmu_notifier_range_blockable(range), - NULL); - if (rc) - up_read(&per_mm->umem_rwsem); - return rc; -} - -static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64= start, - u64 end, void *cookie) -{ - ib_umem_notifier_end_account(item); - return 0; -} - -static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *m= n, - const struct mmu_notifier_range *range) -{ - struct ib_ucontext_per_mm *per_mm =3D - container_of(mn, struct ib_ucontext_per_mm, mn); - - if (unlikely(!per_mm->mn.users)) - return; - - rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, - range->end, - invalidate_range_end_trampoline, true, NULL); - up_read(&per_mm->umem_rwsem); -} - -static struct mmu_notifier *ib_umem_alloc_notifier(struct mm_struct *mm) -{ - struct ib_ucontext_per_mm *per_mm; - - per_mm =3D kzalloc(sizeof(*per_mm), GFP_KERNEL); - if (!per_mm) - return ERR_PTR(-ENOMEM); - - per_mm->umem_tree =3D RB_ROOT_CACHED; - init_rwsem(&per_mm->umem_rwsem); - - WARN_ON(mm !=3D current->mm); - rcu_read_lock(); - per_mm->tgid =3D get_task_pid(current->group_leader, PIDTYPE_PID); - rcu_read_unlock(); - return &per_mm->mn; -} - -static void ib_umem_free_notifier(struct mmu_notifier *mn) -{ - struct ib_ucontext_per_mm *per_mm =3D - container_of(mn, struct ib_ucontext_per_mm, mn); - - WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root)); - - put_pid(per_mm->tgid); - kfree(per_mm); -} - -static const struct mmu_notifier_ops ib_umem_notifiers =3D { - .release =3D ib_umem_notifier_release, - .invalidate_range_start =3D ib_umem_notifier_invalidate_range_start= , - .invalidate_range_end =3D ib_umem_notifier_invalidate_range_end, - .alloc_notifier =3D ib_umem_alloc_notifier, - .free_notifier =3D ib_umem_free_notifier, -}; - static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp) { - struct ib_ucontext_per_mm *per_mm; - struct mmu_notifier *mn; int ret; =20 umem_odp->umem.is_odp =3D 1; + mutex_init(&umem_odp->umem_mutex); + if (!umem_odp->is_implicit_odp) { size_t page_size =3D 1UL << umem_odp->page_shift; + unsigned long start; + unsigned long end; size_t pages; =20 - umem_odp->interval_tree.start =3D - ALIGN_DOWN(umem_odp->umem.address, page_size); + start =3D ALIGN_DOWN(umem_odp->umem.address, page_size); if (check_add_overflow(umem_odp->umem.address, (unsigned long)umem_odp->umem.length, - &umem_odp->interval_tree.last)) + &end)) return -EOVERFLOW; - umem_odp->interval_tree.last =3D - ALIGN(umem_odp->interval_tree.last, page_size); - if (unlikely(umem_odp->interval_tree.last < page_size)) + end =3D ALIGN(end, page_size); + if (unlikely(end < page_size)) return -EOVERFLOW; =20 - pages =3D (umem_odp->interval_tree.last - - umem_odp->interval_tree.start) >> - umem_odp->page_shift; + pages =3D (end - start) >> umem_odp->page_shift; if (!pages) return -EINVAL; =20 - /* - * Note that the representation of the intervals in the - * interval tree considers the ending point as contained in - * the interval. - */ - umem_odp->interval_tree.last--; - umem_odp->page_list =3D kvcalloc( pages, sizeof(*umem_odp->page_list), GFP_KERNEL); if (!umem_odp->page_list) @@ -250,26 +85,15 @@ static inline int ib_init_umem_odp(struct ib_umem_od= p *umem_odp) ret =3D -ENOMEM; goto out_page_list; } - } =20 - mn =3D mmu_notifier_get(&ib_umem_notifiers, umem_odp->umem.owning_mm); - if (IS_ERR(mn)) { - ret =3D PTR_ERR(mn); - goto out_dma_list; - } - umem_odp->per_mm =3D per_mm =3D - container_of(mn, struct ib_ucontext_per_mm, mn); - - mutex_init(&umem_odp->umem_mutex); - init_completion(&umem_odp->notifier_completion); + ret =3D mmu_range_notifier_insert(&umem_odp->notifier, start, + end - start, current->mm); + if (ret) + goto out_dma_list; =20 - if (!umem_odp->is_implicit_odp) { - down_write(&per_mm->umem_rwsem); - interval_tree_insert(&umem_odp->interval_tree, - &per_mm->umem_tree); - up_write(&per_mm->umem_rwsem); + umem_odp->tgid =3D + get_task_pid(current->group_leader, PIDTYPE_PID); } - mmgrab(umem_odp->umem.owning_mm); =20 return 0; =20 @@ -290,8 +114,8 @@ static inline int ib_init_umem_odp(struct ib_umem_odp= *umem_odp) * @udata: udata from the syscall being used to create the umem * @access: ib_reg_mr access flags */ -struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata, - int access) +struct ib_umem_odp * +ib_umem_odp_alloc_implicit(struct ib_udata *udata, int access) { struct ib_ucontext *context =3D container_of(udata, struct uverbs_attr_bundle, driver_udata) @@ -305,8 +129,6 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct= ib_udata *udata, =20 if (!context) return ERR_PTR(-EIO); - if (WARN_ON_ONCE(!context->device->ops.invalidate_range)) - return ERR_PTR(-EINVAL); =20 umem_odp =3D kzalloc(sizeof(*umem_odp), GFP_KERNEL); if (!umem_odp) @@ -336,8 +158,9 @@ EXPORT_SYMBOL(ib_umem_odp_alloc_implicit); * @addr: The starting userspace VA * @size: The length of the userspace VA */ -struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root, - unsigned long addr, size_t size) +struct ib_umem_odp * +ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr, + size_t size, const struct mmu_range_notifier_ops *ops) { /* * Caller must ensure that root cannot be freed during the call to @@ -360,6 +183,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib= _umem_odp *root, umem->writable =3D root->umem.writable; umem->owning_mm =3D root->umem.owning_mm; odp_data->page_shift =3D PAGE_SHIFT; + odp_data->notifier.ops =3D ops; =20 ret =3D ib_init_umem_odp(odp_data); if (ret) { @@ -383,7 +207,8 @@ EXPORT_SYMBOL(ib_umem_odp_alloc_child); * conjunction with MMU notifiers. */ struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned lon= g addr, - size_t size, int access) + size_t size, int access, + const struct mmu_range_notifier_ops *ops) { struct ib_umem_odp *umem_odp; struct ib_ucontext *context; @@ -398,8 +223,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *= udata, unsigned long addr, if (!context) return ERR_PTR(-EIO); =20 - if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND)) || - WARN_ON_ONCE(!context->device->ops.invalidate_range)) + if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND))) return ERR_PTR(-EINVAL); =20 umem_odp =3D kzalloc(sizeof(struct ib_umem_odp), GFP_KERNEL); @@ -411,6 +235,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *= udata, unsigned long addr, umem_odp->umem.address =3D addr; umem_odp->umem.writable =3D ib_access_writable(access); umem_odp->umem.owning_mm =3D mm =3D current->mm; + umem_odp->notifier.ops =3D ops; =20 umem_odp->page_shift =3D PAGE_SHIFT; if (access & IB_ACCESS_HUGETLB) { @@ -442,8 +267,6 @@ EXPORT_SYMBOL(ib_umem_odp_get); =20 void ib_umem_odp_release(struct ib_umem_odp *umem_odp) { - struct ib_ucontext_per_mm *per_mm =3D umem_odp->per_mm; - /* * Ensure that no more pages are mapped in the umem. * @@ -455,28 +278,11 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_o= dp) ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp), ib_umem_end(umem_odp)); mutex_unlock(&umem_odp->umem_mutex); + mmu_range_notifier_remove(&umem_odp->notifier); kvfree(umem_odp->dma_list); kvfree(umem_odp->page_list); + put_pid(umem_odp->tgid); } - - down_write(&per_mm->umem_rwsem); - if (!umem_odp->is_implicit_odp) { - interval_tree_remove(&umem_odp->interval_tree, - &per_mm->umem_tree); - complete_all(&umem_odp->notifier_completion); - } - /* - * NOTE! mmu_notifier_unregister() can happen between a start/end - * callback, resulting in a missing end, and thus an unbalanced - * lock. This doesn't really matter to us since we are about to kfree - * the memory that holds the lock, however LOCKDEP doesn't like this. - * Thus we call the mmu_notifier_put under the rwsem and test the - * internal users count to reliably see if we are past this point. - */ - mmu_notifier_put(&per_mm->mn); - up_write(&per_mm->umem_rwsem); - - mmdrop(umem_odp->umem.owning_mm); kfree(umem_odp); } EXPORT_SYMBOL(ib_umem_odp_release); @@ -501,7 +307,7 @@ EXPORT_SYMBOL(ib_umem_odp_release); */ static int ib_umem_odp_map_dma_single_page( struct ib_umem_odp *umem_odp, - int page_index, + unsigned int page_index, struct page *page, u64 access_mask, unsigned long current_seq) @@ -510,12 +316,7 @@ static int ib_umem_odp_map_dma_single_page( dma_addr_t dma_addr; int ret =3D 0; =20 - /* - * Note: we avoid writing if seq is different from the initial seq, to - * handle case of a racing notifier. This check also allows us to bail - * early if we have a notifier running in parallel with us. - */ - if (ib_umem_mmu_notifier_retry(umem_odp, current_seq)) { + if (mmu_range_check_retry(&umem_odp->notifier, current_seq)) { ret =3D -EAGAIN; goto out; } @@ -618,7 +419,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *ume= m_odp, u64 user_virt, * existing beyond the lifetime of the originating process.. Presumably * mmget_not_zero will fail in this case. */ - owning_process =3D get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID); + owning_process =3D get_pid_task(umem_odp->tgid, PIDTYPE_PID); if (!owning_process || !mmget_not_zero(owning_mm)) { ret =3D -EINVAL; goto out_put_task; @@ -762,32 +563,3 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp = *umem_odp, u64 virt, } } EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages); - -/* @last is not a part of the interval. See comment for function - * node_last. - */ -int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, - u64 start, u64 last, - umem_call_back cb, - bool blockable, - void *cookie) -{ - int ret_val =3D 0; - struct interval_tree_node *node, *next; - struct ib_umem_odp *umem; - - if (unlikely(start =3D=3D last)) - return ret_val; - - for (node =3D interval_tree_iter_first(root, start, last - 1); - node; node =3D next) { - /* TODO move the blockable decision up to the callback */ - if (!blockable) - return -EAGAIN; - next =3D interval_tree_iter_next(node, start, last - 1); - umem =3D container_of(node, struct ib_umem_odp, interval_tree); - ret_val =3D cb(umem, start, last, cookie) || ret_val; - } - - return ret_val; -} diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw= /mlx5/mlx5_ib.h index f61d4005c6c379..c719f08b351670 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -1263,8 +1263,6 @@ int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev)= ; void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev); int __init mlx5_ib_odp_init(void); void mlx5_ib_odp_cleanup(void); -void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned lon= g start, - unsigned long end); void mlx5_odp_init_mr_cache_entry(struct mlx5_cache_ent *ent); void mlx5_odp_populate_klm(struct mlx5_klm *pklm, size_t offset, size_t nentries, struct mlx5_ib_mr *mr, int flags); @@ -1294,11 +1292,10 @@ mlx5_ib_advise_mr_prefetch(struct ib_pd *pd, { return -EOPNOTSUPP; } -static inline void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp= , - unsigned long start, - unsigned long end){}; #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ =20 +extern const struct mmu_range_notifier_ops mlx5_mn_ops; + /* Needed for rep profile */ void __mlx5_ib_remove(struct mlx5_ib_dev *dev, const struct mlx5_ib_profile *profile, diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5= /mr.c index 199f7959aaa510..fbe31830b22807 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -743,7 +743,8 @@ static int mr_umem_get(struct mlx5_ib_dev *dev, struc= t ib_udata *udata, if (access_flags & IB_ACCESS_ON_DEMAND) { struct ib_umem_odp *odp; =20 - odp =3D ib_umem_odp_get(udata, start, length, access_flags); + odp =3D ib_umem_odp_get(udata, start, length, access_flags, + &mlx5_mn_ops); if (IS_ERR(odp)) { mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(odp)); diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx= 5/odp.c index bcfc098466977e..f713eb82eeead4 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -241,17 +241,26 @@ static void destroy_unused_implicit_child_mr(struct= mlx5_ib_mr *mr) xa_unlock(&imr->implicit_children); } =20 -void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned lon= g start, - unsigned long end) +static bool mlx5_ib_invalidate_range(struct mmu_range_notifier *mrn, + const struct mmu_notifier_range *range, + unsigned long cur_seq) { + struct ib_umem_odp *umem_odp =3D + container_of(mrn, struct ib_umem_odp, notifier); struct mlx5_ib_mr *mr; const u64 umr_block_mask =3D (MLX5_UMR_MTT_ALIGNMENT / sizeof(struct mlx5_mtt)) - 1; u64 idx =3D 0, blk_start_idx =3D 0; + unsigned long start; + unsigned long end; int in_block =3D 0; u64 addr; =20 + if (!mmu_notifier_range_blockable(range)) + return false; + mutex_lock(&umem_odp->umem_mutex); + mmu_range_set_seq(mrn, cur_seq); /* * If npages is zero then umem_odp->private may not be setup yet. This * does not complete until after the first page is mapped for DMA. @@ -260,8 +269,8 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *ume= m_odp, unsigned long start, goto out; mr =3D umem_odp->private; =20 - start =3D max_t(u64, ib_umem_start(umem_odp), start); - end =3D min_t(u64, ib_umem_end(umem_odp), end); + start =3D max_t(u64, ib_umem_start(umem_odp), range->start); + end =3D min_t(u64, ib_umem_end(umem_odp), range->end); =20 /* * Iteration one - zap the HW's MTTs. The notifiers_count ensures that @@ -312,8 +321,13 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *um= em_odp, unsigned long start, destroy_unused_implicit_child_mr(mr); out: mutex_unlock(&umem_odp->umem_mutex); + return true; } =20 +const struct mmu_range_notifier_ops mlx5_mn_ops =3D { + .invalidate =3D mlx5_ib_invalidate_range, +}; + void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev) { struct ib_odp_caps *caps =3D &dev->odp_caps; @@ -414,7 +428,7 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struc= t mlx5_ib_mr *imr, =20 odp =3D ib_umem_odp_alloc_child(to_ib_umem_odp(imr->umem), idx * MLX5_IMR_MTT_SIZE, - MLX5_IMR_MTT_SIZE); + MLX5_IMR_MTT_SIZE, &mlx5_mn_ops); if (IS_ERR(odp)) return ERR_CAST(odp); =20 @@ -600,8 +614,9 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, s= truct ib_umem_odp *odp, u64 user_va, size_t bcnt, u32 *bytes_mapped, u32 flags) { - int current_seq, page_shift, ret, np; + int page_shift, ret, np; bool downgrade =3D flags & MLX5_PF_FLAGS_DOWNGRADE; + unsigned long current_seq; u64 access_mask; u64 start_idx, page_mask; =20 @@ -613,12 +628,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, = struct ib_umem_odp *odp, if (odp->umem.writable && !downgrade) access_mask |=3D ODP_WRITE_ALLOWED_BIT; =20 - current_seq =3D READ_ONCE(odp->notifiers_seq); - /* - * Ensure the sequence number is valid for some time before we call - * gup. - */ - smp_rmb(); + current_seq =3D mmu_range_read_begin(&odp->notifier); =20 np =3D ib_umem_odp_map_dma_pages(odp, user_va, bcnt, access_mask, current_seq); @@ -626,7 +636,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, s= truct ib_umem_odp *odp, return np; =20 mutex_lock(&odp->umem_mutex); - if (!ib_umem_mmu_notifier_retry(odp, current_seq)) { + if (!mmu_range_read_retry(&odp->notifier, current_seq)) { /* * No need to check whether the MTTs really belong to * this MR, since ib_umem_odp_map_dma_pages already @@ -656,19 +666,6 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, = struct ib_umem_odp *odp, return np << (page_shift - PAGE_SHIFT); =20 out: - if (ret =3D=3D -EAGAIN) { - unsigned long timeout =3D msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT); - - if (!wait_for_completion_timeout(&odp->notifier_completion, - timeout)) { - mlx5_ib_warn( - mr->dev, - "timeout waiting for mmu notifier. seq %d against %d. notifiers_coun= t=3D%d\n", - current_seq, odp->notifiers_seq, - odp->notifiers_count); - } - } - return ret; } =20 @@ -1609,7 +1606,6 @@ void mlx5_odp_init_mr_cache_entry(struct mlx5_cache= _ent *ent) =20 static const struct ib_device_ops mlx5_ib_dev_odp_ops =3D { .advise_mr =3D mlx5_ib_advise_mr, - .invalidate_range =3D mlx5_ib_invalidate_range, }; =20 int mlx5_ib_odp_init_one(struct mlx5_ib_dev *dev) diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h index 09b0e4494986a9..98ed5435afccd9 100644 --- a/include/rdma/ib_umem_odp.h +++ b/include/rdma/ib_umem_odp.h @@ -35,11 +35,11 @@ =20 #include #include -#include =20 struct ib_umem_odp { struct ib_umem umem; - struct ib_ucontext_per_mm *per_mm; + struct mmu_range_notifier notifier; + struct pid *tgid; =20 /* * An array of the pages included in the on-demand paging umem. @@ -62,13 +62,8 @@ struct ib_umem_odp { struct mutex umem_mutex; void *private; /* for the HW driver to use. */ =20 - int notifiers_seq; - int notifiers_count; int npages; =20 - /* Tree tracking */ - struct interval_tree_node interval_tree; - /* * An implicit odp umem cannot be DMA mapped, has 0 length, and serves * only as an anchor for the driver to hold onto the per_mm. FIXME: @@ -77,7 +72,6 @@ struct ib_umem_odp { */ bool is_implicit_odp; =20 - struct completion notifier_completion; unsigned int page_shift; }; =20 @@ -89,13 +83,13 @@ static inline struct ib_umem_odp *to_ib_umem_odp(stru= ct ib_umem *umem) /* Returns the first page of an ODP umem. */ static inline unsigned long ib_umem_start(struct ib_umem_odp *umem_odp) { - return umem_odp->interval_tree.start; + return umem_odp->notifier.interval_tree.start; } =20 /* Returns the address of the page after the last one of an ODP umem. */ static inline unsigned long ib_umem_end(struct ib_umem_odp *umem_odp) { - return umem_odp->interval_tree.last + 1; + return umem_odp->notifier.interval_tree.last + 1; } =20 static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp) @@ -119,21 +113,14 @@ static inline size_t ib_umem_odp_num_pages(struct i= b_umem_odp *umem_odp) =20 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING =20 -struct ib_ucontext_per_mm { - struct mmu_notifier mn; - struct pid *tgid; - - struct rb_root_cached umem_tree; - /* Protects umem_tree */ - struct rw_semaphore umem_rwsem; -}; - struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned lon= g addr, - size_t size, int access); + size_t size, int access, + const struct mmu_range_notifier_ops *ops); struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata, int access); -struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root_ume= m, - unsigned long addr, size_t size); +struct ib_umem_odp * +ib_umem_odp_alloc_child(struct ib_umem_odp *root_umem, unsigned long add= r, + size_t size, const struct mmu_range_notifier_ops *ops); void ib_umem_odp_release(struct ib_umem_odp *umem_odp); =20 int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_of= fset, @@ -143,39 +130,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *u= mem_odp, u64 start_offset, void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 start= _offset, u64 bound); =20 -typedef int (*umem_call_back)(struct ib_umem_odp *item, u64 start, u64 e= nd, - void *cookie); -/* - * Call the callback on each ib_umem in the range. Returns the logical o= r of - * the return values of the functions called. - */ -int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root, - u64 start, u64 end, - umem_call_back cb, - bool blockable, void *cookie); - -static inline int ib_umem_mmu_notifier_retry(struct ib_umem_odp *umem_od= p, - unsigned long mmu_seq) -{ - /* - * This code is strongly based on the KVM code from - * mmu_notifier_retry. Should be called with - * the relevant locks taken (umem_odp->umem_mutex - * and the ucontext umem_mutex semaphore locked for read). - */ - - if (unlikely(umem_odp->notifiers_count)) - return 1; - if (umem_odp->notifiers_seq !=3D mmu_seq) - return 1; - return 0; -} - #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ =20 -static inline struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata= , - unsigned long addr, - size_t size, int access) +static inline struct ib_umem_odp * +ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, size_t size, + int access, const struct mmu_range_notifier_ops *ops) { return ERR_PTR(-EINVAL); } diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 6a47ba85c54c11..2c30c859ae0d13 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -2422,8 +2422,6 @@ struct ib_device_ops { u64 iova); int (*unmap_fmr)(struct list_head *fmr_list); int (*dealloc_fmr)(struct ib_fmr *fmr); - void (*invalidate_range)(struct ib_umem_odp *umem_odp, - unsigned long start, unsigned long end); int (*attach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); struct ib_xrcd *(*alloc_xrcd)(struct ib_device *device, --=20 2.23.0