From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=bTCv=CR=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 28E29C43461
	for <linux-mm@archiver.kernel.org>; Tue,  8 Sep 2020 13:32:36 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B4E3623BCE
	for <linux-mm@archiver.kernel.org>; Tue,  8 Sep 2020 13:32:35 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B4E3623BCE
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5303C6B006C; Tue,  8 Sep 2020 09:32:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4DEAE900002; Tue,  8 Sep 2020 09:32:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3A9986B0070; Tue,  8 Sep 2020 09:32:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0042.hostedemail.com [216.40.44.42])
	by kanga.kvack.org (Postfix) with ESMTP id 251C76B006C
	for <linux-mm@kvack.org>; Tue,  8 Sep 2020 09:32:35 -0400 (EDT)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id A988A181AC9CB
	for <linux-mm@kvack.org>; Tue,  8 Sep 2020 13:32:34 +0000 (UTC)
X-FDA: 77239983828.06.part53_3a061c1270d5
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin06.hostedemail.com (Postfix) with ESMTP id 6DC9D100B0ED6
	for <linux-mm@kvack.org>; Tue,  8 Sep 2020 13:32:34 +0000 (UTC)
X-HE-Tag: part53_3a061c1270d5
X-Filterd-Recvd-Size: 11829
Received: from huawei.com (szxga06-in.huawei.com [45.249.212.32])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue,  8 Sep 2020 13:32:32 +0000 (UTC)
Received: from DGGEMS412-HUB.china.huawei.com (unknown [172.30.72.59])
	by Forcepoint Email with ESMTP id 7CFC6D9380F9D3DCB5E4;
	Tue,  8 Sep 2020 21:32:24 +0800 (CST)
Received: from localhost (10.174.151.129) by DGGEMS412-HUB.china.huawei.com
 (10.3.19.212) with Microsoft SMTP Server id 14.3.487.0; Tue, 8 Sep 2020
 21:32:14 +0800
From: Ming Mao <maoming.maoming@huawei.com>
To: <linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<linux-mm@kvack.org>, <alex.williamson@redhat.com>,
	<akpm@linux-foundation.org>
CC: <cohuck@redhat.com>, <jianjay.zhou@huawei.com>,
	<weidong.huang@huawei.com>, <peterx@redhat.com>, <aarcange@redhat.com>,
	<wangyunjian@huawei.com>, Ming Mao <maoming.maoming@huawei.com>
Subject: [PATCH V4 1/2] vfio dma_map/unmap: optimized for hugetlbfs pages
Date: Tue, 8 Sep 2020 21:32:03 +0800
Message-ID: <20200908133204.1338-2-maoming.maoming@huawei.com>
X-Mailer: git-send-email 2.26.2.windows.1
In-Reply-To: <20200908133204.1338-1-maoming.maoming@huawei.com>
References: <20200908133204.1338-1-maoming.maoming@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
X-Originating-IP: [10.174.151.129]
X-CFilter-Loop: Reflected
X-Rspamd-Queue-Id: 6DC9D100B0ED6
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In the original process of dma_map/unmap pages for VFIO-devices,
to make sure the pages are contiguous, we have to check them one by one.
As a result, dma_map/unmap could spend a long time.
Using the hugetlb pages, we can avoid this problem.
All pages in hugetlb pages are contiguous.And the hugetlb
page should not be split.So we can delete the for loops.

According to the suggestions of Peter Xu,
we should use the API unpin_user_pages_dirty_lock() to unpin hugetlb page=
s.
And the pages are unpinned one by one in this API.
So it is better to optimize the API.
In this patch, we do not optimize the process of unpinning.
We will do this in another patch.

Signed-off-by: Ming Mao <maoming.maoming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 289 +++++++++++++++++++++++++++++++-
 1 file changed, 281 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_ty=
pe1.c
index 5e556ac91..8c1dc5136 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -479,6 +479,222 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsi=
gned long vaddr,
 	return ret;
 }
=20
+static bool is_hugetlb_page(unsigned long pfn)
+{
+	struct page *page;
+
+	if (!pfn_valid(pfn))
+		return false;
+
+	page =3D pfn_to_page(pfn);
+	/* only check for hugetlb pages */
+	return page && PageHuge(page);
+}
+
+static bool vaddr_is_hugetlb_page(unsigned long vaddr, int prot)
+{
+	unsigned long pfn;
+	int ret;
+	bool result;
+
+	if (!current->mm)
+		return false;
+
+	ret =3D vaddr_get_pfn(current->mm, vaddr, prot, &pfn);
+	if (ret)
+		return false;
+
+	result =3D is_hugetlb_page(pfn);
+
+	put_pfn(pfn, prot);
+
+	return result;
+}
+
+/*
+ * get the number of residual PAGE_SIZE-pages in a hugetlb page
+ * (including the page which pointed by this address)
+ * @address: we count residual pages from this address to the end of
+ * a hugetlb page
+ * @order: the order of the same hugetlb page
+ */
+static long
+hugetlb_get_residual_pages(unsigned long address, unsigned int order)
+{
+	unsigned long hugetlb_npage;
+	unsigned long hugetlb_mask;
+
+	if (!order)
+		return -EINVAL;
+
+	hugetlb_npage =3D 1UL << order;
+	hugetlb_mask =3D hugetlb_npage - 1;
+	address =3D address >> PAGE_SHIFT;
+
+	/*
+	 * Since we count the page pointed by this address, the number of
+	 * residual PAGE_SIZE-pages is greater than or equal to 1.
+	 */
+	return hugetlb_npage - (address & hugetlb_mask);
+}
+
+static unsigned int
+hugetlb_page_get_externally_pinned_num(struct vfio_dma *dma,
+				unsigned long start,
+				unsigned long npage)
+{
+	struct vfio_pfn *vpfn;
+	struct rb_node *node;
+	unsigned long end;
+	unsigned int num =3D 0;
+
+	if (!dma || !npage)
+		return 0;
+
+	end =3D start + npage - 1;
+	/* If we find a page in dma->pfn_list, this page has been pinned extern=
ally */
+	for (node =3D rb_first(&dma->pfn_list); node; node =3D rb_next(node)) {
+		vpfn =3D rb_entry(node, struct vfio_pfn, node);
+		if ((vpfn->pfn >=3D start) && (vpfn->pfn <=3D end))
+			num++;
+	}
+
+	return num;
+}
+
+static long hugetlb_page_vaddr_get_pfn(struct mm_struct *mm, unsigned lo=
ng vaddr,
+					int prot, long npage, unsigned long pfn)
+{
+	long hugetlb_residual_npage;
+	struct page *head;
+	int ret =3D 0;
+	unsigned int contiguous_npage;
+	struct page **pages =3D NULL;
+	unsigned int flags =3D 0;
+
+	if ((npage < 0) || !pfn_valid(pfn))
+		return -EINVAL;
+
+	/* all pages are done? */
+	if (!npage)
+		goto out;
+	/*
+	 * Since pfn is valid,
+	 * hugetlb_residual_npage is greater than or equal to 1.
+	 */
+	head =3D compound_head(pfn_to_page(pfn));
+	hugetlb_residual_npage =3D hugetlb_get_residual_pages(vaddr,
+						compound_order(head));
+	/* The page of vaddr has been gotten by vaddr_get_pfn */
+	contiguous_npage =3D min_t(long, (hugetlb_residual_npage - 1), npage);
+	/* There is on page left in this hugetlb page. */
+	if (!contiguous_npage)
+		goto out;
+
+	pages =3D kvmalloc_array(contiguous_npage, sizeof(struct page *), GFP_K=
ERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	if (prot & IOMMU_WRITE)
+		flags |=3D FOLL_WRITE;
+
+	mmap_read_lock(mm);
+	/* The number of pages pinned may be less than contiguous_npage */
+	ret =3D pin_user_pages_remote(NULL, mm, vaddr + PAGE_SIZE, contiguous_n=
page,
+				flags | FOLL_LONGTERM, pages, NULL, NULL);
+	mmap_read_unlock(mm);
+out:
+	if (pages)
+		kvfree(pages);
+	return ret;
+}
+
+static long vfio_pin_hugetlb_pages_remote(struct vfio_dma *dma, unsigned=
 long vaddr,
+				  long npage, unsigned long *pfn_base,
+				  unsigned long limit)
+{
+	unsigned long pfn =3D 0;
+	long ret, pinned =3D 0, lock_acct =3D 0;
+	dma_addr_t iova =3D vaddr - dma->vaddr + dma->iova;
+	long pinned_loop, i;
+
+	/* This code path is only user initiated */
+	if (!current->mm)
+		return -ENODEV;
+
+	ret =3D vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+	if (ret)
+		return ret;
+
+	pinned++;
+	/*
+	 * Since PG_reserved is not relevant for compound pages
+	 * and the pfn of PAGE_SIZE-page which in hugetlb pages is valid,
+	 * it is not necessary to check rsvd for hugetlb pages.
+	 */
+	if (!vfio_find_vpfn(dma, iova)) {
+		if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
+			put_pfn(*pfn_base, dma->prot);
+			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+				limit << PAGE_SHIFT);
+			return -ENOMEM;
+		}
+		lock_acct++;
+	}
+
+	/* Lock all the consecutive pages from pfn_base */
+	for (vaddr +=3D PAGE_SIZE, iova +=3D PAGE_SIZE; pinned < npage;
+	     pinned +=3D pinned_loop, vaddr +=3D pinned_loop * PAGE_SIZE,
+	     iova +=3D pinned_loop * PAGE_SIZE) {
+		ret =3D vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
+		if (ret)
+			break;
+
+		if (pfn !=3D *pfn_base + pinned ||
+		    !is_hugetlb_page(pfn)) {
+			put_pfn(pfn, dma->prot);
+			break;
+		}
+
+		pinned_loop =3D 1;
+		/*
+		 * It is possible that the page of vaddr is the last PAGE_SIZE-page.
+		 * In this case, vaddr + PAGE_SIZE might be another hugetlb page.
+		 */
+		ret =3D hugetlb_page_vaddr_get_pfn(current->mm, vaddr, dma->prot,
+						npage - pinned - pinned_loop, pfn);
+		if (ret < 0) {
+			put_pfn(pfn, dma->prot);
+			break;
+		}
+
+		pinned_loop +=3D ret;
+		lock_acct +=3D pinned_loop - hugetlb_page_get_externally_pinned_num(dm=
a,
+			pfn, pinned_loop);
+
+		if (!dma->lock_cap &&
+		    current->mm->locked_vm + lock_acct > limit) {
+			for (i =3D 0; i < pinned_loop; i++)
+				put_pfn(pfn++, dma->prot);
+			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+				__func__, limit << PAGE_SHIFT);
+			ret =3D -ENOMEM;
+			goto unpin_out;
+		}
+	}
+
+	ret =3D vfio_lock_acct(dma, lock_acct, false);
+
+unpin_out:
+	if (ret) {
+		for (pfn =3D *pfn_base ; pinned ; pfn++, pinned--)
+			put_pfn(pfn, dma->prot);
+		return ret;
+	}
+
+	return pinned;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -858,6 +1074,57 @@ static size_t unmap_unpin_slow(struct vfio_domain *=
domain,
 	return unmapped;
 }
=20
+static size_t get_contiguous_pages(struct vfio_domain *domain, dma_addr_=
t start,
+				dma_addr_t end, phys_addr_t phys_base)
+{
+	size_t len;
+	phys_addr_t next;
+
+	if (!domain)
+		return 0;
+
+	for (len =3D PAGE_SIZE;
+	     !domain->fgsp && start + len < end; len +=3D PAGE_SIZE) {
+		next =3D iommu_iova_to_phys(domain->domain, start + len);
+		if (next !=3D phys_base + len)
+			break;
+	}
+
+	return len;
+}
+
+static size_t hugetlb_get_contiguous_pages(struct vfio_domain *domain, d=
ma_addr_t start,
+				dma_addr_t end, phys_addr_t phys_base)
+{
+	size_t len;
+	phys_addr_t next;
+	unsigned long contiguous_npage;
+	dma_addr_t max_len;
+	unsigned long hugetlb_residual_npage;
+	struct page *head;
+	unsigned long limit;
+
+	if (!domain)
+		return 0;
+
+	max_len =3D end - start;
+	for (len =3D PAGE_SIZE;
+	     !domain->fgsp && start + len < end; len +=3D contiguous_npage * PA=
GE_SIZE) {
+		next =3D iommu_iova_to_phys(domain->domain, start + len);
+		if ((next !=3D phys_base + len) ||
+		    !is_hugetlb_page(next >> PAGE_SHIFT))
+			break;
+
+		head =3D compound_head(pfn_to_page(next >> PAGE_SHIFT));
+		hugetlb_residual_npage =3D hugetlb_get_residual_pages(start + len,
+								compound_order(head));
+		limit =3D ALIGN((max_len - len), PAGE_SIZE) >> PAGE_SHIFT;
+		contiguous_npage =3D min_t(unsigned long, hugetlb_residual_npage, limi=
t);
+	}
+
+	return len;
+}
+
 static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *=
dma,
 			     bool do_accounting)
 {
@@ -892,7 +1159,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iomm=
u, struct vfio_dma *dma,
 	iommu_iotlb_gather_init(&iotlb_gather);
 	while (iova < end) {
 		size_t unmapped, len;
-		phys_addr_t phys, next;
+		phys_addr_t phys;
=20
 		phys =3D iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
@@ -905,12 +1172,10 @@ static long vfio_unmap_unpin(struct vfio_iommu *io=
mmu, struct vfio_dma *dma,
 		 * may require hardware cache flushing, try to find the
 		 * largest contiguous physical memory chunk to unmap.
 		 */
-		for (len =3D PAGE_SIZE;
-		     !domain->fgsp && iova + len < end; len +=3D PAGE_SIZE) {
-			next =3D iommu_iova_to_phys(domain->domain, iova + len);
-			if (next !=3D phys + len)
-				break;
-		}
+		if (is_hugetlb_page(phys >> PAGE_SHIFT))
+			len =3D hugetlb_get_contiguous_pages(domain, iova, end, phys);
+		else
+			len =3D get_contiguous_pages(domain, iova, end, phys);
=20
 		/*
 		 * First, try to use fast unmap/unpin. In case of failure,
@@ -1243,7 +1508,15 @@ static int vfio_pin_map_dma(struct vfio_iommu *iom=
mu, struct vfio_dma *dma,
=20
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage =3D vfio_pin_pages_remote(dma, vaddr + dma->size,
+		if (vaddr_is_hugetlb_page(vaddr + dma->size, dma->prot)) {
+			npage =3D vfio_pin_hugetlb_pages_remote(dma, vaddr + dma->size,
+					      size >> PAGE_SHIFT, &pfn, limit);
+			/* try the normal page if failed */
+			if (npage <=3D 0)
+				npage =3D vfio_pin_pages_remote(dma, vaddr + dma->size,
+					      size >> PAGE_SHIFT, &pfn, limit);
+		} else
+			npage =3D vfio_pin_pages_remote(dma, vaddr + dma->size,
 					      size >> PAGE_SHIFT, &pfn, limit);
 		if (npage <=3D 0) {
 			WARN_ON(!npage);
--=20
2.23.0