From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 60EF9FA373D
	for <linux-mm@archiver.kernel.org>; Fri, 21 Oct 2022 16:37:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A7EDE8E0014; Fri, 21 Oct 2022 12:37:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A2F978E0001; Fri, 21 Oct 2022 12:37:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8AC8C8E0014; Fri, 21 Oct 2022 12:37:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 743008E0001
	for <linux-mm@kvack.org>; Fri, 21 Oct 2022 12:37:34 -0400 (EDT)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 5023C81294
	for <linux-mm@kvack.org>; Fri, 21 Oct 2022 16:37:34 +0000 (UTC)
X-FDA: 80045512428.09.2C80B5F
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201])
	by imf14.hostedemail.com (Postfix) with ESMTP id 02486100023
	for <linux-mm@kvack.org>; Fri, 21 Oct 2022 16:37:33 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id y126-20020a257d84000000b006c554365f5aso3714104ybc.9
        for <linux-mm@kvack.org>; Fri, 21 Oct 2022 09:37:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=;
        b=Sa9tlTSvMy27PkqCy85zUMB7NXscw1SeftIzcgjHIdZeuKi35UjClC0+e61EdsQTSl
         NMLCrfPUooY7jzsHIfTKQHEnm6IUVAUYCi3yYETPsKUeFPWIFi4VRk+XusoX6DEnlweI
         lFO+cIJ76Sk+AjPePBoVFpvlnw2U1gr7dRgFq3DrzIDAhWlVL+NEO/0ro9fFwSCY/mjx
         7ZtzekH0PrfJEZhy6u5zJ6NFUTk/+Y3IJrfcH7Gs6tO7dALKjpojag4k7ICH4rINP9Vg
         6A/PZUKktjf3jROcDajl8cuELh+tfbsO8Bbf5at999k7ONKuVzu9yjryfOrezKepLD6u
         XbIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=;
        b=1t4j0TMi8yfg5umCOfMVRgPuCuEnUEWAnD1R2sdVuxIxfxnrx6s0BPvK/t+Qy/9KNv
         oQrkKEKiN3neJYLh4jT2UCTiIBujYhxukThnoIOz96QWWf39SlQU2t7D2HiSq9ZXuLBh
         zdQ02CsRf9HygsB6hCOIS+omkwpNmIJCR7W+yGhyqGNF0JdIa47ngLbNCGiCTz5Q8nlz
         j4SiTluELCeJSEmRjDUIA7iwpn8CPvTl39yCTmDu3RvG9dm3uySB42p4zDgnnjzqFC/E
         53+z7JueElR3SAS4qUbc2laRlETP08qG0Plfo9+K0IfluNN8HOLNUiX5sM1jUpmciMUB
         3TcQ==
X-Gm-Message-State: ACrzQf2ytBwGSGsVjBhFptKwZrAOIiSZ77/LBOKGUAfaoqIiL7YfAmAU
	CVUUKMcvdw8SX6Ar5/REXcCAPlQv0Xl0i0pJ
X-Google-Smtp-Source: AMsMyM5FTyzfLTC0XnU5k1S4/Yvtcod5ACm9RndBlBmN969+ez/t051FnxhOla7v7Xs6v4smN/AXXJR7XNVGPYEB
X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f])
 (user=jthoughton job=sendgmr) by 2002:a25:87:0:b0:6c3:b703:ef69 with SMTP id
 129-20020a250087000000b006c3b703ef69mr17587149yba.126.1666370253374; Fri, 21
 Oct 2022 09:37:33 -0700 (PDT)
Date: Fri, 21 Oct 2022 16:36:34 +0000
In-Reply-To: <20221021163703.3218176-1-jthoughton@google.com>
Mime-Version: 1.0
References: <20221021163703.3218176-1-jthoughton@google.com>
X-Mailer: git-send-email 2.38.0.135.g90850a2211-goog
Message-ID: <20221021163703.3218176-19-jthoughton@google.com>
Subject: [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to
 support HGM
From: James Houghton <jthoughton@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, 
	Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, 
	Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, 
	"Zach O'Keefe" <zokeefe@google.com>, Manish Mishra <manish.mishra@nutanix.com>, 
	Naoya Horiguchi <naoya.horiguchi@nec.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, 
	"Matthew Wilcox (Oracle)" <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Miaohe Lin <linmiaohe@huawei.com>, 
	Yang Shi <shy828301@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com>
Content-Type: text/plain; charset="UTF-8"
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666370254; a=rsa-sha256;
	cv=none;
	b=k/wVvJ6Uv7IyVfKIlyYd8l9BNqHGnW1ZugOzsLO3y2P0HmdYDGGgU5mOJ6qwuVxfJ4DKs+
	l/RLxwIe5wKsznkHl2hRSODspnB3fpawn3C+ToEibrTECgxphwhBtGth3h0VULlUGuJJSy
	ewDZmC3XErWqqG03V9i+ZkH9bPA+pDU=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Sa9tlTSv;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf14.hostedemail.com: domain of 3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1666370254;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=;
	b=O2IeM/RUcZLX5URWrJ7Vu7c1LWtIKLiT7SjhZgYPm2dc/OVR6WVt1eLcqQls/TK3UXXrAR
	q+D5cAbdjw0wPElwyEs10cXG6Wql18v/rKfEYooUdyftGAt+/7KOe0yMA3gy5gf7W4G+Yj
	fkGNQLlntIj9gjL7kfuwMwUzftzVkvs=
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Sa9tlTSv;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf14.hostedemail.com: domain of 3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com
X-Rspamd-Server: rspam04
X-Rspam-User: 
X-Stat-Signature: dy1j7ocpjimcqicut6hnmor3fmf47z93
X-Rspamd-Queue-Id: 02486100023
X-HE-Tag: 1666370253-803471
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This enables high-granularity mapping support in GUP.

One important change here is that, before, we never needed to grab the
VMA lock, but now, to prevent someone from collapsing the page tables
out from under us, we grab it for reading when doing high-granularity PT
walks.

In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
that vaddr points to within the subpage that hpte points to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d096cef53cd..d76ab32fb6d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6382,11 +6382,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 	}
 }
 
-static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t pteval,
 					       bool *unshare)
 {
-	pte_t pteval = huge_ptep_get(pte);
-
 	*unshare = false;
 	if (is_swap_pte(pteval))
 		return true;
@@ -6478,12 +6476,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	int err = -EFAULT, refs;
 
+	/*
+	 * Grab the VMA lock for reading now so no one can collapse the page
+	 * table from under us.
+	 */
+	hugetlb_vma_lock_read(vma);
+
 	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
+		pte_t *ptep, pte;
 		spinlock_t *ptl = NULL;
 		bool unshare = false;
 		int absent;
-		struct page *page;
+		unsigned long pages_per_hpte;
+		struct page *page, *subpage;
+		struct hugetlb_pte hpte;
 
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6499,13 +6505,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
 		 *
-		 * Note that page table lock is not held when pte is null.
+		 * Note that page table lock is not held when ptep is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
-				      huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		ptep = huge_pte_offset(mm, vaddr & huge_page_mask(h),
+				       huge_page_size(h));
+		if (ptep) {
+			hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+					hpage_size_to_level(huge_page_size(h)));
+			hugetlb_hgm_walk(mm, vma, &hpte, vaddr,
+					PAGE_SIZE,
+					/*stop_at_none=*/true);
+			ptl = hugetlb_pte_lock(mm, &hpte);
+			ptep = hpte.ptep;
+			pte = huge_ptep_get(ptep);
+		}
+
+		absent = !ptep || huge_pte_none(pte);
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
@@ -6516,12 +6531,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
 
+		if (!absent && pte_present(pte) &&
+				!hugetlb_pte_present_leaf(&hpte, pte)) {
+			/* We raced with someone splitting the PTE, so retry. */
+			spin_unlock(ptl);
+			continue;
+		}
+
 		/*
 		 * We need call hugetlb_fault for both hugepages under migration
 		 * (in which case hugetlb_fault waits for the migration,) and
@@ -6537,7 +6559,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
-			if (pte)
+			/* Drop the lock before entering hugetlb_fault. */
+			hugetlb_vma_unlock_read(vma);
+
+			if (ptep)
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
@@ -6560,7 +6585,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (ret & VM_FAULT_ERROR) {
 				err = vm_fault_to_errno(ret, flags);
 				remainder = 0;
-				break;
+				goto out;
 			}
 			if (ret & VM_FAULT_RETRY) {
 				if (locked &&
@@ -6578,11 +6603,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				 */
 				return i;
 			}
+			hugetlb_vma_lock_read(vma);
 			continue;
 		}
 
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
+		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+		subpage = pte_page(pte);
+		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+		page = compound_head(subpage);
 
 		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			       !PageAnonExclusive(page), page);
@@ -6592,21 +6620,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * and skip the same_page loop below.
 		 */
 		if (!pages && !vmas && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
+		    (vaddr + pages_per_hpte < vma->vm_end) &&
+		    (remainder >= pages_per_hpte)) {
+			vaddr += pages_per_hpte;
+			remainder -= pages_per_hpte;
+			i += pages_per_hpte;
 			spin_unlock(ptl);
 			continue;
 		}
 
 		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+		refs = min3(pages_per_hpte - pfn_offset, remainder,
 		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
 		if (pages || vmas)
-			record_subpages_vmas(nth_page(page, pfn_offset),
+			record_subpages_vmas(nth_page(subpage, pfn_offset),
 					     vma, refs,
 					     likely(pages) ? pages + i : NULL,
 					     vmas ? vmas + i : NULL);
@@ -6637,6 +6665,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		spin_unlock(ptl);
 	}
+	hugetlb_vma_unlock_read(vma);
+out:
 	*nr_pages = remainder;
 	/*
 	 * setting position is actually required only if remainder is
-- 
2.38.0.135.g90850a2211-goog