From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60EF9FA373D for ; Fri, 21 Oct 2022 16:37:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A7EDE8E0014; Fri, 21 Oct 2022 12:37:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A2F978E0001; Fri, 21 Oct 2022 12:37:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8AC8C8E0014; Fri, 21 Oct 2022 12:37:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 743008E0001 for ; Fri, 21 Oct 2022 12:37:34 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 5023C81294 for ; Fri, 21 Oct 2022 16:37:34 +0000 (UTC) X-FDA: 80045512428.09.2C80B5F Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf14.hostedemail.com (Postfix) with ESMTP id 02486100023 for ; Fri, 21 Oct 2022 16:37:33 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id y126-20020a257d84000000b006c554365f5aso3714104ybc.9 for ; Fri, 21 Oct 2022 09:37:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=; b=Sa9tlTSvMy27PkqCy85zUMB7NXscw1SeftIzcgjHIdZeuKi35UjClC0+e61EdsQTSl NMLCrfPUooY7jzsHIfTKQHEnm6IUVAUYCi3yYETPsKUeFPWIFi4VRk+XusoX6DEnlweI lFO+cIJ76Sk+AjPePBoVFpvlnw2U1gr7dRgFq3DrzIDAhWlVL+NEO/0ro9fFwSCY/mjx 7ZtzekH0PrfJEZhy6u5zJ6NFUTk/+Y3IJrfcH7Gs6tO7dALKjpojag4k7ICH4rINP9Vg 6A/PZUKktjf3jROcDajl8cuELh+tfbsO8Bbf5at999k7ONKuVzu9yjryfOrezKepLD6u XbIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=; b=1t4j0TMi8yfg5umCOfMVRgPuCuEnUEWAnD1R2sdVuxIxfxnrx6s0BPvK/t+Qy/9KNv oQrkKEKiN3neJYLh4jT2UCTiIBujYhxukThnoIOz96QWWf39SlQU2t7D2HiSq9ZXuLBh zdQ02CsRf9HygsB6hCOIS+omkwpNmIJCR7W+yGhyqGNF0JdIa47ngLbNCGiCTz5Q8nlz j4SiTluELCeJSEmRjDUIA7iwpn8CPvTl39yCTmDu3RvG9dm3uySB42p4zDgnnjzqFC/E 53+z7JueElR3SAS4qUbc2laRlETP08qG0Plfo9+K0IfluNN8HOLNUiX5sM1jUpmciMUB 3TcQ== X-Gm-Message-State: ACrzQf2ytBwGSGsVjBhFptKwZrAOIiSZ77/LBOKGUAfaoqIiL7YfAmAU CVUUKMcvdw8SX6Ar5/REXcCAPlQv0Xl0i0pJ X-Google-Smtp-Source: AMsMyM5FTyzfLTC0XnU5k1S4/Yvtcod5ACm9RndBlBmN969+ez/t051FnxhOla7v7Xs6v4smN/AXXJR7XNVGPYEB X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a25:87:0:b0:6c3:b703:ef69 with SMTP id 129-20020a250087000000b006c3b703ef69mr17587149yba.126.1666370253374; Fri, 21 Oct 2022 09:37:33 -0700 (PDT) Date: Fri, 21 Oct 2022 16:36:34 +0000 In-Reply-To: <20221021163703.3218176-1-jthoughton@google.com> Mime-Version: 1.0 References: <20221021163703.3218176-1-jthoughton@google.com> X-Mailer: git-send-email 2.38.0.135.g90850a2211-goog Message-ID: <20221021163703.3218176-19-jthoughton@google.com> Subject: [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666370254; a=rsa-sha256; cv=none; b=k/wVvJ6Uv7IyVfKIlyYd8l9BNqHGnW1ZugOzsLO3y2P0HmdYDGGgU5mOJ6qwuVxfJ4DKs+ l/RLxwIe5wKsznkHl2hRSODspnB3fpawn3C+ToEibrTECgxphwhBtGth3h0VULlUGuJJSy ewDZmC3XErWqqG03V9i+ZkH9bPA+pDU= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Sa9tlTSv; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of 3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666370254; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=X13wx2njoRAvoh4E0IMJVfrDZ5s4WLay4B3IFLQbLEY=; b=O2IeM/RUcZLX5URWrJ7Vu7c1LWtIKLiT7SjhZgYPm2dc/OVR6WVt1eLcqQls/TK3UXXrAR q+D5cAbdjw0wPElwyEs10cXG6Wql18v/rKfEYooUdyftGAt+/7KOe0yMA3gy5gf7W4G+Yj fkGNQLlntIj9gjL7kfuwMwUzftzVkvs= Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Sa9tlTSv; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of 3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3zcpSYwoKCM43D18E01D8708805y.w86527EH-664Fuw4.8B0@flex--jthoughton.bounces.google.com X-Rspamd-Server: rspam04 X-Rspam-User: X-Stat-Signature: dy1j7ocpjimcqicut6hnmor3fmf47z93 X-Rspamd-Queue-Id: 02486100023 X-HE-Tag: 1666370253-803471 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This enables high-granularity mapping support in GUP. One important change here is that, before, we never needed to grab the VMA lock, but now, to prevent someone from collapsing the page tables out from under us, we grab it for reading when doing high-granularity PT walks. In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units) that vaddr points to within the subpage that hpte points to. Signed-off-by: James Houghton --- mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 23 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2d096cef53cd..d76ab32fb6d3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6382,11 +6382,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, } } -static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, +static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t pteval, bool *unshare) { - pte_t pteval = huge_ptep_get(pte); - *unshare = false; if (is_swap_pte(pteval)) return true; @@ -6478,12 +6476,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct hstate *h = hstate_vma(vma); int err = -EFAULT, refs; + /* + * Grab the VMA lock for reading now so no one can collapse the page + * table from under us. + */ + hugetlb_vma_lock_read(vma); + while (vaddr < vma->vm_end && remainder) { - pte_t *pte; + pte_t *ptep, pte; spinlock_t *ptl = NULL; bool unshare = false; int absent; - struct page *page; + unsigned long pages_per_hpte; + struct page *page, *subpage; + struct hugetlb_pte hpte; /* * If we have a pending SIGKILL, don't keep faulting pages and @@ -6499,13 +6505,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * each hugepage. We have to make sure we get the * first, for the page indexing below to work. * - * Note that page table lock is not held when pte is null. + * Note that page table lock is not held when ptep is null. */ - pte = huge_pte_offset(mm, vaddr & huge_page_mask(h), - huge_page_size(h)); - if (pte) - ptl = huge_pte_lock(h, mm, pte); - absent = !pte || huge_pte_none(huge_ptep_get(pte)); + ptep = huge_pte_offset(mm, vaddr & huge_page_mask(h), + huge_page_size(h)); + if (ptep) { + hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h), + hpage_size_to_level(huge_page_size(h))); + hugetlb_hgm_walk(mm, vma, &hpte, vaddr, + PAGE_SIZE, + /*stop_at_none=*/true); + ptl = hugetlb_pte_lock(mm, &hpte); + ptep = hpte.ptep; + pte = huge_ptep_get(ptep); + } + + absent = !ptep || huge_pte_none(pte); /* * When coredumping, it suits get_dump_page if we just return @@ -6516,12 +6531,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ if (absent && (flags & FOLL_DUMP) && !hugetlbfs_pagecache_present(h, vma, vaddr)) { - if (pte) + if (ptep) spin_unlock(ptl); remainder = 0; break; } + if (!absent && pte_present(pte) && + !hugetlb_pte_present_leaf(&hpte, pte)) { + /* We raced with someone splitting the PTE, so retry. */ + spin_unlock(ptl); + continue; + } + /* * We need call hugetlb_fault for both hugepages under migration * (in which case hugetlb_fault waits for the migration,) and @@ -6537,7 +6559,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, vm_fault_t ret; unsigned int fault_flags = 0; - if (pte) + /* Drop the lock before entering hugetlb_fault. */ + hugetlb_vma_unlock_read(vma); + + if (ptep) spin_unlock(ptl); if (flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; @@ -6560,7 +6585,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, if (ret & VM_FAULT_ERROR) { err = vm_fault_to_errno(ret, flags); remainder = 0; - break; + goto out; } if (ret & VM_FAULT_RETRY) { if (locked && @@ -6578,11 +6603,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ return i; } + hugetlb_vma_lock_read(vma); continue; } - pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; - page = pte_page(huge_ptep_get(pte)); + pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT; + subpage = pte_page(pte); + pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE; + page = compound_head(subpage); VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); @@ -6592,21 +6620,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * and skip the same_page loop below. */ if (!pages && !vmas && !pfn_offset && - (vaddr + huge_page_size(h) < vma->vm_end) && - (remainder >= pages_per_huge_page(h))) { - vaddr += huge_page_size(h); - remainder -= pages_per_huge_page(h); - i += pages_per_huge_page(h); + (vaddr + pages_per_hpte < vma->vm_end) && + (remainder >= pages_per_hpte)) { + vaddr += pages_per_hpte; + remainder -= pages_per_hpte; + i += pages_per_hpte; spin_unlock(ptl); continue; } /* vaddr may not be aligned to PAGE_SIZE */ - refs = min3(pages_per_huge_page(h) - pfn_offset, remainder, + refs = min3(pages_per_hpte - pfn_offset, remainder, (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT); if (pages || vmas) - record_subpages_vmas(nth_page(page, pfn_offset), + record_subpages_vmas(nth_page(subpage, pfn_offset), vma, refs, likely(pages) ? pages + i : NULL, vmas ? vmas + i : NULL); @@ -6637,6 +6665,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(ptl); } + hugetlb_vma_unlock_read(vma); +out: *nr_pages = remainder; /* * setting position is actually required only if remainder is -- 2.38.0.135.g90850a2211-goog