From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B09DFC433FE for ; Wed, 9 Dec 2020 05:23:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2F66523AC9 for ; Wed, 9 Dec 2020 05:23:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F66523AC9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6BB156B00A0; Wed, 9 Dec 2020 00:23:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 669F06B00A1; Wed, 9 Dec 2020 00:23:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55AB58D0002; Wed, 9 Dec 2020 00:23:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id 3B5B76B00A0 for ; Wed, 9 Dec 2020 00:23:55 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id F20358249980 for ; Wed, 9 Dec 2020 05:23:54 +0000 (UTC) X-FDA: 77572601988.08.hall51_5b0700b273ed Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id D955B1819E76C for ; Wed, 9 Dec 2020 05:23:54 +0000 (UTC) X-HE-Tag: hall51_5b0700b273ed X-Filterd-Recvd-Size: 13535 Received: from hqnvemgate24.nvidia.com (hqnvemgate24.nvidia.com [216.228.121.143]) by imf15.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Dec 2020 05:23:53 +0000 (UTC) Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, AES256-SHA) id ; Tue, 08 Dec 2020 21:23:53 -0800 Received: from [10.2.60.96] (172.20.13.39) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Wed, 9 Dec 2020 05:23:49 +0000 Subject: Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas To: Joao Martins , CC: Dan Williams , Ira Weiny , , Matthew Wilcox , "Jason Gunthorpe" , Jane Chu , Muchun Song , Mike Kravetz , "Andrew Morton" References: <20201208172901.17384-1-joao.m.martins@oracle.com> <20201208172901.17384-11-joao.m.martins@oracle.com> From: John Hubbard Message-ID: Date: Tue, 8 Dec 2020 21:23:48 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Thunderbird/84.0 MIME-Version: 1.0 In-Reply-To: <20201208172901.17384-11-joao.m.martins@oracle.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL107.nvidia.com (172.20.187.13) To HQMAIL107.nvidia.com (172.20.187.13) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1607491433; bh=vWH+Q4RYpVuMca8TyP5EHXsI9aX/JDhI9H7EtFAeX44=; h=Subject:To:CC:References:From:Message-ID:Date:User-Agent: MIME-Version:In-Reply-To:Content-Type:Content-Language: Content-Transfer-Encoding:X-Originating-IP:X-ClientProxiedBy; b=lmFbCwOtssUKc392tH+cqZXiiPTDY3SLXr3utbvbTpyL72nEbX33lqc+QWrTq2R5g PGF1xwwmXm1z5Il0Rcmd2kE4p2fOMvZB4D4deXmFthg1bd3HX/GrxzjcGkX6hoizO2 phk9Q/WJxtyeWzeTJfB4quxWyjhbHm5KzLYJCa7B0hjX2cs52ItTkW6I23533Gvz9S 6SCiiwBv0URVDhBGeuTyh3Rv71OTyI5ewcI3MqIVnMJ1XA2a3xHJ8y6edvufz/DXiu XNE4E+DWS8DTc1UCD6ITdOPlLLn6/FV1ldlYK+6+DmfKV2l8QmUsFrr/QqvTbm0wo1 yWnlbRf3B9/4w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 12/8/20 9:29 AM, Joao Martins wrote: > Similar to follow_hugetlb_page() add a follow_devmap_page which rather > than calling follow_page() per 4K page in a PMD/PUD it does so for the > entire PMD, where we lock the pmd/pud, get all pages , unlock. > > While doing so, we only change the refcount once when PGMAP_COMPOUND is > passed in. > > This let us improve {pin,get}_user_pages{,_longterm}() considerably: > > $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w > d> () [before] -> [after] > (get_user_pages 2M pages) ~150k us -> ~8.9k us > (pin_user_pages 2M pages) ~192k us -> ~9k us > (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us > Yes, this is a massive improvement. > Signed-off-by: Joao Martins > --- > I've special-cased this to device-dax vmas given its similar page size > guarantees as hugetlbfs, but I feel this is a bit wrong. I am > replicating follow_hugetlb_page() as RFC ought to seek feedback whether > this should be generalized if no fundamental issues exist. In such case, > should I be changing follow_page_mask() to take either an array of pages > or a function pointer and opaque arguments which would let caller pick > its structure? I don't know which approach is better because I haven't yet attempted to find the common elements in these routines. But if there is *any way* to avoid this copy-paste creation of yet more following of pages, then it's *very* good to do. thanks, -- John Hubbard NVIDIA > --- > include/linux/huge_mm.h | 4 + > include/linux/mm.h | 2 + > mm/gup.c | 22 ++++- > mm/huge_memory.c | 202 ++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 227 insertions(+), 3 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 0365aa97f8e7..da87ecea19e6 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -293,6 +293,10 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmd, int flags, struct dev_pagemap **pgmap); > struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, > pud_t *pud, int flags, struct dev_pagemap **pgmap); > +long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma, > + struct page **pages, struct vm_area_struct **vmas, > + unsigned long *position, unsigned long *nr_pages, > + long i, unsigned int flags, int *locked); > > extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 8b0155441835..466c88679628 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1164,6 +1164,8 @@ static inline void get_page(struct page *page) > page_ref_inc(page); > } > > +__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs, > + unsigned int flags); > bool __must_check try_grab_page(struct page *page, unsigned int flags); > > static inline __must_check bool try_get_page(struct page *page) > diff --git a/mm/gup.c b/mm/gup.c > index 3a9a7229f418..50effb9cc349 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -78,7 +78,7 @@ static inline struct page *try_get_compound_head(struct page *page, int refs) > * considered failure, and furthermore, a likely bug in the caller, so a warning > * is also emitted. > */ > -static __maybe_unused struct page *try_grab_compound_head(struct page *page, > +__maybe_unused struct page *try_grab_compound_head(struct page *page, > int refs, > unsigned int flags) > { > @@ -880,8 +880,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, > * does not include FOLL_NOWAIT, the mmap_lock may be released. If it > * is, *@locked will be set to 0 and -EBUSY returned. > */ > -static int faultin_page(struct vm_area_struct *vma, > - unsigned long address, unsigned int *flags, int *locked) > +int faultin_page(struct vm_area_struct *vma, > + unsigned long address, unsigned int *flags, int *locked) > { > unsigned int fault_flags = 0; > vm_fault_t ret; > @@ -1103,6 +1103,22 @@ static long __get_user_pages(struct mm_struct *mm, > } > continue; > } > + if (vma_is_dax(vma)) { > + i = follow_devmap_page(mm, vma, pages, vmas, > + &start, &nr_pages, i, > + gup_flags, locked); > + if (locked && *locked == 0) { > + /* > + * We've got a VM_FAULT_RETRY > + * and we've lost mmap_lock. > + * We must stop here. > + */ > + BUG_ON(gup_flags & FOLL_NOWAIT); > + BUG_ON(ret != 0); > + goto out; > + } > + continue; > + } > } > retry: > /* > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index ec2bb93f7431..20bfbf211dc3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1168,6 +1168,208 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, > return page; > } > > +long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma, > + struct page **pages, struct vm_area_struct **vmas, > + unsigned long *position, unsigned long *nr_pages, > + long i, unsigned int flags, int *locked) > +{ > + unsigned long pfn_offset; > + unsigned long vaddr = *position; > + unsigned long remainder = *nr_pages; > + unsigned long align = vma_kernel_pagesize(vma); > + unsigned long align_nr_pages = align >> PAGE_SHIFT; > + unsigned long mask = ~(align-1); > + unsigned long nr_pages_hpage = 0; > + struct dev_pagemap *pgmap = NULL; > + int err = -EFAULT; > + > + if (align == PAGE_SIZE) > + return i; > + > + while (vaddr < vma->vm_end && remainder) { > + pte_t *pte; > + spinlock_t *ptl = NULL; > + int absent; > + struct page *page; > + > + /* > + * If we have a pending SIGKILL, don't keep faulting pages and > + * potentially allocating memory. > + */ > + if (fatal_signal_pending(current)) { > + remainder = 0; > + break; > + } > + > + /* > + * Some archs (sparc64, sh*) have multiple pte_ts to > + * each hugepage. We have to make sure we get the > + * first, for the page indexing below to work. > + * > + * Note that page table lock is not held when pte is null. > + */ > + pte = huge_pte_offset(mm, vaddr & mask, align); > + if (pte) { > + if (align == PMD_SIZE) > + ptl = pmd_lockptr(mm, (pmd_t *) pte); > + else if (align == PUD_SIZE) > + ptl = pud_lockptr(mm, (pud_t *) pte); > + spin_lock(ptl); > + } > + absent = !pte || pte_none(ptep_get(pte)); > + > + if (absent && (flags & FOLL_DUMP)) { > + if (pte) > + spin_unlock(ptl); > + remainder = 0; > + break; > + } > + > + if (absent || > + ((flags & FOLL_WRITE) && > + !pte_write(ptep_get(pte)))) { > + vm_fault_t ret; > + unsigned int fault_flags = 0; > + > + if (pte) > + spin_unlock(ptl); > + if (flags & FOLL_WRITE) > + fault_flags |= FAULT_FLAG_WRITE; > + if (locked) > + fault_flags |= FAULT_FLAG_ALLOW_RETRY | > + FAULT_FLAG_KILLABLE; > + if (flags & FOLL_NOWAIT) > + fault_flags |= FAULT_FLAG_ALLOW_RETRY | > + FAULT_FLAG_RETRY_NOWAIT; > + if (flags & FOLL_TRIED) { > + /* > + * Note: FAULT_FLAG_ALLOW_RETRY and > + * FAULT_FLAG_TRIED can co-exist > + */ > + fault_flags |= FAULT_FLAG_TRIED; > + } > + ret = handle_mm_fault(vma, vaddr, flags, NULL); > + if (ret & VM_FAULT_ERROR) { > + err = vm_fault_to_errno(ret, flags); > + remainder = 0; > + break; > + } > + if (ret & VM_FAULT_RETRY) { > + if (locked && > + !(fault_flags & FAULT_FLAG_RETRY_NOWAIT)) > + *locked = 0; > + *nr_pages = 0; > + /* > + * VM_FAULT_RETRY must not return an > + * error, it will return zero > + * instead. > + * > + * No need to update "position" as the > + * caller will not check it after > + * *nr_pages is set to 0. > + */ > + return i; > + } > + continue; > + } > + > + pfn_offset = (vaddr & ~mask) >> PAGE_SHIFT; > + page = pte_page(ptep_get(pte)); > + > + pgmap = get_dev_pagemap(page_to_pfn(page), pgmap); > + if (!pgmap) { > + spin_unlock(ptl); > + remainder = 0; > + err = -EFAULT; > + break; > + } > + > + /* > + * If subpage information not requested, update counters > + * and skip the same_page loop below. > + */ > + if (!pages && !vmas && !pfn_offset && > + (vaddr + align < vma->vm_end) && > + (remainder >= (align_nr_pages))) { > + vaddr += align; > + remainder -= align_nr_pages; > + i += align_nr_pages; > + spin_unlock(ptl); > + continue; > + } > + > + nr_pages_hpage = 0; > + > +same_page: > + if (pages) { > + pages[i] = mem_map_offset(page, pfn_offset); > + > + /* > + * try_grab_page() should always succeed here, because: > + * a) we hold the ptl lock, and b) we've just checked > + * that the huge page is present in the page tables. > + */ > + if (!(pgmap->flags & PGMAP_COMPOUND) && > + WARN_ON_ONCE(!try_grab_page(pages[i], flags))) { > + spin_unlock(ptl); > + remainder = 0; > + err = -ENOMEM; > + break; > + } > + > + } > + > + if (vmas) > + vmas[i] = vma; > + > + vaddr += PAGE_SIZE; > + ++pfn_offset; > + --remainder; > + ++i; > + nr_pages_hpage++; > + if (vaddr < vma->vm_end && remainder && > + pfn_offset < align_nr_pages) { > + /* > + * We use pfn_offset to avoid touching the pageframes > + * of this compound page. > + */ > + goto same_page; > + } else { > + /* > + * try_grab_compound_head() should always succeed here, > + * because: a) we hold the ptl lock, and b) we've just > + * checked that the huge page is present in the page > + * tables. If the huge page is present, then the tail > + * pages must also be present. The ptl prevents the > + * head page and tail pages from being rearranged in > + * any way. So this page must be available at this > + * point, unless the page refcount overflowed: > + */ > + if ((pgmap->flags & PGMAP_COMPOUND) && > + WARN_ON_ONCE(!try_grab_compound_head(pages[i-1], > + nr_pages_hpage, > + flags))) { > + put_dev_pagemap(pgmap); > + spin_unlock(ptl); > + remainder = 0; > + err = -ENOMEM; > + break; > + } > + put_dev_pagemap(pgmap); > + } > + spin_unlock(ptl); > + } > + *nr_pages = remainder; > + /* > + * setting position is actually required only if remainder is > + * not zero but it's faster not to add a "if (remainder)" > + * branch. > + */ > + *position = vaddr; > + > + return i ? i : err; > +} > + > int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, > pud_t *dst_pud, pud_t *src_pud, unsigned long addr, > struct vm_area_struct *vma) >