From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 251D2C7EE23 for ; Fri, 24 Feb 2023 17:39:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7E9906B0078; Fri, 24 Feb 2023 12:39:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 771B96B007B; Fri, 24 Feb 2023 12:39:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 612BE6B007D; Fri, 24 Feb 2023 12:39:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4B7FE6B0078 for ; Fri, 24 Feb 2023 12:39:51 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D98AC1A0CB8 for ; Fri, 24 Feb 2023 17:39:50 +0000 (UTC) X-FDA: 80502898140.27.06A5AC9 Received: from mail-vs1-f45.google.com (mail-vs1-f45.google.com [209.85.217.45]) by imf25.hostedemail.com (Postfix) with ESMTP id 1AA4CA001A for ; Fri, 24 Feb 2023 17:39:48 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Bv9KmUj4; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677260389; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6DdWHSOmYYOCQYfqMBfKIAUiuglDSU9RZ6zPwv3mrqc=; b=v6kxju+HCi0i2SVAwtGPHimpALAdzGoRyof28KvP0wTlaqQk2LoTdDlXqiZblZslxdXsTP PFCTY7SXWDboHJS7x8XhSlddd0C7Q4zE8Sw5z7Klwt6v8Ih8GBACqis+YD3Q+3CUm3X9ax qMZMn3ORF0m+LE3v35sI+bEt+DZJk0o= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Bv9KmUj4; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677260389; a=rsa-sha256; cv=none; b=26uf5h0RxJKj/u5V4pQKxJOUKU2eHUJgRxuYvUu6EdgiYm90MwM8AQ/qyw9YTPUWC4zaux eAtZ+vMYqseZFuoYTmfzMYr+iJTPLC5xlaRcYoE73wCV4wpSH8OpTqD0gagTKsxFyi1yJy EpSR3nAfnw+B8saEV3wguVid2Id7aUI= Received: by mail-vs1-f45.google.com with SMTP id s1so445619vsk.5 for ; Fri, 24 Feb 2023 09:39:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=6DdWHSOmYYOCQYfqMBfKIAUiuglDSU9RZ6zPwv3mrqc=; b=Bv9KmUj4zAo7qpGqUTRlYxKWyXHSeWI4OmO/fKFGDFOhVixN3T3NF6lUNLmPO0FB+0 qHcfNJUUnXp8WcwlrZMzU4aN2H9q3pLbRq/40fW+6QueKuTLTbj2SQJD8XhCelF3b2px Ka312bOyvQ7BLn0OKgJt/wPLJLQr34olnBUXPG8E1jypWAmckmF0ZjobqWcCiV+ijaaQ rWdA/zYJ3FSq1Mtp1pI2e/LGs/PiP5saQWrQmz5IEqxLU8o/EW7U+N7fjs9X0FeClSq9 N56g3C+VaIxRFWDIC3bkcRmha3yD7HQrvXEyZDCmrquvP3GyHiGmnTOG85MOJC+TGyz+ AuCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6DdWHSOmYYOCQYfqMBfKIAUiuglDSU9RZ6zPwv3mrqc=; b=bqILXQEDMOqYwCrBhQWcRadrE9HhNaIyhLhIy2Sa+WUQQkXApoMmDTRLtQZs8GUyGU B2/peP+NGQVwKdkaAblds/2fWB2BQEmMTKex0/EKCpDJtGqYcEW5fvx1bPSoD04BSeXZ /hdYyftGkQMQ+i2T+QVkMe/8noKLImjpyIzyGCX4hgVIjhR5HoZ6AW/VSnOPmljGPOEB af/UNAWTpLu1Ib2X/liS/+NLCF9099kwnMrERzHhqYR5BIC2pMgeQr7sz2to57HFP2Wt yVmpFAdCb0j9/3DxnKFrJhwq+estsCYGrIdQlxsrybxz6CQI/RocatmykNZLqV1GaeQO UbyA== X-Gm-Message-State: AO0yUKX/gJNpJH3ZY1Tz7p5JRD/yesmfKkIozGzmjIsgpnrDrg84Aubd 2/IuBmcb6giKof45iGPK4jhar2ebd4DwuX0o2J2rVg== X-Google-Smtp-Source: AK7set8Ez6tAnhLpKJqCuX0swOAkF2AM6EIVjyQFN8bUxHfUwqydYX4+dgySrbuEybGI2ANYD+262Abc3a5MnOlkwy0= X-Received: by 2002:a1f:5fc1:0:b0:412:611a:dce5 with SMTP id t184-20020a1f5fc1000000b00412611adce5mr2945608vkb.0.1677260388122; Fri, 24 Feb 2023 09:39:48 -0800 (PST) MIME-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> <20230218002819.1486479-23-jthoughton@google.com> In-Reply-To: <20230218002819.1486479-23-jthoughton@google.com> From: James Houghton Date: Fri, 24 Feb 2023 09:39:12 -0800 Message-ID: Subject: Re: [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 1AA4CA001A X-Stat-Signature: e8r3sxu4sh6f36ua6k4e9gph3a8g8t9q X-HE-Tag: 1677260388-391144 X-HE-Meta: U2FsdGVkX1+ki5ePOph6NWv4cspt6lgIPLVroBdqSYj7GNlz84X0iJz4dNslefQyJU/P/cCfMxPMzuU6Y2Wv0jIV/ceDIW4ma9aRHjn1/YrQD9C8LB4DOhiHvWYx+/6xVjfn8nPbMvJvJEfApmxM2nxQmgekZQq6XWhQQFGNr0L8bOxXXGc3wAkB/3UeS4yz1buJIqZz7lWbWUSma0aGwltixgPTSOuaxHvlZtdGjYYf7LEMr/5vW6tihzkpDrUFT/7bEZpEvNgqWV5/YrPFnFbaOlR4f7fYT69kUOGSW6q4McMOfnhXDkgzIyrFeJcO9iRhK3erDqhG/w7JWow7pXSyRHxDOg0Jxgq9/9tB84bh2yVMmbdgFrudwZymAdql9NpTugbbXsAWNI+zh/8MXiGpzLnDpv6moUpAweQ0gkWFyQO2RdHLwUfDlIHa/BhVq/u/MitqeaJV9rhl0+Mviyam0eOZO2oEk64HRjM43AFp7s/q/h1A0WMPwxYW6/JdxWMHLbkW6mhlRdBw24laky//Pgt6TcbU327B6Iw1bPUXrwm2nx2xXYldtU2docltOL/EIvFQh2sGpak6ul9OHEvlTBP94QCCNcJRZw2f3EmHR/SIEKy6EXNxa+upWufHFEpLytyCsPY/tpzCyg1H+ZFQ4NqanINSM5mElm4sXQr7F5V+rEiAlTliOwF8gr3ynWtQwuhvkFToZ1yG27C7MRfCCjKOg0mfrQa3DhSpOKrc3ad3uK7O+XxFRIimkPY1BE8x0PdDl4Vpnpe3naR/KcevMMflhtmWRVe9EMd057aRqM8GNJTy9eDpgBFW4ilXKyMQRHtTrAX2t7jVjI1Lw0dooG5SF7oafo403kINcAP0/RJ5Y/a4AlNGo49yYMSAKxtMDzMFor3BLOCKlDbY2a6svq7jPZGVo9AMXTQM5CGVnkGELsJCaHNCWF04bVRUU7f6hGdQVNi5trlWqOi +Wk8a0OF bFPPPJyJWbJ7Jfczd2uCV/O7KBE+45nOwcUIkzp8ZieILSyCfUlAHzJ7iiQvmS8PZEek24A+jpROGInwAPm836p7yYNTX8ri5WX4H5ot8/Miq1y8INTX0+GEhviOlradsRMwYR1RC0JEodz7Rv/XR5ptgphMmSEboXx2CNzQl0Tg0Fd2gsyqdQVjbmBiFTX93VhGbMeNV5+JwCLCE6m01G+Zyhy8OWGxMgAd9BgnIX7cv93M+YEgZA1QwTREOAzDYT9lb/a3ynB/nN+5S0ZiJKv3Su3CyIT318dZIUvwHcfOKFfQtBD0JwBdxYQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 17, 2023 at 4:29 PM James Houghton wrote: > > This allows fork() to work with high-granularity mappings. The page > table structure is copied such that partially mapped regions will remain > partially mapped in the same way for the new process. > > A page's reference count is incremented for *each* portion of it that > is mapped in the page table. For example, if you have a PMD-mapped 1G > page, the reference count will be incremented by 512. > > mapcount is handled similar to THPs: if you're completely mapping a > hugepage, then the compound_mapcount is incremented. If you're mapping a > part of it, the subpages that are getting mapped will have their > mapcounts incremented. > > Signed-off-by: James Houghton > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 1a1a71868dfd..2fe1eb6897d4 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -162,6 +162,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spool); > > void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, > struct hstate *h, struct vm_area_struct *vma); > +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, > + struct hstate *h, struct vm_area_struct *vma); > > void hugetlb_dup_vma_private(struct vm_area_struct *vma); > void clear_vma_resv_huge_pages(struct vm_area_struct *vma); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 693332b7e186..210c6f2b16a5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -141,6 +141,37 @@ void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, > page_remove_rmap(subpage, vma, false); > } > } > +/* > + * hugetlb_add_file_rmap() - increment the mapcounts for file-backed hugetlb > + * pages appropriately. > + * > + * For pages that are being mapped with their hstate-level PTE (e.g., a 1G page > + * being mapped with a 1G PUD), then we increment the compound_mapcount for the > + * head page. > + * > + * For pages that are being mapped with high-granularity, we increment the > + * mapcounts for the individual subpages that are getting mapped. > + */ > +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, > + struct hstate *h, struct vm_area_struct *vma) > +{ > + struct page *hpage = compound_head(subpage); > + > + if (shift == huge_page_shift(h)) { > + VM_BUG_ON_PAGE(subpage != hpage, subpage); > + page_add_file_rmap(hpage, vma, true); > + } else { > + unsigned long nr_subpages = 1UL << (shift - PAGE_SHIFT); > + struct page *final_page = &subpage[nr_subpages]; > + > + VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage); > + /* > + * Increment the mapcount on each page that is getting mapped. > + */ > + for (; subpage < final_page; ++subpage) > + page_add_file_rmap(subpage, vma, false); > + } > +} > > static inline bool subpool_is_free(struct hugepage_subpool *spool) > { > @@ -5210,7 +5241,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > struct vm_area_struct *src_vma) > { > pte_t *src_pte, *dst_pte, entry; > - struct page *ptepage; > + struct hugetlb_pte src_hpte, dst_hpte; > + struct page *ptepage, *hpage; > unsigned long addr; > bool cow = is_cow_mapping(src_vma->vm_flags); > struct hstate *h = hstate_vma(src_vma); > @@ -5238,18 +5270,24 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > } > > last_addr_mask = hugetlb_mask_last_page(h); > - for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) { > + addr = src_vma->vm_start; > + while (addr < src_vma->vm_end) { > spinlock_t *src_ptl, *dst_ptl; > - src_pte = hugetlb_walk(src_vma, addr, sz); > - if (!src_pte) { > - addr |= last_addr_mask; > + unsigned long hpte_sz; > + > + if (hugetlb_full_walk(&src_hpte, src_vma, addr)) { > + addr = (addr | last_addr_mask) + sz; > continue; > } > - dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz); > - if (!dst_pte) { > - ret = -ENOMEM; > + ret = hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr, > + hugetlb_pte_size(&src_hpte)); > + if (ret) > break; > - } > + > + src_pte = src_hpte.ptep; > + dst_pte = dst_hpte.ptep; > + > + hpte_sz = hugetlb_pte_size(&src_hpte); > > /* > * If the pagetables are shared don't copy or take references. > @@ -5259,13 +5297,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > * another vma. So page_count of ptep page is checked instead > * to reliably determine whether pte is shared. > */ > - if (page_count(virt_to_page(dst_pte)) > 1) { > - addr |= last_addr_mask; > + if (hugetlb_pte_size(&dst_hpte) == sz && > + page_count(virt_to_page(dst_pte)) > 1) { > + addr = (addr | last_addr_mask) + sz; > continue; > } > > - dst_ptl = huge_pte_lock(h, dst, dst_pte); > - src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte); > + dst_ptl = hugetlb_pte_lock(&dst_hpte); > + src_ptl = hugetlb_pte_lockptr(&src_hpte); > spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); > entry = huge_ptep_get(src_pte); > again: > @@ -5309,10 +5348,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > */ > if (userfaultfd_wp(dst_vma)) > set_huge_pte_at(dst, addr, dst_pte, entry); > + } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) { > + /* Retry the walk. */ > + spin_unlock(src_ptl); > + spin_unlock(dst_ptl); > + continue; > } else { > - entry = huge_ptep_get(src_pte); > ptepage = pte_page(entry); > - get_page(ptepage); > + hpage = compound_head(ptepage); > + get_page(hpage); > > /* > * Failing to duplicate the anon rmap is a rare case > @@ -5324,13 +5368,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > * need to be without the pgtable locks since we could > * sleep during the process. > */ > - if (!PageAnon(ptepage)) { > - page_add_file_rmap(ptepage, src_vma, true); > - } else if (page_try_dup_anon_rmap(ptepage, true, > + if (!PageAnon(hpage)) { > + hugetlb_add_file_rmap(ptepage, > + src_hpte.shift, h, src_vma); > + } > + /* > + * It is currently impossible to get anonymous HugeTLB > + * high-granularity mappings, so we use 'hpage' here. > + * > + * This will need to be changed when HGM support for > + * anon mappings is added. > + */ > + else if (page_try_dup_anon_rmap(hpage, true, > src_vma)) { > pte_t src_pte_old = entry; > struct folio *new_folio; > > + /* > + * If we are mapped at high granularity, we > + * may end up allocating lots and lots of > + * hugepages when we only need one. Bail out > + * now. > + */ > + if (hugetlb_pte_size(&src_hpte) != sz) { > + put_page(hpage); > + ret = -EINVAL; > + break; > + } > + Although this block never executes, it should come after the following spin_unlocks(). > spin_unlock(src_ptl); > spin_unlock(dst_ptl); > /* Do not use reserve as it's private owned */ > @@ -5342,7 +5407,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > } > copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma, > npages); > - put_page(ptepage); > + put_page(hpage); > > /* Install the new hugetlb folio if src pte stable */ > dst_ptl = huge_pte_lock(h, dst, dst_pte); > @@ -5360,6 +5425,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio); > spin_unlock(src_ptl); > spin_unlock(dst_ptl); > + addr += hugetlb_pte_size(&src_hpte); > continue; > } > > @@ -5376,10 +5442,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > } > > set_huge_pte_at(dst, addr, dst_pte, entry); > - hugetlb_count_add(npages, dst); > + hugetlb_count_add( > + hugetlb_pte_size(&dst_hpte) / PAGE_SIZE, > + dst); > } > spin_unlock(src_ptl); > spin_unlock(dst_ptl); > + addr += hugetlb_pte_size(&src_hpte); > } > > if (cow) { > -- > 2.39.2.637.g21b0678d19-goog >