From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EDFCFC4708A for ; Thu, 27 May 2021 02:58:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 683616100C for ; Thu, 27 May 2021 02:58:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 683616100C Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A574A6B0036; Wed, 26 May 2021 22:58:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A07376B006E; Wed, 26 May 2021 22:58:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80BD96B0070; Wed, 26 May 2021 22:58:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id 450FF6B0036 for ; Wed, 26 May 2021 22:58:58 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 486DC988D for ; Thu, 27 May 2021 02:58:56 +0000 (UTC) X-FDA: 78185503872.09.BFBD3B3 Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) by imf04.hostedemail.com (Postfix) with ESMTP id EE26C2C6 for ; Thu, 27 May 2021 02:58:51 +0000 (UTC) Received: by mail-pj1-f46.google.com with SMTP id o17-20020a17090a9f91b029015cef5b3c50so1475503pjp.4 for ; Wed, 26 May 2021 19:58:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Pru6zig8q3sxmELElDi+3+yZB8QewYE1tT1Q+WjTzFQ=; b=pu1KevegNRiVnMibNgq3n/QYJpkV5ZW6F+g/cP4giGCnsVwMugAYUyRUUTzQiFZ/Sr 4zRqebTMzDTgrKhNw1BQrk2TQaBE2v8LYnyDK7SWz8m55MrNGTxK5sMV3bkE6op2rYCz ubuCEBAPeotFJ6a6SUibe/h9J0wlbH7rvR9mYE5xuIBixlk0fnz2IzvPRrsdtcGdRLD7 wSPULPjwPI7SLHm4hM+dnXqHGltxwtEoLQuipKzuCjIPxgCCLi5FrAt4If1dAXlzoqbk XeOIbHSMcaA1Qqc9XvaD8rUxl8aaOathBpBbPKkYQDPtn0P9936/c2DZ+JWqPx7uCer1 hKsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Pru6zig8q3sxmELElDi+3+yZB8QewYE1tT1Q+WjTzFQ=; b=N1NC4bOVhSRzz5EmBQbuXmUz7gQq/Yvx2WGLRnFxZnJxr6nZYAhk7SBH7PEyP7o7GI hRB14SveBpenzHFBJ1LZjxBX1LQtvVC1xVo7oNW5ad/j3JbspLiR2XqGUG7F+05meXL5 89JDIGxYX5qhI5b/Tnyma8fvp6LAV+m815+hVxbHQURGl7DYjh9m5a15YnmMvajvcx0A 7rzqQW5CEvWB7gHXTIRyzTTAVeFeD6OuI2GtMnrc1JmBboZ6X81Tm/bjL2NH17BMDONF eyzOvjqsyD3enUtszqKGNx5LRv6WfwBKOVEw0Rp/M/5+if784T9sNiAQOKMuMEB7fydB iU7Q== X-Gm-Message-State: AOAM530cqAY/JF3+9pbaN7u3zrO0yZOGxlK+3QhkRQKMDhotfMOe6Vve 7QKsXk9ikmjgiZchhVFjFDB0+lR39nKpCZlmEf0H8Q== X-Google-Smtp-Source: ABdhPJwKzBFxAwJvzP6BwwqnClKQ4aMoY1+3GrUFgWxR/cF94wt214noH7Q2i3NRXn2t0PTFEN/hcR4YjBNR4m0TMAo= X-Received: by 2002:a17:90b:1885:: with SMTP id mn5mr6807630pjb.24.1622084334680; Wed, 26 May 2021 19:58:54 -0700 (PDT) MIME-Version: 1.0 References: <78359cf0-6e28-2aaa-d17e-6519b117b3db@oracle.com> <20210525233134.246444-1-mike.kravetz@oracle.com> <20210525233134.246444-3-mike.kravetz@oracle.com> In-Reply-To: <20210525233134.246444-3-mike.kravetz@oracle.com> From: Mina Almasry Date: Wed, 26 May 2021 19:58:43 -0700 Message-ID: Subject: Re: [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map To: Mike Kravetz Cc: Linux-MM , open list , Axel Rasmussen , Peter Xu , Andrew Morton Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=pu1Keveg; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of almasrymina@google.com designates 209.85.216.46 as permitted sender) smtp.mailfrom=almasrymina@google.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: EE26C2C6 X-Stat-Signature: oo3bfy75w6exngpbq5zhbko99bks8kyw X-HE-Tag: 1622084331-167592 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 25, 2021 at 4:31 PM Mike Kravetz wrote: > > When a hugetlb page is allocated via alloc_huge_page, the reserve map > as well as global reservation count may be modified. In case of error > after allocation, the count and map should be restored to their previous > state if possible. The flag HPageRestoreRsvCnt indicates the global > count was modified. Add a new flag HPG_restore_rsv_map to indicate the > reserve map was modified. Note that during hugetlb page allocation the > the global count and reserve map could be modified independently. > Therefore, two specific flags are needed. > > The routine restore_reserve_on_error is called to restore reserve data > on error paths. Modify the routine to check for the HPG_restore_rsv_map > flag and adjust the reserve map accordingly. > Should be an equivalent function that fixes the reservation on page freeing? restore_reserve_on_put_page() or something? I'm confused that we need to restore reservation on error, but seemingly there is no function that restores reservation on normal operation page freeing. > Add missing calls to restore_reserve_on_error to error paths of code > calling alloc_huge_page. > Is it a good idea to add a comment above alloc_huge_page() that to unroll it we need to put_page() *and* call restore_reserve_on_error()? > Signed-off-by: Mike Kravetz > --- > fs/hugetlbfs/inode.c | 1 + > include/linux/hugetlb.h | 11 ++++++ > mm/hugetlb.c | 82 +++++++++++++++++++++++++++++++---------- > 3 files changed, 75 insertions(+), 19 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index bb4de5dcd652..9d846a2edc4b 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > __SetPageUptodate(page); > error = huge_add_to_page_cache(page, mapping, index); > if (unlikely(error)) { > + restore_reserve_on_error(h, &pseudo_vma, addr, page); > put_page(page); > mutex_unlock(&hugetlb_fault_mutex_table[hash]); > goto out; > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index e5e363fa5d02..da2251b0c609 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -517,6 +517,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > * Synchronization: Examined or modified by code that knows it has > * the only reference to page. i.e. After allocation but before use > * or when the page is being freed. > + * HPG_restore_rsv_map - Set when a hugetlb page allocation results in adding > + * an entry to the reserve map. This can happen without adjustment of > + * the global reserve count. Cleared when page is fully instantiated. > + * Error paths (restore_reserve_on_error) check this flag to make > + * adjustments to the reserve map. > + * Synchronization: Examined or modified by code that knows it has > + * the only reference to page. i.e. After allocation but before use. > * HPG_migratable - Set after a newly allocated page is added to the page > * cache and/or page tables. Indicates the page is a candidate for > * migration. > @@ -536,6 +543,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > */ > enum hugetlb_page_flags { > HPG_restore_rsv_cnt = 0, > + HPG_restore_rsv_map, > HPG_migratable, > HPG_temporary, > HPG_freed, > @@ -582,6 +590,7 @@ static inline void ClearHPage##uname(struct page *page) \ > * Create functions associated with hugetlb page flags > */ > HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt) > +HPAGEFLAG(RestoreRsvMap, restore_rsv_map) > HPAGEFLAG(Migratable, migratable) > HPAGEFLAG(Temporary, temporary) > HPAGEFLAG(Freed, freed) > @@ -633,6 +642,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, > unsigned long address); > int huge_add_to_page_cache(struct page *page, struct address_space *mapping, > pgoff_t idx); > +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, > + unsigned long address, struct page *page); > > /* arch callback */ > int __init __alloc_bootmem_huge_page(struct hstate *h); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 2a8cea253388..1c3a68d70ab5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1551,6 +1551,7 @@ void free_huge_page(struct page *page) > page->mapping = NULL; > restore_reserve = HPageRestoreRsvCnt(page); > ClearHPageRestoreRsvCnt(page); > + ClearHPageRestoreRsvMap(page); > > /* > * If HPageRestoreRsvCnt was set on page, page allocation consumed a > @@ -2360,24 +2361,26 @@ static long vma_add_reservation(struct hstate *h, > } > > /* > - * This routine is called to restore a reservation on error paths. In the > - * specific error paths, a huge page was allocated (via alloc_huge_page) > - * and is about to be freed. If a reservation for the page existed, > - * alloc_huge_page would have consumed the reservation and set > - * HPageRestoreRsvCnt in the newly allocated page. When the page is freed > - * via free_huge_page, the global reservation count will be incremented if > - * HPageRestoreRsvCnt is set. However, free_huge_page can not adjust the > - * reserve map. Adjust the reserve map here to be consistent with global > - * reserve count adjustments to be made by free_huge_page. > + * This routine is called to restore a reservation data on error paths. > + * It handles two specific cases for pages allocated via alloc_huge_page: > + * 1) A reservation was in place and page consumed the reservation. > + * HPageRestoreRsvCnt is set in the page. > + * 2) No reservation was in place for the page, so HPageRestoreRsvCnt is > + * not set. However, the reserve map was updated. > + * In case 1, free_huge_page will increment the global reserve count. But, > + * free_huge_page does not have enough context to adjust the reservation map. > + * This case deals primarily with private mappings. Adjust the reserve map > + * here to be consistent with global reserve count adjustments to be made > + * by free_huge_page. > + * In case 2, simply undo an reserve map modifications done by alloc_huge_page. > */ > -static void restore_reserve_on_error(struct hstate *h, > - struct vm_area_struct *vma, unsigned long address, > - struct page *page) > +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, > + unsigned long address, struct page *page) > { > if (unlikely(HPageRestoreRsvCnt(page))) { > long rc = vma_needs_reservation(h, vma, address); > > - if (unlikely(rc < 0)) { > + if (unlikely(rc < 0)) > /* > * Rare out of memory condition in reserve map > * manipulation. Clear HPageRestoreRsvCnt so that > @@ -2390,16 +2393,47 @@ static void restore_reserve_on_error(struct hstate *h, > * accounting of reserve counts. > */ > ClearHPageRestoreRsvCnt(page); > - } else if (rc) { > - rc = vma_add_reservation(h, vma, address); > - if (unlikely(rc < 0)) > + else if (rc) > + vma_add_reservation(h, vma, address); > + else > + vma_end_reservation(h, vma, address); > + } else if (unlikely(HPageRestoreRsvMap(page))) { > + struct resv_map *resv = vma_resv_map(vma); > + pgoff_t idx = vma_hugecache_offset(h, vma, address); > + long rc; > + > + /* > + * This handles the specific case where the reserve count > + * was not updated during the page allocation process, but > + * the reserve map was updated. We need to undo the reserve > + * map update. > + * > + * The presence of an entry in the reserve map has opposite > + * meanings for shared and private mappings. > + */ > + if (vma->vm_flags & VM_MAYSHARE) { > + rc = region_del(resv, idx, idx + 1); > + if (rc < 0) > + /* > + * Rare out of memory condition. Since we can > + * not delete the reserve entry, set > + * HPageRestoreRsvCnt so that the global count > + * will be consistent with the reserve map. > + */ > + SetHPageRestoreRsvCnt(page); > + } else { > + rc = vma_needs_reservation(h, vma, address); > + if (rc < 0) > /* > * See above comment about rare out of > * memory condition. > */ > - ClearHPageRestoreRsvCnt(page); > - } else > - vma_end_reservation(h, vma, address); > + SetHPageRestoreRsvCnt(page); > + else if (rc) > + vma_add_reservation(h, vma, address); > + else > + vma_end_reservation(h, vma, address); > + } As I mentioned in the other email, this call sequence does not result in the region_del() call that we really need here. Calling region_del() directly would be one fix, another would be to call vma_end_reservation() even if !rc. Not sure which is more semantically correct. hugetlb_unreserve_pages() calls region_del() indiscriminately. > } > } > > @@ -2641,6 +2675,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, > hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), > pages_per_huge_page(h), page); > } > + if (map_commit) > + SetHPageRestoreRsvMap(page); > return page; > > out_uncharge_cgroup: > @@ -4053,6 +4089,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr > hugepage_add_new_anon_rmap(new_page, vma, addr); > hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm); > ClearHPageRestoreRsvCnt(new_page); > + ClearHPageRestoreRsvMap(new_page); > SetHPageMigratable(new_page); > } > > @@ -4174,6 +4211,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); > entry = huge_ptep_get(src_pte); > if (!pte_same(src_pte_old, entry)) { > + restore_reserve_on_error(h, vma, addr, > + new); > put_page(new); > /* dst_entry won't change as in child */ > goto again; > @@ -4526,6 +4565,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, > ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); > if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { > ClearHPageRestoreRsvCnt(new_page); > + ClearHPageRestoreRsvMap(new_page); > > /* Break COW */ > huge_ptep_clear_flush(vma, haddr, ptep); > @@ -4593,6 +4633,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping, > if (err) > return err; > ClearHPageRestoreRsvCnt(page); > + ClearHPageRestoreRsvMap(page); > > /* > * set page dirty so that it will not be removed from cache/file > @@ -4776,6 +4817,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > > if (anon_rmap) { > ClearHPageRestoreRsvCnt(page); > + ClearHPageRestoreRsvMap(page); > hugepage_add_new_anon_rmap(page, vma, haddr); > } else > page_dup_rmap(page, true); > @@ -5097,6 +5139,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > page_dup_rmap(page, true); > } else { > ClearHPageRestoreRsvCnt(page); > + ClearHPageRestoreRsvMap(page); > hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > } > > @@ -5133,6 +5176,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > if (vm_shared || is_continue) > unlock_page(page); > out_release_nounlock: > + restore_reserve_on_error(h, dst_vma, dst_addr, page); > put_page(page); > goto out; > } > -- > 2.31.1 >