From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=VCOr=KW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EDFCFC4708A
	for <linux-mm@archiver.kernel.org>; Thu, 27 May 2021 02:58:59 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 683616100C
	for <linux-mm@archiver.kernel.org>; Thu, 27 May 2021 02:58:59 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 683616100C
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A574A6B0036; Wed, 26 May 2021 22:58:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A07376B006E; Wed, 26 May 2021 22:58:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 80BD96B0070; Wed, 26 May 2021 22:58:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201])
	by kanga.kvack.org (Postfix) with ESMTP id 450FF6B0036
	for <linux-mm@kvack.org>; Wed, 26 May 2021 22:58:58 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 486DC988D
	for <linux-mm@kvack.org>; Thu, 27 May 2021 02:58:56 +0000 (UTC)
X-FDA: 78185503872.09.BFBD3B3
Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46])
	by imf04.hostedemail.com (Postfix) with ESMTP id EE26C2C6
	for <linux-mm@kvack.org>; Thu, 27 May 2021 02:58:51 +0000 (UTC)
Received: by mail-pj1-f46.google.com with SMTP id o17-20020a17090a9f91b029015cef5b3c50so1475503pjp.4
        for <linux-mm@kvack.org>; Wed, 26 May 2021 19:58:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=Pru6zig8q3sxmELElDi+3+yZB8QewYE1tT1Q+WjTzFQ=;
        b=pu1KevegNRiVnMibNgq3n/QYJpkV5ZW6F+g/cP4giGCnsVwMugAYUyRUUTzQiFZ/Sr
         4zRqebTMzDTgrKhNw1BQrk2TQaBE2v8LYnyDK7SWz8m55MrNGTxK5sMV3bkE6op2rYCz
         ubuCEBAPeotFJ6a6SUibe/h9J0wlbH7rvR9mYE5xuIBixlk0fnz2IzvPRrsdtcGdRLD7
         wSPULPjwPI7SLHm4hM+dnXqHGltxwtEoLQuipKzuCjIPxgCCLi5FrAt4If1dAXlzoqbk
         XeOIbHSMcaA1Qqc9XvaD8rUxl8aaOathBpBbPKkYQDPtn0P9936/c2DZ+JWqPx7uCer1
         hKsg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=Pru6zig8q3sxmELElDi+3+yZB8QewYE1tT1Q+WjTzFQ=;
        b=N1NC4bOVhSRzz5EmBQbuXmUz7gQq/Yvx2WGLRnFxZnJxr6nZYAhk7SBH7PEyP7o7GI
         hRB14SveBpenzHFBJ1LZjxBX1LQtvVC1xVo7oNW5ad/j3JbspLiR2XqGUG7F+05meXL5
         89JDIGxYX5qhI5b/Tnyma8fvp6LAV+m815+hVxbHQURGl7DYjh9m5a15YnmMvajvcx0A
         7rzqQW5CEvWB7gHXTIRyzTTAVeFeD6OuI2GtMnrc1JmBboZ6X81Tm/bjL2NH17BMDONF
         eyzOvjqsyD3enUtszqKGNx5LRv6WfwBKOVEw0Rp/M/5+if784T9sNiAQOKMuMEB7fydB
         iU7Q==
X-Gm-Message-State: AOAM530cqAY/JF3+9pbaN7u3zrO0yZOGxlK+3QhkRQKMDhotfMOe6Vve
	7QKsXk9ikmjgiZchhVFjFDB0+lR39nKpCZlmEf0H8Q==
X-Google-Smtp-Source: ABdhPJwKzBFxAwJvzP6BwwqnClKQ4aMoY1+3GrUFgWxR/cF94wt214noH7Q2i3NRXn2t0PTFEN/hcR4YjBNR4m0TMAo=
X-Received: by 2002:a17:90b:1885:: with SMTP id mn5mr6807630pjb.24.1622084334680;
 Wed, 26 May 2021 19:58:54 -0700 (PDT)
MIME-Version: 1.0
References: <78359cf0-6e28-2aaa-d17e-6519b117b3db@oracle.com>
 <20210525233134.246444-1-mike.kravetz@oracle.com> <20210525233134.246444-3-mike.kravetz@oracle.com>
In-Reply-To: <20210525233134.246444-3-mike.kravetz@oracle.com>
From: Mina Almasry <almasrymina@google.com>
Date: Wed, 26 May 2021 19:58:43 -0700
Message-ID: <CAHS8izNxru+vKc6nSP5x0h24CqMFnXyVZBKRBDxhP9Uh5hZUJA@mail.gmail.com>
Subject: Re: [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Linux-MM <linux-mm@kvack.org>, open list <linux-kernel@vger.kernel.org>, 
	Axel Rasmussen <axelrasmussen@google.com>, Peter Xu <peterx@redhat.com>, 
	Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b=pu1Keveg;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf04.hostedemail.com: domain of almasrymina@google.com designates 209.85.216.46 as permitted sender) smtp.mailfrom=almasrymina@google.com
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: EE26C2C6
X-Stat-Signature: oo3bfy75w6exngpbq5zhbko99bks8kyw
X-HE-Tag: 1622084331-167592
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, May 25, 2021 at 4:31 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> When a hugetlb page is allocated via alloc_huge_page, the reserve map
> as well as global reservation count may be modified.  In case of error
> after allocation, the count and map should be restored to their previous
> state if possible.  The flag HPageRestoreRsvCnt indicates the global
> count was modified.  Add a new flag HPG_restore_rsv_map to indicate the
> reserve map was modified.  Note that during hugetlb page allocation the
> the global count and reserve map could be modified independently.
> Therefore, two specific flags are needed.
>
> The routine restore_reserve_on_error is called to restore reserve data
> on error paths.  Modify the routine to check for the HPG_restore_rsv_map
> flag and adjust the reserve map accordingly.
>

Should be an equivalent function that fixes the reservation on page
freeing? restore_reserve_on_put_page() or something? I'm confused that
we need to restore reservation on error, but seemingly there is no
function that restores reservation on normal operation page freeing.

> Add missing calls to restore_reserve_on_error to error paths  of code
> calling alloc_huge_page.
>

Is it a good idea to add a comment above alloc_huge_page() that to
unroll it we need to put_page() *and* call restore_reserve_on_error()?

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c    |  1 +
>  include/linux/hugetlb.h | 11 ++++++
>  mm/hugetlb.c            | 82 +++++++++++++++++++++++++++++++----------
>  3 files changed, 75 insertions(+), 19 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index bb4de5dcd652..9d846a2edc4b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>                 __SetPageUptodate(page);
>                 error = huge_add_to_page_cache(page, mapping, index);
>                 if (unlikely(error)) {
> +                       restore_reserve_on_error(h, &pseudo_vma, addr, page);
>                         put_page(page);
>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>                         goto out;
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e5e363fa5d02..da2251b0c609 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -517,6 +517,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   *     Synchronization:  Examined or modified by code that knows it has
>   *     the only reference to page.  i.e. After allocation but before use
>   *     or when the page is being freed.
> + * HPG_restore_rsv_map - Set when a hugetlb page allocation results in adding
> + *     an entry to the reserve map.  This can happen without adjustment of
> + *     the global reserve count.  Cleared when page is fully instantiated.
> + *     Error paths (restore_reserve_on_error) check this flag to make
> + *     adjustments to the reserve map.
> + *     Synchronization:  Examined or modified by code that knows it has
> + *     the only reference to page.  i.e. After allocation but before use.
>   * HPG_migratable  - Set after a newly allocated page is added to the page
>   *     cache and/or page tables.  Indicates the page is a candidate for
>   *     migration.
> @@ -536,6 +543,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   */
>  enum hugetlb_page_flags {
>         HPG_restore_rsv_cnt = 0,
> +       HPG_restore_rsv_map,
>         HPG_migratable,
>         HPG_temporary,
>         HPG_freed,
> @@ -582,6 +590,7 @@ static inline void ClearHPage##uname(struct page *page)             \
>   * Create functions associated with hugetlb page flags
>   */
>  HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt)
> +HPAGEFLAG(RestoreRsvMap, restore_rsv_map)
>  HPAGEFLAG(Migratable, migratable)
>  HPAGEFLAG(Temporary, temporary)
>  HPAGEFLAG(Freed, freed)
> @@ -633,6 +642,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>                                 unsigned long address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>                         pgoff_t idx);
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +               unsigned long address, struct page *page);
>
>  /* arch callback */
>  int __init __alloc_bootmem_huge_page(struct hstate *h);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2a8cea253388..1c3a68d70ab5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1551,6 +1551,7 @@ void free_huge_page(struct page *page)
>         page->mapping = NULL;
>         restore_reserve = HPageRestoreRsvCnt(page);
>         ClearHPageRestoreRsvCnt(page);
> +       ClearHPageRestoreRsvMap(page);
>
>         /*
>          * If HPageRestoreRsvCnt was set on page, page allocation consumed a
> @@ -2360,24 +2361,26 @@ static long vma_add_reservation(struct hstate *h,
>  }
>
>  /*
> - * This routine is called to restore a reservation on error paths.  In the
> - * specific error paths, a huge page was allocated (via alloc_huge_page)
> - * and is about to be freed.  If a reservation for the page existed,
> - * alloc_huge_page would have consumed the reservation and set
> - * HPageRestoreRsvCnt in the newly allocated page.  When the page is freed
> - * via free_huge_page, the global reservation count will be incremented if
> - * HPageRestoreRsvCnt is set.  However, free_huge_page can not adjust the
> - * reserve map.  Adjust the reserve map here to be consistent with global
> - * reserve count adjustments to be made by free_huge_page.
> + * This routine is called to restore a reservation data on error paths.
> + * It handles two specific cases for pages allocated via alloc_huge_page:
> + * 1) A reservation was in place and page consumed the reservation.
> + *    HPageRestoreRsvCnt is set in the page.
> + * 2) No reservation was in place for the page, so HPageRestoreRsvCnt is
> + *    not set.  However, the reserve map was updated.
> + * In case 1, free_huge_page will increment the global reserve count.  But,
> + * free_huge_page does not have enough context to adjust the reservation map.
> + * This case deals primarily with private mappings.  Adjust the reserve map
> + * here to be consistent with global reserve count adjustments to be made
> + * by free_huge_page.
> + * In case 2, simply undo an reserve map modifications done by alloc_huge_page.
>   */
> -static void restore_reserve_on_error(struct hstate *h,
> -                       struct vm_area_struct *vma, unsigned long address,
> -                       struct page *page)
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +                               unsigned long address, struct page *page)
>  {
>         if (unlikely(HPageRestoreRsvCnt(page))) {
>                 long rc = vma_needs_reservation(h, vma, address);
>
> -               if (unlikely(rc < 0)) {
> +               if (unlikely(rc < 0))
>                         /*
>                          * Rare out of memory condition in reserve map
>                          * manipulation.  Clear HPageRestoreRsvCnt so that
> @@ -2390,16 +2393,47 @@ static void restore_reserve_on_error(struct hstate *h,
>                          * accounting of reserve counts.
>                          */
>                         ClearHPageRestoreRsvCnt(page);
> -               } else if (rc) {
> -                       rc = vma_add_reservation(h, vma, address);
> -                       if (unlikely(rc < 0))
> +               else if (rc)
> +                       vma_add_reservation(h, vma, address);
> +               else
> +                       vma_end_reservation(h, vma, address);
> +       } else if (unlikely(HPageRestoreRsvMap(page))) {
> +               struct resv_map *resv = vma_resv_map(vma);
> +               pgoff_t idx = vma_hugecache_offset(h, vma, address);
> +               long rc;
> +
> +               /*
> +                * This handles the specific case where the reserve count
> +                * was not updated during the page allocation process, but
> +                * the reserve map was updated.  We need to undo the reserve
> +                * map update.
> +                *
> +                * The presence of an entry in the reserve map has opposite
> +                * meanings for shared and private mappings.
> +                */
> +               if (vma->vm_flags & VM_MAYSHARE) {
> +                       rc = region_del(resv, idx, idx + 1);
> +                       if (rc < 0)
> +                               /*
> +                                * Rare out of memory condition.  Since we can
> +                                * not delete the reserve entry, set
> +                                * HPageRestoreRsvCnt so that the global count
> +                                * will be consistent with the reserve map.
> +                                */
> +                               SetHPageRestoreRsvCnt(page);
> +               } else {
> +                       rc = vma_needs_reservation(h, vma, address);
> +                       if (rc < 0)
>                                 /*
>                                  * See above comment about rare out of
>                                  * memory condition.
>                                  */
> -                               ClearHPageRestoreRsvCnt(page);
> -               } else
> -                       vma_end_reservation(h, vma, address);
> +                               SetHPageRestoreRsvCnt(page);
> +                       else if (rc)
> +                               vma_add_reservation(h, vma, address);
> +                       else
> +                               vma_end_reservation(h, vma, address);
> +               }

As I mentioned in the other email, this call sequence does not result
in the region_del() call that we really need here. Calling
region_del() directly would be one fix, another would be to call
vma_end_reservation() even if !rc. Not sure which is more semantically
correct. hugetlb_unreserve_pages() calls region_del()
indiscriminately.

>         }
>  }
>
> @@ -2641,6 +2675,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                         hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>                                         pages_per_huge_page(h), page);
>         }
> +       if (map_commit)
> +               SetHPageRestoreRsvMap(page);
>         return page;
>
>  out_uncharge_cgroup:
> @@ -4053,6 +4089,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
>         hugepage_add_new_anon_rmap(new_page, vma, addr);
>         hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
>         ClearHPageRestoreRsvCnt(new_page);
> +       ClearHPageRestoreRsvMap(new_page);
>         SetHPageMigratable(new_page);
>  }
>
> @@ -4174,6 +4211,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                                 entry = huge_ptep_get(src_pte);
>                                 if (!pte_same(src_pte_old, entry)) {
> +                                       restore_reserve_on_error(h, vma, addr,
> +                                                               new);
>                                         put_page(new);
>                                         /* dst_entry won't change as in child */
>                                         goto again;
> @@ -4526,6 +4565,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>         ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>         if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
>                 ClearHPageRestoreRsvCnt(new_page);
> +               ClearHPageRestoreRsvMap(new_page);
>
>                 /* Break COW */
>                 huge_ptep_clear_flush(vma, haddr, ptep);
> @@ -4593,6 +4633,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>         if (err)
>                 return err;
>         ClearHPageRestoreRsvCnt(page);
> +       ClearHPageRestoreRsvMap(page);
>
>         /*
>          * set page dirty so that it will not be removed from cache/file
> @@ -4776,6 +4817,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>
>         if (anon_rmap) {
>                 ClearHPageRestoreRsvCnt(page);
> +               ClearHPageRestoreRsvMap(page);
>                 hugepage_add_new_anon_rmap(page, vma, haddr);
>         } else
>                 page_dup_rmap(page, true);
> @@ -5097,6 +5139,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                 page_dup_rmap(page, true);
>         } else {
>                 ClearHPageRestoreRsvCnt(page);
> +               ClearHPageRestoreRsvMap(page);
>                 hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
>         }
>
> @@ -5133,6 +5176,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         if (vm_shared || is_continue)
>                 unlock_page(page);
>  out_release_nounlock:
> +       restore_reserve_on_error(h, dst_vma, dst_addr, page);
>         put_page(page);
>         goto out;
>  }
> --
> 2.31.1
>