linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Frank van der Linden <fvdl@google.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: akpm@linux-foundation.org, riel@surriel.com, hannes@cmpxchg.org,
	 mhocko@kernel.org, roman.gushchin@linux.dev,
	shakeelb@google.com,  muchun.song@linux.dev, tj@kernel.org,
	lizefan.x@bytedance.com,  shuah@kernel.org,
	mike.kravetz@oracle.com, yosryahmed@google.com,
	 linux-mm@kvack.org, kernel-team@meta.com,
	linux-kernel@vger.kernel.org,  cgroups@vger.kernel.org
Subject: Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller
Date: Thu, 28 Sep 2023 15:59:33 -0700	[thread overview]
Message-ID: <CAPTztWY4YnyFF3fVFZt-EbkUM3SSJ1rMgrgtjZGe5W-+v3tVGQ@mail.gmail.com> (raw)
In-Reply-To: <20230928005723.1709119-2-nphamcs@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 14657 bytes --]

On Wed, Sep 27, 2023 at 5:57 PM Nhat Pham <nphamcs@gmail.com> wrote:

> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is allocated, and uncharging when the folio is freed (analogous to
> the hugetlb controller).
>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>  fs/hugetlbfs/inode.c                    |  2 +-
>  include/linux/cgroup-defs.h             |  5 +++
>  include/linux/hugetlb.h                 |  6 ++--
>  include/linux/memcontrol.h              |  8 +++++
>  kernel/cgroup/cgroup.c                  | 15 ++++++++-
>  mm/hugetlb.c                            | 23 ++++++++++----
>  mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>  8 files changed, 99 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst
> b/Documentation/admin-guide/cgroup-v2.rst
> index 622a7f28db1f..e6267b8cbd1d 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -210,6 +210,15 @@ cgroup v2 currently supports the following mount
> options.
>          relying on the original semantics (e.g. specifying bogusly
>          high 'bypass' protection values at higher tree levels).
>
> +  memory_hugetlb_accounting
> +        Count hugetlb memory usage towards the cgroup's overall
> +        memory usage for the memory controller. This is a new behavior
> +        that could regress existing setups, so it must be explicitly
> +        opted in with this mount option. Note that hugetlb pages
> +        allocated while this option is not selected will not be
> +        tracked by the memory controller (even if cgroup v2 is
> +        remounted later on).
> +
>
>  Organizing Processes and Threads
>  --------------------------------
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 60fce26ff937..034967319955 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -902,7 +902,7 @@ static long hugetlbfs_fallocate(struct file *file, int
> mode, loff_t offset,
>                  * to keep reservation accounting consistent.
>                  */
>                 hugetlb_set_vma_policy(&pseudo_vma, inode, index);
> -               folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
> +               folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0, true);
>                 hugetlb_drop_vma_policy(&pseudo_vma);
>                 if (IS_ERR(folio)) {
>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f1b3151ac30b..8641f4320c98 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -115,6 +115,11 @@ enum {
>          * Enable recursive subtree protection
>          */
>         CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> +
> +       /*
> +        * Enable hugetlb accounting for the memory controller.
> +        */
> +        CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
>  };
>
>  /* cftype->flags */
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a30686e649f7..9b73db1605a2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -713,7 +713,8 @@ struct huge_bootmem_page {
>
>  int isolate_or_dissolve_huge_page(struct page *page, struct list_head
> *list);
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -                               unsigned long addr, int avoid_reserve);
> +                               unsigned long addr, int avoid_reserve,
> +                               bool restore_reserve_on_memcg_failure);
>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int
> preferred_nid,
>                                 nodemask_t *nmask, gfp_t gfp_mask);
>  struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct
> vm_area_struct *vma,
> @@ -1016,7 +1017,8 @@ static inline int
> isolate_or_dissolve_huge_page(struct page *page,
>
>  static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct
> *vma,
>                                            unsigned long addr,
> -                                          int avoid_reserve)
> +                                          int avoid_reserve,
> +                                          bool
> restore_reserve_on_memcg_failure)
>  {
>         return NULL;
>  }
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e0cfab58ab71..8094679c99dd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -677,6 +677,8 @@ static inline int mem_cgroup_charge(struct folio
> *folio, struct mm_struct *mm,
>         return __mem_cgroup_charge(folio, mm, gfp);
>  }
>
> +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp);
> +
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct
> *mm,
>                                   gfp_t gfp, swp_entry_t entry);
>  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> @@ -1251,6 +1253,12 @@ static inline int mem_cgroup_charge(struct folio
> *folio,
>         return 0;
>  }
>
> +static inline int mem_cgroup_hugetlb_charge_folio(struct folio *folio,
> +               gfp_t gfp)
> +{
> +       return 0;
> +}
> +
>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>                         struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>  {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1fb7f562289d..f11488b18ceb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1902,6 +1902,7 @@ enum cgroup2_param {
>         Opt_favordynmods,
>         Opt_memory_localevents,
>         Opt_memory_recursiveprot,
> +       Opt_memory_hugetlb_accounting,
>         nr__cgroup2_params
>  };
>
> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec
> cgroup2_fs_parameters[] = {
>         fsparam_flag("favordynmods",            Opt_favordynmods),
>         fsparam_flag("memory_localevents",      Opt_memory_localevents),
>         fsparam_flag("memory_recursiveprot",    Opt_memory_recursiveprot),
> +       fsparam_flag("memory_hugetlb_accounting",
> Opt_memory_hugetlb_accounting),
>         {}
>  };
>
> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context
> *fc, struct fs_parameter *param
>         case Opt_memory_recursiveprot:
>                 ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 return 0;
> +       case Opt_memory_hugetlb_accounting:
> +               ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               return 0;
>         }
>         return -EINVAL;
>  }
> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int
> root_flags)
>                         cgrp_dfl_root.flags |=
> CGRP_ROOT_MEMORY_RECURSIVE_PROT;
>                 else
>                         cgrp_dfl_root.flags &=
> ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> +
> +               if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +                       cgrp_dfl_root.flags |=
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> +               else
> +                       cgrp_dfl_root.flags &=
> ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
>         }
>  }
>
> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq,
> struct kernfs_root *kf_root
>                 seq_puts(seq, ",memory_localevents");
>         if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
>                 seq_puts(seq, ",memory_recursiveprot");
> +       if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +               seq_puts(seq, ",memory_hugetlb_accounting");
>         return 0;
>  }
>
> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj,
> struct kobj_attribute *attr,
>                         "nsdelegate\n"
>                         "favordynmods\n"
>                         "memory_localevents\n"
> -                       "memory_recursiveprot\n");
> +                       "memory_recursiveprot\n"
> +                       "memory_hugetlb_accounting\n");
>  }
>  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..ff88ea4df11a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
>                                      pages_per_huge_page(h), folio);
>         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                           pages_per_huge_page(h), folio);
> +       mem_cgroup_uncharge(folio);
>         if (restore_reserve)
>                 h->resv_huge_pages++;
>
> @@ -3004,7 +3005,8 @@ int isolate_or_dissolve_huge_page(struct page *page,
> struct list_head *list)
>  }
>
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -                                   unsigned long addr, int avoid_reserve)
> +                                       unsigned long addr, int
> avoid_reserve,
> +                                       bool
> restore_reserve_on_memcg_failure)
>  {
>         struct hugepage_subpool *spool = subpool_vma(vma);
>         struct hstate *h = hstate_vma(vma);
> @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct
> vm_area_struct *vma,
>                         hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>                                         pages_per_huge_page(h), folio);
>         }
> +
> +       /* undo allocation if memory controller disallows it. */
> +       if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) {
> +               if (restore_reserve_on_memcg_failure)
> +                       restore_reserve_on_error(h, vma, addr, folio);
> +               folio_put(folio);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +
>         return folio;
>
>  out_uncharge_cgroup:
> @@ -5179,7 +5190,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst,
> struct mm_struct *src,
>                                 spin_unlock(src_ptl);
>                                 spin_unlock(dst_ptl);
>                                 /* Do not use reserve as it's private
> owned */
> -                               new_folio = alloc_hugetlb_folio(dst_vma,
> addr, 1);
> +                               new_folio = alloc_hugetlb_folio(dst_vma,
> addr, 1, false);
>                                 if (IS_ERR(new_folio)) {
>                                         folio_put(pte_folio);
>                                         ret = PTR_ERR(new_folio);
> @@ -5656,7 +5667,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm,
> struct vm_area_struct *vma,
>          * be acquired again before returning to the caller, as expected.
>          */
>         spin_unlock(ptl);
> -       new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve);
> +       new_folio = alloc_hugetlb_folio(vma, haddr, outside_reserve, true);
>
>         if (IS_ERR(new_folio)) {
>                 /*
> @@ -5930,7 +5941,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct
> *mm,
>                                                         VM_UFFD_MISSING);
>                 }
>
> -               folio = alloc_hugetlb_folio(vma, haddr, 0);
> +               folio = alloc_hugetlb_folio(vma, haddr, 0, true);
>                 if (IS_ERR(folio)) {
>                         /*
>                          * Returning error will result in faulting task
> being
> @@ -6352,7 +6363,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>                         goto out;
>                 }
>
> -               folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
> +               folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, true);
>                 if (IS_ERR(folio)) {
>                         ret = -ENOMEM;
>                         goto out;
> @@ -6394,7 +6405,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>                         goto out;
>                 }
>
> -               folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
> +               folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0, false);
>                 if (IS_ERR(folio)) {
>                         folio_put(*foliop);
>                         ret = -ENOMEM;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d1a322a75172..d5dfc9b36acb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7050,6 +7050,47 @@ int __mem_cgroup_charge(struct folio *folio, struct
> mm_struct *mm, gfp_t gfp)
>         return ret;
>  }
>
> +static struct mem_cgroup *get_mem_cgroup_from_current(void)
> +{
> +       struct mem_cgroup *memcg;
> +
> +again:
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (!css_tryget(&memcg->css)) {
> +               rcu_read_unlock();
> +               goto again;
> +       }
> +       rcu_read_unlock();
> +       return memcg;
> +}
> +
> +/**
> + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb
> folio.
> + * @folio: folio to charge.
> + * @gfp: reclaim mode
> + *
> + * This function charges an allocated hugetlbf folio to the memcg of the
> + * current task.
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp)
> +{
> +       struct mem_cgroup *memcg;
> +       int ret;
> +
> +       if (mem_cgroup_disabled() ||
> +               !(cgrp_dfl_root.flags &
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> +               return 0;
> +
> +       memcg = get_mem_cgroup_from_current();
> +       ret = charge_memcg(folio, memcg, gfp);
> +       mem_cgroup_put(memcg);
> +
> +       return ret;
> +}
> +
>  /**
>   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for
> swapin.
>   * @folio: folio to charge.
> --
> 2.34.1
>
>
With the mount option added, I'm fine with this. There are reasons to want
and reasons not to want this, so everybody's happy!

Out of curiosity: is anyone aware of any code that may behave badly when
folio_memcg(hugetlb_folio) != NULL, not expecting it?

- Frank

[-- Attachment #2: Type: text/html, Size: 17264 bytes --]

  reply	other threads:[~2023-09-28 22:59 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-28  0:57 [PATCH v2 0/2] hugetlb memcg accounting Nhat Pham
2023-09-28  0:57 ` [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
2023-09-28 22:59   ` Frank van der Linden [this message]
2023-09-29  0:33     ` Nhat Pham
2023-09-29  0:38   ` Yosry Ahmed
2023-09-29  0:58     ` Nhat Pham
2023-09-29  1:07       ` Nhat Pham
2023-09-29  1:18         ` Yosry Ahmed
2023-09-29  1:25           ` Nhat Pham
2023-09-29 15:08           ` Johannes Weiner
2023-09-29 15:11             ` Yosry Ahmed
2023-09-29 17:42               ` Johannes Weiner
2023-09-29 17:48                 ` Nhat Pham
2023-09-29 18:07                 ` Frank van der Linden
2023-10-02 12:18                 ` Michal Hocko
2023-09-29 18:17   ` [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller (fix) Nhat Pham
2023-09-29 18:19     ` Nhat Pham
2023-10-02 13:43   ` [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Michal Hocko
2023-10-02 14:50     ` Johannes Weiner
2023-10-02 15:08       ` Michal Hocko
2023-10-02 15:25         ` Johannes Weiner
2023-10-02 17:32           ` Nhat Pham
2023-10-03  9:17           ` Michal Hocko
2023-10-02 16:21     ` Johannes Weiner
2023-10-02 17:28     ` Nhat Pham
2023-09-28  0:57 ` [PATCH v2 2/2] selftests: add a selftest to verify hugetlb usage in memcg Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPTztWY4YnyFF3fVFZt-EbkUM3SSJ1rMgrgtjZGe5W-+v3tVGQ@mail.gmail.com \
    --to=fvdl@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox