From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EA66E743CA for ; Thu, 28 Sep 2023 22:59:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 76FCA8D003B; Thu, 28 Sep 2023 18:59:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 720C48D0002; Thu, 28 Sep 2023 18:59:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 599878D003B; Thu, 28 Sep 2023 18:59:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4342C8D0002 for ; Thu, 28 Sep 2023 18:59:47 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 124FB1A026F for ; Thu, 28 Sep 2023 22:59:47 +0000 (UTC) X-FDA: 81287525214.12.A9D7E83 Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41]) by imf16.hostedemail.com (Postfix) with ESMTP id 35B0A180011 for ; Thu, 28 Sep 2023 22:59:45 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Mrj3ujlU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of fvdl@google.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=fvdl@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695941985; a=rsa-sha256; cv=none; b=WKK2dwGGAL4/jZ7OPpDnzrToo0o1gvaq95DE1fUOKKatLqKO6IHeF8E6179o5sH1Rk90sZ troo8XAFXltJbzZllv/yoinJGeLo4DyQC9+Dfn0gw0AsPdvdR0Qd12Z4xLjFxPwKwVY7BZ SITM7/gcPJzCRmsD8PG/Nyshx1DZaS4= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Mrj3ujlU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of fvdl@google.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=fvdl@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695941985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LaTeBzB3XnCo6qgN36VQmh3CCRL9FPL6JaDl1yE8G98=; b=jxEssxMBhx9V/SEerMJ1r/3uARqX1b0mlWPcAm8ocAf+Po6Nn4+or29H9d/IE1gkVq0P/Z E9YCbljbWpCVbTTFePII2SA1elkjJ1XYTj1P9R+q9KzQb1/ZgWf5vPZ58izn0vDHy8vGKd Ecu3zn60TPqpyD4yOagiSc+hdiAm6Mg= Received: by mail-vs1-f41.google.com with SMTP id ada2fe7eead31-4526c6579afso53739137.0 for ; Thu, 28 Sep 2023 15:59:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695941984; x=1696546784; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=LaTeBzB3XnCo6qgN36VQmh3CCRL9FPL6JaDl1yE8G98=; b=Mrj3ujlUjvDVi06rLo7q7XhHztsJv/cSELScT2poh2K6sWail7+CEuLL3ddjMGQ6Yr 3rExJwehMR39EML1Sspjn/hJMOcapW5KWqNU0XblXGo+ZqNJewSuwrB8eTJsMep4HDyE To4YjqGrqt1YzqMHlQitVav6D3jImLnAI5nRORqW8fbSgHFt69O2zHTpFWMMT8QIQU9y 2WKXIJeLpo/SqleyZ0go+LX5XpbeVKVRaHncAZ0jQKBncd0Uqt2A+wGiu72ZnPJLYjjG o4FNBY+yCYMZKPGAYYGoV6ghvmit3+HJngkCT8cEju82sjm5xLAmLcP+5z2h6DYODpzK AM1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695941984; x=1696546784; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LaTeBzB3XnCo6qgN36VQmh3CCRL9FPL6JaDl1yE8G98=; b=daUH1Wa0Zl+tlN6SBlHzt94LoWR0pU+OOVXbOGlA7qkaE7d56WFKAHdjE2F8ee1Wap xYZYg3VBxFCYDvfeq2t39LyebyRsTv8y1s/KAijwdYd2UHBzSs6a5rvuGb21F4aIBrQE CZjVm+Y1bH0Ev0KwiZ8LM8Bwf6sAdJ/dgoFZKwNuNSDDrhwPnSuTn1qRmYrpMa87fLrc aAkCtGVIKnMm/RJOCV3axLeO/A5rmHjFT3d2a/sRYKx9/FCPUG61XVB4lsY+Z5H2xQB2 yCszBOtiYnhjduxDGGbnn7hMlKMafS2OGvHOAN7r2RX5WOUG4EOVpUioUJ0jS2lwlaS1 FmaA== X-Gm-Message-State: AOJu0YwnCbHZ/LSohqkLjI9cSf9uCp3wGw4U2CNH8/3+6EPVogIRU+Os K4aD4MoYIn+Njni5RJdE/vN284PtbvBPM1rstYgQpw== X-Google-Smtp-Source: AGHT+IHEYsvQ2FCSYbrErmwqhOOCog1z06jFSOG8I5Ifzyul29pm955J2ikqb/6gvaMN7Q19zqpCy3jdeDtFZEEaBLw= X-Received: by 2002:a05:6102:570c:b0:44e:ab51:9e0d with SMTP id dg12-20020a056102570c00b0044eab519e0dmr1179057vsb.16.1695941984128; Thu, 28 Sep 2023 15:59:44 -0700 (PDT) MIME-Version: 1.0 References: <20230928005723.1709119-1-nphamcs@gmail.com> <20230928005723.1709119-2-nphamcs@gmail.com> In-Reply-To: <20230928005723.1709119-2-nphamcs@gmail.com> From: Frank van der Linden Date: Thu, 28 Sep 2023 15:59:33 -0700 Message-ID: Subject: Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller To: Nhat Pham Cc: akpm@linux-foundation.org, riel@surriel.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: multipart/alternative; boundary="000000000000867d6c0606734391" X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 35B0A180011 X-Stat-Signature: 79aiufbyhhwie7g8yn5xgtmjfctct4wd X-HE-Tag: 1695941985-198283 X-HE-Meta: U2FsdGVkX1+q5hSy3nmEh+A0LObe8KDXqFIWJkBEg52g3Dul0Xxuh/fSvGkQvq7xVSYiEeQJq20NWO8PZ6HSQf24Xhh2DrJ3jiRVCrBLWrgR/EVy4ct7NlHu6Oz5uvx552z8CdFPxqztMJvw+E5HSN45tKFor7WEfTZRs4jQRS7hbQQi3piORn/YIa7ftszH7cyuu7Mw+qsssyomU+pLTZTUVhnAQHNJsTUeGt881IO289mmFraFjXSaHjvuuV6myHaICYGD4VbJNlsoDIKe7eM3+bzBJGF0u+5jaoV8C2Cl47URKCTG5wa/+xqeP/PPas2tJGtM50PCiB4QZ+d/2RQ+GLJusH7y0ohbXSI5olSieMd9mwDtY1bpbPCGLmuc6d+G7orBC5te8DkLK4uGLtjyyDSwIelg/TW+Q1KmR5O1Sl/nKeVkf/8jzl838zrLNxFllul5Bng1bBzN9BqT9esZLztxjR+J9KPQfj1PaUenBizoXN5mydvpXMD9TtEZu7vwrkReDEicYuS7JnwTBtsdQXZbo4xCyj/e4aQ5GXv1tf50RA7MmXPPHA/7jLh7BL/d5sABPNj5gAEY5CBjacSFS0x25dPqnjJrS3gY3ezX+TPUKX3mpa+4Typq+qqkIIB7iOctytzUsiShr6YTacCZt6IJp5cS9a955/xmhydr55P1HimRC/cxN2rDQftWh2pyzqRb8PUXYMR8n/F5G0MO+4MYt2G6ps1eBHHHN60MzP5ZHDSW/9f8RLum4kjgFzoY7Ovw/Wx2vHQtMGDod4d2uyEET1hBjHGkc1vek9EwCG04hbK/FDD7e1w8kwj8qYJ6xTAD+84QkmRtkfvkzyURR0xhR+S0jJyqTLoGIk50eY8Nu8OwmTdW1Gomboi/Rzfb2Vua09gJBUeVbTBgOTsA4r/xmthGvYWMpm+YstXQTDkg/AT/5xnDbQIR4RNPF7GmpixHFvJ09sCip3U DWv1k5u4 C+QLN+MKafpdTyTD7O+M24F8DWrxvJiSmQArQWANPXyQ1icWzNXqh6P08jPtSOhthluf7WVbAn+/LxYdaG5JDHyjA7MkpNr+CDD5mYqr5qAMNJ/ymHtyt5ZSYkHtc+eRa8qc6aTya1ibq6wmIMYMSmJCqO3rsGTIKB+18jb+ryHimnGZCTcMWnPRS5e1SWjkRZ965vrGm7D2VPZFOyMd9xH5mKv2G33LZFEXUqLyh4olc41HF7QIMZtDojliXDb/BfsMJ26zhxUs70berx7v281h9DqMTbfU4n+gQBAVtXHTlX5btzm5/wEUdWK2f1hgliX8idEYiFR2oBlnLHwruwxPIAS5enAmmaxAP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000867d6c0606734391 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Sep 27, 2023 at 5:57=E2=80=AFPM Nhat Pham wrote= : > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is allocated, and uncharging when the folio is freed (analogous to > the hugetlb controller). > > Signed-off-by: Nhat Pham > --- > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > fs/hugetlbfs/inode.c | 2 +- > include/linux/cgroup-defs.h | 5 +++ > include/linux/hugetlb.h | 6 ++-- > include/linux/memcontrol.h | 8 +++++ > kernel/cgroup/cgroup.c | 15 ++++++++- > mm/hugetlb.c | 23 ++++++++++---- > mm/memcontrol.c | 41 +++++++++++++++++++++++++ > 8 files changed, 99 insertions(+), 10 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst > b/Documentation/admin-guide/cgroup-v2.rst > index 622a7f28db1f..e6267b8cbd1d 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -210,6 +210,15 @@ cgroup v2 currently supports the following mount > options. > relying on the original semantics (e.g. specifying bogusly > high 'bypass' protection values at higher tree levels). > > + memory_hugetlb_accounting > + Count hugetlb memory usage towards the cgroup's overall > + memory usage for the memory controller. This is a new behavior > + that could regress existing setups, so it must be explicitly > + opted in with this mount option. Note that hugetlb pages > + allocated while this option is not selected will not be > + tracked by the memory controller (even if cgroup v2 is > + remounted later on). > + > > Organizing Processes and Threads > -------------------------------- > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 60fce26ff937..034967319955 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -902,7 +902,7 @@ static long hugetlbfs_fallocate(struct file *file, in= t > mode, loff_t offset, > * to keep reservation accounting consistent. > */ > hugetlb_set_vma_policy(&pseudo_vma, inode, index); > - folio =3D alloc_hugetlb_folio(&pseudo_vma, addr, 0); > + folio =3D alloc_hugetlb_folio(&pseudo_vma, addr, 0, true)= ; > hugetlb_drop_vma_policy(&pseudo_vma); > if (IS_ERR(folio)) { > mutex_unlock(&hugetlb_fault_mutex_table[hash]); > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index f1b3151ac30b..8641f4320c98 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -115,6 +115,11 @@ enum { > * Enable recursive subtree protection > */ > CGRP_ROOT_MEMORY_RECURSIVE_PROT =3D (1 << 18), > + > + /* > + * Enable hugetlb accounting for the memory controller. > + */ > + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING =3D (1 << 19), > }; > > /* cftype->flags */ > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index a30686e649f7..9b73db1605a2 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -713,7 +713,8 @@ struct huge_bootmem_page { > > int isolate_or_dissolve_huge_page(struct page *page, struct list_head > *list); > struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > - unsigned long addr, int avoid_reserve); > + unsigned long addr, int avoid_reserve, > + bool restore_reserve_on_memcg_failure); > struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int > preferred_nid, > nodemask_t *nmask, gfp_t gfp_mask); > struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct > vm_area_struct *vma, > @@ -1016,7 +1017,8 @@ static inline int > isolate_or_dissolve_huge_page(struct page *page, > > static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct > *vma, > unsigned long addr, > - int avoid_reserve) > + int avoid_reserve, > + bool > restore_reserve_on_memcg_failure) > { > return NULL; > } > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e0cfab58ab71..8094679c99dd 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -677,6 +677,8 @@ static inline int mem_cgroup_charge(struct folio > *folio, struct mm_struct *mm, > return __mem_cgroup_charge(folio, mm, gfp); > } > > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp); > + > int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct > *mm, > gfp_t gfp, swp_entry_t entry); > void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > @@ -1251,6 +1253,12 @@ static inline int mem_cgroup_charge(struct folio > *folio, > return 0; > } > > +static inline int mem_cgroup_hugetlb_charge_folio(struct folio *folio, > + gfp_t gfp) > +{ > + return 0; > +} > + > static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > struct mm_struct *mm, gfp_t gfp, swp_entry_t entr= y) > { > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 1fb7f562289d..f11488b18ceb 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1902,6 +1902,7 @@ enum cgroup2_param { > Opt_favordynmods, > Opt_memory_localevents, > Opt_memory_recursiveprot, > + Opt_memory_hugetlb_accounting, > nr__cgroup2_params > }; > > @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec > cgroup2_fs_parameters[] =3D { > fsparam_flag("favordynmods", Opt_favordynmods), > fsparam_flag("memory_localevents", Opt_memory_localevents), > fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot)= , > + fsparam_flag("memory_hugetlb_accounting", > Opt_memory_hugetlb_accounting), > {} > }; > > @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context > *fc, struct fs_parameter *param > case Opt_memory_recursiveprot: > ctx->flags |=3D CGRP_ROOT_MEMORY_RECURSIVE_PROT; > return 0; > + case Opt_memory_hugetlb_accounting: > + ctx->flags |=3D CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > + return 0; > } > return -EINVAL; > } > @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int > root_flags) > cgrp_dfl_root.flags |=3D > CGRP_ROOT_MEMORY_RECURSIVE_PROT; > else > cgrp_dfl_root.flags &=3D > ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; > + > + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) > + cgrp_dfl_root.flags |=3D > CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > + else > + cgrp_dfl_root.flags &=3D > ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > } > } > > @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq= , > struct kernfs_root *kf_root > seq_puts(seq, ",memory_localevents"); > if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) > seq_puts(seq, ",memory_recursiveprot"); > + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) > + seq_puts(seq, ",memory_hugetlb_accounting"); > return 0; > } > > @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, > struct kobj_attribute *attr, > "nsdelegate\n" > "favordynmods\n" > "memory_localevents\n" > - "memory_recursiveprot\n"); > + "memory_recursiveprot\n" > + "memory_hugetlb_accounting\n"); > } > static struct kobj_attribute cgroup_features_attr =3D __ATTR_RO(features= ); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index de220e3ff8be..ff88ea4df11a 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > pages_per_huge_page(h), folio); > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > + mem_cgroup_uncharge(folio); > if (restore_reserve) > h->resv_huge_pages++; > > @@ -3004,7 +3005,8 @@ int isolate_or_dissolve_huge_page(struct page *page= , > struct list_head *list) > } > > struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > - unsigned long addr, int avoid_reserve= ) > + unsigned long addr, int > avoid_reserve, > + bool > restore_reserve_on_memcg_failure) > { > struct hugepage_subpool *spool =3D subpool_vma(vma); > struct hstate *h =3D hstate_vma(vma); > @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct > vm_area_struct *vma, > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h= ), > pages_per_huge_page(h), folio); > } > + > + /* undo allocation if memory controller disallows it. */ > + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) { > + if (restore_reserve_on_memcg_failure) > + restore_reserve_on_error(h, vma, addr, folio); > + folio_put(folio); > + return ERR_PTR(-ENOMEM); > + } > + > return folio; > > out_uncharge_cgroup: > @@ -5179,7 +5190,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, > struct mm_struct *src, > spin_unlock(src_ptl); > spin_unlock(dst_ptl); > /* Do not use reserve as it's private > owned */ > - new_folio =3D alloc_hugetlb_folio(dst_vma= , > addr, 1); > + new_folio =3D alloc_hugetlb_folio(dst_vma= , > addr, 1, false); > if (IS_ERR(new_folio)) { > folio_put(pte_folio); > ret =3D PTR_ERR(new_folio); > @@ -5656,7 +5667,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, > struct vm_area_struct *vma, > * be acquired again before returning to the caller, as expected. > */ > spin_unlock(ptl); > - new_folio =3D alloc_hugetlb_folio(vma, haddr, outside_reserve); > + new_folio =3D alloc_hugetlb_folio(vma, haddr, outside_reserve, tr= ue); > > if (IS_ERR(new_folio)) { > /* > @@ -5930,7 +5941,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct > *mm, > VM_UFFD_MISSING); > } > > - folio =3D alloc_hugetlb_folio(vma, haddr, 0); > + folio =3D alloc_hugetlb_folio(vma, haddr, 0, true); > if (IS_ERR(folio)) { > /* > * Returning error will result in faulting task > being > @@ -6352,7 +6363,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > goto out; > } > > - folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0); > + folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0, true)= ; > if (IS_ERR(folio)) { > ret =3D -ENOMEM; > goto out; > @@ -6394,7 +6405,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > goto out; > } > > - folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0); > + folio =3D alloc_hugetlb_folio(dst_vma, dst_addr, 0, false= ); > if (IS_ERR(folio)) { > folio_put(*foliop); > ret =3D -ENOMEM; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d1a322a75172..d5dfc9b36acb 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7050,6 +7050,47 @@ int __mem_cgroup_charge(struct folio *folio, struc= t > mm_struct *mm, gfp_t gfp) > return ret; > } > > +static struct mem_cgroup *get_mem_cgroup_from_current(void) > +{ > + struct mem_cgroup *memcg; > + > +again: > + rcu_read_lock(); > + memcg =3D mem_cgroup_from_task(current); > + if (!css_tryget(&memcg->css)) { > + rcu_read_unlock(); > + goto again; > + } > + rcu_read_unlock(); > + return memcg; > +} > + > +/** > + * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb > folio. > + * @folio: folio to charge. > + * @gfp: reclaim mode > + * > + * This function charges an allocated hugetlbf folio to the memcg of the > + * current task. > + * > + * Returns 0 on success. Otherwise, an error code is returned. > + */ > +int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp) > +{ > + struct mem_cgroup *memcg; > + int ret; > + > + if (mem_cgroup_disabled() || > + !(cgrp_dfl_root.flags & > CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) > + return 0; > + > + memcg =3D get_mem_cgroup_from_current(); > + ret =3D charge_memcg(folio, memcg, gfp); > + mem_cgroup_put(memcg); > + > + return ret; > +} > + > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for > swapin. > * @folio: folio to charge. > -- > 2.34.1 > > With the mount option added, I'm fine with this. There are reasons to want and reasons not to want this, so everybody's happy! Out of curiosity: is anyone aware of any code that may behave badly when folio_memcg(hugetlb_folio) !=3D NULL, not expecting it? - Frank --000000000000867d6c0606734391 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Sep 27, 2023 at 5:57=E2=80=AFPM N= hat Pham <nphamcs= @gmail.com> wrote:
Currently, hugetlb memory usage is not= acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is allocated, and uncharging when the folio is freed (analogous to the hugetlb controller).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
=C2=A0Documentation/admin-guide/cgroup-v2.rst |=C2=A0 9 ++++++
=C2=A0fs/hugetlbfs/inode.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 |=C2=A0 2 +-
=C2=A0include/linux/cgroup-defs.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0|=C2=A0 5 +++
=C2=A0include/linux/hugetlb.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0|=C2=A0 6 ++--
=C2=A0include/linux/memcontrol.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 |=C2=A0 8 +++++
=C2=A0kernel/cgroup/cgroup.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 | 15 ++++++++-
=C2=A0mm/hugetlb.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 23 ++++++++++----
=C2=A0mm/memcontrol.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| 41 +++++++++++++++++++++++++
=C2=A08 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst
index 622a7f28db1f..e6267b8cbd1d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -210,6 +210,15 @@ cgroup v2 currently supports the following mount optio= ns.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0relying on the original semantics (e.g. s= pecifying bogusly
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0high 'bypass' protection values a= t higher tree levels).

+=C2=A0 memory_hugetlb_accounting
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 Count hugetlb memory usage towards the cgroup&= #39;s overall
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 memory usage for the memory controller. This i= s a new behavior
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 that could regress existing setups, so it must= be explicitly
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 opted in with this mount option. Note that hug= etlb pages
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 allocated while this option is not selected wi= ll not be
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 tracked by the memory controller (even if cgro= up v2 is
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 remounted later on).
+

=C2=A0Organizing Processes and Threads
=C2=A0--------------------------------
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 60fce26ff937..034967319955 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -902,7 +902,7 @@ static long hugetlbfs_fallocate(struct file *file, int = mode, loff_t offset,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* to keep res= ervation accounting consistent.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 hugetlb_set_vma_pol= icy(&pseudo_vma, inode, index);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(&pseudo_vma, addr, 0);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(&pseudo_vma, addr, 0, true);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 hugetlb_drop_vma_po= licy(&pseudo_vma);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(folio)) = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f1b3151ac30b..8641f4320c98 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -115,6 +115,11 @@ enum {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Enable recursive subtree protection
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 CGRP_ROOT_MEMORY_RECURSIVE_PROT =3D (1 <<= 18),
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Enable hugetlb accounting for the memory con= troller.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING =3D (1 <= ;< 19),
=C2=A0};

=C2=A0/* cftype->flags */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a30686e649f7..9b73db1605a2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -713,7 +713,8 @@ struct huge_bootmem_page {

=C2=A0int isolate_or_dissolve_huge_page(struct page *page, struct list_head= *list);
=C2=A0struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long addr, int avoid_reserve= );
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long addr, int avoid_reserve= ,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bool restore_reserve_on_memcg_failure= );
=C2=A0struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int pref= erred_nid,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nodemask_t *nmask, gfp_t gfp_mask);<= br> =C2=A0struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_are= a_struct *vma,
@@ -1016,7 +1017,8 @@ static inline int isolate_or_dissolve_huge_page(struc= t page *page,

=C2=A0static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct= *vma,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0unsigned long addr,
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i= nt avoid_reserve)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 i= nt avoid_reserve,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 b= ool restore_reserve_on_memcg_failure)
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return NULL;
=C2=A0}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e0cfab58ab71..8094679c99dd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,6 +677,8 @@ static inline int mem_cgroup_charge(struct folio *folio= , struct mm_struct *mm,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return __mem_cgroup_charge(folio, mm, gfp);
=C2=A0}

+int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp);
+
=C2=A0int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_str= uct *mm,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gfp_t gfp, swp_entry_t entry)= ;
=C2=A0void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1251,6 +1253,12 @@ static inline int mem_cgroup_charge(struct folio *fo= lio,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0;
=C2=A0}

+static inline int mem_cgroup_hugetlb_charge_folio(struct folio *folio,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0gfp_t gfp)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+}
+
=C2=A0static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
=C2=A0{
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1fb7f562289d..f11488b18ceb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1902,6 +1902,7 @@ enum cgroup2_param {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Opt_favordynmods,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Opt_memory_localevents,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Opt_memory_recursiveprot,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0Opt_memory_hugetlb_accounting,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 nr__cgroup2_params
=C2=A0};

@@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_para= meters[] =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 fsparam_flag("favordynmods",=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Opt_favordynmods),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 fsparam_flag("memory_localevents",=C2= =A0 =C2=A0 =C2=A0 Opt_memory_localevents),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 fsparam_flag("memory_recursiveprot",= =C2=A0 =C2=A0 Opt_memory_recursiveprot),
+=C2=A0 =C2=A0 =C2=A0 =C2=A0fsparam_flag("memory_hugetlb_accounting&qu= ot;, Opt_memory_hugetlb_accounting),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 {}
=C2=A0};

@@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc,= struct fs_parameter *param
=C2=A0 =C2=A0 =C2=A0 =C2=A0 case Opt_memory_recursiveprot:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctx->flags |=3D = CGRP_ROOT_MEMORY_RECURSIVE_PROT;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0case Opt_memory_hugetlb_accounting:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctx->flags |=3D = CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return -EINVAL;
=C2=A0}
@@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int roo= t_flags)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 cgrp_dfl_root.flags |=3D CGRP_ROOT_MEMORY_RECURSIVE_PROT;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 cgrp_dfl_root.flags &=3D ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; +
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (root_flags &= ; CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0cgrp_dfl_root.flags |=3D CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0else
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0cgrp_dfl_root.flags &=3D ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0}

@@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, = struct kernfs_root *kf_root
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 seq_puts(seq, "= ;,memory_localevents");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_= RECURSIVE_PROT)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 seq_puts(seq, "= ;,memory_recursiveprot");
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_= HUGETLB_ACCOUNTING)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0seq_puts(seq, "= ;,memory_hugetlb_accounting");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0;
=C2=A0}

@@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, st= ruct kobj_attribute *attr,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 "nsdelegate\n"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 "favordynmods\n"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 "memory_localevents\n"
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0"memory_recursiveprot\n");
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0"memory_recursiveprot\n"
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0"memory_hugetlb_accounting\n");
=C2=A0}
=C2=A0static struct kobj_attribute cgroup_features_attr =3D __ATTR_RO(featu= res);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..ff88ea4df11a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pages_per_huge_p= age(h), folio);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 hugetlb_cgroup_uncharge_folio_rsvd(hstate_index= (h),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 p= ages_per_huge_page(h), folio);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0mem_cgroup_uncharge(folio);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (restore_reserve)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 h->resv_huge_pag= es++;

@@ -3004,7 +3005,8 @@ int isolate_or_dissolve_huge_page(struct page *page, = struct list_head *list)
=C2=A0}

=C2=A0struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long addr, int= avoid_reserve)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned = long addr, int avoid_reserve,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bool rest= ore_reserve_on_memcg_failure)
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct hugepage_subpool *spool =3D subpool_vma(= vma);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct hstate *h =3D hstate_vma(vma);
@@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_str= uct *vma,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 pages_pe= r_huge_page(h), folio);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* undo allocation if memory controller disallo= ws it. */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (mem_cgroup_hugetlb_charge_folio(folio, GFP_= KERNEL)) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (restore_reserve= _on_memcg_failure)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0restore_reserve_on_error(h, vma, addr, folio);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio_put(folio); +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ERR_PTR(-ENO= MEM);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return folio;

=C2=A0out_uncharge_cgroup:
@@ -5179,7 +5190,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(src_ptl);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(dst_ptl);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Do not use reserve as it's pr= ivate owned */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0new_folio =3D alloc_hugetlb_folio(dst= _vma, addr, 1);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0new_folio =3D alloc_hugetlb_folio(dst= _vma, addr, 1, false);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(new_folio)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_pu= t(pte_folio);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D = PTR_ERR(new_folio);
@@ -5656,7 +5667,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, st= ruct vm_area_struct *vma,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* be acquired again before returning to t= he caller, as expected.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(ptl);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0new_folio =3D alloc_hugetlb_folio(vma, haddr, o= utside_reserve);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0new_folio =3D alloc_hugetlb_folio(vma, haddr, o= utside_reserve, true);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(new_folio)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
@@ -5930,7 +5941,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *m= m,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 VM_UFFD_MISSING);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(vma, haddr, 0);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(vma, haddr, 0, true);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(folio)) = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0* Returning error will result in faulting task being
@@ -6352,7 +6363,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 goto out;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(dst_vma, dst_addr, 0);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(dst_vma, dst_addr, 0, true);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(folio)) = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 ret =3D -ENOMEM;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 goto out;
@@ -6394,7 +6405,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 goto out;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(dst_vma, dst_addr, 0);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0folio =3D alloc_hug= etlb_folio(dst_vma, dst_addr, 0, false);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (IS_ERR(folio)) = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 folio_put(*foliop);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 ret =3D -ENOMEM;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a322a75172..d5dfc9b36acb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7050,6 +7050,47 @@ int __mem_cgroup_charge(struct folio *folio, struct = mm_struct *mm, gfp_t gfp)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return ret;
=C2=A0}

+static struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct mem_cgroup *memcg;
+
+again:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0rcu_read_lock();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0memcg =3D mem_cgroup_from_task(current);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!css_tryget(&memcg->css)) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rcu_read_unlock();<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto again;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0rcu_read_unlock();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return memcg;
+}
+
+/**
+ * mem_cgroup_hugetlb_charge_folio - Charge a newly allocated hugetlb foli= o.
+ * @folio: folio to charge.
+ * @gfp: reclaim mode
+ *
+ * This function charges an allocated hugetlbf folio to the memcg of the + * current task.
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_hugetlb_charge_folio(struct folio *folio, gfp_t gfp)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct mem_cgroup *memcg;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int ret;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (mem_cgroup_disabled() ||
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0!(cgrp_dfl_root.fla= gs & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0memcg =3D get_mem_cgroup_from_current();
+=C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D charge_memcg(folio, memcg, gfp);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0mem_cgroup_put(memcg);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return ret;
+}
+
=C2=A0/**
=C2=A0 * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio fo= r swapin.
=C2=A0 * @folio: folio to charge.
--
2.34.1


With the mount option added, I'm f= ine with this. There are reasons to want and reasons not to want this, so e= verybody's happy!

Out of curiosity: is anyone = aware of any code that may behave badly when folio_memcg(hugetlb_folio) != =3D NULL, not expecting it?

- Frank
--000000000000867d6c0606734391--