From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Nhat Pham <nphamcs@gmail.com>,
akpm@linux-foundation.org, riel@surriel.com,
roman.gushchin@linux.dev, shakeelb@google.com,
muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com,
shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com,
linux-mm@kvack.org, kernel-team@meta.com,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH 0/2] hugetlb memcg accounting
Date: Wed, 27 Sep 2023 14:47:38 -0400 [thread overview]
Message-ID: <20230927184738.GC365513@cmpxchg.org> (raw)
In-Reply-To: <ZRQQMABiVIcXXcrg@dhcp22.suse.cz>
On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
>
> The primary reason why hugetlb is living outside of memcg (and the core
> MM as well) is that it doesn't really fit the whole scheme. In several
> aspects. First and the foremost it is an independently managed resource
> with its own pool management, use and lifetime.
Honestly, the simpler explanation is that few people have used hugetlb
in regular, containerized non-HPC workloads.
Hugetlb has historically been much more special, and it retains a
specialness that warrants e.g. the hugetlb cgroup container. But it
has also made strides with hugetlb_cma, migratability, madvise support
etc. that allows much more on-demand use. It's no longer the case that
you just put a static pool of memory aside during boot and only a few
blessed applications are using it.
For example, we're using hugetlb_cma very broadly with generic
containers. The CMA region is fully usable by movable non-huge stuff
until huge pages are allocated in it. With the hugetlb controller you
can define a maximum number of hugetlb pages that can be used per
container. But what if that container isn't using any? Why shouldn't
it be allowed to use its overall memory allowance for anon and cache
instead?
With hugetlb being more dynamic, it becomes the same problem that we
had with dedicated tcp and kmem pools. It didn't make sense to fail a
random slab allocation when you still have memory headroom or can
reclaim some cache. Nowadays, the same problem applies to hugetlb.
> There is no notion of memory reclaim and this makes a huge difference
> for the pool that might consume considerable amount of memory. While
> this is the case for many kernel allocations as well they usually do not
> consume considerable portions of the accounted memory. This makes it
> really tricky to handle limit enforcement gracefully.
I don't think that's true. For some workloads, network buffers can
absolutely dominate. And they work just fine with cgroup limits. It
isn't a problem that they aren't reclaimable themselves, it's just
important that they put pressure on stuff that is.
So that if you use 80% hugetlb, the other memory is forced to stay in
the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
group is still allowed to use the full 100% of its host memory
allowance, without requiring some outside agent continuously
monitoring and adjusting the container limits.
> Another important aspect comes from the lifetime semantics when a proper
> reservations accounting and managing needs to handle mmap time rather
> than than usual allocation path. While pages are allocated they do not
> belong to anybody and only later at the #PF time (or read for the fs
> backed mapping) the ownership is established. That makes it really hard
> to manage memory as whole under the memcg anyway as a large part of
> that pool sits without an ownership yet it cannot be used for any other
> purpose.
>
> These and more reasons where behind the earlier decision o have a
> dedicated hugetlb controller.
Yeah, there is still a need for an actual hugetlb controller for the
static use cases (and even for dynamic access to hugetlb_cma).
But you need memcg coverage as well for the more dynamic cases to work
as expected. And having that doesn't really interfere with the static
usecases.
> Also I will also Nack involving hugetlb pages being accounted by
> default. This would break any setups which mix normal and hugetlb memory
> with memcg limits applied.
Yes, no disagreement there. I think we're all on the same page this
needs to be opt-in, say with a cgroup mount option.
next prev parent reply other threads:[~2023-09-27 18:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-26 19:49 Nhat Pham
2023-09-26 19:49 ` [PATCH 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller Nhat Pham
2023-09-26 20:50 ` [PATCH 0/2] hugetlb memcg accounting Frank van der Linden
2023-09-26 22:14 ` Johannes Weiner
2023-09-27 12:50 ` Michal Hocko
2023-09-27 16:44 ` Johannes Weiner
2023-09-27 17:22 ` Nhat Pham
2023-09-26 23:31 ` Nhat Pham
2023-09-27 11:21 ` Michal Hocko
2023-09-27 18:47 ` Johannes Weiner [this message]
2023-09-27 21:37 ` Roman Gushchin
2023-09-28 12:52 ` Johannes Weiner
2023-10-01 23:27 ` Mike Kravetz
2023-10-02 14:42 ` Johannes Weiner
2023-10-02 14:58 ` Michal Hocko
2023-10-02 15:36 ` Johannes Weiner
2023-09-27 23:33 ` Nhat Pham
2023-09-28 1:00 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230927184738.GC365513@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=riel@surriel.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox