From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9DCEFE7AD79
	for <linux-mm@archiver.kernel.org>; Tue,  3 Oct 2023 15:59:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3270D6B0213; Tue,  3 Oct 2023 11:59:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D6786B0248; Tue,  3 Oct 2023 11:59:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 14FD78D0003; Tue,  3 Oct 2023 11:59:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 016A56B0213
	for <linux-mm@kvack.org>; Tue,  3 Oct 2023 11:59:35 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id CB08D12034D
	for <linux-mm@kvack.org>; Tue,  3 Oct 2023 15:59:35 +0000 (UTC)
X-FDA: 81304610310.26.2F816DC
Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181])
	by imf02.hostedemail.com (Postfix) with ESMTP id 37F848000A
	for <linux-mm@kvack.org>; Tue,  3 Oct 2023 15:59:33 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=I5CLdmzm;
	spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696348774; a=rsa-sha256;
	cv=none;
	b=d5+oyyCa0RDz7d/dnouqrtPme625pPdkOqOZ4GmZaGJ4VzYk/LLIT4QE472kvB2C/2sGS3
	lunG4Mpphz2GU0CE1lUbkq12F3j/FqokwomBvId7//CC49A0LuK+NE5UKygbnVqti6tVfy
	QpDxpryFm2RhV5nsebQso/W9rvNogs8=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=I5CLdmzm;
	spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696348774;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=;
	b=TYjLzUSKZNzX/oIWlLZu75aqq3l7dkgLYOTAO7zzLfvr0TeSztjWnYvylshPg0M7NLO3EY
	DF2QbvIG9RbC/IfXam1g717+aLzFAVSg2z0/Q/fZ7oTdg9sPahb9otRoKjkb3E2sGlPJ/X
	/URzi0MZqkCSgPF+armxP4IGpwHppvA=
Received: by mail-qk1-f181.google.com with SMTP id af79cd13be357-77428e40f71so79944585a.1
        for <linux-mm@kvack.org>; Tue, 03 Oct 2023 08:59:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1696348772; x=1696953572; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=;
        b=I5CLdmzmQC2Dxi9SZQ5Auusc5c/Tch1Vhry3vNRi2KqLZ8VJumxazpwa7M6nycXwLr
         IGW8Y2YhmmKeWWDUudGLtFFf0GsfRjZgMMCNQT5AZcn/DrY1XkzHEkD30Sqfc+TQF6N+
         zgGaLXniIXC5OYlu5B8sjPUTlVitucfO3o79fAIqRANofSLRF9aXjFJe6aUs2CjuUinQ
         N9ZtHG6W0Q6yISvSa2jBuzunzoCmLm3dHTdRmrYi4g9TnUyAyyzU4PavT20JHvDF3RDX
         AprP5fwxlluAgGAIG4bTJBzRID1KOvLSE2tvcw1Ic+WGwJEhokbrtoOUha8/H5rw37Ow
         TyqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696348772; x=1696953572;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=;
        b=p91+Z8nvYdbo0aAZd04yPSDrcyjw01Qg0FfelwiSeb+C6+VU3j7aBx6s6CZ2TtAxOk
         +dTs5MYYKGo4QOAJJw1xXVjYbvjMtwfz8S5HPXIfbbvk9MHJV7XhIaG5PoFxUC5YV0dx
         cXC0yKECFYtmjEBnnfOu/YHC8X3eZpSSL3Rgv8RYk5DQhzXoGLJReH4PM82caE4lN/OX
         jT6D3RAbN6XRTkzwW51zpgHPgm6FbkDdcm99Zy0Xzr0aY6MOe1flUi+7zTrDjeqiNrnE
         /GFBA3VmE7WXwxqKvMuNMGr0GZHPhaoaXuHM/xIjUk9Vw1LEWewSpEqmGgvXIPvDoHlw
         crsA==
X-Gm-Message-State: AOJu0Yx/XHma9UBK0EWB2w1S39FYH7odY6wbziUJykYZntCObvXt9+Hd
	HqFZTtCkO92n3+Y1N89mWjPjZw==
X-Google-Smtp-Source: AGHT+IHqGr4J+qukUaSisJrYlEvvxoW/auCoCxrT+jkY3iF56XKG6U4RM8wS4bSzjDtmCxHmSD3aRw==
X-Received: by 2002:a0c:a9db:0:b0:655:71df:4c7d with SMTP id c27-20020a0ca9db000000b0065571df4c7dmr14977510qvb.56.1696348772174;
        Tue, 03 Oct 2023 08:59:32 -0700 (PDT)
Received: from localhost (2603-7000-0c01-2716-3012-16a2-6bc2-2937.res6.spectrum.com. [2603:7000:c01:2716:3012:16a2:6bc2:2937])
        by smtp.gmail.com with ESMTPSA id o10-20020a0cf4ca000000b0065862497fd2sm591100qvm.22.2023.10.03.08.59.31
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 03 Oct 2023 08:59:31 -0700 (PDT)
Date: Tue, 3 Oct 2023 11:59:31 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Nhat Pham <nphamcs@gmail.com>, akpm@linux-foundation.org,
	riel@surriel.com, roman.gushchin@linux.dev, shakeelb@google.com,
	muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com,
	shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com,
	fvdl@google.com, linux-mm@kvack.org, kernel-team@meta.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in
 memory controller
Message-ID: <20231003155931.GF17012@cmpxchg.org>
References: <20231003001828.2554080-1-nphamcs@gmail.com>
 <20231003001828.2554080-3-nphamcs@gmail.com>
 <ZRwQEv62Ex4+H2CZ@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZRwQEv62Ex4+H2CZ@dhcp22.suse.cz>
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 37F848000A
X-Stat-Signature: 4556c3qqru8qnrus6q4z5jnzbedkgiak
X-Rspam-User: 
X-HE-Tag: 1696348773-631468
X-HE-Meta: U2FsdGVkX18FwpSUTqCncuTz+bN6wYJ9ysi64f0PzdejIeJwVtEtCp8GgoYOsCDtc8abZPJsLzerKUGrpc0apPfq+iHNrf/p3V4izwZJ3Lt1GJNeKS4EGmDShHINWUivXQl9mQ/NrHIE8gfxZMpiwBS6aKEGwIzEDvRzkaNU7IOlmLjTLYZMlCrcNHZsH8rZLUB0oYG3+L8Kj3bzcco6wCqCg94a2wV/a7x8waLlWPpVFMe2+lrC9hUa9UHfay0Podk9UrLJvQt+tQtxvG6smN9rrnOivEgVOO0CCDxSGvo11twKpOkntMxkM0aYQSJ8KenqZyH7b+yonyFMxeWvoUzugKwVhe8gpcIXB41MjfsCOkTicHqV50TO1mN9yfYJlVEP+d6lgqZnAiJ7tsANaZdMETf0M1cbvyYq95ZzZxTOHU6uxMNasixhqxqkhZhUrRtvUbo78dxLp950SsJsjEu7qIPzLhql83fref0qEkX1cBZaEQayfEjvgufeA1dwpltz7fbWBXvj3AEIClYZGY5r7hpC4hPUJaVcSfWdQj6F8vBmKUSZCapUXoZ1VNhIRykHvsTWRT5CkQlQT4dt2WLdoTVRY+ZhDn09p3rFzhLN1/R51BsUmDEt0b2XcWCLt4+P0KWMlZyvdwN8TkrQzCrMmMYTGeXVJxX6Mdw7HMw0/+tX5LDwusN/seI4LFCbhG2e9vl/986sj0jOsnW5Q1nQoZfohVMZgDIS+FlScrGK6GmfUmjmqLJ+fDe3KrzHRqxd7B5I/i+mbnseG4qIPEd0qxZykpF/j61RvRcvXVhaAEEtrrnuXceZV8E4x9wVv+u6nST1qPdBnYiIPj6s2a+pmxBEe7S51WiUrbUYaCEsjNnHbBSexjp2RNzXrlJfkp+zmI6pdvA8hEm4DkRnzQECcHX1LT2Jfl440L7xj0BfR2/nUv39sSRV7uE/EGL35JuT/1nmICxD68saPPV
 Bv59UNto
 s+vBaWPs6VR48CvlbmoW3g0fLataKvYIRIYyNAh0BxRc7A0nsEfnBDOx9UejpMjkA/DY8ZpKASK0YMqu3ufPrrlkrB913MLVyuWlMQvVo3+x5yXh47Yc8OzqJZEIBuId8N3dLSYhWJfVkpnZeGiC8lgUN7QDbFRf5AKURsKelDwNAb5pzbqX14ycTuzLVK9XYPfR0sZaJq4zuIZRqfhaR6N4K/rGcwT8dyd+0sNiO8eqPxIBfoNyt10uSjbrSwqruDaGRQDp62wZlUKnhcd7iBasJUkwc21sixBJsn1XMN3dompiWuU7v0n47e6EI2H+wdrQflyQjFhGc03AkTLVIobwq1HWR+UxMD1wuOQseZptGlFzTVFrKgHRagO0adIturaLYCqbMXkzO8W245zFfKryRSA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000485, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote:
> On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> > 
> > For instance, here is one of our usecases: suppose there are two 32G
> > containers. The machine is booted with hugetlb_cma=6G, and each
> > container may or may not use up to 3 gigantic page, depending on the
> > workload within it. The rest is anon, cache, slab, etc. We can set the
> > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> > But it is very difficult to configure memory.max to keep overall
> > consumption, including anon, cache, slab etc. fair.
> > 
> > What we have had to resort to is to constantly poll hugetlb usage and
> > readjust memory.max. Similar procedure is done to other memory limits
> > (memory.low for e.g). However, this is rather cumbersome and buggy.
> 
> Could you expand some more on how this _helps_ memory.low? The
> hugetlb memory is not reclaimable so whatever portion of its memcg
> consumption will be "protected from the reclaim". Consider this
> 	      parent
> 	/		\
>        A		 B
>        low=50%		 low=0
>        current=40%	 current=60%
> 
> We have an external memory pressure and the reclaim should prefer B as A
> is under its low limit, correct? But now consider that the predominant
> consumption of B is hugetlb which would mean the memory reclaim cannot
> do much for B and so the A's protection might be breached.
> 
> As an admin (or a tool) you need to know about hugetlb as a potential
> contributor to this behavior (sure mlocked memory would behave the same
> but mlock rarely consumes huge amount of memory in my experience).
> Without the accounting there might not be any external pressure in the
> first place.
> 
> All that being said, I do not see how adding hugetlb into accounting
> makes low, min limits management any easier.

It's important to differentiate the cgroup usecases. One is of course
the cloud/virtual server scenario, where you set the hard limits to
whatever the customer paid for, and don't know and don't care about
the workload running inside. In that case, memory.low and overcommit
aren't really safe to begin with due to unknown unreclaimable mem.

The other common usecase is the datacenter where you run your own
applications. You understand their workingset and requirements, and
configure and overcommit the containers in a way where jobs always
meet their SLAs. E.g. if multiple containers spike, memory.low is set
such that interactive workloads are prioritized over batch jobs, and
both have priority over routine system management tasks.

This is arguably the only case where it's safe to use memory.low. You
have to know what's reclaimable and what isn't, otherwise you cannot
know that memory.low will even do anything, and isolation breaks down.
So we already have that knowledge: mlocked sections, how much anon is
without swap space, and how much memory must not be reclaimed (even if
it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't
really complicate this equation - we already have to consider it
unreclaimable workingset from an overcommit POV on those hosts.

The reason this patch helps in this scenario is that the service teams
are usually different from the containers/infra team. The service
understands their workload and declares its workingset. But it's the
infra team running the containers that currently has to go and find
out if they're using hugetlb and tweak the cgroups. Bugs and
untimeliness in the tweaking have caused multiple production incidents
already. And both teams are regularly confused when there are large
parts of the workload that don't show up in memory.current which both
sides monitor. Keep in mind that these systems are already pretty
complex, with multiple overcommitted containers and system-level
activity. The current hugetlb quirk can heavily distort what a given
container is doing on the host.

With this patch, the service can declare its workingset, the container
team can configure the container, and memory.current makes sense to
everybody. The workload parameters are pretty reliable, but if the
service team gets it wrong and we underprotect the workload, and/or
its unreclaimable memory exceeds what was declared, the infra team
gets alarms on elevated LOW breaching events and investigates if its
an infra problem or a service spec problem that needs escalation.

So the case you describe above only happens when mistakes are made,
and we detect and rectify them. In the common case, hugetlb is part of
the recognized workingset, and we configure memory.low to cut off only
known optional and reclaimable memory under pressure.