From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DCEFE7AD79 for ; Tue, 3 Oct 2023 15:59:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3270D6B0213; Tue, 3 Oct 2023 11:59:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D6786B0248; Tue, 3 Oct 2023 11:59:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 14FD78D0003; Tue, 3 Oct 2023 11:59:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 016A56B0213 for ; Tue, 3 Oct 2023 11:59:35 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CB08D12034D for ; Tue, 3 Oct 2023 15:59:35 +0000 (UTC) X-FDA: 81304610310.26.2F816DC Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181]) by imf02.hostedemail.com (Postfix) with ESMTP id 37F848000A for ; Tue, 3 Oct 2023 15:59:33 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=I5CLdmzm; spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696348774; a=rsa-sha256; cv=none; b=d5+oyyCa0RDz7d/dnouqrtPme625pPdkOqOZ4GmZaGJ4VzYk/LLIT4QE472kvB2C/2sGS3 lunG4Mpphz2GU0CE1lUbkq12F3j/FqokwomBvId7//CC49A0LuK+NE5UKygbnVqti6tVfy QpDxpryFm2RhV5nsebQso/W9rvNogs8= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=I5CLdmzm; spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696348774; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; b=TYjLzUSKZNzX/oIWlLZu75aqq3l7dkgLYOTAO7zzLfvr0TeSztjWnYvylshPg0M7NLO3EY DF2QbvIG9RbC/IfXam1g717+aLzFAVSg2z0/Q/fZ7oTdg9sPahb9otRoKjkb3E2sGlPJ/X /URzi0MZqkCSgPF+armxP4IGpwHppvA= Received: by mail-qk1-f181.google.com with SMTP id af79cd13be357-77428e40f71so79944585a.1 for ; Tue, 03 Oct 2023 08:59:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1696348772; x=1696953572; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; b=I5CLdmzmQC2Dxi9SZQ5Auusc5c/Tch1Vhry3vNRi2KqLZ8VJumxazpwa7M6nycXwLr IGW8Y2YhmmKeWWDUudGLtFFf0GsfRjZgMMCNQT5AZcn/DrY1XkzHEkD30Sqfc+TQF6N+ zgGaLXniIXC5OYlu5B8sjPUTlVitucfO3o79fAIqRANofSLRF9aXjFJe6aUs2CjuUinQ N9ZtHG6W0Q6yISvSa2jBuzunzoCmLm3dHTdRmrYi4g9TnUyAyyzU4PavT20JHvDF3RDX AprP5fwxlluAgGAIG4bTJBzRID1KOvLSE2tvcw1Ic+WGwJEhokbrtoOUha8/H5rw37Ow TyqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696348772; x=1696953572; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; b=p91+Z8nvYdbo0aAZd04yPSDrcyjw01Qg0FfelwiSeb+C6+VU3j7aBx6s6CZ2TtAxOk +dTs5MYYKGo4QOAJJw1xXVjYbvjMtwfz8S5HPXIfbbvk9MHJV7XhIaG5PoFxUC5YV0dx cXC0yKECFYtmjEBnnfOu/YHC8X3eZpSSL3Rgv8RYk5DQhzXoGLJReH4PM82caE4lN/OX jT6D3RAbN6XRTkzwW51zpgHPgm6FbkDdcm99Zy0Xzr0aY6MOe1flUi+7zTrDjeqiNrnE /GFBA3VmE7WXwxqKvMuNMGr0GZHPhaoaXuHM/xIjUk9Vw1LEWewSpEqmGgvXIPvDoHlw crsA== X-Gm-Message-State: AOJu0Yx/XHma9UBK0EWB2w1S39FYH7odY6wbziUJykYZntCObvXt9+Hd HqFZTtCkO92n3+Y1N89mWjPjZw== X-Google-Smtp-Source: AGHT+IHqGr4J+qukUaSisJrYlEvvxoW/auCoCxrT+jkY3iF56XKG6U4RM8wS4bSzjDtmCxHmSD3aRw== X-Received: by 2002:a0c:a9db:0:b0:655:71df:4c7d with SMTP id c27-20020a0ca9db000000b0065571df4c7dmr14977510qvb.56.1696348772174; Tue, 03 Oct 2023 08:59:32 -0700 (PDT) Received: from localhost (2603-7000-0c01-2716-3012-16a2-6bc2-2937.res6.spectrum.com. [2603:7000:c01:2716:3012:16a2:6bc2:2937]) by smtp.gmail.com with ESMTPSA id o10-20020a0cf4ca000000b0065862497fd2sm591100qvm.22.2023.10.03.08.59.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 08:59:31 -0700 (PDT) Date: Tue, 3 Oct 2023 11:59:31 -0400 From: Johannes Weiner To: Michal Hocko Cc: Nhat Pham , akpm@linux-foundation.org, riel@surriel.com, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, fvdl@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller Message-ID: <20231003155931.GF17012@cmpxchg.org> References: <20231003001828.2554080-1-nphamcs@gmail.com> <20231003001828.2554080-3-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 37F848000A X-Stat-Signature: 4556c3qqru8qnrus6q4z5jnzbedkgiak X-Rspam-User: X-HE-Tag: 1696348773-631468 X-HE-Meta: U2FsdGVkX18FwpSUTqCncuTz+bN6wYJ9ysi64f0PzdejIeJwVtEtCp8GgoYOsCDtc8abZPJsLzerKUGrpc0apPfq+iHNrf/p3V4izwZJ3Lt1GJNeKS4EGmDShHINWUivXQl9mQ/NrHIE8gfxZMpiwBS6aKEGwIzEDvRzkaNU7IOlmLjTLYZMlCrcNHZsH8rZLUB0oYG3+L8Kj3bzcco6wCqCg94a2wV/a7x8waLlWPpVFMe2+lrC9hUa9UHfay0Podk9UrLJvQt+tQtxvG6smN9rrnOivEgVOO0CCDxSGvo11twKpOkntMxkM0aYQSJ8KenqZyH7b+yonyFMxeWvoUzugKwVhe8gpcIXB41MjfsCOkTicHqV50TO1mN9yfYJlVEP+d6lgqZnAiJ7tsANaZdMETf0M1cbvyYq95ZzZxTOHU6uxMNasixhqxqkhZhUrRtvUbo78dxLp950SsJsjEu7qIPzLhql83fref0qEkX1cBZaEQayfEjvgufeA1dwpltz7fbWBXvj3AEIClYZGY5r7hpC4hPUJaVcSfWdQj6F8vBmKUSZCapUXoZ1VNhIRykHvsTWRT5CkQlQT4dt2WLdoTVRY+ZhDn09p3rFzhLN1/R51BsUmDEt0b2XcWCLt4+P0KWMlZyvdwN8TkrQzCrMmMYTGeXVJxX6Mdw7HMw0/+tX5LDwusN/seI4LFCbhG2e9vl/986sj0jOsnW5Q1nQoZfohVMZgDIS+FlScrGK6GmfUmjmqLJ+fDe3KrzHRqxd7B5I/i+mbnseG4qIPEd0qxZykpF/j61RvRcvXVhaAEEtrrnuXceZV8E4x9wVv+u6nST1qPdBnYiIPj6s2a+pmxBEe7S51WiUrbUYaCEsjNnHbBSexjp2RNzXrlJfkp+zmI6pdvA8hEm4DkRnzQECcHX1LT2Jfl440L7xj0BfR2/nUv39sSRV7uE/EGL35JuT/1nmICxD68saPPV Bv59UNto s+vBaWPs6VR48CvlbmoW3g0fLataKvYIRIYyNAh0BxRc7A0nsEfnBDOx9UejpMjkA/DY8ZpKASK0YMqu3ufPrrlkrB913MLVyuWlMQvVo3+x5yXh47Yc8OzqJZEIBuId8N3dLSYhWJfVkpnZeGiC8lgUN7QDbFRf5AKURsKelDwNAb5pzbqX14ycTuzLVK9XYPfR0sZaJq4zuIZRqfhaR6N4K/rGcwT8dyd+0sNiO8eqPxIBfoNyt10uSjbrSwqruDaGRQDp62wZlUKnhcd7iBasJUkwc21sixBJsn1XMN3dompiWuU7v0n47e6EI2H+wdrQflyQjFhGc03AkTLVIobwq1HWR+UxMD1wuOQseZptGlFzTVFrKgHRagO0adIturaLYCqbMXkzO8W245zFfKryRSA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000485, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote: > On Mon 02-10-23 17:18:27, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > For instance, here is one of our usecases: suppose there are two 32G > > containers. The machine is booted with hugetlb_cma=6G, and each > > container may or may not use up to 3 gigantic page, depending on the > > workload within it. The rest is anon, cache, slab, etc. We can set the > > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > > But it is very difficult to configure memory.max to keep overall > > consumption, including anon, cache, slab etc. fair. > > > > What we have had to resort to is to constantly poll hugetlb usage and > > readjust memory.max. Similar procedure is done to other memory limits > > (memory.low for e.g). However, this is rather cumbersome and buggy. > > Could you expand some more on how this _helps_ memory.low? The > hugetlb memory is not reclaimable so whatever portion of its memcg > consumption will be "protected from the reclaim". Consider this > parent > / \ > A B > low=50% low=0 > current=40% current=60% > > We have an external memory pressure and the reclaim should prefer B as A > is under its low limit, correct? But now consider that the predominant > consumption of B is hugetlb which would mean the memory reclaim cannot > do much for B and so the A's protection might be breached. > > As an admin (or a tool) you need to know about hugetlb as a potential > contributor to this behavior (sure mlocked memory would behave the same > but mlock rarely consumes huge amount of memory in my experience). > Without the accounting there might not be any external pressure in the > first place. > > All that being said, I do not see how adding hugetlb into accounting > makes low, min limits management any easier. It's important to differentiate the cgroup usecases. One is of course the cloud/virtual server scenario, where you set the hard limits to whatever the customer paid for, and don't know and don't care about the workload running inside. In that case, memory.low and overcommit aren't really safe to begin with due to unknown unreclaimable mem. The other common usecase is the datacenter where you run your own applications. You understand their workingset and requirements, and configure and overcommit the containers in a way where jobs always meet their SLAs. E.g. if multiple containers spike, memory.low is set such that interactive workloads are prioritized over batch jobs, and both have priority over routine system management tasks. This is arguably the only case where it's safe to use memory.low. You have to know what's reclaimable and what isn't, otherwise you cannot know that memory.low will even do anything, and isolation breaks down. So we already have that knowledge: mlocked sections, how much anon is without swap space, and how much memory must not be reclaimed (even if it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't really complicate this equation - we already have to consider it unreclaimable workingset from an overcommit POV on those hosts. The reason this patch helps in this scenario is that the service teams are usually different from the containers/infra team. The service understands their workload and declares its workingset. But it's the infra team running the containers that currently has to go and find out if they're using hugetlb and tweak the cgroups. Bugs and untimeliness in the tweaking have caused multiple production incidents already. And both teams are regularly confused when there are large parts of the workload that don't show up in memory.current which both sides monitor. Keep in mind that these systems are already pretty complex, with multiple overcommitted containers and system-level activity. The current hugetlb quirk can heavily distort what a given container is doing on the host. With this patch, the service can declare its workingset, the container team can configure the container, and memory.current makes sense to everybody. The workload parameters are pretty reliable, but if the service team gets it wrong and we underprotect the workload, and/or its unreclaimable memory exceeds what was declared, the infra team gets alarms on elevated LOW breaching events and investigates if its an infra problem or a service spec problem that needs escalation. So the case you describe above only happens when mistakes are made, and we detect and rectify them. In the common case, hugetlb is part of the recognized workingset, and we configure memory.low to cut off only known optional and reclaimable memory under pressure.