From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 287CDE82CB5 for ; Wed, 27 Sep 2023 17:23:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A72DB8D009C; Wed, 27 Sep 2023 13:23:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A21F78D0093; Wed, 27 Sep 2023 13:23:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8EAC38D009C; Wed, 27 Sep 2023 13:23:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7CAC48D0093 for ; Wed, 27 Sep 2023 13:23:08 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5097D1A013E for ; Wed, 27 Sep 2023 17:23:08 +0000 (UTC) X-FDA: 81283048056.21.3A7E6F3 Received: from mail-io1-f43.google.com (mail-io1-f43.google.com [209.85.166.43]) by imf10.hostedemail.com (Postfix) with ESMTP id 80475C001D for ; Wed, 27 Sep 2023 17:23:06 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1yjpPgQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.43 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695835386; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ITQru/E0afH4eUP13Yo8ylvgWGIZ/FGUSo6bojPrfr8=; b=V8sbOAuArU6LwvjweZUog1ZJhNC34RkCeDteQ9JOIu5yEbFyHRBoWwYyurL/YUC5rEKnsJ fGzNwmiNx9j+Xt+OqW+DnwUw1eoMyTBeZJW/0cILasxIhWusqnZIyMLSAtmTbtzHwBHsqP l0oVzguHFa/9ZvLSsjge92vHNLmn3T4= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i1yjpPgQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.43 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695835386; a=rsa-sha256; cv=none; b=GR2mffWjwRMiXcihGqg9n7t1vgYCv60AN/DV70zQjqOrOG38PP4wXSu4rbZ+xObtwdg2gD MibDoVmZXUXJ/pDBPzfdFJ8E2MM1Xn8LuLME2DmIJnyekK8MA50nqfKmxD8PFoPGNq/DtQ dIKDMs+VBmIag2IAIETLdojiq2NO+kI= Received: by mail-io1-f43.google.com with SMTP id ca18e2360f4ac-79fe87cd74eso104219039f.3 for ; Wed, 27 Sep 2023 10:23:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695835385; x=1696440185; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ITQru/E0afH4eUP13Yo8ylvgWGIZ/FGUSo6bojPrfr8=; b=i1yjpPgQdlym2pXI8F/wE57UPoFSvoSZXLoe9IIRnhm6ce3OpqIRXN0oqH0osEnAVj 2ltvHY2acl3s/A4lXgIwaIfwsQf90yabpyx4ePfGfb6W2XRWhUQXRvSq6Br3J8JoQNMc vEgzvVJx8d1b7MHin0rkudQfNydYSSTEBqjrVdB8MqQcRtsmopxlfIVr1snsYSlJ7G3D wcObq/YBspaZaWFMbefiOs1Cr1QNDSYgcychVdrkDDPqpbcf6vs9CbgvGVY4HMxYvFvi Sp8/nB15IHxa4nX2pp5k9rvXkbqXW9bv9wtZajG/CvJIpMH6agEQsgEogz+6DMf87w7W 2x3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695835385; x=1696440185; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ITQru/E0afH4eUP13Yo8ylvgWGIZ/FGUSo6bojPrfr8=; b=t7grF1M0gNeirdJgt/kSTYbDj9dOa4KT6nRkqkCm+w0uDo/Ke6AmWPwEOIGG5pPZQQ cgX+QYtiLj6Q2Y9udHpYVYpnV832Lymx9lOe751RBu6LSqO2pjj00VKZFLEnBGpPfEZR JtJ1nI48WZM4HU0W8cDwNSTnA8trEn6s+Gr2Uv713KrwXAcdD1+yp4kLNU5f9kvIVl4C 5NSkBXTy47WIm9MYdPw/I1HD4wnPU2XOkJzzYi/KgU/Pik9TGVvqvjbyv+AylYbF+HbU THf1usIB/ZQZsRBhrQhVP0X3/ByQYbbMugFr0Pajl9LudPB3OwRLj2DPbLGWGJdUzkOS 1Bqw== X-Gm-Message-State: AOJu0YxNyBnK6wfBryebFX+ga8uPeKqJfpucVEo2qKXkUH6vcu2INUUA xiPTajDUdiNC+1KrATVfCCzHTNK5k4/LuV6tJZc= X-Google-Smtp-Source: AGHT+IELucLgfpevu8e20YGpQ2B1sf+MJt22Q/LCtXTXOAXfKnYqkMCOV1D+lNvNhmRGx9P+8C8rdTpyTGM3bzDSJXU= X-Received: by 2002:a6b:d00d:0:b0:798:3e95:274f with SMTP id x13-20020a6bd00d000000b007983e95274fmr2549080ioa.19.1695835385557; Wed, 27 Sep 2023 10:23:05 -0700 (PDT) MIME-Version: 1.0 References: <20230926194949.2637078-1-nphamcs@gmail.com> <20230926221414.GD348484@cmpxchg.org> <20230927164430.GB365513@cmpxchg.org> In-Reply-To: <20230927164430.GB365513@cmpxchg.org> From: Nhat Pham Date: Wed, 27 Sep 2023 10:22:54 -0700 Message-ID: Subject: Re: [PATCH 0/2] hugetlb memcg accounting To: Johannes Weiner Cc: Michal Hocko , Frank van der Linden , akpm@linux-foundation.org, riel@surriel.com, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 80475C001D X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 8qerbxwyfdpmdfqytnhkpwa5fej5zszq X-HE-Tag: 1695835386-922101 X-HE-Meta: U2FsdGVkX1/1sTuwDvxZLOMzclpG7zP7kU/ws0++VKEktZ7pjEN5Je4LYEZxoWxq5rXYcqN5LJWW6S53nVg9QKC3eq+wZI6wucUdo1IhUuFQj7UE8donz/Q2csymy8OZQh73Y/DIi/G8CpUu+NNgTuQ6grbBwn+TgNYOXlSmTylr5vg5nAhOeFLJQ4dE1DubhbKCsj7k5FR2UFOymFdNd6sldBX1mCTtAxKxKCd5phQAUNf98zI9Q5/YQRaB42I4kgCze9yR+kjlrPICgjHS6QqYbemXXmFg6robIFD0bagIGlo14l5ZYf3D2pV1Wtnbdf57gVSB50emI89i0vW9KumBkiuIaloWF9tXIquGdQX1lBn7MXvuKSRm/wUzn4JEMDai3SX6pNq0qaYY09A/sPJtBxTWvMkETZOXtY2pPy37X9hQyfGxRjrUpOnlElAbdo+LdGEw4Mcx7CK3rFkoKM+hptdf63/Z6Rds4YymfKRagKRxC44TnZhmfkAmvSQwdl2wpddA+g7S1ACPl5Gy8Ig291DExKVAC31ZLsUitDk0yRGBfGqkDrz70OQ87I4VaA7EbrUi8beOSgEyZMdOLnyLYwLrl1O66CXJRNb5db0kS7aYYn7DEsnDiLqenSu6HCmO5qTh7N5LDHGhZ6y2xmbpq9/ypMgoJzCCjrMBCdlvLDXdzk/1ruKp/hD4cDSxMZNTG8BUbTbLtupYiNGtBi7bPsQEqwPzumBVFUI9gAvYhbrJcIGMOI4/hobeTneessCJ1c6v1PoeICeJZWuicdjlC8bnL923q246coooPNZ/RzRpaTRzUhvWAmlz8r1PskNtxd0AQuagkvyw4IclL/ETb547f62Fk6UFFlYAWlhFgpDlldtafPf6eG7xCINfpZUDChJZyVo4jam/xIPOP7efkBUG5spnx9cN4Qqb3Ipqyqs6VTl6Do48jpLqgmE51P1U13ca3Ev9b3+jR9/ QYDgXiDA yLqBzMGBOLTHwijMhPfQ2acHCknhAMKPUUb8IR2gdSym2d22q5PzTWL/Y1C5tuwrh75xmP2zC83+cxB7bC2xMjRU+GJg9bd6w+Klj6hYXOsGUcD6Kc10sEiz7i7wJaVGen4LUh41MqQ13/BuxILkrutvI20QirKuMPJuqD+6WIHx9YHSreK6VpR1LM52WwRA2R27VnwUHrLJwr1IjPe24Yl719zwu37Kp4F7Kfk68m5x7jW7juMWpwHWj2Ueijz7oY8dmv5nLx9LX2PyPd/u05xqblprdcmzGyLqBi5KlMYQIQeKp3z669+wXUhiTSug4rInp4PMNOla/YYtpHw9f2KxmokCSL0BLicLd5lhOimzZtoif0i92eT8qTdejr4IOkJdHKvyPOEiJ0b5N09snB/GMDC/jboJIzf/0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 27, 2023 at 9:44=E2=80=AFAM Johannes Weiner wrote: > > On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote: > > On Tue 26-09-23 18:14:14, Johannes Weiner wrote: > > [...] > > > The fact that memory consumed by hugetlb is currently not considered > > > inside memcg (host memory accounting and control) is inconsistent. It > > > has been quite confusing to our service owners and complicating thing= s > > > for our containers team. > > > > I do understand how that is confusing and inconsistent as well. Hugetlb > > is bringing throughout its existence I am afraid. > > > > As noted in other reply though I am not sure hugeltb pool can be > > reasonably incorporated with a sane semantic. Neither of the regular > > allocation nor the hugetlb reservation/actual use can fallback to the > > pool of the other. This makes them 2 different things each hitting thei= r > > own failure cases that require a dedicated handling. > > > > Just from top of my head these are cases I do not see easy way out from= : > > - hugetlb charge failure has two failure modes - pool empty > > or memcg limit reached. The former is not recoverable and > > should fail without any further intervention the latter might > > benefit from reclaiming. > > - !hugetlb memory charge failure cannot consider any hugetlb > > pages - they are implicit memory.min protection so it is > > impossible to manage reclaim protection without having a > > knowledge of the hugetlb use. > > - there is no way to control the hugetlb pool distribution by > > memcg limits. How do we distinguish reservations from actual > > use? > > - pre-allocated pool is consuming memory without any actual > > owner until it is actually used and even that has two stages > > (reserved and really used). This makes it really hard to > > manage memory as whole when there is a considerable amount of > > hugetlb memore preallocated. > > It's important to distinguish hugetlb access policy from memory use > policy. This patch isn't about hugetlb access, it's about general > memory use. > > Hugetlb access policy is a separate domain with separate > answers. Preallocating is a privileged operation, for access control > there is the hugetlb cgroup controller etc. > > What's missing is that once you get past the access restrictions and > legitimately get your hands on huge pages, that memory use gets > reflected in memory.current and exerts pressure on *other* memory > inside the group, such as anon or optimistic cache allocations. > > Note that hugetlb *can* be allocated on demand. It's unexpected that > when an application optimistically allocates a couple of 2M hugetlb > pages those aren't reflected in its memory.current. The same is true > for hugetlb_cma. If the gigantic pages aren't currently allocated to a > cgroup, that CMA memory can be used for movable memory elsewhere. > > The points you and Frank raise are reasons and scenarios where > additional hugetlb access control is necessary - preallocation, > limited availability of 1G pages etc. But they're not reasons against > charging faulted in hugetlb to the memcg *as well*. > > My point is we need both. One to manage competition over hugetlb, > because it has unique limitations. The other to manage competition > over host memory which hugetlb is a part of. > > Here is a usecase from our fleet. > > Imagine a configuration with two 32G containers. The machine is booted > with hugetlb_cma=3D6G, and each container may or may not use up to 3 > gigantic page, depending on the workload within it. The rest is anon, > cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to > 3G to enforce hugetlb fairness. But how do you configure memory.max to > keep *overall* consumption, including anon, cache, slab etc. fair? > > If used hugetlb is charged, you can just set memory.max=3D32G regardless > of the workload inside. > > Without it, you'd have to constantly poll hugetlb usage and readjust > memory.max! Yep, and I'd like to add that this could and have caused issues in our production system, when there is a delay in memory limits (low or max) correction. The userspace agent in charge of correcting these only runs periodically, and within consecutive runs the system could be in an over/underprotected state. An instantaneous charge towards the memory controller would close this gap. I think we need both a HugeTLB controller and memory controller accounting.