From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C992EE82CB0 for ; Wed, 27 Sep 2023 16:44:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1A1A66B0093; Wed, 27 Sep 2023 12:44:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 152F26B009E; Wed, 27 Sep 2023 12:44:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01A4A6B00A6; Wed, 27 Sep 2023 12:44:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E46FA6B0093 for ; Wed, 27 Sep 2023 12:44:34 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B30BC140D1E for ; Wed, 27 Sep 2023 16:44:34 +0000 (UTC) X-FDA: 81282950868.28.B5D4B9B Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf25.hostedemail.com (Postfix) with ESMTP id 8A1A2A001E for ; Wed, 27 Sep 2023 16:44:32 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=2urQiLXi; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.174 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695833072; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nKbDRjvApMpYPGbWz9dHVkdjA8N/B9idfwIaFL32H1A=; b=iJ4nyhY1Yf5rLYqgME8brE0/lMsHs7IVMh5p+r3d3WdKw+68CVA4TxTztkXxVGbNWUjHYx hufC2znl9T+RREjGpI/h9Q9L96yTr0DxfJJVF/iI25zzrE5I32HpBh6DyobqsFvPOPWr7i pKtlXbU2hTFXC3WnCSo3me7eti6FzZU= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=2urQiLXi; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.174 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695833072; a=rsa-sha256; cv=none; b=YNzjS6c1vgApx30mWsX7pGRh29IFQIFg5YrkBEnXI1VfPTVQI9Q9sYu0N+vlOf/tMDCzaY kvEIvZRiuDR/l+CKpUNa3AUSpurbMOLMFyvhgWzBQYhWkrK5/zVo0S9YSVQTEIpNox71eA 5HZ9dSy1MHZwqYwakSL3QY8Kjs32M2s= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-4181251f83fso41577871cf.0 for ; Wed, 27 Sep 2023 09:44:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1695833071; x=1696437871; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=nKbDRjvApMpYPGbWz9dHVkdjA8N/B9idfwIaFL32H1A=; b=2urQiLXiyB5ZnwCuGdHf7LheePCdsJIjFVVRzM6pPfJpy37PjvhrZ6l6+7IQOkoifi A1H5Vq+fZyoqbjM2nZjDtHQtIiXM5L8Tca/Sci/bt2cz2lyU9nrZ6EcwEb+Unar8WB3b eQfp48Kmo2KNggzse35oPvUmE2U7Q6peIiApGhGc3HVP1nppAeDpfKQUeLRJ7RQn4van iJooC7QrbUL6xl0p53cyrR+XehNI82MQVKHHQ3p5qpIOY4WLgb4B/h9wU9DoRtXJuOEF eZ0M3+G6vPA0ODZ8w2ZlSL0vHpqaY0gClCuW96DcJNc+YWKgx9xDdMjuYFQamYS9yJtF 88vQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695833071; x=1696437871; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=nKbDRjvApMpYPGbWz9dHVkdjA8N/B9idfwIaFL32H1A=; b=mumAAOhw4av2QY1+NN1bQZ4NMG4ckfQfkmKoHUpXKdbK/lNnLa0y3uHprq3ytUfGhm N0Amm5oUKENAYTu5/RgS5RRiBaOaitGlKpRjBfTUvhPjIoyLJIffcWKYswUGoyOLG2lK GW+3sHfPeScN4SGj2drdI7mlRNOSbYUdQ+gEHsw+ojp/KmVmppri4QoC+ur5UTVKYVIG gXBZNWyFHD0exPlQ68jnv5Ee9+6eOjB7EEZJqxEdWcOIoU0tGCTMJuLSZITffsaPsGQY c/CmzY+dU2h5v8Ddt286otj+hHQUX2MxFEbgadXdheyMgoAiFZIjCAE7Iwod8GGSrzQ6 G3CQ== X-Gm-Message-State: AOJu0YxTqNildPCWjfTUtX/1Xj178hiTYo160Zmo8y2kTlq6WGsJz0F9 gPGsA5KhzuwKKgk8ZKd8WKaMRA== X-Google-Smtp-Source: AGHT+IFOHIz2rG+1dO7b6kU+thPNTv0XP5gZ8Xowyl8cMwksCfd/KzkFVs+MMn71Ytt40M5tC7xwcQ== X-Received: by 2002:a05:622a:15c4:b0:40f:cf7c:5e7c with SMTP id d4-20020a05622a15c400b0040fcf7c5e7cmr2914810qty.24.1695833071521; Wed, 27 Sep 2023 09:44:31 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:ba06]) by smtp.gmail.com with ESMTPSA id h11-20020a37c44b000000b00767b0c35c15sm5604683qkm.91.2023.09.27.09.44.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Sep 2023 09:44:31 -0700 (PDT) Date: Wed, 27 Sep 2023 12:44:30 -0400 From: Johannes Weiner To: Michal Hocko Cc: Frank van der Linden , Nhat Pham , akpm@linux-foundation.org, riel@surriel.com, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH 0/2] hugetlb memcg accounting Message-ID: <20230927164430.GB365513@cmpxchg.org> References: <20230926194949.2637078-1-nphamcs@gmail.com> <20230926221414.GD348484@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 8A1A2A001E X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: y39ue1bemp7s49mg1obsnu4bqpge7ot8 X-HE-Tag: 1695833072-98606 X-HE-Meta: U2FsdGVkX1/NuiO+Uf13nkwyFGsoK2S5tVLDtk6ndWRke0zn+0x0VbF4jaW5hn7S4Ovux2Dlv72rPji1m0/zJRcWp3CBXlbvBnbTr3I6M3H5kd2aTyHJp+SfdFp3Ul/GGHAn5NpaAaqf53aUDLLNR2kcXjeSJJKNjclOJnP4JxWaCz5LuNGRYihcn2C+G2oosIJT29uQys9X/4fL6d2y2CFVsw2r9WN9EcqIWkueyaKBLKaWnm6CaYZjP9pncA51Ei6W/Xbj+JzZSuKrei2gEM8X1/rDFZIPpn27PO/ZvO+bpkCmfZuJ/RDCH6hucILeNZF3ZN690HbSJnqToKksmqblMFHDfN4vMxqagq4AuvIEMkf39I0smPFXv7Se2lTWFFWMd64XCcAGvoQJ4vPwO6C9v/XTblub5Qxdzkp2i2GjVFu8bBt0NpWGRkdAH7onLFva0OZxrWVU93L5ZJ+WleZKItX23MGbhl8nYoji0XuvGXWOe3SuHUUHxrXslmA0AnxT/mGSDELXPSOCJ5Aom0w5zHk4tsAREjssPID5nu53dZ02c18K28CShUPzUDPJjuiMOt8QiA8m2NVnCmfXu7BT2+a6eDvvUXzmUcOamTtfk0Jsnjbj5kz7aXkp2ASyC14lBAMubEExmsre8ho5XDqF4Se8tJXHi91o4u4Gz4KE1o2yF6D4ydaWx0KedTdmbNv1f+7hPhSvW8Pm6tw0dPFDWjaC0SHr0cmIGCw24IVY/p5F9xhfA66v6KgMxgXAw44xThdgQZdlAr6oClSrqCgKpiEyFKbYmuW6SdFFfOliBVrh7e2h6/ASxBkv3HpUIE1b/g4oxvAosO8kDmOW92+T/wKj3+tsRy0izF/oK5qvle8SQvjXtAdrBw8ABYFiu8mT0WfnRq40ZfSQgtwmE875iuUGnexFx1b2e9/goIuFntQTiLJ18kSWT36gbKz9UR1H+esixUsGnqA8u/H xcUwxThH Ct0NuII1V3yR5nCbjQ2YUhqAyV8JjMbQNL/UvFqnCmqZdwNMMMWC5REMq7YktakzFy76mQXxB0/LMp5P7p94D3koYuv9Gq/j3ImB7dvTM5y/FoUosNgqQnV3XwwJrOczTV/BNFE5Q924hPIupk+Vizs3cqhXJApva8KrzvkpB5EI3nE1ghaF5U4bCdSuQSJa+OHkvbHL9eXbTU78ru9XJTkDfwapvPpKKiIDh8659c33mCedDVs7i5FJCLZlMoAHyCDvQ3FQG43yOJ3VENttRKnjhOeGP1jjL0HFeNBWknFzQvj42+BRJ/m8RQbVRcSaZcRFXzLcCnWXGUCRXEIwvpNjjMfIVuHR5tFY2WhsevoGeVFCHMVY7SNGHiqYyOau56iWm4B66Ts5ezW3huq7QG5fJdrRnqnJFXtSZN3y543339b8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote: > On Tue 26-09-23 18:14:14, Johannes Weiner wrote: > [...] > > The fact that memory consumed by hugetlb is currently not considered > > inside memcg (host memory accounting and control) is inconsistent. It > > has been quite confusing to our service owners and complicating things > > for our containers team. > > I do understand how that is confusing and inconsistent as well. Hugetlb > is bringing throughout its existence I am afraid. > > As noted in other reply though I am not sure hugeltb pool can be > reasonably incorporated with a sane semantic. Neither of the regular > allocation nor the hugetlb reservation/actual use can fallback to the > pool of the other. This makes them 2 different things each hitting their > own failure cases that require a dedicated handling. > > Just from top of my head these are cases I do not see easy way out from: > - hugetlb charge failure has two failure modes - pool empty > or memcg limit reached. The former is not recoverable and > should fail without any further intervention the latter might > benefit from reclaiming. > - !hugetlb memory charge failure cannot consider any hugetlb > pages - they are implicit memory.min protection so it is > impossible to manage reclaim protection without having a > knowledge of the hugetlb use. > - there is no way to control the hugetlb pool distribution by > memcg limits. How do we distinguish reservations from actual > use? > - pre-allocated pool is consuming memory without any actual > owner until it is actually used and even that has two stages > (reserved and really used). This makes it really hard to > manage memory as whole when there is a considerable amount of > hugetlb memore preallocated. It's important to distinguish hugetlb access policy from memory use policy. This patch isn't about hugetlb access, it's about general memory use. Hugetlb access policy is a separate domain with separate answers. Preallocating is a privileged operation, for access control there is the hugetlb cgroup controller etc. What's missing is that once you get past the access restrictions and legitimately get your hands on huge pages, that memory use gets reflected in memory.current and exerts pressure on *other* memory inside the group, such as anon or optimistic cache allocations. Note that hugetlb *can* be allocated on demand. It's unexpected that when an application optimistically allocates a couple of 2M hugetlb pages those aren't reflected in its memory.current. The same is true for hugetlb_cma. If the gigantic pages aren't currently allocated to a cgroup, that CMA memory can be used for movable memory elsewhere. The points you and Frank raise are reasons and scenarios where additional hugetlb access control is necessary - preallocation, limited availability of 1G pages etc. But they're not reasons against charging faulted in hugetlb to the memcg *as well*. My point is we need both. One to manage competition over hugetlb, because it has unique limitations. The other to manage competition over host memory which hugetlb is a part of. Here is a usecase from our fleet. Imagine a configuration with two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But how do you configure memory.max to keep *overall* consumption, including anon, cache, slab etc. fair? If used hugetlb is charged, you can just set memory.max=32G regardless of the workload inside. Without it, you'd have to constantly poll hugetlb usage and readjust memory.max!