From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6582FC001DE for ; Wed, 2 Aug 2023 07:31:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D4112280135; Wed, 2 Aug 2023 03:31:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CCB95280112; Wed, 2 Aug 2023 03:31:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B6D23280135; Wed, 2 Aug 2023 03:31:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A7C8A280112 for ; Wed, 2 Aug 2023 03:31:16 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6F7E4140648 for ; Wed, 2 Aug 2023 07:31:16 +0000 (UTC) X-FDA: 81078343752.17.C96A2C3 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf17.hostedemail.com (Postfix) with ESMTP id EDA224001F for ; Wed, 2 Aug 2023 07:31:12 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=smartx-com.20221208.gappssmtp.com header.s=20221208 header.b="Yv/w7AjS"; dmarc=none; spf=none (imf17.hostedemail.com: domain of xueshi.hu@smartx.com has no SPF policy when checking 209.85.214.176) smtp.mailfrom=xueshi.hu@smartx.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690961474; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Y08pJj6V4wystG4oep0lQlTsm7NHSzfF39BxozMIVAk=; b=1BFu1r+ecHg1Sd2ozsnvmS2cFjPoVYTpg2by02vNymS6KwWKO+mTaWVPbKw+slLYYPFEmd CfQ/vL/PPzaV6skTTnT+J9Aiwu1qC+khfnKsac6IlKHhwHzgAEH7Lv0hNWJxyfpttJ6Xtx 2E/9Xra11QkVz1Oa5pCZtX3nUB80rBI= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=smartx-com.20221208.gappssmtp.com header.s=20221208 header.b="Yv/w7AjS"; dmarc=none; spf=none (imf17.hostedemail.com: domain of xueshi.hu@smartx.com has no SPF policy when checking 209.85.214.176) smtp.mailfrom=xueshi.hu@smartx.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690961474; a=rsa-sha256; cv=none; b=Mf+HmetBYy8M/M2Led9G4H0Wns9MF3XV+3J3a9gjsO0YYErZD+nxDsBuFc0D4sHPqcCnRW zuC5UA+T2JmAPy51jro/wsMky2nsMuQV++g2WQcK3BWShSgtqwn0BLlfDWJYA0vmS39hC3 1jPvbTPC1tV/39Z7bWGeEDtEWAb33Ig= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-1b8b4749013so51716035ad.2 for ; Wed, 02 Aug 2023 00:31:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=smartx-com.20221208.gappssmtp.com; s=20221208; t=1690961471; x=1691566271; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Y08pJj6V4wystG4oep0lQlTsm7NHSzfF39BxozMIVAk=; b=Yv/w7AjSiFTjy1Mon2gyPD6ZsOt3XJx0qRwSbUj+ErwClzf+l09DjILA52FkWn22GG 5YqLMx6K6X4v4rPxHpIEhtXVqXYYzS+rF759+lEwHfJC1w4r4MR3p0iaMfRCvV4yCDSW /OkvO8AkiKebaeHEhAaSqg9dPjU7OHKhaSd1uo7AhmbmZ5m0faVSQ0ee/blyyhJP8mLA amCKTToHR8Ranedx6fI3bTkBdlNt5cKcY59umieeRDlu61UPIEII9nkVR9jsnBQaRrfF yBvGVXNUOue9/KKwLithqUPuu3L/ehJtAW4IQCK2Ss62iq3UZDsE9Ny9JM8qPB+AXDva Hyww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690961471; x=1691566271; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Y08pJj6V4wystG4oep0lQlTsm7NHSzfF39BxozMIVAk=; b=Vtu0qnX0sxGReDygcWpI72iZxR/TlHg8wxyOpqQbu7JTX5MxnDjoW0LRcHKjoawwDm xJiDhmIsvmORgnRakku4+t+BvBLt1rdxqn/o1OYd4aKMoX2ClxFA0MzzAvrFOX6Vn1Mc GBZ6VIa67TCIjyRuiVPrX5wUirMVFIOwS0/VGpoiHVRc4fSeQGmG5IP8QKh7R1gBtTcI ARXK7cuOKsucpwbyEwApfxbAK+RQloC7UYA5rZ0m9CQ8az3t3pCGHVMOHlc0XQfdiQTv GuhlXcW4xr4DpoMJqReX8jS4RZY4Cygb5NxDOffpCErOXJA/5y16YLvxIoAiI87JfP8q lkbQ== X-Gm-Message-State: ABy/qLbMwG985B7zaaTzFcOnAtjpIiU4jIdILdcvaanyuECevWAxS9BP pnrKICPQS0cdySVYOdRary+1GQ== X-Google-Smtp-Source: APBJJlHRQpQS55v+3XXkC6goJLbaT+nFnMyljI+oKMN6TWjwhPGlfAgMnvJcbM9me7fCdS84ogH+zw== X-Received: by 2002:a17:902:e314:b0:1b8:862:74ee with SMTP id q20-20020a170902e31400b001b8086274eemr12562565plc.11.1690961471232; Wed, 02 Aug 2023 00:31:11 -0700 (PDT) Received: from nixos ([47.75.78.161]) by smtp.gmail.com with ESMTPSA id s4-20020a170902b18400b001b8b4730355sm11649645plr.287.2023.08.02.00.31.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Aug 2023 00:31:10 -0700 (PDT) Date: Wed, 2 Aug 2023 15:31:07 +0800 From: Xueshi Hu To: Mike Kravetz Cc: akpm@linux-foundation.org, linux-mm@kvack.org, muchun.song@linux.dev Subject: Re: [PATCH 1/3] mm/hugetlb: fix the inconsistency of /proc/sys/vm/nr_huge_pages Message-ID: References: <20230730125156.207301-1-xueshi.hu@smartx.com> <20230730125156.207301-2-xueshi.hu@smartx.com> <20230731221725.GA3351@monkey> <20230801184942.GA6544@monkey> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230801184942.GA6544@monkey> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: EDA224001F X-Stat-Signature: 55jij71bhim9gnngntidss7w9yffqmus X-HE-Tag: 1690961472-72991 X-HE-Meta: U2FsdGVkX1+aHMuCHEpF1XkmrG8rz3n3yUicJAWCmTkLxE6ul6z2QEVfOqTd4mdGsLlQKLtCtgGpELFoqFBmNfjq8yRm2BmyisOzt1WsoUIUwvpw11LehyU4309L71Ftq6FjjybRJ89XLs9Tz/tTl6wyLupVCRTSDAWb5udCHjFYVmQx0FMaS71d9gF5mOcroyHAAlZTQhn4HKoZCfRaAB2RbUItwsCIpnvK9NcgnsH/3V7x7MaH6Y0tAIY0ZBOCBjglG3CMY4K6pVKT80/RqsL9N5Lh1zSmBBOkaMVj+lMLTwCkFOqo6SeKS7U0P7udQBqVfoNPZZ/F4odF+1iOeJUw+DxMFStKcnX7oF9rqLXRwT8V7s1dIu6zI4TjMaLWTcnEflCR8ErL8l8NOpHpkJ8G5897rZCHK2PWRe0KjOS0gUQSEEsF+WgOkn+PgH5ffAn423i7fovAFNKlrbCxyZyzl20jklxZU4i9OlthMgG+jQKs1pqawaad6oCGsUnmY4hHEklqwjqWa1+kFypXXuXQnpoXHHzAR3BmLio1DMVYvlf6lAnfo/5FPeBa5nwlrc6dJlVLZS2cZvsjlOhq+s6AcMiSaMxV/pGI0syHJ/NJQAFcXQ2h8en3asFEcdk2jdJaBmWS8YxOyz8322TM7r78xeJYyC6WBdY5JFV7fC5THqmUHaUaOnAKMUFo6I0D0O7rps7bgNsMtCGfuFfcvcW5y8MCvVoWoWhV+bSmHOUCVJoFj720l+5Uth2gSfdbqWiC55IZZnzDEjL3FWN1cSsKAH7aFlzLQiOshQLcIBfmZ2Cwm9i9dnJjfPcWcEghNxI5Gkj5Y+g82sjiNji7YdQZdoaKA63id99hTQ2jX8LOUyjWQvGKYmMf/wVJ1Q6dv4paC+Xl8YDuwP8OdWQ8tpTqVISSXLr1J8OOaz5Pji5M6smoAJ49VU0dnVDA4MNXdWp/jTtsTgJ/dfbcBB7 puwOZZdq KmSx/aa6F4uwvY0GzXwQl7QsjqHpoLJtAGkZUX3m0P9n/P8zJKYKRZXhqEXNZvG9tp6ANgqho5Jj2cOK7q3nL4zVlIk3Iz8cE4D3BO6qf5tzMvdc3LF9QsDRh0Sics3ZuqZApTLVQ+UX7yG4EmdbIJTT4b/RPpT9YidB6YXUXgPy/HsLkXluXKtVl3vri+a79rQqCpF03/vy9rGTdZ15FEtn8rIeBgZZatmNH+5gSSrTydPJRjxmKCKiFRV+vUlk3nmLb3fxwuQCxZOqB8yiNO0agAi3FQmJbmfaI2lpf/ZPAt0VfBtqkd+fHxbFbZv2ARmrovlCGPjvSQopRpLN3a+5PDF919y3qm1TuFjk9jSwiirDtbGQMjy+MWklR8JasPxcqAjn0J4UYP02ksLDfbdiew/78s1N1LeS4ev+ngEDnU1NT9W5b/hJKeBgsEgKaH5JA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 01, 2023 at 11:49:42AM -0700, Mike Kravetz wrote: > On 08/01/23 20:22, Xueshi Hu wrote: > > On Tue, Aug 1, 2023 at 6:17 AM Mike Kravetz wrote: > > > > > > On 07/30/23 20:51, Xueshi Hu wrote: > > > > When writing to /proc/sys/vm/nr_huge_pages, it indicates global number of > > > > huge pages of the default hstate. But when reading from it, it indicates > > > > the current number of "persistent" huge pages in the kernel's huge page > > > > pool. > > > > > > > > There are currently four interfaces used to export the number of huge > > > > pages: > > > > - /proc/meminfo > > > > - /proc/sys/vm/*hugepages* > > > > - /sys/devices/system/node/node0/hugepages/hugepages-2048kB/* > > > > - /sys/kernel/mm/hugepages/hugepages-2048kB/* > > > > > > > > But only the /proc/sys/vm/nr_huge_pages provides the 'persistent' > > > > semantics when reading from it. This inconsistency is very subtle and can > > > > be easily misunderstood. > > > > > > Thanks for looking into this. > > > > > > The hugetlb documentation (./admin-guide/mm/hugetlbpage.rst) mentions > > > the term 'persistent hugetlb pages', but never provides a definition. > > > > > > We can get the definition from the code as: > > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) > > > > > > Further, the documentation says: > > > "The ``/proc/meminfo`` file provides information about the total number of > > > persistent hugetlb pages in the kernel's huge page pool." > > > > > > "``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" > > > huge pages in the kernel's huge page pool." > > > > > > "The administrator may shrink the pool of persistent huge pages for > > > the default huge page size by setting the ``nr_hugepages`` sysctl to a > > > smaller value." > > > > > > So, the documentation implies that these interfaces should display the > > > number of persistent hugetlb pages. As you have discovered, all but the > > > sysctl interface (and /proc/sys/vm/nr_hugepages) displays the total > > > number of hugetlb pages rather than the number of persistent hugetlb > > > pages. > > > > > > If we wanted to match the documentation, it seems we should change all > > > the "show" interfaces to display persistent huge pages. However, I am a > > > bit concerned about how this may impact end users. > > > > > > There are two types if inconsistencies in these interfaces. > > > 1) As this patch points out, not all "show" interfaces provide the same > > > information. sysctl (/proc/sys/vm/nr_hugepages) displays the number > > > of persistent hugetlb pages, while the others display the total number > > > of hugetlb pages. > > > 2) The show/read interfaces generally provide the total number of > > > hugetlb pages, and the update/write interfaces update the number of > > > persistent hugetlb pages. > > > > > > Both of these situations can lead to user confusion. My 'guess' is that > > > this has not been a widespread issue as most hugetlb users do not > > > configure overcommit/surplus hugetlb pages and thus total number of > > > hugetlb pages is the same as number of persistent hugetlb pages. > > > > > > Right now, I would suggest making all these interfaces display/take the > > > number of persistent hugetlb pages for consistency. This also matches > > > the documentation. > > > > > > Thoughts? > > I am concerned that modifying it this way may result in an weaker control > > over hugetlb pages. Administrator will no longer be able to increase > > surplus pages through the nr_hugepages interface. > > > > Since surplus pages depend on the state of programs in the entire > > system, adjusting nr_hugepages may lead to an unexpected number of > > hugetlbs allocated which may leads to oom. > > Sorry, I am not sure I understand your concerns. I'm wrong, just ignore what I've said. > > Currently, the interfaces to set/update the number of hugetlb pages use > the supplied count as the number of requested persistent pages. I am > not suggesting any changes there (except the bug in node specific code > you discovered). Rather, I am suggesting that we update the interfaces > which show the number of hugepages (nr_hugepages) to display the number > of persistent pages to be consistent with the set/update interfaces. I agree with you. > > > About the definition of /proc/sys/vm/nr_huge_pages and meaning of > > "persistent", the documentation is kind of ambiguous. > > > > The documentation says: > > > > "The ``/proc/meminfo`` file provides information about the total number of > > persistent hugetlb pages in the kernel's huge page pool." > > > > "Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` > > such that it becomes less than the number of huge pages in use will > > convert the balance of the in-use huge pages to surplus huge pages." > > > > "The ``/proc`` interfaces discussed above have been retained for backwards > > compatibility." > > > > The ambiguities are: > > 1. HugePages_Total in /proc/meminfo is actually the total number of > > hugetlb pages. > > Correct. Although the documentation states it is the number of > persistent hugetlb pages. meminfo also contains the number of surplus > huge pages. So, it it possible that one could see > > HugePages_Total: 0 > HugePages_Surp: 100 It's easy to fix. > > Ideally, one would want to know the value for overcommit hugepages as > well. It will be straightforward to achieve this. > > The sysfs directories /sys/kernel/mm/hugepages/hugepages-*/ contain both > the surplus and overcommit counts. > > node specific sysfs directories only contain surplus counts. Node specific sysfs directories don't contain resv_hugepages too. After resolving this issue, I will attempt to assess the feasibility about how to implement node-specific reservations and overcommitment. > > > 2. If nr_hugepages means persistent hugetlb pages, converting in-use huge > > pages to surplus huge pages is impossible. > > I am not sure I understand. When writing to nr_hugepages today, it does > mean persistent hugetlb pages. Are you suggesting we change it to mean > total hugetlb pages when writing/updating? I do not think that is the > case, as none of your proposed changes do this. Still, I'm wrong. > > > 3. As you know, backward compatibility is not retained. > > > > Given that the document needs to be modified anyway, why not make the > > interface more user-friendly? > > In any case, I agree the document should be updated to match the code. > It should also define persistent hugetlb pages. Yes, I'll add it in the v2 patch. > > Thank you, > -- > Mike Kravetz