From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6469CC433F5 for ; Fri, 24 Sep 2021 09:29:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E303561038 for ; Fri, 24 Sep 2021 09:29:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E303561038 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 79C7F900002; Fri, 24 Sep 2021 05:29:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 74BE86B0071; Fri, 24 Sep 2021 05:29:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EB7B900002; Fri, 24 Sep 2021 05:29:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0188.hostedemail.com [216.40.44.188]) by kanga.kvack.org (Postfix) with ESMTP id 4EFD86B006C for ; Fri, 24 Sep 2021 05:29:00 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id E4DF78249980 for ; Fri, 24 Sep 2021 09:28:59 +0000 (UTC) X-FDA: 78621942798.13.867AC04 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 641B8D0304DF for ; Fri, 24 Sep 2021 09:28:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632475738; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dewN/gQJEYK/c39ekpHoZM0DT5BTWPCy8XW+k6sgY3Q=; b=A/CDDJmlKEv+oBzmi9Hwjuc2sXwLlnMwX1kI0pZ3h2FhMImEGfAeGrIUdXfleSkjOcCtli KHFd+/f3+wH0gL1HpSfcahrlM4QkxqQPwVtsKYuuWW5Dv9+crZXYp4fflxgNfDpdSOSgg5 ZMo8E8+ck3PQBDvcLjSmxPaW/deNbfA= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-pVhI-mWFPfa4v5eFxj97gQ-1; Fri, 24 Sep 2021 05:28:57 -0400 X-MC-Unique: pVhI-mWFPfa4v5eFxj97gQ-1 Received: by mail-wr1-f69.google.com with SMTP id j16-20020adfa550000000b0016012acc443so7538266wrb.14 for ; Fri, 24 Sep 2021 02:28:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=dewN/gQJEYK/c39ekpHoZM0DT5BTWPCy8XW+k6sgY3Q=; b=CyIQRjbvAi4dLHgUy/qZxZYsjiHtwMWe7TdQ/8XJnW/8JWM7IG0z5ZHDsQnim03/Ru TUNk2sAxlqoiSqB0ODOQ8gU5oCgUtQ83JwaQ7YS/ZJ5M0/2Q7QsCHKJFoBLm0oO4hAIJ k6Lqz84UIn50HU1GxGd0GDl/7I39dogfJu/JTKbFBn6D+utf39EKOOQM7HuqlQ2tg5na B/m0aZq4V/YD3wMeffSISx8amRSQq21NBapGyzJEkW5QSjzgN3RSkHpv1xnBixyhgmRw klHmlVZbSpJtA/lnpzXIbcC9xRIRlLgPd8JELXX47N9oAsRh7mTPx2zEOtN6Sl6BksnR Qgig== X-Gm-Message-State: AOAM530BtAL+N8gUu8Dkygl5ccGsutLgPcTIBoTexkPxcgbTb6kqAQzh UnMBZJTNFKuwk4sNTlfzmr/LrdZGG9im9joPm9EVIlKDYjhanqz76JMmCnID/sQv42UjJXQymBM eniABJA2SYcE= X-Received: by 2002:adf:f18a:: with SMTP id h10mr10340826wro.42.1632475736255; Fri, 24 Sep 2021 02:28:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx3hItjVEh67XQyuIrQrYkPz7zRbOE56mwf9j6KZF9k4qKQrxOVrDbYt5ArC/jvLrEHlCVlvA== X-Received: by 2002:adf:f18a:: with SMTP id h10mr10340784wro.42.1632475735886; Fri, 24 Sep 2021 02:28:55 -0700 (PDT) Received: from [192.168.3.132] (p5b0c61fc.dip0.t-ipconnect.de. [91.12.97.252]) by smtp.gmail.com with ESMTPSA id f1sm7937737wrc.66.2021.09.24.02.28.55 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 24 Sep 2021 02:28:55 -0700 (PDT) To: Mike Kravetz , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Oscar Salvador , Zi Yan , Muchun Song , Naoya Horiguchi , David Rientjes , "Aneesh Kumar K . V" , Andrew Morton References: <20210923175347.10727-1-mike.kravetz@oracle.com> <20210923175347.10727-2-mike.kravetz@oracle.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Message-ID: <4fcf5b61-8c01-0fa9-7541-afa755a81039@redhat.com> Date: Fri, 24 Sep 2021 11:28:54 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210923175347.10727-2-mike.kravetz@oracle.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 641B8D0304DF X-Stat-Signature: t3quwr869x1yg53fkz5o51ukug7cuafk Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="A/CDDJml"; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf21.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1632475739-151091 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.09.21 19:53, Mike Kravetz wrote: > Two new sysfs files are added to demote hugtlb pages. These files are > both per-hugetlb page size and per node. Files are: > demote_size - The size in Kb that pages are demoted to. (read-write) > demote - The number of huge pages to demote. (write-only) >=20 > By default, demote_size is the next smallest huge page size. Valid hug= e > page sizes less than huge page size may be written to this file. When > huge pages are demoted, they are demoted to this size. >=20 > Writing a value to demote will result in an attempt to demote that > number of hugetlb pages to an appropriate number of demote_size pages. >=20 > NOTE: Demote interfaces are only provided for huge page sizes if there > is a smaller target demote huge page size. For example, on x86 1GB hug= e > pages will have demote interfaces. 2MB huge pages will not have demote > interfaces. >=20 > This patch does not provide full demote functionality. It only provide= s > the sysfs interfaces. >=20 > It also provides documentation for the new interfaces. >=20 > Signed-off-by: Mike Kravetz > --- > Documentation/admin-guide/mm/hugetlbpage.rst | 30 +++- > include/linux/hugetlb.h | 1 + > mm/hugetlb.c | 155 ++++++++++++++++++= - > 3 files changed, 183 insertions(+), 3 deletions(-) >=20 > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentati= on/admin-guide/mm/hugetlbpage.rst > index 8abaeb144e44..0e123a347e1e 100644 > --- a/Documentation/admin-guide/mm/hugetlbpage.rst > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst > @@ -234,8 +234,12 @@ will exist, of the form:: > =20 > hugepages-${size}kB > =20 > -Inside each of these directories, the same set of files will exist:: > +Inside each of these directories, the set of files contained in ``/pro= c`` > +will exist. In addition, two additional interfaces for demoting huge > +pages may exist:: > =20 > + demote > + demote_size > nr_hugepages > nr_hugepages_mempolicy > nr_overcommit_hugepages > @@ -243,7 +247,29 @@ Inside each of these directories, the same set of = files will exist:: > resv_hugepages > surplus_hugepages > =20 > -which function as described above for the default huge page-sized case= . > +The demote interfaces provide the ability to split a huge page into > +smaller huge pages. For example, the x86 architecture supports both > +1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512 > +2MB huge pages. Demote interfaces are not available for the smallest > +huge page size. The demote interfaces are: > + > +demote_size > + is the size of demoted pages. When a page is demoted a corres= ponding > + number of huge pages of demote_size will be created. By defau= lt, > + demote_size is set to the next smaller huge page size. If the= re are > + multiple smaller huge page sizes, demote_size can be set to an= y of > + these smaller sizes. Only huge page sizes less then the curre= nt huge > + pages size are allowed. > + > +demote > + is used to demote a number of huge pages. A user with root pr= ivileges > + can write to this file. It may not be possible to demote the > + requested number of huge pages. To determine how many pages w= ere > + actually demoted, compare the value of nr_hugepages before and= after > + writing to the demote interface. demote is a write only inter= face. > + > +The interfaces which are the same as in ``/proc`` (all except demote a= nd > +demote_size) function as described above for the default huge page-siz= ed case. > =20 > .. _mem_policy_and_hp_alloc: > =20 > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 1faebe1cd0ed..f2c3979efd69 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -596,6 +596,7 @@ struct hstate { > int next_nid_to_alloc; > int next_nid_to_free; > unsigned int order; > + unsigned int demote_order; > unsigned long mask; > unsigned long max_huge_pages; > unsigned long nr_huge_pages; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 6378c1066459..c76ee0bd6374 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(str= uct hstate *h) > =20 > static void __init hugetlb_init_hstates(void) > { > - struct hstate *h; > + struct hstate *h, *h2; > =20 > for_each_hstate(h) { > if (minimum_order > huge_page_order(h)) > @@ -2995,6 +2995,17 @@ static void __init hugetlb_init_hstates(void) > /* oversize hugepages were init'ed in early boot */ > if (!hstate_is_gigantic(h)) > hugetlb_hstate_alloc_pages(h); > + > + /* > + * Set demote order for each hstate. Note that > + * h->demote_order is initially 0. > + */ > + for_each_hstate(h2) { > + if (h2 =3D=3D h) > + continue; > + if (h2->order < h->order && h2->order > h->demote_order) > + h->demote_order =3D h2->order; > + } > } > VM_BUG_ON(minimum_order =3D=3D UINT_MAX); > } > @@ -3235,9 +3246,29 @@ static int set_max_huge_pages(struct hstate *h, = unsigned long count, int nid, > return 0; > } > =20 > +static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_a= llowed) > + __must_hold(&hugetlb_lock) > +{ > + int rc =3D 0; > + > + lockdep_assert_held(&hugetlb_lock); > + > + /* We should never get here if no demote order */ > + if (!h->demote_order) > + return rc; > + > + /* > + * TODO - demote fucntionality will be added in subsequent patch > + */ > + return rc; > +} > + > #define HSTATE_ATTR_RO(_name) \ > static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name) > =20 > +#define HSTATE_ATTR_WO(_name) \ > + static struct kobj_attribute _name##_attr =3D __ATTR_WO(_name) > + > #define HSTATE_ATTR(_name) \ > static struct kobj_attribute _name##_attr =3D \ > __ATTR(_name, 0644, _name##_show, _name##_store) > @@ -3433,6 +3464,112 @@ static ssize_t surplus_hugepages_show(struct ko= bject *kobj, > } > HSTATE_ATTR_RO(surplus_hugepages); > =20 > +static ssize_t demote_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t len) > +{ > + unsigned long nr_demote; > + unsigned long nr_available; > + nodemask_t nodes_allowed, *n_mask; > + struct hstate *h; > + int err; > + int nid; > + > + err =3D kstrtoul(buf, 10, &nr_demote); > + if (err) > + return err; > + h =3D kobj_to_hstate(kobj, &nid); > + > + /* Synchronize with other sysfs operations modifying huge pages */ > + mutex_lock(&h->resize_lock); > + > + spin_lock_irq(&hugetlb_lock); > + if (nid !=3D NUMA_NO_NODE) { > + nr_available =3D h->free_huge_pages_node[nid]; > + init_nodemask_of_node(&nodes_allowed, nid); > + n_mask =3D &nodes_allowed; > + } else { > + nr_available =3D h->free_huge_pages; > + n_mask =3D &node_states[N_MEMORY]; > + } > + nr_available -=3D h->resv_huge_pages; > + if (nr_available <=3D 0) > + goto out; > + nr_demote =3D min(nr_available, nr_demote); > + > + while (nr_demote) { > + if (!demote_pool_huge_page(h, n_mask)) > + break; > + > + /* > + * We may have dropped the lock in the routines to > + * demote/free a page. Recompute nr_demote as counts could > + * have changed and we want to make sure we do not demote > + * a reserved huge page. > + */ > + nr_demote--; > + if (nid !=3D NUMA_NO_NODE) > + nr_available =3D h->free_huge_pages_node[nid]; > + else > + nr_available =3D h->free_huge_pages; > + nr_available -=3D h->resv_huge_pages; > + if (nr_available <=3D 0) > + nr_demote =3D 0; > + else > + nr_demote =3D min(nr_available, nr_demote); > + } > Wonder if you could compress that quite a bit: ... spin_lock_irq(&hugetlb_lock); if (nid !=3D NUMA_NO_NODE) { init_nodemask_of_node(&nodes_allowed, nid); n_mask =3D &nodes_allowed; } else { n_mask =3D &node_states[N_MEMORY]; } while (nr_demote) { /* * Update after each iteration because we might have temporarily * dropped the lock and our counters changes. */ if (nid !=3D NUMA_NO_NODE) nr_available =3D h->free_huge_pages_node[nid]; else nr_available =3D h->free_huge_pages; nr_available -=3D h->resv_huge_pages; if (nr_available <=3D 0) break; if (!demote_pool_huge_page(h, n_mask)) break; nr_demote--; }; spin_unlock_irq(&hugetlb_lock); Not sure if that "nr_demote =3D min(nr_available, nr_demote);" logic is=20 really required. Once nr_available hits <=3D 0 we'll just stop denoting. --=20 Thanks, David / dhildenb