From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 942C8C71136 for ; Fri, 13 Jun 2025 22:07:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 121E16B0088; Fri, 13 Jun 2025 18:07:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0D2EC6B0089; Fri, 13 Jun 2025 18:07:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F04F06B008A; Fri, 13 Jun 2025 18:07:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D18416B0088 for ; Fri, 13 Jun 2025 18:07:37 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7474F141393 for ; Fri, 13 Jun 2025 22:07:37 +0000 (UTC) X-FDA: 83551764954.03.A233597 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) by imf24.hostedemail.com (Postfix) with ESMTP id BB49E180003 for ; Fri, 13 Jun 2025 22:07:35 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wLXhj4cy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of 3JqFMaAsKCOMFHPJWQJdYSLLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--ackerleytng.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3JqFMaAsKCOMFHPJWQJdYSLLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749852455; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=jHrVv2kDcovHNoP3EEC76tcx39dbcoagl8VnybVR6Q8=; b=KuSNyzTVjqo3s6mr7pfmxXMv31FY980Tlh/M8B6Dy0Z5FkP+be55PGilibuQvcfXkHioFx yU+cW0NqRpMMr0LwlkbFr9M98PzyjIWm15bOHU33fNw3v1KAHRkC9MdgJMPNheDNHzd2n+ BiRk3CqR9/dG+rL66eC7idWx/CbMAQk= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=wLXhj4cy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of 3JqFMaAsKCOMFHPJWQJdYSLLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--ackerleytng.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3JqFMaAsKCOMFHPJWQJdYSLLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749852455; a=rsa-sha256; cv=none; b=Gpmx+AJjm//fHCV4XaaXihrfdCON67h5vQA28LUL84cryo1UdBiLuizikGwMBMs9ndgmnl wkAdXUMmniI2GPATgB3e0qtFprFsCTrPCOe/FodsnnUFiU6kkzlQ5plf7QaRMc/dDog+PJ FZqoCuEA85rNxOyI+IV0Epl2IDUQWXU= Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-b0e0c573531so1497030a12.3 for ; Fri, 13 Jun 2025 15:07:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749852454; x=1750457254; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=jHrVv2kDcovHNoP3EEC76tcx39dbcoagl8VnybVR6Q8=; b=wLXhj4cyZ6ij6UIcUugqAa4ep/Yi5rlszKYZztaXgH0FIw6UvAoMNGuGJIUwl99sYT IpzpdDKwA0fv7gBh6nQnHLOd3D4RfKs+9KrsGO2kr7AL/UqNTVwCDxw4Za4xx2RDwl85 4IoZkfOjkNVZFIwwL5zXQAjExCQK7pxISG+mxbnhvpiO6bCuUyd/sC4vMoo32vz1VQ3J UwhBIheUqpg+fRvXylxlP9UNjJuY7TJhNFskPeKgymt/7zsAwWjO40wjTpB/jNDpJhet zXNlVmXcKtyYgZSa5sbpP7uMfsACU3/yHMbT9D7D7zPdTb6DY7q3c/Xe69pniXcw4BDS 1t1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749852454; x=1750457254; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jHrVv2kDcovHNoP3EEC76tcx39dbcoagl8VnybVR6Q8=; b=CIdiBXZQWDlgLhCiPFMfaEkI7oMuvDZZ8jnU4hNWcECk9AidxytCcDq7chPP28ISKS 2GTzd28lUrx9bbKQHymq7OdawPM+ZEDHy0DcEoBSqjjcJ8ZhNHo6XVGbwdN577FJcECb va9xm9J6yIIhj6dvhKKq7g8yzuDTCBBbgnph6uCTmVL9uoUapijGXI71yQdaHqJ9UEc3 t9D54uBYyaCsJnad7Zf6u1wZH1OLnJfALKudhb8ssGuPXs6u/wFkc1l8cCRECL0oaF01 OR8abZGP55rDT+GrxqbUu2YJtBgFBWTIzL5wNYybFcOoyu+Z8q1sC3sgOtdItd+2v4Ap Ll5Q== X-Forwarded-Encrypted: i=1; AJvYcCXHo5Hxk3VkChRxOQEpme0hgbzuhjNRY2OUxVoj4vzF+Ol3dd4bdjCOA28UBUCZtiLQMFP4j6gdHw==@kvack.org X-Gm-Message-State: AOJu0Yx8XUJ30APLI9BwifnEDVDTrNk5xHZQxDz1rK1fBes1/ucqcR61 I1CY2N5muIJlXIizdHzIGqBnZ1lQx23znkOfHYIcqicb/bpsg9WKsbzLHs6Kle3vEhfuESK29qz iXD4ZiA//X4AL5unq35eWxYhARA== X-Google-Smtp-Source: AGHT+IFVHWXFz2fZtkBz3H5QkGTBOJiqvd4jqlHEGetNFDiMdeIl66etnmH5TFa6nz2Zwxoho7DNwLI7pEhSfm9NmQ== X-Received: from pgbfy15.prod.google.com ([2002:a05:6a02:2a8f:b0:b2e:c15e:3eb7]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3943:b0:21f:97f3:d4c2 with SMTP id adf61e73a8af0-21fbd525cffmr1383658637.16.1749852454508; Fri, 13 Jun 2025 15:07:34 -0700 (PDT) Date: Fri, 13 Jun 2025 15:07:33 -0700 In-Reply-To: <683ba0fe64dd5_13031529421@iweiny-mobl.notmuch> (message from Ira Weiny on Sat, 31 May 2025 19:38:22 -0500) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() From: Ackerley Tng To: Ira Weiny Cc: kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: BB49E180003 X-Stat-Signature: ep1azpp8eciopkonwnqybehjsz3hmnc8 X-Rspam-User: X-HE-Tag: 1749852455-730076 X-HE-Meta: U2FsdGVkX1/2INhWbGPWsdU03/0m20qwBsLSHYbS37SeKdaOdI4YU006QfXeZ7Qot6Vtq12hAx3+A9QUUK1uiO7YdBcnbRO4FsDVdMeXSPg/pCc3Xj3T/xqJIOEacH9UFWu7tEzrrj5hwSZow4eKuTuyzNswEXWbS8ouHcsx11rT5WzG/ajsma7XBFfXdD2lQEXczbvpYJKOocrmK6kaBkFFdLYzH4PJGsk88nc5FDfa+J1njkWuOEzCwbhz1ZiDi/f7AKol9uavdJM7Zouli00XysutVCA6OnZ6wFadd8S8WwZN5fL0isSSS5OWixsQf9VHySfd77HDV6viV4NNQ20L4ikIfyewUVFTxYkm2HYGkJ2XqqEhxCBdxm3O612injU1Od9v+soTHi8M14t1ixmrKmUbjW78ZGWJhGDI251pCJ809t4vftJdocM/0ubr0kQOjtJLbCeKRe1cu45UkznUEcEcSX1G/a3Iqg1O1FJu/eKu9LJFbA58ciNGLJ3UZdW4pZ4UMNTonN4zE9x5qQeOt4iu0Ms3QkXjyk7JjJVRu0EP2B2VofmYIGPT3oSIgCGCYz0u8UQrsWp/Eu0bGCuYfHRJCoSXmea1QHN7CIPPc91wR4fdpgZ0deYu4jWtQC0EnDZ+rIzXgwGUzUXGJAKb6frORze80ED8kAf2BiiFM7MmhArsS1JV8aPh05O++579nl46uyO/4OLswZwM66hkvZ/L0K86r2WDxYUlpSOCHMj890vi5vh+d0pp8XMSjQBy1ejsq13bU67qsgl30iI7gk3pvuY6ZKBOSmt1xt2rkxsxNQoRrkQbZnzq/TgROt5kkCYjWs2FHFIizzsCRuZXcdHAnN34Js5y7hNftPBXTW+KpWEuSPV5HD/gYoG7ljqYY0eWiW/Frvv2x4G2cxVkS7F6CKZi7Ulft/dAi0JqzhVffq1Q3piI3qRj+l7yhjo0rbN9V3kU4MXisuc bcZiHiPd MqL144ti/1nppW6SpEen+b5iMSz1rZA5KCjAnsbj24aYCeTWZLGqzrMd6b8j8bMjwORclvD/Cto4oHModkOUmKJo8Rkm+NpeUTa2a7Mmu9gbsle2aGnHQZJ+C8j7UsW4hG8yNCvHtjZSSvxJt3/YsvGcgjOrlsiQ3ptBTzQcpA4uoukB/3RSiS2tqt0TM0n43eyXGMCVhvvt7GpQmTp9x/eT9BnIeGeFwUYMjUoaXfzZzi966aquUZmqF691J6RCKipGGl1UUBwpTL0hr7tw7FH4ft+zCThsDhbq+i0oTotAD4H3zTmrw24HVm+pW9CmZXNFYZE18ZpnkIiEL0KMKdQnfMPg8+uXWsuSA0TzxHG1+wW4WUVGq5vxCoBdvBo/pH0QZbVcefYItv3jpV8Eexmugc0bhrn837d9KE+m3LZrswUQ4xjjrNTV6MBrV4WRwid4mN8F6NFFrZoxCpEbSgOLFetmBcYT/viYfv9lAYIcrxGX3Z1r78/Bdn6ZPCHIcPRQrbWD2qq2j2CP0yF1eZ2IQgVb9itVKiV1haJnm5YpNPEtiiLhrIZUIj2jzlScWsrVuJMzaDHMKOPuoQ0tEerbUYQ6vwVAFNvGET6Bs4NH1rnUI7ES6F228OQhoEqOpq/bbKe1EQpKTKuYY9cuR6mWefWu7eDd0ei/Mrnj7NkCGOFR9NcY2R7xyvi7hi3nE1mfsL4jv4ACdI/A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ira Weiny writes: > Ackerley Tng wrote: >> Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which >> handles allocation of a folio and cgroup charging. >> >> Other than flags to control charging in the allocation process, >> hugetlb_alloc_folio() also has parameters for memory policy. >> >> This refactoring as a whole decouples the hugetlb page allocation from >> hugetlbfs, (1) where the subpool is stored at the fs mount, (2) >> reservations are made during mmap and stored in the vma, and (3) mpol >> must be stored at vma->vm_policy (4) a vma must be used for allocation >> even if the pages are not meant to be used by host process. >> >> This decoupling will allow hugetlb_alloc_folio() to be used by >> guest_memfd in later patches. In guest_memfd, (1) a subpool is created >> per-fd and is stored on the inode, (2) no vma-related reservations are >> used (3) mpol may not be associated with a vma since (4) for private >> pages, the pages will not be mappable to userspace and hence have to >> associated vmas. >> >> This could hopefully also open hugetlb up as a more generic source of >> hugetlb pages that are not bound to hugetlbfs, with the complexities >> of userspace/mmap/vma-related reservations contained just to >> hugetlbfs. >> >> Signed-off-by: Ackerley Tng >> Change-Id: I60528f246341268acbf0ed5de7752ae2cacbef93 >> --- >> include/linux/hugetlb.h | 12 +++ >> mm/hugetlb.c | 192 ++++++++++++++++++++++------------------ >> 2 files changed, 118 insertions(+), 86 deletions(-) >> > > [snip] > >> >> +/** >> + * hugetlb_alloc_folio() - Allocates a hugetlb folio. >> + * >> + * @h: struct hstate to allocate from. >> + * @mpol: struct mempolicy to apply for this folio allocation. >> + * @ilx: Interleave index for interpretation of @mpol. >> + * @charge_cgroup_rsvd: Set to true to charge cgroup reservation. >> + * @use_existing_reservation: Set to true if this allocation should use an >> + * existing hstate reservation. >> + * >> + * This function handles cgroup and global hstate reservations. VMA-related >> + * reservations and subpool debiting must be handled by the caller if necessary. >> + * >> + * Return: folio on success or negated error otherwise. >> + */ >> +struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol, >> + pgoff_t ilx, bool charge_cgroup_rsvd, >> + bool use_existing_reservation) >> +{ >> + unsigned int nr_pages = pages_per_huge_page(h); >> + struct hugetlb_cgroup *h_cg = NULL; >> + struct folio *folio = NULL; >> + nodemask_t *nodemask; >> + gfp_t gfp_mask; >> + int nid; >> + int idx; >> + int ret; >> + >> + idx = hstate_index(h); >> + >> + if (charge_cgroup_rsvd) { >> + if (hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg)) >> + goto out; > > Why not just return here? > return ERR_PTR(-ENOSPC); > I wanted to consistently exit the function on errors at the same place, and also make this refactoring look like I just took the middle of alloc_hugetlb_folio() out as much as possible. >> + } >> + >> + if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg)) >> + goto out_uncharge_cgroup_reservation; >> + >> + gfp_mask = htlb_alloc_mask(h); >> + nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask); >> + >> + spin_lock_irq(&hugetlb_lock); >> + >> + if (use_existing_reservation || available_huge_pages(h)) >> + folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask); >> + >> + if (!folio) { >> + spin_unlock_irq(&hugetlb_lock); >> + folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask); >> + if (!folio) >> + goto out_uncharge_cgroup; >> + spin_lock_irq(&hugetlb_lock); >> + list_add(&folio->lru, &h->hugepage_activelist); >> + folio_ref_unfreeze(folio, 1); >> + /* Fall through */ >> + } >> + >> + if (use_existing_reservation) { >> + folio_set_hugetlb_restore_reserve(folio); >> + h->resv_huge_pages--; >> + } >> + >> + hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio); >> + >> + if (charge_cgroup_rsvd) >> + hugetlb_cgroup_commit_charge_rsvd(idx, nr_pages, h_cg, folio); >> + >> + spin_unlock_irq(&hugetlb_lock); >> + >> + gfp_mask = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; >> + ret = mem_cgroup_charge_hugetlb(folio, gfp_mask); >> + /* >> + * Unconditionally increment NR_HUGETLB here. If it turns out that >> + * mem_cgroup_charge_hugetlb failed, then immediately free the page and >> + * decrement NR_HUGETLB. >> + */ >> + lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h)); >> + >> + if (ret == -ENOMEM) { >> + free_huge_folio(folio); >> + return ERR_PTR(-ENOMEM); >> + } >> + >> + return folio; >> + >> +out_uncharge_cgroup: >> + hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg); >> +out_uncharge_cgroup_reservation: >> + if (charge_cgroup_rsvd) >> + hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg); > > I find the direct copy of the unwind logic from alloc_hugetlb_folio() > cumbersome and it seems like a good opportunity to clean it up. > I really wanted to make this refactoring look like I just took the middle of alloc_hugetlb_folio() out as much as possible, to make it obvious and understandable. I think the cleanup can be a separate patch (series?) >> +out: >> + folio = ERR_PTR(-ENOSPC); >> + goto out; > > Endless loop? > Thanks, this should have been return folio; > Ira > > [snip]