From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 089D6C021A0 for ; Thu, 13 Feb 2025 07:52:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 53DDC6B0088; Thu, 13 Feb 2025 02:52:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C71D6B008C; Thu, 13 Feb 2025 02:52:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 318706B0092; Thu, 13 Feb 2025 02:52:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 109C26B0088 for ; Thu, 13 Feb 2025 02:52:48 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B859912088F for ; Thu, 13 Feb 2025 07:52:47 +0000 (UTC) X-FDA: 83114154774.09.61852FB Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf15.hostedemail.com (Postfix) with ESMTP id 10EF4A0008 for ; Thu, 13 Feb 2025 07:52:45 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=NZBT4x27; spf=pass (imf15.hostedemail.com: domain of 3zKStZwsKCBErt1v82vFA4xx55x2v.t532z4BE-331Crt1.58x@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3zKStZwsKCBErt1v82vFA4xx55x2v.t532z4BE-331Crt1.58x@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739433166; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=a3fdef+uSLahSNZ8rFt65Bjz3vZBpG34zxr3rjAfeYo=; b=dMxwUK6C0B/qSEvBixFFERJkcQPZPfnSPxuByisdPUKuwnYTEdYSk9I1sgB6tCArhF+p1p iphTS1jdMqjVOg41bDPE5xuQBuFMcD5Jak9IW5fVePPifooEaZjgSnbfUnhusrG/kTwncX Uo/uazAbeRvD4T9kJ2/Z/NrWHIIyjHg= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=NZBT4x27; spf=pass (imf15.hostedemail.com: domain of 3zKStZwsKCBErt1v82vFA4xx55x2v.t532z4BE-331Crt1.58x@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3zKStZwsKCBErt1v82vFA4xx55x2v.t532z4BE-331Crt1.58x@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739433166; a=rsa-sha256; cv=none; b=xviNVXKQ9lSzAxhkbv+ujosv1exis2ezb7N4bYxbd1FUvlD6cirtNV5zoaR+4ILNlEyedY TC4L69143a4AcSs/zYfixkZ9/qy61jaJhRlOcMBb9OtqhBWhfuXP/5isJpVR0f+y+E76cX R8RNdsKjCpfU6fmvdK22RZH2mw7mbSg= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2fa3fd30d61so1568172a91.0 for ; Wed, 12 Feb 2025 23:52:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1739433165; x=1740037965; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=a3fdef+uSLahSNZ8rFt65Bjz3vZBpG34zxr3rjAfeYo=; b=NZBT4x270NGI0eM0Xqma/bPM4HdbiV6/80E4Yu+azpgHFPEcF8ugqhuQGLABYzE1rg FvKN/sTsoGXRJyywiyeVK0jNuPXBDwOgFNYCNGfRp/QWp6qy1h9d9/lZa/E9mA6ybity vMdd9SeFXLbbyvvMuODJwLH9ANcHo26wUBym01WWRtHqMAjVOYCLqpLqVtm6bpayqYW1 KsGTDACkIE5zVqfCXiTrSqD4/3hoyzz5bPZSKMKraNzscaR2YudBCXy3+o/cw1WHSbNK TtsvTPfSRQTgtGT1VoDSdHUIXoVpRt9N1pCKO3fTn7xWepeSBra7CvOyrGjRtsUS54Hc mRZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739433165; x=1740037965; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=a3fdef+uSLahSNZ8rFt65Bjz3vZBpG34zxr3rjAfeYo=; b=jyuSIhvxza2vFlCm0bbrDO0yduAMMuA5fS2gUltGLUfWEvmk3/0qsVUXgBLFRkTMef NjTflqbY4Afoz0jO0Q2pFGZ2GKaqoi6us8etVxCIHToHKaEI8fAEpoFQyA8Ck2bXOMRg XmMmy+eFu8udGVJICjBdVPGNHHFDWnqt1wNFFiXIAvSV3XEZx7pX2FIhmRcZdytjscnt pUdMzdTCKAzcpV76sv4KtM28T4IkxSeUYp/3mv1lkDhsrqY8w1yvwWRo2jbP5ogpsmoQ aBLVKrnFDv7BR9zcdz4EBy2tPxG4FxcwlnegeupSBGgGDbRwMLbswkMsY+311jzykwCD SeJQ== X-Forwarded-Encrypted: i=1; AJvYcCWJmtAYoQWibSfItmTAYt5fcgoGT41X78BLV6iqWTEFFVV1XId4kbGUFdvjSCPiiYEARjIJ1xAufA==@kvack.org X-Gm-Message-State: AOJu0Yz3ZfjOTHslFcXsXlF6CPzYt8TV6I/d/uLmGsha0ycEjsmK2fet kZaT3yOxExgvbrIu7r9BnE7Ft19iSm6DVo3mhBnnRN9ZQtIAXJ/Hw9cIwZ92XKHwx+/9AkB+Iw1 KqdHNoIP7xnIsUbFz8ZZONQ== X-Google-Smtp-Source: AGHT+IESiBQ2QnxPu6gqE49lswYwYQOD6YLayJmOTZp5Cen+Q2e/br3WnYY0oOlxgR3gCXnEsaeI4jJPUjsup5i1Xg== X-Received: from pfmx14.prod.google.com ([2002:a62:fb0e:0:b0:730:8854:cc56]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:22d3:b0:730:8e97:bd76 with SMTP id d2e1a72fcca58-7322c591b87mr8612290b3a.9.1739433164873; Wed, 12 Feb 2025 23:52:44 -0800 (PST) Date: Thu, 13 Feb 2025 07:52:43 +0000 In-Reply-To: (message from Peter Xu on Sun, 1 Dec 2024 12:55:36 -0500) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb From: Ackerley Tng To: Peter Xu Cc: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com, erdemaktas@google.com, vannapurve@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 10EF4A0008 X-Stat-Signature: m3a7c4f7y8i5wk98nihsqduw1jaqo9ui X-HE-Tag: 1739433165-78044 X-HE-Meta: U2FsdGVkX19qRt605Dtjlg+C2xGx4tKTGiQSTMhryH3lefYrJuEFtvKKbwc3Y0MOlqu/YMIo1emhGeF/rINgU/Y+RZyqDR5qkqLuneuoU0Lp8CwXZK50Ln4nfCy9QGQXLe1FdM2XAs9Edw/V/PaGb4+FQmFHdeWMNGwBtoRFVH8rffMUyS3OyZv/Zxpgnu4Ndd61hcAvmypQd0q++gf8RZPnCvPiyuRfhaiZl0eCMdgI3Dk3ZTKN/QCqxIObWwHtdxt2aLd0ACK21hTgJfNnlrmET48Py6rKHBJli0Dhx4wkiE5DXJrerW0Y34YgEgVETlIktwoeOfHMgCP8KZUYElDLZ+ZKtYJRUxo/HgwwD4lWLEQYNwr3ElFabS8uXYM3e4T9+rBZdPGehBkjnUYuLuys3G/P47yS/QXcK59XNgYt5RxuKHF990u6gHhDXAcMn45Z0a0JKF/5DtH1tUxX1gWr35k6ACChfT5Btq50sY+h62qE8wSskvRizci+5JVMFLEfEaRKIMzOrnzENM18IlswfSaPUBEjnYZGD/Nq3qQvzgJS1Cm/Q8avZ1ptEFA5edHCuI4zPCzNHUI7HMgYDQKr2gyPm8hFafBzvv/Z/3tLl+bqPwtb7Sa0sDpAcTPV5eo9POYy34b3lGDg/4YsWDBbpp9WFtDDUaKGNk9yg3wlP/LLTewrgYcUfBN06wnRUtsNZO0Si7RvgB1/gUNwHa+r6uxxUpMj4rhBpYwOqhup+tQk679t24fLsvh6sSOH0Cs+pvNoJRYTPPVmW1sgonmk5SGDYNCN2wnPUII/uj1G3R2+ioAFFlkBXe2zpMLNSRAgOFlk/BjrNDh4keJAqSmjd1VHSYsoKLXaetZDn/Jk6Opq9O8lgg/D8cOPO11B0Y7q42Kz4lXaUYrOR8pj6SKgaaRSiTWGerXPzOPp2Ha/xIv0tq1jZ31nEjquVmCIF4Dv2ySgM712P2D4ZqZ IPwUhD77 vKuRLzRNOAGkS/okRQfDJxI4jyUGnTjWrbwMRMm0yia/eP00tQq5RALelphQaR6zyGKCFqVGJtSRiCjgIi6Iu3Ogs/Uyyh2+u25AaIuzz1FqQEKr+bS+zmrjwzgzNsPb1lKjM0q381EOYK7Yg8Y5wFawTlFa6royr42wMTQuG0lGXjAGaKno9wsSegt5zoFVAVA0zn8aWhTZAFe2lO05LRR+i3tgwy+joqjnEcoThfZjy0ddZGsiJ4Lr5LOnC5g4Rgkly99wpXPij4oDY64eRNPCToDvHXz1lVlHXFZWVNZ+P11TEXh6i2rKtHdODx+TICveLERuz70mMttdaoxd50aLwJLl6NQmIeHns6JuiIOV83+jP5oxld6zvdhGMAanymAR+H7op2IBPDa5QmseRj4GaAjkXY+T7aeQ7he96uj+PL6JVhTrHYcm0uT7mWKLV/lhHahSUVEfV0ku9zNZV11iuM4+hO2HlcgkraB7EZSiOa99IemQqdxte9Jgpl7cMPUCnZblD+jpCFVAMvLlK7M5g4ZoX2EP8c886ihh7ir5k4H4/mED31B+lC43OlaQzUnQ67DrL8J3Zmso5z5dvdTM83oDlUkhZdPJOn5g4ojVJ8II8luYhC8ms/TwjmUx6yYXmjT1GsJJajk58cJaOCMKFHa/sph+CPXZWLusD+gcTQfOFUPuaw1kY59mEY4p9+6tL8t36gIsPj7GTld43izZyagOF25xsyxlo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Peter Xu writes: > On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote: >> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h, >> + struct hugepage_subpool *spool) >> +{ >> + bool memcg_charge_was_prepared; >> + struct mem_cgroup *memcg; >> + struct mempolicy *mpol; >> + nodemask_t *nodemask; >> + struct folio *folio; >> + gfp_t gfp_mask; >> + int ret; >> + int nid; >> + >> + gfp_mask = htlb_alloc_mask(h); >> + >> + memcg = get_mem_cgroup_from_current(); >> + ret = mem_cgroup_hugetlb_try_charge(memcg, >> + gfp_mask | __GFP_RETRY_MAYFAIL, >> + pages_per_huge_page(h)); >> + if (ret == -ENOMEM) >> + goto err; >> + >> + memcg_charge_was_prepared = ret != -EOPNOTSUPP; >> + >> + /* Pages are only to be taken from guest_memfd subpool and nowhere else. */ >> + if (hugepage_subpool_get_pages(spool, 1)) >> + goto err_cancel_charge; >> + >> + nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol, >> + &nodemask); >> + /* >> + * charge_cgroup_reservation is false because we didn't make any cgroup >> + * reservations when creating the guest_memfd subpool. > > Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true > instead? > > IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv > charge as well. It is already involved in the rest cgroup charges, and I > wonder whether it's intended that the patch treated the resv accounting > specially. > > Thanks, > Thank you for your careful reviews! I misunderstood charging a cgroup for hugetlb reservations when I was working on this patch. Before this, I thought hugetlb_cgroup_charge_cgroup_rsvd() was only for resv_map reservations, so I set charge_cgroup_reservation to false since guest_memfd didn't use resv_map, but I understand better now. Please help me check my understanding: + All reservations are made at the hstate + In addition, every reservation is associated with a subpool (through spool->rsv_hpages) or recorded in a resv_map + Reservations are either in a subpool or in a resv_map but not both + hugetlb_cgroup_charge_cgroup_rsvd() is for any reservation Regarding the time that a cgroup is charged for reservations: + If a reservation is made during subpool creation, the cgroup is not charged during the reservation by the subpool, probably by design since the process doing the mount may not be the process using the pages + Charging a cgroup for the reservation happens in hugetlb_reserve_pages(), which is called at mmap() time. For guest_memfd, I see two options: Option 1: Charge cgroup for reservations at fault time Pros: + Similar in behavior to a fd on a hugetlbfs mount, where the cgroup of the process calling fallocate() is charged for the reservation. + Symmetric approach, since uncharging happens when the hugetlb folio is freed. Cons: + Room for allocation failure after guest_memfd creation. Even though this guest_memfd had been created with a subpool and pages have been reserved, there is a chance of hitting the cgroup's hugetlb reservation cap and failing to allocate a page. Option 2 (preferred): Charge cgroup for reservations at guest_memfd creation time Pros: + Once guest_memfd file is created, a page is guaranteed at fault time. + Simplifies/doesn't carry over the complexities of the hugetlb(fs) reservation system Cons: + The cgroup being charged is the cgroup of the process creating guest_memfd, which might be an issue if users expect the process faulting the page to be charged. Implementation: + At guest_memfd creation time, when creating the subpool, charge the cgroups for everything: + for hugetlb usage + hugetlb reservation usage and + hugetlb usage by page count (as in mem_cgroup_charge_hugetlb(), which is new since [1]) + Refactoring in [1] would be focused on just dequeueing a folio or failing which, allocating a surplus folio. + After allocation, don't set cgroup on the folio so that the freeing process doesn't uncharge anything + Uncharge when the file is closed Please let me know if anyone has any thoughts/suggestions! >> + * >> + * use_hstate_resv is true because we reserved from global hstate when >> + * creating the guest_memfd subpool. >> + */ >> + folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true); >> + mpol_cond_put(mpol); >> + >> + if (!folio) >> + goto err_put_pages; >> + >> + hugetlb_set_folio_subpool(folio, spool); >> + >> + if (memcg_charge_was_prepared) >> + mem_cgroup_commit_charge(folio, memcg); >> + >> +out: >> + mem_cgroup_put(memcg); >> + >> + return folio; >> + >> +err_put_pages: >> + hugepage_subpool_put_pages(spool, 1); >> + >> +err_cancel_charge: >> + if (memcg_charge_was_prepared) >> + mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h)); >> + >> +err: >> + folio = ERR_PTR(-ENOMEM); >> + goto out; >> +} [1] https://lore.kernel.org/all/7348091f4c539ed207d9bb0f3744d0f0efb7f2b3.1726009989.git.ackerleytng@google.com/