From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0E19AC021A4
	for <linux-mm@archiver.kernel.org>; Thu, 13 Feb 2025 16:49:08 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 788716B0082; Thu, 13 Feb 2025 11:49:07 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 738486B0083; Thu, 13 Feb 2025 11:49:07 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5D91D6B0088; Thu, 13 Feb 2025 11:49:07 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 436AA6B0082
	for <linux-mm@kvack.org>; Thu, 13 Feb 2025 11:49:07 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id BC19B12133E
	for <linux-mm@kvack.org>; Thu, 13 Feb 2025 16:49:06 +0000 (UTC)
X-FDA: 83115506292.11.97B37E0
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf01.hostedemail.com (Postfix) with ESMTP id 6D7634000F
	for <linux-mm@kvack.org>; Thu, 13 Feb 2025 16:49:04 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZL3vGrU6;
	spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739465344; a=rsa-sha256;
	cv=none;
	b=P9jMdD1hVvWAlmIcwis11zDxlQysdwbwsfHZGa9hMadGyYsKSLapcq5tPEIFRq2vaNtGJD
	mhg/XuyQNB1KGC+eKgS680Nh4NP4E+5x4B13TBnOk2FEPD1+2LvQVOxNFwx2KefTG1DCfM
	yrt5EHujxfsiyXJYdJMbbDuAiE+U7ak=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZL3vGrU6;
	spf=pass (imf01.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1739465344;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JROJFeKJG/7QhErL5KNFi8pxb9777JdDet+vnroXIfk=;
	b=K713Dr7N5T9mn62dfq3jaHSti6yLNFhM1f9yF7cM3ZHaXioPAg6YCnYkx1uMlCJatE70XB
	+RhOZf7jwwJS2WVsNEqJNcToAJJkOyz+9L3JMXd+EzccrXOj0dXX7qxGRXg6NV+p3UJdYI
	GW+ngl1sWZIBY5F/zIaRRAWUpvB3deY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1739465343;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=JROJFeKJG/7QhErL5KNFi8pxb9777JdDet+vnroXIfk=;
	b=ZL3vGrU6c9ZoM0y2gJutAFcYnrNZDGKizWA/zq3bss1xcrT4ErR1YhgI6k7tVizcnkavxT
	4vqv4ZaqJxa+anGIpsOYF5kyBtB1vu8GQKSmFAUQ64MF+agg2SyIB5pLTUqbT3YYSB17rK
	WqMK9zo7pt6i4bgqYWEiZXU8C8jSVAE=
Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com
 [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-557-_mT0GeBrOa246jwJ73cZpQ-1; Thu, 13 Feb 2025 11:49:02 -0500
X-MC-Unique: _mT0GeBrOa246jwJ73cZpQ-1
X-Mimecast-MFC-AGG-ID: _mT0GeBrOa246jwJ73cZpQ
Received: by mail-qt1-f198.google.com with SMTP id d75a77b69052e-471b696f5f7so28716901cf.1
        for <linux-mm@kvack.org>; Thu, 13 Feb 2025 08:49:02 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1739465342; x=1740070142;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=JROJFeKJG/7QhErL5KNFi8pxb9777JdDet+vnroXIfk=;
        b=e9Jh+KT3ONtTAiuIeLHvOkJkfnvjkJQL2aDKcTu+tjxq6knq4pJiCaSHCTHv64c+F1
         7AbAXQ9HN0uHnPvfMkKxkQHnzCGcJOasMnRHODBqmDbSAbi2LpTsDLEMBaSCA1IxQ/XE
         jJ2udyZxgIKSQ29qsvJfA1veCKyeX7pXCArst5PA/O/zjncdvWWGrOnBtW0y/yXFqymj
         NE1dSMsMfJmG1GwxbBr3SZporTv3Nipttq/qu+tbZGg0vVWKzAo4RM7CmvP9z5vXZJLG
         +GVAUVGHlvcZa3PEFHrZA/QRZWWqD7CWddZk4OzlRCuXhxinXBhezQqWFTDTbZ3SmT77
         S5EA==
X-Forwarded-Encrypted: i=1; AJvYcCUFqE//Qd8wSnlX7AiS7Z0EAy6dzop4AtyYbyj7opPoyLllRKsFRvAEQCBeJZb+9oX3fhX098l3jw==@kvack.org
X-Gm-Message-State: AOJu0YxcuoLH10peNgVCj4Zw75rloex7m/rgizCQyUgzPSWH2PpGBQY0
	GpWUkIdhoggdNeA4tb3eVpCCqG3LJvESV6sgbqkgpSGgmGcew9Gcx2wb2pDf6zkK8MKOI2Pf8fY
	6KdgGdtOXX57j7wghuAIfEG0xf2WziVFB8eUxRkKMyowA5CYf
X-Gm-Gg: ASbGnctO4GR8hOzlHFX5M1NBYMpqwoQ954DpTvC/k3VSq0+W6gDqrTwqfYu4u/kI2ui
	qjUyhwUtzEFXR4g053xlLaMxlP/QFuFMuvtArjridPfNevamYKMBcskunlOOycCw8zmCl6gU8gl
	PLOO6Kx7uAE8VGauBlqA/2LzElftfw3Qo7iwK1qQDlkmA+WX36rXPuTYbDpPvA4fxoJIWVrdXrM
	txnYrP7uwHCweyShobqqoKYmtK0xbP7o9PxPydcQ7toDtgww0w/BAjv5oM=
X-Received: by 2002:a05:622a:19a7:b0:471:97ef:28c2 with SMTP id d75a77b69052e-471bee88bf9mr76398291cf.50.1739465341887;
        Thu, 13 Feb 2025 08:49:01 -0800 (PST)
X-Google-Smtp-Source: AGHT+IGH90wp0Io4FsbD9Y9r1xlzUmOh6kR8o9MWqYgVTnqP+eNEwiXXiw94ZJu2zck2E6XolQdT0w==
X-Received: by 2002:a05:622a:19a7:b0:471:97ef:28c2 with SMTP id d75a77b69052e-471bee88bf9mr76397931cf.50.1739465341491;
        Thu, 13 Feb 2025 08:49:01 -0800 (PST)
Received: from x1.local ([2604:7a40:2041:2b00::1000])
        by smtp.gmail.com with ESMTPSA id d75a77b69052e-471c2af36f2sm9540111cf.54.2025.02.13.08.48.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 13 Feb 2025 08:49:00 -0800 (PST)
Date: Thu, 13 Feb 2025 11:48:57 -0500
From: Peter Xu <peterx@redhat.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk,
	jgg@nvidia.com, david@redhat.com, rientjes@google.com,
	fvdl@google.com, jthoughton@google.com, seanjc@google.com,
	pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com,
	jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev,
	mike.kravetz@oracle.com, erdemaktas@google.com,
	vannapurve@google.com, qperret@google.com, jhubbard@nvidia.com,
	willy@infradead.org, shuah@kernel.org, brauner@kernel.org,
	bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz,
	rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org,
	haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com,
	maciej.wieczor-retman@intel.com, pgonda@google.com,
	oliver.upton@linux.dev, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org
Subject: Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and
 truncate from hugetlb
Message-ID: <Z64ieVfqTL2Wtqa5@x1.local>
References: <Z0yjGA25b8TfLMnd@x1n>
 <diqz8qqa9t84.fsf@ackerleytng-ctop.c.googlers.com>
MIME-Version: 1.0
In-Reply-To: <diqz8qqa9t84.fsf@ackerleytng-ctop.c.googlers.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: 5FQdk8gZM9-d-5PV6xatFgMouh6YnJP8OpaFw48lQuA_1739465342
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Queue-Id: 6D7634000F
X-Stat-Signature: c7ak6mqujcdkmkznc3tse17gnhawq3yx
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1739465344-835850
X-HE-Meta: U2FsdGVkX1+1S2IUQsjiv7aea0ZQNTdHk8te80A5k57yXpJBBNGk3O/4wthLf9d22E10dQBA9fT2v0JSdE6RmOiDQTuN4ihtY97LeLQCEPlkA0pY1s+Pzo54ls3YJI5rZkP7DVyIrimb5JEN3bhq4//Kb6tOvDWC//NE4xA+lZ4pxZLneVwEY/xdHcTqfVLPiEMpxnu83ujRu4VfMXHIZjZhSJzjEE0jkCnUBvEUnzKY+6+8ZDGKgN2T0tSLeRUNdoIqGZH3Npv7gddDQczCjwZWxXVs6iaeppQQmyASoE50szh9hmv1jwop7gaG64W/o7IOuxuoelE9nENU54vSJ9OayGchjhjGPTYv/FHjhcEO6pgjj77Kh3iHtG7y0rOeoXqCHOCu5eslqMLO/ISg4YdH/XUEZ/NRs0PoChkuRRiJd65jCsr4SiLlu5Cv6eqMYnGJIf4qX8VIyocEDKiVhEB5uFnz6ZG4nlHmkh4OgWZp/7LUG1xwWQI7zwziAV/cTA9YeQuP4STIhlML2K6oJPfSK9hkhkpXU0CRHkkO79l/h4C0uEI207WUQEIO+NTGz58tHD9Ya7a0zfAqwWSdhoj5CU8oa8L1pWP7kuJn/V/kaV+RdR6rlQgipNWhClO7PrBzSAUIc1xslwK1/jbvW7nHaNMAOD0YPat+tBrTfBMpPkx6IALib4bveOdzv2375NnOOOpCjssHM6XX75brT6xAZdlWUtTk/xswbHr3k2tPfyrE8aCzsY9aT8r4AM6UzPKKZPfom38dLPwuDIBjdz/tEH9fvOXpf+LP4D1YmiPW2HZntcfjJ5jHfyQPi5l+Z2PEcMyY3/nG5nGKsIb65ePosBPGMsUKlusm4brDhl6Iy8hb/+72R0PW/bq886Fn4Tz9wSOPSceuSF1P+FqZO9w57WZenqfVj3p3a5S8BtNXLziA8AE2EaG94JAvagcmQk2/UfAHHiV6Y/urh9w
 0ys+OctF
 FQIRehLy+V7nfnAnyGu+mRvuUrs0Cj15KLBdRpsluj7AhuOlj6Q0pqt8Y+4naZ1fAPANYRVl3pE2CGIYBbxTgfzOcYzyiRvqfpO+qaahHGkL3J0OeL/7WdRNAupFR8wY/zVrMKwqtHXsHyB+GOjMkg7xhdOoCJ6v6kV7EDxeHITWHjtMXNCnLP8P50SDVBW5hu6PvTaLYnpYjXmpwyrOUNY//45NaS7Q0R8gNbwZacKtq3VLdg02YOepTZ7/z29Z2uRzmLcL58gBqnAlWqIWSb+KKSQLd6XPt1OVzTKuRcbsDd1OYfeRyMPUqAWTA2EvxbyHrwIWRo7Ym72EQCAb/XWwBbw73TPg8nBGt1e1uRnwIPc65kVWjDZwrE6+gfE3zMuZWJYONZ3oejZJa2BzPb9PCtJRg29iyF4F7V9smXHUTAarHanVXVBQJV2Xp+KJ5UwXczEab+D8Q5dbTTX6nM8vR+HDcOcua+ie4PxqqfTPy/Q6t1URzYXzEi4YJoAfCKUvzao3F9ZZD8q173uKitm56CA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 13, 2025 at 07:52:43AM +0000, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
> >> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
> >> +						  struct hugepage_subpool *spool)
> >> +{
> >> +	bool memcg_charge_was_prepared;
> >> +	struct mem_cgroup *memcg;
> >> +	struct mempolicy *mpol;
> >> +	nodemask_t *nodemask;
> >> +	struct folio *folio;
> >> +	gfp_t gfp_mask;
> >> +	int ret;
> >> +	int nid;
> >> +
> >> +	gfp_mask = htlb_alloc_mask(h);
> >> +
> >> +	memcg = get_mem_cgroup_from_current();
> >> +	ret = mem_cgroup_hugetlb_try_charge(memcg,
> >> +					    gfp_mask | __GFP_RETRY_MAYFAIL,
> >> +					    pages_per_huge_page(h));
> >> +	if (ret == -ENOMEM)
> >> +		goto err;
> >> +
> >> +	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
> >> +
> >> +	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
> >> +	if (hugepage_subpool_get_pages(spool, 1))
> >> +		goto err_cancel_charge;
> >> +
> >> +	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
> >> +					      &nodemask);
> >> +	/*
> >> +	 * charge_cgroup_reservation is false because we didn't make any cgroup
> >> +	 * reservations when creating the guest_memfd subpool.
> >
> > Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true
> > instead?
> >
> > IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv
> > charge as well.  It is already involved in the rest cgroup charges, and I
> > wonder whether it's intended that the patch treated the resv accounting
> > specially.
> >
> > Thanks,
> >
> 
> Thank you for your careful reviews!
> 
> I misunderstood charging a cgroup for hugetlb reservations when I was
> working on this patch.
> 
> Before this, I thought hugetlb_cgroup_charge_cgroup_rsvd() was only for
> resv_map reservations, so I set charge_cgroup_reservation to false since
> guest_memfd didn't use resv_map, but I understand better now. Please
> help me check my understanding:
> 
> + All reservations are made at the hstate
> + In addition, every reservation is associated with a subpool (through
>   spool->rsv_hpages) or recorded in a resv_map
>     + Reservations are either in a subpool or in a resv_map but not both
> + hugetlb_cgroup_charge_cgroup_rsvd() is for any reservation
> 
> Regarding the time that a cgroup is charged for reservations:
> 
> + If a reservation is made during subpool creation, the cgroup is not
>   charged during the reservation by the subpool, probably by design
>   since the process doing the mount may not be the process using the
>   pages

Exactly.

> + Charging a cgroup for the reservation happens in
>   hugetlb_reserve_pages(), which is called at mmap() time.

Yes, or if it's not charged in hugetlb_reserve_pages() it needs to be
charged at folio allocation as of now.

> 
> For guest_memfd, I see two options:
> 
> Option 1: Charge cgroup for reservations at fault time
> 
> Pros:
> 
> + Similar in behavior to a fd on a hugetlbfs mount, where the cgroup of
>   the process calling fallocate() is charged for the reservation.
> + Symmetric approach, since uncharging happens when the hugetlb folio is
>   freed.
> 
> Cons:
> 
> + Room for allocation failure after guest_memfd creation. Even though
>   this guest_memfd had been created with a subpool and pages have been
>   reserved, there is a chance of hitting the cgroup's hugetlb
>   reservation cap and failing to allocate a page.
> 
> Option 2 (preferred): Charge cgroup for reservations at guest_memfd
> creation time
> 
> Pros:
> 
> + Once guest_memfd file is created, a page is guaranteed at fault time.

This would definitely be nice, that whatever that can block the guest from
using the memory should be a fault upfront when a VM boots if ever possible
(e.g. this is not a mmap() interface, so user yet doesn't allow NORESERVE).

It'll be slightly different from the spool use case of mount points, but I
think it's a new use case anyway, so IIUC we can define its behavior to
best suite the use case.

> + Simplifies/doesn't carry over the complexities of the hugetlb(fs)
>   reservation system
> 
> Cons:
> 
> + The cgroup being charged is the cgroup of the process creating
>   guest_memfd, which might be an issue if users expect the process
>   faulting the page to be charged.

Right, though I can't picture such use case yet.

I'm guessing multiple processes use of guest-memfd is still very far away.
When it happens, I would expect these tasks be put into the same cgroup..
Maybe kubevirt already have some of such use, I can go and have a check.

If they're not in the same cgroup, it's still more reasonable to always
charge that at the VM instance, rather than whatever other process that may
operate on the guest memory.

So it could be that we don't see major cons in solution 2.  In general, I
agree with your preference.

> 
> Implementation:
> 
> + At guest_memfd creation time, when creating the subpool, charge the
>   cgroups for everything:
>    + for hugetlb usage

I suppose here you meant the global reservation?  If so, I agree.

IIUC the new code shouldn't need to worry on this if the subpool is created
by the API, as that API does the global charging, like we discussed
elsewhere.

If you meant hugetlb_cgroup_commit_charge(),IMHO it should still be left
done until allocation.  In guest-memfd case, when fallocate().  AFAICT,
that's the only reason why we need two of such anyway..

>    + hugetlb reservation usage and

Agree on this one.

>    + hugetlb usage by page count (as in mem_cgroup_charge_hugetlb(),
>      which is new since [1])

This one should, IMHO, also be done only during allocation.

Thanks,

> + Refactoring in [1] would be focused on just dequeueing a folio or
>   failing which, allocating a surplus folio.
>    + After allocation, don't set cgroup on the folio so that the freeing
>      process doesn't uncharge anything
> + Uncharge when the file is closed
> 
> Please let me know if anyone has any thoughts/suggestions!
> 
> >> +	 *
> >> +	 * use_hstate_resv is true because we reserved from global hstate when
> >> +	 * creating the guest_memfd subpool.
> >> +	 */
> >> +	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
> >> +	mpol_cond_put(mpol);
> >> +
> >> +	if (!folio)
> >> +		goto err_put_pages;
> >> +
> >> +	hugetlb_set_folio_subpool(folio, spool);
> >> +
> >> +	if (memcg_charge_was_prepared)
> >> +		mem_cgroup_commit_charge(folio, memcg);
> >> +
> >> +out:
> >> +	mem_cgroup_put(memcg);
> >> +
> >> +	return folio;
> >> +
> >> +err_put_pages:
> >> +	hugepage_subpool_put_pages(spool, 1);
> >> +
> >> +err_cancel_charge:
> >> +	if (memcg_charge_was_prepared)
> >> +		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
> >> +
> >> +err:
> >> +	folio = ERR_PTR(-ENOMEM);
> >> +	goto out;
> >> +}
> 
> [1] https://lore.kernel.org/all/7348091f4c539ed207d9bb0f3744d0f0efb7f2b3.1726009989.git.ackerleytng@google.com/
> 

-- 
Peter Xu