From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A9062EEB56E
	for <linux-mm@archiver.kernel.org>; Fri,  8 Sep 2023 20:48:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E56D36B00E3; Fri,  8 Sep 2023 16:48:13 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E07336B00E5; Fri,  8 Sep 2023 16:48:13 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CF5D96B00E7; Fri,  8 Sep 2023 16:48:13 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id BF7B76B00E3
	for <linux-mm@kvack.org>; Fri,  8 Sep 2023 16:48:13 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 855CC12116F
	for <linux-mm@kvack.org>; Fri,  8 Sep 2023 20:48:13 +0000 (UTC)
X-FDA: 81214617666.05.8B735D9
Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50])
	by imf26.hostedemail.com (Postfix) with ESMTP id 18CC514000A
	for <linux-mm@kvack.org>; Fri,  8 Sep 2023 20:48:09 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=QCP689mh;
	spf=pass (imf26.hostedemail.com: domain of usama.arif@bytedance.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usama.arif@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1694206091;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=;
	b=5HOlg/Bi3wNi96+BGq/K4BrQKJzBHUaN4F+sYPrL3fai5JX29e16nMduC0GwoLP6TdMU/3
	VF3aKMcRkJ7Qp01o7Uywqxf9DGeUktc3JvYGPBOPV3qXwWJlRu8S6F1O1DYcBj0vX9h3x1
	SQr41BACFBIAACEzaQKCcauEkeRHGCA=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694206091; a=rsa-sha256;
	cv=none;
	b=W6KznIwDPU55e4DER5zsJHWnVUYB3hpXKKwtCm4AP4B2IclmPyxgwxfxWNNxoGtWBPawiG
	EuN+RF+sxAzgca5B4xpHvMtIeb7vqtl7ZPjGqDZUcDjLqHAdlT3PTkNfNVup5JgYXinf2R
	gn8G7dsBCRF5tIS72S+/ccEM0JQ5fSk=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=QCP689mh;
	spf=pass (imf26.hostedemail.com: domain of usama.arif@bytedance.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usama.arif@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-401bdff4cb4so26044515e9.3
        for <linux-mm@kvack.org>; Fri, 08 Sep 2023 13:48:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1694206088; x=1694810888; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=;
        b=QCP689mhGfP31beYNnf2HD3zqB4ixiFRSDIrdu3jOq38hS+riwTOuL6LRQ7aOI3Ykb
         0OHegVa86neG7GDjvGNEEbRNrkyy5LEKjByNHZKpi/z8q546OW3uRo40pVcNT4HfoYTn
         lE7SmmdSRdCMsY6r3WaD0vBbs/48ICmNdsfh7wU3MFi+m0VNp/MDvjbS8YxfJK11RtVy
         /In7xTYlOqKSqhRp3JufFkovC+i3Ru3KOp8XnbTvJylvyCrt9tuuzNI2pMvXtCpZDBWG
         SgATq/oVbUxUvIpdLVfLKweOVkE6mt+bVY62qPem67Wp8MjCB5Ts8e44cMIVEVQRY7fA
         WyEw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694206088; x=1694810888;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=;
        b=Oc+C6CiSJNa7yLlfpKgxtDrLSV/sCUpkqdLKEcIisGNQryHdDaoKiq6tGulMkLgThe
         1uycbsxTFA4yCqWs1JXJDN0iMhLm/ltQFNRowo0zmHj66OTcOWT43K36jR4SqguQlsdd
         ieg9qtWEzPijAHfvvwlV8fBwKn44EvtC74zrqbKuDEmR0OALvf8xdVnhnbyonn3+heoZ
         mAXUch0KIBvtj6QyeJkKsdYO4w8yhkSOEokGwke4GjeP9ooCtqO/vDi+XB7Vd0cvoz0/
         h0V3PG5srnjMwYg70Hhs9KiOFqR2+hu1dzNqpRi+/5SNTiDXMFr0GInp4tJetHOF2Vhb
         YxqQ==
X-Gm-Message-State: AOJu0YwbsgXLMAqtVT5nhc/DNGUDLaFdtts3eCoQHNa43TmBWWCv/A8x
	d2xZmUGc6CLUujmAGcACyxUucg==
X-Google-Smtp-Source: AGHT+IEsJEJkxBZhGZ02mU7PSIc4qDnHnbm9qULLT/PItBheKbhbolKCJ2gwyX3mc2Md2WiaTqLJ6g==
X-Received: by 2002:a5d:4e41:0:b0:31a:ccc6:b8de with SMTP id r1-20020a5d4e41000000b0031accc6b8demr2692257wrt.50.1694206088060;
        Fri, 08 Sep 2023 13:48:08 -0700 (PDT)
Received: from ?IPV6:2a02:6b6a:b5c7:0:2620:8c76:d54f:c8e0? ([2a02:6b6a:b5c7:0:2620:8c76:d54f:c8e0])
        by smtp.gmail.com with ESMTPSA id v17-20020a5d6791000000b00318147fd2d3sm2949124wru.41.2023.09.08.13.48.07
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 08 Sep 2023 13:48:07 -0700 (PDT)
Message-ID: <63b72487-6220-7c44-70fd-822681746737@bytedance.com>
Date: Fri, 8 Sep 2023 21:48:06 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.2
Subject: Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of
 gigantic tail struct pages if freed by HVO
Content-Language: en-US
To: Mike Kravetz <mike.kravetz@oracle.com>,
 Muchun Song <muchun.song@linux.dev>
Cc: Linux-MM <linux-mm@kvack.org>, "Mike Rapoport (IBM)" <rppt@kernel.org>,
 LKML <linux-kernel@vger.kernel.org>, Muchun Song <songmuchun@bytedance.com>,
 fam.zheng@bytedance.com, liangma@liangbit.com, punit.agrawal@bytedance.com
References: <20230906112605.2286994-1-usama.arif@bytedance.com>
 <20230906112605.2286994-5-usama.arif@bytedance.com>
 <20230907183712.GB3640@monkey>
 <C51EF081-ABC9-4770-B329-2CAC2CE979FA@linux.dev>
 <20230908182950.GA6564@monkey>
From: Usama Arif <usama.arif@bytedance.com>
In-Reply-To: <20230908182950.GA6564@monkey>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Stat-Signature: dnp17buhubt7dcnsgy5o44i8keym38k9
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 18CC514000A
X-Rspam-User: 
X-HE-Tag: 1694206089-677025
X-HE-Meta: U2FsdGVkX19sU5Xsmk910YjXDMBDSOLf71yJmNUEdrnLnf/i5aTTvQIw3r1sIXM+nwt9pAG8pf07f24HSEmScRbSCcafYlK6IERk8r3uXk2p/rzTB5wtKykL+v9pTHdl4UiLNdJ0ciTOQMUkryNQfLXhEUKEG9s/FTBhIj7W5nl2Rg84hjANazab0gt/o/fC2vmcIL6EcvdWHBLT9th3hCw4ToViRuwnSsDBJCCCrj4Uh2tfxufr3G/5Y28unM41u5ISbS6iRYSCC6tAvXGZqXK/IIes0KGeO4JcoaxMggiM8ngdm58MpuXfy26Vcozneldzoh0ROfDK7HcRbY+z3GK1ax6ChApIfHcmMVx0zmp7F3gyWE+PPIU8O3EiucFwlrD1keHDYBX7QcSbnhJO7waTHUzIuERGzf6wZE2z6SosZPc+xebKr6Tioz9SN5w7IckNcYaP9rqDsEuV1/IoPaF+PhWa8RkvDCk7AWnDk8flliDQxXAkO1AT3LlRloZfLf4lcP+R3yDhPDL9KNXzsJSk82W6tLJbsQLvGg5lvm9/dYHqa6E9mXJMyvkzSS10QMWFxN1cO5b8qGmQ/Gfk4Z7QiHLzBMyx3RwnNKCPrZmDGqhz2nai09lAvErnLNgMAVBWBCgeJQAfODULUynIyQcKlF/YF3eVZcvXr56qaCc/rAh4CeW9CNwPdKhY9RQBixnRd3XwHXpPSZhh08dDdQpH5xUnHqKQgwc87Greuok/mlBkctegbeEgqp+wjcSUplc4YrXsY0wKObOZ9DwDWVkjPChJeGKv6eSvW+jq3gXs7sbR2iaJfrVgNuEUtQBaHoqYrfKi/z9GFXgtMkLHxWFsIfIr7H0p+DNNrpCVPhZTCS3OGnUjfQGXTUnVBl1QU8sjER09tPPKOKTVDRhkVk8ZWNm840uye3pOWG7zaT8AK1V1oGTxX1e9b03qzxprs5LrQJiuklb5McEBHpY
 i36f2Gnv
 +eawGLwH5+ss4puaOcoAN1pjV/nH+n07iIUzmSIQjLlWEm3cGAPtp9f7zFF/cSLrUQbgCIop4+jAo5f/C3T5podJm5DkaONkkm172yrPb5PnmT96CS/EEidzdrCkzoxHrravGtHT3YKfe6TOlP1pL7Cl2WXXMtimeHLps+7g0P0lCRGHBUlLJjnUhL2d+HxN0gmhX9JqOS7YXXZEfG9mCpSBC4QjXzvUf1NOsG37Mr1jtK0SCgEvHnIZqpQtJ+C5i9M0D7M2Nzj6OKN8UHIVHaZomZTiYy78oGAqRpj+qY0K2S1xjxtD9wGgXMhhaRs/Xgv28mG83JR6LGSrfs7NrMC06WCP5QhPDPAhn12ccjbhfJD3NJkgzmC16ivQyQdolKZqUWys4kqSWcJ1TSy6MOiUJn2DO0N4nfZ6pgfFldK7zkSIDuAmMRd5mJ6r3U5bCGthG30jrRVSkgfY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 08/09/2023 19:29, Mike Kravetz wrote:
> On 09/08/23 10:39, Muchun Song wrote:
>>
>>
>>> On Sep 8, 2023, at 02:37, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 09/06/23 12:26, Usama Arif wrote:
>>>> The new boot flow when it comes to initialization of gigantic pages
>>>> is as follows:
>>>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
>>>> the region after the first struct page is marked as noinit.
>>>> - This results in only the first struct page to be
>>>> initialized in reserve_bootmem_region. As the tail struct pages are
>>>> not initialized at this point, there can be a significant saving
>>>> in boot time if HVO succeeds later on.
>>>> - Later on in the boot, the head page is prepped and the first
>>>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
>>>> are initialized.
>>>> - HVO is attempted. If it is not successful, then the rest of the
>>>> tail struct pages are initialized. If it is successful, no more
>>>> tail struct pages need to be initialized saving significant boot time.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>>>> ---
>>>> mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
>>>> mm/hugetlb_vmemmap.c |  2 +-
>>>> mm/hugetlb_vmemmap.h |  9 ++++---
>>>> mm/internal.h        |  3 +++
>>>> mm/mm_init.c         |  2 +-
>>>> 5 files changed, 62 insertions(+), 15 deletions(-)
>>>
>>> As mentioned, in general this looks good.  One small point below.
>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index c32ca241df4b..540e0386514e 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>>> }
>>>>
>>>> found:
>>>> +
>>>> + 	/*
>>>> + 	 * Only initialize the head struct page in memmap_init_reserved_pages,
>>>> + 	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
>>>> + 	 * The head struct page is used to get folio information by the HugeTLB
>>>> + 	 * subsystem like zone id and node id.
>>>> + 	 */
>>>> + 	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
>>>> + 	huge_page_size(h) - PAGE_SIZE);
>>>> 	/* Put them into a private list first because mem_map is not up yet */
>>>> 	INIT_LIST_HEAD(&m->list);
>>>> 	list_add(&m->list, &huge_boot_pages);
>>>> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>>> 	return 1;
>>>> }
>>>>
>>>> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
>>>> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
>>>> +     		unsigned long start_page_number,
>>>> +     		unsigned long end_page_number)
>>>> +{
>>>> + 	enum zone_type zone = zone_idx(folio_zone(folio));
>>>> + 	int nid = folio_nid(folio);
>>>> + 	unsigned long head_pfn = folio_pfn(folio);
>>>> + 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
>>>> +
>>>> + 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
>>>> + 	struct page *page = pfn_to_page(pfn);
>>>> +
>>>> + 		__init_single_page(page, pfn, zone, nid);
>>>> + 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
>>>> + 		set_page_count(page, 0);
>>>> + 	}
>>>> +}
>>>> +
>>>> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
>>>> +        unsigned long nr_pages)
>>>> +{
>>>> + 	int ret;
>>>> +
>>>> + 	/* Prepare folio head */
>>>> +	 __folio_clear_reserved(folio);
>>>> + 	__folio_set_head(folio);
>>>> + 	ret = page_ref_freeze(&folio->page, 1);
>>>> + 	VM_BUG_ON(!ret);
>>>
>>> In the current code, we print a warning and free the associated pages to
>>> buddy if we ever experience an increased ref count.  The routine
>>> hugetlb_folio_init_tail_vmemmap does not check for this.
>>>
>>> I do not believe speculative/temporary ref counts this early in the boot
>>> process are possible.  It would be great to get input from someone else.
>>
>> Yes, it is a very early stage and other tail struct pages haven't been
>> initialized yet, anyone should not reference them. It it the same case
>> as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>>>
>>> When I wrote the existing code, it was fairly easy to WARN and continue
>>> if we encountered an increased ref count.  Things would be bit more
>>
>> In your case, I think it is not in the boot process, right?
> 
> They were calls in the same routine: gather_bootmem_prealloc().
> 
>>> complicated here.  So, it may not be worth the effort.
>>
>> Agree. Note that tail struct pages are not initialized here, if we want to
>> handle head page, how to handle tail pages? It really cannot resolved.
>> We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> that anyone should not reference those pages.
> 
> Agree that speculative refs should not happen this early.  How about making
> the following changes?
> - Instead of set_page_count() in hugetlb_folio_init_tail_vmemmap, do a
>    page_ref_freeze and VM_BUG_ON if not ref_count != 1.
> - In the commit message, mention 'The WARN_ON for increased ref count in
>    gather_bootmem_prealloc was changed to a VM_BUG_ON.  This is OK as
>    there should be no speculative references this early in boot process.
>    The VM_BUG_ON's are there just in case such code is introduced.'

Sounds good, although its not possible for the refcnt to not be 1 as 
there isnt anything that happens between __init_single_page and 
setting/freezing refcnt to 0. I will include the below diff in the next 
revision with the explanation in commit message as suggested.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 540e0386514e..ed37c6e4e952 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3194,13 +3194,15 @@ static void __init 
hugetlb_folio_init_tail_vmemmap(struct folio *folio,
         int nid = folio_nid(folio);
         unsigned long head_pfn = folio_pfn(folio);
         unsigned long pfn, end_pfn = head_pfn + end_page_number;
+       int ret;

         for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
                 struct page *page = pfn_to_page(pfn);

                 __init_single_page(page, pfn, zone, nid);
                 prep_compound_tail((struct page *)folio, pfn - head_pfn);
-               set_page_count(page, 0);
+               ret = page_ref_freeze(page, 1);
+               VM_BUG_ON(!ret);
         }
  }