From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9062EEB56E for ; Fri, 8 Sep 2023 20:48:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E56D36B00E3; Fri, 8 Sep 2023 16:48:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E07336B00E5; Fri, 8 Sep 2023 16:48:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF5D96B00E7; Fri, 8 Sep 2023 16:48:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BF7B76B00E3 for ; Fri, 8 Sep 2023 16:48:13 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 855CC12116F for ; Fri, 8 Sep 2023 20:48:13 +0000 (UTC) X-FDA: 81214617666.05.8B735D9 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf26.hostedemail.com (Postfix) with ESMTP id 18CC514000A for ; Fri, 8 Sep 2023 20:48:09 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=QCP689mh; spf=pass (imf26.hostedemail.com: domain of usama.arif@bytedance.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usama.arif@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694206091; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=; b=5HOlg/Bi3wNi96+BGq/K4BrQKJzBHUaN4F+sYPrL3fai5JX29e16nMduC0GwoLP6TdMU/3 VF3aKMcRkJ7Qp01o7Uywqxf9DGeUktc3JvYGPBOPV3qXwWJlRu8S6F1O1DYcBj0vX9h3x1 SQr41BACFBIAACEzaQKCcauEkeRHGCA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694206091; a=rsa-sha256; cv=none; b=W6KznIwDPU55e4DER5zsJHWnVUYB3hpXKKwtCm4AP4B2IclmPyxgwxfxWNNxoGtWBPawiG EuN+RF+sxAzgca5B4xpHvMtIeb7vqtl7ZPjGqDZUcDjLqHAdlT3PTkNfNVup5JgYXinf2R gn8G7dsBCRF5tIS72S+/ccEM0JQ5fSk= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=QCP689mh; spf=pass (imf26.hostedemail.com: domain of usama.arif@bytedance.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=usama.arif@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-401bdff4cb4so26044515e9.3 for ; Fri, 08 Sep 2023 13:48:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1694206088; x=1694810888; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=; b=QCP689mhGfP31beYNnf2HD3zqB4ixiFRSDIrdu3jOq38hS+riwTOuL6LRQ7aOI3Ykb 0OHegVa86neG7GDjvGNEEbRNrkyy5LEKjByNHZKpi/z8q546OW3uRo40pVcNT4HfoYTn lE7SmmdSRdCMsY6r3WaD0vBbs/48ICmNdsfh7wU3MFi+m0VNp/MDvjbS8YxfJK11RtVy /In7xTYlOqKSqhRp3JufFkovC+i3Ru3KOp8XnbTvJylvyCrt9tuuzNI2pMvXtCpZDBWG SgATq/oVbUxUvIpdLVfLKweOVkE6mt+bVY62qPem67Wp8MjCB5Ts8e44cMIVEVQRY7fA WyEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694206088; x=1694810888; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=MwlpmlnoH2/XNLfnwkkZ/snO4wOJuimlWOr26ADJPt4=; b=Oc+C6CiSJNa7yLlfpKgxtDrLSV/sCUpkqdLKEcIisGNQryHdDaoKiq6tGulMkLgThe 1uycbsxTFA4yCqWs1JXJDN0iMhLm/ltQFNRowo0zmHj66OTcOWT43K36jR4SqguQlsdd ieg9qtWEzPijAHfvvwlV8fBwKn44EvtC74zrqbKuDEmR0OALvf8xdVnhnbyonn3+heoZ mAXUch0KIBvtj6QyeJkKsdYO4w8yhkSOEokGwke4GjeP9ooCtqO/vDi+XB7Vd0cvoz0/ h0V3PG5srnjMwYg70Hhs9KiOFqR2+hu1dzNqpRi+/5SNTiDXMFr0GInp4tJetHOF2Vhb YxqQ== X-Gm-Message-State: AOJu0YwbsgXLMAqtVT5nhc/DNGUDLaFdtts3eCoQHNa43TmBWWCv/A8x d2xZmUGc6CLUujmAGcACyxUucg== X-Google-Smtp-Source: AGHT+IEsJEJkxBZhGZ02mU7PSIc4qDnHnbm9qULLT/PItBheKbhbolKCJ2gwyX3mc2Md2WiaTqLJ6g== X-Received: by 2002:a5d:4e41:0:b0:31a:ccc6:b8de with SMTP id r1-20020a5d4e41000000b0031accc6b8demr2692257wrt.50.1694206088060; Fri, 08 Sep 2023 13:48:08 -0700 (PDT) Received: from ?IPV6:2a02:6b6a:b5c7:0:2620:8c76:d54f:c8e0? ([2a02:6b6a:b5c7:0:2620:8c76:d54f:c8e0]) by smtp.gmail.com with ESMTPSA id v17-20020a5d6791000000b00318147fd2d3sm2949124wru.41.2023.09.08.13.48.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 08 Sep 2023 13:48:07 -0700 (PDT) Message-ID: <63b72487-6220-7c44-70fd-822681746737@bytedance.com> Date: Fri, 8 Sep 2023 21:48:06 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 Subject: Re: [External] Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Content-Language: en-US To: Mike Kravetz , Muchun Song Cc: Linux-MM , "Mike Rapoport (IBM)" , LKML , Muchun Song , fam.zheng@bytedance.com, liangma@liangbit.com, punit.agrawal@bytedance.com References: <20230906112605.2286994-1-usama.arif@bytedance.com> <20230906112605.2286994-5-usama.arif@bytedance.com> <20230907183712.GB3640@monkey> <20230908182950.GA6564@monkey> From: Usama Arif In-Reply-To: <20230908182950.GA6564@monkey> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: dnp17buhubt7dcnsgy5o44i8keym38k9 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 18CC514000A X-Rspam-User: X-HE-Tag: 1694206089-677025 X-HE-Meta: U2FsdGVkX19sU5Xsmk910YjXDMBDSOLf71yJmNUEdrnLnf/i5aTTvQIw3r1sIXM+nwt9pAG8pf07f24HSEmScRbSCcafYlK6IERk8r3uXk2p/rzTB5wtKykL+v9pTHdl4UiLNdJ0ciTOQMUkryNQfLXhEUKEG9s/FTBhIj7W5nl2Rg84hjANazab0gt/o/fC2vmcIL6EcvdWHBLT9th3hCw4ToViRuwnSsDBJCCCrj4Uh2tfxufr3G/5Y28unM41u5ISbS6iRYSCC6tAvXGZqXK/IIes0KGeO4JcoaxMggiM8ngdm58MpuXfy26Vcozneldzoh0ROfDK7HcRbY+z3GK1ax6ChApIfHcmMVx0zmp7F3gyWE+PPIU8O3EiucFwlrD1keHDYBX7QcSbnhJO7waTHUzIuERGzf6wZE2z6SosZPc+xebKr6Tioz9SN5w7IckNcYaP9rqDsEuV1/IoPaF+PhWa8RkvDCk7AWnDk8flliDQxXAkO1AT3LlRloZfLf4lcP+R3yDhPDL9KNXzsJSk82W6tLJbsQLvGg5lvm9/dYHqa6E9mXJMyvkzSS10QMWFxN1cO5b8qGmQ/Gfk4Z7QiHLzBMyx3RwnNKCPrZmDGqhz2nai09lAvErnLNgMAVBWBCgeJQAfODULUynIyQcKlF/YF3eVZcvXr56qaCc/rAh4CeW9CNwPdKhY9RQBixnRd3XwHXpPSZhh08dDdQpH5xUnHqKQgwc87Greuok/mlBkctegbeEgqp+wjcSUplc4YrXsY0wKObOZ9DwDWVkjPChJeGKv6eSvW+jq3gXs7sbR2iaJfrVgNuEUtQBaHoqYrfKi/z9GFXgtMkLHxWFsIfIr7H0p+DNNrpCVPhZTCS3OGnUjfQGXTUnVBl1QU8sjER09tPPKOKTVDRhkVk8ZWNm840uye3pOWG7zaT8AK1V1oGTxX1e9b03qzxprs5LrQJiuklb5McEBHpY i36f2Gnv +eawGLwH5+ss4puaOcoAN1pjV/nH+n07iIUzmSIQjLlWEm3cGAPtp9f7zFF/cSLrUQbgCIop4+jAo5f/C3T5podJm5DkaONkkm172yrPb5PnmT96CS/EEidzdrCkzoxHrravGtHT3YKfe6TOlP1pL7Cl2WXXMtimeHLps+7g0P0lCRGHBUlLJjnUhL2d+HxN0gmhX9JqOS7YXXZEfG9mCpSBC4QjXzvUf1NOsG37Mr1jtK0SCgEvHnIZqpQtJ+C5i9M0D7M2Nzj6OKN8UHIVHaZomZTiYy78oGAqRpj+qY0K2S1xjxtD9wGgXMhhaRs/Xgv28mG83JR6LGSrfs7NrMC06WCP5QhPDPAhn12ccjbhfJD3NJkgzmC16ivQyQdolKZqUWys4kqSWcJ1TSy6MOiUJn2DO0N4nfZ6pgfFldK7zkSIDuAmMRd5mJ6r3U5bCGthG30jrRVSkgfY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08/09/2023 19:29, Mike Kravetz wrote: > On 09/08/23 10:39, Muchun Song wrote: >> >> >>> On Sep 8, 2023, at 02:37, Mike Kravetz wrote: >>> >>> On 09/06/23 12:26, Usama Arif wrote: >>>> The new boot flow when it comes to initialization of gigantic pages >>>> is as follows: >>>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage, >>>> the region after the first struct page is marked as noinit. >>>> - This results in only the first struct page to be >>>> initialized in reserve_bootmem_region. As the tail struct pages are >>>> not initialized at this point, there can be a significant saving >>>> in boot time if HVO succeeds later on. >>>> - Later on in the boot, the head page is prepped and the first >>>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages >>>> are initialized. >>>> - HVO is attempted. If it is not successful, then the rest of the >>>> tail struct pages are initialized. If it is successful, no more >>>> tail struct pages need to be initialized saving significant boot time. >>>> >>>> Signed-off-by: Usama Arif >>>> --- >>>> mm/hugetlb.c | 61 +++++++++++++++++++++++++++++++++++++------- >>>> mm/hugetlb_vmemmap.c | 2 +- >>>> mm/hugetlb_vmemmap.h | 9 ++++--- >>>> mm/internal.h | 3 +++ >>>> mm/mm_init.c | 2 +- >>>> 5 files changed, 62 insertions(+), 15 deletions(-) >>> >>> As mentioned, in general this looks good. One small point below. >>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index c32ca241df4b..540e0386514e 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) >>>> } >>>> >>>> found: >>>> + >>>> + /* >>>> + * Only initialize the head struct page in memmap_init_reserved_pages, >>>> + * rest of the struct pages will be initialized by the HugeTLB subsystem itself. >>>> + * The head struct page is used to get folio information by the HugeTLB >>>> + * subsystem like zone id and node id. >>>> + */ >>>> + memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE), >>>> + huge_page_size(h) - PAGE_SIZE); >>>> /* Put them into a private list first because mem_map is not up yet */ >>>> INIT_LIST_HEAD(&m->list); >>>> list_add(&m->list, &huge_boot_pages); >>>> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) >>>> return 1; >>>> } >>>> >>>> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ >>>> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, >>>> + unsigned long start_page_number, >>>> + unsigned long end_page_number) >>>> +{ >>>> + enum zone_type zone = zone_idx(folio_zone(folio)); >>>> + int nid = folio_nid(folio); >>>> + unsigned long head_pfn = folio_pfn(folio); >>>> + unsigned long pfn, end_pfn = head_pfn + end_page_number; >>>> + >>>> + for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { >>>> + struct page *page = pfn_to_page(pfn); >>>> + >>>> + __init_single_page(page, pfn, zone, nid); >>>> + prep_compound_tail((struct page *)folio, pfn - head_pfn); >>>> + set_page_count(page, 0); >>>> + } >>>> +} >>>> + >>>> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h, >>>> + unsigned long nr_pages) >>>> +{ >>>> + int ret; >>>> + >>>> + /* Prepare folio head */ >>>> + __folio_clear_reserved(folio); >>>> + __folio_set_head(folio); >>>> + ret = page_ref_freeze(&folio->page, 1); >>>> + VM_BUG_ON(!ret); >>> >>> In the current code, we print a warning and free the associated pages to >>> buddy if we ever experience an increased ref count. The routine >>> hugetlb_folio_init_tail_vmemmap does not check for this. >>> >>> I do not believe speculative/temporary ref counts this early in the boot >>> process are possible. It would be great to get input from someone else. >> >> Yes, it is a very early stage and other tail struct pages haven't been >> initialized yet, anyone should not reference them. It it the same case >> as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. >> >>> >>> When I wrote the existing code, it was fairly easy to WARN and continue >>> if we encountered an increased ref count. Things would be bit more >> >> In your case, I think it is not in the boot process, right? > > They were calls in the same routine: gather_bootmem_prealloc(). > >>> complicated here. So, it may not be worth the effort. >> >> Agree. Note that tail struct pages are not initialized here, if we want to >> handle head page, how to handle tail pages? It really cannot resolved. >> We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT >> that anyone should not reference those pages. > > Agree that speculative refs should not happen this early. How about making > the following changes? > - Instead of set_page_count() in hugetlb_folio_init_tail_vmemmap, do a > page_ref_freeze and VM_BUG_ON if not ref_count != 1. > - In the commit message, mention 'The WARN_ON for increased ref count in > gather_bootmem_prealloc was changed to a VM_BUG_ON. This is OK as > there should be no speculative references this early in boot process. > The VM_BUG_ON's are there just in case such code is introduced.' Sounds good, although its not possible for the refcnt to not be 1 as there isnt anything that happens between __init_single_page and setting/freezing refcnt to 0. I will include the below diff in the next revision with the explanation in commit message as suggested. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 540e0386514e..ed37c6e4e952 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3194,13 +3194,15 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, int nid = folio_nid(folio); unsigned long head_pfn = folio_pfn(folio); unsigned long pfn, end_pfn = head_pfn + end_page_number; + int ret; for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); __init_single_page(page, pfn, zone, nid); prep_compound_tail((struct page *)folio, pfn - head_pfn); - set_page_count(page, 0); + ret = page_ref_freeze(page, 1); + VM_BUG_ON(!ret); } }