From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB278C3F2C6 for ; Wed, 11 Mar 2020 12:05:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4753721655 for ; Wed, 11 Mar 2020 12:05:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4753721655 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C80A26B0006; Wed, 11 Mar 2020 08:05:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C0A5C6B0007; Wed, 11 Mar 2020 08:05:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE07D6B0008; Wed, 11 Mar 2020 08:05:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240]) by kanga.kvack.org (Postfix) with ESMTP id 8BA636B0006 for ; Wed, 11 Mar 2020 08:05:04 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 4E9CA81CF for ; Wed, 11 Mar 2020 12:05:04 +0000 (UTC) X-FDA: 76582950528.20.verse38_2c297949e545a X-HE-Tag: verse38_2c297949e545a X-Filterd-Recvd-Size: 18724 Received: from relay.sw.ru (relay.sw.ru [185.231.240.75]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Wed, 11 Mar 2020 12:05:03 +0000 (UTC) Received: from dhcp-172-16-24-104.sw.ru ([172.16.24.104]) by relay.sw.ru with esmtp (Exim 4.92.3) (envelope-from ) id 1jC06U-0006MJ-FO; Wed, 11 Mar 2020 15:04:50 +0300 Subject: Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside deferred_init_memmap() To: Shile Zhang , Andrew Morton , Pavel Tatashin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Michal Hocko References: <20200303161551.132263-1-shile.zhang@linux.alibaba.com> <20200303161551.132263-2-shile.zhang@linux.alibaba.com> <386d7d5f-a57d-f5b1-acee-131ce23d35ec@linux.alibaba.com> <2d4defb7-8816-3447-3d65-f5d80067a9fd@virtuozzo.com> <1856c956-858f-82d4-f3b3-05b2d0e5641c@linux.alibaba.com> <094fcf02-37bd-c9b9-7156-77b9b34955e0@virtuozzo.com> <59c7f8c7-d2ef-8b59-637e-88a353d4daf0@linux.alibaba.com> From: Kirill Tkhai Message-ID: Date: Wed, 11 Mar 2020 15:04:50 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <59c7f8c7-d2ef-8b59-637e-88a353d4daf0@linux.alibaba.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11.03.2020 15:00, Shile Zhang wrote: >=20 >=20 > On 2020/3/11 19:42, Kirill Tkhai wrote: >> On 11.03.2020 04:44, Shile Zhang wrote: >>> Hi Kirill, >>> >>> Sorry for late to reply! >>> I'm not fully understood the whole thing about deferred page init, so= I >>> just force on the jiffies update issue itself. >>> >>> Maybe I'm in wrong path, it seems make no sense that deferred page in= it in 1 CPU system, >>> it cannot be initialize memory parallel. >> Yes, it can't be initialized in parallel. But scheduler interprets def= erred thread as a separate >> task, so CPU time will be shared between that thread and the rest of t= asks. This still should >> make boot faster. >=20 > Thanks for your quickly reply! >=20 > Sorry, I don't think the CPU time can be shared to the rest of tasks si= nce the CPU be blocked until > the deferred pages are all initialized, as following code: > 8<------ > #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT >=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* There will be num_node_st= ate(N_MEMORY) threads */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 atomic_set(&pgdat_init_n_und= one, num_node_state(N_MEMORY)); > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_node_state(nid, N_M= EMORY) { > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgd= atinit%d", nid); > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Block until all are initi= alised */ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait_for_completion(&pgdat_i= nit_all_done_comp); > 8<------- Yes, sure. Than, there won't be any profit in runtime. Anyway, we get maintaince profit in case of all code is written in generi= c way. No new corner cases are welcomed. =20 > Maybe I misunderstood this point. >=20 > @Pavel, > Could you please give any advice on this point? Thanks! >=20 >>> It might be better to disable deferred page init in 'deferred_init' i= n case of 1 CPU >>> (or only one memory node). >>> >>> In other word, seems the better way to solve this issue is do not bin= d 'pgdatinit' thread >>> on boot CPU. >>> >>> I also refactor the patch based on your comment, please help to check= , thanks! >>> >>> >>> On 2020/3/4 18:47, Kirill Tkhai wrote: >>>> On 04.03.2020 05:34, Shile Zhang wrote: >>>>> Hi Kirill, >>>>> >>>>> Thanks for your quickly reply! >>>>> >>>>> On 2020/3/4 00:52, Kirill Tkhai wrote: >>>>>> On 03.03.2020 19:15, Shile Zhang wrote: >>>>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthre= ad will >>>>>>> initialise the deferred pages with local interrupts disabled. It = is >>>>>>> introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while >>>>>>> initializing deferred pages"). >>>>>>> >>>>>>> The local interrupt will be disabled long time inside >>>>>>> deferred_init_memmap(), depends on memory size. >>>>>>> On machine with NCPUS <=3D 2, the 'pgdatinit' kthread could be pi= ned on >>>>>>> boot CPU, then the tick timer will stuck long time, which caused = the >>>>>>> system wall time inaccuracy. >>>>>>> >>>>>>> For example, the dmesg shown that: >>>>>>> >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 [=C2=A0=C2=A0=C2=A0 0.197975] node 0 ini= tialised, 32170688 pages in 1ms >>>>>>> >>>>>>> Obviously, 1ms is unreasonable. >>>>>>> Now, fix it by restore in the pending interrupts inside the while= loop. >>>>>>> The reasonable demsg shown likes: >>>>>>> >>>>>>> [=C2=A0=C2=A0=C2=A0 1.069306] node 0 initialised, 32203456 pages = in 894ms >>>>>> The way I understand the original problem, that Pavel fixed: >>>>>> >>>>>> we need disable irqs in deferred_init_memmap() since this function= may be called >>>>>> in parallel with deferred_grow_zone() called from interrupt handle= r. So, Pavel >>>>>> added lock to fix the race. >>>>>> >>>>>> In case of we temporary unlock the lock, interrupt still be possib= le, >>>>>> so my previous proposition returns the problem back. >>>>>> >>>>>> Now thought again, I think we have to just add: >>>>>> >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pgdat_resize_unlock(); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pgdat_resize_lock(); >>>>>> >>>>>> instead of releasing interrupts, since in case of we just release = them with lock held, >>>>>> a call of interrupt->deferred_grow_zone() bring us to a deadlock. >>>>>> >>>>>> So, unlock the lock is must. >>>>> Yes, you're right! I missed this point. >>>>> Thanks for your comment! >>>>> >>>>>>> Signed-off-by: Shile Zhang >>>>>>> --- >>>>>>> =C2=A0=C2=A0=C2=A0 mm/page_alloc.c | 6 +++++- >>>>>>> =C2=A0=C2=A0=C2=A0 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 3c4eb750a199..d3f337f2e089 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(voi= d *data) >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can av= oid introducing any issues with the buddy >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>>>>> -=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>>>>>> +=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= nr_pages +=3D deferred_init_maxorder(&i, zone, &spfn, &epfn); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* let in any pending= interrupts */ >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 local_irq_restore(fla= gs); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 local_irq_save(flags)= ; >>>>>>> +=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0 zone_empty: >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(pg= dat, &flags); >>>>>> I think we need here something like below (untested): >>>>>> >>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>> index 79e950d76ffc..323afa9a4db5 100644 >>>>>> --- a/mm/page_alloc.c >>>>>> +++ b/mm/page_alloc.c >>>>>> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void = *data) >>>>>> =C2=A0=C2=A0=C2=A0 { >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *pgdat =3D da= ta; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct cpumask *c= pumask =3D cpumask_of_node(pgdat->node_id); >>>>>> -=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages= =3D 0; >>>>>> +=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages= =3D 0, prev_nr_pages =3D 0; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long first_ini= t_pfn, flags; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long start =3D= jiffies; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone *zone; >>>>>> @@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void= *data) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can avo= id introducing any issues with the buddy >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>>>> -=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>>>>> +=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= nr_pages +=3D deferred_init_maxorder(&i, zone, &spfn, &epfn); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Release interr= upts every 1Gb to give a possibility >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * a timer to adv= ance jiffies. >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (nr_pages - prev_nr= _pages > (1UL << (30 - PAGE_SHIFT))) { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= prev_nr_pages =3D nr_pages; >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= pgdat_resize_unlock(pgdat, &flags); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= pgdat_resize_lock(pgdat, &flags); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> +=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0 zone_empty: >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(pgd= at, &flags); >>>>>> =C2=A0=C2=A0 (I believe the comment may be improved more). >>>>> Yeah, your patch is better! >>>>> I test your code and it works! >>>>> But it seems that 1G is still hold the interrupts too long, about 4= 0ms in my env >>>>> with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use = 1024 pages (4MB), >>>>> which suggested by Andrew's before. >>>>> >>>>> Could you please help to review it again? >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 3c4eb750a199..5def66d3ffcd 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *= data) >>>>> =C2=A0=C2=A0=C2=A0{ >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *p= gdat =3D data; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct= cpumask *cpumask =3D cpumask_of_node(pgdat->node_id); >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epf= n =3D 0, nr_pages =3D 0; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epf= n =3D 0, nr_pages =3D 0, prev_nr_pages =3D 0; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned lon= g first_init_pfn, flags; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned lon= g start =3D jiffies; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone = *zone; >>>>> @@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void = *data) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that= we can avoid introducing any issues with the buddy >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allo= cator. >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nr_pages +=3D deferred_init_maxorder(&i= , zone, &spfn, &epfn); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 /* >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 * Restore pending interrupts every 1024 pages to= give >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 * the chance tick timer to advance jiffies. >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 if (nr_pages - prev_nr_pages > 1024) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_= resize_unlock(&flags); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_= resize_lock(&flags); >>>> Here is problem: prev_nr_pages must be updated. >>>> >>>> Anyway, releasing every 4M looks wrong for me, since you removes the= fix that Pavel introduced. >>>> He protected against big allocations made from interrupt content. Bu= t in case of we unlock >>>> the lock after 4Mb, only 4Mb will be available for allocations from = interrupts. pgdat->first_deferred_pfn >>>> is updated at the start of function, so interrupt allocations won't = be able to initialize >>>> mode for themselve. >>> Yes, you're right. I missed this point since I'm not fully understood= the code before. >>> Thanks for your advice! >>>> In case of you want unlock interrupts very often, you should make so= me creativity with first_deferred_pfn. >>>> We should update it sequentially. Something like below (untested): >>> I got your point now, thanks! >>>> --- >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 79e950d76ffc..be09d158baeb 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *d= ata) >>>> =C2=A0=C2=A0 { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *pgdat =3D data; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct cpumask *cpumask =3D= cpumask_of_node(pgdat->node_id); >>>> -=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages =3D= 0; >>>> +=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long first_init_pfn, f= lags; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long start =3D jiffies= ; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone *zone; >>>> @@ -1838,7 +1838,7 @@ static int __init deferred_init_memmap(void *d= ata) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Bind memory initialisation t= hread to a local node if possible */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!cpumask_empty(cpumask)) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 set_cpu= s_allowed_ptr(current, cpumask); >>>> - >>>> +again: >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_lock(pgdat, &flags= ); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 first_init_pfn =3D pgdat->first= _deferred_pfn; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (first_init_pfn =3D=3D ULONG= _MAX) { >>>> @@ -1850,7 +1850,6 @@ static int __init deferred_init_memmap(void *d= ata) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Sanity check boundaries */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BUG_ON(pgdat->first_deferred_pf= n < pgdat->node_start_pfn); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BUG_ON(pgdat->first_deferred_pf= n > pgdat_end_pfn(pgdat)); >>>> -=C2=A0=C2=A0=C2=A0 pgdat->first_deferred_pfn =3D ULONG_MAX; >>>> =C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Only the highest zone= is deferred so find it */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for (zid =3D 0; zid < MAX_NR_ZO= NES; zid++) { >>>> @@ -1864,14 +1863,30 @@ static int __init deferred_init_memmap(void = *data) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 first_init_pfn)) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto zo= ne_empty; >>>> =C2=A0=C2=A0 +=C2=A0=C2=A0=C2=A0 nr_pages =3D 0; >>> 'nr_pages' used to mark the total init pages before, so it cannot be = zerolized each round. >>> seems we need one more to count the pages init each round. >>> >>>> + >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Initialize and free pag= es in MAX_ORDER sized increments so >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can avoid intro= ducing any issues with the buddy >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >>>> +=C2=A0=C2=A0=C2=A0=C2=A0 * Final iteration marker is: spfn=3DULONG_= MAX and epfn=3D0. >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> -=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>>> +=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nr_page= s +=3D deferred_init_maxorder(&i, zone, &spfn, &epfn); >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!epfn) >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = break; >>> Seems 'epfn' never goes to 0 since it is "end page frame number", rig= ht? >>> So this is needless. >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat->first_deferred_pf= n =3D epfn; >>> I think first_deferred_pfn update wrong value here, it seems should b= e the spfn, the start pfn right? >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Restore pending = interrupts every 128Mb to give >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * the chance tick = timer to advance jiffies. >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (nr_pages > (1UL << 2= 7 - PAGE_SHIFT)) { >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = pgdat_resize_unlock(pgdat, &flags); >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = goto again; >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> +=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0=C2=A0 zone_empty: >>>> +=C2=A0=C2=A0=C2=A0 pgdat->first_deferred_pfn =3D ULONG_MAX; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(pgdat, &fla= gs); >>>> =C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Sanity check that the= next zone really is unpopulated */ >>>> >>>> >>> I update the patch based on your comment, it passed the test. >>> Could you please help to review it again? Thanks! >> The patch is OK for me. It may be sent into ml with an appropriate des= cription. >> >> Feel free to not forget to add my: >> >> Co-developed-by: Kirill Tkhai >=20 > Yeah, sure! > Many thanks for your kindly help on this issue! > I'll send out v3 later for further review. >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 3c4eb750a199..841c902d4509 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1763,12 +1763,17 @@ deferred_init_maxorder(u64 *i, struct zone *z= one, unsigned long *start_pfn, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return nr_pages; >>> =C2=A0=C2=A0} >>> >>> +/* >>> + * Release the tick timer interrupts for every TICK_PAGE_COUNT pages= . >>> + */ >>> +#define TICK_PAGE_COUNT=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (3= 2 * 1024) >>> + >>> =C2=A0=C2=A0/* Initialise remaining memory on a node */ >>> =C2=A0=C2=A0static int __init deferred_init_memmap(void *data) >>> =C2=A0=C2=A0{ >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *pgdat =3D= data; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct cpumask= *cpumask =3D cpumask_of_node(pgdat->node_id); >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn = =3D 0, nr_pages =3D 0; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn = =3D 0, nr_pages =3D 0, prev_nr_pages =3D 0; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long first_= init_pfn, flags; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long start = =3D jiffies; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone *zone; >>> @@ -1779,6 +1784,7 @@ static int __init deferred_init_memmap(void *da= ta) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!cpumask_empty(c= pumask)) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 set_cpus_allowed_ptr(current, cpumask); >>> >>> +again: >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_lock(pg= dat, &flags); >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 first_init_pfn =3D p= gdat->first_deferred_pfn; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (first_init_pfn =3D= =3D ULONG_MAX) { >>> @@ -1790,7 +1796,6 @@ static int __init deferred_init_memmap(void *da= ta) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Sanity check boun= daries */ >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BUG_ON(pgdat->first_= deferred_pfn < pgdat->node_start_pfn); >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BUG_ON(pgdat->first_= deferred_pfn > pgdat_end_pfn(pgdat)); >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat->first_deferred_pfn =3D U= LONG_MAX; >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Only the highest = zone is deferred so find it */ >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for (zid =3D 0; zid = < MAX_NR_ZONES; zid++) { >>> @@ -1809,9 +1814,23 @@ static int __init deferred_init_memmap(void *d= ata) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can = avoid introducing any issues with the buddy >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 nr_pages +=3D deferred_init_maxorder(&i, zone= , &spfn, &epfn); >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 /* >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * Release the interrupts for every TICK_PAGE_COUNT = pages >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * (128MB) to give the chance that tick timer to adv= ance >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * the jiffies. >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 */ >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 if ((nr_pages - prev_nr_pages) > TICK_PAGE_COUNT) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 prev_nr_p= ages =3D nr_pages; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat->fi= rst_deferred_pfn =3D spfn; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_res= ize_unlock(pgdat, &flags); >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto agai= n; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 } >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>> + >>> =C2=A0=C2=A0zone_empty: >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat->first_deferred_pfn =3D U= LONG_MAX; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(= pgdat, &flags); >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Sanity check that= the next zone really is unpopulated */ >>> >=20