From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 151ECC3F2D7 for ; Wed, 4 Mar 2020 10:48:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9537C20732 for ; Wed, 4 Mar 2020 10:48:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9537C20732 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 493DA6B0005; Wed, 4 Mar 2020 05:48:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 41CF96B0006; Wed, 4 Mar 2020 05:48:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30D726B000C; Wed, 4 Mar 2020 05:48:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0228.hostedemail.com [216.40.44.228]) by kanga.kvack.org (Postfix) with ESMTP id 1414C6B0005 for ; Wed, 4 Mar 2020 05:48:07 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 853B28248047 for ; Wed, 4 Mar 2020 10:48:06 +0000 (UTC) X-FDA: 76557354972.11.scale61_80e90bfeb4240 X-HE-Tag: scale61_80e90bfeb4240 X-Filterd-Recvd-Size: 10500 Received: from relay.sw.ru (relay.sw.ru [185.231.240.75]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Mar 2020 10:48:05 +0000 (UTC) Received: from dhcp-172-16-24-104.sw.ru ([172.16.24.104]) by relay.sw.ru with esmtp (Exim 4.92.3) (envelope-from ) id 1j9RZ9-0005TX-Tr; Wed, 04 Mar 2020 13:47:52 +0300 Subject: Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside deferred_init_memmap() To: Shile Zhang , Andrew Morton , Pavel Tatashin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Michal Hocko References: <20200303161551.132263-1-shile.zhang@linux.alibaba.com> <20200303161551.132263-2-shile.zhang@linux.alibaba.com> <386d7d5f-a57d-f5b1-acee-131ce23d35ec@linux.alibaba.com> From: Kirill Tkhai Message-ID: <2d4defb7-8816-3447-3d65-f5d80067a9fd@virtuozzo.com> Date: Wed, 4 Mar 2020 13:47:51 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <386d7d5f-a57d-f5b1-acee-131ce23d35ec@linux.alibaba.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 04.03.2020 05:34, Shile Zhang wrote: > Hi Kirill, >=20 > Thanks for your quickly reply! >=20 > On 2020/3/4 00:52, Kirill Tkhai wrote: >> On 03.03.2020 19:15, Shile Zhang wrote: >>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread w= ill >>> initialise the deferred pages with local interrupts disabled. It is >>> introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while >>> initializing deferred pages"). >>> >>> The local interrupt will be disabled long time inside >>> deferred_init_memmap(), depends on memory size. >>> On machine with NCPUS <=3D 2, the 'pgdatinit' kthread could be pined = on >>> boot CPU, then the tick timer will stuck long time, which caused the >>> system wall time inaccuracy. >>> >>> For example, the dmesg shown that: >>> >>> =C2=A0=C2=A0 [=C2=A0=C2=A0=C2=A0 0.197975] node 0 initialised, 321706= 88 pages in 1ms >>> >>> Obviously, 1ms is unreasonable. >>> Now, fix it by restore in the pending interrupts inside the while loo= p. >>> The reasonable demsg shown likes: >>> >>> [=C2=A0=C2=A0=C2=A0 1.069306] node 0 initialised, 32203456 pages in 8= 94ms >> The way I understand the original problem, that Pavel fixed: >> >> we need disable irqs in deferred_init_memmap() since this function may= be called >> in parallel with deferred_grow_zone() called from interrupt handler. S= o, Pavel >> added lock to fix the race. >> >> In case of we temporary unlock the lock, interrupt still be possible, >> so my previous proposition returns the problem back. >> >> Now thought again, I think we have to just add: >> >> =C2=A0=C2=A0=C2=A0=C2=A0pgdat_resize_unlock(); >> =C2=A0=C2=A0=C2=A0=C2=A0pgdat_resize_lock(); >> >> instead of releasing interrupts, since in case of we just release them= with lock held, >> a call of interrupt->deferred_grow_zone() bring us to a deadlock. >> >> So, unlock the lock is must. >=20 > Yes, you're right! I missed this point. > Thanks for your comment! >=20 >> >>> Signed-off-by: Shile Zhang >>> --- >>> =C2=A0 mm/page_alloc.c | 6 +++++- >>> =C2=A0 1 file changed, 5 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 3c4eb750a199..d3f337f2e089 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(void *d= ata) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can avoid introducing = any issues with the buddy >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> -=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >>> +=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nr_pages +=3D = deferred_init_maxorder(&i, zone, &spfn, &epfn); >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* let in any pending int= errupts */ >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 local_irq_restore(flags); >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 local_irq_save(flags); >>> +=C2=A0=C2=A0=C2=A0 } >>> =C2=A0 zone_empty: >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(pgdat, &flags); >> I think we need here something like below (untested): >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 79e950d76ffc..323afa9a4db5 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *dat= a) >> =C2=A0 { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *pgdat =3D data; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct cpumask *cpumask =3D cpuma= sk_of_node(pgdat->node_id); >> -=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages =3D= 0; >> +=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D 0, nr_pages =3D= 0, prev_nr_pages =3D 0; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long first_init_pfn, flags; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long start =3D jiffies; >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone *zone; >> @@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void *da= ta) >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can avoid introducing a= ny issues with the buddy >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >> -=C2=A0=C2=A0=C2=A0 while (spfn < epfn) >> +=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nr_pages +=3D d= eferred_init_maxorder(&i, zone, &spfn, &epfn); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Release interrupts= every 1Gb to give a possibility >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * a timer to advance= jiffies. >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (nr_pages - prev_nr_pag= es > (1UL << (30 - PAGE_SHIFT))) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pr= ev_nr_pages =3D nr_pages; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg= dat_resize_unlock(pgdat, &flags); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg= dat_resize_lock(pgdat, &flags); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> +=C2=A0=C2=A0=C2=A0 } >> =C2=A0 zone_empty: >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize_unlock(pgdat, &flags); >> =C2=A0 >> (I believe the comment may be improved more). >=20 > Yeah, your patch is better! > I test your code and it works! > But it seems that 1G is still hold the interrupts too long, about 40ms = in my env > with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use 1024= pages (4MB), > which suggested by Andrew's before. >=20 > Could you please help to review it again? >=20 > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3c4eb750a199..5def66d3ffcd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *data= ) > =C2=A0{ > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pg_data_t *pgdat =3D data; > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 const struct cpumask *cpumas= k =3D cpumask_of_node(pgdat->node_id); > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D= 0, nr_pages =3D 0; > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long spfn =3D 0, epfn =3D= 0, nr_pages =3D 0, prev_nr_pages =3D 0; > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long first_init_pfn= , flags; > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long start =3D jiff= ies; > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct zone *zone; > @@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void *dat= a) > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * that we can avoid in= troducing any issues with the buddy > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * allocator. > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (spfn < epfn) { > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 nr_pages +=3D deferred_init_maxorder(&i, zone, &spfn, = &epfn); > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 /* > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * Restore pending interrupts every 1024 pages to give > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * the chance tick timer to advance jiffies. > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 */ > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (nr_pages - prev_nr_pages > 1024) { > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize= _unlock(&flags); > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize= _lock(&flags); Here is problem: prev_nr_pages must be updated. Anyway, releasing every 4M looks wrong for me, since you removes the fix = that Pavel introduced. He protected against big allocations made from interrupt content. But in = case of we unlock the lock after 4Mb, only 4Mb will be available for allocations from inter= rupts. pgdat->first_deferred_pfn is updated at the start of function, so interrupt allocations won't be ab= le to initialize mode for themselve. In case of you want unlock interrupts very often, you should make some cr= eativity with first_deferred_pfn. We should update it sequentially. Something like below (untested): --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 79e950d76ffc..be09d158baeb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data) { pg_data_t *pgdat =3D data; const struct cpumask *cpumask =3D cpumask_of_node(pgdat->node_id); - unsigned long spfn =3D 0, epfn =3D 0, nr_pages =3D 0; + unsigned long spfn =3D 0, epfn =3D 0, nr_pages; unsigned long first_init_pfn, flags; unsigned long start =3D jiffies; struct zone *zone; @@ -1838,7 +1838,7 @@ static int __init deferred_init_memmap(void *data) /* Bind memory initialisation thread to a local node if possible */ if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(current, cpumask); - +again: pgdat_resize_lock(pgdat, &flags); first_init_pfn =3D pgdat->first_deferred_pfn; if (first_init_pfn =3D=3D ULONG_MAX) { @@ -1850,7 +1850,6 @@ static int __init deferred_init_memmap(void *data) /* Sanity check boundaries */ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn); BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); - pgdat->first_deferred_pfn =3D ULONG_MAX; =20 /* Only the highest zone is deferred so find it */ for (zid =3D 0; zid < MAX_NR_ZONES; zid++) { @@ -1864,14 +1863,30 @@ static int __init deferred_init_memmap(void *data= ) first_init_pfn)) goto zone_empty; =20 + nr_pages =3D 0; + /* * Initialize and free pages in MAX_ORDER sized increments so * that we can avoid introducing any issues with the buddy * allocator. + * Final iteration marker is: spfn=3DULONG_MAX and epfn=3D0. */ - while (spfn < epfn) + while (spfn < epfn) { nr_pages +=3D deferred_init_maxorder(&i, zone, &spfn, &epfn); + if (!epfn) + break; + pgdat->first_deferred_pfn =3D epfn; + /* + * Restore pending interrupts every 128Mb to give + * the chance tick timer to advance jiffies. + */ + if (nr_pages > (1UL << 27 - PAGE_SHIFT)) { + pgdat_resize_unlock(pgdat, &flags); + goto again; + } + } zone_empty: + pgdat->first_deferred_pfn =3D ULONG_MAX; pgdat_resize_unlock(pgdat, &flags); =20 /* Sanity check that the next zone really is unpopulated */