From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60514C33CAF for ; Mon, 13 Jan 2020 08:11:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D606A2075B for ; Mon, 13 Jan 2020 08:11:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D606A2075B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4945E8E0005; Mon, 13 Jan 2020 03:11:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 41E9F8E0001; Mon, 13 Jan 2020 03:11:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30C9B8E0005; Mon, 13 Jan 2020 03:11:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0082.hostedemail.com [216.40.44.82]) by kanga.kvack.org (Postfix) with ESMTP id 14B058E0001 for ; Mon, 13 Jan 2020 03:11:24 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id A9A45180AD80F for ; Mon, 13 Jan 2020 08:11:23 +0000 (UTC) X-FDA: 76371891246.20.hill65_c0cbe6785124 X-HE-Tag: hill65_c0cbe6785124 X-Filterd-Recvd-Size: 6248 Received: from relay.sw.ru (relay.sw.ru [185.231.240.75]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Jan 2020 08:11:22 +0000 (UTC) Received: from dhcp-172-16-24-104.sw.ru ([172.16.24.104]) by relay.sw.ru with esmtp (Exim 4.92.3) (envelope-from ) id 1iquoY-00005J-0F; Mon, 13 Jan 2020 11:11:10 +0300 Subject: Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock To: Shile Zhang , Andrew Morton , Pavel Tatashin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20200110082510.172517-2-shile.zhang@linux.alibaba.com> <20200110093053.34777-1-shile.zhang@linux.alibaba.com> <1ee6088c-9e72-8824-3a9a-fc099d196faf@virtuozzo.com> From: Kirill Tkhai Message-ID: <9420eab3-5e5e-150f-53c9-6cd40bacf859@virtuozzo.com> Date: Mon, 13 Jan 2020 11:11:09 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 13.01.2020 03:54, Shile Zhang wrote: >=20 >=20 > On 2020/1/10 19:42, Kirill Tkhai wrote: >> On 10.01.2020 12:30, Shile Zhang wrote: >>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock' >>> will be called inside 'pgdatinit' kthread to initialise the deferred >>> pages with local interrupts disabled. Which is introduced by >>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing defer= red >>> pages"). >>> >>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 b= y >>> default), especially in small system with NRCPUS <=3D 2. In this case= , the >>> interrupts are disabled on boot CPU during memory initialising, which >>> caused the tick_sched timer be blocked, leading to wall clock stuck. >>> >>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializin= g >>> deferred pages") >>> >>> Signed-off-by: Shile Zhang >>> --- >>> =C2=A0 include/linux/memory_hotplug.h | 16 ++++++++++++++-- >>> =C2=A0 1 file changed, 14 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_ho= tplug.h >>> index ba0dca6aac6e..be69a6dc4fee 100644 >>> --- a/include/linux/memory_hotplug.h >>> +++ b/include/linux/memory_hotplug.h >>> @@ -6,6 +6,8 @@ >>> =C2=A0 #include >>> =C2=A0 #include >>> =C2=A0 #include >>> +#include >>> +#include >>> =C2=A0 =C2=A0 struct page; >>> =C2=A0 struct zone; >>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void= ) >>> =C2=A0 static inline >>> =C2=A0 void pgdat_resize_lock(struct pglist_data *pgdat, unsigned lon= g *flags) >>> =C2=A0 { >>> -=C2=A0=C2=A0=C2=A0 spin_lock_irqsave(&pgdat->node_size_lock, *flags)= ; >>> +=C2=A0=C2=A0=C2=A0 /* >>> +=C2=A0=C2=A0=C2=A0=C2=A0 * Disable local interrupts on boot CPU will= stop the tick_sched >>> +=C2=A0=C2=A0=C2=A0=C2=A0 * timer, which will block jiffies(wall cloc= k) update. >>> +=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> +=C2=A0=C2=A0=C2=A0 if (current->cpu !=3D get_boot_cpu_id()) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_lock_irqsave(&pgdat-= >node_size_lock, *flags); >>> +=C2=A0=C2=A0=C2=A0 else >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_lock(&pgdat->node_si= ze_lock); >>> =C2=A0 } >>> =C2=A0 static inline >>> =C2=A0 void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned l= ong *flags) >>> =C2=A0 { >>> -=C2=A0=C2=A0=C2=A0 spin_unlock_irqrestore(&pgdat->node_size_lock, *f= lags); >>> +=C2=A0=C2=A0=C2=A0 if (current->cpu !=3D get_boot_cpu_id()) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock_irqrestore(&p= gdat->node_size_lock, *flags); >>> +=C2=A0=C2=A0=C2=A0 else >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock(&pgdat->node_= size_lock); >>> =C2=A0 } >>> =C2=A0 static inline >>> =C2=A0 void pgdat_resize_init(struct pglist_data *pgdat) >> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option = even may preempt >> *kernel* code in the middle of function. When you are executing a code= containing >> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate= to another cpu >> between them. >> >> bool cpu=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 another cpu >> ---------------------------------- >> pgdat_resize_lock() >> =C2=A0=C2=A0 spin_lock() >> =C2=A0=C2=A0 --> migrate to another cpu >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgdat_resize= _unlock() >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock_= irqrestore() >> >> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after = spin_lock() call). >> >> This looks like a bad helpers, and we should not introduce such the de= sign. >=20 > Hi Kirill, >=20 > Thanks for your comments! > Sorry for I'm not very clear about this lock/unlock, but I encountered = this issue > with "CONFIG_PREEMPT is not set". The thing is we simply shouldn't introduce such the primitives since the = thread may migrate to another cpu, while you own the lock. This looks like a bug= gy design. >> 2)I think there is no the problem this patch solves. Do we really this= statistics? >> Can't we simple remove print message from deferred_init_memmap() and s= olve this? >=20 > Sorry for I've not put this issue very clearly. It's *not* just one sta= tistics log > with wrong time calculate, but the wall clock is stuck. > So the 'systemd-analyze' command also give a wrong time as I mentioned = in the cover > letter. I don't think is OK just remove the log, it cannot solve the wa= ll clock latency. Have you tried temporary enabling interrupts in the middle of cycle after= a huge enough memory block is initialized? Something like: deferred_init_memmap() { while (spfn < epfn) { nr_pages +=3D deferred_init_maxorder(&i, zone, &spfn, &epfn); local_irq_enable(); local_irq_disable(); } } Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxor= der(). >> Also, you may try to check that sched_clock() gives better results wit= h interrupts >> disabled (on x86 it uses rdtsc, when it's possible. But it also may fa= llback to >> jiffies-based clock in some hardware cases, and they also won't go wit= h interrupts >> disabled).