From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4534EC43461 for ; Wed, 16 Sep 2020 06:53:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC13320771 for ; Wed, 16 Sep 2020 06:53:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC13320771 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 264D26B0003; Wed, 16 Sep 2020 02:53:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 216D86B0037; Wed, 16 Sep 2020 02:53:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12C4F6B0055; Wed, 16 Sep 2020 02:53:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id EFCD46B0003 for ; Wed, 16 Sep 2020 02:53:09 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id B50E3180AD80F for ; Wed, 16 Sep 2020 06:53:09 +0000 (UTC) X-FDA: 77268007698.13.chair15_521117d27117 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id 82F0918140B67 for ; Wed, 16 Sep 2020 06:53:09 +0000 (UTC) X-HE-Tag: chair15_521117d27117 X-Filterd-Recvd-Size: 6990 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Sep 2020 06:53:08 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 47C9AB23D; Wed, 16 Sep 2020 06:53:23 +0000 (UTC) Date: Wed, 16 Sep 2020 08:53:06 +0200 From: Michal Hocko To: Vijay Balakrishna Cc: Andrew Morton , "Kirill A. Shutemov" , Oleg Nesterov , Song Liu , Andrea Arcangeli , Pavel Tatashin , Allen Pais , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Message-ID: <20200916065306.GB18998@dhcp22.suse.cz> References: <1599770859-14826-1-git-send-email-vijayb@linux.microsoft.com> <20200914143312.GU16999@dhcp22.suse.cz> <20200915081832.GA4649@dhcp22.suse.cz> <53dd1e2c-f07e-ee5b-51a1-0ef8adb53926@linux.microsoft.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <53dd1e2c-f07e-ee5b-51a1-0ef8adb53926@linux.microsoft.com> X-Rspamd-Queue-Id: 82F0918140B67 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote: >=20 >=20 > On 9/15/2020 1:18 AM, Michal Hocko wrote: > > On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote: > > >=20 > > >=20 > > > On 9/14/2020 7:33 AM, Michal Hocko wrote: > > > > On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote: > > > > > When memory is hotplug added or removed the min_free_kbytes mus= t be > > > > > recalculated based on what is expected by khugepaged. Currentl= y > > > > > after hotplug, min_free_kbytes will be set to a lower default a= nd higher > > > > > default set when THP enabled is lost. This leaves the system wi= th small > > > > > min_free_kbytes which isn't suitable for systems especially wit= h network > > > > > intensive loads. Typical failure symptoms include HW WATCHDOG = reset, > > > > > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM = process > > > > > kills. > > > >=20 > > > > Care to explain some more please? The whole point of increasing > > > > min_free_kbytes for THP is to get a larger free memory with a hop= e that > > > > huge pages will be more likely to appear. While this might help f= or > > > > other users that need a high order pages it is definitely not the > > > > primary reason behind it. Could you provide an example with some = more > > > > data? > > >=20 > > > Thanks Michal. I haven't looked into THP as part of my investigati= on, so I > > > cannot comment. > > >=20 > > > In our use case we are hotplug removing ~2GB of 8GB total (on our S= oC) > > > during normal reboot/shutdown. This memory is hotplug hot-added as= movable > > > type via systemd late service during start-of-day. > > >=20 > > > In our stress test first we ran into HW WATCHDOG recovery, on enabl= ing > > > kernel watchdog we started seeing soft lockup hung task notices, fa= ilure > > > symptons varied, where stack trace of hung tasks sometimes trying t= o > > > allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE = WATCHDOG > > > timeouts, OOM process kills etc., During investigation we reran st= ress test > > > without hotplug use case. Surprisingly this run didn't encounter t= he said > > > problems. This led to comparing what is different between the two = runs, > > > while looking at various globals, studying hotplug code I uncovered= the > > > issue of failing to restore min_free_kbytes. In particular on our = 8GB SoC > > > min_free_kbytes went down to 8703 from 22528 after hotplug add. > >=20 > > Did you try to increase min_free_kbytes manually after hot remove? Bt= w. >=20 > No, in our use case memory hot remove done during shutdown. I do not follow. If you are hotremoving during shutdown then how come the value of min_free_kbytes matter at all? > > I would consider oom killer invocation due to min_free_kbytes really > > weird behavior. If anything the higher value would cause more memory > > reclaim and potentially oom rather than smaller one. >=20 > Yes, we wondered about it too. One panic stack trace (after many OOM k= ills) >=20 > [330321.174240] Out of memory and no killable processes... > [330321.179658] Kernel panic - not syncing: System is deadlocked on mem= ory > [330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G = O > 5.4.51-xxx #1 > [330321.196900] Hardware name: Overlake (DT) > [330321.201038] Call trace: > [330321.203660] dump_backtrace+0x0/0x1d0 > [330321.207533] show_stack+0x20/0x2c > [330321.211048] dump_stack+0xe8/0x150 > [330321.214656] panic+0x18c/0x3b4 > [330321.217901] out_of_memory+0x4c0/0x6e4 > [330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90 > [330321.226722] alloc_pages_current+0x21c/0x2b0 > [330321.231220] alloc_slab_page+0x1e0/0x7d8 > [330321.235361] new_slab+0x2e8/0x2f8 > [330321.238874] ___slab_alloc+0x45c/0x59c > [330321.242835] kmem_cache_alloc+0x2d4/0x360 > [330321.247065] getname_flags+0x6c/0x2a8 > [330321.250938] user_path_at_empty+0x3c/0x68 > [330321.255168] do_readlinkat+0x7c/0x17c > [330321.259039] __arm64_sys_readlinkat+0x5c/0x70 > [330321.263627] el0_svc_handler+0x1b8/0x32c > [330321.267767] el0_svc+0x10/0x14 > [330321.271026] SMP: stopping secondary CPUs > [330321.275382] Starting crashdump kernel... > [330321.279526] Bye! Do you have the full oom splat? The fact that previous oom killer invocations haven't helped and all the eligible tasks have been killed and you still hit the oom would suggest there is a lot of memory allocated without a direct relation to tasks. I fail to see how min_free_kbytes would be related. > Then while searching I came across documented warning below. In above > instance panic after OOM kills happened after 3+ days of stress run (a > mixure of ttcp, cpuloadgen and fio). >=20 > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/= 7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance= _tuning_guide-configuration_tools-configuring_system_memory_capacity >=20 > Warning >=20 > Extreme values can damage your system. Setting min_free_kbytes to an > extremely low value prevents the system from reclaiming memory, which c= an > result in system hangs and OOM-killing processes. However, setting > min_free_kbytes too high (for example, to 5=E2=80=9310% of total system= memory) > causes the system to enter an out-of-memory state immediately, resultin= g in > the system spending too much time reclaiming memory. The auto tuned value should never reach such a low value to cause problems. --=20 Michal Hocko SUSE Labs