From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DD16C35250 for ; Sat, 8 Feb 2020 11:11:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 24EDA20726 for ; Sat, 8 Feb 2020 11:11:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 24EDA20726 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 80C136B0003; Sat, 8 Feb 2020 06:11:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7BB5D6B0005; Sat, 8 Feb 2020 06:11:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D0E56B0007; Sat, 8 Feb 2020 06:11:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187]) by kanga.kvack.org (Postfix) with ESMTP id 5232E6B0003 for ; Sat, 8 Feb 2020 06:11:51 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E764A2C98 for ; Sat, 8 Feb 2020 11:11:50 +0000 (UTC) X-FDA: 76466694780.08.stage83_2d0837401b826 X-HE-Tag: stage83_2d0837401b826 X-Filterd-Recvd-Size: 4319 Received: from r3-17.sinamail.sina.com.cn (r3-17.sinamail.sina.com.cn [202.108.3.17]) by imf41.hostedemail.com (Postfix) with SMTP for ; Sat, 8 Feb 2020 11:11:49 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.246.227.35]) by sina.com with ESMTP id 5E3E977000023F03; Sat, 8 Feb 2020 19:11:46 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 23278049283214 From: Hillf Danton To: Ivan Babrou Cc: Mel Gorman , Vlastimil Babka , Andrew Morton , mm , linux-kernel , kernel-team Subject: Re: Reclaim regression after 1c30844d2dfe Date: Sat, 8 Feb 2020 19:11:33 +0800 Message-Id: <20200208111133.16808-1-hdanton@sina.com> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 7 Feb 2020 14:54:43 -0800 Ivan Babrou wrote: > This change from 5.5 times: >=20 > * https://github.com/torvalds/linux/commit/1c30844d2dfe >=20 > > mm: reclaim small amounts of memory when an external fragmentation ev= ent occurs >=20 > Introduced undesired effects in our environment. >=20 > * NUMA with 2 x CPU > * 128GB of RAM > * THP disabled > * Upgraded from 4.19 to 5.4 >=20 > Before we saw free memory hover at around 1.4GB with no spikes. After > the upgrade we saw some machines decide that they need a lot more than > that, with frequent spikes above 10GB, often only on a single numa > node. >=20 > We can see kswapd quite active in balance_pgdat (it didn't look like > it slept at all): >=20 > $ ps uax | fgrep kswapd > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [ksw= apd0] > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswa= pd1] >=20 > This in turn massively increased pressure on page cache, which did not > go well to services that depend on having a quick response from a > local cache backed by solid storage. >=20 > Here's how it looked like when I zeroed vm.watermark_boost_factor: >=20 > * https://imgur.com/a/6IZWicU >=20 > IO subsided from 100% busy in page cache population at 300MB/s on a > single SATA drive down to under 100MB/s. >=20 > This sort of regression doesn't seem like a good thing. Here are two small diffs :P [1] cleanup: stop reclaiming pages once balanced. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3641,6 +3641,9 @@ restart: * re-evaluate if boosting is required when kswapd next wakes. */ balanced =3D pgdat_balanced(pgdat, sc.order, classzone_idx); + if (balanced) + break; + if (!balanced && nr_boost_reclaim) { nr_boost_reclaim =3D 0; goto restart; -- [2] restore the old behavior by ignoring boost before falling in hot wate= r. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3572,7 +3572,7 @@ static int balance_pgdat(pg_data_t *pgda unsigned long pflags; unsigned long nr_boost_reclaim; unsigned long zone_boosts[MAX_NR_ZONES] =3D { 0, }; - bool boosted; + bool boosted =3D false; struct zone *zone; struct scan_control sc =3D { .gfp_mask =3D GFP_KERNEL, @@ -3591,18 +3591,22 @@ static int balance_pgdat(pg_data_t *pgda * place so that parallel allocations that are near the watermark will * stall or direct reclaim until kswapd is finished. */ +restart: nr_boost_reclaim =3D 0; for (i =3D 0; i <=3D classzone_idx; i++) { zone =3D pgdat->node_zones + i; if (!managed_zone(zone)) continue; =20 + if (boosted) { + zone->watermark_boost =3D 0; + continue; + } nr_boost_reclaim +=3D zone->watermark_boost; zone_boosts[i] =3D zone->watermark_boost; } boosted =3D nr_boost_reclaim; =20 -restart: sc.priority =3D DEF_PRIORITY; do { unsigned long nr_reclaimed =3D sc.nr_reclaimed; @@ -3644,10 +3648,9 @@ restart: if (balanced) break; =20 - if (!balanced && nr_boost_reclaim) { - nr_boost_reclaim =3D 0; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority =3D=3D DEF_PRIORITY - 2) goto restart; - } =20 /* * If boosting is not active then only reclaim if there are no --