From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=74i9=JU=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A2736C433B4
	for <linux-mm@archiver.kernel.org>; Fri, 23 Apr 2021 20:23:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 12AB1613A9
	for <linux-mm@archiver.kernel.org>; Fri, 23 Apr 2021 20:23:46 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 12AB1613A9
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5BC2B6B0036; Fri, 23 Apr 2021 16:23:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 56ADD6B006C; Fri, 23 Apr 2021 16:23:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3968A6B006E; Fri, 23 Apr 2021 16:23:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106])
	by kanga.kvack.org (Postfix) with ESMTP id 176C96B0036
	for <linux-mm@kvack.org>; Fri, 23 Apr 2021 16:23:45 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id CEFB218015402
	for <linux-mm@kvack.org>; Fri, 23 Apr 2021 20:23:44 +0000 (UTC)
X-FDA: 78064757568.10.2B733A5
Received: from mail-il1-f181.google.com (mail-il1-f181.google.com [209.85.166.181])
	by imf12.hostedemail.com (Postfix) with ESMTP id 6B2CF132
	for <linux-mm@kvack.org>; Fri, 23 Apr 2021 20:23:35 +0000 (UTC)
Received: by mail-il1-f181.google.com with SMTP id y10so987638ilv.0
        for <linux-mm@kvack.org>; Fri, 23 Apr 2021 13:23:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=aX40hSHxXwjOOtduDth6BOwMQx+lmvX1XrYRrMT3+DA=;
        b=U2DHWpx7H3uEAXAfRq2GO7OlE46CLU9tpsxeTgtrAIO9RI3UzpbA6njlYrzjGoC8Vk
         82H3BH9eBvNJPkJShp2n6EWjXo+6oTSzpzp5mWKfTnA+wZRMCpXzDC2/EhDPsuKyLAKA
         V4KZHMfX/0k9nZxFmIjW+sInViNJ+8OKOdbnx+SxBGTVcgM/zyW9JixdSkW2KCk2LUow
         sB7c0UPFFmk7gdQdv5Ve00c43YEK11vtl0CuzIeDS86AFjItN02YQud0s7Y/BztZEPxl
         7Qvj7UJa9eeCe74/JuhbeXzOrFke/sr+pMqY3VwhT8Ioa9lJpoB77a1DGfoqF0s1PPGA
         bNjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=aX40hSHxXwjOOtduDth6BOwMQx+lmvX1XrYRrMT3+DA=;
        b=N9LRnwKpGyXlDQKt5fWtdDyOjdR1uHHKj4Ym17rX1wx1QrUkFsWxC5lsO+wzBJyCV9
         6q0sKfFjkz3CHWnulIlyL27fiJ7aKwwnxk+XuGjfzk/kk9Nz0L5jyYUwP7aBy4aVd6YQ
         oW9ywxmY2bfBRm1l0/67cuGuWNtAXxzkB2gwBxKj30mEsE5xqeqYwxeiHvaKRfGHLFQC
         fr+TjEI0Hma+rJCqyVLd2suk66xFNNtNj6uLwwSw1Tko0L9WWuKRdWtasQVtcLtioIJ+
         J+Rg3J/tEJQkMLjdshbDNNfNd43Yoqvz5MlPtyZMDbyOoVQmYrXUK/akqa1w/29aLNEh
         UKrg==
X-Gm-Message-State: AOAM532CsdKydVkFI85WRrhaJhLnbnt+XIaUdLxzISFiCKV0voiQzcX0
	TfEfg2U9vO/foW5ZHYEEypMzew==
X-Google-Smtp-Source: ABdhPJyEJvX89TONfQQzbYOZWkTGQR71kw6GLXDSUJpmChPvZ78m79zxlR9KB1iqNA4RIJzBgNKaSg==
X-Received: by 2002:a92:d092:: with SMTP id h18mr4415717ilh.62.1619209423477;
        Fri, 23 Apr 2021 13:23:43 -0700 (PDT)
Received: from google.com ([2620:15c:183:200:a272:a6c7:58b2:8c6a])
        by smtp.gmail.com with ESMTPSA id h11sm2991816ilr.84.2021.04.23.13.23.42
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 23 Apr 2021 13:23:42 -0700 (PDT)
Date: Fri, 23 Apr 2021 14:23:38 -0600
From: Yu Zhao <yuzhao@google.com>
To: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, ying.huang@intel.com,
	tim.c.chen@linux.intel.com, Shakeel Butt <shakeelb@google.com>,
	Michal Hocko <mhocko@suse.com>, wfg@mail.ustc.edu.cn
Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by
 too_many_isolated()
Message-ID: <YIMsykToLKUVMWbZ@google.com>
References: <20210416023536.168632-1-zhengjun.xing@linux.intel.com>
 <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com>
 <YIGuvh70JbE1Cx4U@google.com>
 <7a0fecab-f9e1-ad39-d55e-01e574a35484@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <7a0fecab-f9e1-ad39-d55e-01e574a35484@linux.intel.com>
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 6B2CF132
X-Stat-Signature: cgd5knu9ccm3k7maie9hmfw94a7mnmxr
Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf12; identity=mailfrom; envelope-from="<yuzhao@google.com>"; helo=mail-il1-f181.google.com; client-ip=209.85.166.181
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1619209415-592174
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Apr 23, 2021 at 02:57:07PM +0800, Xing Zhengjun wrote:
> On 4/23/2021 1:13 AM, Yu Zhao wrote:
> > On Thu, Apr 22, 2021 at 04:36:19PM +0800, Xing Zhengjun wrote:
> > > Hi,
> > >=20
> > >     In the system with very few file pages (nr_active_file + nr_ina=
ctive_file
> > > < 100), it is easy to reproduce "nr_isolated_file > nr_inactive_fil=
e",  then
> > > too_many_isolated return true, shrink_inactive_list enter "msleep(1=
00)", the
> > > long latency will happen.
> > >=20
> > > The test case to reproduce it is very simple: allocate many huge pa=
ges(near
> > > the DRAM size), then do free, repeat the same operation many times.
> > > In the test case, the system with very few file pages (nr_active_fi=
le +
> > > nr_inactive_file < 100), I have dumpped the numbers of
> > > active/inactive/isolated file pages during the whole test(see in th=
e
> > > attachments) , in shrink_inactive_list "too_many_isolated" is very =
easy to
> > > return true, then enter "msleep(100)",in "too_many_isolated" sc->gf=
p_mask is
> > > 0x342cca ("_GFP_IO" and "__GFP_FS" is masked) , it is also very eas=
y to
> > > enter =E2=80=9Cinactive >>=3D3=E2=80=9D, then =E2=80=9Cisolated > i=
nactive=E2=80=9D will be true.
> > >=20
> > > So I  have a proposal to set a threshold number for the total file =
pages to
> > > ignore the system with very few file pages, and then bypass the 100=
ms sleep.
> > > It is hard to set a perfect number for the threshold, so I just giv=
e an
> > > example of "256" for it.
> > >=20
> > > I appreciate it if you can give me your suggestion/comments. Thanks=
.
> >=20
> > Hi Zhengjun,
> >=20
> > It seems to me using the number of isolated pages to keep a lid on
> > direct reclaimers is not a good solution. We shouldn't keep going
> > that direction if we really want to fix the problem because migration
> > can isolate many pages too, which in turn blocks page reclaim.
> >=20
> > Here is something works a lot better. Please give it a try. Thanks.
>=20
> Thanks, I will try it with my test cases.

Thanks. I took care my sloppiness from yesterday and tested the
following. It should apply cleanly and work well. Please let me know.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..48bb2b77389e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -832,6 +832,7 @@ typedef struct pglist_data {
 #endif
=20
 	/* Fields commonly accessed by the page reclaim scanner */
+	atomic_t		nr_reclaimers;
=20
 	/*
 	 * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..3fcdfbee89c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1775,43 +1775,6 @@ int isolate_lru_page(struct page *page)
 	return ret;
 }
=20
-/*
- * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU li=
st and
- * then get rescheduled. When there are massive number of tasks doing pa=
ge
- * allocation, such sleeping direct reclaimers may keep piling up on eac=
h CPU,
- * the LRU list will go small and be scanned faster than necessary, lead=
ing to
- * unnecessary swapping, thrashing and OOM.
- */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
-		struct scan_control *sc)
-{
-	unsigned long inactive, isolated;
-
-	if (current_is_kswapd())
-		return 0;
-
-	if (!writeback_throttling_sane(sc))
-		return 0;
-
-	if (file) {
-		inactive =3D node_page_state(pgdat, NR_INACTIVE_FILE);
-		isolated =3D node_page_state(pgdat, NR_ISOLATED_FILE);
-	} else {
-		inactive =3D node_page_state(pgdat, NR_INACTIVE_ANON);
-		isolated =3D node_page_state(pgdat, NR_ISOLATED_ANON);
-	}
-
-	/*
-	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
-	 * won't get blocked by normal direct-reclaimers, forming a circular
-	 * deadlock.
-	 */
-	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) =3D=3D (__GFP_IO | __GFP_FS)=
)
-		inactive >>=3D 3;
-
-	return isolated > inactive;
-}
-
 /*
  * move_pages_to_lru() moves pages from private @list to appropriate LRU=
 list.
  * On return, @list is reused as a list of pages to be freed by the call=
er.
@@ -1911,20 +1874,6 @@ shrink_inactive_list(unsigned long nr_to_scan, str=
uct lruvec *lruvec,
 	bool file =3D is_file_lru(lru);
 	enum vm_event_item item;
 	struct pglist_data *pgdat =3D lruvec_pgdat(lruvec);
-	bool stalled =3D false;
-
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		if (stalled)
-			return 0;
-
-		/* wait a bit for the reclaimer. */
-		msleep(100);
-		stalled =3D true;
-
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
-	}
=20
 	lru_add_drain();
=20
@@ -2903,6 +2852,8 @@ static void shrink_zones(struct zonelist *zonelist,=
 struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	pg_data_t *last_pgdat =3D NULL;
+	bool should_retry =3D false;
+	int nr_cpus =3D num_online_cpus();
=20
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2914,9 +2865,18 @@ static void shrink_zones(struct zonelist *zonelist=
, struct scan_control *sc)
 		sc->gfp_mask |=3D __GFP_HIGHMEM;
 		sc->reclaim_idx =3D gfp_zone(sc->gfp_mask);
 	}
-
+retry:
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					sc->reclaim_idx, sc->nodemask) {
+		/*
+		 * Shrink each node in the zonelist once. If the zonelist is
+		 * ordered by zone (not the default) then a node may be shrunk
+		 * multiple times but in that case the user prefers lower zones
+		 * being preserved.
+		 */
+		if (zone->zone_pgdat =3D=3D last_pgdat)
+			continue;
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
@@ -2941,16 +2901,28 @@ static void shrink_zones(struct zonelist *zonelis=
t, struct scan_control *sc)
 				sc->compaction_ready =3D true;
 				continue;
 			}
+		}
=20
-			/*
-			 * Shrink each node in the zonelist once. If the
-			 * zonelist is ordered by zone (not the default) then a
-			 * node may be shrunk multiple times but in that case
-			 * the user prefers lower zones being preserved.
-			 */
-			if (zone->zone_pgdat =3D=3D last_pgdat)
-				continue;
+		/*
+		 * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from
+		 * the LRU list and then get rescheduled. When there are massive
+		 * number of tasks doing page allocation, such sleeping direct
+		 * reclaimers may keep piling up on each CPU, the LRU list will
+		 * go small and be scanned faster than necessary, leading to
+		 * unnecessary swapping, thrashing and OOM.
+		 */
+		VM_BUG_ON(current_is_kswapd());
=20
+		if (!atomic_add_unless(&zone->zone_pgdat->nr_reclaimers, 1, nr_cpus)) =
{
+			should_retry =3D true;
+			continue;
+		}
+
+		if (last_pgdat)
+			atomic_dec(&last_pgdat->nr_reclaimers);
+		last_pgdat =3D zone->zone_pgdat;
+
+		if (!cgroup_reclaim(sc)) {
 			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and
@@ -2966,13 +2938,20 @@ static void shrink_zones(struct zonelist *zonelis=
t, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
=20
-		/* See comment about same check for global reclaim above */
-		if (zone->zone_pgdat =3D=3D last_pgdat)
-			continue;
-		last_pgdat =3D zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc);
 	}
=20
+	if (last_pgdat)
+		atomic_dec(&last_pgdat->nr_reclaimers);
+	else if (should_retry) {
+		/* wait a bit for the reclaimer. */
+		if (!schedule_timeout_killable(HZ / 10))
+			goto retry;
+
+		/* We are about to die and free our memory. Return now. */
+		sc->nr_reclaimed +=3D SWAP_CLUSTER_MAX;
+	}
+
 	/*
 	 * Restore to original mask to avoid the impact on the caller if we
 	 * promoted it to __GFP_HIGHMEM.
@@ -4189,6 +4168,15 @@ static int __node_reclaim(struct pglist_data *pgda=
t, gfp_t gfp_mask, unsigned in
 	set_task_reclaim_state(p, &sc.reclaim_state);
=20
 	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
+		int nr_cpus =3D num_online_cpus();
+
+		VM_BUG_ON(current_is_kswapd());
+
+		if (!atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) {
+			schedule_timeout_killable(HZ / 10);
+			goto out;
+		}
+
 		/*
 		 * Free memory by calling shrink node with increasing
 		 * priorities until we have enough memory freed.
@@ -4196,8 +4184,10 @@ static int __node_reclaim(struct pglist_data *pgda=
t, gfp_t gfp_mask, unsigned in
 		do {
 			shrink_node(pgdat, &sc);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >=3D 0);
-	}
=20
+		atomic_dec(&pgdat->nr_reclaimers);
+	}
+out:
 	set_task_reclaim_state(p, NULL);
 	current->flags &=3D ~PF_SWAPWRITE;
 	memalloc_noreclaim_restore(noreclaim_flag);