From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xn6i=J3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.0 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,
	NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,SPF_HELO_NONE,USER_AGENT_SANE_1
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B44B1C433ED
	for <linux-mm@archiver.kernel.org>; Fri, 30 Apr 2021 05:57:47 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5B3E76147D
	for <linux-mm@archiver.kernel.org>; Fri, 30 Apr 2021 05:57:47 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5B3E76147D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 729DE8E0007; Fri, 30 Apr 2021 01:57:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6507B8D0005; Fri, 30 Apr 2021 01:57:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 00CEE8E0007; Fri, 30 Apr 2021 01:57:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0007.hostedemail.com [216.40.44.7])
	by kanga.kvack.org (Postfix) with ESMTP id BF3AA8D0005
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 01:57:45 -0400 (EDT)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 8E152180ACC20
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 05:57:45 +0000 (UTC)
X-FDA: 78087976890.15.9DB22EF
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
	by imf02.hostedemail.com (Postfix) with ESMTP id 51C0D40002C1
	for <linux-mm@kvack.org>; Fri, 30 Apr 2021 05:57:12 +0000 (UTC)
IronPort-SDR: wgQ5+ucQGguJxQYAPMGyDphWJ4DYjNvH2JZbmNSwawQSSoZefxEknfHc4G02s20ToghOhshj6h
 +bpsF3nKtyFg==
X-IronPort-AV: E=McAfee;i="6200,9189,9969"; a="184337868"
X-IronPort-AV: E=Sophos;i="5.82,260,1613462400"; 
   d="scan'208";a="184337868"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Apr 2021 22:57:39 -0700
IronPort-SDR: Ms9VpfrS3xmrKno3zWfSzjxhiXz+qSw3puU/Hz4XTpfTPsureRpGt50ZaH/Z4OiZhjqgegPA2k
 NudTez2S861Q==
X-IronPort-AV: E=Sophos;i="5.82,260,1613462400"; 
   d="scan'208";a="424704370"
Received: from xingzhen-mobl.ccr.corp.intel.com (HELO [10.238.4.46]) ([10.238.4.46])
  by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Apr 2021 22:57:35 -0700
Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by
 too_many_isolated()
To: Yu Zhao <yuzhao@google.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, ying.huang@intel.com,
 tim.c.chen@linux.intel.com, Shakeel Butt <shakeelb@google.com>,
 Michal Hocko <mhocko@suse.com>, wfg@mail.ustc.edu.cn
References: <20210416023536.168632-1-zhengjun.xing@linux.intel.com>
 <7b7a1c09-3d16-e199-15d2-ccea906d4a66@linux.intel.com>
 <YIGuvh70JbE1Cx4U@google.com>
 <7a0fecab-f9e1-ad39-d55e-01e574a35484@linux.intel.com>
 <YIMsykToLKUVMWbZ@google.com>
From: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Message-ID: <d7f670db-e77e-ca16-07c0-c0ebcce2c544@linux.intel.com>
Date: Fri, 30 Apr 2021 13:57:33 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.0
MIME-Version: 1.0
In-Reply-To: <YIMsykToLKUVMWbZ@google.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Stat-Signature: j9du6mq8ja9pf993zm46cj9mgpz8qh3g
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 51C0D40002C1
Authentication-Results: imf02.hostedemail.com;
	dkim=none;
	spf=none (imf02.hostedemail.com: domain of zhengjun.xing@linux.intel.com has no SPF policy when checking 134.134.136.20) smtp.mailfrom=zhengjun.xing@linux.intel.com;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none)
Received-SPF: none (linux.intel.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from="<zhengjun.xing@linux.intel.com>"; helo=mga02.intel.com; client-ip=134.134.136.20
X-HE-DKIM-Result: none/none
X-HE-Tag: 1619762232-903126
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi Yu,

On 4/24/2021 4:23 AM, Yu Zhao wrote:
> On Fri, Apr 23, 2021 at 02:57:07PM +0800, Xing Zhengjun wrote:
>> On 4/23/2021 1:13 AM, Yu Zhao wrote:
>>> On Thu, Apr 22, 2021 at 04:36:19PM +0800, Xing Zhengjun wrote:
>>>> Hi,
>>>>
>>>>      In the system with very few file pages (nr_active_file + nr_ina=
ctive_file
>>>> < 100), it is easy to reproduce "nr_isolated_file > nr_inactive_file=
",  then
>>>> too_many_isolated return true, shrink_inactive_list enter "msleep(10=
0)", the
>>>> long latency will happen.
>>>>
>>>> The test case to reproduce it is very simple: allocate many huge pag=
es(near
>>>> the DRAM size), then do free, repeat the same operation many times.
>>>> In the test case, the system with very few file pages (nr_active_fil=
e +
>>>> nr_inactive_file < 100), I have dumpped the numbers of
>>>> active/inactive/isolated file pages during the whole test(see in the
>>>> attachments) , in shrink_inactive_list "too_many_isolated" is very e=
asy to
>>>> return true, then enter "msleep(100)",in "too_many_isolated" sc->gfp=
_mask is
>>>> 0x342cca ("_GFP_IO" and "__GFP_FS" is masked) , it is also very easy=
 to
>>>> enter =E2=80=9Cinactive >>=3D3=E2=80=9D, then =E2=80=9Cisolated > in=
active=E2=80=9D will be true.
>>>>
>>>> So I  have a proposal to set a threshold number for the total file p=
ages to
>>>> ignore the system with very few file pages, and then bypass the 100m=
s sleep.
>>>> It is hard to set a perfect number for the threshold, so I just give=
 an
>>>> example of "256" for it.
>>>>
>>>> I appreciate it if you can give me your suggestion/comments. Thanks.
>>>
>>> Hi Zhengjun,
>>>
>>> It seems to me using the number of isolated pages to keep a lid on
>>> direct reclaimers is not a good solution. We shouldn't keep going
>>> that direction if we really want to fix the problem because migration
>>> can isolate many pages too, which in turn blocks page reclaim.
>>>
>>> Here is something works a lot better. Please give it a try. Thanks.
>>
>> Thanks, I will try it with my test cases.
>=20
> Thanks. I took care my sloppiness from yesterday and tested the
> following. It should apply cleanly and work well. Please let me know.
>=20
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 47946cec7584..48bb2b77389e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -832,6 +832,7 @@ typedef struct pglist_data {
>   #endif
>  =20
>   	/* Fields commonly accessed by the page reclaim scanner */
> +	atomic_t		nr_reclaimers;
>  =20
>   	/*
>   	 * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 562e87cbd7a1..3fcdfbee89c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1775,43 +1775,6 @@ int isolate_lru_page(struct page *page)
>   	return ret;
>   }
>  =20
> -/*
> - * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU =
list and
> - * then get rescheduled. When there are massive number of tasks doing =
page
> - * allocation, such sleeping direct reclaimers may keep piling up on e=
ach CPU,
> - * the LRU list will go small and be scanned faster than necessary, le=
ading to
> - * unnecessary swapping, thrashing and OOM.
> - */
> -static int too_many_isolated(struct pglist_data *pgdat, int file,
> -		struct scan_control *sc)
> -{
> -	unsigned long inactive, isolated;
> -
> -	if (current_is_kswapd())
> -		return 0;
> -
> -	if (!writeback_throttling_sane(sc))
> -		return 0;
> -
> -	if (file) {
> -		inactive =3D node_page_state(pgdat, NR_INACTIVE_FILE);
> -		isolated =3D node_page_state(pgdat, NR_ISOLATED_FILE);
> -	} else {
> -		inactive =3D node_page_state(pgdat, NR_INACTIVE_ANON);
> -		isolated =3D node_page_state(pgdat, NR_ISOLATED_ANON);
> -	}
> -
> -	/*
> -	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so th=
ey
> -	 * won't get blocked by normal direct-reclaimers, forming a circular
> -	 * deadlock.
> -	 */
> -	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) =3D=3D (__GFP_IO | __GFP_F=
S))
> -		inactive >>=3D 3;
> -
> -	return isolated > inactive;
> -}
> -
>   /*
>    * move_pages_to_lru() moves pages from private @list to appropriate =
LRU list.
>    * On return, @list is reused as a list of pages to be freed by the c=
aller.
> @@ -1911,20 +1874,6 @@ shrink_inactive_list(unsigned long nr_to_scan, s=
truct lruvec *lruvec,
>   	bool file =3D is_file_lru(lru);
>   	enum vm_event_item item;
>   	struct pglist_data *pgdat =3D lruvec_pgdat(lruvec);
> -	bool stalled =3D false;
> -
> -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> -		if (stalled)
> -			return 0;
> -
> -		/* wait a bit for the reclaimer. */
> -		msleep(100);
> -		stalled =3D true;
> -
> -		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> -			return SWAP_CLUSTER_MAX;
> -	}
>  =20
>   	lru_add_drain();
>  =20
> @@ -2903,6 +2852,8 @@ static void shrink_zones(struct zonelist *zonelis=
t, struct scan_control *sc)
>   	unsigned long nr_soft_scanned;
>   	gfp_t orig_mask;
>   	pg_data_t *last_pgdat =3D NULL;
> +	bool should_retry =3D false;
> +	int nr_cpus =3D num_online_cpus();
>  =20
>   	/*
>   	 * If the number of buffer_heads in the machine exceeds the maximum
> @@ -2914,9 +2865,18 @@ static void shrink_zones(struct zonelist *zoneli=
st, struct scan_control *sc)
>   		sc->gfp_mask |=3D __GFP_HIGHMEM;
>   		sc->reclaim_idx =3D gfp_zone(sc->gfp_mask);
>   	}
> -
> +retry:
>   	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>   					sc->reclaim_idx, sc->nodemask) {
> +		/*
> +		 * Shrink each node in the zonelist once. If the zonelist is
> +		 * ordered by zone (not the default) then a node may be shrunk
> +		 * multiple times but in that case the user prefers lower zones
> +		 * being preserved.
> +		 */
> +		if (zone->zone_pgdat =3D=3D last_pgdat)
> +			continue;
> +
>   		/*
>   		 * Take care memory controller reclaiming has small influence
>   		 * to global LRU.
> @@ -2941,16 +2901,28 @@ static void shrink_zones(struct zonelist *zonel=
ist, struct scan_control *sc)
>   				sc->compaction_ready =3D true;
>   				continue;
>   			}
> +		}
>  =20
> -			/*
> -			 * Shrink each node in the zonelist once. If the
> -			 * zonelist is ordered by zone (not the default) then a
> -			 * node may be shrunk multiple times but in that case
> -			 * the user prefers lower zones being preserved.
> -			 */
> -			if (zone->zone_pgdat =3D=3D last_pgdat)
> -				continue;
> +		/*
> +		 * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from
> +		 * the LRU list and then get rescheduled. When there are massive
> +		 * number of tasks doing page allocation, such sleeping direct
> +		 * reclaimers may keep piling up on each CPU, the LRU list will
> +		 * go small and be scanned faster than necessary, leading to
> +		 * unnecessary swapping, thrashing and OOM.
> +		 */
> +		VM_BUG_ON(current_is_kswapd());
>  =20
> +		if (!atomic_add_unless(&zone->zone_pgdat->nr_reclaimers, 1, nr_cpus)=
) {
> +			should_retry =3D true;
> +			continue;
> +		}
> +
> +		if (last_pgdat)
> +			atomic_dec(&last_pgdat->nr_reclaimers);
> +		last_pgdat =3D zone->zone_pgdat;
> +
> +		if (!cgroup_reclaim(sc)) {
>   			/*
>   			 * This steals pages from memory cgroups over softlimit
>   			 * and returns the number of reclaimed pages and
> @@ -2966,13 +2938,20 @@ static void shrink_zones(struct zonelist *zonel=
ist, struct scan_control *sc)
>   			/* need some check for avoid more shrink_zone() */
>   		}
>  =20
> -		/* See comment about same check for global reclaim above */
> -		if (zone->zone_pgdat =3D=3D last_pgdat)
> -			continue;
> -		last_pgdat =3D zone->zone_pgdat;
>   		shrink_node(zone->zone_pgdat, sc);
>   	}
>  =20
> +	if (last_pgdat)
> +		atomic_dec(&last_pgdat->nr_reclaimers);
> +	else if (should_retry) {
> +		/* wait a bit for the reclaimer. */
> +		if (!schedule_timeout_killable(HZ / 10))
> +			goto retry;
> +
> +		/* We are about to die and free our memory. Return now. */
> +		sc->nr_reclaimed +=3D SWAP_CLUSTER_MAX;
> +	}
> +
>   	/*
>   	 * Restore to original mask to avoid the impact on the caller if we
>   	 * promoted it to __GFP_HIGHMEM.
> @@ -4189,6 +4168,15 @@ static int __node_reclaim(struct pglist_data *pg=
dat, gfp_t gfp_mask, unsigned in
>   	set_task_reclaim_state(p, &sc.reclaim_state);
>  =20
>   	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
> +		int nr_cpus =3D num_online_cpus();
> +
> +		VM_BUG_ON(current_is_kswapd());
> +
> +		if (!atomic_add_unless(&pgdat->nr_reclaimers, 1, nr_cpus)) {
> +			schedule_timeout_killable(HZ / 10);
> +			goto out;
> +		}
> +
>   		/*
>   		 * Free memory by calling shrink node with increasing
>   		 * priorities until we have enough memory freed.
> @@ -4196,8 +4184,10 @@ static int __node_reclaim(struct pglist_data *pg=
dat, gfp_t gfp_mask, unsigned in
>   		do {
>   			shrink_node(pgdat, &sc);
>   		} while (sc.nr_reclaimed < nr_pages && --sc.priority >=3D 0);
> -	}
>  =20
> +		atomic_dec(&pgdat->nr_reclaimers);
> +	}
> +out:
>   	set_task_reclaim_state(p, NULL);
>   	current->flags &=3D ~PF_SWAPWRITE;
>   	memalloc_noreclaim_restore(noreclaim_flag);
>=20

I use my compaction test case to test it, test more than 30 times can=20
not reproduce the 100ms sleep. I find that applies the patch, direct=20
reclaim path latency reduces much, but the direct compact path latency=20
double compares with it before.

  24)               |  __alloc_pages_direct_compact() {
  24)               |    try_to_compact_pages() {
  24)   0.131 us    |      __next_zones_zonelist();
  24) @ 184008.2 us |      compact_zone_order();
  24)   0.189 us    |      __next_zones_zonelist();
  24)   0.547 us    |      compact_zone_order();
  24)   0.225 us    |      __next_zones_zonelist();
  24)   0.592 us    |      compact_zone_order();
  24)   0.146 us    |      __next_zones_zonelist();
  24) @ 184012.3 us |    }
  24)               |    get_page_from_freelist() {
  24)   0.160 us    |      __zone_watermark_ok();
  24)   0.140 us    |      __next_zones_zonelist();
  24)   0.141 us    |      __zone_watermark_ok();
  24)   0.134 us    |      __next_zones_zonelist();
  24)   0.121 us    |      __zone_watermark_ok();
  24)   0.123 us    |      __next_zones_zonelist();
  24)   1.688 us    |    }
  24)   0.130 us    |    ___might_sleep();
  24)               |    __cond_resched() {
  24)   0.123 us    |      rcu_all_qs();
  24)   0.370 us    |    }
  24) @ 184015.2 us |  }
  24)               |  /* mm_page_alloc: page=3D0000000000000000 pfn=3D0=20
order=3D9 migratetype=3D1=20
gfp_flags=3DGFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE *=
/
  24)               |  /* memlatency: lat=3D184716 order=3D9=20
gfp_flags=3D342cca=20
(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE|0x812a3c6000=
000000^@)migratetype=3D1=20
*/
/*The memlatency measures the latency of "__alloc_pages_nodemask" */


--=20
Zhengjun Xing