From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=PuGS=4H=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.0 required=3.0 tests=INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C878FC34056
	for <linux-mm@archiver.kernel.org>; Wed, 19 Feb 2020 18:37:36 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 88F38206DB
	for <linux-mm@archiver.kernel.org>; Wed, 19 Feb 2020 18:37:36 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 88F38206DB
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 220786B0003; Wed, 19 Feb 2020 13:37:36 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1D0ED6B0006; Wed, 19 Feb 2020 13:37:36 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0E6A06B0007; Wed, 19 Feb 2020 13:37:36 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0080.hostedemail.com [216.40.44.80])
	by kanga.kvack.org (Postfix) with ESMTP id E9D576B0003
	for <linux-mm@kvack.org>; Wed, 19 Feb 2020 13:37:35 -0500 (EST)
Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 9E7091EFD
	for <linux-mm@kvack.org>; Wed, 19 Feb 2020 18:37:35 +0000 (UTC)
X-FDA: 76507734870.17.kick25_4237edd9d1706
X-HE-Tag: kick25_4237edd9d1706
X-Filterd-Recvd-Size: 9765
Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com [209.85.128.68])
	by imf09.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 19 Feb 2020 18:37:34 +0000 (UTC)
Received: by mail-wm1-f68.google.com with SMTP id c84so1784798wme.4
        for <linux-mm@kvack.org>; Wed, 19 Feb 2020 10:37:34 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=y9fEWLo0nNwXjNwDwLYztFQpM4TMlyLT7ksb0PU5d00=;
        b=iXQGs1Cy3nxerNc3Ccc+8GWCrtGXxvBxPdilk4u+E1rD9LOEJw+Wk4rrRI+APAKccc
         Syi4UNi9vbC4HGn35/ql4zEEI7NRvQdkQagshEi8eFZvRI/S3IRvY+7axjKR3Gya76SX
         ORznY49M6k5aWAW7e4ynoYOUbHlksRGveiG7iZcp3vGLCD1kDCgch+ZYVxTBcePTf/vG
         nRTZ9CYVI4xqoZsVYuE9lQfVcTeUTCesvTjWcjSuSOzgLkHb0lje6lOfwlpMB3Iedbil
         JN8S9V2ehjqAM4lAcRCVAnJY0gt9I5wszVv5D/VBWPzOs16LT+tvFColUSGaDM2XOjR7
         HIuQ==
X-Gm-Message-State: APjAAAUpcavwDUMkDRjBfztcQ3OEFzbb5zQXpP5UkI+7dxA0zWgEXnsn
	RJklNCsoRFYLfZuKtItAWk0=
X-Google-Smtp-Source: APXvYqw3cwz8bAE9+6mXnfb3+gIml84qnTHc3uCVu6YkcDXmZ3ZIzezOYC34Fih/QURLCvWhhX/iTA==
X-Received: by 2002:a05:600c:2283:: with SMTP id 3mr11547523wmf.100.1582137453771;
        Wed, 19 Feb 2020 10:37:33 -0800 (PST)
Received: from localhost (ip-37-188-133-21.eurotel.cz. [37.188.133.21])
        by smtp.gmail.com with ESMTPSA id r1sm768293wrx.11.2020.02.19.10.37.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 19 Feb 2020 10:37:33 -0800 (PST)
Date: Wed, 19 Feb 2020 19:37:31 +0100
From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, Tejun Heo <tj@kernel.org>,
	Roman Gushchin <guro@fb.com>, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
Message-ID: <20200219183731.GC11847@dhcp22.suse.cz>
References: <20200219181219.54356-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200219181219.54356-1-hannes@cmpxchg.org>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.

I am curious why is this unexpected when the high limit is explicitly
documented as a throttling mechanism.

> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.

Who is accounted for all the work? Unless I am missing something this
just gets hidden in the system activity and that might hurt the
isolation. I do see how moving the work to a different context is
desirable but this work has to be accounted properly when it is going to
become a normal mode of operation (rather than a rare exception like the
existing irq context handling).

> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
> 
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
>  mm/vmscan.c     | 10 +++++++--
>  2 files changed, 54 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf02e3ef3ed9..bad838d9c2bb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "pgsteal %lu\n",
>  		       memcg_events(memcg, PGSTEAL_KSWAPD) +
>  		       memcg_events(memcg, PGSTEAL_DIRECT));
> +	seq_buf_printf(&s, "pgscan_direct %lu\n",
> +		       memcg_events(memcg, PGSCAN_DIRECT));
> +	seq_buf_printf(&s, "pgsteal_direct %lu\n",
> +		       memcg_events(memcg, PGSTEAL_DIRECT));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
>  		       memcg_events(memcg, PGACTIVATE));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
> @@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
>  
>  static void high_work_func(struct work_struct *work)
>  {
> +	unsigned long high, usage;
>  	struct mem_cgroup *memcg;
>  
>  	memcg = container_of(work, struct mem_cgroup, high_work);
> -	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> +
> +	high = READ_ONCE(memcg->high);
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		return;
> +
> +	set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
> +	reclaim_high(memcg, usage - high, GFP_KERNEL);
>  }
>  
>  /*
> @@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
>  	unsigned long pflags;
>  	unsigned long penalty_jiffies, overage;
>  	unsigned int nr_pages = current->memcg_nr_pages_over_high;
> +	bool tried_direct_reclaim = false;
>  	struct mem_cgroup *memcg;
>  
>  	if (likely(!nr_pages))
>  		return;
>  
> -	memcg = get_mem_cgroup_from_mm(current->mm);
> -	reclaim_high(memcg, nr_pages, GFP_KERNEL);
>  	current->memcg_nr_pages_over_high = 0;
>  
> +	memcg = get_mem_cgroup_from_mm(current->mm);
> +	high = READ_ONCE(memcg->high);
> +recheck:
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		goto out;
> +
>  	/*
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
> @@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
>  	 * overage amount.
>  	 */
>  
> -	usage = page_counter_read(&memcg->memory);
> -	high = READ_ONCE(memcg->high);
> -
> -	if (usage <= high)
> -		goto out;
> -
>  	/*
>  	 * Prevent division by 0 in overage calculation by acting as if it was a
>  	 * threshold of 1 page
> @@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	/*
> +	 * It's possible async reclaim just isn't able to keep
> +	 * up. Before we go to sleep, try direct reclaim.
> +	 */
> +	if (!tried_direct_reclaim) {
> +		reclaim_high(memcg, nr_pages, GFP_KERNEL);
> +		tried_direct_reclaim = true;
> +		goto recheck;
> +	}
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> @@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	do {
>  		if (page_counter_read(&memcg->memory) > memcg->high) {
> +			/*
> +			 * Kick off the async reclaimer, which should
> +			 * be doing most of the work to avoid latency
> +			 * in the workload. But also check in on its
> +			 * progress before resuming to userspace, in
> +			 * case we need to do direct reclaim, or even
> +			 * throttle the allocating thread if reclaim
> +			 * cannot keep up with allocation demand.
> +			 */
> +			queue_work(system_unbound_wq, &memcg->high_work);
>  			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> -				schedule_work(&memcg->high_work);
> -				break;
> +			if (!in_interrupt()) {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edce83ca..d6085115c7f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSCAN_KSWAPD;
> +	else
> +		item = PGSCAN_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> @@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> -	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSTEAL_KSWAPD;
> +	else
> +		item = PGSTEAL_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -- 
> 2.24.1
> 

-- 
Michal Hocko
SUSE Labs