From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.3 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E3FAC433E0 for ; Thu, 28 May 2020 18:02:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3FCCA20829 for ; Thu, 28 May 2020 18:02:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="n/yOBUUw" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3FCCA20829 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CC8C1800B6; Thu, 28 May 2020 14:02:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C51268001A; Thu, 28 May 2020 14:02:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B189B800B6; Thu, 28 May 2020 14:02:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150]) by kanga.kvack.org (Postfix) with ESMTP id 951158001A for ; Thu, 28 May 2020 14:02:51 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4B8CC153862 for ; Thu, 28 May 2020 18:02:51 +0000 (UTC) X-FDA: 76866898542.11.bait05_8d165840fe146 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin11.hostedemail.com (Postfix) with ESMTP id 089D6180F8B8B for ; Thu, 28 May 2020 18:02:50 +0000 (UTC) X-HE-Tag: bait05_8d165840fe146 X-Filterd-Recvd-Size: 8812 Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Thu, 28 May 2020 18:02:49 +0000 (UTC) Received: by mail-lf1-f68.google.com with SMTP id c21so17114573lfb.3 for ; Thu, 28 May 2020 11:02:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9OK1+n6SxXEn9JNGRXaWGo1/ty0IYnCL8Tld6OXFjk4=; b=n/yOBUUwD4nsHeaM98tpLc9F//6e2SajGthCpSwy3M6a6PPkyZ8E6vqKqhIEtrzFFo fN7seVJumIO0Wnvkd+a7+pe9CN30AfHFwnb4pKQyHHxxvYxYF2C5HAl5Xn4FoT4J5Vyk VMLsCsYE8efR1rx5M4AzX0hSzGPAFxkGZNr8ZqeKYTn8zZgF4nvsz2xgOJBOGK6nv6+o seMMCkXpCofcZmPESV1JiXAMz16W3qld6J2+2VvqTKFKIVJR1qrWmrOVfesJEzRbR++N IFBVokf/Sf+oiQwOerg0N/uYZu8ryIZMvjzvrESVST/H719wbQsK1ZHayVmxpokXctjN pPGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9OK1+n6SxXEn9JNGRXaWGo1/ty0IYnCL8Tld6OXFjk4=; b=N6VhxaE1n8zBbhiq+nB1+Ip7pT2QHG3c94SlaDWUs1ixcCLzeueyAddeAQTSr7nz2N UfByTZR6nPqso6UznRfrqRvSn0SOp8dRcTBjbBkErXXoIBN0T9v667DBZeFHJY5xmaB3 SHjt5XwoCXYKIrpj+/EGRtG1g88xVlnJ89CBthHpXOzI1+BvwCrPCWO5wBh9SnOj27dS pYavkmTJsvRVa5+svWNn43qGMwOebgRZz+iksZQcPW4hjaAm7R8NG4SAwC8v9UiUNZdZ vOdALBXvzsiiHx+TNHsAwsGAMORh88IIMc+4tYxanH4dpLOgXdwQNsC/NyIjfjeynbGZ mPUw== X-Gm-Message-State: AOAM532O8UaiTsLYXbuFxdo1qVEuhMjMyDu7Km0QUnwZpHmrHNvVKf3X nOOPR6Z1G/PP/1jGHd12Bpi85rwx45qGzKe5YWumNg== X-Google-Smtp-Source: ABdhPJwQIjYP5V5f5uajcg1eYFPkalCiLWolqoPKjgd12tImJ7xotzNA7db+3KWty12FSvp38Mbe1LO2SsbAOC7SdEs= X-Received: by 2002:a05:6512:10c3:: with SMTP id k3mr2311998lfg.33.1590688967579; Thu, 28 May 2020 11:02:47 -0700 (PDT) MIME-Version: 1.0 References: <20200520143712.GA749486@chrisdown.name> In-Reply-To: <20200520143712.GA749486@chrisdown.name> From: Shakeel Butt Date: Thu, 28 May 2020 11:02:35 -0700 Message-ID: Subject: Re: [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling To: Chris Down Cc: Andrew Morton , Johannes Weiner , Tejun Heo , Michal Hocko , Linux MM , Cgroups , LKML , Kernel Team Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 089D6180F8B8B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: I haven't gone through the whole email-chain, so, I might be asking some repetitive questions. I will go through the email-chain later. On Wed, May 20, 2020 at 7:37 AM Chris Down wrote: > > In Facebook production, we've seen cases where cgroups have been put > into allocator throttling even when they appear to have a lot of slack > file caches which should be trivially reclaimable. > > Looking more closely, the problem is that we only try a single cgroup > reclaim walk for each return to usermode before calculating whether or > not we should throttle. This single attempt doesn't produce enough > pressure to shrink for cgroups with a rapidly growing amount of file > caches prior to entering allocator throttling. In my experience it is usually shrink_slab which requires hammering multiple times to actually reclaim memory. > > As an example, we see that threads in an affected cgroup are stuck in > allocator throttling: > > # for i in $(cat cgroup.threads); do > > grep over_high "/proc/$i/stack" > > done > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > [<0>] mem_cgroup_handle_over_high+0x10b/0x150 > > ...however, there is no I/O pressure reported by PSI, despite a lot of > slack file pages: > > # cat memory.pressure > some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903 > full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959 > # cat io.pressure > some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391 > full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640 > # grep _file memory.stat > inactive_file 1370939392 > active_file 661635072 > > This patch changes the behaviour to retry reclaim either until the > current task goes below the 10ms grace period, or we are making no > reclaim progress at all. In the latter case, we enter reclaim throttling > as before. > > To a user, there's no intuitive reason for the reclaim behaviour to > differ from hitting memory.high as part of a new allocation, as opposed > to hitting memory.high because someone lowered its value. As such this > also brings an added benefit: it unifies the reclaim behaviour between > the two. What was the initial reason to have different behavior in the first place? > > There's precedent for this behaviour: we already do reclaim retries when > writing to memory.{high,max}, in max reclaim, and in the page allocator > itself. > > Signed-off-by: Chris Down > Cc: Andrew Morton > Cc: Johannes Weiner > Cc: Tejun Heo > Cc: Michal Hocko > --- > mm/memcontrol.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df9510b7d64..b040951ccd6b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); > > struct mem_cgroup *root_mem_cgroup __read_mostly; > > +/* The number of times we should retry reclaim failures before giving up. */ > #define MEM_CGROUP_RECLAIM_RETRIES 5 > > /* Socket memory accounting disabled? */ > @@ -2228,17 +2229,22 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > return 0; > } > > -static void reclaim_high(struct mem_cgroup *memcg, > - unsigned int nr_pages, > - gfp_t gfp_mask) > +static unsigned long reclaim_high(struct mem_cgroup *memcg, > + unsigned int nr_pages, > + gfp_t gfp_mask) > { > + unsigned long nr_reclaimed = 0; > + > do { > if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high)) > continue; > memcg_memory_event(memcg, MEMCG_HIGH); > - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); > + nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > + gfp_mask, true); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > + > + return nr_reclaimed; > } > > static void high_work_func(struct work_struct *work) > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) > { > unsigned long penalty_jiffies; > unsigned long pflags; > + unsigned long nr_reclaimed; > unsigned int nr_pages = current->memcg_nr_pages_over_high; Is there any benefit to keep current->memcg_nr_pages_over_high after this change? Why not just use SWAP_CLUSTER_MAX? > + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > struct mem_cgroup *memcg; > > if (likely(!nr_pages)) > return; > > memcg = get_mem_cgroup_from_mm(current->mm); > - reclaim_high(memcg, nr_pages, GFP_KERNEL); > current->memcg_nr_pages_over_high = 0; > > +retry_reclaim: > + nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL); > + > /* > * memory.high is breached and reclaim is unable to keep up. Throttle > * allocators proactively to slow down excessive growth. > @@ -2403,6 +2413,14 @@ void mem_cgroup_handle_over_high(void) > if (penalty_jiffies <= HZ / 100) > goto out; > > + /* > + * If reclaim is making forward progress but we're still over > + * memory.high, we want to encourage that rather than doing allocator > + * throttling. > + */ > + if (nr_reclaimed || nr_retries--) > + goto retry_reclaim; > + > /* > * If we exit early, we're guaranteed to die (since > * schedule_timeout_killable sets TASK_KILLABLE). This means we don't > -- > 2.26.2 >