From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Ax4L=7D=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 17AFFC433DF
	for <linux-mm@archiver.kernel.org>; Thu, 21 May 2020 07:19:35 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C58C820873
	for <linux-mm@archiver.kernel.org>; Thu, 21 May 2020 07:19:34 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C58C820873
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 59A7880019; Thu, 21 May 2020 03:19:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 54B4D80007; Thu, 21 May 2020 03:19:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4399E80019; Thu, 21 May 2020 03:19:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0231.hostedemail.com [216.40.44.231])
	by kanga.kvack.org (Postfix) with ESMTP id 2858D80007
	for <linux-mm@kvack.org>; Thu, 21 May 2020 03:19:34 -0400 (EDT)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id DB2FF181AEF23
	for <linux-mm@kvack.org>; Thu, 21 May 2020 07:19:33 +0000 (UTC)
X-FDA: 76839875826.29.dust74_5376905d3b338
X-HE-Tag: dust74_5376905d3b338
X-Filterd-Recvd-Size: 5323
Received: from mail-ed1-f65.google.com (mail-ed1-f65.google.com [209.85.208.65])
	by imf09.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 21 May 2020 07:19:33 +0000 (UTC)
Received: by mail-ed1-f65.google.com with SMTP id k19so5918638edv.9
        for <linux-mm@kvack.org>; Thu, 21 May 2020 00:19:33 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=Dn0yp+M8pDTS/BMQHfDmlIcD8nK1n9G9HT0uq2RMC+I=;
        b=tgpj2cmKjMtaAixcMPCKHb0O8FncjKm8d2HclqjK07utx8Mx/GGRphCQKQm2zJ9yut
         cqCAtrfN2F4cMetBRjUIlEnRNXr0HMhpMBX6kjkyKQP1q2riJfjzMHZV7dJ+U9VZUE2x
         KtBwvn9/oR7z4Tn9E1CRyPQVXyIP5Cjx4VMxZGoTDHLJoC2xWkfA4D5T5ww5uTRHWsIH
         gN+4QSYpuOcOXpanyeBc79d+5iHQBQqlRIOsFkRZp4ezP4WOJS7hJMyTjI2A7x1b6iWx
         Wegs6Uu0JJuVMDJgP+l1jWsoeSNuvteZV/yBlpSGMAh+siziHzvHEVbkIoJwNXS2hknH
         8hDQ==
X-Gm-Message-State: AOAM530JWuQ0j/XNxjvUQciOooTkvXVRZZehuNMM4c84PCwUnnaZMrBt
	SHVPPTBbiNGmiTUemERhvmQ=
X-Google-Smtp-Source: ABdhPJyWCoyi6jQRuJ/5H5jy56KynMgozJYzWwRIK2sM6SOw6/OHWD4BJMo9eRmUsz/NSr84X4T00g==
X-Received: by 2002:a50:8b42:: with SMTP id l60mr6576276edl.55.1590045571294;
        Thu, 21 May 2020 00:19:31 -0700 (PDT)
Received: from localhost (ip-37-188-180-112.eurotel.cz. [37.188.180.112])
        by smtp.gmail.com with ESMTPSA id w4sm3932025edx.66.2020.05.21.00.19.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 21 May 2020 00:19:30 -0700 (PDT)
Date: Thu, 21 May 2020 09:19:29 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Chris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm, memcg: reclaim more aggressively before high
 allocator throttling
Message-ID: <20200521071929.GH6462@dhcp22.suse.cz>
References: <20200520143712.GA749486@chrisdown.name>
 <20200520160756.GE6462@dhcp22.suse.cz>
 <20200520202650.GB558281@chrisdown.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200520202650.GB558281@chrisdown.name>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed 20-05-20 21:26:50, Chris Down wrote:
> Michal Hocko writes:
> > Let me try to understand the actual problem. The high memory reclaim has
> > a target which is proportional to the amount of charged memory. For most
> > requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
> > N is the number of memcgs in excess up the hierarchy). I can see to be
> > insufficient if the memcg is already in a large excess but if the
> > reclaim can make a forward progress this should just work fine because
> > each charging context should reclaim at least the contributed amount.
> > 
> > Do you have any insight on why this doesn't work in your situation?
> > Especially with such a large inactive file list I would be really
> > surprised if the reclaim was not able to make a forward progress.
> 
> Reclaim can fail for any number of reasons, which is why we have retries
> sprinkled all over for it already. It doesn't seem hard to believe that it
> might just fail for transient reasons and drive us deeper into the hole as a
> result.

Reclaim can certainly fail. It is however surprising to see it fail with
such a large inactive lru list and reasonably small reclaim target.
Having the full LRU of dirty pages sounds a bit unusual, IO throttling
for v2 and explicit throttling during the reclaim for v1 should prevent
from that. If the reclaim gives up too easily then this should be
addressed at the reclaim level.

> In this case, a.) the application is producing tons of dirty pages, and b.)
> we have really heavy systemwide I/O contention on the affected machines.
> This high load is one of the reasons that direct and kswapd reclaim cannot
> keep up, and thus nr_pages can become a number of orders of magnitude larger
> than SWAP_CLUSTER_MAX. This is trivially reproducible on these machines,
> it's not an edge case.

Please elaborate some more. memcg_nr_pages_over_high shouldn't really
depend on the system wide activity. It should scale with the requested
charges. So yes it can get large for something like a large read/write
which does a lot of allocations in a single syscall before returning to
the userspace.
 
But ok, let's say that the reclaim target is large and then a single
reclaim attempt might fail. Then I am wondering why your patch is not
really targetting to reclaim memcg_nr_pages_over_high pages and instead
push for reclaim down to the high limit.

The main problem I see with that approach is that the loop could easily
lead to reclaim unfairness when a heavy producer which doesn't leave the
kernel (e.g. a large read/write call) can keep a different task doing
all the reclaim work. The loop is effectivelly unbound when there is a
reclaim progress and so the return to the userspace is by no means
proportional to the requested memory/charge.
-- 
Michal Hocko
SUSE Labs