From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5A9FC2BA2B for ; Thu, 9 Apr 2020 10:50:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 654A920A8B for ; Thu, 9 Apr 2020 10:50:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=chrisdown.name header.i=@chrisdown.name header.b="HW+Knoxg" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 654A920A8B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=chrisdown.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DD23D8E000D; Thu, 9 Apr 2020 06:50:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D82C98E0006; Thu, 9 Apr 2020 06:50:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C71C48E000D; Thu, 9 Apr 2020 06:50:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0229.hostedemail.com [216.40.44.229]) by kanga.kvack.org (Postfix) with ESMTP id B48898E0006 for ; Thu, 9 Apr 2020 06:50:51 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 881E4181AEF15 for ; Thu, 9 Apr 2020 10:50:51 +0000 (UTC) X-FDA: 76687998702.18.game34_41a3bd0a28b26 X-HE-Tag: game34_41a3bd0a28b26 X-Filterd-Recvd-Size: 9436 Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Thu, 9 Apr 2020 10:50:50 +0000 (UTC) Received: by mail-wr1-f50.google.com with SMTP id k1so11428667wrm.3 for ; Thu, 09 Apr 2020 03:50:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chrisdown.name; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=v38bfbui8k5a6UTKAOOaJuah7JdF2f/erdWIs/iN2qA=; b=HW+Knoxg6d/dZG9qisjLcg2gnNliKCOWd8p3/R5jwqedJAeqw6sTWQbY7DRl6EP5qj Z0oIN4Gj+z5uafDfWB1vhxgYIt/IayuZXkSSHsYKgPNRr7lUaBAkhFpAiYGbcjTm/08J ogsOuGxSUOM8UjBOW/f8+EmhnUm5ARNpntjQI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=v38bfbui8k5a6UTKAOOaJuah7JdF2f/erdWIs/iN2qA=; b=m6j9r48PxHIMYnUuiuoFxdG9pUmW+ZHk/sgac3nLXsHrFgBDLC75jKe28sUdtB9Rts VhOvfGJvCTdD2OovObUyHocr5NyLngRD0Ob0VSkuybPlVruAaCTky2qo81x5dpLKcqqD aUKE4E0ccNJQ0ZZpseY3QEy7iEKwN9uHLnPre4wh71qOwaFuIFbrbvwQqA1YTwfYjy2t nddOSOzENJ9RtbH76lAGo4vzcmIg9/n23To+2Tdq4QjEnFXT3MV7kPClP4rxYZ+o+Veo Ceh2tng3fH/mZf3EYCnNZGMSpW2l0k7pKE6jxe2EKOKbAny2hyyekEWOwHqNcp+4yTPH 1yxQ== X-Gm-Message-State: AGi0PuZNko9bVyZT7YmNRt4paAeSu0m0zu/uOrA6tcjajO9+d+MCOoiv kHiJV4U0Dplcanzqai2CbibW8Q== X-Google-Smtp-Source: APiQypIwos1LrzxkTlY2XcsPKtu0+SZVjP9Xy4WHuHSvnmvsEMTjhb8qp4dTm5GmIwSVsr1AiAddXg== X-Received: by 2002:adf:ea06:: with SMTP id q6mr10945520wrm.301.1586429449692; Thu, 09 Apr 2020 03:50:49 -0700 (PDT) Received: from localhost ([2620:10d:c092:180::1:9ebe]) by smtp.gmail.com with ESMTPSA id p22sm3206693wmc.42.2020.04.09.03.50.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Apr 2020 03:50:49 -0700 (PDT) Date: Thu, 9 Apr 2020 11:50:48 +0100 From: Chris Down To: Bruno =?iso-8859-1?Q?Pr=E9mont?= Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Michal Hocko , Vladimir Davydov Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup Message-ID: <20200409105048.GA1040020@chrisdown.name> References: <20200409112505.2e1fc150@hemera.lan.sysophe.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Disposition: inline In-Reply-To: <20200409112505.2e1fc150@hemera.lan.sysophe.eu> Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Bruno, Bruno Pr=E9mont writes: >Upgrading from 5.1 kernel to 5.6 kernel on a production system using >cgroups (v2) and having backup process in a memory.high=3D2G cgroup >sees backup being highly throttled (there are about 1.5T to be >backuped). Before 5.4, memory usage with memory.high=3DN is essentially unbounded if= the=20 system is not able to reclaim pages for some reason. This is because all=20 memory.high throttling before that point is just based on forcing direct=20 reclaim for a cgroup, but there's no guarantee that we can actually recla= im=20 pages, or that it will serve as a time penalty. In 5.4, my patch 0e4b01df8659 ("mm, memcg: throttle allocators when faili= ng=20 reclaim over memory.high") changes kernel behaviour to actively penalise=20 cgroups exceeding their memory.high by a large amount. That is, if reclai= m=20 fails to reclaim pages and bring the cgroup below the high threshold, we=20 actively deschedule the process running for some number of jiffies that i= s=20 exponential to the amount of overage incurred. This is so that cgroups us= ing=20 memory.high cannot simply have runaway memory usage without any consequen= ces. This is the patch that I'd particularly suspect is related to your proble= m.=20 However: >Most memory usage in that cgroup is for file cache. > >Here are the memory details for the cgroup: >memory.current:2147225600 >[...] >memory.events:high 423774 >memory.events:max 31131 >memory.high:2147483648 >memory.max:2415919104 Your high limit is being exceeded heavily and you are failing to reclaim.= You=20 have `max` events here, which mean your application is at least at some p= oint=20 using over 268 *mega*bytes over its memory.high. So yes, we will penalise this cgroup heavily since we cannot reclaim from= it.=20 The real question is why we can't reclaim from it :-) >memory.low:33554432 You have a memory.low set, which will bias reclaim away from this cgroup = based=20 on overage. It's not very large, though, so it shouldn't change the seman= tics=20 here, although it's worth noting since it also changed in another one of = my=20 patches, 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim"= ),=20 which is also in 5.4. In 5.1, as soon as you exceed memory.low, you immediately lose all protec= tion. =20 This is not ideal because it results in extremely binary, back-and-forth=20 behaviour for cgroups using it (see the changelog for more information). = This=20 change means you will still receive some small amount of protection based= on=20 your overage, but it's fairly insignificant in this case (memory.current = is=20 about 64x larger than memory.low). What did you intend to do with this in= 5.1?=20 :-) >memory.stat:anon 10887168 >memory.stat:file 2062102528 >memory.stat:kernel_stack 73728 >memory.stat:slab 76148736 >memory.stat:sock 360448 >memory.stat:shmem 0 >memory.stat:file_mapped 12029952 >memory.stat:file_dirty 946176 >memory.stat:file_writeback 405504 >memory.stat:anon_thp 0 >memory.stat:inactive_anon 0 >memory.stat:active_anon 10121216 >memory.stat:inactive_file 1954959360 >memory.stat:active_file 106418176 >memory.stat:unevictable 0 >memory.stat:slab_reclaimable 75247616 >memory.stat:slab_unreclaimable 901120 >memory.stat:pgfault 8651676 >memory.stat:pgmajfault 2013 >memory.stat:workingset_refault 8670651 >memory.stat:workingset_activate 409200 >memory.stat:workingset_nodereclaim 62040 >memory.stat:pgrefill 1513537 >memory.stat:pgscan 47519855 >memory.stat:pgsteal 44933838 >memory.stat:pgactivate 7986 >memory.stat:pgdeactivate 1480623 >memory.stat:pglazyfree 0 >memory.stat:pglazyfreed 0 >memory.stat:thp_fault_alloc 0 >memory.stat:thp_collapse_alloc 0 Hard to say exactly why we can't reclaim using these statistics, usually = if=20 anything the kernel is *over* eager to drop cache pages than anything. If the kernel thinks those file pages are too hot, though, it won't drop = them.=20 However, we only have 106M active file, compared to 2GB memory.current, s= o it=20 doesn't look like this is the issue. Can you please show io.pressure, io.stat, and cpu.pressure during these p= eriods=20 compared to baseline for this cgroup and globally (from /proc/pressure)? = My=20 suspicion is that we are not able to reclaim fast enough because memory=20 management is getting stuck behind a slow disk. Swap availability and usage information would also be helpful. >Regularly the backup process seems to be blocked for about 2s, but not >within a syscall according to strace. 2 seconds is important, it's the maximum time we allow the allocator thro= ttler=20 to throttle for one allocation :-) If you want to verify, you can look at /proc/pid/stack during these stall= s --=20 they should be in mem_cgroup_handle_over_high, in an address related to=20 allocator throttling. >Is there a way to tell kernel that this cgroup should not be throttled Huh? That's what memory.high is for, so why are you using if it you don't= want=20 that? >and its inactive file cache given up (rather quickly). I suspect the kernel is reclaiming as far as it can, but is being stopped= from=20 doing so for some reason, which is why I'd like to see io.pressure and=20 cpu.pressure. >On a side note, I liked v1's mode of soft/hard memory limit where the >memory amount between soft and hard could be used if system has enough >free memory. For v2 the difference between high and max seems almost of >no use. For that use case, that's more or less what we've designed memory.low to = do.=20 The difference is that v1's soft limit almost never worked: the heuristic= s are=20 extremely complicated, so complicated in fact that even we as memcg maint= ainers=20 cannot reason about them. If we cannot reason about them, I'm quite sure = it's=20 not really doing what you expect :-) In this case everything looks like it's working as intended, just this is= all=20 the result of memory.high becoming less broken in 5.4. From your descript= ion,=20 I'm not sure that memory.high is what you want, either. >A cgroup parameter for impacting RO file cache differently than >anonymous memory or otherwise dirty memory would be great too. We had vm.swappiness in v1 and it manifested extremely poorly. I won't go= too=20 much into the details of that here though, since we already discussed it = fairly=20 comprehensively here[0]. Please feel free to send over the io.pressure, io.stat, cpu.pressure, and= swap=20 metrics at baseline and during this when possible. Thanks! 0: https://lore.kernel.org/patchwork/patch/1172080/