From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A06BDC5519F for ; Wed, 25 Nov 2020 11:40:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5DD35206D4 for ; Wed, 25 Nov 2020 11:40:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5DD35206D4 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-vserver.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C75CF6B0071; Wed, 25 Nov 2020 06:40:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C270F6B0078; Wed, 25 Nov 2020 06:40:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B3BBE6B007B; Wed, 25 Nov 2020 06:40:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0007.hostedemail.com [216.40.44.7]) by kanga.kvack.org (Postfix) with ESMTP id 9A6DC6B0071 for ; Wed, 25 Nov 2020 06:40:00 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 5D546180AD807 for ; Wed, 25 Nov 2020 11:40:00 +0000 (UTC) X-FDA: 77522746560.01.mark97_4e0ecc127376 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 3FD201004D01C for ; Wed, 25 Nov 2020 11:40:00 +0000 (UTC) X-HE-Tag: mark97_4e0ecc127376 X-Filterd-Recvd-Size: 9556 Received: from smtprelay.restena.lu (smtprelay.restena.lu [158.64.1.62]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Wed, 25 Nov 2020 11:39:58 +0000 (UTC) Received: from hemera (unknown [IPv6:2001:a18:1:10:fa75:a4ff:fe28:fe3a]) by smtprelay.restena.lu (Postfix) with ESMTPS id F32F640FFF; Wed, 25 Nov 2020 12:39:56 +0100 (CET) Date: Wed, 25 Nov 2020 12:39:56 +0100 From: Bruno =?UTF-8?B?UHLDqW1vbnQ=?= To: Yafang Shao Cc: Chris Down , Michal Hocko , Johannes Weiner , Chris Down , cgroups@vger.kernel.org, linux-mm@kvack.org, Vladimir Davydov Subject: Regression from 5.7.17 to 5.9.9 with memory.low cgroup constraints Message-ID: <20201125123956.61d9e16a@hemera> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.004419, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On a production system I've encountered a rather harsh behavior from kernel in the context of memory cgroup (v2) after updating kernel from 5.7 series to 5.9 series. It seems like kernel is reclaiming file cache but leaving inode cache (reclaimable slabs) alone in a way that the server ends up trashing and maxing out on IO to one of its disks instead of doing actual work. My setup, server has 64G of RAM: root + system { min=0, low=128M, high=8G, max=8G } + base { no specific constraints } + backup { min=0, low=32M, high=2G, max=2G } + shell { no specific constraints } + websrv { min=0, low=4G, high=32G, max=32G } + website { min=0, low=16G, high=40T, max=40T } + website1 { min=0, low=64M, high=2G, max=2G } + website2 { min=0, low=64M, high=2G, max=2G } ... + remote { min=0, low=1G, high=14G, max=14G } + webuser1 { min=0, low=64M, high=2G, max=2G } + webuser2 { min=0, low=64M, high=2G, max=2G } ... When the server was struggling I've had mostly IO on disk hosting system processes and some cache files of websrv processes. It seems that running backup does make the issue much more probable. The processes in websrv are the most impacted by the trashing and this is the one with lots of disk cache and inode cache assigned to it. (note a helper running in websrv cgroup scan whole file system hierarchy once per hour and this keeps inode cache pretty filled. Dropping just file cache (about 10G) did not unlock situation but dropping reclaimable slabs (inode cache, about 30G) got the system back running. Some metrics I have collected during a trashing period (metrics collected at about 5min interval) - I don't have ful memory.stat unfortunately: system/memory.min 0 = 0 system/memory.low 134217728 = 134217728 system/memory.high 8589934592 = 8589934592 system/memory.max 8589934592 = 8589934592 system/memory.pressure some avg10=54.41 avg60=59.28 avg300=69.46 total=7347640237 full avg10=27.45 avg60=22.19 avg300=29.28 total=3287847481 -> some avg10=77.25 avg60=73.24 avg300=69.63 total=7619662740 full avg10=23.04 avg60=25.26 avg300=27.97 total=3401421903 system/memory.current 262533120 < 263929856 system/memory.events.local low 5399469 = 5399469 high 0 = 0 max 112303 = 112303 oom 0 = 0 oom_kill 0 = 0 system/base/memory.min 0 = 0 system/base/memory.low 0 = 0 system/base/memory.high max = max system/base/memory.max max = max system/base/memory.pressure some avg10=18.89 avg60=20.34 avg300=24.95 total=5156816349 full avg10=10.90 avg60=8.50 avg300=11.68 total=2253916169 -> some avg10=33.82 avg60=32.26 avg300=26.95 total=5258381824 full avg10=12.51 avg60=13.01 avg300=12.05 total=2301375471 system/base/memory.current 31363072 < 32243712 system/base/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 system/backup/memory.min 0 = 0 system/backup/memory.low 33554432 = 33554432 system/backup/memory.high 2147483648 = 2147483648 system/backup/memory.max 2147483648 = 2147483648 system/backup/memory.pressure some avg10=41.73 avg60=45.97 avg300=56.27 total=3385780085 full avg10=21.78 avg60=18.15 avg300=25.35 total=1571263731 -> some avg10=60.27 avg60=55.44 avg300=54.37 total=3599850643 full avg10=19.52 avg60=20.91 avg300=23.58 total=1667430954 system/backup/memory.current 222130176 < 222543872 system/backup/memory.events.local low 5446 = 5446 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 system/shell/memory.min 0 = 0 system/shell/memory.low 0 = 0 system/shell/memory.high max = max system/shell/memory.max max = max system/shell/memory.pressure some avg10=0.00 avg60=0.12 avg300=0.25 total=1348427661 full avg10=0.00 avg60=0.04 avg300=0.06 total=493582108 -> some avg10=0.00 avg60=0.00 avg300=0.06 total=1348516773 full avg10=0.00 avg60=0.00 avg300=0.00 total=493591500 system/shell/memory.current 8814592 < 8888320 system/shell/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 website/memory.min 0 = 0 website/memory.low 17179869184 = 17179869184 website/memory.high 45131717672960 = 45131717672960 website/memory.max 45131717672960 = 45131717672960 website/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 -> some avg10=0.00 avg60=0.00 avg300=0.00 total=415009408 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 website/memory.current 11811520512 > 11456942080 website/memory.events.local low 11372142 < 11377350 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 remote/memory.min 0 remote/memory.low 1073741824 remote/memory.high 15032385536 remote/memory.max 15032385536 remote/memory.pressure some avg10=0.00 avg60=0.25 avg300=0.50 total=2017364408 full avg10=0.00 avg60=0.00 avg300=0.01 total=738071296 -> remote/memory.current 84439040 > 81797120 remote/memory.events.local low 11372142 < 11377350 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 websrv/memory.min 0 = 0 websrv/memory.low 4294967296 = 4294967296 websrv/memory.high 34359738368 = 34359738368 websrv/memory.max 34426847232 = 34426847232 websrv/memory.pressure some avg10=40.38 avg60=62.58 avg300=68.83 total=7760096704 full avg10=7.80 avg60=10.78 avg300=12.64 total=2254679370 -> some avg10=89.97 avg60=83.78 avg300=72.99 total=8040513640 full avg10=11.46 avg60=11.49 avg300=11.47 total=2300116237 websrv/memory.current 18421673984 < 18421936128 websrv/memory.events.local low 0 = 0 high 0 = 0 max 0 = 0 oom 0 = 0 oom_kill 0 = 0 Is there something important I'm missing in my setup that could prevent things from starving? Did memory.low meaning change between 5.7 and 5.9? From behavior it feels as if inodes are not accounted to cgroup at all and kernel pushes cgroups down to their memory.low by killing file cache if there is not enough free memory to hold all promises (and not only when a cgroup tries to use up to its promised amount of memory). As system was trashing as much with 10G of file cache dropped (completely unused memory) as with it in use. I will try to create a test-case for it to reproduce it on a test machine an be able to verify a fix or eventually bisect to triggering patch though it this all rings a bell, please tell! Note until I have a test-case I'm reluctant to just wait [on production system] for next occurrence (usually at unpractical times) to gather some more metrics. Regards, Bruno