From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05F77C43603 for ; Wed, 18 Dec 2019 22:14:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 90181218AC for ; Wed, 18 Dec 2019 22:14:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="Ge8x1m6t" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 90181218AC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EE6108E0141; Wed, 18 Dec 2019 17:14:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E95DF8E00F5; Wed, 18 Dec 2019 17:14:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D5E4B8E0141; Wed, 18 Dec 2019 17:14:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0163.hostedemail.com [216.40.44.163]) by kanga.kvack.org (Postfix) with ESMTP id BA50B8E00F5 for ; Wed, 18 Dec 2019 17:14:55 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 57ADC180AD807 for ; Wed, 18 Dec 2019 22:14:55 +0000 (UTC) X-FDA: 76279668150.01.flock29_2ae2390166135 X-HE-Tag: flock29_2ae2390166135 X-Filterd-Recvd-Size: 8696 Received: from mail-qk1-f196.google.com (mail-qk1-f196.google.com [209.85.222.196]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Wed, 18 Dec 2019 22:14:54 +0000 (UTC) Received: by mail-qk1-f196.google.com with SMTP id z14so1715316qkg.9 for ; Wed, 18 Dec 2019 14:14:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=LXV/qxKCqqi4+8yrRwx8Szit//6XB+2AJAW7AHYEa6s=; b=Ge8x1m6tBYswSlELGC/dPShuMWeTebOh/98LXVMAyb8tbxdmhyCa0IikOse7oLbqQ2 2j8TfMqKF7aQvSy+AbH/GvdS5sIFG0WBpbxxfzDoNB5peHUphm8QICmDMXSOOVNytS5z yRRhRSmR7w6CjQor4cyg24qT3NXoUNAmxou1gsqwm5sSyNJ+WfEBufWVSjer+OFzPVpG Nl4xBU/47NRnyG5h1xtMsT1cpG8QrFjMhn98NTx2KBf1dWAlMg1Ngvi0IRRx7vQZKdrb 1o3vVbQCvTTORXYYEJgsQZZvLgoFLVwl2BqMLYXr8o6uXoa+OcASnxzSIMMCJw0OeERg Psgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=LXV/qxKCqqi4+8yrRwx8Szit//6XB+2AJAW7AHYEa6s=; b=TtOqXoXd5wm5haUvzgCOkpqyWza5ZVqpF4+TX7yxntjVWF7vxdX3pYXT48eb/qjr1c IPdDkYz2Quz1gvQ9uQdHwPOkSSxWTNm4mpDzJ8S53wG5D2wV8YrHbi9VfrY5ABTrABsC owyi8nTd6I88ipPC1QtqjtoOC+EvMuBiFQDgG9CINFn/rNkC9Mx7w6+mAfGBsL+mUj7v obG+ftzcGMT4PaeLlSVAQkLftwyOCUtBRTAB1ZG2NZ6TUHEdG1cPlaZevlgv7aDt9jA7 4TIGRU3qQyrH5r7vgMXDC7FVJ5Z8tNDPVyfozEBH7WIGk5mli7z1jiRYB9u4/f8ospYA IcZg== X-Gm-Message-State: APjAAAXAVRC+fPDoNRTiUj+qNVuF/+2bpgH6zXrSGAfvyCxt3QICPX3w VWRx0NkUk/schOW6b/F7XpWXvg== X-Google-Smtp-Source: APXvYqyIOLajEv4WLHSON8UmhPk5pFaclZ0RCmiBESqy3cMuXjz0Dq0PD0EiShPzOIHuUJLeEyjkzg== X-Received: by 2002:a37:6255:: with SMTP id w82mr5165196qkb.330.1576707293729; Wed, 18 Dec 2019 14:14:53 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:35b8]) by smtp.gmail.com with ESMTPSA id s26sm1107646qkj.24.2019.12.18.14.14.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Dec 2019 14:14:53 -0800 (PST) Date: Wed, 18 Dec 2019 17:14:52 -0500 From: Johannes Weiner To: Vlastimil Babka Cc: linux-mm , Michal Hocko , Mel Gorman , Vladimir Davydov , Rik van Riel , Roman Gushchin Subject: Re: workingset transition detection corner case Message-ID: <20191218221452.GA232409@cmpxchg.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Vlastimil, My apologies for the delay. On Fri, Dec 13, 2019 at 04:38:38PM +0100, Vlastimil Babka wrote: > Hi Johannes, > > we have been debugging an issue reported against our 4.12-based kernel, > where a DB-based workload would start thrashing badly at some point, > making the system unusable. This didn't happen when replacing the kernel > with older 4.4-based one (and keeping everything else the same). > > Unfortunately we don't have the reproducer in-house and the conditions > might be also configuration specific (rootfs is on NFS), but we provided > vmstat monitoring instructions and later tracing and from the data we > got we found that the workload at some point fills almost the whole > memory with anonymous pages (namely shmem), pushing almost the whole > page cache out, and filling part of the swap. The 4.4-based kernel then > recovers quickly without excessive anon swapping, which suggests the > shmem pages stop being frequently accessed. However the 4.12-based > kernel is unable to recover and grow the page cache back (both active > and inactive) and keeps thrashing on it. > > We have considered the large upstream changes between 4.4 and 4.12 which > include memcg awareness (but there's a single memcg and disabling memcg > makes no difference) and node-based reclaim (there's no > disproportionally sized zone). Then we suspected 4.12 commit > 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset > transition") and how it affects inactive_list_is_low() when called from > shrink_list() - the theory was that we decide to shrink file active list > too much (by setting inactive_ratio=0) due to refault detection, which > in turn means we shrink file pages too much. This was confirmed by > removing the inactive_ratio=0 part, after which the 4.12-based kernel > stopped thrashing with the workload. > > Then we investigated what leads to the main condition of the logic - > "lruvec->refaults != refaults", by adding some more tracing to > inactive_list_is_low() and snapshot_refaults(). We suspected bad > interactions due to multiple direct reclaimers, but what I mostly see is > the following pattern of kswapd activity: > > - kswapd finishes balancing, makes a snapshot of lruvec->refaults > - after a while (can be up to few seconds) kswapd is woken up again and > the number of refaults meanwhile is changed by some relatively small > number (tens or hundreds) since the snapshot, so the condition > "lruvec->refaults != refaults" becomes true. > - inactive_list_is_low() keeps being called as part of kswapd operation, > always the condition is true as the snapshot didn't change. During that > time, the refaults counter is either unchanged or changes only by a few > refaults. Thus, the whole kswapd activity on the file lru is focused on > the active lru. > > Since the intention of commit 2a2e48854d70 is to detect workingset > transitions, it seems to me it's not working well in this case, as > there's no such transition - the workload just cannot keep its page > cache working set in memory, because it's excessively reclaimed instead > of anonymous memory. The '!=' condition is perhaps too coarse and static > and doesn't reflect how many refaults there were or if refaults keep > happening during kswapd operation - a single refault between two kswapd > runs can affect the whole second run. I wonder if there shouldn't be at > least some kind of decay - when the condition triggers, update the > snapshot to a value between the old snapshot and current value, so if > refaults do not keep occuring, after some number of calls the condition > will stop being true? What do you think? Thanks for the detailed report. I think the problem here is that we entangle two separate things: on one hand whether to protect active cache from refaulting cache; on the other hand whether to protect anonymous from cache. We should be able to open up the page cache to transition without automatically reducing pressure on anonymous. If we did that, we wouldn't have to worry about how many refaults exactly are actually occurring - it should always be safe to open up the active set for re-testing. [ We *could* be more graceful and instead of dissolving the active protection entirely simply restrict its size to a balance we have targeted historically, e.g. 50:50. But in the interest of keeping magic numbers out of the code, I would not lead with that. ] > I should also mention that we don't have the relatively recent commit > 2c012a4ad1a2 ("mm: vmscan: scan anonymous pages on file refaults") in > the 4.12-based kernel. It could in theory make the problem also go away, > as the "excessively true" condition would now also be considered when > inactive_list_is_low() is called from get_scan_count() (in v5.4; I know > there were big reorganizations in last merge window), and perhaps change > some SCAN_FILE outcomes to SCAN_FRACT. But I think it would be better to > do something with the root cause first. That patch should address the issue you are seeing in the interim. My longer-term goal is still to implement pressure-based balancing between the LRU types. A lot of prep work on the cgroup side was necessary to make that patch set really work for cgrouped reclaim - the new cgroup stat infrastructure, the recursive inactive:active balancing etc. I'm hoping to dust off those patches early next year. Those patches separate anon/cache balancing from active/inactive balancing, which I think will universally make better decisions. Does that sound reasonable?