From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 62341C43381 for ; Tue, 19 Feb 2019 23:26:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1C1062147A for ; Tue, 19 Feb 2019 23:26:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1C1062147A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A3F318E0003; Tue, 19 Feb 2019 18:26:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9EE5A8E0002; Tue, 19 Feb 2019 18:26:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8DD0D8E0003; Tue, 19 Feb 2019 18:26:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by kanga.kvack.org (Postfix) with ESMTP id 656DB8E0002 for ; Tue, 19 Feb 2019 18:26:40 -0500 (EST) Received: by mail-qk1-f200.google.com with SMTP id f70so1046653qke.8 for ; Tue, 19 Feb 2019 15:26:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:subject:message-id:references:mime-version:content-disposition :in-reply-to:user-agent; bh=meFASeVm8AgDUCXNN7Y8QzAbvDPsmwGcXRRsijRK3Ro=; b=cyCdWyo7c0su9qv474sJcK4HzKgY0SWXHOx+Uq1tYaRUeHO8+eLkamXxwhcCd+O3wT sLv0AM4+ZARcpukwcztiSCIeny1ZRFy74aFwoxwp0NktxtwtxqWLX9qpDkBhThCqLDBG a+D87YVJOj9SNADvY3H3N+m//iDw7FUQYK7u8SQkFZH0ZESyI8Tgp+7wzj2LRReuPilm 4soHkFyUApN9rrJjc0JIFobZtJwKlEkztnzvE/QpWWEMOsh5hmbrhtQHbnmH1salV7v7 VqGUoZQJzgAwkXgeCRalXBD0JHUw9fddTw4tdW4QqSQ0Lhf+t8hBaWtzRWTfUmLha9G5 chSw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dchinner@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=dchinner@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: AHQUAuZhr8rXTqW0q1Ek2u8WwWPZjHCurHPLEpVOHnc89v6S8DhmFYXs u5I3M8hUSD59wJfCVbR1rWq2qLSEBaFEFveLT9U22R08tV+Xa5YobSFEgJxQLRbwZjWlPg2rpEW R91zgGxStX6z/7XUnZv625ToEkDtrh+Wcd7Nl/ADaAOZjPywY5/aYWkq0PKhO3Ga8sg== X-Received: by 2002:a0c:a326:: with SMTP id u35mr23918580qvu.190.1550618800124; Tue, 19 Feb 2019 15:26:40 -0800 (PST) X-Google-Smtp-Source: AHgI3Ib5jRo6hau3VluRxCb4BZax3HgE2zhvku9+SGnJ/g9cG3K99l30VUhW0SmAoq6zZT7TwxIF X-Received: by 2002:a0c:a326:: with SMTP id u35mr23918541qvu.190.1550618799066; Tue, 19 Feb 2019 15:26:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550618799; cv=none; d=google.com; s=arc-20160816; b=UndKmQ2hzJJ3KPrtvXq8g+zNCVRvyTQ/OyVOAUcEbROnp5ySDLyZKpVp18sPR084UL a4bDRB0SzK+p1YHTCq3EC+x80nLTjlHTP4/F6WHrl6Sr4aQwI86qXyuc3yO2F701rQ/6 vN3W6YvHZwL3KGlas8fjoOckljDi/cKnbMs2P77CrxB+FhZ1FmhcBFpUbnFa7GOFmrC+ F8CXkf2tgB8NSKSl/R6fbFZgqtpnqunlOPihNKf0+L5gmH2AJ8bQkfl2Hdd4I8wC0bn+ TPd54epEF/y91y+GydlnIi+zN6AVrKLhfS5iqchjEH/5JkaQX7/dMWRcxdDsVrIjWhAT yFBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version:references :message-id:subject:cc:to:from:date; bh=meFASeVm8AgDUCXNN7Y8QzAbvDPsmwGcXRRsijRK3Ro=; b=qoFPeVzwMWTvcIhKR+0oibtTFTsFjfg5FJimAc5r4+bRE/TLKRo4KdocT1mOU+sNUZ veAAN19wCgZ0PW0uGbGCQk6X/00Hy/X3IVyYM7KX9bCDCniFyndah+u6w559e32/Z/t5 Jrv9nRxjOFi+/Ko+UFkPnLc3vZCpW/ttIHTfXnIh7e/rZrDvfwALxy45ellRBPLmfI4P Ah81WVL+jrIPE7K+Ye0y5u25OvIWtFDWBs+KM3e/tTBVKyMQ3OAijUPC1E2OOtYc1wBl AuG01tCOEmSOT5ElNSO9RYhKWTewBUUTL8+BIG/M8oezY9ndhGTOboDXGnuvp7mKnbPc dAwA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dchinner@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=dchinner@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id o34si148209qva.1.2019.02.19.15.26.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 19 Feb 2019 15:26:39 -0800 (PST) Received-SPF: pass (google.com: domain of dchinner@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dchinner@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=dchinner@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id AEA9413A4C; Tue, 19 Feb 2019 23:26:37 +0000 (UTC) Received: from rh (ovpn-116-82.phx2.redhat.com [10.3.116.82]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2656D66082; Tue, 19 Feb 2019 23:26:36 +0000 (UTC) Received: from [::1] (helo=rh) by rh with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1gwEmU-0005Zu-A5; Wed, 20 Feb 2019 10:26:30 +1100 Date: Wed, 20 Feb 2019 10:26:27 +1100 From: Dave Chinner To: Rik van Riel Cc: Roman Gushchin , "lsf-pc@lists.linux-foundation.org" , "linux-mm@kvack.org" , "mhocko@kernel.org" , "guroan@gmail.com" , Kernel Team , "hannes@cmpxchg.org" Subject: Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Message-ID: <20190219232627.GZ31397@rh> References: <20190219003140.GA5660@castle.DHCP.thefacebook.com> <20190219020448.GY31397@rh> <7f66dd5242ab4d305f43d85de1a8e514fc47c492.camel@surriel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7f66dd5242ab4d305f43d85de1a8e514fc47c492.camel@surriel.com> User-Agent: Mutt/1.9.1 (2017-09-22) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Tue, 19 Feb 2019 23:26:38 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote: > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote: > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote: > > > Sorry, resending with the fixed to/cc list. Please, ignore the > > > first letter. > > > > Please resend again with linux-fsdevel on the cc list, because this > > isn't a MM topic given the regressions from the shrinker patches > > have all been on the filesystem side of the shrinkers.... > > It looks like there are two separate things going on here. > > The first are an MM issues, one of potentially leaking memory > by not scanning slabs with few items on them, We don't leak memory. Slabs with very few freeable items on them just don't get scanned when there is only light memory pressure. That's /by design/ and it is behaviour we've tried hard over many years to preserve. Once memory pressure ramps up, they'll be scanned just like all the other slabs. e.g. commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if tight on memory") makes this commentary: [....] That said, this patch shouldn't change the vmscan behaviour if the memory pressure is low, but if we are tight on memory, we will do our best by trying to reclaim all available objects, which sounds reasonable. Which is essentially how we've tried to implement shrinker reclaim for a long, long time (bugs notwithstanding). > and having > such slabs stay around forever after the cgroup they were > created for has disappeared, That's a cgroup referencing and teardown problem, not a memory reclaim algorithm problem. To treat it as a memory reclaim problem smears memcg internal implementation bogosities all over the independent reclaim infrastructure. It violates the concepts of isolation, modularity, independence, abstraction layering, etc. > and the other of various other > bugs with shrinker invocation behavior (like the nr_deferred > fixes you posted a patch for). I believe these are MM topics. Except they interact directly with external shrinker behaviour. the conditions of deferral and the problems it is solving are a direct response to shrinker implementation constraints (e.g. GFP_NOFS deadlock avoidance for filesystems). i.e. we can't talk about the deferal algorithm without considering why work is deferred, how much work should be deferred, when it may be safe/best to execute the deferred work, etc. This all comes back to the fact that modifying the shrinker algorithms requires understanding what the shrinker implementations do and the constraints they operate under. It is not a "purely mm" discussion, and treating it as such results regressions like the ones we've recently seen. > The second is the filesystem (and maybe other) shrinker > functions' behavior being somewhat fragile and depending > on closely on current MM behavior, potentially up to > and including MM bugs. > > The lack of a contract between the MM and the shrinker > callbacks is a recurring issue, and something we may > want to discuss in a joint session. > > Some reflections on the shrinker/MM interaction: > - Since all memory (in a zone) could potentially be in > shrinker pools, shrinkers MUST eventually free some > memory. Which they cannot guarantee because all the objects they track may be in use. As such, shrinkers have never been asked to guarantee that they can free memory - they've only ever been asked to scan a number of objects and attempt to free those it can during the scan. > - Shrinkers should not block kswapd from making progress. > If kswapd got stuck in NFS inode writeback, and ended up > not being able to free clean pages to receive network > packets, that might cause a deadlock. Same can happen if kswapd got stuck on dirty page writeback from pageout(). i.e. pageout() can only run from kswapd and it issues IO, which can then block in the IO submission path waiting for IO to make progress, which may require substantial amounts of memory allocation. Yes, we can try to not block kswapd as much as possible just like page reclaim does, but the fact is kswapd is the only context where it is safe to do certain blocking operations to ensure memory reclaim can actually make progress. i.e. the rules for blocking kswapd need to be consistent across both page reclaim and shrinker reclaim, and right now page reclaim can and does block kswapd when it is necessary for forwards progress.... > - The MM should be able to deal with shrinkers doing > nothing at this call, but having some work pending > (eg. waiting on IO completion), without getting a false > OOM kill. How can we do this best? By integrating shrinkers into the same feedback loops as page reclaim. i.e. to allow individual shrinker instance state to be visible to the backoff/congestion decisions that the main page reclaim loops make. i.e. the problem here is that shrinkers only feedback to the main loop is "how many pages were freed" as a whole. They aren't seen as individual reclaim instances like zones for apge reclaim, they are just a huge amorphous blob that "frees some pages". i.e. They sit off to the side and run their own game between main loop scans and have no capability to run individual backoffs, schedule kswapd to do future work, don't have watermarks to provide reclaim goals, can't communicate progress to the main control algorithm, etc. IOWs, the first step we need to take here is to get rid of the shrink_slab() abstraction and make shrinkers a first class reclaim citizen.... > - Related to the above: stalling in the shrinker code is > unpredictable, and can take an arbitrarily long amount > of time. Is there a better way we can make reclaimers > wait for in-flight work to be completed? Look at it this way: what do you need to do to implement the main zone reclaim loops as individual shrinker instances? Complex shrinker implementations have to deal with all the same issues as the page reclaim loops (including managing cross-cache dependencies and balancing). If we can't answer this question, then we can't answer the questions that are being asked. So, at this point, I have to ask: if we need the same functionality for both page reclaim and shrinkers, then why shouldn't the goal be to make page reclaim just another set of opaque shrinker implementations? Cheers, Dave. -- Dave Chinner dchinner@redhat.com