From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FDA4C433E2 for ; Fri, 10 Jul 2020 21:05:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4441520748 for ; Fri, 10 Jul 2020 21:05:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZwAlQGHv" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4441520748 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 967276B0002; Fri, 10 Jul 2020 17:05:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 918656B0003; Fri, 10 Jul 2020 17:05:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8558E8D0001; Fri, 10 Jul 2020 17:05:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0135.hostedemail.com [216.40.44.135]) by kanga.kvack.org (Postfix) with ESMTP id 7233B6B0002 for ; Fri, 10 Jul 2020 17:05:21 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 27BB98248047 for ; Fri, 10 Jul 2020 21:05:21 +0000 (UTC) X-FDA: 77023396842.28.grass03_0c0260426ed1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin28.hostedemail.com (Postfix) with ESMTP id 3F7476D9F for ; Fri, 10 Jul 2020 21:05:18 +0000 (UTC) X-HE-Tag: grass03_0c0260426ed1 X-Filterd-Recvd-Size: 8078 Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jul 2020 21:05:17 +0000 (UTC) Received: by mail-ej1-f68.google.com with SMTP id l12so7470946ejn.10 for ; Fri, 10 Jul 2020 14:05:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=knKCGO/TFZKmvezoCtPFxOWQQYS0S9SrfSJaD988IlQ=; b=ZwAlQGHv/53vyRlAA/pgP4ejQrYZEO9WBuZ97orWg9JVqECbBB1qAb9fzz1MXWBJLq xiYBseSkAnoLcPgAekoTjxZ+YTpFLZZqIgwAXEylYIDb+8OZRR/EKzVtx8oRlFH1aLiW 6mr119vYLxGfTQVhJuQb7L6H9QyC9kaeTBf+DByR6WV7dCexpTqdjTDQkjJlPQwpTry+ g28Ey5ba98fxDvcHP11mUbk2NCs4buS6+rP6MMV0EfczuazJATO04b0EHE3SmUcindIR nW3eUGt5re+JD3ug4x4fEKJsWZuZdhQo3fZAwK2h9TMg6pBJV3ZXKk4CTeh51GhKRPG4 GxyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=knKCGO/TFZKmvezoCtPFxOWQQYS0S9SrfSJaD988IlQ=; b=GADGSrjXt5xuw6KlsQ6moGBlP6baCWMERssbsIGsrG6JQP7llJhuOzoTzjfoxaaEHA O7dz9eGa1/lYX8/r6yyWnccYW4p/F3iWTPqOG9qoYOQa62Fy1saURHUS0DAjqMiq8Pfq knbTpei7TiudakV7Mf2G1Dvj6h8Qr4eFhKlxwgq/svYWfN2D8ss2TOCGw0BmGhySu8MA Uaqwm52yG8Gshy/4ch5lwEVDtKM4Ecy43EIz7zgEVJ+V1ie3Ju9HqjIWToX+0CvQ3vSd i8zb3AyqAdtAKO+ZrL0HwReDGOaMl342CgNDOWvj9nhW68vWvU36j/yDDJ3vndOXfLjj Il3g== X-Gm-Message-State: AOAM533UBrwJEoCZJF0UI/gic2k4iX8VWOPTsDSKuHEhzURaLxLbqxHS RyvjkyiE5bzRq56/Ez29Na+TWsMu/rUkACPcldunmN/6DIY= X-Google-Smtp-Source: ABdhPJx2IUiJD1OxCtVUpCL5i2Fhcm7yWlqe4EFHqwKCrIDxNyKrGjlOCqwlULecplvCwT5kbqw2huLNx/tP9gMVO9I= X-Received: by 2002:a17:906:aac9:: with SMTP id kt9mr58596966ejb.488.1594415116480; Fri, 10 Jul 2020 14:05:16 -0700 (PDT) MIME-Version: 1.0 References: <20200703081538.GO18446@dhcp22.suse.cz> In-Reply-To: From: Yang Shi Date: Fri, 10 Jul 2020 14:04:57 -0700 Message-ID: Subject: Re: Memcg stat for available memory To: David Rientjes Cc: Michal Hocko , Shakeel Butt , Yang Shi , Roman Gushchin , Greg Thelen , Johannes Weiner , Vladimir Davydov , Andrew Morton , Cgroups , Linux MM Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 3F7476D9F X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 10, 2020 at 12:49 PM David Rientjes wrote: > > On Tue, 7 Jul 2020, David Rientjes wrote: > > > Another use case would be motivated by exactly the MemAvailable use case: > > when bound to a memcg hierarchy, how much memory is available without > > substantial swap or risk of oom for starting a new process or service? > > This would not trigger any memory.low or PSI notification but is a > > heuristic that can be used to determine what can and cannot be started > > without incurring substantial memory reclaim. > > > > I'm indifferent to whether this would be a "reclaimable" or "available" > > metric, with a slight preference toward making it as similar in > > calculation to MemAvailable as possible, so I think the question is > > whether this is something the user should be deriving themselves based on > > memcg stats that are exported or whether we should solidify this based on > > how the kernel handles reclaim as a metric that will carry over across > > kernel vesions? > > > > To try to get more discussion on the subject, consider a malloc > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back > to the system and how this freed memory is then described to userspace > depending on the kernel implementation. > > [ For the sake of this discussion, consider we have precise memcg stats > available to us although the actual implementation allows for some > variance (MEMCG_CHARGE_BATCH). ] > > With a 64MB heap backed by thp on x86, for example, the vma starts with an > rss of 64MB, all of which is anon and backed by hugepages. Imagine some > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page > mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. > > Before freeing, anon, anon_thp, and active_anon in memory.stat would all > be the same for this vma (64MB). 64MB would also be charged to > memory.current. That's all working as intended and to the expectation of > userspace. > > After freeing, however, we have the kernel implementation specific detail > of how huge pmd splitting is handled (rss) in comparison to the underlying > split of the compound page (deferred split queue). The huge pmd is always > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB > for this vma and none of it is backed by thp. > > What is charged to the memcg (memory.current) and what is on active_anon > is unchanged, however, because the underlying compound pages are still > charged to the memcg. The amount of anon and anon_thp are decreased > in compliance with the splitting of the page tables, however. > > So after freeing, for this vma: anon = 128KB, anon_thp = 0, > active_anon = 64MB, memory.current = 64MB. > > In this case, because of the deferred split queue, which is a kernel > implementation detail, userspace may be unclear on what is actually > reclaimable -- and this memory is reclaimable under memory pressure. For > the motivation of MemAvailable (what amount of memory is available for > starting new work), userspace *could* determine this through the > aforementioned active_anon - anon (or some combination of > memory.current - anon - file - slab), but I think it's a fair point that > userspace's view of reclaimable memory as the kernel implementation > changes is something that can and should remain consistent between > versions. > > Otherwise, an earlier implementation before deferred split queues could > have safely assumed that active_anon was unreclaimable unless swap were > enabled. It doesn't have the foresight based on future kernel > implementation detail to reconcile what the amount of reclaimable memory > actually is. > > Same discussion could happen for lazy free memory which is anon but now > appears on the file lru stats and not the anon lru stats: it's easily > reclaimable under memory pressure but you need to reconcile the difference > between the anon metric and what is revealed in the anon lru stats. > > That gave way to my original thought of a si_mem_available()-like > calculation ("avail") by doing > > free = memory.high - memory.current I'm wondering what if high or max is set to max limit. Don't you end up seeing a super large memavail? > lazyfree = file - (active_file + inactive_file) Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE just updates inactive lru size. > deferred = active_anon - anon > > avail = free + lazyfree + deferred + > (active_file + inactive_file + slab_reclaimable) / 2 > > And we have the ability to change this formula based on kernel > implementation details as they evolve. Idea is to provide a consistent > field that userspace can use to determine the rough amount of reclaimable > memory in a MemAvailable-like way. >