From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54474C433E6 for ; Fri, 10 Jul 2020 19:49:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id ECCF820748 for ; Fri, 10 Jul 2020 19:49:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KOmBxNYK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ECCF820748 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 983A58D0005; Fri, 10 Jul 2020 15:49:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9340F8D0003; Fri, 10 Jul 2020 15:49:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8714E8D0005; Fri, 10 Jul 2020 15:49:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44]) by kanga.kvack.org (Postfix) with ESMTP id 72C568D0003 for ; Fri, 10 Jul 2020 15:49:35 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 2D11E2C96 for ; Fri, 10 Jul 2020 19:49:35 +0000 (UTC) X-FDA: 77023205910.05.shade51_4f0eb1526ed1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin05.hostedemail.com (Postfix) with ESMTP id 14D6618027FB1 for ; Fri, 10 Jul 2020 19:47:59 +0000 (UTC) X-HE-Tag: shade51_4f0eb1526ed1 X-Filterd-Recvd-Size: 7966 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jul 2020 19:47:58 +0000 (UTC) Received: by mail-pg1-f176.google.com with SMTP id e18so2974236pgn.7 for ; Fri, 10 Jul 2020 12:47:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=X/bhJVH4Z/Kjvby2GJi4o7Q1aq6zY54gp6tbOmIrn+w=; b=KOmBxNYKEcz1INauW1pGspVS0AvuJ/0+83NMJtWdupzDG7UlNSLxFR2N7EWRFOWs8l 2VsNtXAmzvqJeR4zwCjr3z6dGfGynkQyJ+Qb3Fv7p3YAWlQMV4fklY7UYuuke8fau/sS cnVFgeFRpO8wZADUIiPsIJw4+vpiW2qlZF29QfP0ESU3ylBcq1roMULt4IYbYV5nVWhS StDOZNDjfghavZGUQuqLqlLFvNcoErz6EWw+yifquReYkwx1dxXgT0YTRRT+SiOXC5QR V3qEhpHWoFLoLZrNSUwcHbuxfOivxsGLhc/tso5hxRI7oPfTnWg7QhCUOu+E0gRPGsT/ uMFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=X/bhJVH4Z/Kjvby2GJi4o7Q1aq6zY54gp6tbOmIrn+w=; b=poSMer8QgMFYJZ+wSRzhmP3JRifd0NBBahXXU08zBHLSgxeXZATnnXvi5ATyxGxLBB 7SzFQiZGvSMhlXhPNFL1iA7Ec30pgz3NqW1JMia/acwtHfn0AiMoyIpOkHMunzYtum+5 iSQMdU4IpbOC3uqdpwHpnPE8cI9p8zr1RuYKEWdPPyfNLSYqmsRnVIlJyoi9g9oj1ne3 3yX92ZBwSVJy2G8RUaO8IxaFPE4rwIKMHQ8/KXz9ELUXL5NWACV9iLVp5CBEkE62t4E1 GHXFTbFJ3FHh6KzfM4yZQRUBHQJGvNk13CCEygWP0ekilz7xLBQwOIImzTA08yAPbLrn DuNA== X-Gm-Message-State: AOAM532uLwmRxyT0gGZ0tJg+LCqVfju/0PhhPmTW60D0/9/NrGEjUYhs Fl/Meo6tvi9mBlLDu+iwyQNWoQ== X-Google-Smtp-Source: ABdhPJz/YYUHWn+Wqs/Ff7DdxfueFSUCzEJ0TmyJa26+0Xa2wcUUZFn64NUHLVMVt2ZzQSKTUlzpBw== X-Received: by 2002:a65:67d9:: with SMTP id b25mr60926739pgs.311.1594410477408; Fri, 10 Jul 2020 12:47:57 -0700 (PDT) Received: from [2620:15c:17:3:4a0f:cfff:fe51:6667] ([2620:15c:17:3:4a0f:cfff:fe51:6667]) by smtp.gmail.com with ESMTPSA id nk22sm6346203pjb.51.2020.07.10.12.47.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Jul 2020 12:47:56 -0700 (PDT) Date: Fri, 10 Jul 2020 12:47:55 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Shakeel Butt , Yang Shi , Roman Gushchin , Greg Thelen , Johannes Weiner , Vladimir Davydov , Andrew Morton , Cgroups , Linux MM Subject: Re: Memcg stat for available memory In-Reply-To: Message-ID: References: <20200703081538.GO18446@dhcp22.suse.cz> User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 14D6618027FB1 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 7 Jul 2020, David Rientjes wrote: > Another use case would be motivated by exactly the MemAvailable use case: > when bound to a memcg hierarchy, how much memory is available without > substantial swap or risk of oom for starting a new process or service? > This would not trigger any memory.low or PSI notification but is a > heuristic that can be used to determine what can and cannot be started > without incurring substantial memory reclaim. > > I'm indifferent to whether this would be a "reclaimable" or "available" > metric, with a slight preference toward making it as similar in > calculation to MemAvailable as possible, so I think the question is > whether this is something the user should be deriving themselves based on > memcg stats that are exported or whether we should solidify this based on > how the kernel handles reclaim as a metric that will carry over across > kernel vesions? > To try to get more discussion on the subject, consider a malloc implementation, like tcmalloc, that does MADV_DONTNEED to free memory back to the system and how this freed memory is then described to userspace depending on the kernel implementation. [ For the sake of this discussion, consider we have precise memcg stats available to us although the actual implementation allows for some variance (MEMCG_CHARGE_BATCH). ] With a 64MB heap backed by thp on x86, for example, the vma starts with an rss of 64MB, all of which is anon and backed by hugepages. Imagine some aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. Before freeing, anon, anon_thp, and active_anon in memory.stat would all be the same for this vma (64MB). 64MB would also be charged to memory.current. That's all working as intended and to the expectation of userspace. After freeing, however, we have the kernel implementation specific detail of how huge pmd splitting is handled (rss) in comparison to the underlying split of the compound page (deferred split queue). The huge pmd is always split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB for this vma and none of it is backed by thp. What is charged to the memcg (memory.current) and what is on active_anon is unchanged, however, because the underlying compound pages are still charged to the memcg. The amount of anon and anon_thp are decreased in compliance with the splitting of the page tables, however. So after freeing, for this vma: anon = 128KB, anon_thp = 0, active_anon = 64MB, memory.current = 64MB. In this case, because of the deferred split queue, which is a kernel implementation detail, userspace may be unclear on what is actually reclaimable -- and this memory is reclaimable under memory pressure. For the motivation of MemAvailable (what amount of memory is available for starting new work), userspace *could* determine this through the aforementioned active_anon - anon (or some combination of memory.current - anon - file - slab), but I think it's a fair point that userspace's view of reclaimable memory as the kernel implementation changes is something that can and should remain consistent between versions. Otherwise, an earlier implementation before deferred split queues could have safely assumed that active_anon was unreclaimable unless swap were enabled. It doesn't have the foresight based on future kernel implementation detail to reconcile what the amount of reclaimable memory actually is. Same discussion could happen for lazy free memory which is anon but now appears on the file lru stats and not the anon lru stats: it's easily reclaimable under memory pressure but you need to reconcile the difference between the anon metric and what is revealed in the anon lru stats. That gave way to my original thought of a si_mem_available()-like calculation ("avail") by doing free = memory.high - memory.current lazyfree = file - (active_file + inactive_file) deferred = active_anon - anon avail = free + lazyfree + deferred + (active_file + inactive_file + slab_reclaimable) / 2 And we have the ability to change this formula based on kernel implementation details as they evolve. Idea is to provide a consistent field that userspace can use to determine the rough amount of reclaimable memory in a MemAvailable-like way.