From: David Rientjes <rientjes@google.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Roman Gushchin <guro@fb.com>, Greg Thelen <gthelen@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Cgroups <cgroups@vger.kernel.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: Memcg stat for available memory
Date: Fri, 10 Jul 2020 12:47:55 -0700 (PDT) [thread overview]
Message-ID: <alpine.DEB.2.23.453.2007101223470.1178541@chino.kir.corp.google.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007071210410.396729@chino.kir.corp.google.com>
On Tue, 7 Jul 2020, David Rientjes wrote:
> Another use case would be motivated by exactly the MemAvailable use case:
> when bound to a memcg hierarchy, how much memory is available without
> substantial swap or risk of oom for starting a new process or service?
> This would not trigger any memory.low or PSI notification but is a
> heuristic that can be used to determine what can and cannot be started
> without incurring substantial memory reclaim.
>
> I'm indifferent to whether this would be a "reclaimable" or "available"
> metric, with a slight preference toward making it as similar in
> calculation to MemAvailable as possible, so I think the question is
> whether this is something the user should be deriving themselves based on
> memcg stats that are exported or whether we should solidify this based on
> how the kernel handles reclaim as a metric that will carry over across
> kernel vesions?
>
To try to get more discussion on the subject, consider a malloc
implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
to the system and how this freed memory is then described to userspace
depending on the kernel implementation.
[ For the sake of this discussion, consider we have precise memcg stats
available to us although the actual implementation allows for some
variance (MEMCG_CHARGE_BATCH). ]
With a 64MB heap backed by thp on x86, for example, the vma starts with an
rss of 64MB, all of which is anon and backed by hugepages. Imagine some
aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB.
Before freeing, anon, anon_thp, and active_anon in memory.stat would all
be the same for this vma (64MB). 64MB would also be charged to
memory.current. That's all working as intended and to the expectation of
userspace.
After freeing, however, we have the kernel implementation specific detail
of how huge pmd splitting is handled (rss) in comparison to the underlying
split of the compound page (deferred split queue). The huge pmd is always
split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
for this vma and none of it is backed by thp.
What is charged to the memcg (memory.current) and what is on active_anon
is unchanged, however, because the underlying compound pages are still
charged to the memcg. The amount of anon and anon_thp are decreased
in compliance with the splitting of the page tables, however.
So after freeing, for this vma: anon = 128KB, anon_thp = 0,
active_anon = 64MB, memory.current = 64MB.
In this case, because of the deferred split queue, which is a kernel
implementation detail, userspace may be unclear on what is actually
reclaimable -- and this memory is reclaimable under memory pressure. For
the motivation of MemAvailable (what amount of memory is available for
starting new work), userspace *could* determine this through the
aforementioned active_anon - anon (or some combination of
memory.current - anon - file - slab), but I think it's a fair point that
userspace's view of reclaimable memory as the kernel implementation
changes is something that can and should remain consistent between
versions.
Otherwise, an earlier implementation before deferred split queues could
have safely assumed that active_anon was unreclaimable unless swap were
enabled. It doesn't have the foresight based on future kernel
implementation detail to reconcile what the amount of reclaimable memory
actually is.
Same discussion could happen for lazy free memory which is anon but now
appears on the file lru stats and not the anon lru stats: it's easily
reclaimable under memory pressure but you need to reconcile the difference
between the anon metric and what is revealed in the anon lru stats.
That gave way to my original thought of a si_mem_available()-like
calculation ("avail") by doing
free = memory.high - memory.current
lazyfree = file - (active_file + inactive_file)
deferred = active_anon - anon
avail = free + lazyfree + deferred +
(active_file + inactive_file + slab_reclaimable) / 2
And we have the ability to change this formula based on kernel
implementation details as they evolve. Idea is to provide a consistent
field that userspace can use to determine the rough amount of reclaimable
memory in a MemAvailable-like way.
next prev parent reply other threads:[~2020-07-10 19:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-28 22:15 David Rientjes
2020-07-02 15:22 ` Shakeel Butt
2020-07-03 8:15 ` Michal Hocko
2020-07-07 19:58 ` David Rientjes
2020-07-10 19:47 ` David Rientjes [this message]
2020-07-10 21:04 ` Yang Shi
2020-07-12 22:02 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.23.453.2007101223470.1178541@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=gthelen@google.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shakeelb@google.com \
--cc=vdavydov.dev@gmail.com \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox