From: John Hubbard <jhubbard@nvidia.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Zenghui Yu <yuzenghui@huawei.com>,
Matthew Wilcox <willy@infradead.org>,
David Hildenbrand <david@redhat.com>,
Kefeng Wang <wangkefeng.wang@huawei.com>, Zi Yan <ziy@nvidia.com>,
Barry Song <21cnbao@gmail.com>,
Alistair Popple <apopple@nvidia.com>,
William Kucharski <william.kucharski@oracle.com>
Cc: linux-mm@kvack.org, Barry Song <v-songbaohua@oppo.com>
Subject: Re: [PATCH v2] tools/mm: Add thpmaps script to dump THP usage info
Date: Fri, 12 Jan 2024 11:14:52 -0800 [thread overview]
Message-ID: <0f5b9444-fd79-49f0-b9d8-f5e04c044696@nvidia.com> (raw)
In-Reply-To: <22905bf7-570f-41a9-8dd0-b8a250c97de3@arm.com>
On 1/12/24 02:00, Ryan Roberts wrote:
>> ...
>> After spending a day or two exploring running systems with this, I'd
>> like to suggest:
>>
>> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot
>> of information: mTHP is configured as expected, and is helping or not,
>> etc.
>
> There is a difference between how a THP is mapped (PTE vs PMD) and its size. A
> PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter
> out PMD-sized THPs, if that's your suggestion. But we could make a distinction
It's not...
> between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't
> directly give us this, but we can infer it from the AnonHugePages and *PmdMapped
> stats in smaps.
Yes, that would be excellent!
>
>>
>> 2) Not having to list out all the mTHP sizes would be nice. Instead,
>> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* ,
>> unless the user specifies sizes.
>
> This is exactly what the tool already does. Perhaps you haven't fully understood
> the counters that it outputs?
Oh yes, we are in perfect agreement about my not understanding these
counters. :) I'd even expound upon that a bit: despite having a fairly
good working understanding the mTHP implementation in the kernel;
despite reading and re-reading the thpmaps documentation and peeking a
number of times at the thpmaps script; and despite poring over the
thpmaps output, I am still having a rough time with these counters.
Mainly because there is a set of hidden assumptions, many of which are
revealed below.
But it's actually just a few key points that were missing from the
documentation, plus the ability to clearly see the pte-mapped parts. And
your proposed changes below look great; I've got a few more to add and
that should finish the job.
>
> You *always* get the following counters (although note the tool *hides* all
Good. It was not clear that these counters were always active. The --cont
documentation misleads the reader a bit on that matter.
> counters whose value is 0 by default - show them with --inc-empty). This example
> is for a system with 4K base pages:
>
> # thpmaps --pid 1 --summary --inc-empty
>
> anon-thp-aligned-16kB:
> anon-thp-aligned-32kB:
> anon-thp-aligned-64kB:
> anon-thp-aligned-128kB:
> anon-thp-aligned-256kB:
> anon-thp-aligned-512kB:
> anon-thp-aligned-1024kB:
> anon-thp-aligned-2048kB:
> anon-thp-unaligned-16kB:
> anon-thp-unaligned-32kB:
> anon-thp-unaligned-64kB:
> anon-thp-unaligned-128kB:
> anon-thp-unaligned-256kB:
> anon-thp-unaligned-512kB:
> anon-thp-unaligned-1024kB:
> anon-thp-unaligned-2048kB:
> anon-thp-partial:
> file-thp-aligned-16kB:
> file-thp-aligned-32kB:
> file-thp-aligned-64kB:
> file-thp-aligned-128kB:
> file-thp-aligned-256kB:
> file-thp-aligned-512kB:
> file-thp-aligned-1024kB:
> file-thp-aligned-2048kB:
> file-thp-unaligned-16kB:
> file-thp-unaligned-32kB:
> file-thp-unaligned-64kB:
> file-thp-unaligned-128kB:
> file-thp-unaligned-256kB:
> file-thp-unaligned-512kB:
> file-thp-unaligned-1024kB:
> file-thp-unaligned-2048kB:
> file-thp-partial:
>
> So you have counters for every supported THP size in the system - they will be
> different for a 64K base page system.
>
> anon vs file: hopefully obvious
>
> aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In
> the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is
I think we should use "aligned" or "aligned to <size>", and stop saying
"naturally aligned", throughout. "Natural" adds no additional
information, and it makes the reader wonder if there is some other
aspect to the alignment (does natural imply PMD-mapped? etc) that they
are unaware of.
> mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc.
>
> partial: Parts of THPs that are partially mapped into VA space.
>
> Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs.
> But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So
> only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and
> file-thp-aligned-2048kB. We can filter that out by subtracting the relevant
> smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or
That would work but is relatively awkward, but...1
> I could rename all the existing counters to include "pte" and introduce 2 new
> counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB?
...this would be perfect, I think. The "pte" would help self-document, and
separately things out allows for a clearer view into the behavior.
>
> The --cont option will add *additional* special counters, if specified. The idea
> here is to provide a view on what percentage of memory is getting
> contpte-mapped. So if you provide "--cont 64K" it will give you a counter
> showing how much memory is in 64K, naturally aligned blocks (actually 2
> counters; file and anon). Those blocks can come from fully mapped and aligned
> 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP
> is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2
> 64K cont blocks, but it will be counted as unaligned in
> anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's
> first 1M is mapped and aligned on a 64K boundary, then it will be counted in the
> *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter.
>
Interesting, and completely undocumented until now. Let's add this to the
tool's help output! In fact, all of the above.
>
> Sorry if I've labored the point here. But I think the only thing the tool
> doesn't already do that you are asking for is to differentiate PTE- vs PMD-
> mappings?
That, plus explain itself, yes. :)
>
>>
>> ...
>> (e.g. /sys/fs/cgroup for cgroup-v2 or
>>>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one
>>>>> of --pid and --cgroup must be provided.
>>>>
>>>> Maybe we could add "--global" to that list. That would look, in order,
>>>> inside cgroups2 and cgroups, for a list of pids, and then run as if
>>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified.
>>>
>>> I think actually it might be better just to make global the default when neither
>>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids
>>> from /proc rather than traverse the cgroup hierachy, that way it will work on
>>> systems without cgroups. Does that work for you?
>>
>> Yes! That was my initial idea, in fact, and after over-thinking it for
>> a while, it turned into the above. haha :)
>
> OK great - implemented for v3.
>
Sweet!
thanks,
--
John Hubbard
NVIDIA
next prev parent reply other threads:[~2024-01-12 19:15 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-10 17:32 Ryan Roberts
2024-01-10 23:21 ` John Hubbard
2024-01-11 0:11 ` John Hubbard
2024-01-11 3:32 ` John Hubbard
2024-01-11 11:54 ` Ryan Roberts
2024-01-11 17:32 ` Ryan Roberts
2024-01-11 18:01 ` David Hildenbrand
2024-01-11 18:04 ` John Hubbard
2024-01-12 10:01 ` Ryan Roberts
2024-01-11 18:17 ` John Hubbard
2024-01-12 10:00 ` Ryan Roberts
2024-01-12 19:14 ` John Hubbard [this message]
2024-01-15 9:48 ` Ryan Roberts
2024-01-15 15:56 ` Ryan Roberts
2024-01-15 21:30 ` John Hubbard
2024-01-16 8:53 ` Ryan Roberts
2024-01-16 17:27 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0f5b9444-fd79-49f0-b9d8-f5e04c044696@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=v-songbaohua@oppo.com \
--cc=wangkefeng.wang@huawei.com \
--cc=william.kucharski@oracle.com \
--cc=willy@infradead.org \
--cc=yuzenghui@huawei.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox