From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E8BBC3DA79 for ; Mon, 15 Jan 2024 15:56:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F4086B0071; Mon, 15 Jan 2024 10:56:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 97BAF6B0072; Mon, 15 Jan 2024 10:56:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F51F6B0074; Mon, 15 Jan 2024 10:56:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 692AE6B0071 for ; Mon, 15 Jan 2024 10:56:19 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 308011201BD for ; Mon, 15 Jan 2024 15:56:19 +0000 (UTC) X-FDA: 81681997278.08.1B68A2C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id 3E96A18001D for ; Mon, 15 Jan 2024 15:56:17 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705334177; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mkFOsvTd9undyJDkdPWnRPb6A90vtzetaCJhXuGhvlU=; b=Ykbgq+4qGXrO8HMgfmbA0yA3aTzhuoday1fA328bj1mSJbfBiSFj35a6klRziCyXgxfQuN AM43uhC4Mazd6s4FJxQqGNBLCLXkDjn+5C4zp+HeehqnQ0ROFab0guFjnNuIcafwiGZR6s GI/UVtmrT/xTTRTV7cJlYycyDyrSPaY= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705334177; a=rsa-sha256; cv=none; b=ecaemdHUgDO60+l57qC5KnjNoDqB0ypD3aVdfTVwvGtgVd9a6/HqlE2Xb3/r6IusFURaaO ApE5lkaOnCfUubcKMJtZ6j18QaVSnvK/Hf0ZsfD8F5sUM+eM1vN9ZkUeTC/DgRwq4fJ4LR QFQj18MxQjFWx/c9oTxC/jq2tAhsgxQ= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B2FFB2F4; Mon, 15 Jan 2024 07:57:02 -0800 (PST) Received: from [10.57.76.47] (unknown [10.57.76.47]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5228C3F5A1; Mon, 15 Jan 2024 07:56:13 -0800 (PST) Message-ID: <9acb1684-7c5a-41c4-9a23-edad73e55585@arm.com> Date: Mon, 15 Jan 2024 15:56:11 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] tools/mm: Add thpmaps script to dump THP usage info Content-Language: en-GB From: Ryan Roberts To: John Hubbard , Andrew Morton , Zenghui Yu , Matthew Wilcox , David Hildenbrand , Kefeng Wang , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , William Kucharski Cc: linux-mm@kvack.org, Barry Song References: <20240110173203.3419437-1-ryan.roberts@arm.com> <33341ca8-1354-4f3f-b377-0b7d04da48d0@nvidia.com> <43230798-af22-4f59-b37c-8257bae32af8@arm.com> <22905bf7-570f-41a9-8dd0-b8a250c97de3@arm.com> <0f5b9444-fd79-49f0-b9d8-f5e04c044696@nvidia.com> <64f4fc88-b591-4a76-9a9f-3971225d0fa7@arm.com> In-Reply-To: <64f4fc88-b591-4a76-9a9f-3971225d0fa7@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 3E96A18001D X-Stat-Signature: spciet7wdxjwy19ox1ouy5bor7ipqhz9 X-HE-Tag: 1705334177-230833 X-HE-Meta: U2FsdGVkX1/6lQK7tJOU7x/m+6QGNDh3JgziIE5B2oDto2kJ8vO5GxQJpHGEk2tQIz3UVN7g/1PgF5/q6IwftzCulz0YMZlx8ji16JQceLaegVRYzn2r7joK/PbnsC86lwfjn89wMdKRXiLXTQ1+Fe/JZoXsASmfxN+8hZGN9+5cEro8jFe2iclrejEy6N2MA+iQ61qNWrB5+xODIyvUI0VNmNGjhetUXnyg3QyN2Mxzj5a5XbzeXJGsDBxzgvRisbZl7PbBUYNwPHIG25piFj27U52cYFfyuqBT09yozqQ0YcP1ZMnYXhux9g5WTDi/TiW6hdhyaUYuiX+4ktcdOU+xmWADvq3ml8QTDpsBEhCcFmkBs+hG10fmJJnncxVy6FDIPqW3Wg9qMyBNaLwnDOdyfy0APU0gsZZQdPVOoDBVXDZjPxYM0g0U2AFuTkWGT8CsFKQwkqa8LbSznvwylrvg8mczICMLm+k7blidVQ25KVXQ609EjjoiW5Or1WFTZg14W/k4N3EZSbSCrX1zOhc8ficJlIr2QQVm8ZCz/GCKdaAK+LvRRdBPGHMfYPeF+flt/nJs5jmSJm1UEhxxnZkfksMieltoaBRpxbxVG1mWUhwao8PyL+ca9nGZ68SSRP2wQr8cKbMRNSmJ6WTjl8umtavt3neDNw1qK6semD+niytg1sqH8jmPGNBocTH3bH9C0UMrxqJLWt5d66y2YqyjJOo3cAaVMBXSgrTDfL73OD33aIWdwgnxd5IwbN5tRFKbz5Saq8S3DPow/TbNm12Y3ONRkxrG+RfFsfTc4B5sl70ZqqpsDMJZ+ar3qzcysguvFzq7zF24gKioYnTSH/jO2G1spuD68gE/fqINfTn32uP9iWtY+1mn8LdocWgpE6P7hcT3XH3bSaauGBEZ/+GH1fwMipUGOt79A1vawycuTTRtI8Wvy8dv1pCtAEeqzRSmDI9/eN2MFxUqHvz FqHTOZph gBXORCsqkbL8sHpQGEA/j8Fmf2XS3hcn7pvaEoWt9k1XS6Oj3vpkWnV4yfktrHlwMpqhAxXHcVOt1a81yZJApqA/ucACllCxoUoR+OjGlInj935qqmIawh5ecw/lzYQ1FcR7jEc+auIHFw8XKIL2cEK2KWQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 15/01/2024 09:48, Ryan Roberts wrote: > On 12/01/2024 19:14, John Hubbard wrote: >> On 1/12/24 02:00, Ryan Roberts wrote: >>>> ... >>>> After spending a day or two exploring running systems with this, I'd >>>> like to suggest: >>>> >>>> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot >>>> of information: mTHP is configured as expected, and is helping or not, >>>> etc. >>> >>> There is a difference between how a THP is mapped (PTE vs PMD) and its size. A >>> PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter >>> out PMD-sized THPs, if that's your suggestion. But we could make a distinction >> >> It's not... >> >>> between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't >>> directly give us this, but we can infer it from the AnonHugePages and *PmdMapped >>> stats in smaps. >> >> Yes, that would be excellent! >> >>> >>>> >>>> 2) Not having to list out all the mTHP sizes would be nice. Instead, >>>> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , >>>> unless the user specifies sizes. >>> >>> This is exactly what the tool already does. Perhaps you haven't fully understood >>> the counters that it outputs? >> >> Oh yes, we are in perfect agreement about my not understanding these >> counters. :) I'd even expound upon that a bit: despite having a fairly >> good working understanding the mTHP implementation in the kernel; >> despite reading and re-reading  the thpmaps documentation and peeking a >> number of times at the thpmaps script; and despite poring over the >> thpmaps output, I am still having a rough time with these counters. >> Mainly because there is a set of hidden assumptions, many of which are >> revealed below. > > Oh dear sorry about that. Thanks for sticking with it and helping me get it right... > >> >> But it's actually just a few key points that were missing from the >> documentation, plus the ability to clearly see the pte-mapped parts. And >> your proposed changes below look great; I've got a few more to add and >> that should finish the job. > > OK good! > >> >>> >>> You *always* get the following counters (although note the tool *hides* all >> >> Good. It was not clear that these counters were always active. The --cont >> documentation misleads the reader a bit on that matter. >> >>> counters whose value is 0 by default - show them with --inc-empty). This example >>> is for a system with 4K base pages: >>> >>> # thpmaps --pid 1 --summary --inc-empty >>> >>> anon-thp-aligned-16kB: >>> anon-thp-aligned-32kB: >>> anon-thp-aligned-64kB: >>> anon-thp-aligned-128kB: >>> anon-thp-aligned-256kB: >>> anon-thp-aligned-512kB: >>> anon-thp-aligned-1024kB: >>> anon-thp-aligned-2048kB: >>> anon-thp-unaligned-16kB: >>> anon-thp-unaligned-32kB: >>> anon-thp-unaligned-64kB: >>> anon-thp-unaligned-128kB: >>> anon-thp-unaligned-256kB: >>> anon-thp-unaligned-512kB: >>> anon-thp-unaligned-1024kB: >>> anon-thp-unaligned-2048kB: >>> anon-thp-partial: >>> file-thp-aligned-16kB: >>> file-thp-aligned-32kB: >>> file-thp-aligned-64kB: >>> file-thp-aligned-128kB: >>> file-thp-aligned-256kB: >>> file-thp-aligned-512kB: >>> file-thp-aligned-1024kB: >>> file-thp-aligned-2048kB: >>> file-thp-unaligned-16kB: >>> file-thp-unaligned-32kB: >>> file-thp-unaligned-64kB: >>> file-thp-unaligned-128kB: >>> file-thp-unaligned-256kB: >>> file-thp-unaligned-512kB: >>> file-thp-unaligned-1024kB: >>> file-thp-unaligned-2048kB: >>> file-thp-partial: >>> >>> So you have counters for every supported THP size in the system - they will be >>> different for a 64K base page system. >>> >>> anon vs file: hopefully obvious >>> >>> aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In >>> the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is >> >> I think we should use "aligned" or "aligned to ", and stop saying >> "naturally aligned", throughout. "Natural" adds no additional >> information, and it makes the reader wonder if there is some other >> aspect to the alignment (does natural imply PMD-mapped? etc) that they >> are unaware of. > > OK. I thought "naturally aligned" was a fairly standard and well-understood > term. Google says "We call a datum naturally aligned if its address is aligned > to its size". But I'm happy to use the phrase "aligned to " if that's clearer. > >> >> >>> mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. >>> >>> partial: Parts of THPs that are partially mapped into VA space. >>> >>> Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. >>> But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So >>> only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and >>> file-thp-aligned-2048kB. We can filter that out by subtracting the relevant >>> smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or >> >> That would work but is relatively awkward, but...1 >> >>> I could rename all the existing counters to include "pte" and introduce 2 new >>> counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? >> >> ...this would be perfect, I think. The "pte" would help self-document, and >> separately things out allows for a clearer view into the behavior. >> >>> >>> The --cont option will add *additional* special counters, if specified. The idea >>> here is to provide a view on what percentage of memory is getting >>> contpte-mapped. So if you provide "--cont 64K" it will give you a counter >>> showing how much memory is in 64K, naturally aligned blocks (actually 2 >>> counters; file and anon). Those blocks can come from fully mapped and aligned >>> 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP >>> is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 >>> 64K cont blocks, but it will be counted as unaligned in >>> anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's >>> first 1M is mapped and aligned on a 64K boundary, then it will be counted in the >>> *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. >>> >> >> Interesting, and completely undocumented until now. Let's add this to the >> tool's help output! In fact, all of the above. > > Well it already has this, which I intended to convey the same info: > > --cont size[KMG] Adds anon and file stats for naturally aligned, > contiguously mapped blocks of the specified size. May be > issued multiple times to track multiple sized blocks. > Useful to infer e.g. arm64 contpte and hpa mappings. Size > must be a power-of-2 number of pages. > > But yes, let me work up some improved documentation and send it out for your > review. The reason its a bit terse at the moment, is that I'm using Python's > ArgumentParser for the documentation, and it removes all line breaks from the > description which makes it hard to format longer form docs. Anyway, that's a bad > excuse for bad docs so I'll figure out a solution. Here is my proposed documentation. If you could take a look and let me know if it makes sense, then I'll modify the tool to conform: --8<-- $ ./thpmaps --help usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup] [--cont size[KMG]] [--inc-smaps] [--inc-empty] [--periodic sleep_ms] Prints information about how transparent huge pages are mapped, either system- wide, or for a specified process or cgroup. A default set of statistics is always generated for THP mappings. However, it is also possible to generate additional statistics for "contiguous block mappings" where the block size is user-defined. Statistics are maintained independently for anonymous and file-backed (pagecache) memory and are shown both in kB and as a percentage of either total anonymous or total file-backed memory as appropriate. THP Statistics -------------- Statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is aligned to their size, for each supported by the system. Separate counters describe THPs mapped by PTE vs those mapped by PMD. (Although note a THP can only be mapped by PMD if it is PMD-sized): - anon-thp-pte-aligned-kB - file-thp-pte-aligned-kB - anon-thp-pmd-aligned-kB - file-thp-pmd-aligned-kB Similarly, statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is *not* aligned to their size, for each supported by the system. Due to the unaligned mapping, it is impossible to map by PMD, so there are only PTE counters for this case: - anon-thp-pte-unaligned-kB - file-thp-pte-unaligned-kB Statistics are also always generated for mapped pages that belong to a THP but where the is THP is *not* fully- and contiguously- mapped. These "partial" mappings are all counted in the same counter regardless of the size of the THP that is partially mapped: - anon-thp-pte-partial - file-thp-pte-partial Contiguous Block Statistics --------------------------- An optional, additional set of statistics is generated for every contiguous block size specified with `--cont `. These statistics show how much memory is mapped in contiguous blocks of and also aligned to . A given contiguous block must all belong to the same THP, but there is no requirement for it to be the *whole* THP. Separate counters describe contiguous blocks mapped by PTE vs those mapped by PMD: - anon-cont-pte-aligned-kB - file-cont-pte-aligned-kB - anon-cont-pmd-aligned-kB - file-cont-pmd-aligned-kB As an example, if montiroing 64K contiguous blocks (--cont 64K), there are a number of sources that could provide such blocks: a fully- and contiguously- mapped 64K THP that is aligned to a 64K boundary would provide 1 block. A fully- and contiguously-mapped 128K THP that is aligned to at least a 64K boundary would provide 2 blocks. Or a 128K THP that maps its first 100K, but contiguously and starting at a 64K boundary would provide 1 block. A fully- and contiguously- mapped 2M THP would provide 32 blocks. There are many other possible permutations. optional arguments: -h, --help show this help message and exit --pid pid Process id of the target process. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --cgroup path Path to the target cgroup in sysfs. Iterates over every pid in the cgroup and its children. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --rollup Sum the per-vma statistics to provide a summary over the whole system, process or cgroup. --cont size[KMG] Adds stats for memory that is mapped in contiguous blocks of and also aligned to . May be issued multiple times to track multiple sized blocks. Useful to infer e.g. arm64 contpte and hpa mappings. Size must be a power-of-2 number of pages. --inc-smaps Include all numerical, additive /proc//smaps stats in the output. --inc-empty Show all statistics including those whose value is 0. --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. Requires root privilege to access pagemap and kpageflags. --8<-- Thanks, Ryan > > >> >>> >>> Sorry if I've labored the point here. But I think the only thing the tool >>> doesn't already do that you are asking for is to differentiate PTE- vs PMD- >>> mappings? >> >> That, plus explain itself, yes. :) > > Excellent! I'll post a follow up shortly. > >> >>> >>>> >>>> ... >>>>                           (e.g. /sys/fs/cgroup for cgroup-v2 or >>>>>>>                           /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>>>>>                           of --pid and --cgroup must be provided. >>>>>> >>>>>> Maybe we could add "--global" to that list. That would look, in order, >>>>>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>>>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >>>>> >>>>> I think actually it might be better just to make global the default when >>>>> neither >>>>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >>>>> from /proc rather than traverse the cgroup hierachy, that way it will work on >>>>> systems without cgroups. Does that work for you? >>>> >>>> Yes! That was my initial idea, in fact, and after over-thinking it for >>>> a while, it turned into the above. haha :) >>> >>> OK great - implemented for v3. >>> >> >> Sweet! >> >> >> thanks, >