linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Casey Chen <cachen@purestorage.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	linux-mm@kvack.org, yzhong@purestorage.com
Subject: Re: [PATCH 0/1] alloc_tag: add per-numa node stats
Date: Tue, 10 Jun 2025 08:56:31 -0700	[thread overview]
Message-ID: <CAJuCfpFTpHYuEs-Ev76_XkVKqeANwXY_Za6b6O1U=EuX5fgrvQ@mail.gmail.com> (raw)
In-Reply-To: <CALCePG3wXJK_5L1_WbKGXDpWaEBdEoPe2bPc-V1AU2hBwS1U6A@mail.gmail.com>

On Mon, Jun 9, 2025 at 5:22 PM Casey Chen <cachen@purestorage.com> wrote:
>
> On Wed, Jun 4, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jun 3, 2025 at 5:55 PM Casey Chen <cachen@purestorage.com> wrote:
> > >
> > > On Tue, Jun 3, 2025 at 8:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Jun 2, 2025 at 2:32 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Mon, Jun 2, 2025 at 1:48 PM Casey Chen <cachen@purestorage.com> wrote:
> > > > > >
> > > > > > On Fri, May 30, 2025 at 5:05 PM Kent Overstreet
> > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > >
> > > > > > > On Fri, May 30, 2025 at 02:45:57PM -0700, Casey Chen wrote:
> > > > > > > > On Thu, May 29, 2025 at 6:11 PM Kent Overstreet
> > > > > > > > <kent.overstreet@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, May 29, 2025 at 06:39:43PM -0600, Casey Chen wrote:
> > > > > > > > > > The patch is based 4aab42ee1e4e ("mm/zblock: make active_list rcu_list")
> > > > > > > > > > from branch mm-new of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > > > > > > > > >
> > > > > > > > > > The patch adds per-NUMA alloc_tag stats. Bytes/calls in total and per-NUMA
> > > > > > > > > > nodes are displayed in a single row for each alloc_tag in /proc/allocinfo.
> > > > > > > > > > Also percpu allocation is marked and its stats is stored on NUMA node 0.
> > > > > > > > > > For example, the resulting file looks like below.
> > > > > > > > > >
> > > > > > > > > > percpu y total         8588     2147 numa0         8588     2147 numa1            0        0 kernel/irq/irqdesc.c:425 func:alloc_desc
> > > > > > > > > > percpu n total       447232     1747 numa0       269568     1053 numa1       177664      694 lib/maple_tree.c:165 func:mt_alloc_bulk
> > > > > > > > > > percpu n total        83200      325 numa0        30976      121 numa1        52224      204 lib/maple_tree.c:160 func:mt_alloc_one
> > > > > > > > > > ...
> > > > > > > > > > percpu n total       364800     5700 numa0       109440     1710 numa1       255360     3990 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1410 [mlx5_core] func:mlx5_alloc_cmd_msg
> > > > > > > > > > percpu n total      1249280    39040 numa0       374784    11712 numa1       874496    27328 drivers/net/ethernet/mellanox/mlx5/core/cmd.c:1376 [mlx5_core] func:alloc_cmd_box
> > > > > > > > >
> > > > > > > > > Err, what is 'percpu y/n'?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Mark percpu allocation with 'percpu y/n' because for percpu allocation
> > > > > > > > stats, 'bytes' is per-cpu, we have to multiply it by the number of
> > > > > > > > CPUs to get the total bytes. Mark it so we know the exact amount of
> > > > > > > > memory used. Any /proc/allocinfo parser can understand it and make
> > > > > > > > correct calculations.
> > > > > > >
> > > > > > > Ok, just wanted to be sure it wasn't something else. Let's shorten that
> > > > > > > though, a single character should suffice (we already have a header that
> > > > > > > can explain what it is) - if you're growing the width we don't want to
> > > > > > > overflow.
> > > > > > >
> > > > > >
> > > > > > Does it have a header ?
> > > > >
> > > > > Yes. See print_allocinfo_header().
> > > >
> > > > I was thinking if instead of changing /proc/allocinfo format to
> > > > contain both total and per-node information we can keep it as is
> > > > (containing only totals) while exposing per-node information inside
> > > > new /sys/devices/system/node/node<node_no>/allocinfo files. That seems
> > > > cleaner to me.
> > > >
> > >
> > > The output of /sys/devices/system/node/node<node_no>/allocinfo is
> > > strictly limited to a single PAGE_SIZE and it cannot display stats for
> > > all tags.
> >
> > Ugh, that's a pity. Another option would be to add "nid" column like
> > this when this config is specified:
> >
> > nid     bytes    calls
> > 0         8588     2147             kernel/irq/irqdesc.c:425 func:alloc_desc
> > 1         0           0                   kernel/irq/irqdesc.c:425
> > func:alloc_desc
> > ...
> >
> > It bloats the file size but looks more structured to me.
> >
>
> How about this format ?
>
> With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=y, /proc/allocinfo looks like:
> allocinfo - version: 1.0
> <nid>       <size> <calls> <tag info>
>                  0        0 init/main.c:1310 func:do_initcalls
>     0            0        0
>     1            0        0

If we go that way then why not:

allocinfo - version: 2.0
<size> <calls> <tag info>
776704     1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq
        nid0       348672      681
        nid1       428032      836
6144        6 kernel/workqueue.c:4133 func:get_unbound_pool
        nid0         4096        4
        nid1         2048        2
...

If CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n the file format will not change.


> ...
>             776704     1517 kernel/workqueue.c:4301 func:alloc_unbound_pwq
>     0       348672      681
>     1       428032      836
>               6144        6 kernel/workqueue.c:4133 func:get_unbound_pool
>     0         4096        4
>     1         2048        2
>
> With CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=n, /proc/allocinfo
> stays same as before:
> allocinfo - version: 1.0
> <nid>       <size> <calls> <tag info>
>                  0        0 init/main.c:1310 func:do_initcalls
>                  0        0 init/do_mounts.c:350 func:mount_nodev_root
>                  0        0 init/do_mounts.c:187 func:mount_root_generic
> ...
>
> > >
> > > > I'm also not a fan of "percpu y" tags as that requires the reader to
> > > > know how many CPUs were in the system to make the calculation (you
> > > > might get the allocinfo content from a system you have no access to
> > > > and no additional information). Maybe we can have "per-cpu bytes" and
> > > > "total bytes" columns instead? For per-cpu allocations these will be
> > > > different, for all other allocations these two columns will contain
> > > > the same number.
> > >
> > > I plan to remove 'percpu y/n' from this patch and implement it later.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > To save memory, we dynamically allocate per-NUMA node stats counter once the
> > > > > > > > > > system boots up and knows how many NUMA nodes available. percpu allocators
> > > > > > > > > > are used for memory allocation hence increase PERCPU_DYNAMIC_RESERVE.
> > > > > > > > > >
> > > > > > > > > > For in-kernel alloc_tags, pcpu_alloc_noprof() is called so the memory for
> > > > > > > > > > these counters are not accounted in profiling stats.
> > > > > > > > > >
> > > > > > > > > > For loadable modules, __alloc_percpu_gfp() is called and memory is accounted.
> > > > > > > > >
> > > > > > > > > Intruiging, but I'd make it a kconfig option, AFAIK this would mainly be
> > > > > > > > > of interest to people looking at optimizing allocations to make sure
> > > > > > > > > they're on the right numa node?
> > > > > > > >
> > > > > > > > Yes, to help us know if there is an NUMA imbalance issue and make some
> > > > > > > > optimizations. I can make it a kconfig. Does anybody else have any
> > > > > > > > opinion about this feature ? Thanks!
> > > > > > >
> > > > > > > I would like to see some other opinions from potential users, have you
> > > > > > > been circulating it?
> > > > > >
> > > > > > We have been using it internally for a while. I don't know who the
> > > > > > potential users are and how to reach them so I am sharing it here to
> > > > > > collect opinions from others.
> > > > >
> > > > > Should definitely have a separate Kconfig option. Have you measured
> > > > > the memory and performance overhead of this change?


  reply	other threads:[~2025-06-10 15:56 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30  0:39 Casey Chen
2025-05-30  0:39 ` [PATCH] " Casey Chen
2025-05-30  1:11 ` [PATCH 0/1] " Kent Overstreet
2025-05-30 21:45   ` Casey Chen
2025-05-31  0:05     ` Kent Overstreet
2025-06-02 20:48       ` Casey Chen
2025-06-02 21:32         ` Suren Baghdasaryan
2025-06-03 15:00           ` Suren Baghdasaryan
2025-06-03 17:34             ` Kent Overstreet
2025-06-04  0:55             ` Casey Chen
2025-06-04 15:21               ` Suren Baghdasaryan
2025-06-04 15:50                 ` Kent Overstreet
2025-06-10  0:21                 ` Casey Chen
2025-06-10 15:56                   ` Suren Baghdasaryan [this message]
2025-06-03 20:00           ` Casey Chen
2025-06-03 20:18             ` Suren Baghdasaryan
2025-06-02 21:52         ` Kent Overstreet
2025-06-02 22:08           ` Steven Rostedt
2025-06-02 23:35             ` Kent Overstreet
2025-06-03  6:46               ` Ian Rogers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJuCfpFTpHYuEs-Ev76_XkVKqeANwXY_Za6b6O1U=EuX5fgrvQ@mail.gmail.com' \
    --to=surenb@google.com \
    --cc=cachen@purestorage.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox