linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Hongru Zhang <zhanghongru06@gmail.com>
Cc: zhongjinji@honor.com, Liam.Howlett@oracle.com,
	akpm@linux-foundation.org,  axelrasmussen@google.com,
	david@kernel.org, hannes@cmpxchg.org,  jackmanb@google.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 lorenzo.stoakes@oracle.com, mhocko@suse.com, rppt@kernel.org,
	 surenb@google.com, vbabka@suse.cz, weixugc@google.com,
	yuanchu@google.com,  zhanghongru@xiaomi.com, ziy@nvidia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
Date: Tue, 2 Dec 2025 02:54:28 +0800	[thread overview]
Message-ID: <CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com> (raw)
In-Reply-To: <20251201122912.348142-1-zhanghongru@xiaomi.com>

On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
>     "Accuracy. Both implementations have accuracy limitations. The previous
>     implementation required acquiring and releasing the zone lock for counting
>     each order and migratetype, making it potentially inaccurate. Under high
>     memory pressure, accuracy would further degrade due to zone lock
>     contention or fragmentation. The new implementation collects data within a
>     short time window, which helps maintain relatively small errors, and is
>     unaffected by memory pressure. Furthermore, user-space memory management
>     components inherently experience decision latency - by the time they
>     process the collected data and execute actions, the memory state has
>     already changed. This means that even perfectly accurate data at
>     collection time becomes stale by decision time. Considering these factors,
>     the accuracy trade-off introduced by the new implementation should be
>     acceptable for practical use cases, offering a balance between performance
>     and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
>    times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
>  static void pagetypeinfo_showfree_print(struct seq_file *m,
>                                         pg_data_t *pgdat, struct zone *zone)
>  {
>         int order, mtype;
> +       ktime_t start;
> +       u64 lat;
> +       unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> +       start = ktime_get();
> +       for (order = 0; order < NR_PAGE_ORDERS; ++order)
> +               for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> +                       freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> +       lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> +       total_count++;
> +       total_lat += lat;
>
>         for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>                 seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>                         bool overflow = false;
>
>                         /* Keep the same output format for user-space tools compatibility */
> -                       freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +                       freecount = freecounts[order][mtype];
>                         if (freecount >= 100000) {
>                                 overflow = true;
>                                 freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
>  #endif /* CONFIG_PAGE_OWNER */
>  }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.

Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.

>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.

That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/

Thanks
Barry


  reply	other threads:[~2025-12-01 18:54 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28  3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28  3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-29  0:34   ` Barry Song
2025-11-28  3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
2025-11-28 12:03   ` zhongjinji
2025-11-29  0:00     ` Barry Song
2025-11-29  7:55       ` Barry Song
2025-12-01 12:29       ` Hongru Zhang
2025-12-01 18:54         ` Barry Song [this message]
2025-11-28  3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
2025-11-29  0:04   ` Barry Song
2025-11-29  9:24     ` Barry Song
2025-11-28  7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28  8:34   ` Hongru Zhang
2025-11-28  8:40     ` Lorenzo Stoakes
2025-11-28  9:24 ` Vlastimil Babka
2025-11-28 13:08   ` Johannes Weiner
2025-12-01  2:36   ` Hongru Zhang
2025-12-01 17:01     ` Zi Yan
2025-12-02  2:42       ` Hongru Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhanghongru06@gmail.com \
    --cc=zhanghongru@xiaomi.com \
    --cc=zhongjinji@honor.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox