Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hongru Zhang <zhanghongru06@gmail.com>
To: 21cnbao@gmail.com, zhongjinji@honor.com
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	axelrasmussen@google.com, david@kernel.org, hannes@cmpxchg.org,
	jackmanb@google.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com, mhocko@suse.com,
	rppt@kernel.org, surenb@google.com, vbabka@suse.cz,
	weixugc@google.com, yuanchu@google.com, zhanghongru06@gmail.com,
	zhanghongru@xiaomi.com, ziy@nvidia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
Date: Mon,  1 Dec 2025 20:29:12 +0800	[thread overview]
Message-ID: <20251201122912.348142-1-zhanghongru@xiaomi.com> (raw)
In-Reply-To: <CAGsJ_4xtc7ipFKYNQkGa-dSn7C8S7-J8LURqYrehfgenfPT=+w@mail.gmail.com>

> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.

Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:

    "Accuracy. Both implementations have accuracy limitations. The previous
    implementation required acquiring and releasing the zone lock for counting
    each order and migratetype, making it potentially inaccurate. Under high
    memory pressure, accuracy would further degrade due to zone lock
    contention or fragmentation. The new implementation collects data within a
    short time window, which helps maintain relatively small errors, and is
    unaffected by memory pressure. Furthermore, user-space memory management
    components inherently experience decision latency - by the time they
    process the collected data and execute actions, the memory state has
    already changed. This means that even perfectly accurate data at
    collection time becomes stale by decision time. Considering these factors,
    the accuracy trade-off introduced by the new implementation should be
    acceptable for practical use cases, offering a balance between performance
    and accuracy requirements."

Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000 
   times is 4.67 us

2. average latency is 125 ns, if seq_printf() is taken out of the loop

Example code:

+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
                                        pg_data_t *pgdat, struct zone *zone)
 {
        int order, mtype;
+       ktime_t start;
+       u64 lat;
+       unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+       start = ktime_get();
+       for (order = 0; order < NR_PAGE_ORDERS; ++order)
+               for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+                       freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+       lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+       total_count++;
+       total_lat += lat;
 
        for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
                seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
                        bool overflow = false;
 
                        /* Keep the same output format for user-space tools compatibility */
-                       freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+                       freecount = freecounts[order][mtype];
                        if (freecount >= 100000) {
                                overflow = true;
                                freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
 #endif /* CONFIG_PAGE_OWNER */
 }

I think both are small time window (if IRQ is disabled, latency is more
deterministic).

> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.

I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.

In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.

next prev parent reply	other threads:[~2025-12-01 12:29 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28  3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28  3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-29  0:34   ` Barry Song
2025-11-28  3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
2025-11-28 12:03   ` zhongjinji
2025-11-29  0:00     ` Barry Song
2025-11-29  7:55       ` Barry Song
2025-12-01 12:29       ` Hongru Zhang [this message]
2025-12-01 18:54         ` Barry Song
2025-11-28  3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
2025-11-29  0:04   ` Barry Song
2025-11-29  9:24     ` Barry Song
2025-11-28  7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28  8:34   ` Hongru Zhang
2025-11-28  8:40     ` Lorenzo Stoakes
2025-11-28  9:24 ` Vlastimil Babka
2025-11-28 13:08   ` Johannes Weiner
2025-12-01  2:36   ` Hongru Zhang
2025-12-01 17:01     ` Zi Yan
2025-12-02  2:42       ` Hongru Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251201122912.348142-1-zhanghongru@xiaomi.com \
    --to=zhanghongru06@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhanghongru@xiaomi.com \
    --cc=zhongjinji@honor.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox