From: Barry Song <21cnbao@gmail.com>
To: Hongru Zhang <zhanghongru06@gmail.com>
Cc: zhongjinji@honor.com, Liam.Howlett@oracle.com,
akpm@linux-foundation.org, axelrasmussen@google.com,
david@kernel.org, hannes@cmpxchg.org, jackmanb@google.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
lorenzo.stoakes@oracle.com, mhocko@suse.com, rppt@kernel.org,
surenb@google.com, vbabka@suse.cz, weixugc@google.com,
yuanchu@google.com, zhanghongru@xiaomi.com, ziy@nvidia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
Date: Tue, 2 Dec 2025 02:54:28 +0800 [thread overview]
Message-ID: <CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com> (raw)
In-Reply-To: <20251201122912.348142-1-zhanghongru@xiaomi.com>
On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
> "Accuracy. Both implementations have accuracy limitations. The previous
> implementation required acquiring and releasing the zone lock for counting
> each order and migratetype, making it potentially inaccurate. Under high
> memory pressure, accuracy would further degrade due to zone lock
> contention or fragmentation. The new implementation collects data within a
> short time window, which helps maintain relatively small errors, and is
> unaffected by memory pressure. Furthermore, user-space memory management
> components inherently experience decision latency - by the time they
> process the collected data and execute actions, the memory state has
> already changed. This means that even perfectly accurate data at
> collection time becomes stale by decision time. Considering these factors,
> the accuracy trade-off introduced by the new implementation should be
> acceptable for practical use cases, offering a balance between performance
> and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
> times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + ktime_t start;
> + u64 lat;
> + unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> + start = ktime_get();
> + for (order = 0; order < NR_PAGE_ORDERS; ++order)
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> + freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> + lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> + total_count++;
> + total_lat += lat;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> bool overflow = false;
>
> /* Keep the same output format for user-space tools compatibility */
> - freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> + freecount = freecounts[order][mtype];
> if (freecount >= 100000) {
> overflow = true;
> freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
> #endif /* CONFIG_PAGE_OWNER */
> }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.
Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.
>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.
That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.
[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/
Thanks
Barry
next prev parent reply other threads:[~2025-12-01 18:54 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28 3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-29 0:34 ` Barry Song
2025-11-28 3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
2025-11-28 12:03 ` zhongjinji
2025-11-29 0:00 ` Barry Song
2025-11-29 7:55 ` Barry Song
2025-12-01 12:29 ` Hongru Zhang
2025-12-01 18:54 ` Barry Song [this message]
2025-11-28 3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
2025-11-29 0:04 ` Barry Song
2025-11-29 9:24 ` Barry Song
2025-11-28 7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28 8:34 ` Hongru Zhang
2025-11-28 8:40 ` Lorenzo Stoakes
2025-11-28 9:24 ` Vlastimil Babka
2025-11-28 13:08 ` Johannes Weiner
2025-12-01 2:36 ` Hongru Zhang
2025-12-01 17:01 ` Zi Yan
2025-12-02 2:42 ` Hongru Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com \
--to=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhanghongru06@gmail.com \
--cc=zhanghongru@xiaomi.com \
--cc=zhongjinji@honor.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox