From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Karim Manaouil <kmanaouil.dev@gmail.com>, Yu Zhao <yuzhao@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
David Bueso <dave@stgolabs.net>, Michal Hocko <mhocko@suse.com>,
Dan Williams <dan.j.williams@intel.com>,
John Hubbard <jhubbard@nvidia.com>,
Daniel Gomez <da.gomez@samsung.com>,
linux-mm <linux-mm@kvack.org>,
lsf-pc@lists.linux-foundation.org
Subject: Re: [LSFMM] automating measuring memory fragmentation
Date: Mon, 20 May 2024 16:34:54 +0200 [thread overview]
Message-ID: <8d594a01-3e39-4347-9d4e-4db91845f6a3@kernel.org> (raw)
In-Reply-To: <ZkZ7fwkBQ_pBEImO@localhost.localdomain>
On 5/16/24 11:32 PM, Karim Manaouil wrote:
> On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote:
>> For example, if we have two systems, one has lower fragmentation for
>> some orders but higher fragmentation for the rest, and the other is
>> the opposite. How would we be able to use a single measure to describe
>> this? IOW, I don't think a single measurement can describe all orders
>> in a comparable way, which would be the weakest requirement we would
>> have to impose.
>
>> As I (badly) explained earlier, a single value can't do that because
>> different orders are not on the same footing (or so to speak), unless
>> we are only interested in one non-zero order. So we would need
>> fragmentation_index[NR_non_zero_orders].
>
>> No, for example, A can allocate 4 order-1 but 0 order-2, and B can
>> allocate 2 order-1 *or* 1 order-2, which one would you say is better
>> or worse? This, IMO, depends on which order you are trying to
>> allocate. Does it make sense?
>
> But higher order pages can always be broken down into lower order pages.
> However, the inverse is not always gauranteed (they may not be buddies,
> or compaction/reclaim isn't helpful).
>
> Obviously, I would rather have one order-4 page than two order-3 pages.
> You can always satisfy an allocation for an order n if a page with an
> order higher than n is available.
>
> One way to measure fragmentation is to compare how far we are from some
> perfect value. The perfect value represents the case when all the free
> memory is available as blocks of pageblock_order or MAX_PAGE_ORDER.
>
> I can do this as a one shot calculation, for example with
>
> static void estimate_numa_fragmentation(void)
> {
> pg_data_t *pgdat;
> struct zone *z;
> unsigned long fragscore;
> unsigned long bestscore;
> unsigned long nr_free;
> int order;
>
> for_each_online_pgdat(pgdat) {
> nr_free = fragscore = 0;
> z = pgdat->node_zones;
> while (z < (pgdat->node_zones + pgdat->nr_zones)) {
> if (!populated_zone(z)) {
> z++;
> continue;
> }
> spin_lock_irq(&z->lock);
> for (order = 0; order < NR_PAGE_ORDERS; order++) {
> nr_free += z->free_area[order].nr_free << order;
> fragscore += z->free_area[order].nr_free << (order * 2);
> }
> spin_unlock_irq(&z->lock);
> z++;
> cond_resched();
> }
> bestscore = nr_free << MAX_PAGE_ORDER;
> fragscore = ((bestscore - fragscore) * 100) / bestscore;
> pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore);
> }
> }
I've had a similar idea, which is maybe exactly the same as yours, I don't
immediately see whether that's the case and I have not formalized mine, just
have an intuitive explanation. But the basic premise is the same, if
"continuity" is between 0 and 100, then all memory available in
MAX_PAGE_ORDER should score 100.
Then e.g. if we have 50% of the memory in MAX_PAGE_ORDER, then having the
other 50% fully available in MAX_PAGE_ORDER-1 should give us another 25%,
for a total of 75%, and so on. Maybe the decay of contribution to the
continuity metrics for decreasing an order by 1 should be less than 2x, not
sure.
I think in case of Yu Zhao's example "A can allocate 4 order-1 but 0
order-2, and B can allocate 2 order-1 *or* 1 order-2" I think this should
result in the same score at least? And of course it's not useful if we don't
know what allocations we actually need, but maybe useful enough to compare
two systems.
> But there must be a way to streamline the calculation and update the value
> with low overhead over time.
I think instead of trying to track anything in the kernel and set it in
stone, the metric should be calculated in userspace after reading the base
values from /proc/buddyinfo (not fragindex) and maybe ignore small special
zones like ZONE_DMA, and the per-order numbers of other zones could be
summed together. Could also differentiate per migratetype using
/proc/pagetypeinfo but then it's no longer a single number.
> Cheers
> Karim
> PhD Student
> Edinburgh University
>
prev parent reply other threads:[~2024-05-20 14:35 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-15 19:34 Luis Chamberlain
2024-05-16 5:15 ` Yu Zhao
2024-05-16 6:23 ` Luis Chamberlain
2024-05-16 20:05 ` Yu Zhao
2024-05-16 21:32 ` Karim Manaouil
2024-05-16 21:36 ` Yu Zhao
2024-05-20 14:34 ` Vlastimil Babka (SUSE) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8d594a01-3e39-4347-9d4e-4db91845f6a3@kernel.org \
--to=vbabka@kernel.org \
--cc=da.gomez@samsung.com \
--cc=dan.j.williams@intel.com \
--cc=dave@stgolabs.net \
--cc=jhubbard@nvidia.com \
--cc=kmanaouil.dev@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=mhocko@suse.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox