[LSFMM] automating measuring memory fragmentation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSFMM] automating measuring memory fragmentation
@ 2024-05-15 19:34 Luis Chamberlain
  2024-05-16  5:15 ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Luis Chamberlain @ 2024-05-15 19:34 UTC (permalink / raw)
  To: Michal Hocko, Dan Williams
  Cc: Luis Chamberlain, Yu Zhao, John Hubbard, Daniel Gomez, linux-mm, lsf-pc

RFC to see if we have a breakout session today at LSFMM.

After the TAO talk today it occurred to me that it might make sense
to review how we're measuring memory fragmentation today. We're looking
to add automation support into kdevops for this to help compare and
contrast memory fragmentation behaviour with one kernel against another.
A while ago, while mTHP was being evaluated I asked genearlly how we
could measure fragmentation with a simple one value, and John Hubbard
had one recommendation [0], working that proved we could simplify things
[1] but we also could just use the existing fragmentation index and only
consider the values where this is concerned for fragmentation and not
lack of memory. It begs the question of how folks are measuring memory
fragmentation today in production, and if they have any desirable
changes. The first approach being considered is to reproduce the
workloads Mel Gorman had written and used for mmtests and leverage those
on kdevops, perhaps modernize them, but before we do so it seems
reviewing how we measure fragmentation today might be useful to others
too.

As for mmtests integration into kdevops, first order of business are
just a few distro-friendly updates [2], for the next steps after that
though it would be great to review the above.

[0] https://lore.kernel.org/all/5ac6a387-0ca7-45ca-bebc-c3bdd48452cb@nvidia.com/T/#u
[1] https://lkml.kernel.org/r/20240314005710.2964798-1-mcgrof@kernel.org
[2] https://lore.kernel.org/kdevops/20240319044621.2682968-1-mcgrof@kernel.org/

  Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-15 19:34 [LSFMM] automating measuring memory fragmentation Luis Chamberlain
@ 2024-05-16  5:15 ` Yu Zhao
  2024-05-16  6:23   ` Luis Chamberlain
  0 siblings, 1 reply; 7+ messages in thread
From: Yu Zhao @ 2024-05-16  5:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Michal Hocko, Dan Williams, John Hubbard, Daniel Gomez, linux-mm, lsf-pc

On Wed, May 15, 2024 at 1:34 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> RFC to see if we have a breakout session today at LSFMM.
>
> After the TAO talk today it occurred to me that it might make sense
> to review how we're measuring memory fragmentation today. We're looking
> to add automation support into kdevops for this to help compare and
> contrast memory fragmentation behaviour with one kernel against another.
> A while ago, while mTHP was being evaluated I asked genearlly how we
> could measure fragmentation with a simple one value, and John Hubbard
> had one recommendation [0], working that proved we could simplify things
> [1] but we also could just use the existing fragmentation index and only
> consider the values where this is concerned for fragmentation and not
> lack of memory. It begs the question of how folks are measuring memory
> fragmentation today in production, and if they have any desirable
> changes. The first approach being considered is to reproduce the
> workloads Mel Gorman had written and used for mmtests and leverage those
> on kdevops, perhaps modernize them, but before we do so it seems
> reviewing how we measure fragmentation today might be useful to others
> too.
>
> As for mmtests integration into kdevops, first order of business are
> just a few distro-friendly updates [2], for the next steps after that
> though it would be great to review the above.
>
> [0] https://lore.kernel.org/all/5ac6a387-0ca7-45ca-bebc-c3bdd48452cb@nvidia.com/T/#u
> [1] https://lkml.kernel.org/r/20240314005710.2964798-1-mcgrof@kernel.org
> [2] https://lore.kernel.org/kdevops/20240319044621.2682968-1-mcgrof@kernel.org/

Please correct me if I'm wrong -- I don't think we can use a single
measure to describe fragmentation in an actionable way.

IMO, we would need at least multiple values, e.g., fragmentation index
for each non-zero order, to describe how fragmented the memory is with
respect to the order of interest. Of course we could encode multiple
fragmentation indices into a single value, but that's not really one
measure.

Fragmentation index of an order can tell whether reclaim+compaction
can theoretically result in a free area of that order. As an average,
fragmentation index can't tell which actionable unit area, e.g.,
pageblock, would be the best candidate for reclaim and/or compaction.
That would require a ranking model, e.g., a cost function and weights
for reclaim and compaction operations, and calculations of the cost to
produce a free area of a requested order for each pageblock, i.e., a
2-dimensional measure
costs_to_produce_free_area[NR_non_zero_orders][NR_pageblocks].

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-16  5:15 ` Yu Zhao
@ 2024-05-16  6:23   ` Luis Chamberlain
  2024-05-16 20:05     ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Luis Chamberlain @ 2024-05-16  6:23 UTC (permalink / raw)
  To: Yu Zhao, David Bueso
  Cc: Michal Hocko, Dan Williams, John Hubbard, Daniel Gomez, linux-mm, lsf-pc

On Wed, May 15, 2024 at 11:15:58PM -0600, Yu Zhao wrote:
> On Wed, May 15, 2024 at 1:34 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > RFC to see if we have a breakout session today at LSFMM.
> >
> > After the TAO talk today it occurred to me that it might make sense
> > to review how we're measuring memory fragmentation today. We're looking
> > to add automation support into kdevops for this to help compare and
> > contrast memory fragmentation behaviour with one kernel against another.
> > A while ago, while mTHP was being evaluated I asked genearlly how we
> > could measure fragmentation with a simple one value, and John Hubbard
> > had one recommendation [0], working that proved we could simplify things
> > [1] but we also could just use the existing fragmentation index and only
> > consider the values where this is concerned for fragmentation and not
> > lack of memory. It begs the question of how folks are measuring memory
> > fragmentation today in production, and if they have any desirable
> > changes. The first approach being considered is to reproduce the
> > workloads Mel Gorman had written and used for mmtests and leverage those
> > on kdevops, perhaps modernize them, but before we do so it seems
> > reviewing how we measure fragmentation today might be useful to others
> > too.
> >
> > As for mmtests integration into kdevops, first order of business are
> > just a few distro-friendly updates [2], for the next steps after that
> > though it would be great to review the above.
> >
> > [0] https://lore.kernel.org/all/5ac6a387-0ca7-45ca-bebc-c3bdd48452cb@nvidia.com/T/#u
> > [1] https://lkml.kernel.org/r/20240314005710.2964798-1-mcgrof@kernel.org
> > [2] https://lore.kernel.org/kdevops/20240319044621.2682968-1-mcgrof@kernel.org/
> 
> Please correct me if I'm wrong -- I don't think we can use a single
> measure to describe fragmentation in an actionable way.
                                          ^^^^^^^^^^ ^^^
Two key words: actionable way.

Even in that sense, to say that you need more would suggest that either
compaction does not suffice to address memory fragmentation, or that we
can improve memory fragmentation through other means. Both are possible,
and only measurements can prove that.

But my point was not about taking measures in an *actionable way* to
address memory fragmentation though, but simply measuring memory
fragmentation in environment A and evironment B, to address the
question, under which environment is memory fragmentation worse.  That
said, I am *also* interested in solutions to address memory
fragmentation, but that's a secondary step, first I'd like to measure,
not take action.

It does not mean that evaluating measurements to consider memory
fragmentation to evaluate actionable measures are not useful to the
single snapshot of memory fragmentation. In fact if more information is
better, or we're lacking other sources of measurement of memory
fragmentation it'd be great to improve it.

As noted in the above URL John Hubbard provided a simple metric
recommendation, and I tried to implement it but as the patch in [1]
notes the missing semantic would be used folios per order and to add
this I thought it would be expensive today from, as per my last review
(perhaps I am wrong). Hence my approach to only seek one value and see
if its positive, and if so how high.

> IMO, we would need at least multiple values, e.g., fragmentation index
> for each non-zero order, to describe how fragmented the memory is with
> respect to the order of interest.

Here you seem to accept you can measure how memory is fragmented with the
existing fragmentation index for each order, is that right?

Or is it that this is the only tool we have today, but likely we could
improve the metric?

> Of course we could encode multiple
> fragmentation indices into a single value, but that's not really one
> measure.

If I am not looking for an actionable measure, but just get a single
quantifiable metric of "how badly fragmented is this system", is
a single value not useful for that purpose?

For my purpose, it was about evaluating if the general situation is
worse in environment A Vs B, in that world, would a single metric work?

> Fragmentation index of an order can tell whether reclaim+compaction
> can theoretically result in a free area of that order.

Indeed, for my interest it's the positive values, about when a system
has memory fragmented.

> As an average,
> fragmentation index can't tell which actionable unit area,
                                       ^^^^^^^^^^ 
In the A Vs B simple measurement introspection situation one is not
taking into consideration an action but just being a silly memory
fragmentation voyeur.

> e.g.,
> pageblock, would be the best candidate for reclaim and/or compaction.
> That would require a ranking model, e.g., a cost function and weights
> for reclaim and compaction operations, and calculations of the cost to
> produce a free area of a requested order for each pageblock, i.e., a
> 2-dimensional measure
> costs_to_produce_free_area[NR_non_zero_orders][NR_pageblocks].

This all makse sense!

  Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-16  6:23   ` Luis Chamberlain
@ 2024-05-16 20:05     ` Yu Zhao
  2024-05-16 21:32       ` Karim Manaouil
  0 siblings, 1 reply; 7+ messages in thread
From: Yu Zhao @ 2024-05-16 20:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Bueso, Michal Hocko, Dan Williams, John Hubbard,
	Daniel Gomez, linux-mm, lsf-pc

On Thu, May 16, 2024 at 12:23 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Wed, May 15, 2024 at 11:15:58PM -0600, Yu Zhao wrote:
> > On Wed, May 15, 2024 at 1:34 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
> > > RFC to see if we have a breakout session today at LSFMM.
> > >
> > > After the TAO talk today it occurred to me that it might make sense
> > > to review how we're measuring memory fragmentation today. We're looking
> > > to add automation support into kdevops for this to help compare and
> > > contrast memory fragmentation behaviour with one kernel against another.
> > > A while ago, while mTHP was being evaluated I asked genearlly how we
> > > could measure fragmentation with a simple one value, and John Hubbard
> > > had one recommendation [0], working that proved we could simplify things
> > > [1] but we also could just use the existing fragmentation index and only
> > > consider the values where this is concerned for fragmentation and not
> > > lack of memory. It begs the question of how folks are measuring memory
> > > fragmentation today in production, and if they have any desirable
> > > changes. The first approach being considered is to reproduce the
> > > workloads Mel Gorman had written and used for mmtests and leverage those
> > > on kdevops, perhaps modernize them, but before we do so it seems
> > > reviewing how we measure fragmentation today might be useful to others
> > > too.
> > >
> > > As for mmtests integration into kdevops, first order of business are
> > > just a few distro-friendly updates [2], for the next steps after that
> > > though it would be great to review the above.
> > >
> > > [0] https://lore.kernel.org/all/5ac6a387-0ca7-45ca-bebc-c3bdd48452cb@nvidia.com/T/#u
> > > [1] https://lkml.kernel.org/r/20240314005710.2964798-1-mcgrof@kernel.org
> > > [2] https://lore.kernel.org/kdevops/20240319044621.2682968-1-mcgrof@kernel.org/
> >
> > Please correct me if I'm wrong -- I don't think we can use a single
> > measure to describe fragmentation in an actionable way.
>                                           ^^^^^^^^^^ ^^^
> Two key words: actionable way.
>
> Even in that sense, to say that you need more would suggest that either
> compaction does not suffice to address memory fragmentation, or that we
> can improve memory fragmentation through other means. Both are possible,
> and only measurements can prove that.
>
> But my point was not about taking measures in an *actionable way* to
> address memory fragmentation though, but simply measuring memory
> fragmentation in environment A and evironment B, to address the
> question, under which environment is memory fragmentation worse.  That
> said, I am *also* interested in solutions to address memory
> fragmentation, but that's a secondary step, first I'd like to measure,
> not take action.
>
> It does not mean that evaluating measurements to consider memory
> fragmentation to evaluate actionable measures are not useful to the
> single snapshot of memory fragmentation. In fact if more information is
> better, or we're lacking other sources of measurement of memory
> fragmentation it'd be great to improve it.
>
> As noted in the above URL John Hubbard provided a simple metric
> recommendation, and I tried to implement it but as the patch in [1]
> notes the missing semantic would be used folios per order and to add
> this I thought it would be expensive today from, as per my last review
> (perhaps I am wrong). Hence my approach to only seek one value and see
> if its positive, and if so how high.

Thanks. IIUC, the metric(s) you have in mind would be able to compare
over time or across different systems.

I still don't think a single measurement can do that because different
orders are not on the same footing (or so to speak), unless we are
only interested in one non-zero order.

For example, if we have two systems, one has lower fragmentation for
some orders but higher fragmentation for the rest, and the other is
the opposite. How would we be able to use a single measure to describe
this? IOW, I don't think a single measurement can describe all orders
in a comparable way, which would be the weakest requirement we would
have to impose.

> > IMO, we would need at least multiple values, e.g., fragmentation index
> > for each non-zero order, to describe how fragmented the memory is with
> > respect to the order of interest.
>
> Here you seem to accept you can measure how memory is fragmented with the
> existing fragmentation index for each order, is that right?

Correct. Fragmentation indices for all orders are what we have now.

> Or is it that this is the only tool we have today, but likely we could
> improve the metric?

With them we can compare fragmentation in a system over time, or
fragmentation between systems. In addition to comparison, we also can
tell whether reclaim+compaction would be able to make an allocation of
a specific order possible, as I mentioned earlier.

> > Of course we could encode multiple
> > fragmentation indices into a single value, but that's not really one
> > measure.
>
> If I am not looking for an actionable measure, but just get a single
> quantifiable metric of "how badly fragmented is this system", is
> a single value not useful for that purpose?

As I (badly) explained earlier, a single value can't do that because
different orders are not on the same footing (or so to speak), unless
we are only interested in one non-zero order. So we would need
fragmentation_index[NR_non_zero_orders].

> For my purpose, it was about evaluating if the general situation is
> worse in environment A Vs B, in that world, would a single metric work?

No, for example, A can allocate 4 order-1 but 0 order-2, and B can
allocate 2 order-1 *or* 1 order-2, which one would you say is better
or worse? This, IMO, depends on which order you are trying to
allocate. Does it make sense?

> > Fragmentation index of an order can tell whether reclaim+compaction
> > can theoretically result in a free area of that order.
>
> Indeed, for my interest it's the positive values, about when a system
> has memory fragmented.
>
> > As an average,
> > fragmentation index can't tell which actionable unit area,
>                                        ^^^^^^^^^^
> In the A Vs B simple measurement introspection situation one is not
> taking into consideration an action but just being a silly memory
> fragmentation voyeur.
>
> > e.g.,
> > pageblock, would be the best candidate for reclaim and/or compaction.
> > That would require a ranking model, e.g., a cost function and weights
> > for reclaim and compaction operations, and calculations of the cost to
> > produce a free area of a requested order for each pageblock, i.e., a
> > 2-dimensional measure
> > costs_to_produce_free_area[NR_non_zero_orders][NR_pageblocks].
>
> This all makse sense!

Thank you!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-16 20:05     ` Yu Zhao
@ 2024-05-16 21:32       ` Karim Manaouil
  2024-05-16 21:36         ` Yu Zhao
  2024-05-20 14:34         ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 7+ messages in thread
From: Karim Manaouil @ 2024-05-16 21:32 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Luis Chamberlain, David Bueso, Michal Hocko, Dan Williams,
	John Hubbard, Daniel Gomez, linux-mm, lsf-pc, Karim Manaouil

On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote: 
> For example, if we have two systems, one has lower fragmentation for
> some orders but higher fragmentation for the rest, and the other is
> the opposite. How would we be able to use a single measure to describe
> this? IOW, I don't think a single measurement can describe all orders
> in a comparable way, which would be the weakest requirement we would
> have to impose.

> As I (badly) explained earlier, a single value can't do that because
> different orders are not on the same footing (or so to speak), unless
> we are only interested in one non-zero order. So we would need
> fragmentation_index[NR_non_zero_orders].

> No, for example, A can allocate 4 order-1 but 0 order-2, and B can
> allocate 2 order-1 *or* 1 order-2, which one would you say is better
> or worse? This, IMO, depends on which order you are trying to
> allocate. Does it make sense?

But higher order pages can always be broken down into lower order pages.
However, the inverse is not always gauranteed (they may not be buddies,
or compaction/reclaim isn't helpful).

Obviously, I would rather have one order-4 page than two order-3 pages.
You can always satisfy an allocation for an order n if a page with an
order higher than n is available.

One way to measure fragmentation is to compare how far we are from some 
perfect value. The perfect value represents the case when all the free
memory is available as blocks of pageblock_order or MAX_PAGE_ORDER.

I can do this as a one shot calculation, for example with

static void estimate_numa_fragmentation(void)
{
	pg_data_t *pgdat;
	struct zone *z;
	unsigned long fragscore;
	unsigned long bestscore;
	unsigned long nr_free;
	int order;

	for_each_online_pgdat(pgdat) {
		nr_free = fragscore = 0;
		z = pgdat->node_zones;
		while (z < (pgdat->node_zones + pgdat->nr_zones)) {
			if (!populated_zone(z)) {
				z++;
				continue;
			}
			spin_lock_irq(&z->lock);
			for (order = 0; order < NR_PAGE_ORDERS; order++) {
				nr_free += z->free_area[order].nr_free << order;
				fragscore += z->free_area[order].nr_free << (order * 2);
			}
			spin_unlock_irq(&z->lock);
			z++;
			cond_resched();
		}
		bestscore = nr_free << MAX_PAGE_ORDER;
		fragscore = ((bestscore - fragscore) * 100) / bestscore;
		pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore);
	}
}

But there must be a way to streamline the calculation and update the value
with low overhead over time.

Cheers
Karim
PhD Student
Edinburgh University


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-16 21:32       ` Karim Manaouil
@ 2024-05-16 21:36         ` Yu Zhao
  2024-05-20 14:34         ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 7+ messages in thread
From: Yu Zhao @ 2024-05-16 21:36 UTC (permalink / raw)
  To: Karim Manaouil
  Cc: Luis Chamberlain, David Bueso, Michal Hocko, Dan Williams,
	John Hubbard, Daniel Gomez, linux-mm, lsf-pc

On Thu, May 16, 2024 at 3:32 PM Karim Manaouil <kmanaouil.dev@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote:
> > For example, if we have two systems, one has lower fragmentation for
> > some orders but higher fragmentation for the rest, and the other is
> > the opposite. How would we be able to use a single measure to describe
> > this? IOW, I don't think a single measurement can describe all orders
> > in a comparable way, which would be the weakest requirement we would
> > have to impose.
>
> > As I (badly) explained earlier, a single value can't do that because
> > different orders are not on the same footing (or so to speak), unless
> > we are only interested in one non-zero order. So we would need
> > fragmentation_index[NR_non_zero_orders].
>
> > No, for example, A can allocate 4 order-1 but 0 order-2, and B can
> > allocate 2 order-1 *or* 1 order-2, which one would you say is better
> > or worse? This, IMO, depends on which order you are trying to
> > allocate. Does it make sense?
>
> But higher order pages can always be broken down into lower order pages.
> However, the inverse is not always gauranteed (they may not be buddies,
> or compaction/reclaim isn't helpful).

Please read my example again, carefully.

> Obviously, I would rather have one order-4 page than two order-3 pages.
> You can always satisfy an allocation for an order n if a page with an
> order higher than n is available.
>
> One way to measure fragmentation is to compare how far we are from some
> perfect value. The perfect value represents the case when all the free
> memory is available as blocks of pageblock_order or MAX_PAGE_ORDER.
>
> I can do this as a one shot calculation, for example with
>
> static void estimate_numa_fragmentation(void)
> {
>         pg_data_t *pgdat;
>         struct zone *z;
>         unsigned long fragscore;
>         unsigned long bestscore;
>         unsigned long nr_free;
>         int order;
>
>         for_each_online_pgdat(pgdat) {
>                 nr_free = fragscore = 0;
>                 z = pgdat->node_zones;
>                 while (z < (pgdat->node_zones + pgdat->nr_zones)) {
>                         if (!populated_zone(z)) {
>                                 z++;
>                                 continue;
>                         }
>                         spin_lock_irq(&z->lock);
>                         for (order = 0; order < NR_PAGE_ORDERS; order++) {
>                                 nr_free += z->free_area[order].nr_free << order;
>                                 fragscore += z->free_area[order].nr_free << (order * 2);
>                         }
>                         spin_unlock_irq(&z->lock);
>                         z++;
>                         cond_resched();
>                 }
>                 bestscore = nr_free << MAX_PAGE_ORDER;
>                 fragscore = ((bestscore - fragscore) * 100) / bestscore;
>                 pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore);
>         }
> }
>
> But there must be a way to streamline the calculation and update the value
> with low overhead over time.
>
> Cheers
> Karim
> PhD Student
> Edinburgh University


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSFMM] automating measuring memory fragmentation
  2024-05-16 21:32       ` Karim Manaouil
  2024-05-16 21:36         ` Yu Zhao
@ 2024-05-20 14:34         ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 7+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-05-20 14:34 UTC (permalink / raw)
  To: Karim Manaouil, Yu Zhao
  Cc: Luis Chamberlain, David Bueso, Michal Hocko, Dan Williams,
	John Hubbard, Daniel Gomez, linux-mm, lsf-pc

On 5/16/24 11:32 PM, Karim Manaouil wrote:
> On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote: 
>> For example, if we have two systems, one has lower fragmentation for
>> some orders but higher fragmentation for the rest, and the other is
>> the opposite. How would we be able to use a single measure to describe
>> this? IOW, I don't think a single measurement can describe all orders
>> in a comparable way, which would be the weakest requirement we would
>> have to impose.
> 
>> As I (badly) explained earlier, a single value can't do that because
>> different orders are not on the same footing (or so to speak), unless
>> we are only interested in one non-zero order. So we would need
>> fragmentation_index[NR_non_zero_orders].
> 
>> No, for example, A can allocate 4 order-1 but 0 order-2, and B can
>> allocate 2 order-1 *or* 1 order-2, which one would you say is better
>> or worse? This, IMO, depends on which order you are trying to
>> allocate. Does it make sense?
> 
> But higher order pages can always be broken down into lower order pages.
> However, the inverse is not always gauranteed (they may not be buddies,
> or compaction/reclaim isn't helpful).
> 
> Obviously, I would rather have one order-4 page than two order-3 pages.
> You can always satisfy an allocation for an order n if a page with an
> order higher than n is available.
> 
> One way to measure fragmentation is to compare how far we are from some 
> perfect value. The perfect value represents the case when all the free
> memory is available as blocks of pageblock_order or MAX_PAGE_ORDER.
> 
> I can do this as a one shot calculation, for example with
> 
> static void estimate_numa_fragmentation(void)
> {
> 	pg_data_t *pgdat;
> 	struct zone *z;
> 	unsigned long fragscore;
> 	unsigned long bestscore;
> 	unsigned long nr_free;
> 	int order;
> 
> 	for_each_online_pgdat(pgdat) {
> 		nr_free = fragscore = 0;
> 		z = pgdat->node_zones;
> 		while (z < (pgdat->node_zones + pgdat->nr_zones)) {
> 			if (!populated_zone(z)) {
> 				z++;
> 				continue;
> 			}
> 			spin_lock_irq(&z->lock);
> 			for (order = 0; order < NR_PAGE_ORDERS; order++) {
> 				nr_free += z->free_area[order].nr_free << order;
> 				fragscore += z->free_area[order].nr_free << (order * 2);
> 			}
> 			spin_unlock_irq(&z->lock);
> 			z++;
> 			cond_resched();
> 		}
> 		bestscore = nr_free << MAX_PAGE_ORDER;
> 		fragscore = ((bestscore - fragscore) * 100) / bestscore;
> 		pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore);
> 	}
> }

I've had a similar idea, which is maybe exactly the same as yours, I don't
immediately see whether that's the case and I have not formalized mine, just
have an intuitive explanation. But the basic premise is the same, if
"continuity" is between 0 and 100, then all memory available in
MAX_PAGE_ORDER should score 100.
Then e.g. if we have 50% of the memory in MAX_PAGE_ORDER, then having the
other 50% fully available in MAX_PAGE_ORDER-1 should give us another 25%,
for a total of 75%, and so on. Maybe the decay of contribution to the
continuity metrics for decreasing an order by 1 should be less than 2x, not
sure.

I think in case of Yu Zhao's example "A can allocate 4 order-1 but 0
order-2, and B can allocate 2 order-1 *or* 1 order-2" I think this should
result in the same score at least? And of course it's not useful if we don't
know what allocations we actually need, but maybe useful enough to compare
two systems.

> But there must be a way to streamline the calculation and update the value
> with low overhead over time.

I think instead of trying to track anything in the kernel and set it in
stone, the metric should be calculated in userspace after reading the base
values from /proc/buddyinfo (not fragindex) and maybe ignore small special
zones like ZONE_DMA, and the per-order numbers of other zones could be
summed together. Could also differentiate per migratetype using
/proc/pagetypeinfo but then it's no longer a single number.

> Cheers
> Karim
> PhD Student
> Edinburgh University
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-20 14:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15 19:34 [LSFMM] automating measuring memory fragmentation Luis Chamberlain
2024-05-16  5:15 ` Yu Zhao
2024-05-16  6:23   ` Luis Chamberlain
2024-05-16 20:05     ` Yu Zhao
2024-05-16 21:32       ` Karim Manaouil
2024-05-16 21:36         ` Yu Zhao
2024-05-20 14:34         ` Vlastimil Babka (SUSE)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox