[Resend] Possible bug in __fragmentation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Resend] Possible bug in __fragmentation_index()
@ 2018-02-02 14:16 Robert Harris
  2018-02-02 17:47 ` Mel Gorman
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Harris @ 2018-02-02 14:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vlastimil Babka,
	Kemi Wang, ying.huang, David Rientjes, Vinayak Menon, Mel Gorman

I was planning to annotate the opaque calculation in
__fragmentation_index() but on closer inspection I think there may be a
bug.  I could use some feedback.

Firstly, for the case of fragmentation and ignoring the scaling,
__fragmentation_index() purports to return a value in the range 0 to 1.
Generally, however, the lower bound is actually 0.5.  Here's an
illustration using a zone that I fragmented with selective calls to
__alloc_pages() and __free_pages --- the fragmentation for order-1 could
not be minimised further yet is reported as 0.5:

# head -1 /proc/buddyinfo
Node 0, zone      DMA   1983      0      0      0      0      0      0      0      0      0      0 
# head -1 /sys/kernel/debug/extfrag/extfrag_index 
Node 0, zone      DMA -1.000 0.500 0.750 0.875 0.937 0.969 0.984 0.992 0.996 0.998 0.999 
#

This is significant because 0.5 is the default value of
sysctl_extfrag_threshold, meaning that compaction will not be suppressed
for larger blocks when memory is scarce rather than fragmented.  Of
course, sysctl_extfrag_threshold is a tuneable so the first question is:
does this even matter?

The calculation in __fragmentation_index() isn't documented but the
apparent error in the lower bound may be explained by showing that the
index is approximated by

F ~ 1 - 1/N

where N is (conceptually) the number of free blocks into which each
potential requested-size block has been split.  I.e. if all free space
were compacted then there would be B free blocks of the requested size
where

B = info->free_pages/requested

and thus

N = info->free_blocks_total/B

The case of least fragmentation must be when all of the requested-size
blocks have been split just once to form twice as many blocks in the
next lowest free list.  Thus the lowest value of N is 2 and the lowest
vale of F is 0.5.  I readied a patch that, in essence, defined
F = 1 - 2/N and thereby set the bounds of __fragmentation_index() as
0 <= F < 1.  Before sending it, I realised that, during testing, I *had* seen
the occasional instance of F < 0.5, e.g. F = 0.499.  Revisting the
calculation, I see that the actual implementation is

F = 1 - [1/N + 1/info->free_blocks_total]

meaning that a very severe shortage of free memory *could* tip the
balance in favour of "low fragmentation".  Although this seems highly
unlikely to occur outside testing, it does reflect the directive in the
comment above the function, i.e. favour page reclaim when fragmentation
is low.  My second question: is the current implementation of F is
intentional and, if not, what is the actual intent?

The comments in compaction_suitable() suggest that the compaction/page
reclaim decision is one of cost but, as compaction is linear, this isn't
what __fragmentation_index() is calculating.  A more reasonable argument
is that there's generally some lower limit on the fragmentation
achievable through compaction, given the inevitable presence of
non-migratable pages.  Is there anything else going on?

Robert Harris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Resend] Possible bug in __fragmentation_index()
  2018-02-02 14:16 [Resend] Possible bug in __fragmentation_index() Robert Harris
@ 2018-02-02 17:47 ` Mel Gorman
  2018-02-09 14:43   ` Robert Harris
  0 siblings, 1 reply; 3+ messages in thread
From: Mel Gorman @ 2018-02-02 17:47 UTC (permalink / raw)
  To: Robert Harris
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Johannes Weiner, Vlastimil Babka, Kemi Wang, ying.huang,
	David Rientjes, Vinayak Menon

On Fri, Feb 02, 2018 at 02:16:39PM +0000, Robert Harris wrote:
> I was planning to annotate the opaque calculation in
> __fragmentation_index() but on closer inspection I think there may be a
> bug.  I could use some feedback.
> 
> Firstly, for the case of fragmentation and ignoring the scaling,
> __fragmentation_index() purports to return a value in the range 0 to 1.
> Generally, however, the lower bound is actually 0.5.  Here's an
> illustration using a zone that I fragmented with selective calls to
> __alloc_pages() and __free_pages --- the fragmentation for order-1 could
> not be minimised further yet is reported as 0.5:
> 
> # head -1 /proc/buddyinfo
> Node 0, zone      DMA   1983      0      0      0      0      0      0      0      0      0      0 
> # head -1 /sys/kernel/debug/extfrag/extfrag_index 
> Node 0, zone      DMA -1.000 0.500 0.750 0.875 0.937 0.969 0.984 0.992 0.996 0.998 0.999 
> #
> 
> This is significant because 0.5 is the default value of
> sysctl_extfrag_threshold, meaning that compaction will not be suppressed
> for larger blocks when memory is scarce rather than fragmented.  Of
> course, sysctl_extfrag_threshold is a tuneable so the first question is:
> does this even matter?
> 

It's now 8 years since it was written so my memory is rusty. While the bounds
could be adjusted, it's not without risk. The bounds were left as-is and
the sysctl to avoid possibilties of excessive reclaim -- something early
implementations suffered badly. At the time of implementation, it was used
as a rough estimate for monitoring purposes but on an allocation failure,
it was always page reclaim that was used to try the allocation again.

At a later time, compaction was introduced to avoid excessive reclaim
but the cutoff was set to only happen for extreme memory shortage (and
the bounds should have been corrected at the time but were not).  It was
a long time before all the excessive reclaim bugs in kswapd were ironed
out but bugs of runaway kswapd at 100% CPU usage were common for a while.
There were also severeal problems with compaction overhead that were
adjusted in other matters. It may have reached the point where revisiting
the sysctl is potentially safe given that reclaim is considerably better
than it used to be.

> meaning that a very severe shortage of free memory *could* tip the
> balance in favour of "low fragmentation".  Although this seems highly
> unlikely to occur outside testing, it does reflect the directive in the
> comment above the function, i.e. favour page reclaim when fragmentation
> is low.  My second question: is the current implementation of F is
> intentional and, if not, what is the actual intent?
> 

It's intentional but could be fixed to give a real bound of 0 to 1 instead
of half the range as it currently give. The sysctl_extfrag_threshold should
also be adjusted at that time. After that, the real work is determining
if it's safe to strike a balance between reclaim/compaction that avoids
unnecessary compaction while not being too aggressive about reclaim or
having kswapd enter a runaway loop with a reintroduction of the "kswapd
stuck at 100% CPU time" problems.

Alternative, delete references to it entirely as the cutoff is not really
being used and the monitoring information is too specialised to be of
general use.

> The comments in compaction_suitable() suggest that the compaction/page
> reclaim decision is one of cost but, as compaction is linear, this isn't
> what __fragmentation_index() is calculating. 

The index was not intended as an estimate of the cost of compaction. It
was originally intended to act as an estimator of whether it's ebtter to
spend time reclaiming or compacting. Compacting was favoured on the
grounds that high order allocations were meant to be able to fail where
as reclaiming potentially useful data could have other consequences.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Resend] Possible bug in __fragmentation_index()
  2018-02-02 17:47 ` Mel Gorman
@ 2018-02-09 14:43   ` Robert Harris
  0 siblings, 0 replies; 3+ messages in thread
From: Robert Harris @ 2018-02-09 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Johannes Weiner, Vlastimil Babka, Kemi Wang, ying.huang,
	David Rientjes, Vinayak Menon

On 2 Feb 2018, at 17:47, Mel Gorman wrote:

> On Fri, Feb 02, 2018 at 02:16:39PM +0000, Robert Harris wrote:
>> I was planning to annotate the opaque calculation in
>> __fragmentation_index() but on closer inspection I think there may be a
>> bug.  I could use some feedback.

A belated thank you for the reply.

> It's intentional but could be fixed to give a real bound of 0 to 1 instead
> of half the range as it currently give. The sysctl_extfrag_threshold should
> also be adjusted at that time. After that, the real work is determining
> if it's safe to strike a balance between reclaim/compaction that avoids
> unnecessary compaction while not being too aggressive about reclaim or
> having kswapd enter a runaway loop with a reintroduction of the "kswapd
> stuck at 100% CPU time" problems.

In my (incomplete) view, striking the balance is a case of determining the
cost of memory regeneration through compaction versus reclaim and choosing
the cheaper.  I'm reasonably confident that this could be achieved for
compaction, which is why the calculation in __fragmentation_index() caught
my eye in the first place, but reclaim/swapping is probably significantly
harder to quantify.  Similarly, a cost function for allocation failure
is also necessary but not obvious.

All of the above is just a nebulous plan for now;  in the meantime, I'll
change __fragmentation_index() and the threshold as you suggest.

Robert  Harris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-02-09 14:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-02 14:16 [Resend] Possible bug in __fragmentation_index() Robert Harris
2018-02-02 17:47 ` Mel Gorman
2018-02-09 14:43   ` Robert Harris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox