Re: [LSF/MM ATTEND] 2017 userfaultfd-WP, node reclaim vs zone compaction, THP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org
Subject: Re: [LSF/MM ATTEND] 2017 userfaultfd-WP, node reclaim vs zone compaction, THP
Date: Fri, 13 Jan 2017 17:24:26 +0100	[thread overview]
Message-ID: <20170113162426.GA20475@redhat.com> (raw)
In-Reply-To: <73b60b0a-33c2-739c-3d1e-d74b73f204e9@suse.cz>

On Thu, Jan 12, 2017 at 10:58:46PM +0100, Vlastimil Babka wrote:
> On 01/12/2017 08:26 PM, Andrea Arcangeli wrote:
> >    To avoid dropping some patches that implement "compaction aware
> >    zone_reclaim_mode" (i.e. now node_reclaim_mode) I'm still running
> >    with zone LRU, although I don't disagree with the node LRU per se,
> >    my only issue is that compaction still work zone based and that
> >    collides with those changes.
> >
> >    With reclaim working node based and compaction working zone
> >    based, I would need to call a blind for_each_zone(node)
> >    compaction() loop which is far from ideal compared to compaction
> >    crossing the zone boundary.
> 
> Compaction does a lot of watermark checking, which is also per-zone based, so we 
> would likely have to do these for_each_zone() dances for the watermark checks, 
> I'm afraid. At the same time it should make sure that it doesn't exhaust free 
> pages of each single zone below the watermark. The result would look ugly, 
> unless we switch to per-node watermarks.

compaction aware zone_reclaim looks like this:

static int zone_reclaim_compact(struct zone *preferred_zone,
				struct zone *zone, gfp_t gfp_mask,
				unsigned int order,
				bool sync_compaction,
				bool *need_compaction,
				int alloc_flags, int classzone_idx)
{
	if (compaction_deferred(preferred_zone, order) ||
	    !order ||
	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
		*need_compaction = false;
		return COMPACT_SKIPPED;
	}

	if (!zone_reclaimable(zone))
		return ZONE_RECLAIM_FULL;

	*need_compaction = true;
	return compact_zone_order(zone, order, gfp_mask,
				  sync_compaction ? DEF_COMPACT_PRIORITY :
				  COMPACT_PRIO_ASYNC, alloc_flags,
				  classzone_idx);
}

int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
		 gfp_t gfp_mask, unsigned int order,
		 unsigned long mark, int alloc_flags, int classzone_idx)
{
	int node_id;
	int ret, c_ret;
	bool sync_compaction = false, need_compaction = false;

	/*
	 * Do not scan if the allocation should not be delayed.
	 */
	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
		return ZONE_RECLAIM_NOSCAN;

	/*
	 * Only run zone reclaim on the local zone or on zones that do not
	 * have associated processors. This will favor the local processor
	 * over remote processors and spread off node memory allocations
	 * as wide as possible.
	 */
	node_id = zone_to_nid(zone);
	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
		return ZONE_RECLAIM_NOSCAN;

repeat_compaction:
	/*
	 * If this allocation may be satisfied by memory compaction,
	 * run compaction before reclaim.
	 */
	c_ret = zone_reclaim_compact(preferred_zone, zone, gfp_mask, order,
				     sync_compaction, &need_compaction,
				     alloc_flags, classzone_idx);
	if (need_compaction &&
	    c_ret != COMPACT_SKIPPED &&
	    zone_watermark_ok(zone, order, mark,
			      classzone_idx,
			      alloc_flags)) {
#ifdef CONFIG_COMPACTION
		zone->compact_considered = 0;
		zone->compact_defer_shift = 0;
#endif
		return ZONE_RECLAIM_SUCCESS;
	}

	/*
	 * reclaim if compaction failed because not enough memory was
	 * available or if compaction didn't run (order 0) or didn't
	 * succeed.
	 */
	ret = __zone_reclaim(zone, gfp_mask, order);
	if (ret == ZONE_RECLAIM_SUCCESS) {
		if (zone_watermark_ok(zone, order, mark,
				      classzone_idx,
				      alloc_flags))
			return ZONE_RECLAIM_SUCCESS;

		/*
		 * If compaction run but it was skipped and reclaim was
		 * successful keep going.
		 */
		if (need_compaction && c_ret == COMPACT_SKIPPED) {
			/*
			 * If it's ok to wait for I/O we can as well run sync
			 * compaction
			 */
			sync_compaction = !!(zone_reclaim_mode &
					     (RECLAIM_WRITE|RECLAIM_UNMAP));
			cond_resched();
			goto repeat_compaction;
		}
	}
	if (need_compaction)
		defer_compaction(preferred_zone, order);

	if (!ret)
		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

	return ret;
}

In principle it's:

repeat_compaction:
   zone_reclaim_compact(zone)
   if still not enough high order pages in zone:
	ret = __zone_reclaim(zone, gfp_mask, order);
	if still not enough high order pages in zone but order 0
	   reclaim was successful and compaction was skipped because
	   not enough order 0 free pages:
	       goto repeat_compaction;

And actually this isn't different than what happens all the time just
this has to happen within same reclaim code because zone_reclaim_mode
> 0 requires it to stay aggressively within the node. This is why it
is not ok in this case to invoke compaction only in page_alloc.c
before getting into reclaim, and let reclaim go through the whole
zonelist. zone_reclaim_mode > 0 has to shrink the current node before
it moves to the next node in the zonelist. The zonelist of course must
also be ordered node based for zone_reclaim_mode > 0 (now would be
node_reclaim_mode but zone/node doesn't matter here) to be effective,
with the proper boot option (normally the default).

The watermark mess you mention about doing node-compaction I'm afraid
exists already in turn also breaking stuff. page_alloc.c is not
fundamentally different from the above loop, simply the current
watermark mess will emerge as worsened compaction behavior and it's
not as visible as it gets visible if you try to fix the above
compaction-aware code on the node LRU model.

And I wouldn't be surprised if the bzip2 regression is just because
compaction got worse as result of not being node based when reclaim is
node based, and nobody solved the watermark mess, and instead of a
failure this just results in lower THP utilization.

When compaction fails because of zone watermark checks and you call
reclaim on the node, the zone watermarks don't improve and then
compaction fails again despite you called reclaim in between.

Reclaim has classzone concept so it can concentrate in lower zones
only (i.e. the classzone of the allocation), it won't ever concentrate
on the higher zone only, but that's needed if compaction shall succeed
when tried again on the highest zone. The RAM freed in the lower zones
won't help when compaction can't cross the zone boundary.

So short of doing the blind unconditional for_each_zone(node)
compact_zone() after reclaim succeeds on the node, I'm not sure how
else to fix this fundamental watermark inaccuracy in compaction. If
compaction worked node based and could cross the zone boundary this
watermark mess wouldn't exist, freeing memory in lower zones would
still allow the next invocation of compaction to succeed and see the
freed memory in the lower zones.

> >    Most pages that can be migrated by
> >    compaction can go in any zone, not all but we could record the page
> >    classzone.
> 
> Finding space for that in struct page also wouldn't be easy.

We'd need to find a way not to store that in the page struct indeed.

I'm not sure if that is a concern though, I think that classzone
restriction applies only to some buffer header or extreme cases, that
should be possible to single out by finding they're not standard user
memory or normal pagecache. For example they will have a
page->mapping->gfp_mask too that we can check if needed and will tell
which classzone they're part of. The vast majority of movable memory
can cross the zone boundary with no problem.

I already mentioned this issue once but nothing happened and the idea
of doing a blind for_each_zone() compact_zone() loop to solve this and
then break the loop if compaction succeeded in any of the lower zones
(because node based reclaim actually freed memory in the lower zones),
didn't feel optimal. However if you're sure that's the way to go I'll
have to think some more about it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-01-13 16:24 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-12 19:26 Andrea Arcangeli
2017-01-12 21:58 ` Vlastimil Babka
2017-01-13 16:24   ` Andrea Arcangeli [this message]
2017-01-26 17:50 ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170113162426.GA20475@redhat.com \
    --to=aarcange@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox