From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
David Rientjes <rientjes@google.com>,
linux-mm@kvack.org, Mel Gorman <mgorman@techsingularity.net>,
Michal Hocko <mhocko@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Rik van Riel <riel@redhat.com>
Subject: Re: [RFC PATCH 0/6] proactive kcompactd
Date: Thu, 24 Aug 2017 15:24:57 +0900 [thread overview]
Message-ID: <20170824062457.GA24656@js1304-P5Q-DELUXE> (raw)
In-Reply-To: <502d438b-7167-5b78-c66c-0e1b47ba2434@suse.cz>
On Wed, Aug 23, 2017 at 10:12:14AM +0200, Vlastimil Babka wrote:
> On 08/23/2017 07:36 AM, Joonsoo Kim wrote:
> > On Mon, Aug 21, 2017 at 10:10:14AM -0400, Johannes Weiner wrote:
> >> On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote:
> >>> On Thu, 27 Jul 2017, Vlastimil Babka wrote:
> >>>
> >>>> As we discussed at last LSF/MM [1], the goal here is to shift more compaction
> >>>> work to kcompactd, which currently just makes a single high-order page
> >>>> available and then goes to sleep. The last patch, evolved from the initial RFC
> >>>> [2] does this by recording for each order > 0 how many allocations would have
> >>>> potentially be able to skip direct compaction, if the memory wasn't fragmented.
> >>>> Kcompactd then tries to compact as long as it takes to make that many
> >>>> allocations satisfiable. This approach avoids any hooks in allocator fast
> >>>> paths. There are more details to this, see the last patch.
> >>>>
> >>>
> >>> I think I would have liked to have seen "less proactive" :)
> >>>
> >>> Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it
> >>> continues until it can defragment memory. On a host with 128GB of memory
> >>> and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd
> >>> wakeups for order-2 memory allocation. The stats are pretty bad:
> >>>
> >>> compact_migrate_scanned 2931254031294
> >>> compact_free_scanned 102707804816705
> >>> compact_isolated 1309145254
> >>>
> >>> 0.0012% of memory scanned is ever actually isolated. We constantly see
> >>> very high cpu for compaction_alloc() because kcompactd is almost always
> >>> running in the background and iterating most memory completely needlessly
> >>> (define needless as 0.0012% of memory scanned being isolated).
> >>
> >> The free page scanner will inevitably wade through mostly used memory,
> >> but 0.0012% is lower than what systems usually have free. I'm guessing
> >> this is because of concurrent allocation & free cycles racing with the
> >> scanner? There could also be an issue with how we do partial scans.
> >>
> >> Anyway, we've also noticed scalability issues with the current scanner
> >> on 128G and 256G machines. Even with a better efficiency - finding the
> >> 1% of free memory, that's still a ton of linear search space.
> >>
> >> I've been toying around with the below patch. It adds a free page
> >> bitmap, allowing the free scanner to quickly skip over the vast areas
> >> of used memory. I don't have good data on skip-efficiency at higher
> >> uptimes and the resulting fragmentation yet. The overhead added to the
> >> page allocator is concerning, but I cannot think of a better way to
> >> make the search more efficient. What do you guys think?
> >
> > Hello, Johannes.
> >
> > I think that the best solution is that the compaction doesn't do linear
> > scan completely. Vlastimil already have suggested that idea.
>
> I was going to bring this up here, thanks :)
>
> > mm, compaction: direct freepage allocation for async direct
> > compaction
> >
> > lkml.kernel.org/r/<1459414236-9219-5-git-send-email-vbabka@suse.cz>
> >
> > It uses the buddy allocator to get a freepage so there is no linear
> > scan. It would completely remove scalability issue.
>
> Another big advantage is that migration scanner would get to see the
> whole zone, and not be biased towards the first 1/3 until it meets the
> free scanner. And another advantage is that we wouldn't be splitting
> free pages needlessly.
>
> > Unfortunately, he applied this idea only to async compaction since
> > changing the other compaction mode will probably cause long term
> > fragmentation. And, I disagreed with that idea at that time since
> > different compaction logic for different compaction mode would make
> > the system more unpredicatable.
> >
> > I doubt long term fragmentation is a real issue in practice. We loses
> > too much things to prevent long term fragmentation. I think that it's
> > the time to fix up the real issue (yours and David's) by giving up the
> > solution for long term fragmentation.
>
> I'm now also more convinced that this direction should be pursued, and
> wanted to get to it after the proactive kcompactd part. My biggest
> concern is that freelists can give us the pages from the same block that
> we (or somebody else) is trying to compact (migrate away). Isolating
> (i.e. MIGRATE_ISOLATE) the block first would work, but the overhead of
> the isolation could be significant. But I have some alternative ideas
> that could be tried.
>
> > If someone doesn't agree with above solution, your approach looks the
> > second best to me. Though, there is something to optimize.
> >
> > I think that we don't need to be precise to track the pageblock's
> > freepage state. Compaction is a far rare event compared to page
> > allocation so compaction could be tolerate with false positive.
> >
> > So, my suggestion is:
> >
> > 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
> > to me.
>
> Wouldn't the reusing cripple the original use for the migration scanner?
I think that there is no serious problem. Problem happens if we set
PB_migrate_skip wrongly. Consider following two cases that set
PB_migrate_skip.
1) migration scanner find that whole pages in the pageblock is pinned.
-> set skip -> it is cleared after one of the page is freed. No
problem.
There is a possibility that temporary pinned page is unpinned and we
miss this pageblock but it would be minor case.
2) migration scanner find that whole pages in the pageblock are free.
-> set skip -> we can miss the pageblock for a long time.
We need to fix 2) case in order to reuse PB_migrate_skip. I guess that
just counting the number of freepage in isolate_migratepages_block()
and considering it to not set PB_migrate_skip will work.
>
> > 2) Mark PB_migrate_skip only in free path and only when needed.
> > Unmark it in compaction if freepage scan fails in that pageblock.
> > In compaction, skip the pageblock if PB_migrate_skip is set. It means
> > that there is no freepage in the pageblock.
> >
> > Following is some code about my suggestion.
>
> Otherwise is sounds like it could work until the direct allocation
> approach is fully developed (or turns out to be infeasible).
Agreed.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-24 6:24 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-27 16:06 Vlastimil Babka
2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
2017-07-28 9:38 ` Mel Gorman
2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
2017-07-28 10:16 ` Mel Gorman
2017-07-27 16:06 ` [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough Vlastimil Babka
2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
2017-07-28 10:41 ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
2017-07-28 10:43 ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
2017-07-28 10:58 ` Mel Gorman
2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
2017-08-21 14:10 ` Johannes Weiner
2017-08-21 21:40 ` Rik van Riel
2017-08-22 20:57 ` David Rientjes
2017-08-23 5:36 ` Joonsoo Kim
2017-08-23 8:12 ` Vlastimil Babka
2017-08-24 6:24 ` Joonsoo Kim [this message]
2017-08-24 11:30 ` Vlastimil Babka
2017-08-24 23:51 ` Joonsoo Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170824062457.GA24656@js1304-P5Q-DELUXE \
--to=iamjoonsoo.kim@lge.com \
--cc=aarcange@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox