From: Michal Hocko <mhocko@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@suse.de>,
Andrea Arcangeli <aarcange@redhat.com>,
Vlastimil Babka <vbabka@suse.cz>, Zi Yan <zi.yan@cs.rutgers.edu>,
Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations"
Date: Fri, 7 Jun 2019 10:32:55 +0200 [thread overview]
Message-ID: <20190607083255.GA18435@dhcp22.suse.cz> (raw)
In-Reply-To: <alpine.DEB.2.21.1906061451001.121338@chino.kir.corp.google.com>
On Thu 06-06-19 15:12:40, David Rientjes wrote:
> On Wed, 5 Jun 2019, Michal Hocko wrote:
>
> > > That's fine, but we also must be mindful of users who have used
> > > MADV_HUGEPAGE over the past four years based on its hard-coded behavior
> > > that would now regress as a result.
> >
> > Absolutely, I am all for helping those usecases. First of all we need to
> > understand what those usecases are though. So far we have only seen very
> > vague claims about artificial worst case examples when a remote access
> > dominates the overall cost but that doesn't seem to be the case in real
> > life in my experience (e.g. numa balancing will correct things or the
> > over aggressive node reclaim tends to cause problems elsewhere etc.).
> >
>
> The usecase is a remap of a binary's text segment to transparent hugepages
> by doing mmap() -> madvise(MADV_HUGEPAGE) -> mremap() and when this
> happens on a locally fragmented node. This happens at startup when we
> aren't concerned about allocation latency: we want to compact. We are
> concerned with access latency thereafter as long as the process is
> running.
You have indicated this previously but no call for a stand alone
reproducer was successful. It is really hard to optimize for such a
specialized workload without anything to play with. Btw. this is exactly
a case where I would expect numa balancing to converge to the optimal
placement. And if numabalancing is not an option than an explicit
mempolicy (e.g. the one suggested here) would be a good fit.
[...]
I will defer the compaction related stuff to Vlastimil and Mel who are
much more familiar with the current code.
> So my proposed change would be:
> - give the page allocator a consistent indicator that compaction failed
> because we are low on memory (make COMPACT_SKIPPED really mean this),
> - if we get this in the page allocator and we are allocating thp, fail,
> reclaim is unlikely to help here and is much more likely to be
> disruptive
> - we could retry compaction if we haven't scanned all memory and
> were contended,
> - if the hugepage allocation fails, have thp check watermarks for order-0
> pages without any padding,
> - if watermarks succeed, fail the thp allocation: we can't allocate
> because of fragmentation and it's better to return node local memory,
Doesn't this lead to the same THP low success rate we have seen with one
of the previous patches though?
Let me remind you of the previous semantic I was proposing
http://lkml.kernel.org/r/20181206091405.GD1286@dhcp22.suse.cz and that
didn't get shot down. Linus had some follow up ideas on how exactly
the fallback order should look like and that is fine. We should just
measure differences between local node cheep base page vs. remote THP on
_real_ workloads. Any microbenchmark which just measures a latency is
inherently misleading.
And really, fundamental problem here is that MADV_HUGEPAGE has gained
a NUMA semantic without a due scrutiny leading to a broken interface
with side effects that are simply making the interface unusable for a
large part of usecases that the madvise was originaly designed for.
Until we find an agreement on this point we will be looping in a dead
end discussion, I am afraid.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-06-07 8:32 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-03 22:31 [PATCH 0/2] reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Andrea Arcangeli
2019-05-03 22:31 ` [PATCH 1/2] Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Andrea Arcangeli
2019-05-04 12:03 ` Michal Hocko
2019-05-03 22:31 ` [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations" Andrea Arcangeli
2019-05-04 12:11 ` Michal Hocko
2019-05-09 8:38 ` Mel Gorman
2019-05-15 20:26 ` David Rientjes
2019-05-20 15:36 ` Mel Gorman
2019-05-20 17:54 ` David Rientjes
2019-05-24 0:57 ` Andrew Morton
2019-05-24 20:29 ` Andrea Arcangeli
2019-05-29 2:06 ` David Rientjes
2019-05-29 21:24 ` David Rientjes
2019-05-31 9:22 ` Michal Hocko
2019-05-31 21:53 ` David Rientjes
2019-06-05 9:32 ` Michal Hocko
2019-06-06 22:12 ` David Rientjes
2019-06-07 8:32 ` Michal Hocko [this message]
2019-06-13 20:17 ` David Rientjes
2019-06-21 21:16 ` David Rientjes
2019-07-30 13:11 ` Michal Hocko
2019-07-30 18:05 ` Andrew Morton
2019-07-31 8:17 ` Michal Hocko
2019-05-24 10:07 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190607083255.GA18435@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=rientjes@google.com \
--cc=s.priebe@profihost.ag \
--cc=vbabka@suse.cz \
--cc=zi.yan@cs.rutgers.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox