From: Vlastimil Babka <vbabka@suse.cz>
To: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Minchan Kim <minchan@kernel.org>, Mel Gorman <mgorman@suse.de>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Michal Nazarewicz <mina86@mina86.com>,
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP
Date: Mon, 23 Jun 2014 11:52:28 +0200 [thread overview]
Message-ID: <53A7F8DC.8090106@suse.cz> (raw)
In-Reply-To: <53A7BD91.8020802@cn.fujitsu.com>
On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
> Hello
>
> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>> dropped before write lock is taken on the same mmap_sem.
>>>
>>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
>>> page allocation might involve a rather long sync compaction, and thus block
>>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>>> or mprotect oterations.
>>>
>>> This patch simply releases the read lock before allocating a huge page. It
>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>> ("mm: thp: khugepaged: add policy for finding target node").
>>
>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>> all. Please, move up_read() outside khugepaged_alloc_page().
>>
Well there's also currently no point in passing several parameters to
khugepaged_alloc_page(). So I could clean it up as well, but I imagine
later we would perhaps reintroduce them back, as I don't think the
current situation is ideal for at least two reasons.
1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for
finding target node"), it's based on a report where somebody found that
mempolicy is not observed properly when collapsing THP's. But the
'policy' introduced by the commit isn't based on real mempolicy, it
might just under certain conditions results in an interleave, which
happens to be what the reporter was trying.
So ideally, it should be making node allocation decisions based on where
the original 4KB pages are located. For example, allocate a THP only if
all the 4KB pages are on the same node. That would also automatically
obey any policy that has lead to the allocation of those 4KB pages.
And for this, it will need again the parameters and mmap_sem in read
mode. It would be however still a good idea to drop mmap_sem before the
allocation itself, since compaction/reclaim might take some time...
2. (less related) I'd expect khugepaged to first allocate a hugepage and
then scan for collapsing. Yes there's khugepaged_prealloc_page, but that
only does something on !NUMA systems and these are not the future.
Although I don't have the data, I expect allocating a hugepage is a
bigger issue than finding something that could be collapsed. So why scan
for collapsing if in the end I cannot allocate a hugepage? And if I
really cannot find something to collapse, would e.g. caching a single
hugepage per node be a big hit? Also, if there's really nothing to
collapse, then it means khugepaged won't compact. And since khugepaged
is becoming the only source of sync compaction that doesn't give up
easily and tries to e.g. migrate movable pages out of unmovable
pageblocks, this might have bad effects on fragmentation.
I believe this could be done smarter.
> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
> do the for loop to get the next vma and handle it. Does we do this without holding
> the mmap_sem in any mode?
>
> And if the loop end, we have another up_read in breakouterloop. What if we have
> released the mmap_sem in collapse_huge_page()?
collapse_huge_page() is only called from khugepaged_scan_pmd() in the if
(ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for
the return value of khugepaged_scan_pmd() to break out of the loop (and
not doing up_read() again). So I think this is correct and moving
up_read from khugepaged_alloc_page() to collapse_huge_page() wouldn't
change this?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-06-23 9:52 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-20 15:49 [PATCH v3 00/13] compaction: balancing overhead and success rates Vlastimil Babka
2014-06-20 15:49 ` [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
2014-06-20 17:45 ` Kirill A. Shutemov
2014-06-23 5:39 ` Zhang Yanfei
2014-06-23 9:52 ` Vlastimil Babka [this message]
2014-06-23 10:40 ` Zhang Yanfei
2014-06-20 15:49 ` [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone Vlastimil Babka
2014-06-23 2:24 ` Minchan Kim
2014-06-23 6:26 ` Zhang Yanfei
2014-06-24 8:23 ` Joonsoo Kim
2014-06-24 15:29 ` Vlastimil Babka
2014-06-25 1:02 ` Joonsoo Kim
2014-06-20 15:49 ` [PATCH v3 03/13] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
2014-06-20 15:49 ` [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range() Vlastimil Babka
2014-06-23 6:57 ` Zhang Yanfei
2014-06-24 4:52 ` Naoya Horiguchi
2014-06-24 15:34 ` Vlastimil Babka
2014-06-24 16:58 ` Naoya Horiguchi
2014-06-25 8:50 ` Vlastimil Babka
2014-06-25 15:46 ` Naoya Horiguchi
2014-06-24 8:33 ` Joonsoo Kim
2014-06-24 15:42 ` Vlastimil Babka
2014-06-25 0:53 ` Joonsoo Kim
2014-06-25 8:59 ` Vlastimil Babka
2014-06-27 5:57 ` Joonsoo Kim
2014-06-20 15:49 ` [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention Vlastimil Babka
2014-06-23 1:39 ` Minchan Kim
2014-06-23 8:55 ` Zhang Yanfei
2014-06-23 23:35 ` Minchan Kim
2014-06-24 1:07 ` Zhang Yanfei
2014-07-11 8:28 ` Vlastimil Babka
2014-07-11 9:38 ` Vlastimil Babka
2014-06-20 15:49 ` [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
2014-06-23 2:53 ` Minchan Kim
2014-07-11 12:03 ` Vlastimil Babka
2014-06-23 9:13 ` Zhang Yanfei
2014-06-24 15:39 ` Naoya Horiguchi
2014-06-24 15:44 ` Vlastimil Babka
2014-06-20 15:49 ` [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
2014-06-23 9:16 ` Zhang Yanfei
2014-06-24 18:55 ` Naoya Horiguchi
2014-06-20 15:49 ` [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
2014-06-23 3:04 ` Minchan Kim
2014-06-23 9:17 ` Zhang Yanfei
2014-06-24 19:09 ` Naoya Horiguchi
2014-06-20 15:49 ` [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
2014-06-23 3:05 ` Minchan Kim
2014-06-23 9:29 ` Zhang Yanfei
2014-06-20 15:49 ` [PATCH v3 10/13] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
2014-06-24 20:34 ` Naoya Horiguchi
2014-06-20 15:49 ` [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
2014-06-23 3:06 ` Minchan Kim
2014-06-23 9:31 ` Zhang Yanfei
2014-06-20 15:49 ` [PATCH v3 12/13] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
2014-06-25 1:57 ` Naoya Horiguchi
2014-06-25 8:57 ` Vlastimil Babka
2014-06-20 15:49 ` [RFC PATCH v3 13/13] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53A7F8DC.8090106@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mina86@mina86.com \
--cc=minchan@kernel.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=zhangyanfei@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox