From: William Kucharski <william.kucharski@oracle.com>
To: lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>,
linux-fsdevel@vger.kernel.org
Subject: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
Date: Wed, 20 Feb 2019 04:17:13 -0700 [thread overview]
Message-ID: <379F21DD-006F-4E33-9BD5-F81F9BA75C10@oracle.com> (raw)
For the past year or so I have been working on further developing my original
prototype support of mapping read-only program text using large THP pages.
I developed a prototype described below which I continue to work on, but the
major issues I have yet to solve involve page cache integration and filesystem
support.
At present, the conventional methodology of reading a single base PAGE and
using readahead to fill in additional pages isn't useful as the entire (in my
prototype) PMD page needs to be read in before the page can be mapped (and at
that point it is unclear whether readahead of additional PMD sized pages would
be of benefit or too costly.
Additionally, there are no good interfaces at present to tell filesystem layers
that content is desired in chunks larger than a hardcoded limit of 64K, or to
to read disk blocks in chunks appropriate for PMD sized pages.
I very briefly discussed some of this work with Kirill in the past, and am
currently somewhat blocked on progress with my prototype due to issues with
multiorder page size support in the radix tree page cache. I don't feel it is
worth the time to debug those issues since the radix tree page cache is dead,
and it's much more useful to help Matthew Wilcox get multiorder page support
for XArray tested and approved upstream.
The following is a backgrounder on the work I have done to date and some
performance numbers.
Since it's just a prototype, I am unsure as to whether it would make a good topic
of a discussion talk per se, but should I be invited to attend it could
certainly engender a good amount of discussion as a BOF/cross-discipline topic
between the MM and FS tracks.
Thanks,
William Kucharski
========================================
One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.
I embarked upon this prototype on the theory that it would be advantageous to
be able to map large ranges of read-only text pages using THP as well.
The idea is that the kernel will attempt to allocate and map the range using a
PMD sized THP page upon first fault; if the allocation is successful the page
will be populated (at present using a call to kernel_read()) and the page will
be mapped at the PMD level. If memory allocation fails, the page fault routines
will drop through to the conventional PAGESIZE-oriented routines for mapping
the faulting page.
Since this approach will map a PMD size block of the memory map at a time, we
should see a slight uptick in time spent in disk I/O but a substantial drop in
page faults as well as a reduction in iTLB misses as address ranges will be
mapped with the larger page. Analysis of a test program that consists of a very
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this
does occur and there is a slight reduction in program execution time.
The text segment as seen from readelf:
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000
As currently implemented for test purposes, the prototype will only use large
pages to map an executable with a particular filename ("testr"), enabling easy
comparison of the same executable using 4K and 2M (x64) pages on the same
kernel. It is understood that this is just a proof of concept implementation
and much more work regarding enabling the feature and overall system usage of
it would need to be done before it was submitted as a kernel patch. However, I
felt it would be worthy to send it out as an RFC so I can find out whether
there are huge objections from the community to doing this at all, or a better
understanding of the major concerns that must be assuaged before it would even
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the
equivalent of "always" and bypass some checks for anonymous pages by simply
#ifdefing the code out; obviously I would need to determine the right thing to
do in those cases.
Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10"
follow; the 4K pagesize program was named "foo" and the 2M pagesize program
"testr" (as noted above) - please note that these numbers do vary from run to
run, but the orders of magnitude of the differences between the two versions
remain relatively constant:
4K Pages:
=========
Performance counter stats for './foo' (10 runs):
307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,728 page-faults:u # 0.025 K/sec ( +- 0.00% )
1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%)
562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%)
20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%)
2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%)
180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%)
40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%)
232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%)
23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%)
<not supported> L1-icache-loads:u
74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%)
180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%)
707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%)
5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%)
1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
307.093088771 seconds time elapsed ( +- 0.20% )
2M Pages:
=========
Performance counter stats for './testr' (10 runs):
289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
598 page-faults:u # 0.002 K/sec ( +- 0.03% )
1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%)
562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%)
20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%)
2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%)
180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%)
40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%)
135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%)
6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%)
<not supported> L1-icache-loads:u
74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%)
180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%)
835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%)
6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%)
51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
289.551551387 seconds time elapsed ( +- 0.20% )
A check of /proc/meminfo with the test program running shows the large mappings:
ShmemPmdMapped: 471040 kB
The obvious problem with this first swipe at things is the large pages are not
placed into the page cache, so for example multiple concurrent executions of the
test program allocate and map the large pages each time.
A greater architectural issue is the best way to support large pages in the page
cache, which is something Matthew Wilcox's multiorder page support in XArray
should solve.
Some questions:
* What is the best approach to deal with large pages when PAGESIZE mappings exist?
At present, the prototype evicts PAGESIZE pages from the page cache, replacing
them with a mapping for the large page, and future mappings of a PAGESIZE range
should map using an offset into the PMD sized physical page used to map the PMD
sized virtual page.
* Do we need to create per-filesystem routines to handle large pages or can
we delay that (ideally we would want to be able to read in the contents
of large pages without having to read_iter however many PAGESIZE pages
we need.)
I am happy to take whatever approach is best to add large pages to the page
cache, but it seems useful and crucuial that a way be provided for the system to
automatically use THP to map large text pages if so desired, read-only to begin
but eventually read/write to accommodate applications that self-modify code such
as databases and Java.
========================================
next reply other threads:[~2019-02-20 11:17 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-20 11:17 William Kucharski [this message]
2019-02-20 12:10 ` Michal Hocko
2019-02-20 13:18 ` William Kucharski
2019-02-20 13:27 ` Michal Hocko
2019-02-20 13:44 ` Matthew Wilcox
2019-02-20 14:07 ` William Kucharski
2019-02-20 14:43 ` Matthew Wilcox
2019-02-20 16:39 ` Keith Busch
2019-02-20 17:19 ` Matthew Wilcox
2019-04-08 11:36 ` William Kucharski
2019-04-28 20:08 ` Song Liu
2019-04-30 12:12 ` William Kucharski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=379F21DD-006F-4E33-9BD5-F81F9BA75C10@oracle.com \
--to=william.kucharski@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox