* [PATCH 0/2] mm,thp: Add filemap_huge_fault() for THP
@ 2019-07-28 22:47 William Kucharski
2019-07-28 22:47 ` [PATCH 1/2] mm: Allow the page cache to allocate large pages William Kucharski
2019-07-28 22:47 ` [PATCH 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
0 siblings, 2 replies; 4+ messages in thread
From: William Kucharski @ 2019-07-28 22:47 UTC (permalink / raw)
To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
William Kucharski, Chad Mynhier, Kirill A. Shutemov,
Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
Bartlomiej Zolnierkiewicz, Ross Zwisler
This set of patches is the first step towards a mechanism for automatically
mapping read-only text areas of appropriate size and alignment to THPs whenever
possible.
For now, the central routine, filemap_huge_fault(), amd various support
routines are only included if the experimental kernel configuration option
RO_EXEC_FILEMAP_HUGE_FAULT_THP
is enabled.
This is because filemap_huge_fault() is dependent upon the
address_space_operations vector readpage() pointing to a routine that
will read and fill an entire large page at a time without poulluting the
page cache with PAGESIZE entries for the large page being mapped or
performing readahead that would pollute the page cache entries for
succeeding large pages. Unfortunately, there is no good way to determine
how many bytes were read by readpage(). At present, if filemap_huge_fault()
were to call a conventional readpage() routine, it would only fill the first
PAGESIZE bytes of the large page, which is definitely NOT the desired behavior.
However, by making the code available now it is hoped that filesystem
maintainers who have pledged to provide such a mechanism will do so more
rapidly.
The first part of the patch adds an order field to __page_cache_alloc(),
allowing callers to directly request page cache pages of various sizes.
This code was provided by Matthew Wilcox.
The second part of the patch implements the filemap_huge_fault() mechanism as
described above.
Matthew Wilcox (1):
mm: Allow the page cache to allocate large pages
William Kucharski (2):
mm: Allow the page cache to allocate large pages
mm,thp: Add config experimental option RO_EXEC_FILEMAP_HUGE_FAULT_THP
fs/afs/dir.c | 2 +-
fs/btrfs/compression.c | 2 +-
fs/cachefiles/rdwr.c | 4 +-
fs/ceph/addr.c | 2 +-
fs/ceph/file.c | 2 +-
include/linux/huge_mm.h | 16 +-
include/linux/mm.h | 6 +
include/linux/pagemap.h | 13 +-
mm/Kconfig | 15 ++
mm/filemap.c | 322 ++++++++++++++++++++++++++++++++++++++--
mm/huge_memory.c | 3 +
mm/mmap.c | 36 ++++-
mm/readahead.c | 2 +-
mm/rmap.c | 8 +
net/ceph/pagelist.c | 4 +-
net/ceph/pagevec.c | 2 +-
16 files changed, 404 insertions(+), 35 deletions(-)
--
2.21.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] mm: Allow the page cache to allocate large pages
2019-07-28 22:47 [PATCH 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
@ 2019-07-28 22:47 ` William Kucharski
2019-07-29 20:00 ` kbuild test robot
2019-07-28 22:47 ` [PATCH 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
1 sibling, 1 reply; 4+ messages in thread
From: William Kucharski @ 2019-07-28 22:47 UTC (permalink / raw)
To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
William Kucharski, Chad Mynhier, Kirill A. Shutemov,
Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
Bartlomiej Zolnierkiewicz, Ross Zwisler
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
---
fs/afs/dir.c | 2 +-
fs/btrfs/compression.c | 2 +-
fs/cachefiles/rdwr.c | 4 ++--
fs/ceph/addr.c | 2 +-
fs/ceph/file.c | 2 +-
include/linux/pagemap.h | 13 +++++++++----
mm/filemap.c | 25 +++++++++++++------------
mm/readahead.c | 2 +-
net/ceph/pagelist.c | 4 ++--
net/ceph/pagevec.c | 2 +-
10 files changed, 32 insertions(+), 26 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index e640d67274be..0a392214f71e 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key)
afs_stat_v(dvnode, n_inval);
ret = -ENOMEM;
- req->pages[i] = __page_cache_alloc(gfp);
+ req->pages[i] = __page_cache_alloc(gfp, 0);
if (!req->pages[i])
goto error;
ret = add_to_page_cache_lru(req->pages[i],
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 60c47b417a4b..5280e7477b7e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
}
page = __page_cache_alloc(mapping_gfp_constraint(mapping,
- ~__GFP_FS));
+ ~__GFP_FS), 0);
if (!page)
break;
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 44a3ce1e4ce4..11d30212745f 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
goto backing_page_already_present;
if (!newpage) {
- newpage = __page_cache_alloc(cachefiles_gfp);
+ newpage = __page_cache_alloc(cachefiles_gfp, 0);
if (!newpage)
goto nomem_monitor;
}
@@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
goto backing_page_already_present;
if (!newpage) {
- newpage = __page_cache_alloc(cachefiles_gfp);
+ newpage = __page_cache_alloc(cachefiles_gfp, 0);
if (!newpage)
goto nomem;
}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index e078cc55b989..bcb41fbee533 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1707,7 +1707,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
if (len > PAGE_SIZE)
len = PAGE_SIZE;
} else {
- page = __page_cache_alloc(GFP_NOFS);
+ page = __page_cache_alloc(GFP_NOFS, 0);
if (!page) {
err = -ENOMEM;
goto out;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..ae58d7c31aa4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct page *page = NULL;
loff_t i_size;
if (retry_op == READ_INLINE) {
- page = __page_cache_alloc(GFP_KERNEL);
+ page = __page_cache_alloc(GFP_KERNEL, 0);
if (!page)
return -ENOMEM;
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c7552459a15f..e9004e3cb6a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -208,17 +208,17 @@ static inline int page_cache_add_speculative(struct page *page, int count)
}
#ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order);
#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
#endif
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x), 0);
}
static inline gfp_t readahead_gfp_mask(struct address_space *x)
@@ -240,6 +240,11 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
#define FGP_NOFS 0x00000010
#define FGP_NOWAIT 0x00000020
#define FGP_FOR_MMAP 0x00000040
+/* If you add more flags, increment FGP_ORDER_SHIFT */
+#define FGP_ORDER_SHIFT 7
+#define FGP_PMD ((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+#define FGP_PUD ((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+#define fgp_get_order(fgp) ((fgp) >> FGP_ORDER_SHIFT)
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
int fgp_flags, gfp_t cache_gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..eb4c87428099 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -954,7 +954,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
#ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned int order)
{
int n;
struct page *page;
@@ -964,12 +964,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = __alloc_pages_node(n, gfp, 0);
+ page = __alloc_pages_node(n, gfp, order);
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
return page;
}
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
EXPORT_SYMBOL(__page_cache_alloc);
#endif
@@ -1597,12 +1597,12 @@ EXPORT_SYMBOL(find_lock_entry);
* pagecache_get_page - find and get a page reference
* @mapping: the address_space to search
* @offset: the page index
- * @fgp_flags: PCG flags
+ * @fgp_flags: FGP flags
* @gfp_mask: gfp mask to use for the page cache data page allocation
*
* Looks up the page cache slot at @mapping & @offset.
*
- * PCG flags modify how the page is returned.
+ * FGP flags modify how the page is returned.
*
* @fgp_flags can be:
*
@@ -1615,6 +1615,7 @@ EXPORT_SYMBOL(find_lock_entry);
* - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
* its own locking dance if the page is already in cache, or unlock the page
* before returning if we had to add the page to pagecache.
+ * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
*
* If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
* if the GFP flags specified for FGP_CREAT are atomic.
@@ -1660,12 +1661,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
no_page:
if (!page && (fgp_flags & FGP_CREAT)) {
int err;
- if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+ if ((fgp_flags & FGP_WRITE) &&
+ mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
gfp_mask &= ~__GFP_FS;
- page = __page_cache_alloc(gfp_mask);
+ page = __page_cache_alloc(gfp_mask, fgp_order(fgp_flags));
if (!page)
return NULL;
@@ -2802,15 +2804,14 @@ static struct page *wait_on_page_read(struct page *page)
static struct page *do_read_cache_page(struct address_space *mapping,
pgoff_t index,
int (*filler)(void *, struct page *),
- void *data,
- gfp_t gfp)
+ void *data, unsigned int order, gfp_t gfp)
{
struct page *page;
int err;
repeat:
page = find_get_page(mapping, index);
if (!page) {
- page = __page_cache_alloc(gfp);
+ page = __page_cache_alloc(gfp, order);
if (!page)
return ERR_PTR(-ENOMEM);
err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2917,7 +2918,7 @@ struct page *read_cache_page(struct address_space *mapping,
int (*filler)(void *, struct page *),
void *data)
{
- return do_read_cache_page(mapping, index, filler, data,
+ return do_read_cache_page(mapping, index, filler, data, 0,
mapping_gfp_mask(mapping));
}
EXPORT_SYMBOL(read_cache_page);
@@ -2939,7 +2940,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
pgoff_t index,
gfp_t gfp)
{
- return do_read_cache_page(mapping, index, NULL, NULL, gfp);
+ return do_read_cache_page(mapping, index, NULL, NULL, 0, gfp);
}
EXPORT_SYMBOL(read_cache_page_gfp);
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..954760a612ea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -193,7 +193,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
continue;
}
- page = __page_cache_alloc(gfp_mask);
+ page = __page_cache_alloc(gfp_mask, 0);
if (!page)
break;
page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 65e34f78b05d..0c3face908dc 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -56,7 +56,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
struct page *page;
if (!pl->num_pages_free) {
- page = __page_cache_alloc(GFP_NOFS);
+ page = __page_cache_alloc(GFP_NOFS, 0);
} else {
page = list_first_entry(&pl->free_list, struct page, lru);
list_del(&page->lru);
@@ -107,7 +107,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT; /* conv to num pages */
while (space > pl->num_pages_free) {
- struct page *page = __page_cache_alloc(GFP_NOFS);
+ struct page *page = __page_cache_alloc(GFP_NOFS, 0);
if (!page)
return -ENOMEM;
list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 64305e7056a1..1d07e639216d 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -45,7 +45,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
if (!pages)
return ERR_PTR(-ENOMEM);
for (i = 0; i < num_pages; i++) {
- pages[i] = __page_cache_alloc(flags);
+ pages[i] = __page_cache_alloc(flags, 0);
if (pages[i] == NULL) {
ceph_release_page_vector(pages, i);
return ERR_PTR(-ENOMEM);
--
2.21.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
2019-07-28 22:47 [PATCH 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
2019-07-28 22:47 ` [PATCH 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-07-28 22:47 ` William Kucharski
1 sibling, 0 replies; 4+ messages in thread
From: William Kucharski @ 2019-07-28 22:47 UTC (permalink / raw)
To: ceph-devel, linux-afs, linux-btrfs, linux-kernel, linux-mm,
netdev, Chris Mason, David S. Miller, David Sterba, Josef Bacik
Cc: Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
William Kucharski, Chad Mynhier, Kirill A. Shutemov,
Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
Thomas Gleixner, Jérôme Glisse, Amir Goldstein,
Jason Gunthorpe, Michal Hocko, Jann Horn, David Howells,
John Hubbard, Souptick Joarder, john.hubbard, Jan Kara,
Andrey Konovalov, Arun KS, Aneesh Kumar K.V, Jeff Layton,
Yangtao Li, Andrew Morton, Robin Murphy, Mike Rapoport,
David Rientjes, Andrey Ryabinin, Yafang Shao, Huang Shijie,
Yang Shi, Miklos Szeredi, Pavel Tatashin, Kirill Tkhai,
Sage Weil, Ira Weiny, Dan Williams, Darrick J. Wong, Gao Xiang,
Bartlomiej Zolnierkiewicz, Ross Zwisler, root
From: root <root@localhost.localdomain>
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
---
include/linux/huge_mm.h | 16 ++-
include/linux/mm.h | 6 +
mm/Kconfig | 15 ++
mm/filemap.c | 301 +++++++++++++++++++++++++++++++++++++++-
mm/huge_memory.c | 3 +
mm/mmap.c | 36 ++++-
mm/rmap.c | 8 ++
7 files changed, 374 insertions(+), 11 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 45ede62aa85b..34723f7e75d0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -79,13 +79,15 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
-#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
-
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
-#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1)
+#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET))
+
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1)
+#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET))
extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..ba24b515468a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2433,6 +2433,12 @@ extern void truncate_inode_pages_final(struct address_space *);
/* generic vm_area_ops exported for stackable file systems */
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
+
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size);
+#endif
+
extern void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..2debaded0e4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL
config ARCH_HAS_HUGEPD
bool
+config RO_EXEC_FILEMAP_HUGE_FAULT_THP
+ bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)"
+ depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM
+
+ help
+ Introduce filemap_huge_fault() to automatically map executable
+ read-only pages of mapped files of suitable size and alignment
+ using THP if possible.
+
+ This is marked experimental because it is a new feature and is
+ dependent upon filesystmes implementing readpages() in a way
+ that will recognize large THP pages and read file content to
+ them without polluting the pagecache with PAGESIZE pages due
+ to readahead.
+
endmenu
diff --git a/mm/filemap.c b/mm/filemap.c
index eb4c87428099..4e7287db0d8e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -199,6 +199,8 @@ static void unaccount_page_cache_page(struct address_space *mapping,
nr = hpage_nr_pages(page);
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
if (PageSwapBacked(page)) {
__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
if (PageTransHuge(page))
@@ -206,6 +208,13 @@ static void unaccount_page_cache_page(struct address_space *mapping,
} else {
VM_BUG_ON_PAGE(PageTransHuge(page), page);
}
+#else
+ if (PageSwapBacked(page))
+ __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
+
+ if (PageTransHuge(page))
+ __dec_node_page_state(page, NR_SHMEM_THPS);
+#endif
/*
* At this point page must be either written or cleaned by
@@ -1615,7 +1624,7 @@ EXPORT_SYMBOL(find_lock_entry);
* - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
* its own locking dance if the page is already in cache, or unlock the page
* before returning if we had to add the page to pagecache.
- * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page.
+ * - FGP_PMD: If FGP_CREAT is specified, attempt to allocate a PMD-sized page
*
* If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
* if the GFP flags specified for FGP_CREAT are atomic.
@@ -1667,7 +1676,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
if (fgp_flags & FGP_NOFS)
gfp_mask &= ~__GFP_FS;
- page = __page_cache_alloc(gfp_mask, fgp_order(fgp_flags));
+ page = __page_cache_alloc(gfp_mask, fgp_get_order(fgp_flags));
if (!page)
return NULL;
@@ -2642,6 +2651,291 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+/*
+ * Check for an entry in the page cache which would conflict with the address
+ * range we wish to map using a THP or is otherwise unusable to map a large
+ * cached page.
+ *
+ * The routine will return true if a usable page is found in the page cache
+ * (and *pagep will be set to the address of the cached page), or if no
+ * cached page is found (and *pagep will be set to NULL).
+ */
+static bool
+filemap_huge_check_pagecache_usable(struct xa_state *xasp,
+ struct page **pagep, pgoff_t hindex, pgoff_t hindex_max)
+{
+ struct page *page;
+
+ while (1) {
+ page = xas_find(xasp, hindex_max);
+
+ if (xas_retry(xasp, page)) {
+ xas_set(xasp, hindex);
+ continue;
+ }
+
+ /*
+ * A found entry is unusable if:
+ * + the entry is an Xarray value, not a pointer
+ * + the entry is an internal Xarray node
+ * + the entry is not a Transparent Huge Page
+ * + the entry is not a compound page
+ * + the entry is not the head of a compound page
+ * + the enbry is a page page with an order other than
+ * HPAGE_PMD_ORDER
+ * + the page's index is not what we expect it to be
+ * + the page is not up-to-date
+ * + the page is unlocked
+ */
+ if ((page) && (xa_is_value(page) || xa_is_internal(page) ||
+ (!PageCompound(page)) || (PageHuge(page)) ||
+ (!PageTransCompound(page)) ||
+ page != compound_head(page) ||
+ compound_order(page) != HPAGE_PMD_ORDER ||
+ page->index != hindex || (!PageUptodate(page)) ||
+ (!PageLocked(page))))
+ return false;
+
+ break;
+ }
+
+ xas_set(xasp, hindex);
+ *pagep = page;
+ return true;
+}
+
+/**
+ * filemap_huge_fault - read in file data for page fault handling to THP
+ * @vmf: struct vm_fault containing details of the fault
+ * @pe_size: large page size to map, currently this must be PE_SIZE_PMD
+ *
+ * filemap_huge_fault() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data to a transparent huge page during
+ * a page fault.
+ *
+ * If for any reason we can't allocate a THP, map it or add it to the page
+ * cache, VM_FAULT_FALLBACK will be returned which will cause the fault
+ * handler to try mapping the page using a PAGESIZE page, usually via
+ * filemap_fault() if so speicifed in the vma operations vector.
+ *
+ * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte()
+ * to map the new THP.
+ *
+ * NOTE: This routine depends upon the file system's readpage routine as
+ * specified in the address space operations vector to recognize when it
+ * is being passed a large page and to read the approprate amount of data
+ * in full and without polluting the page cache for the large page itself
+ * with PAGESIZE pages to perform a buffered read or to pollute what
+ * would be the page cache space for any succeeding pages with PAGESIZE
+ * pages due to readahead.
+ *
+ * It is VITAL that this routine not be enabled without such filesystem
+ * support. As there is no way to determine how many bytes were read by
+ * the readpage() operation, if only a PAGESIZE page is read, this routine
+ * will map the THP containing only the first PAGESIZE bytes of file data
+ * to satisfy the fault, which is never the result desired.
+ */
+vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
+{
+ struct file *filp = vmf->vma->vm_file;
+ struct address_space *mapping = filp->f_mapping;
+ struct vm_area_struct *vma = vmf->vma;
+
+ unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR);
+ pgoff_t hindex_max = hindex + HPAGE_PMD_NR;
+
+ struct page *cached_page, *hugepage;
+ struct page *new_page = NULL;
+
+ vm_fault_t ret = VM_FAULT_FALLBACK;
+ int error;
+
+ XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER);
+
+ /*
+ * Return VM_FAULT_FALLBACK if:
+ *
+ * + pe_size != PE_SIZE_PMD
+ * + FAULT_FLAG_WRITE is set in vmf->flags
+ * + vma isn't aligned to allow a PMD mapping
+ * + PMD would extend beyond the end of the vma
+ */
+ if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) ||
+ (haddr < vma->vm_start ||
+ (haddr + HPAGE_PMD_SIZE > vma->vm_end)))
+ return ret;
+
+ xas_lock_irq(&xas);
+
+retry_xas_locked:
+ if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+ hindex_max)) {
+ /* found a conflicting entry in the page cache, so fallback */
+ goto unlock;
+ } else if (cached_page) {
+ /* found a valid cached page, so map it */
+ hugepage = cached_page;
+ goto map_huge;
+ }
+
+ xas_unlock_irq(&xas);
+
+ /* allocate huge THP page in VMA */
+ new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP |
+ __GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER);
+
+ if (unlikely(!new_page))
+ return ret;
+
+ if (unlikely(!(PageCompound(new_page)))) {
+ put_page(new_page);
+ return ret;
+ }
+
+ prep_transhuge_page(new_page);
+ new_page->index = hindex;
+ new_page->mapping = mapping;
+
+ __SetPageLocked(new_page);
+
+ /*
+ * The readpage() operation below is expected to fill the large
+ * page with data without polluting the page cache with
+ * PAGESIZE entries due to a buffered read and/or readahead().
+ *
+ * A filesystem's vm_operations_struct huge_fault field should
+ * never point to this routine without such a capability, and
+ * without it a call to this routine would eventually just
+ * fall through to the normal fault op anyway.
+ */
+ error = mapping->a_ops->readpage(vmf->vma->vm_file, new_page);
+
+ if (unlikely(error)) {
+ put_page(new_page);
+ return ret;
+ }
+
+ /* XXX - use wait_on_page_locked_killable() instead? */
+ wait_on_page_locked(new_page);
+
+ if (!PageUptodate(new_page)) {
+ /* EIO */
+ new_page->mapping = NULL;
+ put_page(new_page);
+ return ret;
+ }
+
+ do {
+ xas_lock_irq(&xas);
+ xas_set(&xas, hindex);
+ xas_create_range(&xas);
+
+ if (!(xas_error(&xas)))
+ break;
+
+ if (!xas_nomem(&xas, GFP_KERNEL)) {
+ if (new_page) {
+ new_page->mapping = NULL;
+ put_page(new_page);
+ }
+
+ goto unlock;
+ }
+
+ xas_unlock_irq(&xas);
+ } while (1);
+
+ /*
+ * Double check that an entry did not sneak into the page cache while
+ * creating Xarray entries for the new page.
+ */
+ if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex,
+ hindex_max)) {
+ /*
+ * An unusable entry was found, so delete the newly allocated
+ * page and fallback.
+ */
+ new_page->mapping = NULL;
+ put_page(new_page);
+ goto unlock;
+ } else if (cached_page) {
+ /*
+ * A valid large page was found in the page cache, so free the
+ * newly allocated page and map the cached page instead.
+ */
+ new_page->mapping = NULL;
+ put_page(new_page);
+ new_page = NULL;
+ hugepage = cached_page;
+ goto map_huge;
+ }
+
+ __SetPageLocked(new_page);
+
+ /* did it get truncated? */
+ if (unlikely(new_page->mapping != mapping)) {
+ unlock_page(new_page);
+ put_page(new_page);
+ goto retry_xas_locked;
+ }
+
+ hugepage = new_page;
+
+map_huge:
+ /* map hugepage at the PMD level */
+ ret = alloc_set_pte(vmf, NULL, hugepage);
+
+ VM_BUG_ON_PAGE((!(pmd_trans_huge(*vmf->pmd))), hugepage);
+
+ if (likely(!(ret & VM_FAULT_ERROR))) {
+ /*
+ * The alloc_set_pte() succeeded without error, so
+ * add the page to the page cache if it is new, and
+ * increment page statistics accordingly.
+ */
+ if (new_page) {
+ unsigned long nr;
+
+ xas_set(&xas, hindex);
+
+ for (nr = 0; nr < HPAGE_PMD_NR; nr++) {
+#ifndef COMPOUND_PAGES_HEAD_ONLY
+ xas_store(&xas, new_page + nr);
+#else
+ xas_store(&xas, new_page);
+#endif
+ xas_next(&xas);
+ }
+
+ count_vm_event(THP_FILE_ALLOC);
+ __inc_node_page_state(new_page, NR_SHMEM_THPS);
+ __mod_node_page_state(page_pgdat(new_page),
+ NR_FILE_PAGES, HPAGE_PMD_NR);
+ __mod_node_page_state(page_pgdat(new_page),
+ NR_SHMEM, HPAGE_PMD_NR);
+ }
+
+ vmf->address = haddr;
+ vmf->page = hugepage;
+
+ page_ref_add(hugepage, HPAGE_PMD_NR);
+ count_vm_event(THP_FILE_MAPPED);
+ } else if (new_page) {
+ /* there was an error mapping the new page, so release it */
+ new_page->mapping = NULL;
+ put_page(new_page);
+ }
+
+unlock:
+ xas_unlock_irq(&xas);
+ return ret;
+}
+EXPORT_SYMBOL(filemap_huge_fault);
+#endif
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
@@ -2924,7 +3218,8 @@ struct page *read_cache_page(struct address_space *mapping,
EXPORT_SYMBOL(read_cache_page);
/**
- * read_cache_page_gfp - read into page cache, using specified page allocation flags.
+ * read_cache_page_gfp - read into page cache, using specified page allocation
+ * flags.
* @mapping: the page's address_space
* @index: the page index
* @gfp: the page allocator flags to use if allocating
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1334ede667a8..26d74466d1f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -543,8 +543,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
if (addr)
goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
goto out;
+#endif
addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
if (addr)
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..96ff80d2a8fb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1391,6 +1391,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
int pkey = 0;
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+ unsigned long vm_maywrite = VM_MAYWRITE;
+#endif
+
*populate = 0;
if (!len)
@@ -1429,7 +1433,33 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
- addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+ /*
+ * If THP is enabled, it's a read-only executable that is
+ * MAP_PRIVATE mapped, the length is larger than a PMD page
+ * and either it's not a MAP_FIXED mapping or the passed address is
+ * properly aligned for a PMD page, attempt to get an appropriate
+ * address at which to map a PMD-sized THP page, otherwise call the
+ * normal routine.
+ */
+ if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
+ (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
+ (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE &&
+ (!(addr & HPAGE_PMD_OFFSET))) {
+ addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
+
+ if (addr && (!(addr & HPAGE_PMD_OFFSET)))
+ vm_maywrite = 0;
+ else
+ addr = get_unmapped_area(file, addr, len, pgoff, flags);
+ } else {
+#endif
+ addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+ }
+#endif
+
if (offset_in_page(addr))
return addr;
@@ -1451,7 +1481,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* of the memory object, so we don't do any here.
*/
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
+ mm->def_flags | VM_MAYREAD | vm_maywrite | VM_MAYEXEC;
+#else
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+#endif
if (flags & MAP_LOCKED)
if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..503612d3b52b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1192,7 +1192,11 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
+
__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
@@ -1232,7 +1236,11 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+
+#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
+
__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
--
2.21.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] mm: Allow the page cache to allocate large pages
2019-07-28 22:47 ` [PATCH 1/2] mm: Allow the page cache to allocate large pages William Kucharski
@ 2019-07-29 20:00 ` kbuild test robot
0 siblings, 0 replies; 4+ messages in thread
From: kbuild test robot @ 2019-07-29 20:00 UTC (permalink / raw)
To: William Kucharski
Cc: kbuild-all, ceph-devel, linux-afs, linux-btrfs, linux-kernel,
linux-mm, netdev, Chris Mason, David S. Miller, David Sterba,
Josef Bacik, Dave Hansen, Song Liu, Bob Kasten, Mike Kravetz,
William Kucharski, Chad Mynhier, Kirill A. Shutemov,
Johannes Weiner, Matthew Wilcox, Dave Airlie, Vlastimil Babka,
Keith Busch, Ralph Campbell, Steve Capper, Dave Chinner,
Sean Christopherson, Hugh Dickins, Ilya Dryomov, Alexander Duyck,
Thomas Gleixner, =?unknown-8bit?B?SsOpcsO0bWU=?= Glisse,
Amir Goldstein, Jason Gunthorpe, Michal Hocko, Jann Horn,
David Howells, John Hubbard, Souptick Joarder, john.hubbard,
Jan Kara, Andrey Konovalov, Arun KS, Aneesh Kumar K.V,
Jeff Layton, Yangtao Li, Andrew Morton, Robin Murphy,
Mike Rapoport, David Rientjes, Andrey Ryabinin, Yafang Shao,
Huang Shijie, Yang Shi, Miklos Szeredi, Pavel Tatashin,
Kirill Tkhai, Sage Weil, Ira Weiny, Dan Williams,
Darrick J. Wong, Gao Xiang, Bartlomiej Zolnierkiewicz,
Ross Zwisler
[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]
Hi William,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc2 next-20190729]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/William-Kucharski/mm-thp-Add-filemap_huge_fault-for-THP/20190730-012407
config: i386-allnoconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-10) 7.4.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>
Note: the linux-review/William-Kucharski/mm-thp-Add-filemap_huge_fault-for-THP/20190730-012407 HEAD f8fb164fdd02af659d9a2ae4e1b6c790057bdcd4 builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
mm/filemap.c: In function 'pagecache_get_page':
>> mm/filemap.c:1670:39: error: implicit declaration of function 'fgp_order'; did you mean 'page_order'? [-Werror=implicit-function-declaration]
page = __page_cache_alloc(gfp_mask, fgp_order(fgp_flags));
^~~~~~~~~
page_order
cc1: some warnings being treated as errors
vim +1670 mm/filemap.c
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7282 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-07-29 20:01 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-28 22:47 [PATCH 0/2] mm,thp: Add filemap_huge_fault() for THP William Kucharski
2019-07-28 22:47 ` [PATCH 1/2] mm: Allow the page cache to allocate large pages William Kucharski
2019-07-29 20:00 ` kbuild test robot
2019-07-28 22:47 ` [PATCH 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP William Kucharski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox