* [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again
@ 2023-05-02 23:56 Ackerley Tng
2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng
2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng
0 siblings, 2 replies; 7+ messages in thread
From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw)
To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz,
muchun.song, willy, sidhartha.kumar, jhubbard
Cc: vannapurve, erdemaktas, Ackerley Tng
When fallocate() is used twice on the same offset, it should succeed
and return 0 to userspace. The second fallocate() in
test_multiple_fallocates() in the following code will return -EEXIST
for hugetlbfs, but not tmpfs/shmem.
#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/falloc.h>
#include <linux/memfd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
void test_multiple_fallocates(unsigned int flags)
{
int fd = memfd_create("foo", flags);
if (fallocate(fd, 0, 0, 0x1000))
printf("error with fallocate %m\n");
if (fallocate(fd, 0, 0, 0x1000))
printf("error with second fallocate %m\n");
}
int main(void) {
printf("Testing tmpfs:\n");
test_multiple_fallocates(0);
printf("Testing hugetlbfs:\n");
test_multiple_fallocates(MFD_HUGETLB | MFD_HUGE_2MB);
}
This patch series replaces page_cache_next_miss(), used to determine a
page cache hit, with a more direct filemap_has_folio() (a new
function).
I hope that this is also the desired refactoring as mentioned in [1].
[1] https://lore.kernel.org/all/Y8oqEOICcNV762IA@casper.infradead.org/
---
Ackerley Tng (2):
mm: filemap: Add filemap_has_folio function
fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
fs/hugetlbfs/inode.c | 6 +-----
include/linux/pagemap.h | 1 +
mm/filemap.c | 17 +++++++++++++++++
3 files changed, 19 insertions(+), 5 deletions(-)
--
2.40.1.495.gc816e09b53d-goog
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/2] mm: filemap: Add filemap_has_folio function 2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng @ 2023-05-02 23:56 ` Ackerley Tng 2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng 1 sibling, 0 replies; 7+ messages in thread From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw) To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz, muchun.song, willy, sidhartha.kumar, jhubbard Cc: vannapurve, erdemaktas, Ackerley Tng filemap_has_folio() will return whether there is a folio at a given index in a mapping. This function does not affect the folio refcount. Signed-off-by: Ackerley Tng <ackerleytng@google.com> --- include/linux/pagemap.h | 1 + mm/filemap.c | 17 +++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a56308a9d1a4..e49f07cdbff7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -508,6 +508,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) +bool filemap_has_folio(struct address_space *mapping, pgoff_t index); void *filemap_get_entry(struct address_space *mapping, pgoff_t index); struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, int fgp_flags, gfp_t gfp); diff --git a/mm/filemap.c b/mm/filemap.c index a34abfe8c654..a7a6e229e33d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1835,6 +1835,23 @@ EXPORT_SYMBOL(page_cache_prev_miss); * folio_put(). */ +/** + * filemap_has_folio - Check if filemap has a folio at given index + * @mapping: The address_space to search. + * @index: The page index. + * + * Unlike filemap_get_entry, this does not increment refcount of the folio. + * + * Return: true if folio exists else false. + */ +bool filemap_has_folio(struct address_space *mapping, pgoff_t index) +{ + void *entry = xa_load(&mapping->i_pages, index); + + return entry && !xa_is_value(entry); +} +EXPORT_SYMBOL(filemap_has_folio); + /* * filemap_get_entry - Get a page cache entry. * @mapping: the address_space to search -- 2.40.1.495.gc816e09b53d-goog ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache 2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng 2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng @ 2023-05-02 23:56 ` Ackerley Tng 2023-05-03 3:05 ` Mike Kravetz 1 sibling, 1 reply; 7+ messages in thread From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw) To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz, muchun.song, willy, sidhartha.kumar, jhubbard Cc: vannapurve, erdemaktas, Ackerley Tng When fallocate() is called twice on the same offset in the file, the second fallocate() should succeed. page_cache_next_miss() always advances index before returning, so even on a page cache hit, the check would set present to false. Signed-off-by: Ackerley Tng <ackerleytng@google.com> --- fs/hugetlbfs/inode.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ecfdfb2529a3..f640cff1bbce 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, */ struct folio *folio; unsigned long addr; - bool present; cond_resched(); @@ -845,10 +844,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, mutex_lock(&hugetlb_fault_mutex_table[hash]); /* See if already present in mapping to avoid alloc/free */ - rcu_read_lock(); - present = page_cache_next_miss(mapping, index, 1) != index; - rcu_read_unlock(); - if (present) { + if (filemap_has_folio(mapping, index)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); hugetlb_drop_vma_policy(&pseudo_vma); continue; -- 2.40.1.495.gc816e09b53d-goog ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache 2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng @ 2023-05-03 3:05 ` Mike Kravetz 2023-05-04 0:14 ` Mike Kravetz 0 siblings, 1 reply; 7+ messages in thread From: Mike Kravetz @ 2023-05-03 3:05 UTC (permalink / raw) To: Ackerley Tng, willy, sidhartha.kumar Cc: akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song, jhubbard, vannapurve, erdemaktas On 05/02/23 23:56, Ackerley Tng wrote: > When fallocate() is called twice on the same offset in the file, the > second fallocate() should succeed. > > page_cache_next_miss() always advances index before returning, so even > on a page cache hit, the check would set present to false. Thank you Ackerley for finding this! When I read the description of page_cache_next_miss(), I assumed present = page_cache_next_miss(mapping, index, 1) != index; would tell us if there was a page at index in the cache. However, when looking closer at the code it does not check for a page at index, but rather starts looking at index+1. Perhaps that is why it is named next? Matthew, I think the use of the above statement was your suggestion. And you know the xarray code better than anyone. I just want to make sure page_cache_next_miss is operating as designed/expected. If so, then the changes suggested here make sense. In addition, the same code is in hugetlbfs_pagecache_present and will have this same issue. -- Mike Kravetz > > Signed-off-by: Ackerley Tng <ackerleytng@google.com> > --- > fs/hugetlbfs/inode.c | 6 +----- > 1 file changed, 1 insertion(+), 5 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index ecfdfb2529a3..f640cff1bbce 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > */ > struct folio *folio; > unsigned long addr; > - bool present; > > cond_resched(); > > @@ -845,10 +844,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > mutex_lock(&hugetlb_fault_mutex_table[hash]); > > /* See if already present in mapping to avoid alloc/free */ > - rcu_read_lock(); > - present = page_cache_next_miss(mapping, index, 1) != index; > - rcu_read_unlock(); > - if (present) { > + if (filemap_has_folio(mapping, index)) { > mutex_unlock(&hugetlb_fault_mutex_table[hash]); > hugetlb_drop_vma_policy(&pseudo_vma); > continue; > -- > 2.40.1.495.gc816e09b53d-goog > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache 2023-05-03 3:05 ` Mike Kravetz @ 2023-05-04 0:14 ` Mike Kravetz 2023-05-08 16:29 ` Ackerley Tng 0 siblings, 1 reply; 7+ messages in thread From: Mike Kravetz @ 2023-05-04 0:14 UTC (permalink / raw) To: Ackerley Tng, willy, sidhartha.kumar Cc: akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song, jhubbard, vannapurve, erdemaktas On 05/02/23 20:05, Mike Kravetz wrote: > On 05/02/23 23:56, Ackerley Tng wrote: > > When fallocate() is called twice on the same offset in the file, the > > second fallocate() should succeed. > > > > page_cache_next_miss() always advances index before returning, so even > > on a page cache hit, the check would set present to false. > > Thank you Ackerley for finding this! > > When I read the description of page_cache_next_miss(), I assumed > > present = page_cache_next_miss(mapping, index, 1) != index; > > would tell us if there was a page at index in the cache. > > However, when looking closer at the code it does not check for a page > at index, but rather starts looking at index+1. Perhaps that is why > it is named next? > > Matthew, I think the use of the above statement was your suggestion. > And you know the xarray code better than anyone. I just want to make > sure page_cache_next_miss is operating as designed/expected. If so, > then the changes suggested here make sense. I took a closer look at the code today. page_cache_next_miss has a 'special case' for index 0. The function description says: * Return: The index of the gap if found, otherwise an index outside the * range specified (in which case 'return - index >= max_scan' will be true). * In the rare case of index wrap-around, 0 will be returned. And, the loop in the routine does: while (max_scan--) { void *entry = xas_next(&xas); if (!entry || xa_is_value(entry)) break; if (xas.xa_index == 0) break; } At first glance, I thought xas_next always went to the next entry but now see that is not the case here because this is a new state with xa_node = XAS_RESTART. So, xas_next is effectively a xas_load. This means in the case were index == 0, page_cache_next_miss(mapping, index, 1) will ALWAYS return zero even if a page is present. I need to look at the xarray code and this rare index wrap-around case to see if we can somehow modify that check for xas.xa_index == 0 in page_cache_next_miss. -- Mike Kravetz ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache 2023-05-04 0:14 ` Mike Kravetz @ 2023-05-08 16:29 ` Ackerley Tng 2023-05-08 20:40 ` Ackerley Tng 0 siblings, 1 reply; 7+ messages in thread From: Ackerley Tng @ 2023-05-08 16:29 UTC (permalink / raw) To: Mike Kravetz Cc: willy, sidhartha.kumar, akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song, jhubbard, vannapurve, erdemaktas I wrote a selftest to verify the behavior of page_cache_next_miss() and indeed you are right about the special case. I'll continue to look into this to figure out why it isn't hitting in the page cache. Thanks Mike! ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache 2023-05-08 16:29 ` Ackerley Tng @ 2023-05-08 20:40 ` Ackerley Tng 0 siblings, 0 replies; 7+ messages in thread From: Ackerley Tng @ 2023-05-08 20:40 UTC (permalink / raw) To: Ackerley Tng Cc: mike.kravetz, willy, sidhartha.kumar, akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song, jhubbard, vannapurve, erdemaktas, rientjes I figured it out, the bug is still in the use of page_cache_next_miss(), but my earlier suggestion that xas_next always advances the pointer is wrong. Mike is right in that xas_next() is effectively xas_load() for the first call after an XA_STATE() initialiation. However, when max_scan is set to 1, xas.xa_index has no chance to be advanced since the loop condition while(max_scan--) terminates the loop after 1 iteration. Hence, after loading happens with xas_next() regardless of the checks within the loop (!entry, or xa_is_value(entry), etc), xa.xas_index is not advanced, and the index returned always == the index passed in to page_cache_next_miss(). Hence, in hugetlb_fallocate(), it always appears to be a page cache miss. Here's code from a selftest that can be added to lib/test_xarray.c: /* Modified from page_cache_next_miss() */ static unsigned long xa_next_empty(struct xarray *xa, unsigned long start, unsigned long max_scan) { XA_STATE(xas, xa, start); while (max_scan--) { void *entry = xas_next(&xas); if (!entry) { printk("entry not present"); break; } if (xa_is_value(entry)) { printk("xa_is_value instead of pointer"); } if (xas.xa_index == 0) { printk("wraparound"); break; } } if (max_scan == -1) printk("exceeded max_scan"); return xas.xa_index; } /* Replace this function in lib/test_xarray.c to run */ static noinline void check_find(struct xarray *xa) { unsigned long max_scan; xa_init(&xa); xa_store_range(&xa, 3, 5, malloc(10), GFP_KERNEL); max_scan = 1; for (int i = 1; i < 8; i++) printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan, xa_next_empty(&xa, i, max_scan)); printk("\n"); max_scan = 2; for (int i = 1; i < 8; i++) printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan, xa_next_empty(&xa, i, max_scan)); } Result: entry not present => xa_next_empty(xa, 1, 1): 1 entry not present => xa_next_empty(xa, 2, 1): 2 exceeded max_scan => xa_next_empty(xa, 3, 1): 3 exceeded max_scan => xa_next_empty(xa, 4, 1): 4 exceeded max_scan => xa_next_empty(xa, 5, 1): 5 entry not present => xa_next_empty(xa, 6, 1): 6 entry not present => xa_next_empty(xa, 7, 1): 7 entry not present => xa_next_empty(xa, 1, 2): 1 entry not present => xa_next_empty(xa, 2, 2): 2 exceeded max_scan => xa_next_empty(xa, 3, 2): 4 exceeded max_scan => xa_next_empty(xa, 4, 2): 5 exceeded max_scan => xa_next_empty(xa, 5, 2): 6 entry not present => xa_next_empty(xa, 6, 2): 6 entry not present => xa_next_empty(xa, 7, 2): 7 Since the xarray was set up with pointers in indices 3, 4 and 5, we expect xa_next_empty() or page_cache_next_miss() to return the next index (4, 5 and 6 respectively), but when used with a max_scan of 1, we just get the index passed in. While max_scan could be increased to fix this bug, I feel that having a separate function like filemap_has_folio() makes the intent more explicit and is less reliant on internal values of struct xa_state. xas.xa_index could take other values to indicate wraparound or sibling nodes and I think it is better to use a higher level abstraction like xa_load() (used in filemap_has_folio()). In addition xa_load() also takes the locks it needs, which helps :). I could refactor the other usage of page_cache_next_miss() in hugetlbfs_pagecache_present() if you'd like! On this note, the docstring of page_cache_next_miss() is also inaccurate. The return value is not always outside the range specified if max_scan is too small. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-05-08 20:40 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng 2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng 2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng 2023-05-03 3:05 ` Mike Kravetz 2023-05-04 0:14 ` Mike Kravetz 2023-05-08 16:29 ` Ackerley Tng 2023-05-08 20:40 ` Ackerley Tng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox