linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again
@ 2023-05-02 23:56 Ackerley Tng
  2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng
  2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng
  0 siblings, 2 replies; 7+ messages in thread
From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw)
  To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz,
	muchun.song, willy, sidhartha.kumar, jhubbard
  Cc: vannapurve, erdemaktas, Ackerley Tng

When fallocate() is used twice on the same offset, it should succeed
and return 0 to userspace. The second fallocate() in
test_multiple_fallocates() in the following code will return -EEXIST
for hugetlbfs, but not tmpfs/shmem.

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/falloc.h>
#include <linux/memfd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>

void test_multiple_fallocates(unsigned int flags)
{
	int fd = memfd_create("foo", flags);

	if (fallocate(fd, 0, 0, 0x1000))
		printf("error with fallocate %m\n");

	if (fallocate(fd, 0, 0, 0x1000))
		printf("error with second fallocate %m\n");
}

int main(void) {

	printf("Testing tmpfs:\n");
	test_multiple_fallocates(0);

	printf("Testing hugetlbfs:\n");
	test_multiple_fallocates(MFD_HUGETLB | MFD_HUGE_2MB);

}

This patch series replaces page_cache_next_miss(), used to determine a
page cache hit, with a more direct filemap_has_folio() (a new
function).

I hope that this is also the desired refactoring as mentioned in [1].

[1] https://lore.kernel.org/all/Y8oqEOICcNV762IA@casper.infradead.org/

---

Ackerley Tng (2):
  mm: filemap: Add filemap_has_folio function
  fs: hugetlbfs: Fix logic to skip allocation on hit in page cache

 fs/hugetlbfs/inode.c    |  6 +-----
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 17 +++++++++++++++++
 3 files changed, 19 insertions(+), 5 deletions(-)

--
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] mm: filemap: Add filemap_has_folio function
  2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng
@ 2023-05-02 23:56 ` Ackerley Tng
  2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng
  1 sibling, 0 replies; 7+ messages in thread
From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw)
  To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz,
	muchun.song, willy, sidhartha.kumar, jhubbard
  Cc: vannapurve, erdemaktas, Ackerley Tng

filemap_has_folio() will return whether there is a folio at a given
index in a mapping. This function does not affect the folio refcount.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..e49f07cdbff7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -508,6 +508,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+bool filemap_has_folio(struct address_space *mapping, pgoff_t index);
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..a7a6e229e33d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1835,6 +1835,23 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  * folio_put().
  */
 
+/**
+ * filemap_has_folio - Check if filemap has a folio at given index
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ *
+ * Unlike filemap_get_entry, this does not increment refcount of the folio.
+ *
+ * Return: true if folio exists else false.
+ */
+bool filemap_has_folio(struct address_space *mapping, pgoff_t index)
+{
+	void *entry = xa_load(&mapping->i_pages, index);
+
+	return entry && !xa_is_value(entry);
+}
+EXPORT_SYMBOL(filemap_has_folio);
+
 /*
  * filemap_get_entry - Get a page cache entry.
  * @mapping: the address_space to search
-- 
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
  2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng
  2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng
@ 2023-05-02 23:56 ` Ackerley Tng
  2023-05-03  3:05   ` Mike Kravetz
  1 sibling, 1 reply; 7+ messages in thread
From: Ackerley Tng @ 2023-05-02 23:56 UTC (permalink / raw)
  To: akpm, linux-fsdevel, linux-kernel, linux-mm, mike.kravetz,
	muchun.song, willy, sidhartha.kumar, jhubbard
  Cc: vannapurve, erdemaktas, Ackerley Tng

When fallocate() is called twice on the same offset in the file, the
second fallocate() should succeed.

page_cache_next_miss() always advances index before returning, so even
on a page cache hit, the check would set present to false.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 fs/hugetlbfs/inode.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ecfdfb2529a3..f640cff1bbce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 */
 		struct folio *folio;
 		unsigned long addr;
-		bool present;
 
 		cond_resched();
 
@@ -845,10 +844,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		/* See if already present in mapping to avoid alloc/free */
-		rcu_read_lock();
-		present = page_cache_next_miss(mapping, index, 1) != index;
-		rcu_read_unlock();
-		if (present) {
+		if (filemap_has_folio(mapping, index)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			hugetlb_drop_vma_policy(&pseudo_vma);
 			continue;
-- 
2.40.1.495.gc816e09b53d-goog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
  2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng
@ 2023-05-03  3:05   ` Mike Kravetz
  2023-05-04  0:14     ` Mike Kravetz
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2023-05-03  3:05 UTC (permalink / raw)
  To: Ackerley Tng, willy, sidhartha.kumar
  Cc: akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song,
	jhubbard, vannapurve, erdemaktas

On 05/02/23 23:56, Ackerley Tng wrote:
> When fallocate() is called twice on the same offset in the file, the
> second fallocate() should succeed.
> 
> page_cache_next_miss() always advances index before returning, so even
> on a page cache hit, the check would set present to false.

Thank you Ackerley for finding this!

When I read the description of page_cache_next_miss(), I assumed

	present = page_cache_next_miss(mapping, index, 1) != index;

would tell us if there was a page at index in the cache.

However, when looking closer at the code it does not check for a page
at index, but rather starts looking at index+1.  Perhaps that is why
it is named next?

Matthew, I think the use of the above statement was your suggestion.
And you know the xarray code better than anyone.  I just want to make
sure page_cache_next_miss is operating as designed/expected.  If so,
then the changes suggested here make sense.

In addition, the same code is in hugetlbfs_pagecache_present and will
have this same issue.
-- 
Mike Kravetz

> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  fs/hugetlbfs/inode.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index ecfdfb2529a3..f640cff1bbce 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		 */
>  		struct folio *folio;
>  		unsigned long addr;
> -		bool present;
>  
>  		cond_resched();
>  
> @@ -845,10 +844,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  
>  		/* See if already present in mapping to avoid alloc/free */
> -		rcu_read_lock();
> -		present = page_cache_next_miss(mapping, index, 1) != index;
> -		rcu_read_unlock();
> -		if (present) {
> +		if (filemap_has_folio(mapping, index)) {
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			hugetlb_drop_vma_policy(&pseudo_vma);
>  			continue;
> -- 
> 2.40.1.495.gc816e09b53d-goog
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
  2023-05-03  3:05   ` Mike Kravetz
@ 2023-05-04  0:14     ` Mike Kravetz
  2023-05-08 16:29       ` Ackerley Tng
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2023-05-04  0:14 UTC (permalink / raw)
  To: Ackerley Tng, willy, sidhartha.kumar
  Cc: akpm, linux-fsdevel, linux-kernel, linux-mm, muchun.song,
	jhubbard, vannapurve, erdemaktas

On 05/02/23 20:05, Mike Kravetz wrote:
> On 05/02/23 23:56, Ackerley Tng wrote:
> > When fallocate() is called twice on the same offset in the file, the
> > second fallocate() should succeed.
> > 
> > page_cache_next_miss() always advances index before returning, so even
> > on a page cache hit, the check would set present to false.
> 
> Thank you Ackerley for finding this!
> 
> When I read the description of page_cache_next_miss(), I assumed
> 
> 	present = page_cache_next_miss(mapping, index, 1) != index;
> 
> would tell us if there was a page at index in the cache.
> 
> However, when looking closer at the code it does not check for a page
> at index, but rather starts looking at index+1.  Perhaps that is why
> it is named next?
> 
> Matthew, I think the use of the above statement was your suggestion.
> And you know the xarray code better than anyone.  I just want to make
> sure page_cache_next_miss is operating as designed/expected.  If so,
> then the changes suggested here make sense.

I took a closer look at the code today.

page_cache_next_miss has a 'special case' for index 0.  The function
description says:

 * Return: The index of the gap if found, otherwise an index outside the
 * range specified (in which case 'return - index >= max_scan' will be true).
 * In the rare case of index wrap-around, 0 will be returned.

And, the loop in the routine does:

	while (max_scan--) {
		void *entry = xas_next(&xas);
		if (!entry || xa_is_value(entry))
			break;
		if (xas.xa_index == 0)
			break;
	}

At first glance, I thought xas_next always went to the next entry but
now see that is not the case here because this is a new state with
xa_node = XAS_RESTART.  So, xas_next is effectively a xas_load.

This means in the case were index == 0,

	page_cache_next_miss(mapping, index, 1)

will ALWAYS return zero even if a page is present.

I need to look at the xarray code and this rare index wrap-around case
to see if we can somehow modify that check for xas.xa_index == 0 in
page_cache_next_miss.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
  2023-05-04  0:14     ` Mike Kravetz
@ 2023-05-08 16:29       ` Ackerley Tng
  2023-05-08 20:40         ` Ackerley Tng
  0 siblings, 1 reply; 7+ messages in thread
From: Ackerley Tng @ 2023-05-08 16:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: willy, sidhartha.kumar, akpm, linux-fsdevel, linux-kernel,
	linux-mm, muchun.song, jhubbard, vannapurve, erdemaktas



I wrote a selftest to verify the behavior of page_cache_next_miss() and
indeed you are right about the special case.

I'll continue to look into this to figure out why it isn't hitting in
the page cache.

Thanks Mike!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache
  2023-05-08 16:29       ` Ackerley Tng
@ 2023-05-08 20:40         ` Ackerley Tng
  0 siblings, 0 replies; 7+ messages in thread
From: Ackerley Tng @ 2023-05-08 20:40 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: mike.kravetz, willy, sidhartha.kumar, akpm, linux-fsdevel,
	linux-kernel, linux-mm, muchun.song, jhubbard, vannapurve,
	erdemaktas, rientjes


I figured it out, the bug is still in the use of page_cache_next_miss(),
but my earlier suggestion that xas_next always advances the pointer is
wrong.

Mike is right in that xas_next() is effectively xas_load() for the first
call after an XA_STATE() initialiation.

However, when max_scan is set to 1, xas.xa_index has no chance to be
advanced since the loop condition while(max_scan--) terminates the loop
after 1 iteration.

Hence, after loading happens with xas_next() regardless of the checks
within the loop (!entry, or xa_is_value(entry), etc), xa.xas_index is
not advanced, and the index returned always == the index passed in to
page_cache_next_miss().

Hence, in hugetlb_fallocate(), it always appears to be a page cache
miss.

Here's code from a selftest that can be added to lib/test_xarray.c:

/* Modified from page_cache_next_miss() */
static unsigned long xa_next_empty(struct xarray *xa, unsigned long start,  
unsigned long max_scan)
{
	XA_STATE(xas, xa, start);

	while (max_scan--) {
		void *entry = xas_next(&xas);
		if (!entry) {
			printk("entry not present");
			break;
		}
		if (xa_is_value(entry)) {
			printk("xa_is_value instead of pointer");
		}
		if (xas.xa_index == 0) {
			printk("wraparound");
			break;
		}
	}

	if (max_scan == -1)
		printk("exceeded max_scan");

	return xas.xa_index;
}

/* Replace this function in lib/test_xarray.c to run */
static noinline void check_find(struct xarray *xa)
{
	unsigned long max_scan;

	xa_init(&xa);
	xa_store_range(&xa, 3, 5, malloc(10), GFP_KERNEL);

	max_scan = 1;
	for (int i = 1; i < 8; i++)
		printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan,  
xa_next_empty(&xa, i, max_scan));

	printk("\n");

	max_scan = 2;
	for (int i = 1; i < 8; i++)
		printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan,  
xa_next_empty(&xa, i, max_scan));
}

Result:

entry not present => xa_next_empty(xa, 1, 1): 1
entry not present => xa_next_empty(xa, 2, 1): 2
exceeded max_scan => xa_next_empty(xa, 3, 1): 3
exceeded max_scan => xa_next_empty(xa, 4, 1): 4
exceeded max_scan => xa_next_empty(xa, 5, 1): 5
entry not present => xa_next_empty(xa, 6, 1): 6
entry not present => xa_next_empty(xa, 7, 1): 7

entry not present => xa_next_empty(xa, 1, 2): 1
entry not present => xa_next_empty(xa, 2, 2): 2
exceeded max_scan => xa_next_empty(xa, 3, 2): 4
exceeded max_scan => xa_next_empty(xa, 4, 2): 5
exceeded max_scan => xa_next_empty(xa, 5, 2): 6
entry not present => xa_next_empty(xa, 6, 2): 6
entry not present => xa_next_empty(xa, 7, 2): 7

Since the xarray was set up with pointers in indices 3, 4 and 5, we
expect xa_next_empty() or page_cache_next_miss() to return the next
index (4, 5 and 6 respectively), but when used with a max_scan of 1, we
just get the index passed in.

While max_scan could be increased to fix this bug, I feel that having a
separate function like filemap_has_folio() makes the intent more
explicit and is less reliant on internal values of struct xa_state.

xas.xa_index could take other values to indicate wraparound or sibling
nodes and I think it is better to use a higher level abstraction like
xa_load() (used in filemap_has_folio()). In addition xa_load() also
takes the locks it needs, which helps :).

I could refactor the other usage of page_cache_next_miss() in
hugetlbfs_pagecache_present() if you'd like!

On this note, the docstring of page_cache_next_miss() is also
inaccurate. The return value is not always outside the range specified
if max_scan is too small.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-08 20:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-02 23:56 [PATCH 0/2] Fix fallocate error in hugetlbfs when fallocating again Ackerley Tng
2023-05-02 23:56 ` [PATCH 1/2] mm: filemap: Add filemap_has_folio function Ackerley Tng
2023-05-02 23:56 ` [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache Ackerley Tng
2023-05-03  3:05   ` Mike Kravetz
2023-05-04  0:14     ` Mike Kravetz
2023-05-08 16:29       ` Ackerley Tng
2023-05-08 20:40         ` Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox