[PATCH 0/3] hugetlbfs fallocate hole punch race with page faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults
@ 2015-10-16 22:08 Mike Kravetz
  2015-10-16 22:08 ` [PATCH 1/3] mm/hugetlb: Define hugetlb_falloc structure for hole punch race Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Mike Kravetz @ 2015-10-16 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, Hugh Dickins, Davidlohr Bueso,
	Andrew Morton, Mike Kravetz

The hugetlbfs fallocate hole punch code can race with page faults.  The
result is that after a hole punch operation, pages may remain within the
hole.  No other side effects of this race were observed.

In preparation for adding userfaultfd support to hugetlbfs, it is desirable
to plug or significantly shrink this hole.  This patch set uses the same
mechanism employed in shmem (see commit f00cdc6df7).

hugetlb_fault_mutex_table is already used in hugetlbfs for fault
synchronization, and there is no swap for hugetlbfs. So, this code is
simpler than in shmem.  In fact, the hugetlb_fault_mutex_table could be
used for races with small hole punch operations.  However, we need something
that will work for large holes as well.

Mike Kravetz (3):
  mm/hugetlb: Define hugetlb_falloc structure for hole punch race
  mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  mm/hugetlb: page faults check for fallocate hole punch in progress and
    wait

 fs/hugetlbfs/inode.c    | 26 +++++++++++++++++++++++---
 include/linux/hugetlb.h | 10 ++++++++++
 mm/hugetlb.c            | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+), 3 deletions(-)

-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] mm/hugetlb: Define hugetlb_falloc structure for hole punch race
  2015-10-16 22:08 [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
@ 2015-10-16 22:08 ` Mike Kravetz
  2015-10-16 22:08 ` [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2015-10-16 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, Hugh Dickins, Davidlohr Bueso,
	Andrew Morton, Mike Kravetz

A hugetlb_falloc structure is pointed to by i_private during fallocate
hole punch operations.  Page faults check this structure and if they are
in the hole, wait for the operation to finish.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 685c262..4be35b9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -42,6 +42,16 @@ struct resv_map {
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
 
+/*
+ * hugetlb_falloc is used to prevent page faults during falloc hole punch
+ * operations.  During hole punch, inode->i_private points to this struct.
+ */
+struct hugetlb_falloc {
+	wait_queue_head_t *waitq;	/* Page faults waiting on hole punch */
+	pgoff_t start;			/* Start of fallocate hole */
+	pgoff_t end;			/* End of fallocate hole */
+};
+
 extern spinlock_t hugetlb_lock;
 extern int hugetlb_max_hstate __read_mostly;
 #define for_each_hstate(h) \
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  2015-10-16 22:08 [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
  2015-10-16 22:08 ` [PATCH 1/3] mm/hugetlb: Define hugetlb_falloc structure for hole punch race Mike Kravetz
@ 2015-10-16 22:08 ` Mike Kravetz
  2015-10-19 23:16   ` Andrew Morton
  2015-10-16 22:08 ` [PATCH 3/3] mm/hugetlb: page faults check for fallocate hole punch in progress and wait Mike Kravetz
  2015-10-19 23:18 ` [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Andrew Morton
  3 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2015-10-16 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, Hugh Dickins, Davidlohr Bueso,
	Andrew Morton, Mike Kravetz

When performing a fallocate hole punch, set up a hugetlb_falloc struct
and make i_private point to it.  i_private will point to this struct for
the duration of the operation.  At the end of the operation, wake up
anyone who faulted on the hole and is on the waitq.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316adb9..b847e72 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -507,7 +507,9 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct hstate *h = hstate_inode(inode);
 	loff_t hpage_size = huge_page_size(h);
+	unsigned long hpage_shift = huge_page_shift(h);
 	loff_t hole_start, hole_end;
+	struct hugetlb_falloc hugetlb_falloc;
 
 	/*
 	 * For hole punch round up the beginning offset of the hole and
@@ -518,8 +520,23 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	if (hole_end > hole_start) {
 		struct address_space *mapping = inode->i_mapping;
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(hugetlb_falloc_waitq);
+
+		/*
+		 * Page faults on the area to be hole punched must be
+		 * stopped during the operation.  Initialize struct and
+		 * have inode->i_private point to it.
+		 */
+		hugetlb_falloc.waitq = &hugetlb_falloc_waitq;
+		hugetlb_falloc.start = hole_start >> hpage_shift;
+		hugetlb_falloc.end = hole_end >> hpage_shift;
 
 		mutex_lock(&inode->i_mutex);
+
+		spin_lock(&inode->i_lock);
+		inode->i_private = &hugetlb_falloc;
+		spin_unlock(&inode->i_lock);
+
 		i_mmap_lock_write(mapping);
 		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 			hugetlb_vmdelete_list(&mapping->i_mmap,
@@ -527,6 +544,12 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 						hole_end  >> PAGE_SHIFT);
 		i_mmap_unlock_write(mapping);
 		remove_inode_hugepages(inode, hole_start, hole_end);
+
+		spin_lock(&inode->i_lock);
+		inode->i_private = NULL;
+		wake_up_all(&hugetlb_falloc_waitq);
+		spin_unlock(&inode->i_lock);
+
 		mutex_unlock(&inode->i_mutex);
 	}
 
@@ -647,9 +670,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = CURRENT_TIME;
-	spin_lock(&inode->i_lock);
-	inode->i_private = NULL;
-	spin_unlock(&inode->i_lock);
 out:
 	mutex_unlock(&inode->i_mutex);
 	return error;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/3] mm/hugetlb: page faults check for fallocate hole punch in progress and wait
  2015-10-16 22:08 [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
  2015-10-16 22:08 ` [PATCH 1/3] mm/hugetlb: Define hugetlb_falloc structure for hole punch race Mike Kravetz
  2015-10-16 22:08 ` [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch Mike Kravetz
@ 2015-10-16 22:08 ` Mike Kravetz
  2015-10-19 23:18 ` [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Andrew Morton
  3 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2015-10-16 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Naoya Horiguchi, Hugh Dickins, Davidlohr Bueso,
	Andrew Morton, Mike Kravetz

At page fault time, check i_private which indicates a fallocate hole punch
is in progress.  If the fault falls within the hole, wait for the hole
punch operation to complete before proceeding with the fault.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c7db92..540d3a79 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3580,6 +3580,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *pagecache_page = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
+	struct inode *inode = file_inode(vma->vm_file);
 	int need_wait_lock = 0;
 
 	address &= huge_page_mask(h);
@@ -3603,6 +3604,42 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	idx = vma_hugecache_offset(h, vma, address);
 
 	/*
+	 * page faults could race with fallocate hole punch.  If a page
+	 * is faulted between unmap and deallocation, it will still remain
+	 * in the punched hole.  During hole punch operations, a hugetlb_falloc
+	 * structure will be pointed to by i_private.  If this fault is for
+	 * a page in a hole being punched, wait for the operation to finish
+	 * before proceeding.
+	 *
+	 * Even with this strategy, it is still possible for a page fault to
+	 * race with hole punch.  However, the race window is considerably
+	 * smaller.
+	 */
+	if (unlikely(inode->i_private)) {
+		struct hugetlb_falloc *hugetlb_falloc;
+
+		spin_lock(&inode->i_lock);
+		hugetlb_falloc = inode->i_private;
+		if (hugetlb_falloc && hugetlb_falloc->waitq &&
+		    idx >= hugetlb_falloc->start &&
+		    idx <= hugetlb_falloc->end) {
+			wait_queue_head_t *hugetlb_falloc_waitq;
+			DEFINE_WAIT(hugetlb_fault_wait);
+
+			hugetlb_falloc_waitq = hugetlb_falloc->waitq;
+			prepare_to_wait(hugetlb_falloc_waitq,
+					&hugetlb_fault_wait,
+					TASK_UNINTERRUPTIBLE);
+			spin_unlock(&inode->i_lock);
+			schedule();
+
+			spin_lock(&inode->i_lock);
+			finish_wait(hugetlb_falloc_waitq, &hugetlb_fault_wait);
+		}
+		spin_unlock(&inode->i_lock);
+	}
+
+	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  2015-10-16 22:08 ` [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch Mike Kravetz
@ 2015-10-19 23:16   ` Andrew Morton
  2015-10-20  1:41     ` Mike Kravetz
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-10-19 23:16 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	Hugh Dickins, Davidlohr Bueso

On Fri, 16 Oct 2015 15:08:29 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> When performing a fallocate hole punch, set up a hugetlb_falloc struct
> and make i_private point to it.  i_private will point to this struct for
> the duration of the operation.  At the end of the operation, wake up
> anyone who faulted on the hole and is on the waitq.
> 
> ...
>
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -507,7 +507,9 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	loff_t hpage_size = huge_page_size(h);
> +	unsigned long hpage_shift = huge_page_shift(h);
>  	loff_t hole_start, hole_end;
> +	struct hugetlb_falloc hugetlb_falloc;
>  
>  	/*
>  	 * For hole punch round up the beginning offset of the hole and
> @@ -518,8 +520,23 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  
>  	if (hole_end > hole_start) {
>  		struct address_space *mapping = inode->i_mapping;
> +		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(hugetlb_falloc_waitq);
> +
> +		/*
> +		 * Page faults on the area to be hole punched must be
> +		 * stopped during the operation.  Initialize struct and
> +		 * have inode->i_private point to it.
> +		 */
> +		hugetlb_falloc.waitq = &hugetlb_falloc_waitq;
> +		hugetlb_falloc.start = hole_start >> hpage_shift;
> +		hugetlb_falloc.end = hole_end >> hpage_shift;

This is a bit neater:

--- a/fs/hugetlbfs/inode.c~mm-hugetlb-setup-hugetlb_falloc-during-fallocate-hole-punch-fix
+++ a/fs/hugetlbfs/inode.c
@@ -509,7 +509,6 @@ static long hugetlbfs_punch_hole(struct
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
 	loff_t hole_start, hole_end;
-	struct hugetlb_falloc hugetlb_falloc;
 
 	/*
 	 * For hole punch round up the beginning offset of the hole and
@@ -521,15 +520,16 @@ static long hugetlbfs_punch_hole(struct
 	if (hole_end > hole_start) {
 		struct address_space *mapping = inode->i_mapping;
 		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(hugetlb_falloc_waitq);
-
 		/*
-		 * Page faults on the area to be hole punched must be
-		 * stopped during the operation.  Initialize struct and
-		 * have inode->i_private point to it.
+		 * Page faults on the area to be hole punched must be stopped
+		 * during the operation.  Initialize struct and have
+		 * inode->i_private point to it.
 		 */
-		hugetlb_falloc.waitq = &hugetlb_falloc_waitq;
-		hugetlb_falloc.start = hole_start >> hpage_shift;
-		hugetlb_falloc.end = hole_end >> hpage_shift;
+		struct hugetlb_falloc hugetlb_falloc = {
+			.waitq = &hugetlb_falloc_waitq,
+			.start = hole_start >> hpage_shift,
+			.end = hole_end >> hpage_shift
+		};
 
 		mutex_lock(&inode->i_mutex);
 

>  		mutex_lock(&inode->i_mutex);
> +
> +		spin_lock(&inode->i_lock);
> +		inode->i_private = &hugetlb_falloc;
> +		spin_unlock(&inode->i_lock);

Locking around a single atomic assignment is a bit peculiar.  I can
kinda see that it kinda protects the logic in hugetlb_fault(), but I
would like to hear (in comment form) your description of how this logic
works?

>  		i_mmap_lock_write(mapping);
>  		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>  			hugetlb_vmdelete_list(&mapping->i_mmap,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults
  2015-10-16 22:08 [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
                   ` (2 preceding siblings ...)
  2015-10-16 22:08 ` [PATCH 3/3] mm/hugetlb: page faults check for fallocate hole punch in progress and wait Mike Kravetz
@ 2015-10-19 23:18 ` Andrew Morton
  2015-10-20  1:54   ` Mike Kravetz
  3 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-10-19 23:18 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	Hugh Dickins, Davidlohr Bueso

On Fri, 16 Oct 2015 15:08:27 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> The hugetlbfs fallocate hole punch code can race with page faults.  The
> result is that after a hole punch operation, pages may remain within the
> hole.  No other side effects of this race were observed.
> 
> In preparation for adding userfaultfd support to hugetlbfs, it is desirable
> to plug or significantly shrink this hole.  This patch set uses the same
> mechanism employed in shmem (see commit f00cdc6df7).
> 

"still buggy but not as bad as before" isn't what we strive for ;) What
would it take to fix this for real?  An exhaustive description of the
bug would be a good starting point, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  2015-10-19 23:16   ` Andrew Morton
@ 2015-10-20  1:41     ` Mike Kravetz
  2015-10-20  2:22       ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2015-10-20  1:41 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi, Davidlohr Bueso

On 10/19/2015 04:16 PM, Andrew Morton wrote:
> On Fri, 16 Oct 2015 15:08:29 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> When performing a fallocate hole punch, set up a hugetlb_falloc struct
>> and make i_private point to it.  i_private will point to this struct for
>> the duration of the operation.  At the end of the operation, wake up
>> anyone who faulted on the hole and is on the waitq.
>>
>> ...
>>
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -507,7 +507,9 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  {
>>  	struct hstate *h = hstate_inode(inode);
>>  	loff_t hpage_size = huge_page_size(h);
>> +	unsigned long hpage_shift = huge_page_shift(h);
>>  	loff_t hole_start, hole_end;
>> +	struct hugetlb_falloc hugetlb_falloc;
>>  
>>  	/*
>>  	 * For hole punch round up the beginning offset of the hole and
>> @@ -518,8 +520,23 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  
>>  	if (hole_end > hole_start) {
>>  		struct address_space *mapping = inode->i_mapping;
>> +		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(hugetlb_falloc_waitq);
>> +
>> +		/*
>> +		 * Page faults on the area to be hole punched must be
>> +		 * stopped during the operation.  Initialize struct and
>> +		 * have inode->i_private point to it.
>> +		 */
>> +		hugetlb_falloc.waitq = &hugetlb_falloc_waitq;
>> +		hugetlb_falloc.start = hole_start >> hpage_shift;
>> +		hugetlb_falloc.end = hole_end >> hpage_shift;
> 
> This is a bit neater:
> 
> --- a/fs/hugetlbfs/inode.c~mm-hugetlb-setup-hugetlb_falloc-during-fallocate-hole-punch-fix
> +++ a/fs/hugetlbfs/inode.c
> @@ -509,7 +509,6 @@ static long hugetlbfs_punch_hole(struct
>  	loff_t hpage_size = huge_page_size(h);
>  	unsigned long hpage_shift = huge_page_shift(h);
>  	loff_t hole_start, hole_end;
> -	struct hugetlb_falloc hugetlb_falloc;
>  
>  	/*
>  	 * For hole punch round up the beginning offset of the hole and
> @@ -521,15 +520,16 @@ static long hugetlbfs_punch_hole(struct
>  	if (hole_end > hole_start) {
>  		struct address_space *mapping = inode->i_mapping;
>  		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(hugetlb_falloc_waitq);
> -
>  		/*
> -		 * Page faults on the area to be hole punched must be
> -		 * stopped during the operation.  Initialize struct and
> -		 * have inode->i_private point to it.
> +		 * Page faults on the area to be hole punched must be stopped
> +		 * during the operation.  Initialize struct and have
> +		 * inode->i_private point to it.
>  		 */
> -		hugetlb_falloc.waitq = &hugetlb_falloc_waitq;
> -		hugetlb_falloc.start = hole_start >> hpage_shift;
> -		hugetlb_falloc.end = hole_end >> hpage_shift;
> +		struct hugetlb_falloc hugetlb_falloc = {
> +			.waitq = &hugetlb_falloc_waitq,
> +			.start = hole_start >> hpage_shift,
> +			.end = hole_end >> hpage_shift
> +		};
>  
>  		mutex_lock(&inode->i_mutex);
>  
> 

Thanks!

>>  		mutex_lock(&inode->i_mutex);
>> +
>> +		spin_lock(&inode->i_lock);
>> +		inode->i_private = &hugetlb_falloc;
>> +		spin_unlock(&inode->i_lock);
> 
> Locking around a single atomic assignment is a bit peculiar.  I can
> kinda see that it kinda protects the logic in hugetlb_fault(), but I
> would like to hear (in comment form) your description of how this logic
> works?

To be honest, this code/scheme was copied from shmem as it addresses
the same situation there.  I did not notice how strange this looks until
you pointed it out.  At first glance, the locking does appear to be
unnecessary.  The fault code initially checks this value outside the
lock.  However, the fault code (on another CPU) will take the lock
and access values within the structure.  Without the locking or some other
type of memory barrier here, there is no guarantee that the structure
will be initialized before setting i_private.  So, the faulting code
could see invalid values in the structure.

Hugh, is that accurate?  You provided the shmem code.

-- 
Mike Kravetz

>>  		i_mmap_lock_write(mapping);
>>  		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>>  			hugetlb_vmdelete_list(&mapping->i_mmap,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults
  2015-10-19 23:18 ` [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Andrew Morton
@ 2015-10-20  1:54   ` Mike Kravetz
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2015-10-20  1:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Dave Hansen, Naoya Horiguchi,
	Hugh Dickins, Davidlohr Bueso

On 10/19/2015 04:18 PM, Andrew Morton wrote:
> On Fri, 16 Oct 2015 15:08:27 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> The hugetlbfs fallocate hole punch code can race with page faults.  The
>> result is that after a hole punch operation, pages may remain within the
>> hole.  No other side effects of this race were observed.
>>
>> In preparation for adding userfaultfd support to hugetlbfs, it is desirable
>> to plug or significantly shrink this hole.  This patch set uses the same
>> mechanism employed in shmem (see commit f00cdc6df7).
>>
> 
> "still buggy but not as bad as before" isn't what we strive for ;) What
> would it take to fix this for real?  An exhaustive description of the
> bug would be a good starting point, thanks.
> 

Thanks for asking, it made me look closer at ways to resolve this.

The current code in remove_inode_hugepages() does nothing with a page if
it is still mapped.  The only way it can be mapped is if we race and take
a page fault after unmapping, but before the page is removed.  This patch
set makes that window much smaller, but it still exists.

Instead of "giving up" on a mapped page, remove_inode_hugepages() can go
back and unmap it.  I'll code this up tomorrow.  Fortunately, it is
pretty easy to hit these races and verify proper behavior.

I'll create a new patch set with this combined code for a complete fix.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  2015-10-20  1:41     ` Mike Kravetz
@ 2015-10-20  2:22       ` Hugh Dickins
  2015-10-20  3:12         ` Mike Kravetz
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2015-10-20  2:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Hugh Dickins, linux-mm, linux-kernel, Dave Hansen,
	Naoya Horiguchi, Davidlohr Bueso

On Mon, 19 Oct 2015, Mike Kravetz wrote:
> On 10/19/2015 04:16 PM, Andrew Morton wrote:
> > On Fri, 16 Oct 2015 15:08:29 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> >>  		mutex_lock(&inode->i_mutex);
> >> +
> >> +		spin_lock(&inode->i_lock);
> >> +		inode->i_private = &hugetlb_falloc;
> >> +		spin_unlock(&inode->i_lock);
> > 
> > Locking around a single atomic assignment is a bit peculiar.  I can
> > kinda see that it kinda protects the logic in hugetlb_fault(), but I
> > would like to hear (in comment form) your description of how this logic
> > works?
> 
> To be honest, this code/scheme was copied from shmem as it addresses
> the same situation there.  I did not notice how strange this looks until
> you pointed it out.  At first glance, the locking does appear to be
> unnecessary.  The fault code initially checks this value outside the
> lock.  However, the fault code (on another CPU) will take the lock
> and access values within the structure.  Without the locking or some other
> type of memory barrier here, there is no guarantee that the structure
> will be initialized before setting i_private.  So, the faulting code
> could see invalid values in the structure.
> 
> Hugh, is that accurate?  You provided the shmem code.

Yes, I think that's accurate; but confess I'm replying now for the
sake of replying in a rare timely fashion, before having spent any
time looking through your hugetlbfs reimplementation of the same.

The peculiar thing in the shmem case, was that the structure being
pointed to is on the kernel stack of the fallocating task (with
i_mutex guaranteeing only one at a time per file could be doing this):
so the faulting end has to be careful that it's not accessing the
stale memory after the fallocator has retreated back up its stack.

And in the shmem case, this "temporary inode extension" also had to
communicate to shmem_writepage(), the swapout end of things.  Which
is not a complication you have with hugetlbfs: perhaps it could be
simpler if just between fallocate and fault, or perhaps not.

Whilst it does all work for tmpfs, it looks as if tmpfs was ahead of
the pack (or trinity was attacking tmpfs before other filesystems),
and the issue of faulting versus holepunching (and DAX) has captured
wider interest recently, with Dave Chinner formulating answers in XFS,
and hoping to set an example for other filesystems.

If that work were further along, and if I had had time to digest any
of what he is doing about it, I would point you in his direction rather
than this; but since this does work for tmpfs, I shouldn't discourage you.

I'll try to take a look through yours in the coming days, but there's
several other patchsets I need to look through too, plus a few more
patches from me, if I can find time to send them in: juggling priorities.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch
  2015-10-20  2:22       ` Hugh Dickins
@ 2015-10-20  3:12         ` Mike Kravetz
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2015-10-20  3:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Dave Hansen,
	Naoya Horiguchi, Davidlohr Bueso

On 10/19/2015 07:22 PM, Hugh Dickins wrote:
> On Mon, 19 Oct 2015, Mike Kravetz wrote:
>> On 10/19/2015 04:16 PM, Andrew Morton wrote:
>>> On Fri, 16 Oct 2015 15:08:29 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>>>  		mutex_lock(&inode->i_mutex);
>>>> +
>>>> +		spin_lock(&inode->i_lock);
>>>> +		inode->i_private = &hugetlb_falloc;
>>>> +		spin_unlock(&inode->i_lock);
>>>
>>> Locking around a single atomic assignment is a bit peculiar.  I can
>>> kinda see that it kinda protects the logic in hugetlb_fault(), but I
>>> would like to hear (in comment form) your description of how this logic
>>> works?
>>
>> To be honest, this code/scheme was copied from shmem as it addresses
>> the same situation there.  I did not notice how strange this looks until
>> you pointed it out.  At first glance, the locking does appear to be
>> unnecessary.  The fault code initially checks this value outside the
>> lock.  However, the fault code (on another CPU) will take the lock
>> and access values within the structure.  Without the locking or some other
>> type of memory barrier here, there is no guarantee that the structure
>> will be initialized before setting i_private.  So, the faulting code
>> could see invalid values in the structure.
>>
>> Hugh, is that accurate?  You provided the shmem code.
> 
> Yes, I think that's accurate; but confess I'm replying now for the
> sake of replying in a rare timely fashion, before having spent any
> time looking through your hugetlbfs reimplementation of the same.
> 
> The peculiar thing in the shmem case, was that the structure being
> pointed to is on the kernel stack of the fallocating task (with
> i_mutex guaranteeing only one at a time per file could be doing this):
> so the faulting end has to be careful that it's not accessing the
> stale memory after the fallocator has retreated back up its stack.

I used the same code/scheme for hugetlbfs.

> And in the shmem case, this "temporary inode extension" also had to
> communicate to shmem_writepage(), the swapout end of things.  Which
> is not a complication you have with hugetlbfs: perhaps it could be
> simpler if just between fallocate and fault, or perhaps not.

Yes, I think it is simpler.  At first I was excited because hugetlbfs
already has a table of mutex'es to synchronize page faults. and I
thought about using those.  If the hole being punched is small this
works fine.  But, for large holes you could end up taking all mutexes
and prevent all huge page faults. :(  So, I went back to the scheme
employed by shmem.

> Whilst it does all work for tmpfs, it looks as if tmpfs was ahead of
> the pack (or trinity was attacking tmpfs before other filesystems),
> and the issue of faulting versus holepunching (and DAX) has captured
> wider interest recently, with Dave Chinner formulating answers in XFS,
> and hoping to set an example for other filesystems.
> 
> If that work were further along, and if I had had time to digest any
> of what he is doing about it, I would point you in his direction rather
> than this; but since this does work for tmpfs, I shouldn't discourage you.
> 
> I'll try to take a look through yours in the coming days, but there's
> several other patchsets I need to look through too, plus a few more
> patches from me, if I can find time to send them in: juggling priorities.
> 
> Hugh

Thanks, and no hurry.  I need to add a little more code to this this patch
set to completely handle the race.  A new patch will be out in a day or two.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-20  3:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-16 22:08 [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Mike Kravetz
2015-10-16 22:08 ` [PATCH 1/3] mm/hugetlb: Define hugetlb_falloc structure for hole punch race Mike Kravetz
2015-10-16 22:08 ` [PATCH 2/3] mm/hugetlb: Setup hugetlb_falloc during fallocate hole punch Mike Kravetz
2015-10-19 23:16   ` Andrew Morton
2015-10-20  1:41     ` Mike Kravetz
2015-10-20  2:22       ` Hugh Dickins
2015-10-20  3:12         ` Mike Kravetz
2015-10-16 22:08 ` [PATCH 3/3] mm/hugetlb: page faults check for fallocate hole punch in progress and wait Mike Kravetz
2015-10-19 23:18 ` [PATCH 0/3] hugetlbfs fallocate hole punch race with page faults Andrew Morton
2015-10-20  1:54   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox