[PATCH 1/4] holepunch: fix shmem_truncate

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
@ 2007-03-28 14:50 Hugh Dickins
  2007-03-28 14:51 ` [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking Hugh Dickins
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Hugh Dickins @ 2007-03-28 14:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, Badari Pulavarty, linux-mm

Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered
in rare circumstances, because shmem_truncate_range() erroneously
removes partially truncated directory pages at the end of the range:
later reclaim on pages pointing to these removed directories triggers
the BUG.  Indeed, and it can also cause data loss beyond the hole.

Fix this as in the patch proposed by Miklos, but distinguish between
"limit" (how far we need to search: ignore truncation's next_index
optimization in the holepunch case - if there are races it's more
consistent to act on the whole range specified) and "upper_limit"
(how far we can free directory pages: generally we must be careful
to keep partially punched pages, but can relax at end of file -
i_size being held stable by i_mutex).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cs>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
---
Patch is against 2.6.21-rc5: intended for 2.6.21.
To apply this series to -mm, please first revert Miklos'
shmem-dont-release-lock-for-hole-punching.patch
shmem-fix-bug-in-shmem_writepage.patch
which these replace.

 mm/shmem.c |   32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

--- 2.6.21-rc5/mm/shmem.c	2007-03-07 13:09:01.000000000 +0000
+++ punch1/mm/shmem.c	2007-03-28 11:50:57.000000000 +0100
@@ -481,7 +481,8 @@ static void shmem_truncate_range(struct 
 	long nr_swaps_freed = 0;
 	int offset;
 	int freed;
-	int punch_hole = 0;
+	int punch_hole;
+	unsigned long upper_limit;
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -492,11 +493,18 @@ static void shmem_truncate_range(struct 
 	info->flags |= SHMEM_TRUNCATE;
 	if (likely(end == (loff_t) -1)) {
 		limit = info->next_index;
+		upper_limit = SHMEM_MAX_INDEX;
 		info->next_index = idx;
+		punch_hole = 0;
 	} else {
-		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-		if (limit > info->next_index)
-			limit = info->next_index;
+		if (end + 1 >= inode->i_size) {	/* we may free a little more */
+			limit = (inode->i_size + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+			upper_limit = SHMEM_MAX_INDEX;
+		} else {
+			limit = (end + 1) >> PAGE_CACHE_SHIFT;
+			upper_limit = limit;
+		}
 		punch_hole = 1;
 	}
 
@@ -520,10 +528,10 @@ static void shmem_truncate_range(struct 
 	 * If there are no indirect blocks or we are punching a hole
 	 * below indirect blocks, nothing to be done.
 	 */
-	if (!topdir || (punch_hole && (limit <= SHMEM_NR_DIRECT)))
+	if (!topdir || limit <= SHMEM_NR_DIRECT)
 		goto done2;
 
-	BUG_ON(limit <= SHMEM_NR_DIRECT);
+	upper_limit -= SHMEM_NR_DIRECT;
 	limit -= SHMEM_NR_DIRECT;
 	idx = (idx > SHMEM_NR_DIRECT)? (idx - SHMEM_NR_DIRECT): 0;
 	offset = idx % ENTRIES_PER_PAGE;
@@ -543,7 +551,7 @@ static void shmem_truncate_range(struct 
 		if (*dir) {
 			diroff = ((idx - ENTRIES_PER_PAGEPAGE/2) %
 				ENTRIES_PER_PAGEPAGE) / ENTRIES_PER_PAGE;
-			if (!diroff && !offset) {
+			if (!diroff && !offset && upper_limit >= stage) {
 				*dir = NULL;
 				nr_pages_to_free++;
 				list_add(&middir->lru, &pages_to_free);
@@ -570,9 +578,11 @@ static void shmem_truncate_range(struct 
 			}
 			stage = idx + ENTRIES_PER_PAGEPAGE;
 			middir = *dir;
-			*dir = NULL;
-			nr_pages_to_free++;
-			list_add(&middir->lru, &pages_to_free);
+			if (upper_limit >= stage) {
+				*dir = NULL;
+				nr_pages_to_free++;
+				list_add(&middir->lru, &pages_to_free);
+			}
 			shmem_dir_unmap(dir);
 			cond_resched();
 			dir = shmem_dir_map(middir);
@@ -598,7 +608,7 @@ static void shmem_truncate_range(struct 
 		}
 		if (offset)
 			offset = 0;
-		else if (subdir && !page_private(subdir)) {
+		else if (subdir && upper_limit - idx >= ENTRIES_PER_PAGE) {
 			dir[diroff] = NULL;
 			nr_pages_to_free++;
 			list_add(&subdir->lru, &pages_to_free);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking
  2007-03-28 14:50 [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Hugh Dickins
@ 2007-03-28 14:51 ` Hugh Dickins
  2007-03-29 11:32   ` Miklos Szeredi
  2007-03-28 14:52 ` [PATCH 3/4] holepunch: fix disconnected pages after second truncate Hugh Dickins
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2007-03-28 14:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, Badari Pulavarty, linux-mm

Miklos Szeredi observes that during truncation of shmem page directories,
info->lock is released to improve latency (after lowering i_size and
next_index to exclude races); but this is quite wrong for holepunching,
which receives no such protection from i_size or next_index, and is left
vulnerable to races with shmem_unuse, shmem_getpage and shmem_writepage.

Hold info->lock throughout when holepunching?  No, any user could prevent
rescheduling for far too long.  Instead take info->lock just when needed:
in shmem_free_swp when removing the swap entries, and whenever removing
a directory page from the level above.  But so long as we remove before
scanning, we can safely skip taking the lock at the lower levels, except
at misaligned start and end of the hole.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
---
Patch against 1/4: intended for 2.6.21.

 mm/shmem.c |   96 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 23 deletions(-)

--- punch1/mm/shmem.c	2007-03-28 11:50:57.000000000 +0100
+++ punch2/mm/shmem.c	2007-03-28 11:50:58.000000000 +0100
@@ -402,26 +402,38 @@ static swp_entry_t *shmem_swp_alloc(stru
 /*
  * shmem_free_swp - free some swap entries in a directory
  *
- * @dir:   pointer to the directory
- * @edir:  pointer after last entry of the directory
+ * @dir:        pointer to the directory
+ * @edir:       pointer after last entry of the directory
+ * @punch_lock: pointer to spinlock when needed for the holepunch case
  */
-static int shmem_free_swp(swp_entry_t *dir, swp_entry_t *edir)
+static int shmem_free_swp(swp_entry_t *dir, swp_entry_t *edir,
+						spinlock_t *punch_lock)
 {
+	spinlock_t *punch_unlock = NULL;
 	swp_entry_t *ptr;
 	int freed = 0;
 
 	for (ptr = dir; ptr < edir; ptr++) {
 		if (ptr->val) {
+			if (unlikely(punch_lock)) {
+				punch_unlock = punch_lock;
+				punch_lock = NULL;
+				spin_lock(punch_unlock);
+				if (!ptr->val)
+					continue;
+			}
 			free_swap_and_cache(*ptr);
 			*ptr = (swp_entry_t){0};
 			freed++;
 		}
 	}
+	if (punch_unlock)
+		spin_unlock(punch_unlock);
 	return freed;
 }
 
-static int shmem_map_and_free_swp(struct page *subdir,
-		int offset, int limit, struct page ***dir)
+static int shmem_map_and_free_swp(struct page *subdir, int offset,
+		int limit, struct page ***dir, spinlock_t *punch_lock)
 {
 	swp_entry_t *ptr;
 	int freed = 0;
@@ -431,7 +443,8 @@ static int shmem_map_and_free_swp(struct
 		int size = limit - offset;
 		if (size > LATENCY_LIMIT)
 			size = LATENCY_LIMIT;
-		freed += shmem_free_swp(ptr+offset, ptr+offset+size);
+		freed += shmem_free_swp(ptr+offset, ptr+offset+size,
+							punch_lock);
 		if (need_resched()) {
 			shmem_swp_unmap(ptr);
 			if (*dir) {
@@ -482,6 +495,8 @@ static void shmem_truncate_range(struct 
 	int offset;
 	int freed;
 	int punch_hole;
+	spinlock_t *needs_lock;
+	spinlock_t *punch_lock;
 	unsigned long upper_limit;
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -495,6 +510,7 @@ static void shmem_truncate_range(struct 
 		limit = info->next_index;
 		upper_limit = SHMEM_MAX_INDEX;
 		info->next_index = idx;
+		needs_lock = NULL;
 		punch_hole = 0;
 	} else {
 		if (end + 1 >= inode->i_size) {	/* we may free a little more */
@@ -505,6 +521,7 @@ static void shmem_truncate_range(struct 
 			limit = (end + 1) >> PAGE_CACHE_SHIFT;
 			upper_limit = limit;
 		}
+		needs_lock = &info->lock;
 		punch_hole = 1;
 	}
 
@@ -521,7 +538,7 @@ static void shmem_truncate_range(struct 
 		size = limit;
 		if (size > SHMEM_NR_DIRECT)
 			size = SHMEM_NR_DIRECT;
-		nr_swaps_freed = shmem_free_swp(ptr+idx, ptr+size);
+		nr_swaps_freed = shmem_free_swp(ptr+idx, ptr+size, needs_lock);
 	}
 
 	/*
@@ -531,6 +548,19 @@ static void shmem_truncate_range(struct 
 	if (!topdir || limit <= SHMEM_NR_DIRECT)
 		goto done2;
 
+	/*
+	 * The truncation case has already dropped info->lock, and we're safe
+	 * because i_size and next_index have already been lowered, preventing
+	 * access beyond.  But in the punch_hole case, we still need to take
+	 * the lock when updating the swap directory, because there might be
+	 * racing accesses by shmem_getpage(SGP_CACHE), shmem_unuse_inode or
+	 * shmem_writepage.  However, whenever we find we can remove a whole
+	 * directory page (not at the misaligned start or end of the range),
+	 * we first NULLify its pointer in the level above, and then have no
+	 * need to take the lock when updating its contents: needs_lock and
+	 * punch_lock (either pointing to info->lock or NULL) manage this.
+	 */
+
 	upper_limit -= SHMEM_NR_DIRECT;
 	limit -= SHMEM_NR_DIRECT;
 	idx = (idx > SHMEM_NR_DIRECT)? (idx - SHMEM_NR_DIRECT): 0;
@@ -552,7 +582,13 @@ static void shmem_truncate_range(struct 
 			diroff = ((idx - ENTRIES_PER_PAGEPAGE/2) %
 				ENTRIES_PER_PAGEPAGE) / ENTRIES_PER_PAGE;
 			if (!diroff && !offset && upper_limit >= stage) {
-				*dir = NULL;
+				if (needs_lock) {
+					spin_lock(needs_lock);
+					*dir = NULL;
+					spin_unlock(needs_lock);
+					needs_lock = NULL;
+				} else
+					*dir = NULL;
 				nr_pages_to_free++;
 				list_add(&middir->lru, &pages_to_free);
 			}
@@ -578,8 +614,16 @@ static void shmem_truncate_range(struct 
 			}
 			stage = idx + ENTRIES_PER_PAGEPAGE;
 			middir = *dir;
+			if (punch_hole)
+				needs_lock = &info->lock;
 			if (upper_limit >= stage) {
-				*dir = NULL;
+				if (needs_lock) {
+					spin_lock(needs_lock);
+					*dir = NULL;
+					spin_unlock(needs_lock);
+					needs_lock = NULL;
+				} else
+					*dir = NULL;
 				nr_pages_to_free++;
 				list_add(&middir->lru, &pages_to_free);
 			}
@@ -588,31 +632,37 @@ static void shmem_truncate_range(struct 
 			dir = shmem_dir_map(middir);
 			diroff = 0;
 		}
+		punch_lock = needs_lock;
 		subdir = dir[diroff];
-		if (subdir && page_private(subdir)) {
+		if (subdir && !offset && upper_limit-idx >= ENTRIES_PER_PAGE) {
+			if (needs_lock) {
+				spin_lock(needs_lock);
+				dir[diroff] = NULL;
+				spin_unlock(needs_lock);
+				punch_lock = NULL;
+			} else
+				dir[diroff] = NULL;
+			nr_pages_to_free++;
+			list_add(&subdir->lru, &pages_to_free);
+		}
+		if (subdir && page_private(subdir) /* has swap entries */) {
 			size = limit - idx;
 			if (size > ENTRIES_PER_PAGE)
 				size = ENTRIES_PER_PAGE;
 			freed = shmem_map_and_free_swp(subdir,
-						offset, size, &dir);
+					offset, size, &dir, punch_lock);
 			if (!dir)
 				dir = shmem_dir_map(middir);
 			nr_swaps_freed += freed;
-			if (offset)
+			if (offset || punch_lock) {
 				spin_lock(&info->lock);
-			set_page_private(subdir, page_private(subdir) - freed);
-			if (offset)
+				set_page_private(subdir,
+					page_private(subdir) - freed);
 				spin_unlock(&info->lock);
-			if (!punch_hole)
-				BUG_ON(page_private(subdir) > offset);
-		}
-		if (offset)
-			offset = 0;
-		else if (subdir && upper_limit - idx >= ENTRIES_PER_PAGE) {
-			dir[diroff] = NULL;
-			nr_pages_to_free++;
-			list_add(&subdir->lru, &pages_to_free);
+			} else
+				BUG_ON(page_private(subdir) != freed);
 		}
+		offset = 0;
 	}
 done1:
 	shmem_dir_unmap(dir);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking
  2007-03-28 14:51 ` [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking Hugh Dickins
@ 2007-03-29 11:32   ` Miklos Szeredi
  0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2007-03-29 11:32 UTC (permalink / raw)
  To: hugh; +Cc: akpm, mszeredi, pbadari, linux-mm

> Miklos Szeredi observes that during truncation of shmem page directories,
> info->lock is released to improve latency (after lowering i_size and
> next_index to exclude races); but this is quite wrong for holepunching,
> which receives no such protection from i_size or next_index, and is left
> vulnerable to races with shmem_unuse, shmem_getpage and shmem_writepage.
> 
> Hold info->lock throughout when holepunching?  No, any user could prevent
> rescheduling for far too long.  Instead take info->lock just when needed:
> in shmem_free_swp when removing the swap entries, and whenever removing
> a directory page from the level above.  But so long as we remove before
> scanning, we can safely skip taking the lock at the lower levels, except
> at misaligned start and end of the hole.

ACK, but I think this has become way too complex, and none of us will
understand it in a month or so :(

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] holepunch: fix disconnected pages after second truncate
  2007-03-28 14:50 [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Hugh Dickins
  2007-03-28 14:51 ` [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking Hugh Dickins
@ 2007-03-28 14:52 ` Hugh Dickins
  2007-03-28 14:54 ` [PATCH 4/4] holepunch: fix mmap_sem i_mutex deadlock Hugh Dickins
  2007-03-29 10:57 ` [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Miklos Szeredi
  3 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2007-03-28 14:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, Badari Pulavarty, Nick Piggin, linux-mm

shmem_truncate_range has its own truncate_inode_pages_range, to free any
pages racily instantiated while it was in progress: a SHMEM_PAGEIN flag
is set when this might have happened.  But holepunching gets no chance
to clear that flag at the start of vmtruncate_range, so it's always set
(unless a truncate came just before), so holepunch almost always does
this second truncate_inode_pages_range.

shmem holepunch has unlikely swap<->file races hereabouts whatever we do
(without a fuller rework than is fit for this release): I was going to
skip the second truncate in the punch_hole case, but Miklos points out
that would make holepunch correctness more vulnerable to swapoff.  So
keep the second truncate, but follow it by an unmap_mapping_range to
eliminate the disconnected pages (freed from pagecache while still
mapped in userspace) that it might have left behind.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
---
This patch can safely go into 2.6.21 and 2.6.21-rc-mm, but Nick's
pagefault/truncation patches in mm do change the rules, so it's
probably overkill there - but there's unmap_mapping_range issues
I need to clarify with Nick before undoing this in mm.

 mm/shmem.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- punch2/mm/shmem.c	2007-03-28 11:50:58.000000000 +0100
+++ punch3/mm/shmem.c	2007-03-28 11:50:59.000000000 +0100
@@ -674,8 +674,16 @@ done2:
 		 * generic_delete_inode did it, before we lowered next_index.
 		 * Also, though shmem_getpage checks i_size before adding to
 		 * cache, no recheck after: so fix the narrow window there too.
+		 *
+		 * Recalling truncate_inode_pages_range and unmap_mapping_range
+		 * every time for punch_hole (which never got a chance to clear
+		 * SHMEM_PAGEIN at the start of vmtruncate_range) is expensive,
+		 * yet hardly ever necessary: try to optimize them out later.
 		 */
 		truncate_inode_pages_range(inode->i_mapping, start, end);
+		if (punch_hole)
+			unmap_mapping_range(inode->i_mapping, start,
+							end - start, 1);
 	}
 
 	spin_lock(&info->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/4] holepunch: fix mmap_sem i_mutex deadlock
  2007-03-28 14:50 [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Hugh Dickins
  2007-03-28 14:51 ` [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking Hugh Dickins
  2007-03-28 14:52 ` [PATCH 3/4] holepunch: fix disconnected pages after second truncate Hugh Dickins
@ 2007-03-28 14:54 ` Hugh Dickins
  2007-03-29 10:57 ` [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Miklos Szeredi
  3 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2007-03-28 14:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, Badari Pulavarty, Nick Piggin, linux-mm

sys_madvise has down_write of mmap_sem, then madvise_remove calls
vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can
easily devise deadlocks from that ordering.

madvise_remove drop mmap_sem while calling vmtruncate_range: luckily,
since madvise_remove doesn't split or merge vmas, it's easy to handle
this case with a NULL prev, without restructuring sys_madvise.  (Though
sad to retake mmap_sem when it's unlikely to be needed, and certainly
down_read is sufficient for MADV_REMOVE, unlike the other madvices.)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
---

 mm/madvise.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- punch3/mm/madvise.c	2007-03-26 07:30:54.000000000 +0100
+++ punch4/mm/madvise.c	2007-03-28 11:51:01.000000000 +0100
@@ -159,9 +159,10 @@ static long madvise_remove(struct vm_are
 				unsigned long start, unsigned long end)
 {
 	struct address_space *mapping;
-        loff_t offset, endoff;
+	loff_t offset, endoff;
+	int error;
 
-	*prev = vma;
+	*prev = NULL;	/* tell sys_madvise we drop mmap_sem */
 
 	if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB))
 		return -EINVAL;
@@ -180,7 +181,12 @@ static long madvise_remove(struct vm_are
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 	endoff = (loff_t)(end - vma->vm_start - 1)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-	return  vmtruncate_range(mapping->host, offset, endoff);
+
+	/* vmtruncate_range needs to take i_mutex and i_alloc_sem */
+	up_write(&current->mm->mmap_sem);
+	error = vmtruncate_range(mapping->host, offset, endoff);
+	down_write(&current->mm->mmap_sem);
+	return error;
 }
 
 static long
@@ -315,12 +321,15 @@ asmlinkage long sys_madvise(unsigned lon
 		if (error)
 			goto out;
 		start = tmp;
-		if (start < prev->vm_end)
+		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
 		error = unmapped_error;
 		if (start >= end)
 			goto out;
-		vma = prev->vm_next;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(current->mm, start);
 	}
 out:
 	up_write(&current->mm->mmap_sem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-28 14:50 [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Hugh Dickins
                   ` (2 preceding siblings ...)
  2007-03-28 14:54 ` [PATCH 4/4] holepunch: fix mmap_sem i_mutex deadlock Hugh Dickins
@ 2007-03-29 10:57 ` Miklos Szeredi
  2007-03-29 11:56   ` Hugh Dickins
  2007-03-29 22:48   ` Peter Chubb
  3 siblings, 2 replies; 12+ messages in thread
From: Miklos Szeredi @ 2007-03-29 10:57 UTC (permalink / raw)
  To: hugh; +Cc: akpm, mszeredi, pbadari, linux-mm

> Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered
> in rare circumstances, because shmem_truncate_range() erroneously
> removes partially truncated directory pages at the end of the range:
> later reclaim on pages pointing to these removed directories triggers
> the BUG.  Indeed, and it can also cause data loss beyond the hole.
> 
> Fix this as in the patch proposed by Miklos, but distinguish between
> "limit" (how far we need to search: ignore truncation's next_index
> optimization in the holepunch case - if there are races it's more
> consistent to act on the whole range specified) and "upper_limit"
> (how far we can free directory pages: generally we must be careful
> to keep partially punched pages, but can relax at end of file -
> i_size being held stable by i_mutex).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> Cc: Miklos Szeredi <mszeredi@suse.cs>
> Cc: Badari Pulavarty <pbadari@us.ibm.com>
> ---
> Patch is against 2.6.21-rc5: intended for 2.6.21.
> To apply this series to -mm, please first revert Miklos'
> shmem-dont-release-lock-for-hole-punching.patch
> shmem-fix-bug-in-shmem_writepage.patch
> which these replace.
> 
>  mm/shmem.c |   32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> --- 2.6.21-rc5/mm/shmem.c	2007-03-07 13:09:01.000000000 +0000
> +++ punch1/mm/shmem.c	2007-03-28 11:50:57.000000000 +0100
> @@ -481,7 +481,8 @@ static void shmem_truncate_range(struct 
>  	long nr_swaps_freed = 0;
>  	int offset;
>  	int freed;
> -	int punch_hole = 0;
> +	int punch_hole;
> +	unsigned long upper_limit;
>  
>  	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
>  	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> @@ -492,11 +493,18 @@ static void shmem_truncate_range(struct 
>  	info->flags |= SHMEM_TRUNCATE;
>  	if (likely(end == (loff_t) -1)) {
>  		limit = info->next_index;
> +		upper_limit = SHMEM_MAX_INDEX;
>  		info->next_index = idx;
> +		punch_hole = 0;
>  	} else {
> -		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> -		if (limit > info->next_index)
> -			limit = info->next_index;
> +		if (end + 1 >= inode->i_size) {	/* we may free a little more */

Why end + 1?  If the hole end is at 4096 and the file size is 4097 we
surely don't want to truncate that second page also?

Otherwise ACK.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 10:57 ` [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Miklos Szeredi
@ 2007-03-29 11:56   ` Hugh Dickins
  2007-03-29 12:11     ` Miklos Szeredi
  2007-03-29 22:48   ` Peter Chubb
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2007-03-29 11:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, mszeredi, pbadari, linux-mm

On Thu, 29 Mar 2007, Miklos Szeredi wrote:
> >  	} else {
> > -		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> > -		if (limit > info->next_index)
> > -			limit = info->next_index;
> > +		if (end + 1 >= inode->i_size) {	/* we may free a little more */
> 
> Why end + 1?  If the hole end is at 4096 and the file size is 4097 we
> surely don't want to truncate that second page also?

I spent ages over that!  It's the fix to one of those silly little bugs
that were in the interim Not-Yet-Signed-off-by version I sent you earlier.

In that interim version indeed I had just "end": I thought the original
(see the first - line above) was wrong to be rounding up "end" to a page
boundary, since "start" has been rounded up to a page boundary - the
right thing would be to round up start and round down end (though it's
academic so long as the only way here is through sys_madvise, which
enforces page alignment of start and rounds up len).

For a long time I was mystified why my final page's swap entry wasn't
getting punched out.  Eventually I printk'ed the arguments coming in,
then found this line in madvise_remove:

	endoff = (loff_t)(end - vma->vm_start - 1)
			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

It subtracts 1 from the "end" I'm used to dealing with (offset of
byte beyond the area in question), to make it the offset of the
last byte of the hole, then passes that down as "lend" or "end"
to truncate_inode_pages_range and shmem_truncate_range.

Personally, I'd call that "last"; or rather, I wouldn't do it like
that at all - I find it terribly confusing to deal with; but I
guess Badari himself found it more natural that way, or it fitted
better with the convention of -1 he was using for full truncation.

So in your example above, hole end 4096 means the last byte of
the hole is at offset 4096, which is indeed the last byte of the
size 4097 file: so we're right to chop it all off.

Of course, you can then question the args I gave to the additional
unmap_mapping_range: I dithered over that for a long time too (and
again got it wrong in the interim version), in the end decided to say
the same as vmtruncate_range (to avoid questions!), though strictly
unmap_mapping_range(mapping, offset, 1 + end - offset, 1)
would be more correct - though again, it's academic since
unmap_mapping_range does its own rounding.

> Otherwise ACK.

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 11:56   ` Hugh Dickins
@ 2007-03-29 12:11     ` Miklos Szeredi
  2007-03-29 13:39       ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2007-03-29 12:11 UTC (permalink / raw)
  To: hugh; +Cc: miklos, akpm, mszeredi, pbadari, linux-mm

> > >  	} else {
> > > -		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> > > -		if (limit > info->next_index)
> > > -			limit = info->next_index;
> > > +		if (end + 1 >= inode->i_size) {	/* we may free a little more */
> > 
> > Why end + 1?  If the hole end is at 4096 and the file size is 4097 we
> > surely don't want to truncate that second page also?
> 
> I spent ages over that!  It's the fix to one of those silly little bugs
> that were in the interim Not-Yet-Signed-off-by version I sent you earlier.
> 
> In that interim version indeed I had just "end": I thought the original
> (see the first - line above) was wrong to be rounding up "end" to a page
> boundary, since "start" has been rounded up to a page boundary - the
> right thing would be to round up start and round down end (though it's
> academic so long as the only way here is through sys_madvise, which
> enforces page alignment of start and rounds up len).
> 
> For a long time I was mystified why my final page's swap entry wasn't
> getting punched out.  Eventually I printk'ed the arguments coming in,
> then found this line in madvise_remove:
> 
> 	endoff = (loff_t)(end - vma->vm_start - 1)
> 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> 
> It subtracts 1 from the "end" I'm used to dealing with (offset of
> byte beyond the area in question), to make it the offset of the
> last byte of the hole, then passes that down as "lend" or "end"
> to truncate_inode_pages_range and shmem_truncate_range.
> 
> Personally, I'd call that "last"; or rather, I wouldn't do it like
> that at all - I find it terribly confusing to deal with; but I
> guess Badari himself found it more natural that way, or it fitted
> better with the convention of -1 he was using for full truncation.

Oh, it _is_ confusing.

> Of course, you can then question the args I gave to the additional
> unmap_mapping_range: I dithered over that for a long time too (and
> again got it wrong in the interim version), in the end decided to say
> the same as vmtruncate_range (to avoid questions!), though strictly
> unmap_mapping_range(mapping, offset, 1 + end - offset, 1)
> would be more correct - though again, it's academic since
> unmap_mapping_range does its own rounding.

I think we should at least have a

  BUG_ON((end + 1) % PAGE_CACHE_SIZE);

or something, to remind us about this wart.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 12:11     ` Miklos Szeredi
@ 2007-03-29 13:39       ` Hugh Dickins
  2007-03-29 14:35         ` Miklos Szeredi
  0 siblings, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2007-03-29 13:39 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, mszeredi, pbadari, linux-mm

On Thu, 29 Mar 2007, Miklos Szeredi wrote:
> 
> I think we should at least have a
> 
>   BUG_ON((end + 1) % PAGE_CACHE_SIZE);
> 
> or something, to remind us about this wart.

truncate_inode_pages_range does indeed have
	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
but you're right that falls short of covering all the places we might
make such a mistake.

And in future I'd expect them to be extended to allow non-page-sized
holes, zeroing the partial areas at each end of the hole: whereupon
no such BUG_ON will be possible.

I'd much prefer to change the interface to vmtruncate_range,
truncate_inode_pages_range, shmem_truncate_range,
i_op->truncate_range, to take the expected end offset.

There are other interface changes needed to eradicate
(rather than paper over) the races we've mentioned in private mail.
shmem_truncate_range, and I believe any other implementation of an
i_op->truncate_range, needs to know when the holepunch is beginning
(if the prior unmap_mapping_range and truncate_inode_pages_range
aren't just to be a waste of time that has to be repeated).
Easiest is just to move those calls into each i_op->truncate_range.

But are we free to make such interface changes now?
Might third parties have observed MADV_REMOVE and ->truncate_range,
and be implementing them in their own filesystems?

And another change I'd like to suggest: at present holepunching
is using unmap_mapping_range(,,,1) which discards privately COWed
pages from vmas.  It's certainly easier to implement unracily if
we change it only to unmap the shared file pages: and I argue that
it's more correct that way, that the madvise(,,MADV_REMOVE) caller
should not be discarding private data from others' address spaces.

That behaviour is mandated by standards for truncation, and probably
necessary for consistent SIGBUS-beyond-EOF behaviour: but I question
(a year late!) whether we should be extending it to holepunching.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 13:39       ` Hugh Dickins
@ 2007-03-29 14:35         ` Miklos Szeredi
  0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2007-03-29 14:35 UTC (permalink / raw)
  To: hugh; +Cc: akpm, mszeredi, pbadari, linux-mm

> On Thu, 29 Mar 2007, Miklos Szeredi wrote:
> > 
> > I think we should at least have a
> > 
> >   BUG_ON((end + 1) % PAGE_CACHE_SIZE);
> > 
> > or something, to remind us about this wart.
> 
> truncate_inode_pages_range does indeed have
> 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
> but you're right that falls short of covering all the places we might
> make such a mistake.
> 
> And in future I'd expect them to be extended to allow non-page-sized
> holes, zeroing the partial areas at each end of the hole: whereupon
> no such BUG_ON will be possible.
> 
> I'd much prefer to change the interface to vmtruncate_range,
> truncate_inode_pages_range, shmem_truncate_range,
> i_op->truncate_range, to take the expected end offset.

I agree 100%

> There are other interface changes needed to eradicate
> (rather than paper over) the races we've mentioned in private mail.
> shmem_truncate_range, and I believe any other implementation of an
> i_op->truncate_range, needs to know when the holepunch is beginning
> (if the prior unmap_mapping_range and truncate_inode_pages_range
> aren't just to be a waste of time that has to be repeated).
> Easiest is just to move those calls into each i_op->truncate_range.
> 
> But are we free to make such interface changes now?
> Might third parties have observed MADV_REMOVE and ->truncate_range,
> and be implementing them in their own filesystems?

I think that's very unlikly.  Considering that there were two bugs
(the infinite loop and the ABBA deadlock) in the generic MADV_REMOVE
code, and nobody noticed until now.

The sooner that change is done, the better.

> And another change I'd like to suggest: at present holepunching
> is using unmap_mapping_range(,,,1) which discards privately COWed
> pages from vmas.  It's certainly easier to implement unracily if
> we change it only to unmap the shared file pages: and I argue that
> it's more correct that way, that the madvise(,,MADV_REMOVE) caller
> should not be discarding private data from others' address spaces.

Yes, that makes sense too.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 10:57 ` [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Miklos Szeredi
  2007-03-29 11:56   ` Hugh Dickins
@ 2007-03-29 22:48   ` Peter Chubb
  2007-03-30  4:01     ` Hugh Dickins
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Chubb @ 2007-03-29 22:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: hugh, akpm, mszeredi, pbadari, linux-mm

>>>>> "Miklos" == Miklos Szeredi <miklos@szeredi.hu> writes:
>> + punch_hole = 0; } else { 
>> - limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 
>> - if (limit > info->next_index) 
>> - limit =  info->next_index; 
>> + if (end + 1 >= inode->i_size) { /* we may free  a little more */

Miklos> Why end + 1?  If the hole end is at 4096 and the file size is
Miklos> 4097 we surely don't want to truncate that second page also?

Why not simplify it to end > inode->i_size?  I think it then makes
more sense.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far
  2007-03-29 22:48   ` Peter Chubb
@ 2007-03-30  4:01     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2007-03-30  4:01 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Miklos Szeredi, akpm, mszeredi, pbadari, linux-mm

On Fri, 30 Mar 2007, Peter Chubb wrote:
> >> + if (end + 1 >= inode->i_size) { /* we may free  a little more */
> 
> Miklos> Why end + 1?  If the hole end is at 4096 and the file size is
> Miklos> 4097 we surely don't want to truncate that second page also?
> 
> Why not simplify it to end > inode->i_size?  I think it then makes
> more sense.

One reason would be, that 2 + 1 >= 3, but I don't think 2 > 3 ;)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-03-30  4:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-28 14:50 [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Hugh Dickins
2007-03-28 14:51 ` [PATCH 2/4] holepunch: fix shmem_truncate_range punch locking Hugh Dickins
2007-03-29 11:32   ` Miklos Szeredi
2007-03-28 14:52 ` [PATCH 3/4] holepunch: fix disconnected pages after second truncate Hugh Dickins
2007-03-28 14:54 ` [PATCH 4/4] holepunch: fix mmap_sem i_mutex deadlock Hugh Dickins
2007-03-29 10:57 ` [PATCH 1/4] holepunch: fix shmem_truncate_range punching too far Miklos Szeredi
2007-03-29 11:56   ` Hugh Dickins
2007-03-29 12:11     ` Miklos Szeredi
2007-03-29 13:39       ` Hugh Dickins
2007-03-29 14:35         ` Miklos Szeredi
2007-03-29 22:48   ` Peter Chubb
2007-03-30  4:01     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox