linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
@ 2006-08-01 11:36 Nick Piggin
  2006-08-01 14:27 ` Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nick Piggin @ 2006-08-01 11:36 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Hugh Dickins, Linux Memory Management

[-- Attachment #1: Type: text/plain, Size: 333 bytes --]

Hi,

Just like to get some thoughts on another possible approach to this
problem, and whether my changelog and implementation actually capture
the problem. This fix is actually something Andrea had proposed, so
credit really goes to him.

I suppose we should think about fixing it some day?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-dnp-invp-race.patch --]
[-- Type: text/plain, Size: 14424 bytes --]

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of
pages from pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the
page, before it can be discarded from the pagecache. Between shooting
down ptes to a particular page, and actually dropping the struct page
from the pagecache, do_no_page from any process might fault on that
page and establish a new mapping to the page just before it gets
discarded from the pagecache.

The most common case where such invalidation is used is in file
truncation. This case was catered for by doing a sort of open-coded
seqlock between the file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then
find the page if it is within i_size, and then check truncate_count
under the page table lock and back out and retry if it had
subsequently been changed (ptl will serialise against unmapping, and
ensure a potentially updated truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails
in the case where we would like to invalidate pagecache inside i_size.
do_no_page can come in anytime and filemap_nopage is not aware of the
invalidation in progress (as it is when it is outside i_size). The
end result is that dangling (->mapping == NULL) pages that appear to
be from a particular file may be mapped into userspace with nonsense
data. Valid mappings to the same place will see a different page.

Andrea implemented two working fixes, one using a real seqlock,
another using a page->flags bit. He also proposed using the page lock
in do_no_page, but that was initially considered too heavyweight.
However, it is not a global or per-file lock, and the page cacheline
is modified in do_no_page to increment _count and _mapcount anyway, so
a further modification should not be a large performance hit.
Scalability is not an issue.

This patch implements this latter approach. ->nopage implementations
return with the page locked if it is possible for their underlying
file to be invalidated (in that case, they must set a special vm_flags
bit to indicate so). do_no_page only unlocks the page after setting
up the mapping completely. invalidation is excluded because it holds
the page lock during invalidation of each page.

This allows significant simplifications in do_no_page.

kbuild performance is, surprisingly, maybe slightly improved:
2xP4 Xeon 3.6GHz with HT:
kbuild -j8 in shmfs
original:
505.16user 28.59system 2:15.87elapsed 392%CPU
505.08user 28.28system 2:15.81elapsed 392%CPU
504.16user 29.28system 2:15.85elapsed 392%CPU

patched:
502.35user 26.86system 2:14.77elapsed 392%CPU
501.89user 27.15system 2:14.86elapsed 392%CPU
502.22user 27.01system 2:14.87elapsed 392%CPU

kbuild -j8 in ext3
original:
505.65user 27.72system 2:15.75elapsed 392%CPU
506.38user 26.73system 2:16.38elapsed 390%CPU
507.00user 26.21system 2:15.76elapsed 392%CPU

patched:
501.03user 28.67system 2:14.98elapsed 392%CPU
500.84user 29.34system 2:14.99elapsed 392%CPU
501.18user 28.76system 2:15.05elapsed 392%CPU

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-07-26 18:00:47.000000000 +1000
+++ linux-2.6/include/linux/mm.h	2006-07-31 15:37:45.000000000 +1000
@@ -166,6 +166,11 @@ extern unsigned int kobjsize(const void 
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
+#define VM_CAN_INVLD	0x04000000	/* The mapping may be invalidated,
+					 * eg. truncate or invalidate_inode_*.
+					 * In this case, do_no_page must
+					 * return with the page locked.
+					 */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -205,6 +210,7 @@ struct vm_operations_struct {
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr);
 #endif
+	unsigned long vm_flags; /* vm_flags to copy into any mapping vmas */
 };
 
 struct mmu_gather;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2006-07-31 15:37:42.000000000 +1000
+++ linux-2.6/mm/filemap.c	2006-07-31 16:06:04.000000000 +1000
@@ -1279,6 +1279,8 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	BUG_ON(!(area->vm_flags & VM_CAN_INVLD));
+
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
 retry_all:
@@ -1303,7 +1305,7 @@ retry_all:
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_get_page(mapping, pgoff);
+	page = find_lock_page(mapping, pgoff);
 	if (!page) {
 		unsigned long ra_pages;
 
@@ -1337,7 +1339,7 @@ retry_find:
 				start = pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_get_page(mapping, pgoff);
+		page = find_lock_page(mapping, pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
@@ -1399,30 +1401,6 @@ page_not_uptodate:
 		majmin = VM_FAULT_MAJOR;
 		inc_page_state(pgmajfault);
 	}
-	lock_page(page);
-
-	/* Did it get unhashed while we waited for it? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Did somebody else get it up-to-date? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
 
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.
@@ -1430,20 +1408,6 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
-
-	/* Somebody truncated the page on us? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Somebody else successfully read it in? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
 	ClearPageError(page);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {
@@ -1462,7 +1426,6 @@ page_not_uptodate:
 	page_cache_release(page);
 	return NULL;
 }
-
 EXPORT_SYMBOL(filemap_nopage);
 
 static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
@@ -1641,6 +1604,7 @@ EXPORT_SYMBOL(filemap_populate);
 struct vm_operations_struct generic_file_vm_ops = {
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+	.vm_flags	= VM_CAN_INVLD,
 };
 
 /* This is used for a general mmap of a disk file */
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-07-26 18:00:47.000000000 +1000
+++ linux-2.6/mm/memory.c	2006-07-31 16:06:40.000000000 +1000
@@ -1577,6 +1577,13 @@ static int unmap_mapping_range_vma(struc
 	unsigned long restart_addr;
 	int need_break;
 
+	/*
+	 * files that support invalidating or truncating portions of the
+	 * file from under mmaped areas must set the VM_CAN_INVLD flag, and
+	 * have their .nopage function return the page locked.
+	 */
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVLD));
+
 again:
 	restart_addr = vma->vm_truncate_count;
 	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
@@ -1707,17 +1714,8 @@ void unmap_mapping_range(struct address_
 
 	spin_lock(&mapping->i_mmap_lock);
 
-	/* serialize i_size write against truncate_count write */
-	smp_wmb();
-	/* Protect against page faults, and endless unmapping loops */
+	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
-	/*
-	 * For archs where spin_lock has inclusive semantics like ia64
-	 * this smp_mb() will prevent to read pagetable contents
-	 * before the truncate_count increment is visible to
-	 * other cpus.
-	 */
-	smp_mb();
 	if (unlikely(is_restart_addr(mapping->truncate_count))) {
 		if (mapping->truncate_count == 0)
 			reset_vma_truncate_counts(mapping);
@@ -1729,6 +1727,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
+
 	spin_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
@@ -2040,36 +2039,23 @@ static int do_no_page(struct mm_struct *
 		int write_access)
 {
 	spinlock_t *ptl;
-	struct page *new_page;
-	struct address_space *mapping = NULL;
+	struct page *new_page, *old_page;
 	pte_t entry;
-	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
 
-	if (vma->vm_file) {
-		mapping = vma->vm_file->f_mapping;
-		sequence = mapping->truncate_count;
-		smp_rmb(); /* serializes i_size against truncate_count */
-	}
-retry:
 	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
-	/*
-	 * No smp_rmb is needed here as long as there's a full
-	 * spin_lock/unlock sequence inside the ->nopage callback
-	 * (for the pagecache lookup) that acts as an implicit
-	 * smp_mb() and prevents the i_size read to happen
-	 * after the next truncate_count read.
-	 */
-
 	/* no page was available -- either SIGBUS or OOM */
 	if (new_page == NOPAGE_SIGBUS)
 		return VM_FAULT_SIGBUS;
 	if (new_page == NOPAGE_OOM)
 		return VM_FAULT_OOM;
+	old_page = new_page;
+
+	BUG_ON(vma->vm_flags & VM_CAN_INVLD && !PageLocked(new_page));
 
 	/*
 	 * Should we do an early C-O-W break?
@@ -2089,19 +2075,6 @@ retry:
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	/*
-	 * For a file-backed vma, someone could have truncated or otherwise
-	 * invalidated this page.  If unmap_mapping_range got called,
-	 * retry getting the page.
-	 */
-	if (mapping && unlikely(sequence != mapping->truncate_count)) {
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(new_page);
-		cond_resched();
-		sequence = mapping->truncate_count;
-		smp_rmb();
-		goto retry;
-	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2139,10 +2112,15 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+out:
+	if (likely(vma->vm_flags & VM_CAN_INVLD))
+		unlock_page(old_page);
 	return ret;
+
 oom:
 	page_cache_release(new_page);
-	return VM_FAULT_OOM;
+	ret = VM_FAULT_OOM;
+	goto out;
 }
 
 /*
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2006-07-26 18:00:47.000000000 +1000
+++ linux-2.6/mm/shmem.c	2006-07-31 16:54:48.000000000 +1000
@@ -80,6 +80,7 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
+	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1211,8 +1212,10 @@ repeat:
 	}
 done:
 	if (*pagep != filepage) {
-		unlock_page(filepage);
 		*pagep = filepage;
+		if (sgp != SGP_NOPAGE)
+			unlock_page(filepage);
+
 	}
 	return 0;
 
@@ -1231,13 +1234,15 @@ struct page *shmem_nopage(struct vm_area
 	unsigned long idx;
 	int error;
 
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVLD));
+
 	idx = (address - vma->vm_start) >> PAGE_SHIFT;
 	idx += vma->vm_pgoff;
 	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
 	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return NOPAGE_SIGBUS;
 
-	error = shmem_getpage(inode, idx, &page, SGP_CACHE, type);
+	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
 	if (error)
 		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
 
@@ -2230,6 +2235,7 @@ static struct vm_operations_struct shmem
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+	.vm_flags	= VM_CAN_INVLD,
 };
 
 
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c	2004-10-19 17:20:33.000000000 +1000
+++ linux-2.6/fs/ncpfs/mmap.c	2006-07-31 15:39:23.000000000 +1000
@@ -100,6 +100,7 @@ static struct page* ncp_file_mmap_nopage
 static struct vm_operations_struct ncp_file_mmap =
 {
 	.nopage	= ncp_file_mmap_nopage,
+	.vm_flags = VM_CAN_INVLD,
 };
 
 
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c	2006-02-09 18:04:58.000000000 +1100
+++ linux-2.6/fs/ocfs2/mmap.c	2006-07-31 15:39:56.000000000 +1000
@@ -78,6 +78,7 @@ out:
 
 static struct vm_operations_struct ocfs2_file_vm_ops = {
 	.nopage = ocfs2_nopage,
+	.vm_flags = VM_CAN_INVLD,
 };
 
 int ocfs2_mmap(struct file *file, struct vm_area_struct *vma)
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c	2006-04-20 18:55:03.000000000 +1000
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2006-07-31 15:39:47.000000000 +1000
@@ -634,6 +634,7 @@ const struct file_operations xfs_dir_fil
 static struct vm_operations_struct xfs_file_vm_ops = {
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+	.vm_flags	= VM_CAN_INVLD,
 };
 
 #ifdef CONFIG_XFS_DMAPI
@@ -643,5 +644,6 @@ static struct vm_operations_struct xfs_d
 #ifdef HAVE_VMOP_MPROTECT
 	.mprotect	= xfs_vm_mprotect,
 #endif
+	.vm_flags	= VM_CAN_INVLD,
 };
 #endif /* CONFIG_XFS_DMAPI */
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-07-26 18:00:47.000000000 +1000
+++ linux-2.6/mm/mmap.c	2006-07-31 16:03:58.000000000 +1000
@@ -1089,6 +1089,9 @@ munmap_back:
 			goto free_vma;
 	}
 
+	if (vma->vm_ops)
+		vma->vm_flags |= vma->vm_ops->vm_flags;
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-01 11:36 [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race? Nick Piggin
@ 2006-08-01 14:27 ` Andrea Arcangeli
  2006-08-02  0:19   ` Nick Piggin
  2006-08-03 16:04 ` Hugh Dickins
  2006-08-03 16:34 ` David Howells
  2 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2006-08-01 14:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Hugh Dickins, Linux Memory Management

On Tue, Aug 01, 2006 at 09:36:23PM +1000, Nick Piggin wrote:
> Hi,
> 
> Just like to get some thoughts on another possible approach to this
> problem, and whether my changelog and implementation actually capture
> the problem. This fix is actually something Andrea had proposed, so
> credit really goes to him.

The credit for this third possible fix goes to you for sorting out all
the details ;). This is perhaps the cleanest fix even if more
intrusive.

> I suppose we should think about fixing it some day?

I was thinking about this every few days too, but I already submitted
two fixes and I got somewhat contradictory reviews of them, so I
wasn't sure what to do given that for mainline it's mostly a DoS
because the VM lacks the proper bugchecks in the objrmap layer to
autodetect the leak (the bugchecks I'm talking about only exists only
in the sles9 VM, Hugh removed them while merging objrmap into
mainline, and the fact they existed in sles9 is why we noticed and
tracked down this leak). We already fixed the bug in sles9 a while ago
with my second fix, but I obviously agree we have to fix it in
mainline as well some day too, infact I wouldn't mind to add the
bugchecks too to be sure something like this doesn't go unnoticed
again (especially now that in sles10 we're in VM sync with mainline).

I really appreciate this third way being implemented. It looks quite
nice. Great work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-01 14:27 ` Andrea Arcangeli
@ 2006-08-02  0:19   ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2006-08-02  0:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Hugh Dickins, Linux Memory Management, Christoph Lameter

Andrea Arcangeli wrote:
> On Tue, Aug 01, 2006 at 09:36:23PM +1000, Nick Piggin wrote:

>>I suppose we should think about fixing it some day?
> 
> 
> I was thinking about this every few days too, but I already submitted
> two fixes and I got somewhat contradictory reviews of them, so I
> wasn't sure what to do given that for mainline it's mostly a DoS
> because the VM lacks the proper bugchecks in the objrmap layer to
> autodetect the leak (the bugchecks I'm talking about only exists only
> in the sles9 VM, Hugh removed them while merging objrmap into
> mainline, and the fact they existed in sles9 is why we noticed and
> tracked down this leak). We already fixed the bug in sles9 a while ago
> with my second fix, but I obviously agree we have to fix it in
> mainline as well some day too, infact I wouldn't mind to add the
> bugchecks too to be sure something like this doesn't go unnoticed
> again (especially now that in sles10 we're in VM sync with mainline).

Well, I don't think it would be out of the question to add bug checks
for obscure conditions if they might have actually detected bugs for
us. It is common to see bug fixes also adding a BUG_ON to ensure similar
problems or future regressions are noticed. I can't remember Hugh's
reason for not wanting the check there, though.

> 
> I really appreciate this third way being implemented. It looks quite
> nice. Great work.

Thanks, glad you like it :)

Hugh did mention a 2% slowdown in lmbench tests due to the lock page
(but maybe that was before removing the truncate_count stuff?). That
isn't good, but it actually seems to speed up kbuild (which is very
do_no_page intensive). I didn't get enough samples to be statistically
significant, but it looks like 1s speedup on shmfs, and 3/4s on ext3.

So if the lmbench slowdown is still there, I think kbuild trumps it.
However, my testing is just on a P4, so it would be good to check a
wider range of architectures too. I'll spend a bit of time with that
when I get a chance.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-01 11:36 [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race? Nick Piggin
  2006-08-01 14:27 ` Andrea Arcangeli
@ 2006-08-03 16:04 ` Hugh Dickins
  2006-08-05  3:52   ` Nick Piggin
  2006-08-07 14:18   ` Nick Piggin
  2006-08-03 16:34 ` David Howells
  2 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2006-08-03 16:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

(David, I've added you to CC because way down below
there's an issue of interaction with page_mkwrite.)

On Tue, 1 Aug 2006, Nick Piggin wrote:
> 
> Just like to get some thoughts on another possible approach to this
> problem, and whether my changelog and implementation actually capture

Good changelog, promising implementation.

> the problem. This fix is actually something Andrea had proposed, so
> credit really goes to him.
> 
> I suppose we should think about fixing it some day?

Certainly we should fix it, and I've failed to come up with anything
better than this, despite mulling over it from time to time.

I believe I was the one who most strongly resisted (even ridiculed)
this approach, others liked it; and it's grown more attractive
since Christoph (Mr Scalability) approved the idea at OLS.

Though I'm still not entirely convinced.

> Fix the race between invalidate_inode_pages and do_no_page.
> 
> Andrea Arcangeli identified a subtle race between invalidation of
> pages from pagecache with userspace mappings, and do_no_page.
> 
> The issue is that invalidation has to shoot down all mappings to the
> page, before it can be discarded from the pagecache. Between shooting
> down ptes to a particular page, and actually dropping the struct page
> from the pagecache, do_no_page from any process might fault on that
> page and establish a new mapping to the page just before it gets
> discarded from the pagecache.
> 
> The most common case where such invalidation is used is in file
> truncation. This case was catered for by doing a sort of open-coded
> seqlock between the file's i_size, and its truncate_count.
> 
> Truncation will decrease i_size, then increment truncate_count before
> unmapping userspace pages; do_no_page will read truncate_count, then
> find the page if it is within i_size, and then check truncate_count
> under the page table lock and back out and retry if it had
> subsequently been changed (ptl will serialise against unmapping, and
> ensure a potentially updated truncate_count is actually visible).
> 
> Complexity and documentation issues aside, the locking protocol fails
> in the case where we would like to invalidate pagecache inside i_size.
> do_no_page can come in anytime and filemap_nopage is not aware of the
> invalidation in progress (as it is when it is outside i_size). The
> end result is that dangling (->mapping == NULL) pages that appear to
> be from a particular file may be mapped into userspace with nonsense
> data. Valid mappings to the same place will see a different page.

I think it was some NFS or cluster FS case that showed the problem,
Andrea would know.  But Badari's MADV_REMOVE, punching a hole within
a file, has added another case which the i_size/truncate_count
technique cannot properly guard against.

> Andrea implemented two working fixes, one using a real seqlock,
> another using a page->flags bit. He also proposed using the page lock
> in do_no_page, but that was initially considered too heavyweight.
> However, it is not a global or per-file lock, and the page cacheline
> is modified in do_no_page to increment _count and _mapcount anyway, so
> a further modification should not be a large performance hit.
> Scalability is not an issue.

Scalability is not an issue, that's nice - but I don't see how you
arrive at that certainty.  Obviously the per-page lock means it's
less of a scalability issue than global or per-file; and the fact
that tmpfs' shmem_getpage has always taken page lock internally
adds good evidence that it can't be too bad.

But I worry a little about shared libraries, and suspect that there
will be cases degraded by the additional locking - perhaps benchmarks
(with processes falling into lockstep) rather than real-life loads.  I
think it's fair to say "Scalability is unlikely to be much of an issue".

> This patch implements this latter approach. ->nopage implementations
> return with the page locked if it is possible for their underlying
> file to be invalidated (in that case, they must set a special vm_flags
> bit to indicate so). do_no_page only unlocks the page after setting
> up the mapping completely. invalidation is excluded because it holds
> the page lock during invalidation of each page.
> 
> This allows significant simplifications in do_no_page.
> 
> kbuild performance is, surprisingly, maybe slightly improved:

Emphasis on maybe.  It would be surprising, and your ext3 system
times go the other way, and I find the reverse of what you find:
slightly regressed, say ~0.5%, on kbuilds and some lmbenchs.

Perhaps because my kbuilds or machines are different; or perhaps
it's all just in the noise, and an accident of text moving into or
out of different cachelines.

Nothing too alarming, anyway, and I don't really trust myself for
performance numbers.  Let's wait to hear from Martin and co.

> 2xP4 Xeon 3.6GHz with HT:
> kbuild -j8 in shmfs
> original:
> 505.16user 28.59system 2:15.87elapsed 392%CPU
> 505.08user 28.28system 2:15.81elapsed 392%CPU
> 504.16user 29.28system 2:15.85elapsed 392%CPU
> 
> patched:
> 502.35user 26.86system 2:14.77elapsed 392%CPU
> 501.89user 27.15system 2:14.86elapsed 392%CPU
> 502.22user 27.01system 2:14.87elapsed 392%CPU
> 
> kbuild -j8 in ext3
> original:
> 505.65user 27.72system 2:15.75elapsed 392%CPU
> 506.38user 26.73system 2:16.38elapsed 390%CPU
> 507.00user 26.21system 2:15.76elapsed 392%CPU
> 
> patched:
> 501.03user 28.67system 2:14.98elapsed 392%CPU
> 500.84user 29.34system 2:14.99elapsed 392%CPU
> 501.18user 28.76system 2:15.05elapsed 392%CPU
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2006-07-26 18:00:47.000000000 +1000
> +++ linux-2.6/include/linux/mm.h	2006-07-31 15:37:45.000000000 +1000
> @@ -166,6 +166,11 @@ extern unsigned int kobjsize(const void 
>  #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
>  #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
>  #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
> +#define VM_CAN_INVLD	0x04000000	/* The mapping may be invalidated,
> +					 * eg. truncate or invalidate_inode_*.
> +					 * In this case, do_no_page must
> +					 * return with the page locked.
> +					 */

I didn't care for "INVLD", and gather now that it's being changed to
"INVALIDATE" (I'd have suggested "INVAL").  But actually, I'd rather
a name that says what's actually being assumed: VM_NOPAGE_LOCKED?

With "revoke" in the air, I suspect that we're going to want to be
able to invalidate the pages of _any_ mapping, whether the driver
locks them in its nopage or not.  (Or am I thereby just encouraging
the idea of a racy revoke?)

>  
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -205,6 +210,7 @@ struct vm_operations_struct {
>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>  					unsigned long addr);
>  #endif
> +	unsigned long vm_flags; /* vm_flags to copy into any mapping vmas */
>  };

I suppose this is quite efficient, but I find it confusing.
We have lots and lots of drivers already setting vm_flags in their
mmap methods, now you add an alternative way of doing the same thing.
Can't you just set VM_NOPAGE_LOCKED in the relevant mmap methods?
Or did you try it that way and it worked out messy?

>  struct mmu_gather;
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2006-07-31 15:37:42.000000000 +1000
> +++ linux-2.6/mm/filemap.c	2006-07-31 16:06:04.000000000 +1000
> @@ -1279,6 +1279,8 @@ struct page *filemap_nopage(struct vm_ar
>  	unsigned long size, pgoff;
>  	int did_readaround = 0, majmin = VM_FAULT_MINOR;
>  
> +	BUG_ON(!(area->vm_flags & VM_CAN_INVLD));
> +
>  	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
>  
>  retry_all:
> @@ -1303,7 +1305,7 @@ retry_all:
>  	 * Do we have something in the page cache already?
>  	 */
>  retry_find:
> -	page = find_get_page(mapping, pgoff);
> +	page = find_lock_page(mapping, pgoff);
>  	if (!page) {
>  		unsigned long ra_pages;
>  
> @@ -1337,7 +1339,7 @@ retry_find:
>  				start = pgoff - ra_pages / 2;
>  			do_page_cache_readahead(mapping, file, start, ra_pages);
>  		}
> -		page = find_get_page(mapping, pgoff);
> +		page = find_lock_page(mapping, pgoff);
>  		if (!page)
>  			goto no_cached_page;
>  	}
> @@ -1399,30 +1401,6 @@ page_not_uptodate:
>  		majmin = VM_FAULT_MAJOR;
>  		inc_page_state(pgmajfault);
>  	}
> -	lock_page(page);
> -
> -	/* Did it get unhashed while we waited for it? */
> -	if (!page->mapping) {
> -		unlock_page(page);
> -		page_cache_release(page);
> -		goto retry_all;
> -	}
> -
> -	/* Did somebody else get it up-to-date? */
> -	if (PageUptodate(page)) {
> -		unlock_page(page);
> -		goto success;
> -	}
> -
> -	error = mapping->a_ops->readpage(file, page);
> -	if (!error) {
> -		wait_on_page_locked(page);
> -		if (PageUptodate(page))
> -			goto success;
> -	} else if (error == AOP_TRUNCATED_PAGE) {
> -		page_cache_release(page);
> -		goto retry_find;
> -	}

This is a delicious piece of clutter removal: I think you're correct.

>  
>  	/*
>  	 * Umm, take care of errors if the page isn't up-to-date.
> @@ -1430,20 +1408,6 @@ page_not_uptodate:
>  	 * because there really aren't any performance issues here
>  	 * and we need to check for errors.
>  	 */
> -	lock_page(page);
> -
> -	/* Somebody truncated the page on us? */
> -	if (!page->mapping) {
> -		unlock_page(page);
> -		page_cache_release(page);
> -		goto retry_all;
> -	}
> -
> -	/* Somebody else successfully read it in? */
> -	if (PageUptodate(page)) {
> -		unlock_page(page);
> -		goto success;
> -	}

Ditto.

>  	ClearPageError(page);
>  	error = mapping->a_ops->readpage(file, page);
>  	if (!error) {
> @@ -1462,7 +1426,6 @@ page_not_uptodate:
>  	page_cache_release(page);
>  	return NULL;

But here I think you're missing something: the wait_on_page_locked
after ->readpage needs to become a lock_page before going to success?
with unlock_page if it doesn't.

>  }
> -
>  EXPORT_SYMBOL(filemap_nopage);
>  
>  static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
> @@ -1641,6 +1604,7 @@ EXPORT_SYMBOL(filemap_populate);
>  struct vm_operations_struct generic_file_vm_ops = {
>  	.nopage		= filemap_nopage,
>  	.populate	= filemap_populate,
> +	.vm_flags	= VM_CAN_INVLD,
>  };
>  
>  /* This is used for a general mmap of a disk file */
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c	2006-07-26 18:00:47.000000000 +1000
> +++ linux-2.6/mm/memory.c	2006-07-31 16:06:40.000000000 +1000
> @@ -1577,6 +1577,13 @@ static int unmap_mapping_range_vma(struc
>  	unsigned long restart_addr;
>  	int need_break;
>  
> +	/*
> +	 * files that support invalidating or truncating portions of the
> +	 * file from under mmaped areas must set the VM_CAN_INVLD flag, and
> +	 * have their .nopage function return the page locked.
> +	 */
> +	BUG_ON(!(vma->vm_flags & VM_CAN_INVLD));
> +

I think we shall end up wanting to apply unmap_mapping_range even
to "unlocked nopage" vmas (the revoke idea) - unless we decide we
have to make every nopage vma do the locking.  

Would the BUG_ON be better as a WARN_ON, or nothing at all?  It'll
give trouble until out-of-tree filesystems/drivers are updated; or
do we want to give them active trouble there, I'm not sure?

>  again:
>  	restart_addr = vma->vm_truncate_count;
>  	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
> @@ -1707,17 +1714,8 @@ void unmap_mapping_range(struct address_
>  
>  	spin_lock(&mapping->i_mmap_lock);
>  
> -	/* serialize i_size write against truncate_count write */
> -	smp_wmb();
> -	/* Protect against page faults, and endless unmapping loops */
> +	/* Protect against endless unmapping loops */
>  	mapping->truncate_count++;
> -	/*
> -	 * For archs where spin_lock has inclusive semantics like ia64
> -	 * this smp_mb() will prevent to read pagetable contents
> -	 * before the truncate_count increment is visible to
> -	 * other cpus.
> -	 */
> -	smp_mb();

Yes, that's right, leave truncate_count itself for prio_tree restart,
but it's grand to be deleting all those barriers and their comments.

>  	if (unlikely(is_restart_addr(mapping->truncate_count))) {
>  		if (mapping->truncate_count == 0)
>  			reset_vma_truncate_counts(mapping);
> @@ -1729,6 +1727,7 @@ void unmap_mapping_range(struct address_
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
>  	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>  		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
> +
>  	spin_unlock(&mapping->i_mmap_lock);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
> @@ -2040,36 +2039,23 @@ static int do_no_page(struct mm_struct *
>  		int write_access)
>  {
>  	spinlock_t *ptl;
> -	struct page *new_page;
> -	struct address_space *mapping = NULL;
> +	struct page *new_page, *old_page;

I think it's much clearer to call it "locked_page" than "old_page",
particularly when you see it alonside Peter's "dirty_page".

>  	pte_t entry;
> -	unsigned int sequence = 0;
>  	int ret = VM_FAULT_MINOR;
>  	int anon = 0;
>  
>  	pte_unmap(page_table);
>  	BUG_ON(vma->vm_flags & VM_PFNMAP);
>  
> -	if (vma->vm_file) {
> -		mapping = vma->vm_file->f_mapping;
> -		sequence = mapping->truncate_count;
> -		smp_rmb(); /* serializes i_size against truncate_count */
> -	}
> -retry:
>  	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
> -	/*
> -	 * No smp_rmb is needed here as long as there's a full
> -	 * spin_lock/unlock sequence inside the ->nopage callback
> -	 * (for the pagecache lookup) that acts as an implicit
> -	 * smp_mb() and prevents the i_size read to happen
> -	 * after the next truncate_count read.
> -	 */
> -

More great removals.

>  	/* no page was available -- either SIGBUS or OOM */
>  	if (new_page == NOPAGE_SIGBUS)
>  		return VM_FAULT_SIGBUS;
>  	if (new_page == NOPAGE_OOM)
>  		return VM_FAULT_OOM;
> +	old_page = new_page;
> +
> +	BUG_ON(vma->vm_flags & VM_CAN_INVLD && !PageLocked(new_page));

Maybe
	if (vma->vm_flags & VM_NOPAGE_LOCKED) {
		locked_page = new_page;
		BUG_ON(!PageLocked(locked_page));
	} else
		locked_page = NULL;

But what I hate about this do_no_page is that sometimes we're going
through it with the page locked, and sometimes we're going through it
with the page not locked.  Now I've not noticed any actual problem
from that (aside from where page_mkwrite fails), and it is well-defined
which case is which, but it is confusing and does make do_no_page harder
to audit at any time.

(I did toy with a separate do_no_page_locked, and nopage_locked
methods for the filesystems; but duplicating so much code doesn't
really solve anything.)

And when you factor in Peter's dirty_page stuff, it's a nuisance:
because he has had to get_page(dirty_page) then put_page(dirty_page),
in case page already got freed by vmscan after ptl dropped: which is
redundant if the page is locked throughout, but you can't rely on that
because (for a while at least) some fs'es won't set VM_NOPAGE_LOCKED.

How about
	if (vma->vm_flags & VM_NOPAGE_LOCKED)
		BUG_ON(!PageLocked(new_page));
	else
		lock_page(new_page);
	locked_page = new_page;
?

And then proceed through the rest of do_no_page sure in the knowledge
that we have the page locked, simplifying whatever might be simplified
by that (removing Peter's get_page,put_page at least).  I can see this
adds a little overhead to some less important cases, but it does make
the rules much easier to grasp.

>  
>  	/*
>  	 * Should we do an early C-O-W break?

Somewhere below here you're missing a hunk to deal with a failed
page_mkwrite, needing to unlock_page(locked_page).  We don't have
an example of a page_mkwrite in tree at present, but it seems
reasonable to suppose that we not it should unlock the page.

Hmm, David Howells has an afs_file_page_mkwrite which sits waiting
for an FsMisc page flag to be cleared: might that deadlock with the
page lock held?  If so, it may need to unlock and relock the page,
rechecking for truncation.

Hmmm, page_mkwrite when called from do_wp_page would not expect to
be holding page lock: we don't want it called with in one case and
without in the other.  Maybe do_no_page needs to unlock_page before
calling page_mkwrite, lock_page after, and check page->mapping when
VM_NOPAGE_LOCKED??

> @@ -2089,19 +2075,6 @@ retry:
>  	}
>  
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> -	/*
> -	 * For a file-backed vma, someone could have truncated or otherwise
> -	 * invalidated this page.  If unmap_mapping_range got called,
> -	 * retry getting the page.
> -	 */
> -	if (mapping && unlikely(sequence != mapping->truncate_count)) {
> -		pte_unmap_unlock(page_table, ptl);
> -		page_cache_release(new_page);
> -		cond_resched();
> -		sequence = mapping->truncate_count;
> -		smp_rmb();
> -		goto retry;
> -	}

More pleasure.

>  
>  	/*
>  	 * This silly early PAGE_DIRTY setting removes a race
> @@ -2139,10 +2112,15 @@ retry:
>  	lazy_mmu_prot_update(entry);
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
> +out:
> +	if (likely(vma->vm_flags & VM_CAN_INVLD))
> +		unlock_page(old_page);

If you agree above, that becomes unconditional unlock_page(locked_page);

>  	return ret;
> +
>  oom:
>  	page_cache_release(new_page);
> -	return VM_FAULT_OOM;
> +	ret = VM_FAULT_OOM;
> +	goto out;
>  }
>  
>  /*
> Index: linux-2.6/mm/shmem.c
> ===================================================================
> --- linux-2.6.orig/mm/shmem.c	2006-07-26 18:00:47.000000000 +1000
> +++ linux-2.6/mm/shmem.c	2006-07-31 16:54:48.000000000 +1000
> @@ -80,6 +80,7 @@ enum sgp_type {
>  	SGP_READ,	/* don't exceed i_size, don't allocate page */
>  	SGP_CACHE,	/* don't exceed i_size, may allocate page */
>  	SGP_WRITE,	/* may exceed i_size, may allocate page */
> +	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
>  };

I don't think you need to add another type for this, SGP_CACHE should do.

Perhaps you avoided that because it's also used by shmem_populate.
But another point I want to make is that you do need to update
filemap_populate, shmem_populate, install_page and whatever to
make use the same locked page fix: they've been relying on the
i_size and page->mapping checks, which are not quite enough,
isn't that right? (now my grasp of the race has fallen out of my
left ear, and I'd better finish this mail before regrasping it)

(If you think those checks are enough, then wouldn't it follow that
we don't need to hold page locked in do_no_page at all, just have a
vm_flag to say NULL page->mapping in do_no_page means invalidated?
I think we've all played with that in the past and found it wanting.)

Not as sigificant as the do_no_page fix, but necessary to plug the hole.

>  
>  static int shmem_getpage(struct inode *inode, unsigned long idx,
> @@ -1211,8 +1212,10 @@ repeat:
>  	}
>  done:
>  	if (*pagep != filepage) {
> -		unlock_page(filepage);
>  		*pagep = filepage;
> +		if (sgp != SGP_NOPAGE)
> +			unlock_page(filepage);
> +

You've inserted that blank line just to upset me.

>  	}
>  	return 0;
>  
> @@ -1231,13 +1234,15 @@ struct page *shmem_nopage(struct vm_area
>  	unsigned long idx;
>  	int error;
>  
> +	BUG_ON(!(vma->vm_flags & VM_CAN_INVLD));
> +

We've separately observed that you need to add VM_CAN_INVLD or
VM_CAN_INVALIDATE or VM_NOPAGE_LOCKED into ipc/shm.c too.

>  	idx = (address - vma->vm_start) >> PAGE_SHIFT;
>  	idx += vma->vm_pgoff;
>  	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
>  	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
>  		return NOPAGE_SIGBUS;
>  
> -	error = shmem_getpage(inode, idx, &page, SGP_CACHE, type);
> +	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
>  	if (error)
>  		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
>  
> @@ -2230,6 +2235,7 @@ static struct vm_operations_struct shmem
>  	.set_policy     = shmem_set_policy,
>  	.get_policy     = shmem_get_policy,
>  #endif
> +	.vm_flags	= VM_CAN_INVLD,
>  };
>  
>  
> Index: linux-2.6/fs/ncpfs/mmap.c
> ===================================================================
> --- linux-2.6.orig/fs/ncpfs/mmap.c	2004-10-19 17:20:33.000000000 +1000
> +++ linux-2.6/fs/ncpfs/mmap.c	2006-07-31 15:39:23.000000000 +1000
> @@ -100,6 +100,7 @@ static struct page* ncp_file_mmap_nopage
>  static struct vm_operations_struct ncp_file_mmap =
>  {
>  	.nopage	= ncp_file_mmap_nopage,
> +	.vm_flags = VM_CAN_INVLD,
>  };
>  
>  
> Index: linux-2.6/fs/ocfs2/mmap.c
> ===================================================================
> --- linux-2.6.orig/fs/ocfs2/mmap.c	2006-02-09 18:04:58.000000000 +1100
> +++ linux-2.6/fs/ocfs2/mmap.c	2006-07-31 15:39:56.000000000 +1000
> @@ -78,6 +78,7 @@ out:
>  
>  static struct vm_operations_struct ocfs2_file_vm_ops = {
>  	.nopage = ocfs2_nopage,
> +	.vm_flags = VM_CAN_INVLD,
>  };
>  
>  int ocfs2_mmap(struct file *file, struct vm_area_struct *vma)
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c	2006-04-20 18:55:03.000000000 +1000
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2006-07-31 15:39:47.000000000 +1000
> @@ -634,6 +634,7 @@ const struct file_operations xfs_dir_fil
>  static struct vm_operations_struct xfs_file_vm_ops = {
>  	.nopage		= filemap_nopage,
>  	.populate	= filemap_populate,
> +	.vm_flags	= VM_CAN_INVLD,
>  };
>  
>  #ifdef CONFIG_XFS_DMAPI
> @@ -643,5 +644,6 @@ static struct vm_operations_struct xfs_d
>  #ifdef HAVE_VMOP_MPROTECT
>  	.mprotect	= xfs_vm_mprotect,
>  #endif
> +	.vm_flags	= VM_CAN_INVLD,
>  };
>  #endif /* CONFIG_XFS_DMAPI */
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c	2006-07-26 18:00:47.000000000 +1000
> +++ linux-2.6/mm/mmap.c	2006-07-31 16:03:58.000000000 +1000
> @@ -1089,6 +1089,9 @@ munmap_back:
>  			goto free_vma;
>  	}
>  
> +	if (vma->vm_ops)
> +		vma->vm_flags |= vma->vm_ops->vm_flags;
> +

Mmm, I'd prefer not to have this additional way of setting vm_flags.

>  	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
>  	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
>  	 * that memory reservation must be checked; but that reservation

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-01 11:36 [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race? Nick Piggin
  2006-08-01 14:27 ` Andrea Arcangeli
  2006-08-03 16:04 ` Hugh Dickins
@ 2006-08-03 16:34 ` David Howells
  2 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2006-08-03 16:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Andrea Arcangeli, Andrew Morton, David Howells,
	Linux Memory Management

Hugh Dickins <hugh@veritas.com> wrote:

> >  
> >  	/*
> >  	 * Should we do an early C-O-W break?
> 
> Somewhere below here you're missing a hunk to deal with a failed
> page_mkwrite, needing to unlock_page(locked_page).  We don't have
> an example of a page_mkwrite in tree at present, but it seems
> reasonable to suppose that we not it should unlock the page.
> 
> Hmm, David Howells has an afs_file_page_mkwrite which sits waiting
> for an FsMisc page flag to be cleared: might that deadlock with the
> page lock held?  If so, it may need to unlock and relock the page,
> rechecking for truncation.
> 
> Hmmm, page_mkwrite when called from do_wp_page would not expect to
> be holding page lock: we don't want it called with in one case and
> without in the other.  Maybe do_no_page needs to unlock_page before
> calling page_mkwrite, lock_page after, and check page->mapping when
> VM_NOPAGE_LOCKED??

For what I'm using page_mkwrite() and PG_fs_misc for, there shouldn't be a
deadlock:

 (1) PG_fs_misc has to be set whilst we are holding the page lock after
     reading the page, since the VM could race with page release otherwise.

 (2) If the cache refuses to store the page, PG_fs_misc is cleared immediately
     before the page lock is released.

 (3) If the cache agrees to store the page, it will start the process of
     storing the page to disk, and doesn't require the page to remain locked.

     When the I/O is complete, the cache will call back into the netfs (AFS or
     NFS, for example) and the netfs will clear PG_fs_misc and wake up anyone
     waiting for it.  It will not attempt to lock the page.

 (4) If page_mkwrite() is called, it will simply call wait_on_page_fs_misc(),
     and will not attempt to lock the page.

I don't think the caller of page_mkwrite() ever has the page locked
currently.  It's possible someone else has the page locked, and that
page_mkwrite() might have to wait for it to become unlocked, depending on what
it wants to do.

But for the page being locked during reading, do_no_page() makes sure the page
is unlocked before page_mkwrite() can be called.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-03 16:04 ` Hugh Dickins
@ 2006-08-05  3:52   ` Nick Piggin
  2006-08-07 14:18   ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2006-08-05  3:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

Hugh Dickins wrote:
> (David, I've added you to CC because way down below
> there's an issue of interaction with page_mkwrite.)
> 
> On Tue, 1 Aug 2006, Nick Piggin wrote:
> 
>>Just like to get some thoughts on another possible approach to this
>>problem, and whether my changelog and implementation actually capture
> 
> 
> Good changelog, promising implementation.

... thanks for the thorough review, Hugh, as always. I'll find
time to respond early next week.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-03 16:04 ` Hugh Dickins
  2006-08-05  3:52   ` Nick Piggin
@ 2006-08-07 14:18   ` Nick Piggin
  2006-08-07 14:58     ` Nick Piggin
  2006-08-07 17:05     ` Hugh Dickins
  1 sibling, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2006-08-07 14:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

Hugh Dickins wrote:
> (David, I've added you to CC because way down below
> there's an issue of interaction with page_mkwrite.)

Apparently not a problem yet (phew). But the sooner we make a decision
here, the more future page_mkwrite problems we might prevent ;)


>>Complexity and documentation issues aside, the locking protocol fails
>>in the case where we would like to invalidate pagecache inside i_size.
>>do_no_page can come in anytime and filemap_nopage is not aware of the
>>invalidation in progress (as it is when it is outside i_size). The
>>end result is that dangling (->mapping == NULL) pages that appear to
>>be from a particular file may be mapped into userspace with nonsense
>>data. Valid mappings to the same place will see a different page.
> 
> 
> I think it was some NFS or cluster FS case that showed the problem,
> Andrea would know.  But Badari's MADV_REMOVE, punching a hole within
> a file, has added another case which the i_size/truncate_count
> technique cannot properly guard against.

Yes. I guess I should mention some examples of these users. Direct
IO under mmapped pagecache AFAIKS would have the race too.

> 
> 
>>Andrea implemented two working fixes, one using a real seqlock,
>>another using a page->flags bit. He also proposed using the page lock
>>in do_no_page, but that was initially considered too heavyweight.
>>However, it is not a global or per-file lock, and the page cacheline
>>is modified in do_no_page to increment _count and _mapcount anyway, so
>>a further modification should not be a large performance hit.
>>Scalability is not an issue.
> 
> 
> Scalability is not an issue, that's nice - but I don't see how you
> arrive at that certainty.  Obviously the per-page lock means it's
> less of a scalability issue than global or per-file; and the fact
> that tmpfs' shmem_getpage has always taken page lock internally
> adds good evidence that it can't be too bad.

Well, scalability because there is no extra cacheline transfer
anywhere. The cacheline must already be local at both the site
where we take the page lock, and the site where we unlock it.

> But I worry a little about shared libraries, and suspect that there
> will be cases degraded by the additional locking - perhaps benchmarks
> (with processes falling into lockstep) rather than real-life loads.  I
> think it's fair to say "Scalability is unlikely to be much of an issue".

I don't expect them to be an issue. Even if tree_lock were removed
from the find_get/lock_page path, there is still the per-page count
and mapcount to contend with. Sure, lock_page may prevent a very
*slight* amount of parallelism...

Anyway, we do need a bugfix, and it is something we do have to give
up performance for, if nothing better can be found.

If there is a really bad problem (which I doubt), then I can trade
them a per-page lock for the per-inode tree_lock and truncate_count
bouncing around in do_no_page()...

No, seriously: maybe we could re-evaluate one of Andrea's
implementations, or look at holding lock_page for shorter times. I
don't know until I see ;)

> 
> 
>>This patch implements this latter approach. ->nopage implementations
>>return with the page locked if it is possible for their underlying
>>file to be invalidated (in that case, they must set a special vm_flags
>>bit to indicate so). do_no_page only unlocks the page after setting
>>up the mapping completely. invalidation is excluded because it holds
>>the page lock during invalidation of each page.
>>
>>This allows significant simplifications in do_no_page.
>>
>>kbuild performance is, surprisingly, maybe slightly improved:
> 
> 
> Emphasis on maybe.  It would be surprising, and your ext3 system
> times go the other way, and I find the reverse of what you find:
> slightly regressed, say ~0.5%, on kbuilds and some lmbenchs.

No the ext3 times are improved too. Elapsed time, that is.

I don't doubt you see some regressions. I still haven't got numbers
for a wide range of architectures yet. Implementation still in
flux...

> I didn't care for "INVLD", and gather now that it's being changed to
> "INVALIDATE" (I'd have suggested "INVAL").  But actually, I'd rather
> a name that says what's actually being assumed: VM_NOPAGE_LOCKED?
> 
> With "revoke" in the air, I suspect that we're going to want to be
> able to invalidate the pages of _any_ mapping, whether the driver
> locks them in its nopage or not.  (Or am I thereby just encouraging
> the idea of a racy revoke?)

I'd like to fix pagecache and let revoke sort that out. Good point
though.

> 
> 
>> 
>> #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>> #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
>>@@ -205,6 +210,7 @@ struct vm_operations_struct {
>> 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>> 					unsigned long addr);
>> #endif
>>+	unsigned long vm_flags; /* vm_flags to copy into any mapping vmas */
>> };
> 
> 
> I suppose this is quite efficient, but I find it confusing.
> We have lots and lots of drivers already setting vm_flags in their
> mmap methods, now you add an alternative way of doing the same thing.
> Can't you just set VM_NOPAGE_LOCKED in the relevant mmap methods?
> Or did you try it that way and it worked out messy?

Generic pagecache doesn't have an mmap method, which is where
I stopped looking. I guess you could add the |= to filemap_nopage,
but that's much uglier.

I don't find it at all confusing, just maybe a bit of a violation
because the structure is technically only for "ops".

> 
> 
>> struct mmu_gather;
>>Index: linux-2.6/mm/filemap.c
>>===================================================================
>>--- linux-2.6.orig/mm/filemap.c	2006-07-31 15:37:42.000000000 +1000
>>+++ linux-2.6/mm/filemap.c	2006-07-31 16:06:04.000000000 +1000
>>@@ -1279,6 +1279,8 @@ struct page *filemap_nopage(struct vm_ar

>>@@ -1462,7 +1426,6 @@ page_not_uptodate:
>> 	page_cache_release(page);
>> 	return NULL;
> 
> 
> But here I think you're missing something: the wait_on_page_locked
> after ->readpage needs to become a lock_page before going to success?
> with unlock_page if it doesn't.

Good catch, thanks.

>>Index: linux-2.6/mm/memory.c
>>===================================================================
>>--- linux-2.6.orig/mm/memory.c	2006-07-26 18:00:47.000000000 +1000
>>+++ linux-2.6/mm/memory.c	2006-07-31 16:06:40.000000000 +1000
>>@@ -1577,6 +1577,13 @@ static int unmap_mapping_range_vma(struc
>> 	unsigned long restart_addr;
>> 	int need_break;
>> 
>>+	/*
>>+	 * files that support invalidating or truncating portions of the
>>+	 * file from under mmaped areas must set the VM_CAN_INVLD flag, and
>>+	 * have their .nopage function return the page locked.
>>+	 */
>>+	BUG_ON(!(vma->vm_flags & VM_CAN_INVLD));
>>+
> 
> 
> I think we shall end up wanting to apply unmap_mapping_range even
> to "unlocked nopage" vmas (the revoke idea) - unless we decide we
> have to make every nopage vma do the locking.  

Yes, I think we should tackle that when we see it?

> 
> Would the BUG_ON be better as a WARN_ON, or nothing at all?  It'll
> give trouble until out-of-tree filesystems/drivers are updated; or
> do we want to give them active trouble there, I'm not sure?

I guess we do because now all the old truncate race handling is gone
they'll see corruption much sooner if they don't lock the page.

>>@@ -2040,36 +2039,23 @@ static int do_no_page(struct mm_struct *
>> 		int write_access)
>> {
>> 	spinlock_t *ptl;
>>-	struct page *new_page;
>>-	struct address_space *mapping = NULL;
>>+	struct page *new_page, *old_page;
> 
> 
> I think it's much clearer to call it "locked_page" than "old_page",
> particularly when you see it alonside Peter's "dirty_page".

OK.

>>+	BUG_ON(vma->vm_flags & VM_CAN_INVLD && !PageLocked(new_page));
> 
> 
> Maybe
> 	if (vma->vm_flags & VM_NOPAGE_LOCKED) {
> 		locked_page = new_page;
> 		BUG_ON(!PageLocked(locked_page));
> 	} else
> 		locked_page = NULL;
> 
> But what I hate about this do_no_page is that sometimes we're going
> through it with the page locked, and sometimes we're going through it
> with the page not locked.  Now I've not noticed any actual problem
> from that (aside from where page_mkwrite fails), and it is well-defined
> which case is which, but it is confusing and does make do_no_page harder
> to audit at any time.
> 
> (I did toy with a separate do_no_page_locked, and nopage_locked
> methods for the filesystems; but duplicating so much code doesn't
> really solve anything.)
> 
> And when you factor in Peter's dirty_page stuff, it's a nuisance:
> because he has had to get_page(dirty_page) then put_page(dirty_page),
> in case page already got freed by vmscan after ptl dropped: which is
> redundant if the page is locked throughout, but you can't rely on that
> because (for a while at least) some fs'es won't set VM_NOPAGE_LOCKED.
> 
> How about
> 	if (vma->vm_flags & VM_NOPAGE_LOCKED)
> 		BUG_ON(!PageLocked(new_page));
> 	else
> 		lock_page(new_page);
> 	locked_page = new_page;
> ?
> 
> And then proceed through the rest of do_no_page sure in the knowledge
> that we have the page locked, simplifying whatever might be simplified
> by that (removing Peter's get_page,put_page at least).  I can see this
> adds a little overhead to some less important cases, but it does make
> the rules much easier to grasp.

OK, I hadn't looked at either the dirty page or the page_mkwrite stuff
with a mind to this patch, to be honest (which is why the page_mkwrite
is broken).

Sorry, it was a diff against what I was working on (2.6.17), and I
hadn't brought it uptodate before posting for comments.


> 
> 
>> 
>> 	/*
>> 	 * Should we do an early C-O-W break?
> 
> 
> Somewhere below here you're missing a hunk to deal with a failed
> page_mkwrite, needing to unlock_page(locked_page).  We don't have
> an example of a page_mkwrite in tree at present, but it seems
> reasonable to suppose that we not it should unlock the page.
> 
> Hmm, David Howells has an afs_file_page_mkwrite which sits waiting
> for an FsMisc page flag to be cleared: might that deadlock with the
> page lock held?  If so, it may need to unlock and relock the page,
> rechecking for truncation.
> 
> Hmmm, page_mkwrite when called from do_wp_page would not expect to
> be holding page lock: we don't want it called with in one case and
> without in the other.  Maybe do_no_page needs to unlock_page before
> calling page_mkwrite, lock_page after, and check page->mapping when
> VM_NOPAGE_LOCKED??

That's pretty foul. I'll take a bit of a look. Is it really a problem
to call in either state, if it is well documented? (we could even
send a flag down if needed). I thought filesystem code loved this
kind of spaghetti locking?

>>Index: linux-2.6/mm/shmem.c
>>===================================================================
>>--- linux-2.6.orig/mm/shmem.c	2006-07-26 18:00:47.000000000 +1000
>>+++ linux-2.6/mm/shmem.c	2006-07-31 16:54:48.000000000 +1000
>>@@ -80,6 +80,7 @@ enum sgp_type {
>> 	SGP_READ,	/* don't exceed i_size, don't allocate page */
>> 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
>> 	SGP_WRITE,	/* may exceed i_size, may allocate page */
>>+	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
>> };
> 
> 
> I don't think you need to add another type for this, SGP_CACHE should do.
> 
> Perhaps you avoided that because it's also used by shmem_populate.
> But another point I want to make is that you do need to update
> filemap_populate, shmem_populate, install_page and whatever to
> make use the same locked page fix: they've been relying on the
> i_size and page->mapping checks, which are not quite enough,
> isn't that right? (now my grasp of the race has fallen out of my
> left ear, and I'd better finish this mail before regrasping it)

Yeah, it isn't enough. Too bad ->populate is implemented as it is,
so it can't take advantage of the generic race finding code in
memory.c (even though that isn't yet sufficient either ;)).

I don't think ->populate has ever particularly troubled itself with
these kinds of theoretical races. I was really hoping to fix linear
pagecache first before getting bogged down with nonlinear.

>> static int shmem_getpage(struct inode *inode, unsigned long idx,
>>@@ -1211,8 +1212,10 @@ repeat:
>> 	}
>> done:
>> 	if (*pagep != filepage) {
>>-		unlock_page(filepage);
>> 		*pagep = filepage;
>>+		if (sgp != SGP_NOPAGE)
>>+			unlock_page(filepage);
>>+
> 
> 
> You've inserted that blank line just to upset me.

I did. I was trying to sneak that one past you.

>>Index: linux-2.6/mm/mmap.c
>>===================================================================
>>--- linux-2.6.orig/mm/mmap.c	2006-07-26 18:00:47.000000000 +1000
>>+++ linux-2.6/mm/mmap.c	2006-07-31 16:03:58.000000000 +1000
>>@@ -1089,6 +1089,9 @@ munmap_back:
>> 			goto free_vma;
>> 	}
>> 
>>+	if (vma->vm_ops)
>>+		vma->vm_flags |= vma->vm_ops->vm_flags;
>>+
> 
> 
> Mmm, I'd prefer not to have this additional way of setting vm_flags.

OK fair enough. However, that's probably not the biggest issue I face.

After thinking about it a bit more, I think I've found my filemap_nopage
wanting. Suppose i_size is shrunk and the page truncated before the
first find_lock_page. OK, no we'll allocate a new page, add it to the
pagecache, and do a ->readpage().

readpage should notice we're reading outside i_size and zero fill it
(which seems like a silly thing to do, but it must be for a good reason,
Andrew? can we read holes past i_size? or is it for some awful ptrace
crud?)

Anyway, that is all expecting do_no_page to notice the truncation that
happened after we sampled i_size, and retry. But I removed the notice-
and-retry logic.

What should possibly be done is to recheck i_size under the page lock.

Regular invalidates (of the type we're trying to plug the hole in) are
fine, because they only ask that ->readpage be rerun after they complete.

Will think a bit more. Thanks for the input. I see Andrew's dropped it:
that's fine for now.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-07 14:18   ` Nick Piggin
@ 2006-08-07 14:58     ` Nick Piggin
  2006-08-07 15:25       ` Hugh Dickins
  2006-08-07 17:05     ` Hugh Dickins
  1 sibling, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2006-08-07 14:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

Nick Piggin wrote:
> Hugh Dickins wrote:

>> I suppose this is quite efficient, but I find it confusing.
>> We have lots and lots of drivers already setting vm_flags in their
>> mmap methods, now you add an alternative way of doing the same thing.
>> Can't you just set VM_NOPAGE_LOCKED in the relevant mmap methods?
>> Or did you try it that way and it worked out messy?
> 
> 
> Generic pagecache doesn't have an mmap method, which is where
> I stopped looking. I guess you could add the |= to filemap_nopage,
> but that's much uglier.
> 
> I don't find it at all confusing, just maybe a bit of a violation
> because the structure is technically only for "ops".

Hmm, I guess adding a new mmap method solely to set that flag
would actually be cleaner. And it would allow any filesystems
that override .nopage bug end up calling filemap_nopage could
equivalently override their mmap but still call filemap_mmap.

Yes that might be nicer.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-07 14:58     ` Nick Piggin
@ 2006-08-07 15:25       ` Hugh Dickins
  2006-08-08  1:17         ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2006-08-07 15:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

On Tue, 8 Aug 2006, Nick Piggin wrote:
> Nick Piggin wrote:
> > 
> > Generic pagecache doesn't have an mmap method, which is where
> > I stopped looking. I guess you could add the |= to filemap_nopage,
> > but that's much uglier.

You can't |= vm_flags in nopage, mmap_sem isn't exclusive there.
But what's the matter with generic_file_mmap?

> Hmm, I guess adding a new mmap method solely to set that flag
> would actually be cleaner. And it would allow any filesystems
> that override .nopage bug end up calling filemap_nopage could
> equivalently override their mmap but still call filemap_mmap.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-07 14:18   ` Nick Piggin
  2006-08-07 14:58     ` Nick Piggin
@ 2006-08-07 17:05     ` Hugh Dickins
  2006-08-08  1:14       ` Nick Piggin
  1 sibling, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2006-08-07 17:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

On Tue, 8 Aug 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > Hmmm, page_mkwrite when called from do_wp_page would not expect to
> > be holding page lock: we don't want it called with in one case and
> > without in the other.  Maybe do_no_page needs to unlock_page before
> > calling page_mkwrite, lock_page after, and check page->mapping when
> > VM_NOPAGE_LOCKED??
> 
> That's pretty foul. I'll take a bit of a look. Is it really a problem
> to call in either state, if it is well documented? (we could even
> send a flag down if needed). I thought filesystem code loved this
> kind of spaghetti locking?

Agreed foul.  David's helpful mail reassures not an immediate problem,
but I'm pretty sure other future uses of page_mkwrite would need to
know if the page is held locked or not.  Yes, could be done by a flag,
though that's not pretty (gives the ->page_mkwrite implementation much
the same schizophrenia as I was disliking here in do_no_page).

> I don't think ->populate has ever particularly troubled itself with
> these kinds of theoretical races. I was really hoping to fix linear
> pagecache first before getting bogged down with nonlinear.

install_page has had mapping & i_size check for quite a while, but
perhaps by theoretical races you mean Andrea's invalidate case.
The nonlinear case is much less a concern than MAP_POPULATE
(though I don't know if anyone really uses that).

> After thinking about it a bit more, I think I've found my filemap_nopage
> wanting. Suppose i_size is shrunk and the page truncated before the
> first find_lock_page. OK, no we'll allocate a new page, add it to the
> pagecache, and do a ->readpage().

I've got a bit lost between merges against different trees,
I'll let you sort that one out.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-07 17:05     ` Hugh Dickins
@ 2006-08-08  1:14       ` Nick Piggin
  2006-08-08 18:19         ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2006-08-08  1:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

Hugh Dickins wrote:
> On Tue, 8 Aug 2006, Nick Piggin wrote:
> 
>>Hugh Dickins wrote:
>>
>>>Hmmm, page_mkwrite when called from do_wp_page would not expect to
>>>be holding page lock: we don't want it called with in one case and
>>>without in the other.  Maybe do_no_page needs to unlock_page before
>>>calling page_mkwrite, lock_page after, and check page->mapping when
>>>VM_NOPAGE_LOCKED??
>>
>>That's pretty foul. I'll take a bit of a look. Is it really a problem
>>to call in either state, if it is well documented? (we could even
>>send a flag down if needed). I thought filesystem code loved this
>>kind of spaghetti locking?
> 
> 
> Agreed foul.  David's helpful mail reassures not an immediate problem,
> but I'm pretty sure other future uses of page_mkwrite would need to
> know if the page is held locked or not.  Yes, could be done by a flag,
> though that's not pretty (gives the ->page_mkwrite implementation much
> the same schizophrenia as I was disliking here in do_no_page).

But it would be better to do it in the theoretical page_mkwrite that
cares, maybe? Imagine one that did want to have the page locked.

Well I'll leave this issue alone for the next iteration.

> 
> 
>>I don't think ->populate has ever particularly troubled itself with
>>these kinds of theoretical races. I was really hoping to fix linear
>>pagecache first before getting bogged down with nonlinear.
> 
> 
> install_page has had mapping & i_size check for quite a while, but
> perhaps by theoretical races you mean Andrea's invalidate case.
> The nonlinear case is much less a concern than MAP_POPULATE
> (though I don't know if anyone really uses that).

Sure but it doesn't do any truncate_count checking. So nothing in my
patch will make it worse than it already is.

> 
> 
>>After thinking about it a bit more, I think I've found my filemap_nopage
>>wanting. Suppose i_size is shrunk and the page truncated before the
>>first find_lock_page. OK, no we'll allocate a new page, add it to the
>>pagecache, and do a ->readpage().
> 
> 
> I've got a bit lost between merges against different trees,
> I'll let you sort that one out.

I wonder if we should have the i_size check (under the page lock) in
do_no_page or down in the ->nopage implementations?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-07 15:25       ` Hugh Dickins
@ 2006-08-08  1:17         ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2006-08-08  1:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

Hugh Dickins wrote:
> On Tue, 8 Aug 2006, Nick Piggin wrote:
> 
>>Nick Piggin wrote:
>>
>>>Generic pagecache doesn't have an mmap method, which is where
>>>I stopped looking. I guess you could add the |= to filemap_nopage,
>>>but that's much uglier.
> 
> 
> You can't |= vm_flags in nopage, mmap_sem isn't exclusive there.

Well you *could*. So long as nobody else modifies vm_flags under
a read lock ;)

> But what's the matter with generic_file_mmap?

Umm... I don't know. Maybe my eyes?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race?
  2006-08-08  1:14       ` Nick Piggin
@ 2006-08-08 18:19         ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2006-08-08 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Andrew Morton, David Howells, Linux Memory Management

On Tue, 8 Aug 2006, Nick Piggin wrote:
> 
> I wonder if we should have the i_size check (under the page lock) in
> do_no_page or down in the ->nopage implementations?

I'm inclined to say down in the ->nopage implementations.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-08-08 18:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-01 11:36 [patch][rfc] possible lock_page fix for Andrea's nopage vs invalidate race? Nick Piggin
2006-08-01 14:27 ` Andrea Arcangeli
2006-08-02  0:19   ` Nick Piggin
2006-08-03 16:04 ` Hugh Dickins
2006-08-05  3:52   ` Nick Piggin
2006-08-07 14:18   ` Nick Piggin
2006-08-07 14:58     ` Nick Piggin
2006-08-07 15:25       ` Hugh Dickins
2006-08-08  1:17         ` Nick Piggin
2006-08-07 17:05     ` Hugh Dickins
2006-08-08  1:14       ` Nick Piggin
2006-08-08 18:19         ` Hugh Dickins
2006-08-03 16:34 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox