Re: [PATCH] Fix vmtruncate race and distributed filesystem race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <andrea@suse.de>
To: Dave McCracken <dmccr@us.ibm.com>
Cc: Andrew Morton <akpm@digeo.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Fri, 20 Jun 2003 02:17:43 +0200	[thread overview]
Message-ID: <20030620001743.GI18317@dualathlon.random> (raw)
In-Reply-To: <184910000.1055458610@baldur.austin.ibm.com>

Hi,

On Thu, Jun 12, 2003 at 05:56:50PM -0500, Dave McCracken wrote:
> --- 2.5.70-mm8/./mm/memory.c	2003-06-12 13:37:31.000000000 -0500
> +++ 2.5.70-mm8-trunc/./mm/memory.c	2003-06-12 17:51:55.000000000 -0500
> @@ -1138,6 +1138,8 @@ invalidate_mmap_range(struct address_spa
>  			hlen = ULONG_MAX - hba + 1;
>  	}
>  	down(&mapping->i_shared_sem);
> +	/* Protect against page fault */
> +	atomic_inc(&mapping->truncate_count);
>  	if (unlikely(!list_empty(&mapping->i_mmap)))
>  		invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen);
>  	if (unlikely(!list_empty(&mapping->i_mmap_shared)))
> @@ -1390,8 +1392,10 @@ do_no_page(struct mm_struct *mm, struct 
>  	unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
>  {
>  	struct page * new_page;
> +	struct address_space *mapping;
>  	pte_t entry;
>  	struct pte_chain *pte_chain;
> +	unsigned sequence;
>  	int ret;
>  
>  	if (!vma->vm_ops || !vma->vm_ops->nopage)
> @@ -1400,6 +1404,9 @@ do_no_page(struct mm_struct *mm, struct 
>  	pte_unmap(page_table);
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
> +retry:
> +	sequence = atomic_read(&mapping->truncate_count);
>  	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
>  
>  	/* no page was available -- either SIGBUS or OOM */
> @@ -1428,6 +1435,16 @@ do_no_page(struct mm_struct *mm, struct 
>  	}
>  
>  	spin_lock(&mm->page_table_lock);
> +	/*
> +	 * For a file-backed vma, someone could have truncated or otherwise
> +	 * invalidated this page.  If invalidate_mmap_range got called,
> +	 * retry getting the page.
> +	 */
> +	if (unlikely(sequence != atomic_read(&mapping->truncate_count))) {
> +		spin_unlock(&mm->page_table_lock);
> +		page_cache_release(new_page);
> +		goto retry;
> +	}
>  	page_table = pte_offset_map(pmd, address);

maybe I'm missing something silly but this fixes nothing IMHO. It's not
a coincidence I used the seq_lock (aka frlock in 2.4-aa) in my fix, a
single counter increment isn't nearly enough, you definitely need _both_
an entry and exit point in do_truncate or you'll never know if
vmtruncate has been running under you. The first increment is like the
down_read, the second increment is the up_read. Both are necessary to
trap any vmtruncate during the do_no_page.

Your patch traps this timing case:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
				read counter

	increment counter
	vmtruncate
				->nopage
				read counter again -> different so retry


but you can't trap this with a single counter increment in do_truncate:

	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate
	increment counter
				read counter
				->nopage
	vmtruncate
				read counter again -> different so retry

thanks to the second counter increment after vmtruncate in my fix, the
above race couldn't happen.

About the down(&inode->i_sem); up(), that you dropped under Andrew's
suggestion, while that maybe ugly, it will have a chance to save cpu,
and since it's a slow path such goto, it's definitely worthwhile to keep
it IMHO. Otherwise one cpu will keep scheduling in a loop until truncate
returns, and it can take time since it may have to do I/O or wait on
some I/O semaphore. It wouldn't be DoSable, because the
ret-from-exception will check need_resched, but still it's bad for cpu
utilization and such a waste can be avoided trivially as in my fix.

I was chatting with Daniel about those hooks a few minutes ago, and he
suggested to make do_no_page a callback itself (instead of having
do_no_page call into a ->nopage further callback). And to provide by
default a generic implementation that would be equivalent to the current
do_no_page. As far as I can tell that will save both the new pointer to
function for the DSM hook (that IMHO has to be taken twice, both before
->nopage and after ->nopage, not only before the ->nopage, for the
reason explained above) and the ->nopage hook itself. So maybe that
could be a cleaner solution to avoid the DSM hooks enterely, so we don't
have more hooks but less, and a library call. This sounds the best for
performance and flexibility. (talking only about 2.5 of course, 2.4 I
think is just fine with my ""production"" 8) fix here:

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc8aa1/2.4.21rc8aa1/9999_truncate-nopage-race-1

)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2003-06-20  0:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-06-12 20:16 Dave McCracken
2003-06-12 20:49 ` Andrew Morton
2003-06-12 21:00   ` Andrew Morton
2003-06-12 21:08     ` Dave McCracken
2003-06-12 21:44       ` Andrew Morton
2003-06-12 22:56         ` Dave McCracken
2003-06-12 23:07           ` Andrew Morton
2003-06-20  0:17           ` Andrea Arcangeli [this message]
2003-06-23  3:28             ` Paul E. McKenney
2003-06-23  6:29               ` Andrea Arcangeli
2003-06-23  6:32               ` Andrew Morton
2003-06-23  7:43                 ` Andrea Arcangeli
2003-06-23  7:56                   ` Andrew Morton
2003-06-23  8:10                     ` Andrea Arcangeli
2003-06-24  1:37                       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030620001743.GI18317@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@digeo.com \
    --cc=dmccr@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox