Re: [PATCH] Fix vmtruncate race and distributed filesystem race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@digeo.com>
Cc: paulmck@us.ibm.com, dmccr@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Fix vmtruncate race and distributed filesystem race
Date: Mon, 23 Jun 2003 09:43:53 +0200	[thread overview]
Message-ID: <20030623074353.GE19940@dualathlon.random> (raw)
In-Reply-To: <20030622233235.0924364d.akpm@digeo.com>

On Sun, Jun 22, 2003 at 11:32:35PM -0700, Andrew Morton wrote:
> "Paul E. McKenney" <paulmck@us.ibm.com> wrote:
> >
> > > but you can't trap this with a single counter increment in do_truncate:
> >  > 
> >  > 	CPU 0			CPU 1
> >  > 	----------		-----------
> >  > 				do_no_page
> >  > 	truncate
> 
>         i_size = new_i_size;
> 
> >  > 	increment counter
> >  > 				read counter
> >  > 				->nopage
> 
>                                 check i_size
> 
> >  > 	vmtruncate
> >  > 				read counter again -> different so retry
> >  > 
> >  > thanks to the second counter increment after vmtruncate in my fix, the
> >  > above race couldn't happen.
> > 
> >  The trick is that CPU 0 is expected to have updated the filesystem's
> >  idea of what pages are available before calling vmtruncate,
> >  invalidate_mmap_range() or whichever.
> 
> i_size has been updated, and filemap_nopage() will return NULL.

ok, I finally see your object now, so it's the i_size that was in theory
supposed to act as a "second" increment on the truncate side to
serialize against in do_no_page.

I was searching for any additional SMP locking, and infact there is
none.  No way the current code can serialize against the i_size.

The problem is that the last patch posted on l-k isn't guaranteeing
anything like what you wrote above.  What can happen is that the i_size
can be read still way before the counter.

 	CPU 0			CPU 1
	----------		-----------
				do_no_page
	truncate

                                specualtive read i_size into l2 cache
        i_size = new_i_size;

	increment counter
 				read counter
				->nopage
				check i_size from l2 cache

 	vmtruncate
 				read counter again -> different so retry

For the single counter increment patch to close the race using the
implicit i_size update as a second increment, you need to add an
absolutely necessary smb_rmb() here:

        sequence = atomic_read(&mapping->truncate_count);
+	smp_rmb();	
        new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

And for non-x86 you need a smb_wmb() between the i_size = new_i_size and
the atomic_inc on the truncate side too, because advanced archs with
finegrined locking can implement a one-way barrier, where the counter
increment could pass the down(&mapping->i_shared_sem). Last but not the
least it was IMHO misleading and dirty to do the counter increment in
truncate after taking a completely unrelated semaphore just to take
advantage of the implicit barrier on the x86. So doing it with a proper
smp_wmb() besides fixing advanced archs will kind of make the code
readable, so you can understand the i_size update is the second
increment (actually decrement because it's the i_size).

Now that I see what the second increment was supposed to be, it's a cute
idea to use the i_size for that, and I can appreciate its beauty (I
mean, in purist terms, in practice IMHO the seq_lock was more obivously
correct code with all the smp reordering hidden into the lock semantics
and I'm 100% sure you couldn't measure any performance difference in
truncate due the second increment ;). But if you prefer to keep the most
finegrined approch of using the i_size with the explicit memory
barriers, I'm of course not against it after you add the needed fixed I
outlined above that will finally close the race.

thanks,

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2003-06-23  7:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-06-12 20:16 Dave McCracken
2003-06-12 20:49 ` Andrew Morton
2003-06-12 21:00   ` Andrew Morton
2003-06-12 21:08     ` Dave McCracken
2003-06-12 21:44       ` Andrew Morton
2003-06-12 22:56         ` Dave McCracken
2003-06-12 23:07           ` Andrew Morton
2003-06-20  0:17           ` Andrea Arcangeli
2003-06-23  3:28             ` Paul E. McKenney
2003-06-23  6:29               ` Andrea Arcangeli
2003-06-23  6:32               ` Andrew Morton
2003-06-23  7:43                 ` Andrea Arcangeli [this message]
2003-06-23  7:56                   ` Andrew Morton
2003-06-23  8:10                     ` Andrea Arcangeli
2003-06-24  1:37                       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030623074353.GE19940@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@digeo.com \
    --cc=dmccr@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=paulmck@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox