Re: [RFC] page-table walkers vs memory order

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hugh Dickins <hughd@google.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Nick Piggin <npiggin@kernel.dk>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC] page-table walkers vs memory order
Date: Tue, 24 Jul 2012 14:51:05 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.00.1207241356350.2094@eggly.anvils> (raw)
In-Reply-To: <1343064870.26034.23.camel@twins>

On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.
> 
> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha

The Alpha pmd_offset() and pte_offset_map() already contain an
smp_read_barrier_depends() (362a61ad from Nick); with comment that
it's not needed on the pgd, and I presume the pud level is folded.
Does Alpha really need more of them, as you have put below?

> and subtly broken for those of us with 'creative' compilers.

I don't want to fight against ACCESS_ONCE() (or barrier(): that's
interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
substitute before); but I do want to question it a little.

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an "--access_once"
option that the kernel build would want to turn on?

Those may all be questions for Paul!

> 
> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?

I do love the status quo, but an audit would be welcome.  When
it comes to patches, personally I tend to prefer ACCESS_ONCE() and
smp_read_barrier_depends() and accompanying comments to be hidden away
in the underlying macros or inlines where reasonable, rather than
repeated all over; but I may have my priorities wrong on that.

The last time we rewrote the main pgd-pud-pmd-pte walkers,
we believed that no ACCESS_ONCE() was necessary, because although a
pgd-pud-pmd entry might be racily instantiated at any instant, it
could never change beneath us - the freeing of page tables happens
only when we cannot reach them by other routes.

(Never quite true: those _clear_bad() things can zero entries at any
instant, but we're already in a bad place when those come into play,
so we never worried about racing against them.)

Since then, I think THP has made the rules more complicated; but I
believe Andrea paid a great deal of attention to that kind of issue.

I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
gup_fast() breaks as many rules as it can, and in particular may
be racing with the freeing of page tables; but I'm not so sure
about the pagewalk mods - we could say "cannot do any harm",
but I don't like adding lines on that basis.

Hugh

> 
> ---
>  arch/x86/mm/gup.c |    6 +++---
>  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  
>  	pmdp = pmd_offset(&pud, addr);
>  	do {
> -		pmd_t pmd = *pmdp;
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
>  		/*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  
>  	pudp = pud_offset(&pgd, addr);
>  	do {
> -		pud_t pud = *pudp;
> +		pud_t pud = ACCESS_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	local_irq_save(flags);
>  	pgdp = pgd_offset(mm, addr);
>  	do {
> -		pgd_t pgd = *pgdp;
> +		pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pte = pte_offset_map(pmd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	for (;;) {
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pmd = pmd_offset(pud, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pud = pud_offset(pgd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud)) {
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-07-24 21:51 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-23 17:34 Peter Zijlstra
2012-07-23 19:27 ` Paul E. McKenney
2012-07-24 21:51 ` Hugh Dickins [this message]
2012-07-25 17:56   ` Paul E. McKenney
2012-07-25 20:26     ` Hugh Dickins
2012-07-25 21:12       ` Paul E. McKenney
2012-07-25 22:09         ` Hugh Dickins
2012-07-25 22:37           ` Paul E. McKenney
2012-07-26  8:11           ` Peter Zijlstra
2012-07-30 19:21         ` Jamie Lokier
2012-07-30 20:28           ` Paul E. McKenney
2012-07-26 20:39   ` Peter Zijlstra
2012-07-27 19:22     ` Hugh Dickins
2012-07-27 19:39       ` Paul E. McKenney
2012-08-04 14:37   ` Andrea Arcangeli
2012-08-04 22:02     ` Paul E. McKenney
2012-08-04 22:47       ` Andrea Arcangeli
2012-08-04 22:59         ` Dr. David Alan Gilbert
2012-08-04 23:11           ` Paul E. McKenney
2012-08-05  0:10             ` Dr. David Alan Gilbert
2012-08-04 23:06         ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.00.1207241356350.2094@eggly.anvils \
    --to=hughd@google.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@kernel.dk \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox