Re: ptep_get_lockless() on 32-bit x86/mips/sh looks wrong

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jann Horn <jannh@google.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	Peter Zijlstra <peterz@infradead.org>,
	 Christoph Hellwig <hch@lst.de>,
	kernel list <linux-kernel@vger.kernel.org>,
	 David Hildenbrand <david@redhat.com>
Subject: Re: ptep_get_lockless() on 32-bit x86/mips/sh looks wrong
Date: Thu, 6 Oct 2022 17:55:54 +0200	[thread overview]
Message-ID: <CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com> (raw)
In-Reply-To: <Yz7316NubHtw2xCa@nvidia.com>

On Thu, Oct 6, 2022 at 5:44 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote:
> > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH:
> >
> > pte_t pte;
> > do {
> >   pte.pte_low = ptep->pte_low;
> >   smp_rmb();
> >   pte.pte_high = ptep->pte_high;
> >   smp_rmb();
> > } while (unlikely(pte.pte_low != ptep->pte_low));
> >
> > It has a comment above it that argues that this is correct because:
> > 1. A present PTE can't become non-present and then become a present
> > PTE pointing to another page without a TLB flush in between.
> > 2. TLB flushes involve IPIs.
> >
> > As far as I can tell, in particular on x86, _both_ of those
> > assumptions are false; perhaps on mips and sh only one of them is?
> >
> > Number 2 is straightforward: X86 can run under hypervisors, and when
> > it runs under hypervisors, the MMU paravirtualization code (including
> > the KVM version) can implement remote TLB flushes without IPIs.
> >
> > Number 1 is gnarlier, because breaking that assumption implies that
> > there can be a situation where different threads see different memory
> > at the same virtual address because their TLBs are incoherent. But as
> > far as I know, it can happen when MADV_DONTNEED races with an
> > anonymous page fault, because zap_pte_range() does not always flush
> > stale TLB entries before dropping the page table lock. I think that's
> > probably fine, since it's a "garbage in, garbage out" kind of
> > situation - but if a concurrent GUP-fast can then theoretically end up
> > returning a completely unrelated page, that's bad.
> >
> >
> > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't
> > just change ptep_get_lockless() to use arch_cmpxchg_double() and be
> > done with it...
>
> I think the argument here has nothing to do with IPIs, but is more a
> statement on memory ordering.

The comment above the definition of ptep_get_lockless() claims: "it
will not switch to a completely different present page without a TLB
flush in between; something that we are blocking by holding interrupts
off."

> What we want to get is a non-torn load
> of low/high, under some restricted rules.
>
> PTE writes should be ordered so that the present/not present bit is
> properly:
>
> Zapping a PTE:
>
> write_low (not present)
> wmb()
> write_high (a)
> wmb()
>
> Reestablish a PTE:
>
> write_high (b)
> wmb()
> write_low (present)
> wmb()
>
> This ordering is necessary to make the TLB's atomic 64 bit load work
> properly, otherwise the TLB could read a present entry with a bogus
> other half!
>
> For ptep_get_lockless() we define non-torn as meaning the same as for the TLB:
>
>      pre-zap low / high (present)
>   restablish low / high (b) (present)
>          any low / any high (not present)
>
> Other combinations are forbidden.
>
> The read side has a corresponding list of reads:
>
> read_low
> rmb()
> read_high
> rmb()
> read_low
>
> So, it seems plausible this could be OK based only on atomics (I did
> not check that the present bit is properly placed in the right
> low/high). Do you see a way the atomics don't work out?

The race would be something like this, where A is one thread doing
ptep_get_lockless() and B, C and D are other threads:

<PTE initially points to address 0x0001000100010000>
A: read ptep->pte_low, sees low address half 0x00010000
B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
C: page fault installs a new PTE pointing to address 0x0001000200020000
A: read ptep->pte_high, sees high address half 0x00010002
C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
D: page fault installs a new PTE pointing to address 0x0001000300010000
A: re-read ptep->pte_low, sees low address half 0x00010000 matching
the first one
A: returns physical address 0x000100020x00010000, which was never
actually in the PTE

So it's not a problem with the memory ordering, it's just that it's
not possible to atomically read a 64-bit PTE with 32-bit reads when
the PTE can completely change under you - and ptep_get_lockless() was
written under the assumption that this can't happen because of TLB
flush IPIs.

next prev parent reply	other threads:[~2022-10-06 15:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-06 15:23 Jann Horn
2022-10-06 15:44 ` Jason Gunthorpe
2022-10-06 15:55   ` Jann Horn [this message]
2022-10-06 15:58     ` Jann Horn
2022-10-06 16:15     ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com' \
    --to=jannh@google.com \
    --cc=david@redhat.com \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox