From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E49FC433F5 for ; Thu, 6 Oct 2022 15:56:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 298B76B0072; Thu, 6 Oct 2022 11:56:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 248358E0002; Thu, 6 Oct 2022 11:56:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C1A18E0001; Thu, 6 Oct 2022 11:56:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E777B6B0072 for ; Thu, 6 Oct 2022 11:56:42 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B9ADB1202E4 for ; Thu, 6 Oct 2022 15:56:42 +0000 (UTC) X-FDA: 79990977444.10.6C3BEDE Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) by imf15.hostedemail.com (Postfix) with ESMTP id 5D908A001C for ; Thu, 6 Oct 2022 15:56:42 +0000 (UTC) Received: by mail-pf1-f181.google.com with SMTP id y136so2446065pfb.3 for ; Thu, 06 Oct 2022 08:56:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=dYWAXBCfGNdt2FEEr/XAmuM7IcXTbwrbh6ObZ10wCExgLJnFqd7ULWVxtf8SpzsIG2 xsAwTOGs5muyf1Ek1vxwQDgI6lZcSLPZxoQxbC9yEJlyWJnY9J71hAiD2wS0apDSMGpy pV5OUUrYFN5ggX769ElWdweoKedia0K/HYzpH1LVfPgRcuvzzr+lcfXPq5afXHFoVzbX Q0JUf8e0b+0XygU3i40KcPg+H+h2FmEktr+/Hc5KZg14AEyg/ARvuaT3odGj4Xxvyg7A nxMuU2PF/AkOq1ARlHMI2QbMqJ1SbFUOxUqCIZgJM1O6NwXGF1TNEAjl94MdR7SR44Lo Y6iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=Zg12cx39lGUM0+UolQb/tdvGFHDOP50gdBKL4NkBPTXyKkKB5Tyac+HThKJhds6/Wn VbPU2NhXwkjnFIzsfi0Mt2tF4Hwq+aKe9iWyzyn5lgyVIbMF+kaCJg+lPzOLkn59TmJh 5C5SyYYQyE6wJgGQNEAJG3ZjeUXbjmL/ZzqO0eG10IMuTs3dA+aJwNbMQ0Kch6BrBnJa 8Fq+tiLQJWREnLXTdGhNj4ayAWrnKYF2f+b7ImTSdd9Vn3Du67cQAYPPUqmoXw3zDlrn McHiULQnqCLXzZL7vlJU+ZSEd0FmWeQ3Hoc9xF8vyCt9YgyQBTMkgpJL34x7BsqddYfA 9GpA== X-Gm-Message-State: ACrzQf0GCGwiaI3j36khbb61wvFpiepydkHUxDe1MLKjAasa6SC3i+L/ 9aMoEEtSaXOha8oZJ/jwcjeyDnhct3eL/aHrF/LX1hwe2/ZMbw== X-Google-Smtp-Source: AMsMyM60Fv/1iOXbHNVfyXudpJGF9twOcRqcHv9MyUpEc9+XYdMr/+CfmeGaJmP/s5m+16actayvNHIFg5LWPz/vnIQ= X-Received: by 2002:a05:6602:2dc7:b0:6a5:14e5:d709 with SMTP id l7-20020a0566022dc700b006a514e5d709mr229751iow.54.1665071790552; Thu, 06 Oct 2022 08:56:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jann Horn Date: Thu, 6 Oct 2022 17:55:54 +0200 Message-ID: Subject: Re: ptep_get_lockless() on 32-bit x86/mips/sh looks wrong To: Jason Gunthorpe Cc: Linux-MM , Peter Zijlstra , Christoph Hellwig , kernel list , David Hildenbrand Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665071802; a=rsa-sha256; cv=none; b=JjPQqGr5QV0liMSzWbsyFDN/P33l8AlDSKXRujhNlwkN79vVwIFodf9dnjxfLqPGcIIt8P kvghQK1JudeJ7fvbXMwhissLCpW3VhSW99+DlH4NXUIlTI535sHtoDRrvCdCPlAgFXGILl 6JU53NQ7H28gvNHZJ2P9HuZqsFYJ2Zs= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dYWAXBCf; spf=pass (imf15.hostedemail.com: domain of jannh@google.com designates 209.85.210.181 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665071802; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CjxW0PKCyTrNxuUlcWq36jveOujZNX1r5syNeeGoP8s=; b=N5N6VQFwJHBdl6WaR0L+Jz4OS+VIhZ5okjICWUTO8OAQY49SX9n1Tx/sEhgt3lQCMLhtAG KWp/8xyXhgSJuFCYWqUtDKKFdf1drWGZaFkfps71A9wpZJv0LU2Ubop4yButy8T02ZswNw fFYtUwvFOKG6WtoyQqfKUYL3gqIEsYk= X-Rspam-User: Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dYWAXBCf; spf=pass (imf15.hostedemail.com: domain of jannh@google.com designates 209.85.210.181 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: oyoax9wx7xs1h156q954myskspa39sb4 X-Rspamd-Queue-Id: 5D908A001C X-Rspamd-Server: rspam01 X-HE-Tag: 1665071802-375768 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 6, 2022 at 5:44 PM Jason Gunthorpe wrote: > On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote: > > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH: > > > > pte_t pte; > > do { > > pte.pte_low = ptep->pte_low; > > smp_rmb(); > > pte.pte_high = ptep->pte_high; > > smp_rmb(); > > } while (unlikely(pte.pte_low != ptep->pte_low)); > > > > It has a comment above it that argues that this is correct because: > > 1. A present PTE can't become non-present and then become a present > > PTE pointing to another page without a TLB flush in between. > > 2. TLB flushes involve IPIs. > > > > As far as I can tell, in particular on x86, _both_ of those > > assumptions are false; perhaps on mips and sh only one of them is? > > > > Number 2 is straightforward: X86 can run under hypervisors, and when > > it runs under hypervisors, the MMU paravirtualization code (including > > the KVM version) can implement remote TLB flushes without IPIs. > > > > Number 1 is gnarlier, because breaking that assumption implies that > > there can be a situation where different threads see different memory > > at the same virtual address because their TLBs are incoherent. But as > > far as I know, it can happen when MADV_DONTNEED races with an > > anonymous page fault, because zap_pte_range() does not always flush > > stale TLB entries before dropping the page table lock. I think that's > > probably fine, since it's a "garbage in, garbage out" kind of > > situation - but if a concurrent GUP-fast can then theoretically end up > > returning a completely unrelated page, that's bad. > > > > > > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't > > just change ptep_get_lockless() to use arch_cmpxchg_double() and be > > done with it... > > I think the argument here has nothing to do with IPIs, but is more a > statement on memory ordering. The comment above the definition of ptep_get_lockless() claims: "it will not switch to a completely different present page without a TLB flush in between; something that we are blocking by holding interrupts off." > What we want to get is a non-torn load > of low/high, under some restricted rules. > > PTE writes should be ordered so that the present/not present bit is > properly: > > Zapping a PTE: > > write_low (not present) > wmb() > write_high (a) > wmb() > > Reestablish a PTE: > > write_high (b) > wmb() > write_low (present) > wmb() > > This ordering is necessary to make the TLB's atomic 64 bit load work > properly, otherwise the TLB could read a present entry with a bogus > other half! > > For ptep_get_lockless() we define non-torn as meaning the same as for the TLB: > > pre-zap low / high (present) > restablish low / high (b) (present) > any low / any high (not present) > > Other combinations are forbidden. > > The read side has a corresponding list of reads: > > read_low > rmb() > read_high > rmb() > read_low > > So, it seems plausible this could be OK based only on atomics (I did > not check that the present bit is properly placed in the right > low/high). Do you see a way the atomics don't work out? The race would be something like this, where A is one thread doing ptep_get_lockless() and B, C and D are other threads: A: read ptep->pte_low, sees low address half 0x00010000 B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet C: page fault installs a new PTE pointing to address 0x0001000200020000 A: read ptep->pte_high, sees high address half 0x00010002 C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet D: page fault installs a new PTE pointing to address 0x0001000300010000 A: re-read ptep->pte_low, sees low address half 0x00010000 matching the first one A: returns physical address 0x000100020x00010000, which was never actually in the PTE So it's not a problem with the memory ordering, it's just that it's not possible to atomically read a 64-bit PTE with 32-bit reads when the PTE can completely change under you - and ptep_get_lockless() was written under the assumption that this can't happen because of TLB flush IPIs.