From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4FABBECAAA1
	for <linux-mm@archiver.kernel.org>; Thu, 27 Oct 2022 18:14:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AD5706B0071; Thu, 27 Oct 2022 14:14:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A864A6B0073; Thu, 27 Oct 2022 14:14:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 94CD06B0074; Thu, 27 Oct 2022 14:14:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 82DE96B0071
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 14:14:15 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 274591606D0
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 18:14:15 +0000 (UTC)
X-FDA: 80067528870.12.E9DD64D
Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170])
	by imf08.hostedemail.com (Postfix) with ESMTP id B559016000E
	for <linux-mm@kvack.org>; Thu, 27 Oct 2022 18:14:14 +0000 (UTC)
Received: by mail-qk1-f170.google.com with SMTP id x26so1662121qki.0
        for <linux-mm@kvack.org>; Thu, 27 Oct 2022 11:14:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linux-foundation.org; s=google;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=;
        b=OGWIzf//yDrzpIU5CUpEyJ7Dm5rnyRzXTwtOsRe4hwmqHu+knWWFjSY5Bsl8zdN2JR
         AbhywM5MxQHd0TDQb+wMPn9yzk03EzWqxeNIV3iaekqGTUDPJgEKXXQd5WJ0UVzk037R
         SPcYzev0TYZ8UCj2B/vUiwWZh4URP/tTZ7s+E=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=;
        b=vaNQmZtKUVLSSRgTt2TVolZaEPva6uhxOmYKkA24SMc1tYYQd6hMc/Es7yVBnL2Dst
         kR29yiHRaO1kBfrU/XqCpcA1ye/PS57lPg//LmlrMxPRpONoPS+G6uwzc87unECHk6cj
         Vh72MiugWgucz+eyWHofSmj8jGKzYElO4MsE7FKhLBVcUMOfBku1dtc2cvpiplRF8xtT
         1HM7VSLUKbKUnABDCX/86kdmmZXejFWvsBTuWwlNtUEevH0FtZORdobzBue2rVX2+jE5
         ADM2D/HT2KFd4WlXcnboWIfraudVQDiQti1Qhaek560+Dhm3G/fJsEhvVlAVakhZ74ZA
         LHuA==
X-Gm-Message-State: ACrzQf0jPpcSygOI4YtP1Nax7Ab/dwGNr+5PIDrGKUreewB8kNqsIxA4
	Eas0E4RgXojx9lk0D3Mm5sKWMpdZsyyKJQ==
X-Google-Smtp-Source: AMsMyM6OJsU/5/+/DRAYUJomj846sOsFr15gGsVwEJ170jldqQvFskVrwlwAoiU5/WAZIWwHudosLA==
X-Received: by 2002:a05:620a:4884:b0:6f3:91b8:4bf7 with SMTP id ea4-20020a05620a488400b006f391b84bf7mr19190311qkb.167.1666894453465;
        Thu, 27 Oct 2022 11:14:13 -0700 (PDT)
Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com. [209.85.219.177])
        by smtp.gmail.com with ESMTPSA id w10-20020a05620a444a00b006f9e103260dsm1439105qkp.91.2022.10.27.11.14.12
        for <linux-mm@kvack.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 27 Oct 2022 11:14:12 -0700 (PDT)
Received: by mail-yb1-f177.google.com with SMTP id z192so3255826yba.0
        for <linux-mm@kvack.org>; Thu, 27 Oct 2022 11:14:12 -0700 (PDT)
X-Received: by 2002:a5b:984:0:b0:6ca:9345:b2ee with SMTP id
 c4-20020a5b0984000000b006ca9345b2eemr8216571ybq.362.1666894452121; Thu, 27
 Oct 2022 11:14:12 -0700 (PDT)
MIME-Version: 1.0
References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org>
 <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <Y1ZGNwUEdm15W6Xz@hirez.programming.kicks-ass.net>
 <CAG48ez3fG=WnvbiE5mJaD66_gJj_hohnV8CqBG9eNdjd7pJW3A@mail.gmail.com>
 <Y1fsYwshJ93FT21r@hirez.programming.kicks-ass.net> <CAG48ez3VE+3dVdUMK+Pg_942gR+h_TCcSaFxGwCbNfh3W+mfOA@mail.gmail.com>
 <Y1f7YvKuwOl1XEwU@hirez.programming.kicks-ass.net> <CAG48ez05gBiB2w7bL_3O_OUoBWazmHnRwMiuni_wQyXBUcaxbQ@mail.gmail.com>
 <Y1oub9MvqwGBlHkq@hirez.programming.kicks-ass.net>
In-Reply-To: <Y1oub9MvqwGBlHkq@hirez.programming.kicks-ass.net>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 27 Oct 2022 11:13:55 -0700
X-Gmail-Original-Message-ID: <CAHk-=wihPdOtXgJ32pLB6Xd-UvnaZW9YvOOjM24JGYRjNHeykA@mail.gmail.com>
Message-ID: <CAHk-=wihPdOtXgJ32pLB6Xd-UvnaZW9YvOOjM24JGYRjNHeykA@mail.gmail.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jann Horn <jannh@google.com>, John Hubbard <jhubbard@nvidia.com>, x86@kernel.org, 
	willy@infradead.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, 
	jroedel@suse.de, ubizjak@gmail.com
Content-Type: text/plain; charset="UTF-8"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1666894454;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=;
	b=SRgZ6KcV3Id68KYTPba4BOJ7qm6YezTbbidE+olwkB2fU+oW7ZDtjjqKcYBAvRoQnd5zt5
	+lrgU44YsRPu3RKjrZfO2zDOs4Wt4vaEAaedDNVBWva4IARyV5qiNV/MU7RNVIUY2LOLpJ
	27U66RQm9OVKeuQp1SznVTDAricPBkM=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=linux-foundation.org header.s=google header.b="OGWIzf//";
	spf=pass (imf08.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666894454; a=rsa-sha256;
	cv=none;
	b=UpzwEi+TxtYVCoZdTIWw27Ci3GnWtMgfMsPh+DdlCVh9ZbC8OvUsy1PU+/EKirFTLsxd0R
	dq35YCE1E0x0sEl0ZbaWocr7KwZLqwmttFUEM8ALI67GmKVX7urXQ0i+y6H14HFil4h3bA
	iau2VvCocHCqj401ELA/sqkQfZF4YUA=
X-Stat-Signature: nk1hid57p8kmgpr4ebkpc8w9ikumk8px
X-Rspamd-Queue-Id: B559016000E
X-Rspam-User: 
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=linux-foundation.org header.s=google header.b="OGWIzf//";
	spf=pass (imf08.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org;
	dmarc=none
X-Rspamd-Server: rspam04
X-HE-Tag: 1666894454-484955
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I thought the general rule was that if you modify a PTE and have not
> unmapped things -- IOW, there's actual concurrency possible on the
> thing, then the TLB invalidate needs to happen under pte_lock, since
> that is what controls concurrency at the pte level.

Yeah, I think we should really think about the TLB issue and make the
rules very clear.

A lot of our thinking behind "fix TLB races" have been about
"use-after-free wrt TLB on another CPU", and the design in
zap_page_ranges() is almost entirely about that: making sure that the
TLB flush happens before the underlying pages are free'd.

I say "almost entirely", because you can see the effects of the
*other* kind of race in

                        if (!PageAnon(page)) {
                                if (pte_dirty(ptent)) {
                                        force_flush = 1;
                                        set_page_dirty(page);
                                }

where it isn't about the lifetime of the page, but about the state
coherency wrt other users.

But I think even that code is quite questionable, and we really should
write down the rules much more strictly somewhere.

For example, wrt that pte_dirty() causing force_flush, I'd love to
have some core set of rules that would explain

 (a) why it is only relevant for shared pages

 (b) do even shared pages need it for the "fullmm" case when we tear
down the whole VM?

 (c) what are the force_flush interactions wrt holding the mmap lock
for writing vs reading?

In the above, I think (b)/(c) are related - I suspect "fullmm" is
mostly equivalent to "mmap held for writing", in that it guarantees
that there should be no concurrent faults or other active sw ops on
the table.

But "fullmm" is probably even stronger than "mmap write-lock" in that
it should also mean "no other CPU can be actively using this" - either
for hardware page table walking, or for GUP.

Both both have the vmscan code and rmap that can still get to the
pages - and the vmscan code clearly only cares about the page table
lock. But the vmscan code itself also shouldn't care about some stale
TLB value, so ...

> As it stands MADV_DONTNEED seems to blatatly violate that general rule.

So let's talk about exactly what and why the TLB would need to be
flushed before dropping the page table lock.

For example, MADV_DONTNEED does this all with just the mmap lock held
for reading, so we *unless* we have that 'force_flush', we can

 (a) have another CPU continue to use the old stale TLB entry for quite a while

 (b) yet another CPU (that didn't have a TLB entry, or wanted to write
to a read-only one ) could take a page fault, and install a *new* PTE
entry in the same slot, all at the same time.

Now, that's clearly *very* confusing. But being confusing may not mean
"wrong" - we're still delaying the free of the old entry, so there's
no use-after-free.

The biggest problem I can see is that this means that user space
memory ordering might be screwed up: the CPU in (a) will see not just
an old TLB entry, but simply old *data*, when the CPU in (b) may be
writing to that same address with new data.

So I think we clearly do *NOT* serialize as much as we could for
MADV_DONTNEED, and I think the above is a real semantic result of
that.

But equally arguably if you do this kind of unserialized MADV_DONTNEED
in user space, that's a case of "you asked for the breakage - you get
to keep both pieces".

So is that ever an actual problem? If the worst issue is that some
CPU's may see old ("pre-DONTNEED") data, while other CPU's then see
new data while the MADV_DONTNEED is executing, I think maybe *THAT*
part is fine.

But because it's so very confusing, maybe we have other problems in
this area, which is why I think it would be really good to have an
actual set of real hard documented rules about this all, and exactly
when we need to flush TLBs synchronously with the page table lock, and
when we do not.

Anybody willing to try to write up the rules (and have each rule
document *why* it's a rule - not just "by fiat", but an actual "these
are the rules and this is *why* they are the rules").

Because right now I think all of our rules are almost entirely just
encoded in the code, with a couple of comments, and a few people who
just remember why we do what we do.

                   Linus