From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FABBECAAA1 for ; Thu, 27 Oct 2022 18:14:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AD5706B0071; Thu, 27 Oct 2022 14:14:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A864A6B0073; Thu, 27 Oct 2022 14:14:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94CD06B0074; Thu, 27 Oct 2022 14:14:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 82DE96B0071 for ; Thu, 27 Oct 2022 14:14:15 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 274591606D0 for ; Thu, 27 Oct 2022 18:14:15 +0000 (UTC) X-FDA: 80067528870.12.E9DD64D Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) by imf08.hostedemail.com (Postfix) with ESMTP id B559016000E for ; Thu, 27 Oct 2022 18:14:14 +0000 (UTC) Received: by mail-qk1-f170.google.com with SMTP id x26so1662121qki.0 for ; Thu, 27 Oct 2022 11:14:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=; b=OGWIzf//yDrzpIU5CUpEyJ7Dm5rnyRzXTwtOsRe4hwmqHu+knWWFjSY5Bsl8zdN2JR AbhywM5MxQHd0TDQb+wMPn9yzk03EzWqxeNIV3iaekqGTUDPJgEKXXQd5WJ0UVzk037R SPcYzev0TYZ8UCj2B/vUiwWZh4URP/tTZ7s+E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=; b=vaNQmZtKUVLSSRgTt2TVolZaEPva6uhxOmYKkA24SMc1tYYQd6hMc/Es7yVBnL2Dst kR29yiHRaO1kBfrU/XqCpcA1ye/PS57lPg//LmlrMxPRpONoPS+G6uwzc87unECHk6cj Vh72MiugWgucz+eyWHofSmj8jGKzYElO4MsE7FKhLBVcUMOfBku1dtc2cvpiplRF8xtT 1HM7VSLUKbKUnABDCX/86kdmmZXejFWvsBTuWwlNtUEevH0FtZORdobzBue2rVX2+jE5 ADM2D/HT2KFd4WlXcnboWIfraudVQDiQti1Qhaek560+Dhm3G/fJsEhvVlAVakhZ74ZA LHuA== X-Gm-Message-State: ACrzQf0jPpcSygOI4YtP1Nax7Ab/dwGNr+5PIDrGKUreewB8kNqsIxA4 Eas0E4RgXojx9lk0D3Mm5sKWMpdZsyyKJQ== X-Google-Smtp-Source: AMsMyM6OJsU/5/+/DRAYUJomj846sOsFr15gGsVwEJ170jldqQvFskVrwlwAoiU5/WAZIWwHudosLA== X-Received: by 2002:a05:620a:4884:b0:6f3:91b8:4bf7 with SMTP id ea4-20020a05620a488400b006f391b84bf7mr19190311qkb.167.1666894453465; Thu, 27 Oct 2022 11:14:13 -0700 (PDT) Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com. [209.85.219.177]) by smtp.gmail.com with ESMTPSA id w10-20020a05620a444a00b006f9e103260dsm1439105qkp.91.2022.10.27.11.14.12 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 27 Oct 2022 11:14:12 -0700 (PDT) Received: by mail-yb1-f177.google.com with SMTP id z192so3255826yba.0 for ; Thu, 27 Oct 2022 11:14:12 -0700 (PDT) X-Received: by 2002:a5b:984:0:b0:6ca:9345:b2ee with SMTP id c4-20020a5b0984000000b006ca9345b2eemr8216571ybq.362.1666894452121; Thu, 27 Oct 2022 11:14:12 -0700 (PDT) MIME-Version: 1.0 References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> In-Reply-To: From: Linus Torvalds Date: Thu, 27 Oct 2022 11:13:55 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment To: Peter Zijlstra Cc: Jann Horn , John Hubbard , x86@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Content-Type: text/plain; charset="UTF-8" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666894454; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c9TtS5EyPfrJtXKFUijRKZY8i/FB2qm2pzEph6K2Voc=; b=SRgZ6KcV3Id68KYTPba4BOJ7qm6YezTbbidE+olwkB2fU+oW7ZDtjjqKcYBAvRoQnd5zt5 +lrgU44YsRPu3RKjrZfO2zDOs4Wt4vaEAaedDNVBWva4IARyV5qiNV/MU7RNVIUY2LOLpJ 27U66RQm9OVKeuQp1SznVTDAricPBkM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b="OGWIzf//"; spf=pass (imf08.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666894454; a=rsa-sha256; cv=none; b=UpzwEi+TxtYVCoZdTIWw27Ci3GnWtMgfMsPh+DdlCVh9ZbC8OvUsy1PU+/EKirFTLsxd0R dq35YCE1E0x0sEl0ZbaWocr7KwZLqwmttFUEM8ALI67GmKVX7urXQ0i+y6H14HFil4h3bA iau2VvCocHCqj401ELA/sqkQfZF4YUA= X-Stat-Signature: nk1hid57p8kmgpr4ebkpc8w9ikumk8px X-Rspamd-Queue-Id: B559016000E X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b="OGWIzf//"; spf=pass (imf08.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none X-Rspamd-Server: rspam04 X-HE-Tag: 1666894454-484955 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra wrote: > > So I thought the general rule was that if you modify a PTE and have not > unmapped things -- IOW, there's actual concurrency possible on the > thing, then the TLB invalidate needs to happen under pte_lock, since > that is what controls concurrency at the pte level. Yeah, I think we should really think about the TLB issue and make the rules very clear. A lot of our thinking behind "fix TLB races" have been about "use-after-free wrt TLB on another CPU", and the design in zap_page_ranges() is almost entirely about that: making sure that the TLB flush happens before the underlying pages are free'd. I say "almost entirely", because you can see the effects of the *other* kind of race in if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush = 1; set_page_dirty(page); } where it isn't about the lifetime of the page, but about the state coherency wrt other users. But I think even that code is quite questionable, and we really should write down the rules much more strictly somewhere. For example, wrt that pte_dirty() causing force_flush, I'd love to have some core set of rules that would explain (a) why it is only relevant for shared pages (b) do even shared pages need it for the "fullmm" case when we tear down the whole VM? (c) what are the force_flush interactions wrt holding the mmap lock for writing vs reading? In the above, I think (b)/(c) are related - I suspect "fullmm" is mostly equivalent to "mmap held for writing", in that it guarantees that there should be no concurrent faults or other active sw ops on the table. But "fullmm" is probably even stronger than "mmap write-lock" in that it should also mean "no other CPU can be actively using this" - either for hardware page table walking, or for GUP. Both both have the vmscan code and rmap that can still get to the pages - and the vmscan code clearly only cares about the page table lock. But the vmscan code itself also shouldn't care about some stale TLB value, so ... > As it stands MADV_DONTNEED seems to blatatly violate that general rule. So let's talk about exactly what and why the TLB would need to be flushed before dropping the page table lock. For example, MADV_DONTNEED does this all with just the mmap lock held for reading, so we *unless* we have that 'force_flush', we can (a) have another CPU continue to use the old stale TLB entry for quite a while (b) yet another CPU (that didn't have a TLB entry, or wanted to write to a read-only one ) could take a page fault, and install a *new* PTE entry in the same slot, all at the same time. Now, that's clearly *very* confusing. But being confusing may not mean "wrong" - we're still delaying the free of the old entry, so there's no use-after-free. The biggest problem I can see is that this means that user space memory ordering might be screwed up: the CPU in (a) will see not just an old TLB entry, but simply old *data*, when the CPU in (b) may be writing to that same address with new data. So I think we clearly do *NOT* serialize as much as we could for MADV_DONTNEED, and I think the above is a real semantic result of that. But equally arguably if you do this kind of unserialized MADV_DONTNEED in user space, that's a case of "you asked for the breakage - you get to keep both pieces". So is that ever an actual problem? If the worst issue is that some CPU's may see old ("pre-DONTNEED") data, while other CPU's then see new data while the MADV_DONTNEED is executing, I think maybe *THAT* part is fine. But because it's so very confusing, maybe we have other problems in this area, which is why I think it would be really good to have an actual set of real hard documented rules about this all, and exactly when we need to flush TLBs synchronously with the page table lock, and when we do not. Anybody willing to try to write up the rules (and have each rule document *why* it's a rule - not just "by fiat", but an actual "these are the rules and this is *why* they are the rules"). Because right now I think all of our rules are almost entirely just encoded in the code, with a couple of comments, and a few people who just remember why we do what we do. Linus