From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7173CC38A02 for ; Mon, 31 Oct 2022 05:00:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C5C68E0001; Mon, 31 Oct 2022 01:00:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 175486B0074; Mon, 31 Oct 2022 01:00:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 03E008E0001; Mon, 31 Oct 2022 01:00:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D30116B0073 for ; Mon, 31 Oct 2022 01:00:51 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 888B9AAF94 for ; Mon, 31 Oct 2022 05:00:51 +0000 (UTC) X-FDA: 80080044702.14.AC5EF20 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf16.hostedemail.com (Postfix) with ESMTP id F412318001B for ; Mon, 31 Oct 2022 05:00:50 +0000 (UTC) Received: by mail-qt1-f174.google.com with SMTP id h24so6849800qta.7 for ; Sun, 30 Oct 2022 22:00:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=gDfyWnj8mazCtlw/gyjBjCoBw+EAoW04RRcZ+OnjPIkjADjY+8JLLNDXRI4fHh9aHW c5LKzgN4CbVicZAJMiq0u6LN3wVFv1CsntXdZyXlxitg+Jl6emrhrhtudeT5+5pmWQNb xDFlixN7NTaGR92MioTb0yjQ7OeVPhlyfUpUg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=QLJscfq/Sb80nR9mNxvaDNULYkth87OY86r4JPrXi7mN9MjJn2ppbC0XVjWhZfDgCk qayzqjHifBKbKlbvvWFxp55CLqbg9j3cehXTbHnth/vU6/qUHiASWZAtBC/167VAEr1Y TQ/f8veJs5tJfiZJx3nntWgk/zvj54Pfk1FwcpVdReOLHjmp95kiFr/Gk8f+kiweu0Cy xF6zxGoVlqATp30uKIV/XAV2SZpyj7HqM5rDJhw49zRCzBtuQwAXWv0jNAPkAsgxpR1F iqu7q7Q2q7683e0OaQbG6vpEWiQ/7e/ibS30w2voq9RSAlTjnBnkat+C7m+MB1a4NjfZ Z8iw== X-Gm-Message-State: ACrzQf3LjpbKsDT4M0uEoHnHnrOv7dkKB6MK7Mn7vFofPJM/c5YlmiTL NjPXCxRnVjiEJZOCGd040bxWdDx3KwfI1A== X-Google-Smtp-Source: AMsMyM5JAJkJQ9jwTm7a8/ikug7b+zO1tzD8CTFtJRgknUTuHJb9eFBNeIwPeLxJnwKCmzll9gOIRA== X-Received: by 2002:ac8:5f89:0:b0:39c:e5a2:6db9 with SMTP id j9-20020ac85f89000000b0039ce5a26db9mr9079004qta.138.1667192450047; Sun, 30 Oct 2022 22:00:50 -0700 (PDT) Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com. [209.85.219.171]) by smtp.gmail.com with ESMTPSA id s13-20020a05620a254d00b006cf8fc6e922sm4041284qko.119.2022.10.30.22.00.49 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 30 Oct 2022 22:00:49 -0700 (PDT) Received: by mail-yb1-f171.google.com with SMTP id 63so12499684ybq.4 for ; Sun, 30 Oct 2022 22:00:49 -0700 (PDT) X-Received: by 2002:a25:bb02:0:b0:6ca:9345:b2ee with SMTP id z2-20020a25bb02000000b006ca9345b2eemr93317ybg.362.1667192449279; Sun, 30 Oct 2022 22:00:49 -0700 (PDT) MIME-Version: 1.0 References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <6C548A9A-3AF3-4EC1-B1E5-47A7FFBEB761@gmail.com> <47678198-C502-47E1-B7C8-8A12352CDA95@gmail.com> <140B437E-B994-45B7-8DAC-E9B66885BEEF@gmail.com> In-Reply-To: From: Linus Torvalds Date: Sun, 30 Oct 2022 22:00:32 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment To: Nadav Amit Cc: Peter Zijlstra , Jann Horn , John Hubbard , X86 ML , Matthew Wilcox , Andrew Morton , kernel list , Linux-MM , Andrea Arcangeli , "Kirill A . Shutemov" , jroedel@suse.de, ubizjak@gmail.com, Alistair Popple Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667192451; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KMoYPf5QlXxpUEaNojn2TFOmvOKnRgIek8nj+6VnLpI=; b=lbdui2C9mfi2NJGQJQ/Fh76jWlFaD95sMHPkcvUZeazyqfumdxChrX6H/S7n5FAAsTo/YU SnYz+bC+Z98b1JbykRctou6aNJ2Ra1tJGCkJJACTU2BkZ99RcLsxZ449rXdq0f7HB2H+NV FvZTTa4H87Go2OZibxuq35gG1hdJ6OY= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=gDfyWnj8; dmarc=none; spf=pass (imf16.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.160.174 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667192451; a=rsa-sha256; cv=none; b=tOQwFeNKgkgnT926oIcsI2+8vYkkrDWFN3wqYRvJqw0D+9vRxlc99+G/6ysjWeX66ZZ2ix U7syjY4fNPt2n/99iPK5bwiYc9Hxx7CQdh6FTFoomqijNPA8vNvtUSN7xzjeh4MGRNH2so CRi5wqD6XePb2gw62zsrkUrE3KqDSuw= X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: F412318001B X-Rspam-User: Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=gDfyWnj8; dmarc=none; spf=pass (imf16.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.160.174 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org X-Stat-Signature: sh38ftqd9t8ghe5k7m9bnxr16arjuqdo X-HE-Tag: 1667192450-289221 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit wrote: > > I am sorry for not managing to make it reproducible on your system. Heh, that's very much *not* your fault. Honestly, I didn't try very much or very hard. I felt like I understood the problem cause sufficiently that I didn't really need to have a reproducer, and I much prefer to just think the solution through and try to make it really robust. Or, put another way - I'm just lazy. > Anyhow, I ran the tests with the patches and there are no failures. Lovely. > Thanks for addressing this issue. Well, I'm not sure the issue is "addressed" yet. I think the patch series is likely the right thing to do, but others may disagree with this approach. And regardless of that, this still leaves some questions open. (a) there's the issue of s390, which does its own version of __tlb_remove_page_size. I *think* s390 basically does the TLB flush synchronously in zap_pte_range(), and that it would be for that reason trivial to just add that 'flags' argument to the s390 __tlb_remove_page_size(), and make it do if (flags & TLB_ZAP_RMAP) page_zap_pte_rmap(page); at the top synchronously too. But some s390 person would need to look at it= . I *think* the issue is literally that straightforward and not a big deal, but it's probably not even worth bothering the s390 people until VM people have decided "yes, this makes sense". (b) the issue I mentioned with the currently useless "page_mapcount(page) < 0" test with that patch. Again, this is mostly just janitorial stuff associated with that patch seri= es. (c) whether to worry about back-porting I don't *think* this is worth backporting, but if it causes other changes, then maybe.. > I understand from the code that you decided to drop the deferring of > set_page_dirty(), which could - at least for the munmap case (where > mmap_lock is taken for write) - prevent the need for =E2=80=9Cforce_flush= =E2=80=9D and > potentially save TLB flushes. I really liked my dirty patch, but your warning case really made it obvious that it was just broken. The thing is, moving the "set_page_dirty()" to later is really nice, and really makes a *lot* of sense from a conceptual standpoint: only after that TLB flush do we really have no more people who can dirty it. BUT. Even if we just used another bit for in the array for "dirty", and did the set_page_dirty() later (but still before getting rid of the rmap), that wouldn't actually *work*. Why? Because the race with folio_mkclean() would just come back. Yes, now we'd have the rmap data, so mkclean would be forced to serialize with the page table lock. But if we get rid of the "force_flush" for the dirty bit, that serialization won't help, simply because we've *dropped* the page table lock before we actually then do the set_page_dirty() again. So the mkclean serialization needs *both* the late rmap dropping _and_ the page table lock being kept. So deferring set_page_dirty() is conceptually the right thing to do from a pure "just track dirty bit" standpoint, but it doesn't work with the way we currently expect mkclean to work. > I was just wondering whether the reason for that is that you wanted > to have small backportable and conservative patches, or whether you > changed your mind about it. See above: I still think it would be the right thing in a perfect world. But with the current folio_mkclean(), we just can't do it. I had completely forgotten / repressed that horror-show. So the current ordering rules are basically that we need to do set_page_dirty() *and* we need to flush the TLB's before dropping the page table lock. That's what gets us serialized with "mkclean". The whole "drop rmap" can then happen at any later time, the only important thing was that it was kept to at least after the TLB flush. We could do the rmap drop still inside the page table lock, but honestly, it just makes more sense to do as we free the batched pages anyway. Am I missing something still? And again, this is about our horrid serialization between folio_mkclean and set_page_dirty(). It's related to how GUP + set_page_dirty() is also fundamentally problematic. So that dirty bit situation *may* change if the rules for folio_mkclean() change... Linus