From: Linus Torvalds <torvalds@linux-foundation.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Jann Horn <jannh@google.com>, John Hubbard <jhubbard@nvidia.com>,
X86 ML <x86@kernel.org>, Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
kernel list <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
jroedel@suse.de, ubizjak@gmail.com,
Alistair Popple <apopple@nvidia.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
Date: Sun, 30 Oct 2022 22:00:32 -0700 [thread overview]
Message-ID: <CAHk-=wjZnVURfhWMmWiDX3D0kuqnJ0PLM_Na-U7ufzqPMxucjw@mail.gmail.com> (raw)
In-Reply-To: <A48A5CEB-2C02-4101-B315-6792D042C605@gmail.com>
On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I am sorry for not managing to make it reproducible on your system.
Heh, that's very much *not* your fault. Honestly, I didn't try very
much or very hard.
I felt like I understood the problem cause sufficiently that I didn't
really need to have a reproducer, and I much prefer to just think the
solution through and try to make it really robust.
Or, put another way - I'm just lazy.
> Anyhow, I ran the tests with the patches and there are no failures.
Lovely.
> Thanks for addressing this issue.
Well, I'm not sure the issue is "addressed" yet. I think the patch
series is likely the right thing to do, but others may disagree with
this approach.
And regardless of that, this still leaves some questions open.
(a) there's the issue of s390, which does its own version of
__tlb_remove_page_size.
I *think* s390 basically does the TLB flush synchronously in
zap_pte_range(), and that it would be for that reason trivial to just
add that 'flags' argument to the s390 __tlb_remove_page_size(), and
make it do
if (flags & TLB_ZAP_RMAP)
page_zap_pte_rmap(page);
at the top synchronously too. But some s390 person would need to look at it.
I *think* the issue is literally that straightforward and not a big
deal, but it's probably not even worth bothering the s390 people until
VM people have decided "yes, this makes sense".
(b) the issue I mentioned with the currently useless
"page_mapcount(page) < 0" test with that patch.
Again, this is mostly just janitorial stuff associated with that patch series.
(c) whether to worry about back-porting
I don't *think* this is worth backporting, but if it causes other
changes, then maybe..
> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for “force_flush” and
> potentially save TLB flushes.
I really liked my dirty patch, but your warning case really made it
obvious that it was just broken.
The thing is, moving the "set_page_dirty()" to later is really nice,
and really makes a *lot* of sense from a conceptual standpoint: only
after that TLB flush do we really have no more people who can dirty
it.
BUT.
Even if we just used another bit for in the array for "dirty", and did
the set_page_dirty() later (but still before getting rid of the rmap),
that wouldn't actually *work*.
Why? Because the race with folio_mkclean() would just come back. Yes,
now we'd have the rmap data, so mkclean would be forced to serialize
with the page table lock.
But if we get rid of the "force_flush" for the dirty bit, that
serialization won't help, simply because we've *dropped* the page
table lock before we actually then do the set_page_dirty() again.
So the mkclean serialization needs *both* the late rmap dropping _and_
the page table lock being kept.
So deferring set_page_dirty() is conceptually the right thing to do
from a pure "just track dirty bit" standpoint, but it doesn't work
with the way we currently expect mkclean to work.
> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.
See above: I still think it would be the right thing in a perfect world.
But with the current folio_mkclean(), we just can't do it. I had
completely forgotten / repressed that horror-show.
So the current ordering rules are basically that we need to do
set_page_dirty() *and* we need to flush the TLB's before dropping the
page table lock. That's what gets us serialized with "mkclean".
The whole "drop rmap" can then happen at any later time, the only
important thing was that it was kept to at least after the TLB flush.
We could do the rmap drop still inside the page table lock, but
honestly, it just makes more sense to do as we free the batched pages
anyway.
Am I missing something still?
And again, this is about our horrid serialization between
folio_mkclean and set_page_dirty(). It's related to how GUP +
set_page_dirty() is also fundamentally problematic. So that dirty bit
situation *may* change if the rules for folio_mkclean() change...
Linus
next prev parent reply other threads:[~2022-10-31 5:00 UTC|newest]
Thread overview: 143+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-22 11:14 [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Peter Zijlstra
2022-10-22 11:14 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-24 5:42 ` John Hubbard
2022-10-24 8:00 ` Peter Zijlstra
2022-10-24 19:58 ` Jann Horn
2022-10-24 20:19 ` Linus Torvalds
2022-10-24 20:23 ` Jann Horn
2022-10-24 20:36 ` Linus Torvalds
2022-10-25 3:21 ` Matthew Wilcox
2022-10-25 7:54 ` Alistair Popple
2022-10-25 13:33 ` Peter Zijlstra
2022-10-25 13:44 ` Jann Horn
2022-10-26 0:45 ` Alistair Popple
2022-10-25 14:02 ` Peter Zijlstra
2022-10-25 14:18 ` Jann Horn
2022-10-25 15:06 ` Peter Zijlstra
2022-10-26 16:45 ` Jann Horn
2022-10-27 7:08 ` Peter Zijlstra
2022-10-27 18:13 ` Linus Torvalds
2022-10-27 19:35 ` Peter Zijlstra
2022-10-27 19:43 ` Linus Torvalds
2022-10-27 20:15 ` Nadav Amit
2022-10-27 20:31 ` Linus Torvalds
2022-10-27 21:44 ` Nadav Amit
2022-10-28 23:57 ` Nadav Amit
2022-10-29 0:42 ` Linus Torvalds
2022-10-29 18:05 ` Nadav Amit
2022-10-29 18:36 ` Linus Torvalds
2022-10-29 18:58 ` Linus Torvalds
2022-10-29 19:14 ` Linus Torvalds
2022-10-29 19:28 ` Nadav Amit
2022-10-30 0:18 ` Nadav Amit
2022-10-30 2:17 ` Nadav Amit
2022-10-30 18:19 ` Linus Torvalds
2022-10-30 18:51 ` Linus Torvalds
2022-10-30 22:47 ` Linus Torvalds
2022-10-31 1:47 ` Linus Torvalds
2022-10-31 4:09 ` Nadav Amit
2022-10-31 4:55 ` Nadav Amit
2022-10-31 5:00 ` Linus Torvalds [this message]
2022-10-31 15:43 ` Nadav Amit
2022-10-31 17:32 ` Linus Torvalds
2022-10-31 9:36 ` Peter Zijlstra
2022-10-31 17:28 ` Linus Torvalds
2022-10-31 18:43 ` mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-02 9:14 ` Christian Borntraeger
2022-11-02 9:23 ` Christian Borntraeger
2022-11-02 17:55 ` Linus Torvalds
2022-11-02 18:28 ` Linus Torvalds
2022-11-02 22:29 ` Gerald Schaefer
2022-11-02 12:45 ` Peter Zijlstra
2022-11-02 22:31 ` Gerald Schaefer
2022-11-02 23:13 ` Linus Torvalds
2022-11-03 9:52 ` David Hildenbrand
2022-11-03 16:54 ` Linus Torvalds
2022-11-03 17:09 ` Linus Torvalds
2022-11-03 17:36 ` David Hildenbrand
2022-11-04 6:33 ` Alexander Gordeev
2022-11-04 17:35 ` Linus Torvalds
2022-11-06 21:06 ` Hugh Dickins
2022-11-06 22:34 ` Linus Torvalds
2022-11-06 23:14 ` Andrew Morton
2022-11-07 0:06 ` Stephen Rothwell
2022-11-07 16:19 ` Linus Torvalds
2022-11-07 23:02 ` Andrew Morton
2022-11-07 23:44 ` Stephen Rothwell
2022-11-07 9:12 ` Peter Zijlstra
2022-11-07 20:07 ` Johannes Weiner
2022-11-07 20:29 ` Linus Torvalds
2022-11-07 23:47 ` Linus Torvalds
2022-11-08 4:28 ` Linus Torvalds
2022-11-08 19:56 ` Linus Torvalds
2022-11-08 20:03 ` Konstantin Ryabitsev
2022-11-08 20:18 ` Linus Torvalds
2022-11-08 19:41 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-08 20:37 ` Nadav Amit
2022-11-08 20:46 ` Linus Torvalds
2022-11-09 6:36 ` Alexander Gordeev
2022-11-09 18:00 ` Linus Torvalds
2022-11-09 20:02 ` Linus Torvalds
2022-11-08 19:41 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
2022-11-08 19:41 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
2022-11-08 19:41 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-08 21:05 ` Nadav Amit
2022-11-09 15:53 ` Johannes Weiner
2022-11-09 19:31 ` Hugh Dickins
2022-10-31 9:39 ` [PATCH 01/13] mm: Update ptep_get_lockless()s comment Peter Zijlstra
2022-10-31 17:22 ` Linus Torvalds
2022-10-31 9:46 ` Peter Zijlstra
2022-10-31 9:28 ` Peter Zijlstra
2022-10-31 17:19 ` Linus Torvalds
2022-10-30 19:34 ` Nadav Amit
2022-10-29 19:39 ` John Hubbard
2022-10-29 20:15 ` Linus Torvalds
2022-10-29 20:30 ` Linus Torvalds
2022-10-29 20:42 ` John Hubbard
2022-10-29 20:56 ` Nadav Amit
2022-10-29 21:03 ` Nadav Amit
2022-10-29 21:12 ` Linus Torvalds
2022-10-29 20:59 ` Theodore Ts'o
2022-10-26 19:43 ` Nadav Amit
2022-10-27 7:27 ` Peter Zijlstra
2022-10-27 17:30 ` Nadav Amit
2022-10-22 11:14 ` [PATCH 02/13] x86/mm/pae: Make pmd_t similar to pte_t Peter Zijlstra
2022-10-22 11:14 ` [PATCH 03/13] sh/mm: " Peter Zijlstra
2022-12-21 13:54 ` Guenter Roeck
2022-10-22 11:14 ` [PATCH 04/13] mm: Fix pmd_read_atomic() Peter Zijlstra
2022-10-22 17:30 ` Linus Torvalds
2022-10-24 8:09 ` Peter Zijlstra
2022-11-01 12:41 ` Peter Zijlstra
2022-11-01 17:42 ` Linus Torvalds
2022-10-22 11:14 ` [PATCH 05/13] mm: Rename GUP_GET_PTE_LOW_HIGH Peter Zijlstra
2022-10-22 11:14 ` [PATCH 06/13] mm: Rename pmd_read_atomic() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 07/13] mm/gup: Fix the lockless PMD access Peter Zijlstra
2022-10-23 0:42 ` Hugh Dickins
2022-10-24 7:42 ` Peter Zijlstra
2022-10-25 3:58 ` Hugh Dickins
2022-10-22 11:14 ` [PATCH 08/13] x86/mm/pae: Dont (ab)use atomic64 Peter Zijlstra
2022-10-22 11:14 ` [PATCH 09/13] x86/mm/pae: Use WRITE_ONCE() Peter Zijlstra
2022-10-22 17:42 ` Linus Torvalds
2022-10-24 10:21 ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 10/13] x86/mm/pae: Be consistent with pXXp_get_and_clear() Peter Zijlstra
2022-10-22 17:53 ` Linus Torvalds
2022-10-24 11:13 ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 11/13] x86_64: Remove pointless set_64bit() usage Peter Zijlstra
2022-10-22 17:55 ` Linus Torvalds
2022-11-03 19:09 ` Nathan Chancellor
2022-11-03 19:23 ` Uros Bizjak
2022-11-03 19:35 ` Nathan Chancellor
2022-11-03 20:39 ` Linus Torvalds
2022-11-03 21:06 ` Peter Zijlstra
2022-11-04 16:01 ` Peter Zijlstra
2022-11-04 17:15 ` Linus Torvalds
2022-11-05 13:29 ` Jason A. Donenfeld
2022-11-05 15:14 ` Peter Zijlstra
2022-11-05 20:54 ` Jason A. Donenfeld
2022-11-07 9:14 ` David Laight
2022-12-19 15:44 ` Peter Zijlstra
2022-10-22 11:14 ` [PATCH 12/13] x86/mm/pae: Get rid of set_64bit() Peter Zijlstra
2022-10-22 11:14 ` [PATCH 13/13] mm: Remove pointless barrier() after pmdp_get_lockless() Peter Zijlstra
2022-10-22 19:59 ` Yu Zhao
2022-10-22 17:57 ` [PATCH 00/13] Clean up pmd_get_atomic() and i386-PAE Linus Torvalds
2022-10-29 12:21 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHk-=wjZnVURfhWMmWiDX3D0kuqnJ0PLM_Na-U7ufzqPMxucjw@mail.gmail.com' \
--to=torvalds@linux-foundation.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=jannh@google.com \
--cc=jhubbard@nvidia.com \
--cc=jroedel@suse.de \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nadav.amit@gmail.com \
--cc=peterz@infradead.org \
--cc=ubizjak@gmail.com \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox