linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH v1 1/2] mm: let pte_lockptr() consume a pte_t pointer
Date: Fri, 26 Jul 2024 23:48:01 +0200	[thread overview]
Message-ID: <9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com> (raw)
In-Reply-To: <ZqQVDwv4RM-wIW7S@x1n>

On 26.07.24 23:28, Peter Xu wrote:
> On Fri, Jul 26, 2024 at 06:02:17PM +0200, David Hildenbrand wrote:
>> On 26.07.24 17:36, Peter Xu wrote:
>>> On Thu, Jul 25, 2024 at 08:39:54PM +0200, David Hildenbrand wrote:
>>>> pte_lockptr() is the only *_lockptr() function that doesn't consume
>>>> what would be expected: it consumes a pmd_t pointer instead of a pte_t
>>>> pointer.
>>>>
>>>> Let's change that. The two callers in pgtable-generic.c are easily
>>>> adjusted. Adjust khugepaged.c:retract_page_tables() to simply do a
>>>> pte_offset_map_nolock() to obtain the lock, even though we won't actually
>>>> be traversing the page table.
>>>>
>>>> This makes the code more similar to the other variants and avoids other
>>>> hacks to make the new pte_lockptr() version happy. pte_lockptr() users
>>>> reside now only in  pgtable-generic.c.
>>>>
>>>> Maybe, using pte_offset_map_nolock() is the right thing to do because
>>>> the PTE table could have been removed in the meantime? At least it sounds
>>>> more future proof if we ever have other means of page table reclaim.
>>>
>>> I think it can't change, because anyone who wants to race against this
>>> should try to take the pmd lock first (which was held already)?
>>
>> That doesn't explain why it is safe for us to assume that after we took the
>> PMD lock that the PMD actually still points at a completely empty page
>> table. Likely it currently works by accident, because we only have a single
>> such user that makes this assumption. It might certainly be a different once
>> we asynchronously reclaim page tables.
> 
> I think it's safe because find_pmd_or_thp_or_none() returned SUCCEED, and
> we're holding i_mmap lock for read.  I don't see any way that this pmd can
> become a non-pgtable-page.
> 
> I meant, AFAIU tearing down pgtable in whatever sane way will need to at
> least take both mmap write lock and i_mmap write lock (in this case, a file
> mapping), no?

Skimming over [1] where I still owe a review I think we can now do it 
now purely under the read locks, with the PMD lock held.

I think this is also what collapse_pte_mapped_thp() ends up doing: 
replace a PTE table that maps a folio by a PMD (present or none, 
depends) that maps a folio only while holding the mmap lock in read 
mode. Of course, here the table is not empty but we need similar ways of 
making PT walkers aware of concurrent page table retraction.

IIRC, that was the magic added to __pte_offset_map(), such that 
pte_offset_map_nolock/pte_offset_map_lock can fail on races.


But if we hold the PMD lock, nothing should actually change (so far my 
understanding) -- we cannot suddenly rip out a page table.

[1] 
https://lkml.kernel.org/r/cover.1719570849.git.zhengqi.arch@bytedance.com

> 
>>
>> But yes, the PMD cannot get modified while we hold the PMD lock, otherwise
>> we'd be in trouble
>>
>>>
>>> I wonder an open coded "ptlock_ptr(page_ptdesc(pmd_page(*pmd)))" would be
>>> nicer here, but only if my understanding is correct.
>>
>> I really don't like open-coding that. Fortunately we were able to limit the
>> use of ptlock_ptr to a single user outside of arch/x86/xen/mmu_pv.c so far.
> 
> I'm fine if you prefer like that; I don't see it a huge deal to me.

Let's keep it like that, unless we can come up with something neater. At 
least it makes the code also more consistent with similar code in that 
file and the overhead should be  minimal.

I was briefly thinking about actually testing if the PT is full of 
pte_none(), either as a debugging check or to also handle what is 
currently handled via:

if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {

Seems wasteful just because some part of a VMA might have a private page 
mapped / uffd-wp active to let all other parts suffer.

Will think about if that is really worth it.

... also because I still want to understand why the PTL of the PMD table 
is required at all. What if we lock it first and somebody else wants to 
lock it after us while we already ripped it out? Sure there must be some 
reason for the lock, I just don't understand it yet :/.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-07-26 21:48 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25 18:39 [PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
2024-07-25 18:39 ` [PATCH v1 1/2] mm: let pte_lockptr() consume a pte_t pointer David Hildenbrand
2024-07-26 15:36   ` Peter Xu
2024-07-26 16:02     ` David Hildenbrand
2024-07-26 21:28       ` Peter Xu
2024-07-26 21:48         ` David Hildenbrand [this message]
2024-07-29  6:19           ` Qi Zheng
2024-07-30  8:40             ` David Hildenbrand
2024-07-30  9:10               ` Qi Zheng
2024-07-29 16:26           ` Peter Xu
2024-07-29 16:39             ` Peter Xu
2024-07-29 17:46               ` David Hildenbrand
2024-07-30 18:44                 ` Peter Xu
2024-07-30 19:49                   ` David Hildenbrand
2024-07-29  7:48   ` Qi Zheng
2024-07-29  8:46     ` David Hildenbrand
2024-07-29  8:52       ` Qi Zheng
     [not found]   ` <CGME20240730153058eucas1p2319e4cc985dcdc6e98d08398c33fcfd3@eucas1p2.samsung.com>
2024-07-30 15:30     ` Marek Szyprowski
2024-07-30 15:45       ` David Hildenbrand
2024-07-30 15:49         ` David Hildenbrand
2024-07-30 16:08           ` Marek Szyprowski
2024-07-30 16:10             ` David Hildenbrand
2024-07-25 18:39 ` [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking David Hildenbrand
2024-07-26  2:33   ` Baolin Wang
2024-07-26  3:03     ` Baolin Wang
2024-07-26  8:04       ` David Hildenbrand
2024-07-26  8:04     ` David Hildenbrand
2024-07-26  9:38       ` Baolin Wang
2024-07-26 11:40         ` David Hildenbrand
2024-07-29  1:48           ` Baolin Wang
2024-07-26  8:18   ` Muchun Song
2024-07-26 15:26   ` Peter Xu
2024-07-26 15:32     ` David Hildenbrand
2024-07-29  4:51   ` Oscar Salvador
2024-07-25 20:41 ` [PATCH v1 0/2] " Andrew Morton
2024-07-26  9:19   ` David Hildenbrand
2024-07-26 14:45     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e671388-a5c6-4de1-8c85-b7af8aee7f44@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox