Re: [QUESTION] mmap lock in khugepaged

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [QUESTION] mmap lock in khugepaged
       [not found]     ` <3d4c57dd-0821-4684-9018-db8274c170ec@redhat.com>
@ 2024-11-29  7:10       ` Dev Jain
  2024-11-29 11:30         ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Dev Jain @ 2024-11-29  7:10 UTC (permalink / raw)
  To: David Hildenbrand, ryan.roberts, kirill.shutemov, willy,
	linux-kernel, linux-mm, ziy


+the list and Zi Yan.

On 28/11/24 5:04 pm, David Hildenbrand wrote:
>>
>>
>>> Maybe, we'd have to do the isolation+copy under the PMD lock. And
>>> currently, we have to drop the PMD lock in order to have the
>>> pte_offset_map_lock() work IIRC.
>>>
>>
>> Is there a problem in holding two page table locks simultaneously?
>
> Depending on CONFIG_SPLIT_PTE_PTLOCKS, it might not even be two locks 
> (I assume we could have such configs with khugepaged).
>
> Not sure if there could be an issue with lock inversion.
>
> So I suspect this no not be 100% trivial :)
>
>>
>>
>>>
>>> Most importantly, the copy that currently runs under no spinlocks
>>> would now run under spinlock. Up to 512 MiB on arm64 64K, not sure if
>>> that can be a problem ... we currently seem to take care of that
>>
>> But we already are taking mmap_write_lock(), so that should not matter?
>
> We are dealing with a spinlock vs. a rwsem.
>
> We usually want to avoid holding spinlocks for an excessive amount of 
> time, because all other CPUs waiting for that lock will ... spin with 
> preemption disabled instead of rescheduling and doing something useful.
>

Ah okay.


>
> Further, without CONFIG_SPLIT_PMD_PTLOCKS, in fact everybody who wnats 
> to take a PMD lock in that process would be spinning on the same PMD 
> lock :)
>
>
>> I mean, if we can get rid of the mmap exclusive lock, then the copying
>> would still be a bottleneck, and all fault handlers will back off, but
>
> I'm trying to digest it once again, but I'm afraid I don't see how 
> fault handlers will back off.
>
> Won't they either see pmd_none(), to then either call pte_alloc_one() 
> where they would spin on the PMD lock, or try allocating a PMD THP to 
> insert it, and then spin on the PMD lock, to figure out later that it 
> was all in vain?

Ya that's what I meant when I said "back off", sorry I wasn't clear, I 
didn't mean VM_FAULT_FALLBACK/RETRY or something...yes, the work will be 
in vain and the MMU will refault...
>
>
> Thinking about it, I am also not sure if most other code can deal with 
> temporary pmd_none(). These don't necessarily take the PMD lock, 
> because "why would they" right now.
>
> See walk_pmd_range() as one example, if it spots pmd_none() it assumes 
> "there really is nothing" without checking the PMD lock.
>
> As a more concrete example, assume someone calls MADV_DONTNEED and we 
> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" 
> we'll skip that entry without getting the PMD lock involved. That 
> would mean that you would be breaking MADV_DONTNEED if you managed to 
> collapse or not -- memory would not get discarded.
>
> This is a real problem with anonymous memory.
>
Yes, I thought of different locking fashions but the problem seems to be 
that any pagetable walker will choose an action path according to the value
it sees.

>
> Unless I am missing something it's all very tricky and there might be 
> a lot of such code that assumes "if I hold a mmap lock / VMA lock in 
> read mode, pmd_none() means there is nothing even without holding the 
> PMD lock when checking".

Yes, I was betting on the fact that, if the code assumes that pmd_none() 
means there is nothing, eventually when it will take the PMD lock to 
write to it, it will check whether
the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED 
thingy, thanks.
>
>
>> at least processes will be able to mmap() and do stuff with their VMAs,
>> and I would guess that this is worth optimizing...
>
> It would certainly be interesting to get rid of the mmap lock in write 
> mode here, but it's all rater tricky (and the code has rather nasty 
> hidden implications).
>
>>>
>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> if (pte) {
>>>      ...
>>>      spun_unlock(pte);
>>> } ...
>>>
>>> result = __collapse_huge_page_copy(...);
>>> pte_unmap(pte);
>>>
>>>
>>> Deep in __collapse_huge_page_copy() we seem to re-rake the PTL lock.
>>> No-split-spinlock confiogs might be problematic ...
>>
>> Could you elaborate a little? I haven't read about the older config...
>
> See above regarding CONFIG_SPLIT_PTE_PTLOCKS and friends.
>
>>>
>>>
>>> I recall that for shmem that's "easier", because we don't have to
>>> reinstall a PMD immediately, we cna be sure that the page table is
>>> kept empty/unmodified, ...
>>>
>>

All in all, the general solution seems to be that, if I can take all 
pagetable walkers into an invalid state and make them backoff, then I am 
safe.
For example, we do not zero out the PMD, we take the pte PTL, we do 
stuff, then instead of setting the PTEs to zero, we set it to a universal
invalid state upon which no pagetable walker can take an action; an 
instance of that can be to set the PTE to a swap entry such that the fault
handler ends up in do_swap_page() ->print_bad_pte(). So now we can take 
the PMD lock (in fact we don't need it since any pagetable walking
is rendered useless) and populate the PMD to resume the new pagetable 
walking. Another *ridiculous* idea may be to remember the PGD we
came from and nuke it (but I guess there is an alternate path for that 
in walk_pgd_range() and so on?)


>
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] mmap lock in khugepaged
  2024-11-29  7:10       ` [QUESTION] mmap lock in khugepaged Dev Jain
@ 2024-11-29 11:30         ` David Hildenbrand
  2024-12-02  9:54           ` Dev Jain
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2024-11-29 11:30 UTC (permalink / raw)
  To: Dev Jain, ryan.roberts, kirill.shutemov, willy, linux-kernel,
	linux-mm, ziy

>>
>> Thinking about it, I am also not sure if most other code can deal with
>> temporary pmd_none(). These don't necessarily take the PMD lock,
>> because "why would they" right now.
>>
>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes
>> "there really is nothing" without checking the PMD lock.
>>
>> As a more concrete example, assume someone calls MADV_DONTNEED and we
>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
>> we'll skip that entry without getting the PMD lock involved. That
>> would mean that you would be breaking MADV_DONTNEED if you managed to
>> collapse or not -- memory would not get discarded.
>>
>> This is a real problem with anonymous memory.
>>
> Yes, I thought of different locking fashions but the problem seems to be
> that any pagetable walker will choose an action path according to the value
> it sees.
> 
>>
>> Unless I am missing something it's all very tricky and there might be
>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in
>> read mode, pmd_none() means there is nothing even without holding the
>> PMD lock when checking".
> 
> Yes, I was betting on the fact that, if the code assumes that pmd_none()
> means there is nothing, eventually when it will take the PMD lock to
> write to it, it will check whether
> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
> thingy, thanks.

Note that this is just the tip of the iceberg. Most page table walkers 
that deal with anonymous memory have the same requirement.

>>>>
>>>>
>>>> I recall that for shmem that's "easier", because we don't have to
>>>> reinstall a PMD immediately, we cna be sure that the page table is
>>>> kept empty/unmodified, ...
>>>>
>>>
> 
> All in all, the general solution seems to be that, if I can take all
> pagetable walkers into an invalid state and make them backoff, then I am
> safe.
 > For example, we do not zero out the PMD, we take the pte PTL, we do> 
stuff, then instead of setting the PTEs to zero, we set it to a universal
> invalid state upon which no pagetable walker can take an action; an
> instance of that can be to set the PTE to a swap entry such that the fault
> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
> the PMD lock (in fact we don't need it since any pagetable walking
> is rendered useless) and populate the PMD to resume the new pagetable
> walking. Another *ridiculous* idea may be to remember the PGD we
> came from and nuke it (but I guess there is an alternate path for that
> in walk_pgd_range() and so on?)


What might work is introducing a PMD marker (note: we don't have PMD 
markers yet) for that purpose. Likely, the PMD marker may only be set 
while the PMD lock is being held, and we must not drop the PMD lock 
temporarily (otherwise people will spin on the value by repeadetly 
(un)locking the PMD lock, which is stupid).


Then you have to make sure that each and every page table walker handles 
that properly.

Then, there is the problem with holding the PMD lock for too long as I 
mentioned.

Last but not least, there are more issues that I haven't even described 
before (putting aside the other issue):


Our page table walkers can handle the transitions:

PMD leaf -> PMD none
* concurrent zap / MADV_DONTNEED / !anon PMD split

PMD leaf -> PTE table
* concurrent anon PMD split
* concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
   allocation

PTE table (empty) -> PMD none
* concurrent page table reclaim, part of collapsing file THPs
  * collapse_pte_mapped_thp()
  * retract_page_tables()


I think they *cannot* tolerate the transition *properly*:

PTE table (non-empty) -> PMD leaf
* what you want do do ;)

-> Observe how we skip to the next PMD in all page table walkers /  give 
up if pte_offset_map_lock() and friends fail! I even think there are 
more issues hiding there with pte_offset_map_ro_nolock().


Of course, on top of all of this, you should net significantly degrade 
the ordinary page table walker performance ...


I don't want to discourage you, but it's all extremely complicated.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] mmap lock in khugepaged
  2024-11-29 11:30         ` David Hildenbrand
@ 2024-12-02  9:54           ` Dev Jain
  0 siblings, 0 replies; 5+ messages in thread
From: Dev Jain @ 2024-12-02  9:54 UTC (permalink / raw)
  To: David Hildenbrand, ryan.roberts, kirill.shutemov, willy,
	linux-kernel, linux-mm, ziy


On 29/11/24 5:00 pm, David Hildenbrand wrote:
>>>
>>> Thinking about it, I am also not sure if most other code can deal with
>>> temporary pmd_none(). These don't necessarily take the PMD lock,
>>> because "why would they" right now.
>>>
>>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes
>>> "there really is nothing" without checking the PMD lock.
>>>
>>> As a more concrete example, assume someone calls MADV_DONTNEED and we
>>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
>>> we'll skip that entry without getting the PMD lock involved. That
>>> would mean that you would be breaking MADV_DONTNEED if you managed to
>>> collapse or not -- memory would not get discarded.
>>>
>>> This is a real problem with anonymous memory.
>>>
>> Yes, I thought of different locking fashions but the problem seems to be
>> that any pagetable walker will choose an action path according to the 
>> value
>> it sees.
>>
>>>
>>> Unless I am missing something it's all very tricky and there might be
>>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in
>>> read mode, pmd_none() means there is nothing even without holding the
>>> PMD lock when checking".
>>
>> Yes, I was betting on the fact that, if the code assumes that pmd_none()
>> means there is nothing, eventually when it will take the PMD lock to
>> write to it, it will check whether
>> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
>> thingy, thanks.
>
> Note that this is just the tip of the iceberg. Most page table walkers 
> that deal with anonymous memory have the same requirement.
>
>>>>>
>>>>>
>>>>> I recall that for shmem that's "easier", because we don't have to
>>>>> reinstall a PMD immediately, we cna be sure that the page table is
>>>>> kept empty/unmodified, ...
>>>>>
>>>>
>>
>> All in all, the general solution seems to be that, if I can take all
>> pagetable walkers into an invalid state and make them backoff, then I am
>> safe.
> > For example, we do not zero out the PMD, we take the pte PTL, we do> 
> stuff, then instead of setting the PTEs to zero, we set it to a universal
>> invalid state upon which no pagetable walker can take an action; an
>> instance of that can be to set the PTE to a swap entry such that the 
>> fault
>> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
>> the PMD lock (in fact we don't need it since any pagetable walking
>> is rendered useless) and populate the PMD to resume the new pagetable
>> walking. Another *ridiculous* idea may be to remember the PGD we
>> came from and nuke it (but I guess there is an alternate path for that
>> in walk_pgd_range() and so on?)
>
>
> What might work is introducing a PMD marker (note: we don't have PMD 
> markers yet) for that purpose. Likely, the PMD marker may only be set 
> while the PMD lock is being held, and we must not drop the PMD lock 
> temporarily (otherwise people will spin on the value by repeadetly 
> (un)locking the PMD lock, which is stupid).
>
>
> Then you have to make sure that each and every page table walker 
> handles that properly.
>
> Then, there is the problem with holding the PMD lock for too long as I 
> mentioned.
>
> Last but not least, there are more issues that I haven't even 
> described before (putting aside the other issue):
>
>
> Our page table walkers can handle the transitions:
>
> PMD leaf -> PMD none
> * concurrent zap / MADV_DONTNEED / !anon PMD split
>
> PMD leaf -> PTE table
> * concurrent anon PMD split
> * concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
>   allocation
>
> PTE table (empty) -> PMD none
> * concurrent page table reclaim, part of collapsing file THPs
>  * collapse_pte_mapped_thp()
>  * retract_page_tables()
>
>
> I think they *cannot* tolerate the transition *properly*:
>
> PTE table (non-empty) -> PMD leaf
> * what you want do do ;)
>
> -> Observe how we skip to the next PMD in all page table walkers /  
> give up if pte_offset_map_lock() and friends fail! I even think there 
> are more issues hiding there with pte_offset_map_ro_nolock().
>
>
> Of course, on top of all of this, you should net significantly degrade 
> the ordinary page table walker performance ...
>
>
> I don't want to discourage you, but it's all extremely complicated.
>
Thanks for the detailed reply!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] anon_vma lock in khugepaged
       [not found] <20241128062641.59096-1-dev.jain@arm.com>
       [not found] ` <d8d29152-ce3f-406d-9e95-a0e8ea2eabbf@redhat.com>
@ 2024-12-05 10:10 ` Dev Jain
  2024-12-05 10:53   ` David Hildenbrand
  1 sibling, 1 reply; 5+ messages in thread
From: Dev Jain @ 2024-12-05 10:10 UTC (permalink / raw)
  To: ryan.roberts, david, kirill.shutemov, willy, ziy, linux-kernel, linux-mm


On 28/11/24 11:56 am, Dev Jain wrote:
> Hi, I was looking at khugepaged code and I cannot figure out what will the problem be
> if we take the mmap lock in read mode. Shouldn't just taking the PMD lock, then PTL,
> then unlocking PTL, then unlocking PMD, solve any races with page table walkers?
>
>

Similar questions:

1. Why do we need anon_vma_lock_write() in collapse_huge_page()? AFAIK we need to walk anon_vma's either
    when we are forking or when we are unmapping a folio and need to find all VMAs mapping it; the former path takes the
    mmap_write_lock() and so we have no problem, and for the latter, if we just had anon_vma_lock_read(), then it
    may happen that kswapd isolates folio from LRU, and traverses rmap and swaps the folio out and khugepaged fails in
    folio_isolate_lru(), but then that is not a fatal problem but just a performance degradation due to a race (wherein
    the entire code is racy anyways). What am I missing?

2. In what all scenarios does rmap come into play? Fork, swapping out, any other I am missing?

3. Please confirm the correctness: In stark contrast to page migration, we do not need to do rmap walk and nuke all
    PTEs referencing the folio, because for anon non-shmem folios, the only way the folio can be shared is forking,
    and, if that is the case, folio_put() will not release the folio in __collapse_huge_page_copy_succeeded() -> free_page_and_swap_cache(),
    so the old folio is still there and child processes can read from it. Page migration requires that we are able
    to deallocate the old folios.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] anon_vma lock in khugepaged
  2024-12-05 10:10 ` [QUESTION] anon_vma " Dev Jain
@ 2024-12-05 10:53   ` David Hildenbrand
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2024-12-05 10:53 UTC (permalink / raw)
  To: Dev Jain, ryan.roberts, kirill.shutemov, willy, ziy,
	linux-kernel, linux-mm

On 05.12.24 11:10, Dev Jain wrote:
> 
> On 28/11/24 11:56 am, Dev Jain wrote:
>> Hi, I was looking at khugepaged code and I cannot figure out what will the problem be
>> if we take the mmap lock in read mode. Shouldn't just taking the PMD lock, then PTL,
>> then unlocking PTL, then unlocking PMD, solve any races with page table walkers?
>>
>>
> 
> Similar questions:
> 
> 1. Why do we need anon_vma_lock_write() in collapse_huge_page()? AFAIK we need to walk anon_vma's either
>      when we are forking or when we are unmapping a folio and need to find all VMAs mapping it; the former path takes the
>      mmap_write_lock() and so we have no problem, and for the latter, if we just had anon_vma_lock_read(), then it
>      may happen that kswapd isolates folio from LRU, and traverses rmap and swaps the folio out and khugepaged fails in
>      folio_isolate_lru(), but then that is not a fatal problem but just a performance degradation due to a race (wherein
>      the entire code is racy anyways). What am I missing?
> 
> 2. In what all scenarios does rmap come into play? Fork, swapping out, any other I am missing?
> 
> 3. Please confirm the correctness: In stark contrast to page migration, we do not need to do rmap walk and nuke all
>      PTEs referencing the folio, because for anon non-shmem folios, the only way the folio can be shared is forking,
>      and, if that is the case, folio_put() will not release the folio in __collapse_huge_page_copy_succeeded() -> free_page_and_swap_cache(),
>      so the old folio is still there and child processes can read from it. Page migration requires that we are able
>      to deallocate the old folios.
> 

Without going too much into detail: I think one reason is that 
try_to_unmap() and friends should not fail (not finding all PTEs/PMDs) 
simply because there is concurrent collapse going on.

Observe how try_to_unmap() and friends take the folio lock first, and 
how collapse tries taking the folio locks only after it already 
invalidated the PMDP.

Likely there are other reasons, but this is the "obvious" one I can 
think of.

We really try preventing *any* concurrent page table walkers, and that 
requires the mmap lock+vma lock in write + rmap lock in write.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-12-05 10:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20241128062641.59096-1-dev.jain@arm.com>
     [not found] ` <d8d29152-ce3f-406d-9e95-a0e8ea2eabbf@redhat.com>
     [not found]   ` <4cb26a06-d982-4ca3-a5f7-7c6a6c63428c@arm.com>
     [not found]     ` <3d4c57dd-0821-4684-9018-db8274c170ec@redhat.com>
2024-11-29  7:10       ` [QUESTION] mmap lock in khugepaged Dev Jain
2024-11-29 11:30         ` David Hildenbrand
2024-12-02  9:54           ` Dev Jain
2024-12-05 10:10 ` [QUESTION] anon_vma " Dev Jain
2024-12-05 10:53   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox