Re: [QUESTION] mmap lock in khugepaged

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: David Hildenbrand <david@redhat.com>,
	ryan.roberts@arm.com, kirill.shutemov@linux.intel.com,
	willy@infradead.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, ziy@nvidia.com
Subject: Re: [QUESTION] mmap lock in khugepaged
Date: Mon, 2 Dec 2024 15:24:18 +0530	[thread overview]
Message-ID: <fe96ea70-5796-483b-ac1f-e58feeb63676@arm.com> (raw)
In-Reply-To: <d5b827a4-81ae-4283-9bb4-2e38adc47d71@redhat.com>


On 29/11/24 5:00 pm, David Hildenbrand wrote:
>>>
>>> Thinking about it, I am also not sure if most other code can deal with
>>> temporary pmd_none(). These don't necessarily take the PMD lock,
>>> because "why would they" right now.
>>>
>>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes
>>> "there really is nothing" without checking the PMD lock.
>>>
>>> As a more concrete example, assume someone calls MADV_DONTNEED and we
>>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
>>> we'll skip that entry without getting the PMD lock involved. That
>>> would mean that you would be breaking MADV_DONTNEED if you managed to
>>> collapse or not -- memory would not get discarded.
>>>
>>> This is a real problem with anonymous memory.
>>>
>> Yes, I thought of different locking fashions but the problem seems to be
>> that any pagetable walker will choose an action path according to the 
>> value
>> it sees.
>>
>>>
>>> Unless I am missing something it's all very tricky and there might be
>>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in
>>> read mode, pmd_none() means there is nothing even without holding the
>>> PMD lock when checking".
>>
>> Yes, I was betting on the fact that, if the code assumes that pmd_none()
>> means there is nothing, eventually when it will take the PMD lock to
>> write to it, it will check whether
>> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
>> thingy, thanks.
>
> Note that this is just the tip of the iceberg. Most page table walkers 
> that deal with anonymous memory have the same requirement.
>
>>>>>
>>>>>
>>>>> I recall that for shmem that's "easier", because we don't have to
>>>>> reinstall a PMD immediately, we cna be sure that the page table is
>>>>> kept empty/unmodified, ...
>>>>>
>>>>
>>
>> All in all, the general solution seems to be that, if I can take all
>> pagetable walkers into an invalid state and make them backoff, then I am
>> safe.
> > For example, we do not zero out the PMD, we take the pte PTL, we do> 
> stuff, then instead of setting the PTEs to zero, we set it to a universal
>> invalid state upon which no pagetable walker can take an action; an
>> instance of that can be to set the PTE to a swap entry such that the 
>> fault
>> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
>> the PMD lock (in fact we don't need it since any pagetable walking
>> is rendered useless) and populate the PMD to resume the new pagetable
>> walking. Another *ridiculous* idea may be to remember the PGD we
>> came from and nuke it (but I guess there is an alternate path for that
>> in walk_pgd_range() and so on?)
>
>
> What might work is introducing a PMD marker (note: we don't have PMD 
> markers yet) for that purpose. Likely, the PMD marker may only be set 
> while the PMD lock is being held, and we must not drop the PMD lock 
> temporarily (otherwise people will spin on the value by repeadetly 
> (un)locking the PMD lock, which is stupid).
>
>
> Then you have to make sure that each and every page table walker 
> handles that properly.
>
> Then, there is the problem with holding the PMD lock for too long as I 
> mentioned.
>
> Last but not least, there are more issues that I haven't even 
> described before (putting aside the other issue):
>
>
> Our page table walkers can handle the transitions:
>
> PMD leaf -> PMD none
> * concurrent zap / MADV_DONTNEED / !anon PMD split
>
> PMD leaf -> PTE table
> * concurrent anon PMD split
> * concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
>   allocation
>
> PTE table (empty) -> PMD none
> * concurrent page table reclaim, part of collapsing file THPs
>  * collapse_pte_mapped_thp()
>  * retract_page_tables()
>
>
> I think they *cannot* tolerate the transition *properly*:
>
> PTE table (non-empty) -> PMD leaf
> * what you want do do ;)
>
> -> Observe how we skip to the next PMD in all page table walkers /  
> give up if pte_offset_map_lock() and friends fail! I even think there 
> are more issues hiding there with pte_offset_map_ro_nolock().
>
>
> Of course, on top of all of this, you should net significantly degrade 
> the ordinary page table walker performance ...
>
>
> I don't want to discourage you, but it's all extremely complicated.
>
Thanks for the detailed reply!

next prev parent reply	other threads:[~2024-12-02  9:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20241128062641.59096-1-dev.jain@arm.com>
     [not found] ` <d8d29152-ce3f-406d-9e95-a0e8ea2eabbf@redhat.com>
     [not found]   ` <4cb26a06-d982-4ca3-a5f7-7c6a6c63428c@arm.com>
     [not found]     ` <3d4c57dd-0821-4684-9018-db8274c170ec@redhat.com>
2024-11-29  7:10       ` Dev Jain
2024-11-29 11:30         ` David Hildenbrand
2024-12-02  9:54           ` Dev Jain [this message]
2024-12-05 10:10 ` [QUESTION] anon_vma " Dev Jain
2024-12-05 10:53   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fe96ea70-5796-483b-ac1f-e58feeb63676@arm.com \
    --to=dev.jain@arm.com \
    --cc=david@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox