From: David Hildenbrand <david@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>,
akpm@linux-foundation.org, tglx@linutronix.de,
kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, songmuchun@bytedance.com,
zhouchengming@bytedance.com
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages
Date: Wed, 10 Nov 2021 16:37:14 +0100 [thread overview]
Message-ID: <6ac9cc0d-7dea-0e19-51b3-625ec6561ac7@redhat.com> (raw)
In-Reply-To: <20211110143859.GS1740502@nvidia.com>
On 10.11.21 15:38, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
>> On 10.11.21 13:56, Jason Gunthorpe wrote:
>>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>>
>>>> In this patch series, we add a pte_refcount field to the struct page of page
>>>> table to track how many users of PTE page table. Similar to the mechanism of
>>>> page refcount, the user of PTE page table should hold a refcount to it before
>>>> accessing. The PTE page table page will be freed when the last refcount is
>>>> dropped.
>>>
>>> So, this approach basically adds two atomics on every PTE map
>>>
>>> If I have it right the reason that zap cannot clean the PTEs today is
>>> because zap cannot obtain the mmap lock due to a lock ordering issue
>>> with the inode lock vs mmap lock.
>>
>> There are different ways to zap: madvise(DONTNEED) vs
>> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
>> comming: a process page table walker or the rmap.
>
> AFAIK rmap is the same issue, it can't lock the mmap_sem
>
>> The way locking currently works doesn't allow to remove a page table
>> just by holding the mmap lock, not even in write mode.
>
> I'm not sure I understand this. If the goal is to free the PTE tables
> then the main concern is use-after free on page table walkers (which
> this series is addressing). Ignoring bugs, we have only three ways to
> read the page table:
Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs.
>
> - Fully locked. Under the PTLs (gup slow is an example)
> - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> - hw-locked. Barriered with TLB flush (gup fast is an example)
Three additions as far as I can tell:
1. Fully locked currently needs the read mmap lock OR the rmap lock in
read. PTLs on their own are not sufficient AFAIKT.
2. #1 and #2 can currently only walk within VMA ranges.
3. We can theoretically walk page tables outside VMA ranges with the
write mmap lock, because page tables get removed with the mmap lock in
read mode and heavy-weight operations (VMA layout, khugepaged) are
performed under the write mmap lock.
The rmap locks protect from modifications where we want to exclude rmap
walkers similarly to how we grab the mmap lock in write, where the PTLs
are not sufficient.
See mm/mremap.c:move_ptes() as an example which performs VMA layout +
page table modifications. See khugepagd which doesn't perform VMA layout
modifications but page table modifications.
>
> #1 should be completely safe as the PTLs will protect eveything
> #2 is safe so long as the write side is held during any layout changes
> #3 interacts with the TLB flush, and is also safe with zap
>
> rmap itself is a #1 page table walker, ie it gets the PTLs under
> page_vma_mapped_walk().
When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?
Because the PMD-PTLs re usually not taken in case we know there is a
page table (nothing would currently change it without heavy locking).
And if they are taken, they are only held while allocating/checking a
PMDE, not while actually *using* the page table that's mapped in that entry.
For example, walk_page_range() requires the mmap lock in read and grabs
the PTE-PTLs.
>
> The sin we have comitted here is that both the mmap lock and the PTLs
> are being used to protect the page table itself with a very
> complicated dual semantic.
>
> Splitting the sleeping mmap lock into 'covers vma' and 'covers page
> tables' lets us solve the lock ordering and semi-locked can become
> more fully locked by the new lock, instead of by abusing mmap sem.
It would still be a fairly coarse-grained locking, I am not sure if that
is a step into the right direction. If you want to modify *some* page
table in your process you have exclude each and every page table walker.
Or did I mis-interpret what you were saying?
>
> I'd suggest to make this new lock a special rwsem which allows either
> concurrent read access OR concurrent PTL access, but not both. This
I looked into such a lock recently in similar context and something like
that does not exist yet (and fairness will be challenging). You either
have a single reader or multiple writer. I'd be interested if someone
knows of something like that.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-11-10 15:37 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-10 10:54 Qi Zheng
2021-11-10 10:54 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
2021-11-11 13:46 ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
2021-11-10 10:54 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
2021-11-10 10:54 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
2021-11-11 0:37 ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
2021-11-11 15:17 ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
2021-11-10 10:54 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
2021-11-10 10:54 ` [PATCH v3 10/15] mm/pte_ref: add support for page fault path Qi Zheng
2021-11-10 10:54 ` [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page Qi Zheng
2021-11-10 10:54 ` [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Qi Zheng
2021-11-14 14:43 ` [mm/pte_ref] afcc9fb874: kernel_BUG_at_include/linux/pte_ref.h kernel test robot
2021-11-10 10:54 ` [PATCH v3 14/15] Documentation: add document for pte_ref Qi Zheng
2021-11-10 14:39 ` Jonathan Corbet
2021-11-11 5:40 ` Qi Zheng
2021-11-10 10:54 ` [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages Qi Zheng
2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
2021-11-10 13:25 ` David Hildenbrand
2021-11-10 13:59 ` Qi Zheng
2021-11-10 14:38 ` Jason Gunthorpe
2021-11-10 15:37 ` David Hildenbrand [this message]
2021-11-10 16:39 ` Jason Gunthorpe
2021-11-10 17:37 ` David Hildenbrand
2021-11-10 17:49 ` Jason Gunthorpe
2021-11-11 3:58 ` Qi Zheng
2021-11-11 9:22 ` David Hildenbrand
2021-11-11 11:08 ` Qi Zheng
2021-11-11 11:19 ` David Hildenbrand
2021-11-11 12:00 ` Qi Zheng
2021-11-11 12:20 ` David Hildenbrand
2021-11-11 12:32 ` Qi Zheng
2021-11-11 12:51 ` David Hildenbrand
2021-11-11 13:01 ` Qi Zheng
2021-11-10 16:49 ` Matthew Wilcox
2021-11-10 16:53 ` David Hildenbrand
2021-11-10 16:56 ` Jason Gunthorpe
2021-11-10 13:54 ` Qi Zheng
-- strict thread matches above, loose matches on Subject: below --
2021-11-10 8:40 Qi Zheng
2021-11-10 8:52 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6ac9cc0d-7dea-0e19-51b3-625ec6561ac7@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=jgg@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mika.penttila@nextfour.com \
--cc=songmuchun@bytedance.com \
--cc=tglx@linutronix.de \
--cc=zhengqi.arch@bytedance.com \
--cc=zhouchengming@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox