Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>, Barry Song <21cnbao@gmail.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Suren Baghdasaryan <surenb@google.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	Tangquan Zheng <zhengtangquan@oppo.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED
Date: Thu, 5 Jun 2025 11:23:18 +0800	[thread overview]
Message-ID: <7cb990bf-57d4-4fc9-b44c-f30175c0fb7a@bytedance.com> (raw)
In-Reply-To: <c813c03a-5d95-43a6-9415-0ceb845eb62c@lucifer.local>



On 6/5/25 1:50 AM, Lorenzo Stoakes wrote:
> On Wed, Jun 04, 2025 at 02:02:12PM +0800, Qi Zheng wrote:
>> Hi Lorenzo,
>>
>> On 6/3/25 5:54 PM, Lorenzo Stoakes wrote:
>>> On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote:
>>>> Hi Jann,
>>>>
>>>> On 5/30/25 10:06 PM, Jann Horn wrote:
>>>>> On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>>>> frequently than other madvise options, particularly in native and Java
>>>>>> heaps for dynamic memory management.
>>>>>>
>>>>>> Currently, the mmap_lock is always held during these operations, even when
>>>>>> unnecessary. This causes lock contention and can lead to severe priority
>>>>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
>>>>>> hold the lock and block higher-priority threads.
>>>>>>
>>>>>> This patch enables the use of per-VMA locks when the advised range lies
>>>>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
>>>>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>>>>>>
>>>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
>>>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>>>> only 1,231 fell back to mmap_lock.
>>>>>>
>>>>>> To simplify handling, the implementation falls back to the standard
>>>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
>>>>>> userfaultfd_remove().
>>>>>
>>>>> One important quirk of this is that it can, from what I can see, cause
>>>>> freeing of page tables (through pt_reclaim) without holding the mmap
>>>>> lock at all:
>>>>>
>>>>> do_madvise [behavior=MADV_DONTNEED]
>>>>>      madvise_lock
>>>>>        lock_vma_under_rcu
>>>>>      madvise_do_behavior
>>>>>        madvise_single_locked_vma
>>>>>          madvise_vma_behavior
>>>>>            madvise_dontneed_free
>>>>>              madvise_dontneed_single_vma
>>>>>                zap_page_range_single_batched [.reclaim_pt = true]
>>>>>                  unmap_single_vma
>>>>>                    unmap_page_range
>>>>>                      zap_p4d_range
>>>>>                        zap_pud_range
>>>>>                          zap_pmd_range
>>>>>                            zap_pte_range
>>>>>                              try_get_and_clear_pmd
>>>>>                              free_pte
>>>>>
>>>>> This clashes with the assumption in walk_page_range_novma() that
>>>>> holding the mmap lock in write mode is sufficient to prevent
>>>>> concurrent page table freeing, so it can probably lead to page table
>>>>> UAF through the ptdump interface (see ptdump_walk_pgd()).
>>>>
>>>> Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
>>>> following case:
>>>>
>>>> cpu 0				cpu 1
>>>>
>>>> ptdump_walk_pgd
>>>> --> walk_pte_range
>>>>       --> pte_offset_map (hold RCU read lock)
>>>> 				zap_pte_range
>>>> 				--> free_pte (via RCU)
>>>>           walk_pte_range_inner
>>>>           --> ptdump_pte_entry (the PTE page is not freed at this time)
>>>>
>>>> IIUC, there is no UAF issue here?
>>>>
>>>> If I missed anything please let me know.
> 
> Seems to me that we don't need the VMA locks then unless I'm missing
> something? :) Jann?
> 
> Would this RCU-lock-acquired-by-pte_offset_map also save us from the
> munmap() downgraded read lock scenario also? Or is the problem there
> intermediate page table teardown I guess?
> 

Right. Currently, page table pages other than PTE pages are not
protected by RCU, so mmap write lock still needed in the munmap path
to wait for all readers of the page table pages to exit the critical
section.

In other words, once we have achieved that all page table pages are
protected by RCU, we can completely remove the page table pages from
the protection of mmap locks.

Here are some of my previous thoughts:

```
Another plan
============

Currently, page table modification are protected by page table locks
(page_table_lock or split pmd/pte lock), but the life cycle of page
table pages are protected by mmap_lock (and vma lock). For more details,
please refer to the latest added Documentation/mm/process_addrs.rst file.

Currently we try to free the PTE pages through RCU when
CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
need to hold mmap_lock for the read/write op on the PTE pages.

So maybe we can remove the page table from the protection of the mmap
lock (which is too big), like this:

1. free all levels of page table pages by RCU, not just PTE pages, but
    also pmd, pud, etc.
2. similar to pte_offset_map/pte_unmap, add
    [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
    rcu_read_lock/rcu_read_unlcok, and make them accept failure.

In this way, we no longer need the mmap lock. For readers, such as page
table wallers, we are already in the critical section of RCU. For
writers, we only need to hold the page table lock.

But there is a difficulty here, that is, the RCU critical section is not
allowed to sleep, but it is possible to sleep in the callback function
of .pmd_entry, such as mmu_notifier_invalidate_range_start().

Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
it's an interesting thing to try.
```

Thanks!

next prev parent reply	other threads:[~2025-06-05  3:23 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30 10:44 Barry Song
2025-05-30 14:06 ` Jann Horn
2025-05-30 14:34   ` Lorenzo Stoakes
2025-05-30 20:17     ` Barry Song
2025-06-02 17:35       ` SeongJae Park
2025-06-02 17:53         ` SeongJae Park
2025-05-30 20:40     ` Jann Horn
2025-06-02 11:50       ` Lorenzo Stoakes
2025-06-03  1:06         ` Barry Song
2025-06-03  9:48           ` Lorenzo Stoakes
2025-06-03  7:06       ` Barry Song
2025-06-03 16:52         ` Jann Horn
2025-06-05 10:27           ` Barry Song
2025-05-30 22:00   ` Barry Song
2025-06-02 14:55     ` Jann Horn
2025-06-03  7:51       ` Barry Song
2025-06-03  7:24   ` Qi Zheng
2025-06-03  9:54     ` Lorenzo Stoakes
2025-06-04  6:02       ` Qi Zheng
2025-06-04 17:50         ` Lorenzo Stoakes
2025-06-05  3:23           ` Qi Zheng [this message]
2025-06-05 14:04             ` Lorenzo Stoakes
2025-06-06  3:55               ` Qi Zheng
2025-06-06 10:44                 ` Lorenzo Stoakes
2025-06-09  6:40                   ` Qi Zheng
2025-06-09 15:08                     ` Lorenzo Stoakes
2025-06-10  7:20                     ` David Hildenbrand
2025-06-06 11:07           ` Jann Horn
2025-06-03 18:43 ` Lorenzo Stoakes
2025-06-03 20:17   ` Suren Baghdasaryan
2025-06-04  5:22     ` Lorenzo Stoakes
2025-06-06  7:18     ` Barry Song
2025-06-06 10:16       ` Lorenzo Stoakes
2025-06-03 20:59   ` Pedro Falcato
2025-06-04  5:23     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cb990bf-57d4-4fc9-b44c-f30175c0fb7a@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vbabka@suse.cz \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox