From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 83519C5AD49
	for <linux-mm@archiver.kernel.org>; Fri,  6 Jun 2025 03:56:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 626C86B0095; Thu,  5 Jun 2025 23:56:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5FE536B009E; Thu,  5 Jun 2025 23:56:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 53B6F6B009F; Thu,  5 Jun 2025 23:56:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 361396B0095
	for <linux-mm@kvack.org>; Thu,  5 Jun 2025 23:56:10 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 11FDF1421A7
	for <linux-mm@kvack.org>; Fri,  6 Jun 2025 03:56:09 +0000 (UTC)
X-FDA: 83523612858.28.72169A3
Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172])
	by imf25.hostedemail.com (Postfix) with ESMTP id 500A0A0009
	for <linux-mm@kvack.org>; Fri,  6 Jun 2025 03:56:06 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b="UhT2bQ/F";
	spf=pass (imf25.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1749182167;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7d5kaFS84TSa8GmIV7A0n5UI1SbOTb/2OdVl96QMQnY=;
	b=nIYTD5+Srx21clduZySZy80s7ox8pO/4wPg+j991hFlH9wMuZLEa6rovHK9HSZy8icKi/A
	FntvKUfgITXBwYVc8ag1SAU1hGNunq2c3WCLoSkvzc2NsM8iT0RhrL6hotsT59Ma8ksl0H
	zILLoNe43eJx6YFS682V3KaT2JddR/M=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b="UhT2bQ/F";
	spf=pass (imf25.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749182167; a=rsa-sha256;
	cv=none;
	b=yI2efNXfZu8YRTe4qZ/+RNY26n2ix0zW9I92+O930FsCFXMlW7fYlinvRMngfeM3uRxMKK
	qAi5c3oNorRqqxZRy7wvH24WlV1JkqLl0iSsp7uHNJ9OC9NrZR0kfTStLoFRc0PfwXUTbj
	2aq2Jgv5VV0yqPrhu9jYEdoGbN9+m/c=
Received: by mail-pg1-f172.google.com with SMTP id 41be03b00d2f7-7fd35b301bdso1578210a12.2
        for <linux-mm@kvack.org>; Thu, 05 Jun 2025 20:56:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1749182165; x=1749786965; darn=kvack.org;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject
         :user-agent:mime-version:date:message-id:from:to:cc:subject:date
         :message-id:reply-to;
        bh=7d5kaFS84TSa8GmIV7A0n5UI1SbOTb/2OdVl96QMQnY=;
        b=UhT2bQ/FigxQHpR7SX4cO2Z2VHi7yZCf+cpj/q06OQRb6gDRSMkMFTaW77VHBkI7mj
         8J/di0nR6od1TN072nbEfixXRMTkQo/lFa4FbKkmTABXWkx0bly7y9yzX/yz1EjKrxnl
         OD7SljkQQMpttEWoKcuf5Dqnyhd71yOWphDTs1izht1vc0rjgLDC08aOE4uLmk5S2/h6
         vlqcHbNAg6kgxMzOx6u6poJeQiVkNnf2YJt+NsCgKT7okJM26PJoBpTPKt3a+q5m1I00
         jqq2Bi1v12DCYoJcIUXpDfvS01q2o7kGJ5o4kUmvuWykZk5sG60tOqBPDnVi/EW1KoSs
         AlXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1749182165; x=1749786965;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject
         :user-agent:mime-version:date:message-id:x-gm-message-state:from:to
         :cc:subject:date:message-id:reply-to;
        bh=7d5kaFS84TSa8GmIV7A0n5UI1SbOTb/2OdVl96QMQnY=;
        b=aStwJX+L7PQXTuU3o1ydIQsUD322LQkDtAN0YDl0FgtMnh53LzBYNcVWeDh/4gxPml
         Imh25cbLcRIo0c+IZP6nX6E3cdoiK9tCjmMdSEJeTJoUl22Go32Tfr21Vc3goDKfLbxG
         HHGZaWjX1UADWNcffxavzFsO9bdE5fQk2ZbCGEpLdiLwXqNk4imh+9KrI6O0G8Fy+sFA
         ML9lR9yCPcnuu1MWf5EIgguV6UtAjxi5hLyvLRt8RjY7hLh9zr/cOfbZFkhRYdLVQv8q
         BaIKz1nKyyC8Astk46M5XBD6Dw8U78BsH7iAryR/CvQGxGCUCIXI6U9jkXrbsPOZ6eSX
         ZcGg==
X-Forwarded-Encrypted: i=1; AJvYcCXPLtinGDY1ZjWhWHEWOwcqzv1+sdmQSG7kEnasEzbvDGbCdTmXxEoyaXS0aTVjTrRsk4WmfSqzdQ==@kvack.org
X-Gm-Message-State: AOJu0YygH+IijGZsVYD5RlFwhNIHDEcix9ZLy0SxAWd3loYeaf/vMuKu
	MNxKZ/Lxz8DErzhx6hXHUBTUdk5yEbLsvcM+4UobE44ogXPC8O401n457FdCDwF2IIk=
X-Gm-Gg: ASbGncsi+SJKWu8DUE5/F9cXMC5uTCTny6DbNq2WlKRsAu1YnsZ0SgV0c6xQ+pA3UmY
	arvcgsewizTzWZrSPBCfNNfIlXh/ohdkjs22syBqBn9hHM2BEssAzm2MV+UyItkd/DCY+SvYO8k
	Vdr3HSfCkaCSedExU/ukBDLFF65/IyezE4C/7m2I4WeDHvkgoYmxmq66axr6aKJGDs2A6vIol1t
	FM9XgY4cgUD8OIVSGFJA5dAK7zJ1tHHapAykCc0m1aMrNrOqteVK1Ebupeg/O0ZA6fLrBgXN83q
	9nOJ8OoQP5xhewFCfY64wSHftlA17PEuJz4lO6UcmWTwKCTWhpTgNMGnhXWkC3ZNs2nmzOdpzCO
	6duk=
X-Google-Smtp-Source: AGHT+IE17FlGz4IID9w8ry1Vbh1JI1U1R3q7eJ/v7EPEcFLDdQNvaCYq3y01ZkvFUKKAvSrD/8kwhQ==
X-Received: by 2002:a17:902:e74b:b0:234:df51:d16c with SMTP id d9443c01a7336-23601d9e57fmr24333145ad.45.1749182164611;
        Thu, 05 Jun 2025 20:56:04 -0700 (PDT)
Received: from [10.254.237.177] ([139.177.225.248])
        by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-31349fdfe58sm469444a91.38.2025.06.05.20.55.58
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 05 Jun 2025 20:56:04 -0700 (PDT)
Message-ID: <f0069b65-9ca8-44e5-8c98-1d377798c31e@bytedance.com>
Date: Fri, 6 Jun 2025 11:55:56 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>, Barry Song <21cnbao@gmail.com>,
 akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
 Barry Song <v-songbaohua@oppo.com>, "Liam R. Howlett"
 <Liam.Howlett@oracle.com>, David Hildenbrand <david@redhat.com>,
 Vlastimil Babka <vbabka@suse.cz>, Suren Baghdasaryan <surenb@google.com>,
 Lokesh Gidra <lokeshgidra@google.com>,
 Tangquan Zheng <zhengtangquan@oppo.com>
References: <20250530104439.64841-1-21cnbao@gmail.com>
 <CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com>
 <0fb74598-1fee-428e-987b-c52276bfb975@bytedance.com>
 <c6dbfb68-413a-4a98-8d21-8c3f4b324618@lucifer.local>
 <3cb53060-9769-43f4-996d-355189df107d@bytedance.com>
 <c813c03a-5d95-43a6-9415-0ceb845eb62c@lucifer.local>
 <7cb990bf-57d4-4fc9-b44c-f30175c0fb7a@bytedance.com>
 <bfb56be6-d55e-4dcc-93a3-4c7e6faf790f@lucifer.local>
From: Qi Zheng <zhengqi.arch@bytedance.com>
In-Reply-To: <bfb56be6-d55e-4dcc-93a3-4c7e6faf790f@lucifer.local>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 500A0A0009
X-Stat-Signature: 3fzg9fdpoc4ttayrwmurfpibinbm9akr
X-Rspam-User: 
X-Rspamd-Server: rspam07
X-HE-Tag: 1749182166-944544
X-HE-Meta: U2FsdGVkX1/lWE9pe1u9VLroMPAfQOuwXkiB+0RgFLiEZEjuT/RCi+ZCE6RQaEhRLdSkEDobrp7Hum92wrqlbdmsunzrsOdeFqPHYTGjV2AbnFiSq7hHmpQLNKedVlvhnmRVncjI4UUvcDKf9X0OOU7Y+lVBD4ifaRecQt3ld7aI2KAbFeLoiysd7Mhn9qYg/EojizwsZHEurem/iN2co817rgi+rhxK4kLiWQiKmLca3nNaIwHy+cnJiLTRrOXLN7mEBFBIa6G6sUGAfGoCZgyXwIa4OKa29t3zKgrfbL6TOGV9UEZIeDSrYoRCVEov8Q1dtZt+ypkttYBMBgb5ebXwn958/7c+dvJFuwR+0cYjjovE8IR9lD46UrZGR1Pr1Icz87+PgVQUdNwgMW44HD9GYNmoITVrVWOCNBJwSwTVF2jod/EJPFVZccLflmIrDD4DXTuu5MngvQRW4Z0Qr2tX11b8+dY/EeOuYgVb2bkjGgt+EWO6n/1xRAPwHvy48eazg48gDPg7lkayRScADP9908MBESRzTCfOoM/NbZARfCfE0T5ltd+a/sgAxFPDzcRNU1GO7pHZS3aPS1bVa15dDz372G8r4lx+XbvqMS68azLg1CCOaDWSz0jj2fHrX1VhdpDZ471IeqiueXmMP+AHqtQcoUQtVzxS1f38ciUHS+DCNforw1+Y0buZYm0YaYUGPdNuhKU2ot4VkFMXUYDgJxaNsZRwzD0y1TyPG0Zh43MKBH1cMsm2TzhB/lmuGoVQghwOXr/Fe03stAOt4k81MZy0B0EVVsk1SK6uVK7efjd0t1LVHhIpZnMAiyfQmZe0MgoIwSXxUKBtsak+2legQ7cqC5aowRDowGCRFm2s6P7uAe67TSvfEISjVZpQfl7f5f2Puf/JEhWrPnpFHL9a2iEBE5JVsCUBmBCaBTxRQYW2dFBJ6sZpBGeiUqIV3NexPCLQGXVGLYbRt0w
 /ecUg0Dz
 tBaBwnWhT2ITVaWCifZUeMJmV0WGrKxt1H/mLTcnG1CzbreXVh7npDplMB1sZEYMCmz5awQd6EeiMGx72/JFgzuHfyM8VhDoZS6ZHNdBIb/IeVfZSEXkSd991j87SA0U4a7uGWf+/9OOhka7BCRAtqOEpMjTFCgzmSMkV1JVBuw0cXpgQ6WlmMrmDjlqyGie7B/80eBcfh2YXD35E6hYt5DlFDE3MRr8zqUZfpsA/DnQUmVtQkqCRob84F8MIwaIRsuOAtArICjPM0SjR2es3CDWRs9Ke4xVq7YQUBz/7qY40Uag12G9H3BRvpJm2vtCwNrSsd+oltsHUQkxjuYJ1A3mYsoOkLAmpouh1/eKmQ7d1+1jzf9vpMh3Kkv8dA18GmDsIIsOAWPtGS7bZVBatL3onN5A4y4iyTcoaOPT8SMPeYNWNQspYF73ULw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Lorenzo,

On 6/5/25 10:04 PM, Lorenzo Stoakes wrote:
> On Thu, Jun 05, 2025 at 11:23:18AM +0800, Qi Zheng wrote:
>>
>>
>> On 6/5/25 1:50 AM, Lorenzo Stoakes wrote:
>>> On Wed, Jun 04, 2025 at 02:02:12PM +0800, Qi Zheng wrote:
>>>> Hi Lorenzo,
>>>>
>>>> On 6/3/25 5:54 PM, Lorenzo Stoakes wrote:
>>>>> On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote:
>>>>>> Hi Jann,
>>>>>>
>>>>>> On 5/30/25 10:06 PM, Jann Horn wrote:
>>>>>>> On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>>>>>> frequently than other madvise options, particularly in native and Java
>>>>>>>> heaps for dynamic memory management.
>>>>>>>>
>>>>>>>> Currently, the mmap_lock is always held during these operations, even when
>>>>>>>> unnecessary. This causes lock contention and can lead to severe priority
>>>>>>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
>>>>>>>> hold the lock and block higher-priority threads.
>>>>>>>>
>>>>>>>> This patch enables the use of per-VMA locks when the advised range lies
>>>>>>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
>>>>>>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>>>>>>>>
>>>>>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
>>>>>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>>>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>>>>>> only 1,231 fell back to mmap_lock.
>>>>>>>>
>>>>>>>> To simplify handling, the implementation falls back to the standard
>>>>>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
>>>>>>>> userfaultfd_remove().
>>>>>>>
>>>>>>> One important quirk of this is that it can, from what I can see, cause
>>>>>>> freeing of page tables (through pt_reclaim) without holding the mmap
>>>>>>> lock at all:
>>>>>>>
>>>>>>> do_madvise [behavior=MADV_DONTNEED]
>>>>>>>       madvise_lock
>>>>>>>         lock_vma_under_rcu
>>>>>>>       madvise_do_behavior
>>>>>>>         madvise_single_locked_vma
>>>>>>>           madvise_vma_behavior
>>>>>>>             madvise_dontneed_free
>>>>>>>               madvise_dontneed_single_vma
>>>>>>>                 zap_page_range_single_batched [.reclaim_pt = true]
>>>>>>>                   unmap_single_vma
>>>>>>>                     unmap_page_range
>>>>>>>                       zap_p4d_range
>>>>>>>                         zap_pud_range
>>>>>>>                           zap_pmd_range
>>>>>>>                             zap_pte_range
>>>>>>>                               try_get_and_clear_pmd
>>>>>>>                               free_pte
>>>>>>>
>>>>>>> This clashes with the assumption in walk_page_range_novma() that
>>>>>>> holding the mmap lock in write mode is sufficient to prevent
>>>>>>> concurrent page table freeing, so it can probably lead to page table
>>>>>>> UAF through the ptdump interface (see ptdump_walk_pgd()).
>>>>>>
>>>>>> Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
>>>>>> following case:
>>>>>>
>>>>>> cpu 0				cpu 1
>>>>>>
>>>>>> ptdump_walk_pgd
>>>>>> --> walk_pte_range
>>>>>>        --> pte_offset_map (hold RCU read lock)
>>>>>> 				zap_pte_range
>>>>>> 				--> free_pte (via RCU)
>>>>>>            walk_pte_range_inner
>>>>>>            --> ptdump_pte_entry (the PTE page is not freed at this time)
>>>>>>
>>>>>> IIUC, there is no UAF issue here?
>>>>>>
>>>>>> If I missed anything please let me know.
>>>
>>> Seems to me that we don't need the VMA locks then unless I'm missing
>>> something? :) Jann?
>>>
>>> Would this RCU-lock-acquired-by-pte_offset_map also save us from the
>>> munmap() downgraded read lock scenario also? Or is the problem there
>>> intermediate page table teardown I guess?
>>>
>>
>> Right. Currently, page table pages other than PTE pages are not
>> protected by RCU, so mmap write lock still needed in the munmap path
>> to wait for all readers of the page table pages to exit the critical
>> section.
>>
>> In other words, once we have achieved that all page table pages are
>> protected by RCU, we can completely remove the page table pages from
>> the protection of mmap locks.
> 
> Interesting - so on reclaim/migrate we are just clearing PTE entries with
> the rmap lock right? Would this lead to a future where we could also tear
> down page tables there?
> 
> Another point to remember is that when we are clearing down higher level
> page tables in the general case, the logic assumes nothing else can touch
> anything... we hold both rmap lock AND mmap/vma locks at this point.
> 
> But I guess if we're RCU-safe, we're same even from rmap right?

Yeah, and we have already done something similar. For more details,
please refer to retract_page_tables(). It only holds i_mmap_rwsem read
lock and then calls pte_free_defer() to free the PTE page through RCU.

For migrate case, the pte entry will store a migrate entry, right? And a
new physical page will be installed soon through a page fault, so I
don't think it is necessary to free the corresponding PTE page.

For reclaim case, there is a problem that only PTE entries that mapped
to a physical page are operated each time. If we want to free the entire
PTE page, we need to check the adjacent PTE entries. Maybe MGLRU can
help with this. I remember that MGLRU has an optimization that will 
check the adjacent PTE entries.

> 
>>
>> Here are some of my previous thoughts:
>>
>> ```
>> Another plan
>> ============
>>
>> Currently, page table modification are protected by page table locks
>> (page_table_lock or split pmd/pte lock), but the life cycle of page
>> table pages are protected by mmap_lock (and vma lock). For more details,
>> please refer to the latest added Documentation/mm/process_addrs.rst file.
>>
>> Currently we try to free the PTE pages through RCU when
>> CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
>> need to hold mmap_lock for the read/write op on the PTE pages.
>>
>> So maybe we can remove the page table from the protection of the mmap
>> lock (which is too big), like this:
>>
>> 1. free all levels of page table pages by RCU, not just PTE pages, but
>>     also pmd, pud, etc.
>> 2. similar to pte_offset_map/pte_unmap, add
>>     [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
>>     rcu_read_lock/rcu_read_unlcok, and make them accept failure.
>>
>> In this way, we no longer need the mmap lock. For readers, such as page
>> table wallers, we are already in the critical section of RCU. For
>> writers, we only need to hold the page table lock.
>>
>> But there is a difficulty here, that is, the RCU critical section is not
>> allowed to sleep, but it is possible to sleep in the callback function
>> of .pmd_entry, such as mmu_notifier_invalidate_range_start().
>>
>> Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
>> it's an interesting thing to try.
> 
> Thanks for the information, RCU freeing of page tables is something of a

RCU-freeing is relatively simple, tlb_remove_table() can be easily
changed to free all levels of page table pages through RCU. The more
difficult is to protect the page table pages above PTE level through RCU
lock.

> long-term TODO discussed back and forth :) might take a look myself if
> somebody else hasn't grabbed when I have a second...

This is awesome, I'm stuck with some other stuff at the moment, I'll
also take a look at it later when I have time.

> 
> Is it _only_ the mmu notifier sleeping in this scenario? Or are there other
> examples?

I'm not sure, need some investigation.

> 
> We could in theory always add another callback .pmd_entry_sleep or
> something for this one case and document the requirement...

Maybe, but the SRCU critical section cannot prevent the PTE page from
being freed via RCU. :(

Thanks!

> 
>> ```
>>
>> Thanks!
>>
>>
> 
> Cheers, Lorenzo