From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 74394E77188
	for <linux-mm@archiver.kernel.org>; Thu,  2 Jan 2025 10:08:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6DE866B0095; Thu,  2 Jan 2025 05:08:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 68DEB6B00A3; Thu,  2 Jan 2025 05:08:53 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 555E56B00A4; Thu,  2 Jan 2025 05:08:53 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 39BDB6B0095
	for <linux-mm@kvack.org>; Thu,  2 Jan 2025 05:08:53 -0500 (EST)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id B31FA161B73
	for <linux-mm@kvack.org>; Thu,  2 Jan 2025 10:08:52 +0000 (UTC)
X-FDA: 82962086550.10.9CE929B
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf16.hostedemail.com (Postfix) with ESMTP id 5A4FF180009
	for <linux-mm@kvack.org>; Thu,  2 Jan 2025 10:07:59 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735812483; a=rsa-sha256;
	cv=none;
	b=wW7fpYwUJgF92DzGq6vg2Q6NZOg0bLZaKHuM6A5IHSeTZz7u3OFPNarISoGQZUxNTqrm7P
	tWSVAAO5vNPG5bUrU4OsOcwiU7MuzRJ6EtZHY2BSu/PkU/PJBhw4LxKJ9QoTULLhCOAMat
	43B/zhB70q9a8jWvOz/rL+fhoAcudTU=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1735812483;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Kw+pq0GDGH9BZGSjp3Tvx7PLTUjcv/aQem69Uy/U+ag=;
	b=sazsyv13xDqSRHkS04p0n4JSofIc0EKLCVyDxRRVaBR8L4iFj5Ua6Bjge9Hs+7DaLUOzzu
	IzO5Fr6vjkTUdQgw+J2ztTkM4X52sXihWDuh6GQ5pQ+bxZ5D4ZYks23b/A3Q6AeFKfUkES
	DZIXMaJKGMfvKqYuH3T21Tk797HWbAI=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9B19A11FB;
	Thu,  2 Jan 2025 02:09:17 -0800 (PST)
Received: from [10.163.82.86] (unknown [10.163.82.86])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4C1AE3F59E;
	Thu,  2 Jan 2025 02:08:34 -0800 (PST)
Message-ID: <8d752d25-b9b2-4bf9-9a81-254aeb3ab0f6@arm.com>
Date: Thu, 2 Jan 2025 15:38:29 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
From: Dev Jain <dev.jain@arm.com>
To: David Hildenbrand <david@redhat.com>, akpm@linux-foundation.org,
 willy@infradead.org, kirill.shutemov@linux.intel.com
Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com,
 cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com,
 dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org,
 jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com,
 hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
 peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 ziy@nvidia.com, jglisse@google.com, surenb@google.com,
 vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com,
 jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20241216165105.56185-1-dev.jain@arm.com>
 <20241216165105.56185-10-dev.jain@arm.com>
 <2215dd8e-233a-427b-b15c-a2ffbce8f46d@redhat.com>
 <28013908-65d8-462e-b975-cd0f63d226b1@arm.com>
 <0368f4f2-cb0f-4633-a86d-5c3f75839b4e@redhat.com>
 <a8bdfa71-138e-40ab-b743-6f771c62d9ff@arm.com>
Content-Language: en-US
In-Reply-To: <a8bdfa71-138e-40ab-b743-6f771c62d9ff@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 5A4FF180009
X-Stat-Signature: 17nzmmho5hhqynuoc3cezzhrydo5yty9
X-Rspam-User: 
X-HE-Tag: 1735812479-953466
X-HE-Meta: U2FsdGVkX1/wTH4u3wCNttt511myxxNI/xq7Bf58kpysbcsdlBCsiZQh5BENgau7pku3Xj+0KH8ACLxsQdNclri9d8Q9lsgCGQmjaO3Wj1Pp2ftXtiKobubbKdzY8upU8DxVdzQUkQK4rCBvFTO5lh5WJU1efrODIx8EJKXVZPRbIMwRTphZ6bwrpl5M8/Rnw31VUbO1qEEygXizT0P6V26Nig1KaTRblkmQs6EWirH40b/RWarqxRXvS39qSJOkVv5cRJntdUMHdF0p2XJ1p7oWqSgBbngQuyiX450NxIDAEfXjrPjb+EAp3y1y8x8NwRY0cNF3L1/8F6ovaqs7HBGioaTo1G7/ArlecF5kC+naYRkmmfeYYEeOgfx22GkOMvzqwxnFjOdQqvPFAbqAUGOTrL+X0wZFB3as+orykby/LQjzczWUn3/BgzJctq66sgR6ciPMx3+5Xjtous/CF/PbXINruObc71mzkMTDuz1RoEadSs8B1dIxk9Cw0ZoLiop6GDYD4yyiChEiQSzMSSgYOTwiO57WUGqrIiHENuxSvfoLUZ+fBsVDSrGi0LyFuwsJTOOdO5Nwn3D+IralT9jOHANL7tR8b9Ui0AZKTB9jyBW4OLaeAJXSYJjI0GjLxkuvjUIHI9KEYkS3f6FI1Q8BTtOfJJ3RWBGWcuMqJdOMq2RpWtNRjZ5BkkeTofvpwzfXdKZuuYc3EuG1He2QsvVecdj5T+YdzCBPfY7GCSXp8AKymN+50SMWCHiJlxIvtMebvnJjR6W5uD046tozQbFAxoz+y3zTBvTyXcxk4US7iXjogxTFHASVAnYk6mx9DBEYjb4Yg1dvX2dPZNU2bOZee2fxI1G9FXICEZgY7zlA77emOSkE4QEEeoAgnodJ1x8p2NVLNBggjLsvX0OtjMDHGYKoxZBO/a5fj2dKk2vtYskm9tnavveEDNFgkDjVXMJVVlI2F+prP+Mwkbp
 cgtefE3c
 zSIGW/hg0iHM9VWEG+A0ZgvSVtpcJC/ceolctUA3vw10weCP2tI8xWU57L8fRXZ+VvS7rgZCWwuYxkfqPvDDRp5a8cCvTp9+QwHbFl+/nE3aqt+8OaCLMy5uYkkh6XBR3WgD24hbnHFfdEOdSb11Uh4VE5kj8jdjkjNIgFPWrjKjbvjtDCuCT4Lc9EH+ISkBautrFJSNuGPtrgz4+uOzleaK+wHlgd835CFwTdHsjvZeamEwR8POzdy4+CxKCZ9++0x6oV0raxSw6lUQAcPoup49yKlT/+zHzNf/ySMLR+tJ2DR7zadBEBuEJtONisey7qBmBOBo3N9IeVBZXVcYaSomMoi3uF7CHH8Iga36ePTGsruXT7D34bI9Uiw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 18/12/24 2:05 pm, Dev Jain wrote:
>
> On 17/12/24 4:02 pm, David Hildenbrand wrote:
>> On 17.12.24 11:07, Dev Jain wrote:
>>>
>>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>>> On 16.12.24 17:51, Dev Jain wrote:
>>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>>> of pagetable
>>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>>> mode. Still
>>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>>> the rmap path,
>>>>> which is anyways going to fail since the PTEs are going to be
>>>>> changed. Under the PTL,
>>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>>> unmap the
>>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>> ---
>>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>>> still are
>>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>>> it after
>>>>> the copying should lead to a deadlock, for example:
>>>>> khugepaged                        madvise(MADV_COLD)
>>>>> folio_lock()                        lock(ptl)
>>>>> lock(ptl)                        folio_lock()
>>>>>
>>>>> We can create a locked folio list, altogether drop both the locks,
>>>>> take the PTL,
>>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>>> isolation and
>>>>> again try locking folios, but then it will reduce efficiency of
>>>>> khugepaged
>>>>> and almost looks like a forced solution :)
>>>>> Please note the following discussion if anyone is interested:
>>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>>>>
>>>>>
>>>>> (Apologies for not CCing the mailing list from the start)
>>>>>
>>>>>    mm/khugepaged.c | 108 
>>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 88beebef773e..8040b130e677 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -714,24 +714,28 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                            struct vm_area_struct *vma,
>>>>>                            unsigned long address,
>>>>>                            spinlock_t *ptl,
>>>>> -                        struct list_head *compound_pagelist)
>>>>> +                        struct list_head *compound_pagelist, int 
>>>>> order)
>>>>>    {
>>>>>        struct folio *src, *tmp;
>>>>>        pte_t *_pte;
>>>>>        pte_t pteval;
>>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>>>             _pte++, address += PAGE_SIZE) {
>>>>>            pteval = ptep_get(_pte);
>>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>>>> -                /*
>>>>> -                 * ptl mostly unnecessary.
>>>>> -                 */
>>>>> -                spin_lock(ptl);
>>>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>>>> -                spin_unlock(ptl);
>>>>> +                if (order == HPAGE_PMD_ORDER) {
>>>>> +                    /*
>>>>> +                    * ptl mostly unnecessary.
>>>>> +                    */
>>>>> +                    spin_lock(ptl);
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                    spin_unlock(ptl);
>>>>> +                } else {
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                }
>>>>> ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>>                }
>>>>>            } else {
>>>>> @@ -740,15 +744,20 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                src = page_folio(src_page);
>>>>>                if (!folio_test_large(src))
>>>>>                    release_pte_folio(src);
>>>>> -            /*
>>>>> -             * ptl mostly unnecessary, but preempt has to
>>>>> -             * be disabled to update the per-cpu stats
>>>>> -             * inside folio_remove_rmap_pte().
>>>>> -             */
>>>>> -            spin_lock(ptl);
>>>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>>>> -            spin_unlock(ptl);
>>>>> +            if (order == HPAGE_PMD_ORDER) {
>>>>> +                /*
>>>>> +                * ptl mostly unnecessary, but preempt has to
>>>>> +                * be disabled to update the per-cpu stats
>>>>> +                * inside folio_remove_rmap_pte().
>>>>> +                */
>>>>> +                spin_lock(ptl);
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>
>>>>
>>>>
>>>>
>>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>>> +                spin_unlock(ptl);
>>>>> +            } else {
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>>>> +            }
>>>>
>>>> As I've talked to Nico about this code recently ... :)
>>>>
>>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>>> TLB flush?
>>>>
>>>> How do you sync against concurrent write acess + GUP-fast?
>>>>
>>>>
>>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>>> copy page content (4) set updated PTE/PMD.
>>>
>>> Thanks...we need to ensure GUP-fast does not write when we are copying
>>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>>> back-off.
>>
>> Yes, and of course, that also the CPU cannot concurrently still 
>> modify the page content while/after you copy the page content, but 
>> before you unmap+flush.
>>
>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole 
>> period: from when you temporarily clear PTEs +_ flush the TLB, when 
>> you copy, until you re-insert the updated ones.
>
> Ignoring the implementation and code churn part :) Is the following 
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check 
> folio references. This will solve concurrent write
> access races. Now, we can drop the PTL...no one can write to the old 
> folios because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can 
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios 
> to new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry 
> and back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of 
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at 
> some point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration 
> entries.
>
> Note that, I had suggested this "setting the PTEs to a global invalid 
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there, 
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are 
> interested in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside, 
> I do not see why the following claim isn't true:
>
> Claim: The (anon-private) mTHP khugepaged collapse problem is 
> mathematically equivalent to the (anon-private) page migration problem.
>
> The difference being, in khugepaged we need the VMA to be stable, 
> hence have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
>
> If at all my theory is correct, I'll leave it to the community to 
> decide if it's worth it to go through my brain-rot :)

If more thinking is required on this sequence, I can postpone this to a 
future optimization patch.

>
>
>>
>> When having to back-off (restore original PTEs), or for copying, 
>> you'll likely need access to the original PTEs, which were already 
>> cleared. So likely you need a temporary copy of the original PTEs 
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier 
>> to implement, at the cost of requiring the mmap lock in write mode 
>> like PMD collapse.

Why do I need to clear the PMD if I am taking the mmap_write_lock() and 
operating only on the PTE?
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a 
> long time is worse than taking the mmap_write_lock(). The latter 
> blocks this process
> from doing mmap()s, which, in my limited knowledge, is bad for 
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes 
> one unit of computation from the system for a long time.
>