From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74394E77188 for ; Thu, 2 Jan 2025 10:08:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6DE866B0095; Thu, 2 Jan 2025 05:08:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 68DEB6B00A3; Thu, 2 Jan 2025 05:08:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 555E56B00A4; Thu, 2 Jan 2025 05:08:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 39BDB6B0095 for ; Thu, 2 Jan 2025 05:08:53 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B31FA161B73 for ; Thu, 2 Jan 2025 10:08:52 +0000 (UTC) X-FDA: 82962086550.10.9CE929B Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id 5A4FF180009 for ; Thu, 2 Jan 2025 10:07:59 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735812483; a=rsa-sha256; cv=none; b=wW7fpYwUJgF92DzGq6vg2Q6NZOg0bLZaKHuM6A5IHSeTZz7u3OFPNarISoGQZUxNTqrm7P tWSVAAO5vNPG5bUrU4OsOcwiU7MuzRJ6EtZHY2BSu/PkU/PJBhw4LxKJ9QoTULLhCOAMat 43B/zhB70q9a8jWvOz/rL+fhoAcudTU= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735812483; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Kw+pq0GDGH9BZGSjp3Tvx7PLTUjcv/aQem69Uy/U+ag=; b=sazsyv13xDqSRHkS04p0n4JSofIc0EKLCVyDxRRVaBR8L4iFj5Ua6Bjge9Hs+7DaLUOzzu IzO5Fr6vjkTUdQgw+J2ztTkM4X52sXihWDuh6GQ5pQ+bxZ5D4ZYks23b/A3Q6AeFKfUkES DZIXMaJKGMfvKqYuH3T21Tk797HWbAI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9B19A11FB; Thu, 2 Jan 2025 02:09:17 -0800 (PST) Received: from [10.163.82.86] (unknown [10.163.82.86]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4C1AE3F59E; Thu, 2 Jan 2025 02:08:34 -0800 (PST) Message-ID: <8d752d25-b9b2-4bf9-9a81-254aeb3ab0f6@arm.com> Date: Thu, 2 Jan 2025 15:38:29 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() From: Dev Jain To: David Hildenbrand , akpm@linux-foundation.org, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-10-dev.jain@arm.com> <2215dd8e-233a-427b-b15c-a2ffbce8f46d@redhat.com> <28013908-65d8-462e-b975-cd0f63d226b1@arm.com> <0368f4f2-cb0f-4633-a86d-5c3f75839b4e@redhat.com> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 5A4FF180009 X-Stat-Signature: 17nzmmho5hhqynuoc3cezzhrydo5yty9 X-Rspam-User: X-HE-Tag: 1735812479-953466 X-HE-Meta: U2FsdGVkX1/wTH4u3wCNttt511myxxNI/xq7Bf58kpysbcsdlBCsiZQh5BENgau7pku3Xj+0KH8ACLxsQdNclri9d8Q9lsgCGQmjaO3Wj1Pp2ftXtiKobubbKdzY8upU8DxVdzQUkQK4rCBvFTO5lh5WJU1efrODIx8EJKXVZPRbIMwRTphZ6bwrpl5M8/Rnw31VUbO1qEEygXizT0P6V26Nig1KaTRblkmQs6EWirH40b/RWarqxRXvS39qSJOkVv5cRJntdUMHdF0p2XJ1p7oWqSgBbngQuyiX450NxIDAEfXjrPjb+EAp3y1y8x8NwRY0cNF3L1/8F6ovaqs7HBGioaTo1G7/ArlecF5kC+naYRkmmfeYYEeOgfx22GkOMvzqwxnFjOdQqvPFAbqAUGOTrL+X0wZFB3as+orykby/LQjzczWUn3/BgzJctq66sgR6ciPMx3+5Xjtous/CF/PbXINruObc71mzkMTDuz1RoEadSs8B1dIxk9Cw0ZoLiop6GDYD4yyiChEiQSzMSSgYOTwiO57WUGqrIiHENuxSvfoLUZ+fBsVDSrGi0LyFuwsJTOOdO5Nwn3D+IralT9jOHANL7tR8b9Ui0AZKTB9jyBW4OLaeAJXSYJjI0GjLxkuvjUIHI9KEYkS3f6FI1Q8BTtOfJJ3RWBGWcuMqJdOMq2RpWtNRjZ5BkkeTofvpwzfXdKZuuYc3EuG1He2QsvVecdj5T+YdzCBPfY7GCSXp8AKymN+50SMWCHiJlxIvtMebvnJjR6W5uD046tozQbFAxoz+y3zTBvTyXcxk4US7iXjogxTFHASVAnYk6mx9DBEYjb4Yg1dvX2dPZNU2bOZee2fxI1G9FXICEZgY7zlA77emOSkE4QEEeoAgnodJ1x8p2NVLNBggjLsvX0OtjMDHGYKoxZBO/a5fj2dKk2vtYskm9tnavveEDNFgkDjVXMJVVlI2F+prP+Mwkbp cgtefE3c zSIGW/hg0iHM9VWEG+A0ZgvSVtpcJC/ceolctUA3vw10weCP2tI8xWU57L8fRXZ+VvS7rgZCWwuYxkfqPvDDRp5a8cCvTp9+QwHbFl+/nE3aqt+8OaCLMy5uYkkh6XBR3WgD24hbnHFfdEOdSb11Uh4VE5kj8jdjkjNIgFPWrjKjbvjtDCuCT4Lc9EH+ISkBautrFJSNuGPtrgz4+uOzleaK+wHlgd835CFwTdHsjvZeamEwR8POzdy4+CxKCZ9++0x6oV0raxSw6lUQAcPoup49yKlT/+zHzNf/ySMLR+tJ2DR7zadBEBuEJtONisey7qBmBOBo3N9IeVBZXVcYaSomMoi3uF7CHH8Iga36ePTGsruXT7D34bI9Uiw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18/12/24 2:05 pm, Dev Jain wrote: > > On 17/12/24 4:02 pm, David Hildenbrand wrote: >> On 17.12.24 11:07, Dev Jain wrote: >>> >>> On 16/12/24 10:36 pm, David Hildenbrand wrote: >>>> On 16.12.24 17:51, Dev Jain wrote: >>>>> In contrast to PMD-collapse, we do not need to operate on two levels >>>>> of pagetable >>>>> simultaneously. Therefore, downgrade the mmap lock from write to read >>>>> mode. Still >>>>> take the anon_vma lock in exclusive mode so as to not waste time in >>>>> the rmap path, >>>>> which is anyways going to fail since the PTEs are going to be >>>>> changed. Under the PTL, >>>>> copy page contents, clear the PTEs, remove folio pins, and (try to) >>>>> unmap the >>>>> old folios. Set the PTEs to the new folio using the set_ptes() API. >>>>> >>>>> Signed-off-by: Dev Jain >>>>> --- >>>>> Note: I have been trying hard to get rid of the locks in here: we >>>>> still are >>>>> taking the PTL around the page copying; dropping the PTL and taking >>>>> it after >>>>> the copying should lead to a deadlock, for example: >>>>> khugepaged                        madvise(MADV_COLD) >>>>> folio_lock()                        lock(ptl) >>>>> lock(ptl)                        folio_lock() >>>>> >>>>> We can create a locked folio list, altogether drop both the locks, >>>>> take the PTL, >>>>> do everything which __collapse_huge_page_isolate() does *except* the >>>>> isolation and >>>>> again try locking folios, but then it will reduce efficiency of >>>>> khugepaged >>>>> and almost looks like a forced solution :) >>>>> Please note the following discussion if anyone is interested: >>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ >>>>> >>>>> >>>>> (Apologies for not CCing the mailing list from the start) >>>>> >>>>>    mm/khugepaged.c | 108 >>>>> ++++++++++++++++++++++++++++++++++++++---------- >>>>>    1 file changed, 87 insertions(+), 21 deletions(-) >>>>> >>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>> index 88beebef773e..8040b130e677 100644 >>>>> --- a/mm/khugepaged.c >>>>> +++ b/mm/khugepaged.c >>>>> @@ -714,24 +714,28 @@ static void >>>>> __collapse_huge_page_copy_succeeded(pte_t *pte, >>>>>                            struct vm_area_struct *vma, >>>>>                            unsigned long address, >>>>>                            spinlock_t *ptl, >>>>> -                        struct list_head *compound_pagelist) >>>>> +                        struct list_head *compound_pagelist, int >>>>> order) >>>>>    { >>>>>        struct folio *src, *tmp; >>>>>        pte_t *_pte; >>>>>        pte_t pteval; >>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR; >>>>> +    for (_pte = pte; _pte < pte + (1UL << order); >>>>>             _pte++, address += PAGE_SIZE) { >>>>>            pteval = ptep_get(_pte); >>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); >>>>>                if (is_zero_pfn(pte_pfn(pteval))) { >>>>> -                /* >>>>> -                 * ptl mostly unnecessary. >>>>> -                 */ >>>>> -                spin_lock(ptl); >>>>> -                ptep_clear(vma->vm_mm, address, _pte); >>>>> -                spin_unlock(ptl); >>>>> +                if (order == HPAGE_PMD_ORDER) { >>>>> +                    /* >>>>> +                    * ptl mostly unnecessary. >>>>> +                    */ >>>>> +                    spin_lock(ptl); >>>>> +                    ptep_clear(vma->vm_mm, address, _pte); >>>>> +                    spin_unlock(ptl); >>>>> +                } else { >>>>> +                    ptep_clear(vma->vm_mm, address, _pte); >>>>> +                } >>>>> ksm_might_unmap_zero_page(vma->vm_mm, pteval); >>>>>                } >>>>>            } else { >>>>> @@ -740,15 +744,20 @@ static void >>>>> __collapse_huge_page_copy_succeeded(pte_t *pte, >>>>>                src = page_folio(src_page); >>>>>                if (!folio_test_large(src)) >>>>>                    release_pte_folio(src); >>>>> -            /* >>>>> -             * ptl mostly unnecessary, but preempt has to >>>>> -             * be disabled to update the per-cpu stats >>>>> -             * inside folio_remove_rmap_pte(). >>>>> -             */ >>>>> -            spin_lock(ptl); >>>>> -            ptep_clear(vma->vm_mm, address, _pte); >>>>> -            folio_remove_rmap_pte(src, src_page, vma); >>>>> -            spin_unlock(ptl); >>>>> +            if (order == HPAGE_PMD_ORDER) { >>>>> +                /* >>>>> +                * ptl mostly unnecessary, but preempt has to >>>>> +                * be disabled to update the per-cpu stats >>>>> +                * inside folio_remove_rmap_pte(). >>>>> +                */ >>>>> +                spin_lock(ptl); >>>>> +                ptep_clear(vma->vm_mm, address, _pte); >>>> >>>> >>>> >>>> >>>>> + folio_remove_rmap_pte(src, src_page, vma); >>>>> +                spin_unlock(ptl); >>>>> +            } else { >>>>> +                ptep_clear(vma->vm_mm, address, _pte); >>>>> +                folio_remove_rmap_pte(src, src_page, vma); >>>>> +            } >>>> >>>> As I've talked to Nico about this code recently ... :) >>>> >>>> Are you clearing the PTE after the copy succeeded? If so, where is the >>>> TLB flush? >>>> >>>> How do you sync against concurrent write acess + GUP-fast? >>>> >>>> >>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check >>>> if there are unexpected page references (e.g., GUP) if so back off (3) >>>> copy page content (4) set updated PTE/PMD. >>> >>> Thanks...we need to ensure GUP-fast does not write when we are copying >>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and >>> back-off. >> >> Yes, and of course, that also the CPU cannot concurrently still >> modify the page content while/after you copy the page content, but >> before you unmap+flush. >> >>>> >>>> To Nico, I suggested doing it simple initially, and still clear the >>>> high-level PMD entry + flush under mmap write lock, then re-map the >>>> PTE table after modifying the page table. It's not as efficient, but >>>> "harder to get wrong". >>>> >>>> Maybe that's already happening, but I stumbled over this clearing >>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious. >>> >>> No, I am not even touching the PMD. I guess the sequence you described >>> should work? I just need to reverse the copying and PTE clearing order >>> to implement this sequence. >> >> That would work, but you really have to hold the PTL for the whole >> period: from when you temporarily clear PTEs +_ flush the TLB, when >> you copy, until you re-insert the updated ones. > > Ignoring the implementation and code churn part :) Is the following > algorithm theoretically correct: (1) Take PTL, scan PTEs, > isolate and lock the folios, set the PTEs to migration entries, check > folio references. This will solve concurrent write > access races. Now, we can drop the PTL...no one can write to the old > folios because (1) rmap cannot run (2) folio from PTE > cannot be derived. Note that migration_entry_wait_on_locked() path can > be scheduled out, so this is not the same as the > fault handlers spinning on the PTL. We can now safely copy old folios > to new folio, then take the PTL: The PTL is > available because every pagetable walker will see a migration entry > and back off. We batch set the PTEs now, and release > the folio locks, making the fault handlers getting out of > migration_entry_wait_on_locked(). As compared to the old code, > the point of failure we need to handle is when copying fails, or at > some point folio isolation fails...therefore, we need to > maintain a list of old PTEs corresponding to the PTEs set to migration > entries. > > Note that, I had suggested this "setting the PTEs to a global invalid > state" thingy in our previous discussion too, but I guess > simultaneously working on the PMD and PTE was the main problem there, > since the walkers do not take a lock on the PMD > to check if someone is changing it, when what they really are > interested in is to make change at the PTE level. In fact, leaving > all specifics like racing with a specific pagetable walker etc aside, > I do not see why the following claim isn't true: > > Claim: The (anon-private) mTHP khugepaged collapse problem is > mathematically equivalent to the (anon-private) page migration problem. > > The difference being, in khugepaged we need the VMA to be stable, > hence have to take the mmap_read_lock(), and have to "migrate" > to a large folio instead of individual pages. > > If at all my theory is correct, I'll leave it to the community to > decide if it's worth it to go through my brain-rot :) If more thinking is required on this sequence, I can postpone this to a future optimization patch. > > >> >> When having to back-off (restore original PTEs), or for copying, >> you'll likely need access to the original PTEs, which were already >> cleared. So likely you need a temporary copy of the original PTEs >> somehow. >> >> That's why temporarily clearing the PMD und mmap write lock is easier >> to implement, at the cost of requiring the mmap lock in write mode >> like PMD collapse. Why do I need to clear the PMD if I am taking the mmap_write_lock() and operating only on the PTE? >> >> > So, I understand the following: Some CPU spinning on the PTL for a > long time is worse than taking the mmap_write_lock(). The latter > blocks this process > from doing mmap()s, which, in my limited knowledge, is bad for > memory-intensive processes (aligned with the fact that the maple tree was > introduced to optimize VMA operations), and the former literally nukes > one unit of computation from the system for a long time. >