From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 048F9E77187 for ; Wed, 18 Dec 2024 16:00:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90AA36B009A; Wed, 18 Dec 2024 11:00:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B9EE6B009C; Wed, 18 Dec 2024 11:00:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A8926B00A2; Wed, 18 Dec 2024 11:00:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5CBF76B009A for ; Wed, 18 Dec 2024 11:00:05 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 09C0D80FEA for ; Wed, 18 Dec 2024 16:00:05 +0000 (UTC) X-FDA: 82908538902.20.928617C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf26.hostedemail.com (Postfix) with ESMTP id DCD2414000F for ; Wed, 18 Dec 2024 15:59:38 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734537588; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7GEMACb0GKZMMBKLVS6MdDhNq48xiFdr7Ao8KMV3yg8=; b=kZ/KPJnWoDZPPVRR53+BE0txdnsgBEYCR66+WKtvIzBHG5jEH08YnarZDhFJJ204ynaqkq Csy6uJIpaKSTsBv+o8K3xLu2jLAQG5R5TNEdll97Dc+iHgtmIYLP+qZ5sqI1QIBXcEAhK9 GEfiQqrNibqsYpf7dvKRK61G140HASU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734537588; a=rsa-sha256; cv=none; b=KRRFf2eD6SS3/jGbn4T5iIDIpoE1Y68dDaLzTlKnss/UvZ4NTYc4f//xn6EtAv1mdSmu02 xV3im6KoWHS6+7ymlIAYHv9NATqxlBp6XbD9O30TwsW0qSKymtDRwNiNj9jDCUdZpMSapT hxS2vodt7AZsF69aZ4H9pEhl3CVvDXo= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6CF35FEC; Wed, 18 Dec 2024 08:00:30 -0800 (PST) Received: from [10.163.79.181] (unknown [10.163.79.181]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E1EFD3F528; Wed, 18 Dec 2024 07:59:51 -0800 (PST) Message-ID: <8cd184ef-9a66-4018-b4db-1103f05a6184@arm.com> Date: Wed, 18 Dec 2024 21:29:48 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() To: David Hildenbrand , akpm@linux-foundation.org, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-10-dev.jain@arm.com> <2215dd8e-233a-427b-b15c-a2ffbce8f46d@redhat.com> Content-Language: en-US From: Dev Jain In-Reply-To: <2215dd8e-233a-427b-b15c-a2ffbce8f46d@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: DCD2414000F X-Stat-Signature: 6ed8n83wo7tyt74x7z3scojdz9z5qkph X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1734537578-134091 X-HE-Meta: U2FsdGVkX1/C2n0LKkO983IJ3QQYzGp7LoBAXAVQihIzxCGXA4W56ysDGamhmhFMOEpqpd8Csjix0oJCQ29ln8WUdWVSRS6p8W0UhZOVKTBW+z2cBKzCRcjwxispZ+JCche26pZgSD5WxXJeA0nDlCbUvDlcVwwSPFVmEkbvKFIKW4mvxmRZ470gYLzUrWMmRmNPmEhTAmfeh91cFTwRPKAFQ7AABf1+XV+4DRNTwkxTNh6uh7W5wAFfgO6G1X6/jpj2IYXY3IqPbAmBkvEDTGHSSX9mvYx7Ff8HDrP5UpsXnoGiV7zbZr0onXSuiIzWdrF1O5rNr8OLBkKsaNpd46v5NV78zJoSd6/bJe3A02EwQ9EEaI/sFaQTvNeOh4pegwZx1u9QMfkSm1A75s4iEOrI9Xyp4lkNrSkkNhn015OTmwNTIa/gt6velQfq3Jcj0w6sh89Mw9ORO94Puu9EEgvj/hy/oK4Brl9Z4VijOVpKAS9s/4x60hEZuwyE3fT/dbsQdxSLdP9XLOrlJomhwEVpHqM2ztoYfMJ/CtTn6vVjmPmdrAgB9PNrXGrUTXeNcOTYay5c4AEEtgnJlKtHhhIRf+sCx+gPDSDZvBGrUUhf4mq/h5b2fBZIDvwDI4DmzMi5vkfZQG6Ga9QK5PARz34ILmbCK+Ap9EExnpE7fCmaI13B4zJhR6sfUfALn06VmYfJnV0LZGfmwZfrETNbEA2fVekm0sPmCRHSHWRIIFY/B6+UCRTPaJwysnPY4RBz9PgTxWW1BDL9QGF2iT2noh79h1dYS3PnkzH64FO1NkA2XuP8/cklx4uXvnxowiMmzERGnftnZ090Z6jfnfsdqdUwqZb9TaFUa+HWy7+I1LFAywBce0XUr58HrNDtNSFkTVJLLo5MNNBxQNcLzx2hS4ye1rWLSBBczlFTmUIvKrVp5nfhCGbLx08kxHsRiM4xjNqaXNIH1KlugtgEJaL mDiMqYwU KL15CHe/qq1HK/+E93wMl0w7Y8v/wEPOG7zmaBkePLMykNe4f0ikDNFIede0VhL+mL7Iu1QN/TNTgHMsfUTYBuL0oo9zoV2Ss87i6OY9aseSSNlUu6zV4yz+OlhUBftyTmJ3wFMYp7hIGe/1jNFOR0JyM9kF2f0Fd4eWisaBxeW6DszoPT5tWxROoQZ+i56QrC5qPL33lP88JknzA1yki/QUvnrzPicnogV5vponaKxYOrBfCOlayAH818YCZfmi9FLj65uH0ztyNcGwweK1TM8Rrhmm4FS4quiUmTNuTi5EjmK0va50Wf4ZFolDYcesxLIKi186wIPvYf7wJ9t0UFVBC+YKyUBQBV2eaNJiv7Hy5yKRPimphG0raUrrDNB90Aqih+iZl41X5fFwRw/gXGkN6Ngv1AQO/6vv47QNxu3OwySp9BVdJctopZyPHVab8foLWF5JCHQbhIEaXQbKFgO3K0ZYoXg1biebIpBVMgawlb4g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/12/24 10:36 pm, David Hildenbrand wrote: > On 16.12.24 17:51, Dev Jain wrote: >> In contrast to PMD-collapse, we do not need to operate on two levels >> of pagetable >> simultaneously. Therefore, downgrade the mmap lock from write to read >> mode. Still >> take the anon_vma lock in exclusive mode so as to not waste time in >> the rmap path, >> which is anyways going to fail since the PTEs are going to be >> changed. Under the PTL, >> copy page contents, clear the PTEs, remove folio pins, and (try to) >> unmap the >> old folios. Set the PTEs to the new folio using the set_ptes() API. >> >> Signed-off-by: Dev Jain >> --- >> Note: I have been trying hard to get rid of the locks in here: we >> still are >> taking the PTL around the page copying; dropping the PTL and taking >> it after >> the copying should lead to a deadlock, for example: >> khugepaged                        madvise(MADV_COLD) >> folio_lock()                        lock(ptl) >> lock(ptl)                        folio_lock() >> >> We can create a locked folio list, altogether drop both the locks, >> take the PTL, >> do everything which __collapse_huge_page_isolate() does *except* the >> isolation and >> again try locking folios, but then it will reduce efficiency of >> khugepaged >> and almost looks like a forced solution :) >> Please note the following discussion if anyone is interested: >> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ >> >> (Apologies for not CCing the mailing list from the start) >> >>   mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++---------- >>   1 file changed, 87 insertions(+), 21 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 88beebef773e..8040b130e677 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -714,24 +714,28 @@ static void >> __collapse_huge_page_copy_succeeded(pte_t *pte, >>                           struct vm_area_struct *vma, >>                           unsigned long address, >>                           spinlock_t *ptl, >> -                        struct list_head *compound_pagelist) >> +                        struct list_head *compound_pagelist, int order) >>   { >>       struct folio *src, *tmp; >>       pte_t *_pte; >>       pte_t pteval; >>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR; >> +    for (_pte = pte; _pte < pte + (1UL << order); >>            _pte++, address += PAGE_SIZE) { >>           pteval = ptep_get(_pte); >>           if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); >>               if (is_zero_pfn(pte_pfn(pteval))) { >> -                /* >> -                 * ptl mostly unnecessary. >> -                 */ >> -                spin_lock(ptl); >> -                ptep_clear(vma->vm_mm, address, _pte); >> -                spin_unlock(ptl); >> +                if (order == HPAGE_PMD_ORDER) { >> +                    /* >> +                    * ptl mostly unnecessary. >> +                    */ >> +                    spin_lock(ptl); >> +                    ptep_clear(vma->vm_mm, address, _pte); >> +                    spin_unlock(ptl); >> +                } else { >> +                    ptep_clear(vma->vm_mm, address, _pte); >> +                } >>                   ksm_might_unmap_zero_page(vma->vm_mm, pteval); >>               } >>           } else { >> @@ -740,15 +744,20 @@ static void >> __collapse_huge_page_copy_succeeded(pte_t *pte, >>               src = page_folio(src_page); >>               if (!folio_test_large(src)) >>                   release_pte_folio(src); >> -            /* >> -             * ptl mostly unnecessary, but preempt has to >> -             * be disabled to update the per-cpu stats >> -             * inside folio_remove_rmap_pte(). >> -             */ >> -            spin_lock(ptl); >> -            ptep_clear(vma->vm_mm, address, _pte); >> -            folio_remove_rmap_pte(src, src_page, vma); >> -            spin_unlock(ptl); >> +            if (order == HPAGE_PMD_ORDER) { >> +                /* >> +                * ptl mostly unnecessary, but preempt has to >> +                * be disabled to update the per-cpu stats >> +                * inside folio_remove_rmap_pte(). >> +                */ >> +                spin_lock(ptl); >> +                ptep_clear(vma->vm_mm, address, _pte); > > > > >> + folio_remove_rmap_pte(src, src_page, vma); >> +                spin_unlock(ptl); >> +            } else { >> +                ptep_clear(vma->vm_mm, address, _pte); >> +                folio_remove_rmap_pte(src, src_page, vma); >> +            } > > As I've talked to Nico about this code recently ... :) > > Are you clearing the PTE after the copy succeeded? If so, where is the > TLB flush? > > How do you sync against concurrent write acess + GUP-fast? > > > The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check > if there are unexpected page references (e.g., GUP) if so back off (3) > copy page content (4) set updated PTE/PMD. > > To Nico, I suggested doing it simple initially, and still clear the > high-level PMD entry + flush under mmap write lock, then re-map the > PTE table after modifying the page table. It's not as efficient, but > "harder to get wrong". Btw if we are taking mmap write lock, then we do not require flushing PMD entry for mTHP collapse right? > > Maybe that's already happening, but I stumbled over this clearing > logic in __collapse_huge_page_copy_succeeded(), so I'm curious. >