From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0309C02181 for ; Fri, 24 Jan 2025 07:38:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 21E74280042; Fri, 24 Jan 2025 02:38:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1CBEE28002E; Fri, 24 Jan 2025 02:38:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 09531280042; Fri, 24 Jan 2025 02:38:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B785728002E for ; Fri, 24 Jan 2025 02:38:29 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1A037140E00 for ; Fri, 24 Jan 2025 07:38:29 +0000 (UTC) X-FDA: 83041542738.07.6812CA3 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id 55E9618000C for ; Fri, 24 Jan 2025 07:38:27 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737704307; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1jnxAM1sG24xIhhqGYP5ANfAQu9XGx5i+bPbshp+Wf8=; b=VTB7uuIH/W37NGBwSOg2MJvA6RqG90inBeZU4PpktoIUvPNjMTyGixbYMM88oorcjTiqgq mNC3QPfOI6jak+aZ65enqZfQFnVO2QCJe4wEiGKd6K9V5S0VvffeFrpUQEKB9AUwujy14r 39Y8V2VE76ZGAuVqv2xpE+tdpV87mBw= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737704307; a=rsa-sha256; cv=none; b=OOTal4dSsoDQk1SafNReEBSaiT7rBhKhKKe/33iEhJV+EmOM2ri2NjCmBDo0V36gyMYaep FpTMKmWdsDBawV0YQOAaS3azDiTYbelH3H/paNejb5xmqRcx1pyhq5Gx0Yt8nqLeUYLz9s e/An2rc37o/zoyFt93odxOErXpgZqwI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 54AD2497; Thu, 23 Jan 2025 23:38:54 -0800 (PST) Received: from [10.163.88.60] (unknown [10.163.88.60]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D61123F5A1; Thu, 23 Jan 2025 23:38:14 -0800 (PST) Message-ID: Date: Fri, 24 Jan 2025 13:08:11 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/11] khugepaged: mTHP support From: Dev Jain To: Nico Pache Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com> <209ba507-fab8-4011-bc36-1cc38a303800@arm.com> <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> Content-Language: en-US In-Reply-To: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 55E9618000C X-Stat-Signature: 4apge5751indmkfuhmc1qtap9sy3kash X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1737704307-35931 X-HE-Meta: U2FsdGVkX1+owtMXzwd7yntGnD92yQtIoQzmtcLr6ND3/Cp39m8jv49Ai3Sysa7N8OaprqbSZUXTaBlPX/F3oNIqmtOolzQgs91Ba829FLPZWLS03df8bcxugViAKFaAciE46UFCAT7chMvW9S9mb+vC8db+DYcvrS4yX01b5V3LVxMR4nR59ani2f+3q3MV1JdxYE0xScvQ+CTiecSi48M9iUmVPmxJlCE7+FDZMQEIroPcI1Qup7x9IuIgqluMfoLIINdDsRBHuPxMe3HZJAOoT3hFNT47t8LliqjcuofEKpfzutzeBU4+lcTRpoDjAhBbRJPF1uSVttc1H/dZQLDOjL2T9zufXU4rEqpEp8HJ4btSuq+oEB8LWptzVlBRlwgrdq2yYFa8R77vcT+pi4lRRPm7sCauN6Vh/mI3lAJ9iEotWnvKoqJU3cLEwULwxhSDeNm2CMx8HB5eUxROeEXflmKaGJkaL50lLR0NRSRBqlGyjKNDXDbCmNBkZ+hHFDH4uOOGwuDp9sI18GKG6UGU0b4eQK0oC4o5bPlZVEXN7Kc07MTyQsi5Em2sH13m+t00NFMY8O2rO9Z+wYjhCvs0VjuGmW9sjduEf7nyYq9+X6LkOAsq1rrNAGNFPdTKe27MioK53cvXquf9DePtFxuEz4oN98IiKKDwefMYfInK599a42+/rLRxIPdz9Wfyi87wMeIfPuWpLy+dWk1dLPdyYc042EAZAtY+ys4RsVO/Z0YJi8Og7xbopPCjwPyziwyDqowShBgxy6SOjBUt5Q7IHmUv6rOSJxelXpdcl7lJy7ObyMVt6V07Ipg7LnDQ9ePkhObdjmf40pV8UDAl9M3ilfspb1T5LOAdlqnc6jYDmKHZwWgeOqS4d3Bac57tb0LZVs3oXBvyA7Nq4GNC3GolCQoAwvQbvCprVE99xv7jWgveCS9dI0YRXzjAd91CMi1cCi2+j2hS7I+UvIZ KcaMaKyn W4SRe5lYwteumJmEpy+7KnVrqYdkLZ911ypi77DRs/72wx8l9BuRoAgSxxKP8Q9/SAVDnyI13KkCAsJheCnmLiKs/SWHyEdRV+aoDXZP5nKEb7sQi44h2G+v/2UljD8hmeVAgVTW4NtQ60t7y0alptlceuRPRBGZU5sUC+2RxPR9vYrTeqp30Mq+6eXNuxe8yRrfHpMn2HaMpbw7uI/JeRxVdr3vEx+DTRHKiOpoyGsBgRpAoRX9EkEAgiP3fCJDL+tw34Y3pNO17UUhS0Tf8tbhfc92uRWZRN8Zag6sNPJ6tFeheJ4EzhYlmZZRUekB5SDfK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 24/01/25 12:43 pm, Dev Jain wrote: > > > On 24/01/25 1:54 am, Nico Pache wrote: >> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain wrote: >>> >>> >>> >>> --- snip --- >>>>> >>>>> Althogh to be honest, it's not super clear to me what the benefit >>>>> of the bitmap >>>>> is vs just iterating through the PTEs like Dev does; is there a >>>>> significant cost >>>>> saving in practice? On the face of it, it seems like it might be >>>>> uneeded complexity. >>>> The bitmap was to encode the state of PMD without needing rescanning >>>> (or refactor a lot of code). We keep the scan runtime constant at 512 >>>> (for x86). Dev did some good analysis for this here >>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- >>>> aba4b1a441b4@arm.com/ >>> >>> I think I swayed away and over-analyzed, and probably did not make my >>> main objection clear enough, so let us cut to the chase. >>> *Why* is it correct to remember the state of the PMD? >>> >>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs >>> tunables again, since we dropped the lock. The bitmap thingy which you >>> are doing, and in general, any algorithm which tries to remember the >>> state of the PMD, violates the entire point of max_ptes_*. Take for >>> example: Suppose the PTE table had a lot of shared ptes. After you drop >>> the PTL, you do this: scan_bitmap() -> read_unlock() -> >>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot >> per your recommendation I dropped the read_lock() -> read_unlock() and >> made it a conditional unlock > > That's not the one I was talking about here... > >>> of stuff. Now, you do write_lock(), which means that you need to wait >>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process >>> forks and then a lot of PTEs become shared. The point of max_ptes_shared >>> is to stop the collapse here, since we do not want memory bloat >>> (collapse will grab more memory from the buddy and the old memory won't >>> be freed because it has a reference from the parent/child). >> >> That's a fair point, but given the other feedback, my current >> implementation now requires mTHPs to have no shared/swap, and ive >> improved the sysctl interactions for the set_bitmap and the >> max_ptes_none check in the _isolate function. > > I am guessing you are following the policy of letting the creep happen > for none ptes, and assuming shared and swap to be zero. Ah sorry, I read the thread again and it seems we decided on skipping mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the range to check whether we have at least one filled /all filled ptes, and none of them are shared and swap. > >> >> As for *why* remembering the state is correct. It just prevents >> needing to rescan. > > That is what I am saying...if collapse_huge_page() fails, then you have > dropped the mmap write lock, so now the state of the PTEs may have > changed, so you must rescan... > >> >>> Another example would be, a sysadmin does not want too much memory >>> wastage from khugepaged, so we decide to set max_ptes_none low. When you >>> scan the PTE table you justify the collapse. After you drop the PTL and >>> the mmap_lock, a munmap() happens in the region, no longer justifying >>> the collapse. If you have a lot of VMAs of size <= 2MB, then any >>> munmap() on a VMA will happen on the single PTE table present. >>> >>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need >>> to ask whether any algorithm remembering the state of the PMD is even >>> conceptually right. >> >> Both the issues you raised dont really have to do with the bitmap... > > Correct, my issue is with any general algorithm remembering PTE state. > >> they are fair points, but they are more of a criticism of my sysctl >> handling. Ive cleaned up the max_ptes_none interactions, and now that >> we dont plan to initially support swap/shared both these problems are >> 'gone'. >>> >>> Then, you have the harder task of proving that your optimization is >>> actually an optimization, that it is not turned into being futile >>> because of overhead. From a high-level mathematical PoV, you are saving >>> iterations. Any mathematical analysis has the underlying assumption that >>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << >>> order)] is virtually and physically contiguous in memory so prefetching >>> helps us. You are trying to save on pte memory references, but then look >>> at the number of bitmap memory references you have created, not to >>> mention that you are doing a (costly?) division operation in there, you >>> have a while loop, a stack, new structs, and if conditions. I do not see >>> how this is any faster than a naive linear scan. >> >> Yeah it's hard to say without real performance testing. I hope to >> include some performance results with my next post. >> >>> >>>> This prevents needing to hold the read lock for longer, and prevents >>>> needing to reacquire it too. >>> >>> My implementation does not hold the read lock for longer. What you mean >>> to say is, I need to reacquire the lock, and this is by design, to >> yes sorry. >>> ensure correctness, which boils down to what I wrote above. >> The write lock is what ensures correctness, not the read lock. The >> read lock is to gain insight of potential collapse candidates while >> avoiding the cost of the write lock. >> >> Cheers! >> -- Nico >>> >> > >