From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F0309C02181
	for <linux-mm@archiver.kernel.org>; Fri, 24 Jan 2025 07:38:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 21E74280042; Fri, 24 Jan 2025 02:38:30 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1CBEE28002E; Fri, 24 Jan 2025 02:38:30 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 09531280042; Fri, 24 Jan 2025 02:38:30 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id B785728002E
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 02:38:29 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 1A037140E00
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 07:38:29 +0000 (UTC)
X-FDA: 83041542738.07.6812CA3
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf16.hostedemail.com (Postfix) with ESMTP id 55E9618000C
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 07:38:27 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1737704307;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=1jnxAM1sG24xIhhqGYP5ANfAQu9XGx5i+bPbshp+Wf8=;
	b=VTB7uuIH/W37NGBwSOg2MJvA6RqG90inBeZU4PpktoIUvPNjMTyGixbYMM88oorcjTiqgq
	mNC3QPfOI6jak+aZ65enqZfQFnVO2QCJe4wEiGKd6K9V5S0VvffeFrpUQEKB9AUwujy14r
	39Y8V2VE76ZGAuVqv2xpE+tdpV87mBw=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737704307; a=rsa-sha256;
	cv=none;
	b=OOTal4dSsoDQk1SafNReEBSaiT7rBhKhKKe/33iEhJV+EmOM2ri2NjCmBDo0V36gyMYaep
	FpTMKmWdsDBawV0YQOAaS3azDiTYbelH3H/paNejb5xmqRcx1pyhq5Gx0Yt8nqLeUYLz9s
	e/An2rc37o/zoyFt93odxOErXpgZqwI=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 54AD2497;
	Thu, 23 Jan 2025 23:38:54 -0800 (PST)
Received: from [10.163.88.60] (unknown [10.163.88.60])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D61123F5A1;
	Thu, 23 Jan 2025 23:38:14 -0800 (PST)
Message-ID: <f3ffda64-93e4-42c9-bf3a-dabcca070ada@arm.com>
Date: Fri, 24 Jan 2025 13:08:11 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 00/11] khugepaged: mTHP support
From: Dev Jain <dev.jain@arm.com>
To: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com,
 cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com,
 dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org,
 jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com,
 hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
 peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 ziy@nvidia.com, jglisse@google.com, surenb@google.com,
 vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com,
 jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org,
 kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com,
 raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com,
 audra@redhat.com, akpm@linux-foundation.org
References: <20250108233128.14484-1-npache@redhat.com>
 <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com>
 <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
 <209ba507-fab8-4011-bc36-1cc38a303800@arm.com>
 <CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com>
 <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>
Content-Language: en-US
In-Reply-To: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 55E9618000C
X-Stat-Signature: 4apge5751indmkfuhmc1qtap9sy3kash
X-Rspamd-Server: rspam08
X-Rspam-User: 
X-HE-Tag: 1737704307-35931
X-HE-Meta: U2FsdGVkX1+owtMXzwd7yntGnD92yQtIoQzmtcLr6ND3/Cp39m8jv49Ai3Sysa7N8OaprqbSZUXTaBlPX/F3oNIqmtOolzQgs91Ba829FLPZWLS03df8bcxugViAKFaAciE46UFCAT7chMvW9S9mb+vC8db+DYcvrS4yX01b5V3LVxMR4nR59ani2f+3q3MV1JdxYE0xScvQ+CTiecSi48M9iUmVPmxJlCE7+FDZMQEIroPcI1Qup7x9IuIgqluMfoLIINdDsRBHuPxMe3HZJAOoT3hFNT47t8LliqjcuofEKpfzutzeBU4+lcTRpoDjAhBbRJPF1uSVttc1H/dZQLDOjL2T9zufXU4rEqpEp8HJ4btSuq+oEB8LWptzVlBRlwgrdq2yYFa8R77vcT+pi4lRRPm7sCauN6Vh/mI3lAJ9iEotWnvKoqJU3cLEwULwxhSDeNm2CMx8HB5eUxROeEXflmKaGJkaL50lLR0NRSRBqlGyjKNDXDbCmNBkZ+hHFDH4uOOGwuDp9sI18GKG6UGU0b4eQK0oC4o5bPlZVEXN7Kc07MTyQsi5Em2sH13m+t00NFMY8O2rO9Z+wYjhCvs0VjuGmW9sjduEf7nyYq9+X6LkOAsq1rrNAGNFPdTKe27MioK53cvXquf9DePtFxuEz4oN98IiKKDwefMYfInK599a42+/rLRxIPdz9Wfyi87wMeIfPuWpLy+dWk1dLPdyYc042EAZAtY+ys4RsVO/Z0YJi8Og7xbopPCjwPyziwyDqowShBgxy6SOjBUt5Q7IHmUv6rOSJxelXpdcl7lJy7ObyMVt6V07Ipg7LnDQ9ePkhObdjmf40pV8UDAl9M3ilfspb1T5LOAdlqnc6jYDmKHZwWgeOqS4d3Bac57tb0LZVs3oXBvyA7Nq4GNC3GolCQoAwvQbvCprVE99xv7jWgveCS9dI0YRXzjAd91CMi1cCi2+j2hS7I+UvIZ
 KcaMaKyn
 W4SRe5lYwteumJmEpy+7KnVrqYdkLZ911ypi77DRs/72wx8l9BuRoAgSxxKP8Q9/SAVDnyI13KkCAsJheCnmLiKs/SWHyEdRV+aoDXZP5nKEb7sQi44h2G+v/2UljD8hmeVAgVTW4NtQ60t7y0alptlceuRPRBGZU5sUC+2RxPR9vYrTeqp30Mq+6eXNuxe8yRrfHpMn2HaMpbw7uI/JeRxVdr3vEx+DTRHKiOpoyGsBgRpAoRX9EkEAgiP3fCJDL+tw34Y3pNO17UUhS0Tf8tbhfc92uRWZRN8Zag6sNPJ6tFeheJ4EzhYlmZZRUekB5SDfK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@arm.com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
>