From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0C1ABC02185
	for <linux-mm@archiver.kernel.org>; Mon, 20 Jan 2025 05:18:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 72CA96B0082; Mon, 20 Jan 2025 00:18:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6DC046B0083; Mon, 20 Jan 2025 00:18:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5CB1C6B0085; Mon, 20 Jan 2025 00:18:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 3F4616B0082
	for <linux-mm@kvack.org>; Mon, 20 Jan 2025 00:18:16 -0500 (EST)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id E96E082757
	for <linux-mm@kvack.org>; Mon, 20 Jan 2025 05:18:15 +0000 (UTC)
X-FDA: 83026674150.17.9E235C9
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf30.hostedemail.com (Postfix) with ESMTP id ED64A80009
	for <linux-mm@kvack.org>; Mon, 20 Jan 2025 05:18:13 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1737350294;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UqUdL23jQVUOihCchysRbS1Dj8TQHYoWDft65v0e5nY=;
	b=DUphAUFuCf6f1FsKLu0Y7F/CX96f3eG/aGRtw04t9aLnlFj27S5Y336IDoqPOp+utraegi
	CnheFloieQcyvyJMxeoQm5n+ievVcSW3CDj696uGdbCSys3tzx8advP66dXFYchp9NlxmN
	h6gmuvzawGZMRV/t+GH9HWUEndfEjDU=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737350294; a=rsa-sha256;
	cv=none;
	b=UG4cLDZ5N37gkgf7/3+9ZLsvRiyLTUGp3nlKqFikb/X06E50VInV40RYBD/XwqgeRBBZBz
	QIUfl00NY99sfcvVPVDI39xRJiF1lUUr1AmmV5ECPqzcb38RRf4Qy2wApovMQol78hh/Ca
	nWBhP88BeJCPNEKLYM+Bc96a1SDP8qc=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5774E1063;
	Sun, 19 Jan 2025 21:18:41 -0800 (PST)
Received: from [10.162.42.19] (K4MQJ0H1H2.blr.arm.com [10.162.42.19])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AF9603F740;
	Sun, 19 Jan 2025 21:18:00 -0800 (PST)
Message-ID: <209ba507-fab8-4011-bc36-1cc38a303800@arm.com>
Date: Mon, 20 Jan 2025 10:47:57 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 00/11] khugepaged: mTHP support
To: Nico Pache <npache@redhat.com>, Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org,
 vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com,
 dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org,
 jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com,
 hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
 peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 ziy@nvidia.com, jglisse@google.com, surenb@google.com,
 vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com,
 jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org,
 kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com,
 raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com,
 audra@redhat.com, akpm@linux-foundation.org
References: <20250108233128.14484-1-npache@redhat.com>
 <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com>
 <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
Content-Language: en-US
From: Dev Jain <dev.jain@arm.com>
In-Reply-To: <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: ED64A80009
X-Stat-Signature: pjsjeciujnf83pqkes6tbn3z6o9ebebj
X-Rspam-User: 
X-HE-Tag: 1737350293-435712
X-HE-Meta: U2FsdGVkX1+qSaW00HwHU0vt7UbHBKlS59iEvkfv3CD0m8xSPeHUM7SA7IL8VsPSC7CFy/zKrE6PrF+OpP4aexOGbySM8sK3fNCsOQrraqLrRtYBfP08erpqALo9mZce9p/qIEuvsYrQv6/kAi+gxyINCj54R27DaaoubgTADh1JOwglQ4oMeTgFvhmsPXdT/hDJpLva93EoXBDffYBsKdw4y5By2lyChZeSaRDzlgkDqmVegP6GrDPeFwIRAQ2L+/6pWWm2L5Po1935o88/XUmFRPDpg9LKipKlnvpz24v41rIEndIsngKVZnwxctRnWmFXZEaaXsqdCulJDJLplnNVm7ZDmwir4NXFUz64H7Zd8L1T/yMVY31ceY+4HWJed4hSi13iqMPCOqgW5y7jVKAhWeXbyBHW738c0syG3iBzKbxv5eBdwSkiI8aZhMFy8SC2UZWmGyecBXP9WNri3YKahgcITNlV/EFEeXP/cnawZdZio8hj7aF2eYu26/AnGq4d1JJdZ853Bde17oxdp0/8nTKqZwjBUHiQ3c4TQoVGQhRR+xU5T4CsSNO4BnCpRvbRj0MSwWGsbL2Ro/e4iiEMc2gfLJ1Ko+5byVfrPEk47dUu5n+qz3/d5HdHwedAVtjWurSbjqLOVq5ZqlR4VoFs4GFEq5zF8/2ym4FHWZiMICAam7eMz7nKhgENPR9Vqw1Xq/NnD2llJWg5Mi8UwexzHTSEyBZPWIZ4JAZsgSH3DozKN5thtekjXzMJflUxpKRLp3W0YmQWchpJc++84MCrm8CV/xw+OdkrCAKrJTomb2gvCbw2R/cbX+sEHN4cvkesm6LQy5UvwJtXhBzUleYDoxsOj+YjIMC2S6mioc/ny+NYsRbrcsiaMLejkRMB6g+LqsD9+gtETFW32IOlAjjodZ/E5mRb2/ijU7G7dLEctsyc7Li9Vz4LD4fix/QJbkCgY1ymcknf3uXW07f
 TxqV00ov
 DwA6MLY/dTZRxPz+b3tg42ZMeQrfoDVmzDPF07x0tnrVDEP22s09SRHdIkVT3mHtRYJvXSj4+pMC2UFP09FTd2iNJUyYLneQtagMVAkMtXCMFZUeEes+BjsPj5/l6uUw296AZcKDWTPXDzSTJ71t6TIfSPub1/i08+m3m6h+tlp420hJ9iyuES3O82KetdC0VqdZSxxaUx+L2Xnlr9b5Y7Ti/cNb1wieVqNzBu+RX6cWqPpc6AueLsxbsheyWIWzAzzjG
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


--- snip ---
>>
>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> The bitmap was to encode the state of PMD without needing rescanning
> (or refactor a lot of code). We keep the scan runtime constant at 512
> (for x86). Dev did some good analysis for this here
> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/

I think I swayed away and over-analyzed, and probably did not make my 
main objection clear enough, so let us cut to the chase.
*Why* is it correct to remember the state of the PMD?

In__collapse_huge_page_isolate(), we check the PTEs against the sysfs 
tunables again, since we dropped the lock. The bitmap thingy which you 
are doing, and in general, any algorithm which tries to remember the 
state of the PMD, violates the entire point of max_ptes_*. Take for 
example: Suppose the PTE table had a lot of shared ptes. After you drop 
the PTL, you do this: scan_bitmap() -> read_unlock() -> 
alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot 
of stuff. Now, you do write_lock(), which means that you need to wait 
for all faulting/forking/mremap/mmap etc to stop. Suppose this process 
forks and then a lot of PTEs become shared. The point of max_ptes_shared 
is to stop the collapse here, since we do not want memory bloat 
(collapse will grab more memory from the buddy and the old memory won't 
be freed because it has a reference from the parent/child).
Another example would be, a sysadmin does not want too much memory 
wastage from khugepaged, so we decide to set max_ptes_none low. When you 
scan the PTE table you justify the collapse. After you drop the PTL and 
the mmap_lock, a munmap() happens in the region, no longer justifying 
the collapse. If you have a lot of VMAs of size <= 2MB, then any 
munmap() on a VMA will happen on the single PTE table present.

So, IMHO before even jumping on analyzing the bitmap algorithm, we need 
to ask whether any algorithm remembering the state of the PMD is even 
conceptually right.

Then, you have the harder task of proving that your optimization is 
actually an optimization, that it is not turned into being futile 
because of overhead. From a high-level mathematical PoV, you are saving 
iterations. Any mathematical analysis has the underlying assumption that 
every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << 
order)] is virtually and physically contiguous in memory so prefetching 
helps us. You are trying to save on pte memory references, but then look 
at the number of bitmap memory references you have created, not to 
mention that you are doing a (costly?) division operation in there, you 
have a while loop, a stack, new structs, and if conditions. I do not see 
how this is any faster than a naive linear scan.

> This prevents needing to hold the read lock for longer, and prevents
> needing to reacquire it too.

My implementation does not hold the read lock for longer. What you mean 
to say is, I need to reacquire the lock, and this is by design, to 
ensure correctness, which boils down to what I wrote above.