From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95469C02181 for ; Fri, 24 Jan 2025 07:13:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 056226B0082; Fri, 24 Jan 2025 02:13:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 00693280025; Fri, 24 Jan 2025 02:13:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE9116B009E; Fri, 24 Jan 2025 02:13:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C08936B0082 for ; Fri, 24 Jan 2025 02:13:57 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id EF8AAC0DCA for ; Fri, 24 Jan 2025 07:13:56 +0000 (UTC) X-FDA: 83041480872.11.A283137 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf15.hostedemail.com (Postfix) with ESMTP id 0485CA000E for ; Fri, 24 Jan 2025 07:13:54 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737702835; a=rsa-sha256; cv=none; b=kfj/iwyQ8X9cmXsUS9SmoJNbU8qEAgyTdGD1gHNd2GjlZBHv0nINP+YRcHaPXUUe1hEpAJ z+77AZksxX1OWJYU8AARAq9SDg98xzHPeWO6rOskqMyvNX+ppNxlaZlikgXTW4DJWIKmeM x+jSCLq8UiHh8bk8DMwBmbO0aq/W8Lc= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737702835; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kUdhn358y8iVC22CKwW5kvU3D/XI0fdy4wI3xsNl/Lg=; b=PFUz0ZK72qaVq9sbhz0xVJm+glDR0yePKY9ovCCMa15x72O0wB/Zvz9PLLNcxVevokmWgB j5euF/rtZW3CBk+BTebl1ZzxQdACl8OBorFIe6ZuElJcAwwlETmptvHlfH/DRNFDWsUqni HmDU63qEBGGwuF7O7cP9C2zGM/r172E= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D36ED497; Thu, 23 Jan 2025 23:14:21 -0800 (PST) Received: from [10.163.88.60] (unknown [10.163.88.60]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 28ECF3F5A1; Thu, 23 Jan 2025 23:13:41 -0800 (PST) Message-ID: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> Date: Fri, 24 Jan 2025 12:43:39 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/11] khugepaged: mTHP support To: Nico Pache Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com> <209ba507-fab8-4011-bc36-1cc38a303800@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0485CA000E X-Stat-Signature: 3ygeqa3mhpa7g339pnfjg6wtbqb5dy58 X-HE-Tag: 1737702834-679935 X-HE-Meta: U2FsdGVkX190HLfX6VRs0JQD9PEUJTflh5GDHksVnDvkPH76peOxBzXT5deD9Yfjzr1IWZ9wlPrym1l5Kwai+88NVcL4eWyMn95m6vfThHp1/uS1DGYz/yIJDlcCR71vZRDpSzLaZdFCQ23NKTlIzMSkQJL81ju3NYhTdj81N2YXD7Pd18tdGqJ07PrYznfcyGVsQ6x7ALh3jUAAqC0uk7oy6p2csV10BCY2vzcJCO+dZjaOrHTSoyGOe+XKs2UTTuZMvOrBAPPx/piG70xEzrCCpvMCWtP5bM9t54Ny4CtGrxKzEieu60bJ8sQhWq5QUBSGW/N8mO5wrkYYP0NjYaOl2KmF6eQcPecffUwXnbhfAfWE50WbjmwJQ+MF1G5LTzcC8wUe2Xf8o8neYhRBbywkOh/Wshc2SoO/rXb7AG3kmDUCCXwWkRvKS+7ZkUwKzJcJwSEY0oybWlCr9RJc8Wr3VkzDbubG9KiCCswIfj5QC0PRlD1W0n+ilsKXkvc6BB7pX6lwhvEyKSMfNTEtfxUsu+8tDvV6ltXZSLRX1MNbgNYNlpiNGqv4c22WjnjpGhyQfJL8UoUBaT+ZfxBsQ98tJ2he9XMRjeUgCtxe0XP0/AMw7/WMMfWDgYfhEzrojATnDIAFtcs+MR2ewgFQz/2g6JZK5F3TPMq+sKPYU0+Q0tXmEuDEqEXmLoSaYh8Hr+bbYIOHvKnXhdVOcZyniUNsyr2rCT/AL4+cIctbRoFRj08e+pA0/uaw+MOlNqh2VOVBgNcsT85g9CHMu6raefrkyURzWiSTAy3TDQq861Wd0ItJ6CoIDsOmzki7JD8FwaqaNSuXGGoR3cLUHHt9kp9U1Y8xuWoiNk30CmDeKASiU8oTOSm2D129zHg+ixblK+IlONboKRRrBO7HT5GgYH3w+Te7gtuypNwwofN+NFG/oB3zsxO8LVWREBanmUTas9qCGcuYcJAEP8CjI5g kQmlIl6f ooicz7BvEyaDa0il3KJSuXs8fvIx7/UmoCLyS8YqrScHpAVwEdGC21M+sywPh12xaoaJuQ+/OZQQUHRYvHz5hg8dwnX1e4/Nn5lKf2oghhnBC2oCwFuJt4SItu33yeHrsbMu9nPikcHl7042ny51x2gdYRj1Ec/OaWY72b35+gMfqZsJ/dTl5UxSFdW+b7zMCuU6qbvZjO1qiTgmAs2Iim99KKaVbaQMcLjb+tqK2YJlXONkrod/1oLRKLzZYjmz30+sYOJaiTiVCz5NzCSS6EjtpxLDjG9XAGbPtBVaytkwKGroR/6KnfpPHeQv9sH6HvMo/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 24/01/25 1:54 am, Nico Pache wrote: > On Sun, Jan 19, 2025 at 10:18 PM Dev Jain wrote: >> >> >> >> --- snip --- >>>> >>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap >>>> is vs just iterating through the PTEs like Dev does; is there a significant cost >>>> saving in practice? On the face of it, it seems like it might be uneeded complexity. >>> The bitmap was to encode the state of PMD without needing rescanning >>> (or refactor a lot of code). We keep the scan runtime constant at 512 >>> (for x86). Dev did some good analysis for this here >>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ >> >> I think I swayed away and over-analyzed, and probably did not make my >> main objection clear enough, so let us cut to the chase. >> *Why* is it correct to remember the state of the PMD? >> >> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs >> tunables again, since we dropped the lock. The bitmap thingy which you >> are doing, and in general, any algorithm which tries to remember the >> state of the PMD, violates the entire point of max_ptes_*. Take for >> example: Suppose the PTE table had a lot of shared ptes. After you drop >> the PTL, you do this: scan_bitmap() -> read_unlock() -> >> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot > per your recommendation I dropped the read_lock() -> read_unlock() and > made it a conditional unlock That's not the one I was talking about here... >> of stuff. Now, you do write_lock(), which means that you need to wait >> for all faulting/forking/mremap/mmap etc to stop. Suppose this process >> forks and then a lot of PTEs become shared. The point of max_ptes_shared >> is to stop the collapse here, since we do not want memory bloat >> (collapse will grab more memory from the buddy and the old memory won't >> be freed because it has a reference from the parent/child). > > That's a fair point, but given the other feedback, my current > implementation now requires mTHPs to have no shared/swap, and ive > improved the sysctl interactions for the set_bitmap and the > max_ptes_none check in the _isolate function. I am guessing you are following the policy of letting the creep happen for none ptes, and assuming shared and swap to be zero. > > As for *why* remembering the state is correct. It just prevents > needing to rescan. That is what I am saying...if collapse_huge_page() fails, then you have dropped the mmap write lock, so now the state of the PTEs may have changed, so you must rescan... > >> Another example would be, a sysadmin does not want too much memory >> wastage from khugepaged, so we decide to set max_ptes_none low. When you >> scan the PTE table you justify the collapse. After you drop the PTL and >> the mmap_lock, a munmap() happens in the region, no longer justifying >> the collapse. If you have a lot of VMAs of size <= 2MB, then any >> munmap() on a VMA will happen on the single PTE table present. >> >> So, IMHO before even jumping on analyzing the bitmap algorithm, we need >> to ask whether any algorithm remembering the state of the PMD is even >> conceptually right. > > Both the issues you raised dont really have to do with the bitmap... Correct, my issue is with any general algorithm remembering PTE state. > they are fair points, but they are more of a criticism of my sysctl > handling. Ive cleaned up the max_ptes_none interactions, and now that > we dont plan to initially support swap/shared both these problems are > 'gone'. >> >> Then, you have the harder task of proving that your optimization is >> actually an optimization, that it is not turned into being futile >> because of overhead. From a high-level mathematical PoV, you are saving >> iterations. Any mathematical analysis has the underlying assumption that >> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << >> order)] is virtually and physically contiguous in memory so prefetching >> helps us. You are trying to save on pte memory references, but then look >> at the number of bitmap memory references you have created, not to >> mention that you are doing a (costly?) division operation in there, you >> have a while loop, a stack, new structs, and if conditions. I do not see >> how this is any faster than a naive linear scan. > > Yeah it's hard to say without real performance testing. I hope to > include some performance results with my next post. > >> >>> This prevents needing to hold the read lock for longer, and prevents >>> needing to reacquire it too. >> >> My implementation does not hold the read lock for longer. What you mean >> to say is, I need to reacquire the lock, and this is by design, to > yes sorry. >> ensure correctness, which boils down to what I wrote above. > The write lock is what ensures correctness, not the read lock. The > read lock is to gain insight of potential collapse candidates while > avoiding the cost of the write lock. > > Cheers! > -- Nico >> >