From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 844D2C0218C for ; Mon, 27 Jan 2025 09:32:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BFA42280134; Mon, 27 Jan 2025 04:32:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B827728012F; Mon, 27 Jan 2025 04:32:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A2263280134; Mon, 27 Jan 2025 04:32:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 81E9228012F for ; Mon, 27 Jan 2025 04:32:03 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id DFFA91221BA for ; Mon, 27 Jan 2025 09:32:02 +0000 (UTC) X-FDA: 83052715284.19.64DF983 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf07.hostedemail.com (Postfix) with ESMTP id DD7E240007 for ; Mon, 27 Jan 2025 09:32:00 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737970321; a=rsa-sha256; cv=none; b=RmKBCmGoIBG2S1zKJ9KPd6DEZ1cNfjfYzuRc2jyh+VtlXYZ/cgB8OtwlyGnK+qVtkLXf/g nRm4BienJfrw3S1GPef5paCdeDM8RGdMo6uw9TcgBLm80fIGYaWCTZMYzq5M/FVRdUfS4d yZD9VLdU+Gbn2UpbmZzDMXsZXP5jrNM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737970321; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5E/2R1T0xZhqV1z5/QPtqyYYkZHhsHNR/gqASP23Rh4=; b=uIP/Bg2yxeg62cNmZRQzEPYW1vXor6pbWNmZBy9hBPg6zUAdNlVoZ+J8mB3uSvZ6Y5CUfh xicLmsY4G5Z2I506OobqJO8rgeCIU403ZzJlGSXgl0P6HNLdbRTcrF1WhLoBkafoti07hw yH29MPIFhJNfiNdfndIuaaEfZSS0A3s= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B68C1497; Mon, 27 Jan 2025 01:32:26 -0800 (PST) Received: from [10.162.43.36] (K4MQJ0H1H2.blr.arm.com [10.162.43.36]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F38033F528; Mon, 27 Jan 2025 01:31:47 -0800 (PST) Message-ID: <41d85d62-3234-478e-8cd7-571a49cfc031@arm.com> Date: Mon, 27 Jan 2025 15:01:44 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/11] khugepaged: mTHP support To: David Hildenbrand , Ryan Roberts , Nico Pache Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com> <37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com> <0a318ea8-7836-405a-a033-f073efdc958f@arm.com> <8305ddf7-1ada-4a75-a2c3-385b530b25d4@redhat.com> <9bf875ad-3e31-464d-bccd-7c737a2c53bc@arm.com> <95472249-44f6-4764-a5fa-fac834eb5a49@redhat.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: DD7E240007 X-Stat-Signature: xe66xntxhkptjac676pf6qcf317nkdxn X-HE-Tag: 1737970320-131385 X-HE-Meta: U2FsdGVkX18mXNklvPVJtM+7Pc8VeUnI1ALihW5iXzqEgEHbG6nyl/+E8c3EFm22YeQSWsRTngH6KFjbkjmyBCu40jlCmC8eHBnKbO9JQHvGVYu+TQfV0DGXb0w6UQgu+K23hbkGH2tPXUmi1XdIh9ae/ijPj1fTVOdrL+HkhEnvP+Cl2lY+lGA/EC4ytJbJYS/WbZoVuwcXwXYDsjRgMmWGNcs4FpE5Z+T4yOKtzNNFGZcsFVf9EZQ6fRb26BAyaMkXDqD0mCvfedUXbU800wjKQ2svIKLmouLPoFjfSNyEBl44UJJbpAQ+UGEi3H3yHjogorrTdJa4PsRd3VI3wb5Ao9dKCIA6xmUN77vBSesL6eQxYAoG7r6ESRfToKx+k2xEGOJAljCdS1aK93yIkJle5uOXnvXoRvX68oLagK1p/dQJuMmyXuQShHgDv6pkRKXwIo0hHNbgTDVHLVVp3ejeg56vkYA7YxLquhGzgFFFYKsi4bmKmafXH99EpwoRoTt3iSJOEE7xO/P8VwOD1OQVslc4vLt0b2vD09Awv+JfgSh6DrNmiZEbc6faGXJNx3AkO3IeSgdK8hf9IqaYazTHm2BmhEQsVu4uDj8otJaTOoaWwCjAtFG2lHfx4SZVkIHq98jSfKJOS7Z+HvOI56AOaqK1kEHiQmnC90NaOaRFyy7L/DT6nWctXlqWiQcFR4wV7h33Jm8YKduAE8cQ3zARHACkVaW2cHh2I3x1I3SbIEBHJoE+USweg35RprbsiReeGcQfC1ytxkfCbhrrKBco9WVukyqc4nX45nt6Eg3i9o/oMTFvuw7IIhxGOKPih1Tv8nvSQqlDrSOf9wR1hT15AxZC0H1M4NsOaT9mtGGu5AlGhIOkZleQPtfhhS387x9E1cO/pyS5N7HiDgSxSV0xzxsPT8Ltw4y6OIEza8dURerRIZ6eHBxok/vLtE491vsi7DybGt6W5Y3vZWZ Le0RNdHI oWMp0M+cEcXpV0Ut8E6b3Rlwsw9XlEG6yZ+S4M0UIQ8TitK3PD/ktvS9QzUnlTbzBJCa0Cku94X9tiTr/1YLVSWtv4Kp+V4O48pWWOVTseexiI8BzKMKMiqX34wC96ieRHCLk+z25KzAjIgu8cK3+zKGLeYgqviyhxgGLx8Inbt5jsFc9H6fRIkDtOwwh8DjkRoMxSJCqrVHGxwQgqELUpJ9eZg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 21/01/25 3:49 pm, David Hildenbrand wrote: >> Hmm that's an interesting idea; If I've understood, we would >> effectively test >> the PMD for collapse as if we were collapsing to PMD-size, but then do >> the >> actual collapse to the "highest allowed order" (dictated by what's >> enabled + >> MADV_HUGEPAGE config). >> >> I'm not so sure this is a good way to go; there would be no way to >> support VMAs >> (or parts of VMAs) that don't span a full PMD. > > > In Nicos approach to locking, we temporarily have to remove the PTE > table either way. While holding the mmap lock in write mode, the VMAs > cannot go away, so we could scan the whole PTE table to figure it out. > > To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably > don't need the other VMA information. Figuring out "shared" is trickier, > because we have to obtain the folio and would have to walk the other VMAs. > > It's a good question if we would have to VMA-write-lock the other > affected VMAs as well in order to temporarily remove the PTE table that > crosses multiple VMAs, or if we'd need something different (collapse PMD > marker) so the page table walkers could handle that case properly -- > keep retrying or fallback to the mmap lock. I missed this reply, could have saved me some trouble :) When collapsing for VMAs < PMD, we *will* have to write lock the VMAs, write lock the anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs, otherwise someone may fault on another VMA mapping the same PTE table. I was trying to implement this, but cannot find a clean way: we will have to implement it like mm_take_all_locks(), with a similar bit like AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then two VMAs may have the same anon_vma, and we cannot get away with the following check: lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem) since I need to skip the lock only when it is khugepaged which has taken the lock. I guess the way to go about this then is the PMD-marker thingy, which I am not very familiar with. > >> And I can imagine we might see >> memory bloat; imagine you have 2M=madvise, 64K=always, >> max_ptes_none=511, and >> let's say we have a 2M (aligned portion of a) VMA that does NOT have >> MADV_HUGEPAGE set and has a single page populated. It passes the PMD- >> size test, >> but we opt to collapse to 64K (since 2M=madvise). So now we end up >> with 32x 64K >> folios, 31 of which are all zeros. We have spent the same amount of >> memory as if >> 2M=always. Perhaps that's a detail that could be solved by ignoring >> fully none >> 64K blocks when collapsing to 64K... > > Yes, that's what I had in mind. No need to collapse where there is > nothing at all ... > >> >> Personally, I think your "enforce simplicifation of the tunables for mTHP >> collapse" idea is the best we have so far. > > Right. > >> >> But I'll just push against your pushback of the per-VMA cursor idea >> briefly. It >> strikes me that this could be useful for khugepaged regardless of mTHP >> support. > > Not a clear pushback, as you say to me this is a different optimization > and I am missing how it could really solve the problem at hand here. > > Note that we're already fighting with not growing VMAs (see the VMA > locking changes under review), but maybe we could still squeeze it in > there without requiring a bigger slab. > >> Today, it starts scanning a VMA, collapses the first PMD it finds that >> meets the >> requirements, then switches to scanning another VMA. When it >> eventually gets >> back to scanning the first VMA, it starts from the beginning again. >> Wouldn't a >> cursor help reduce the amount of scanning it has to do? > > Yes, that whole scanning approach sound weird. I would have assumed that > it might nowdays be smarter to just scan the MM sequentially, and not > jump between VMAs. > > Assume you only have a handfull of large VMAs (like in a VMM), you'd end > up scanning the same handful of VMAs over and over again. > > I think a lot of the khugepaged codebase is just full with historical > baggage that must be cleaned up and re-validated if it still required ... >