From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C8B01E83F05 for ; Thu, 5 Feb 2026 05:46:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3D59A6B0096; Thu, 5 Feb 2026 00:46:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 363836B0098; Thu, 5 Feb 2026 00:46:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22EE76B0099; Thu, 5 Feb 2026 00:46:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0EBEF6B0096 for ; Thu, 5 Feb 2026 00:46:18 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B70E4B913C for ; Thu, 5 Feb 2026 05:46:17 +0000 (UTC) X-FDA: 84409317594.07.ECECDF2 Received: from mail-dl1-f51.google.com (mail-dl1-f51.google.com [74.125.82.51]) by imf29.hostedemail.com (Postfix) with ESMTP id BA557120002 for ; Thu, 5 Feb 2026 05:46:15 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VbzQtUlt; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 74.125.82.51 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770270375; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xh8SRryTbBKNPrusvQGIQRE83r+fwBp2qy2Ro8RHIOE=; b=KFj61niWMOeR8PpZNh4dKofT9cQrdjWgChDfFwWRdyxZydEzHLyJKzYtgqSP9cB7zdUDXj FX9pzZFYkYpOvnVF9FzQc3PH5oplsPJVrREGQ2wMpmgGQQsvIyWJuG8jwjR8craB9haTY1 jCVZ6DkpKSrnWrWKmDZZQwz1dkqDdGc= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VbzQtUlt; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 74.125.82.51 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1770270375; a=rsa-sha256; cv=none; b=G/7bU4pyKi2HkLqpMk3/kiPgJKs9F8hRAYhMRi3ZnW419bU+0Hc0WObqfwRRKWRXobrds/ cwEyRZx+CZIPJxY0pCOhpFJKNEEif40VD/VQiLx6NiqmXifd1lUUuf6b7gMHeC3bcfJMkG kEo0bceKrF8v3tnu5YoHx6etXijrqcg= Received: by mail-dl1-f51.google.com with SMTP id a92af1059eb24-1233b953bebso1561282c88.1 for ; Wed, 04 Feb 2026 21:46:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770270374; x=1770875174; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=xh8SRryTbBKNPrusvQGIQRE83r+fwBp2qy2Ro8RHIOE=; b=VbzQtUltpKEQ/3rzDeNC46sqGnNMidv6cOnnQGLoGr6x6Qik4P/+giyBrQWMpYWEkU j9m53Bo0NyIBGmEq2ykfAvX1EXkwc5OrnebfVT9la0x/EhP/zdecNzlhwhGNP2UAs6eX E+Kf+ZV50XTcl2dpcJA32gI/iu6gewDvgHK0TKbNP+oxC6009B1DOSnC6mSwJSB96gfF 0OuAwVqA4e5AcuvckJfHgF9D9Tiy1crAS2yzW7FFLLG3Y+qfMdkIHavhZi6Ia36/zi6H mknZm2VNwjaUOnjoFg+m4AL9cwUTety3UDHQ6hUz10Ett0Yala5w7gAjB4EY9yhLY83S vQ9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770270374; x=1770875174; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=xh8SRryTbBKNPrusvQGIQRE83r+fwBp2qy2Ro8RHIOE=; b=I3N27AYcyn4MCKTsCEthXVbUp1Pz0+yvifpZLvcp3eQOlrtgV1o8S/ie1LYvGTdfl6 tneRQHxSZrBY3xNXYwB1hzkzNzDOXxVFgUFBkQBYg5QUIPyT7EKFAYOLaV38ogbg/bwU p0BbIkPG0BDSRAeLbvzv7dSPErLTGF5x8/5+oEAMHC+6h7PhzYa1EcDgFkrJDI4Z0iQk JrJajYtDEKgpIgvKDiXtmEsmCzAv0A0Xcj28lVRM0QC5drSBxSMHH6Tzb64d5GeMFYwM hKZacGkTGI303mSPW/YjjIgxbsRfBudl5QOH5ti7ljH/3TuDgj0FoTHo1cKA5ywvl9AI qhgg== X-Forwarded-Encrypted: i=1; AJvYcCVmaXcXYjeE9VwvlopnLMhadr7lkg9eXBUIAai0AYobL2pg1/CLnyQCXqXZBvDWWpeKVw7uKOWgDQ==@kvack.org X-Gm-Message-State: AOJu0YxzfEL0xaGEpFK9xEuNS32u9PyaKN4Ez0tcoYLHOlzEZqhi2uTN WSfsVsTFFddF05GlQoJOmX8pS11iDn4X8NZ+2Kiir4a0AyRL8QawvwET X-Gm-Gg: AZuq6aI1rgi+/NxtvJmuJlioDXPzP+9bRN1ShRSCrtiwvYHDaJDvRjLeAa6DI9YSr/Y NWzmaRB0GNke1x9iYSV+cmpMBr4kk5kqD+qyd1N4nezfdhlQL1IHBv6k42IuJNa2D816VHZDitj HyZTx9tDsxe4bSQ4/U1+63gEGb2Cl/fa1Wws+DzSsYd9yhh8QxM+GOM3VHo8QtORY1JQtvFEqkW nSW0FojJifMZTx9hUfIQ39H+r/bQ/VgIlJSegxCbTqo67WI4AOH27EUsg2xZeQ0D5FNYNj0Wqkj kSQM2FlTmLIxi4K7H8jrSLvCa+vwZNETn5NeJ20MT6w2M29PdCtUcAhnKZmK/aHTDfNsrLVYZEf npuMAVHYcD6zyjfJddw2ghNOml/HD5+hDZx/sa6gLC5vfOwGpuJ651ujpE2lOO2LrScU6BD9wgD iN98GSy51NU5hX/fYnA3z6/Dsf+Qhe1PEJgw== X-Received: by 2002:a05:7022:7a2:b0:122:153:d161 with SMTP id a92af1059eb24-126f477cf4amr2706153c88.17.1770270374198; Wed, 04 Feb 2026 21:46:14 -0800 (PST) Received: from [10.36.158.92] ([50.175.227.221]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-126f4e05ce7sm3514842c88.1.2026.02.04.21.46.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 Feb 2026 21:46:13 -0800 (PST) Message-ID: <582ffff0-c1ed-4a25-9130-1b1c9d290998@gmail.com> Date: Wed, 4 Feb 2026 21:46:12 -0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 00/12] mm: PUD (1GB) THP implementation Content-Language: en-GB To: Frank van der Linden Cc: Zi Yan , Andrew Morton , David Hildenbrand , lorenzo.stoakes@oracle.com, linux-mm@kvack.org, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, lance.yang@linux.dev, linux-kernel@vger.kernel.org, kernel-team@meta.com References: <20260202005451.774496-1-usamaarif642@gmail.com> <3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com> <20f92576-e932-435f-bb7b-de49eb84b012@gmail.com> From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BA557120002 X-Stat-Signature: xsnixrwizeiqqrz6d1o9peecadgurpe3 X-Rspam-User: X-HE-Tag: 1770270375-628615 X-HE-Meta: U2FsdGVkX1/jKzenDfmr3DEVy8dQTlUlxketDg55hLkoKE9/+bLspKVpsbkiJjIOu+T516cEz67XsIM3to0zROWl30MG5EFWJsGrs7SaeoxCiVktQQxyLj1sBHhpafVU7ylzfItLo4VpoCwTJNjlivzqiAKNK8ASZcjg5/4MZ3+sl4NRgvE/0Wu7TlcdteL7bw2r5/972qtOrDJMNVEY7kHL3bLv8X4wbleoAxsshtAd8WMqQVVhX/SiYX37qc6U6u2ILOptDsoDSiBGEAX161GpEVi+VwL8hZ8wBXagU0FzSyqtzsTUReRMeJR4TRwkmldxlITqAHbZi3ZzQCKsvokE12IEL/QuOcWpl9V11KzR5Wq0OHRdlVh8yT7Rq/20/uU8eG18++rZxbwOAMqp7vDeMaIQT4ffAbssbOc2FCiXIfJvdj6hlNZloartua5dKKEPal3s5+N4UZUb/01vf2an8JTvb7/6mqD8o9Vppgp8c94rQDtHdf/sTJVzl/MgtncFK98G2iCkSzxcbDLy6TXuveXKy018mcl8x/9DkjnxpDw08usT9+L6yAPZiuUirF+7Hq5DgUaQvhxA0yMeNm/CptWU+Ckkq/MjQaI8WxWy1Dr694EI4rQwzlCMJnFp2ymyxXu1FD+8XJ+5D3vcKCUu8cZKRo40pRkb8dWsjM8ojcAiirG4RqyJJBHE6mTWsXyNtfWiIe+7kIQjLTXhtr0gdw7gItxFnCtOjHDnaUtJsm2KbCP5XwRfv1RQL/Hxo3et2/cZoCOMgkOSu0gm5cj1TnAQ92MbDQpjU2i0TW11nENOyMfErt3/p9BIlWLyHK+VxGgJLs2TxH0c/AsJ65c8P4RpoZzLu7kMMTZOZwtopI2gMR+ZA0T6dIfctaZRka7zBAwlkmbNr6FnNoSEK0724M5qcAREA2kEwor0ZeHxnQ54vWOw63ec29LKHogn5SC32nPtxmmuNM9RNwS hi+OCjEp 5TKhFkTaotlJmaTT1m5pkIs257tI8hJBsljpLmT3mIYJuH3tbg4fSqAKSroo5k0FczqAvVwuIPS/auEF4+oIKWxhomhTJGIpgqbaMmhInQv61rkeK6Qh0LP8yUw0K8y51NOO9W99csdAmHkAZMevLoi7bLj4Q3BzmdEjRJcTXC91GZ8E6Kw69rK0SenELLCGdIfIknnZP2OCtgZxDyDVdKvkg62F7ViywuRq0jQxI1NuL4TV3FgaltKgzdP4cVZsscTD8WLx6GfnBsxEq1bnBRcpGR+m7gLEnGbs3dHYlPY0zzsGHZydcUQgWZ/9xBew4sKZoS3yqCFL4nsqfJjjyDI3gYCijutCefUd8SAVxPmLDm/B5m6gDPYtts6vGPm25ueUEI9TQ5yUO9L/1PcP+1kgh+awulSSCOiyQhW7hujbbmxO77cC9VIa+2OscNQFqSey+C4ZSJc+7IBaw6XKWtnrBpNJgXXw+95HWhK4xU/VzoKvWdrC4pfQ6ymHHPA3ibW8wos3ceveP/w+88yJhij/UhG4lzmW/Z96atAStrBwT5cP5T7k2LeUWOi1K5y6fJHxhK2t5cdmmeZkqhrfZg18COugxkQ2zVCiR2GcqL+VIIsHQwcGRxcSmLg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 03/02/2026 16:08, Frank van der Linden wrote: > On Tue, Feb 3, 2026 at 3:29 PM Usama Arif wrote: >> >> >> >> On 02/02/2026 08:24, Zi Yan wrote: >>> On 1 Feb 2026, at 19:50, Usama Arif wrote: >>> >>>> This is an RFC series to implement 1GB PUD-level THPs, allowing >>>> applications to benefit from reduced TLB pressure without requiring >>>> hugetlbfs. The patches are based on top of >>>> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6). >>> >>> It is nice to see you are working on 1GB THP. >>> >>>> >>>> Motivation: Why 1GB THP over hugetlbfs? >>>> ======================================= >>>> >>>> While hugetlbfs provides 1GB huge pages today, it has significant limitations >>>> that make it unsuitable for many workloads: >>>> >>>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot >>>> or runtime, taking memory away. This requires capacity planning, >>>> administrative overhead, and makes workload orchastration much much more >>>> complex, especially colocating with workloads that don't use hugetlbfs. >>> >>> But you are using CMA, the same allocation mechanism as hugetlb_cma. What >>> is the difference? >>> >> >> So we dont really need to use CMA. CMA can help a lot ofcourse, but we dont *need* it. >> For e.g. I can run the very simple case [1] of trying to get 1G pages in the upstream >> kernel without CMA on my server and it works. The server has been up for more than a week >> (so pretty fragmented), is running a bunch of stuff in the background, uses 0 CMA memory, >> and I tried to get 20x1G pages on it and it worked. >> It uses folio_alloc_gigantic, which is exactly what this series uses: >> >> $ uptime -p >> up 1 week, 3 days, 5 hours, 7 minutes >> $ cat /proc/meminfo | grep -i cma >> CmaTotal: 0 kB >> CmaFree: 0 kB >> $ echo 20 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages >> 20 >> $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages >> 20 >> $ free -h >> total used free shared buff/cache available >> Mem: 1.0Ti 142Gi 292Gi 143Mi 583Gi 868Gi >> Swap: 129Gi 3.5Gi 126Gi >> $ ./map_1g_hugepages >> Mapping 20 x 1GB huge pages (20 GB total) >> Mapped at 0x7f43c0000000 >> Touched page 0 at 0x7f43c0000000 >> Touched page 1 at 0x7f4400000000 >> Touched page 2 at 0x7f4440000000 >> Touched page 3 at 0x7f4480000000 >> Touched page 4 at 0x7f44c0000000 >> Touched page 5 at 0x7f4500000000 >> Touched page 6 at 0x7f4540000000 >> Touched page 7 at 0x7f4580000000 >> Touched page 8 at 0x7f45c0000000 >> Touched page 9 at 0x7f4600000000 >> Touched page 10 at 0x7f4640000000 >> Touched page 11 at 0x7f4680000000 >> Touched page 12 at 0x7f46c0000000 >> Touched page 13 at 0x7f4700000000 >> Touched page 14 at 0x7f4740000000 >> Touched page 15 at 0x7f4780000000 >> Touched page 16 at 0x7f47c0000000 >> Touched page 17 at 0x7f4800000000 >> Touched page 18 at 0x7f4840000000 >> Touched page 19 at 0x7f4880000000 >> Unmapped successfully >> >> >> >> >>>> >>>> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails >>>> rather than falling back to smaller pages. This makes it fragile under >>>> memory pressure. >>> >>> True. >>> >>>> >>>> 4. No Splitting: hugetlbfs pages cannot be split when only partial access >>>> is needed, leading to memory waste and preventing partial reclaim. >>> >>> Since you have PUD THP implementation, have you run any workload on it? >>> How often you see a PUD THP split? >>> >> >> Ah so running non upstream kernels in production is a bit more difficult >> (and also risky). I was trying to use the 512M experiment on arm as a comparison, >> although I know its not the same thing with PAGE_SIZE and pageblock order. >> >> I can try some other upstream benchmarks if it helps? Although will need to find >> ones that create VMA > 1G. >> >>> Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have >>> any split stats to show the necessity of THP split? >>> >>>> >>>> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot >>>> be easily shared with regular memory pools. >>> >>> True. >>> >>>> >>>> PUD THP solves these limitations by integrating 1GB pages into the existing >>>> THP infrastructure. >>> >>> The main advantage of PUD THP over hugetlb is that it can be split and mapped >>> at sub-folio level. Do you have any data to support the necessity of them? >>> I wonder if it would be easier to just support 1GB folio in core-mm first >>> and we can add 1GB THP split and sub-folio mapping later. With that, we >>> can move hugetlb users to 1GB folio. >>> >> >> I would say its not the main advantage? But its definitely one of them. >> The 2 main areas where split would be helpful is munmap partial >> range and reclaim (MADV_PAGEOUT). For e.g. jemalloc/tcmalloc can now start >> taking advantge of 1G pages. My knowledge is not that great when it comes >> to memory allocators, but I believe they track for how long certain areas >> have been cold and can trigger reclaim as an example. Then split will be useful. >> Having memory allocators use hugetlb is probably going to be a no? >> >> >>> BTW, without split support, you can apply HVO to 1GB folio to save memory. >>> That is a disadvantage of PUD THP. Have you taken that into consideration? >>> Basically, switching from hugetlb to PUD THP, you will lose memory due >>> to vmemmap usage. >>> >> >> Yeah so HVO saves 16M per 1G, and the page depost mechanism adds ~2M as per 1G. >> We have HVO enabled in the meta fleet. I think we should not only think of PUD THP >> as a replacement for hugetlb, but to also enable further usescases where hugetlb >> would not be feasible. >> >> Ater the basic infrastructure for 1G is there, we can work on optimizing, I think >> there would be a a lot of interesting work we can do. HVO for 1G THP would be one >> of them? >> >>>> >>>> Performance Results >>>> =================== >>>> >>>> Benchmark results of these patches on Intel Xeon Platinum 8321HC: >>>> >>>> Test: True Random Memory Access [1] test of 4GB memory region with pointer >>>> chasing workload (4M random pointer dereferences through memory): >>>> >>>> | Metric | PUD THP (1GB) | PMD THP (2MB) | Change | >>>> |-------------------|---------------|---------------|--------------| >>>> | Memory access | 88 ms | 134 ms | 34% faster | >>>> | Page fault time | 898 ms | 331 ms | 2.7x slower | >>>> >>>> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)). >>>> For long-running workloads this will be a one-off cost, and the 34% >>>> improvement in access latency provides significant benefit. >>>> >>>> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU >>>> bound workload running on a large number of ARM servers (256G). I enabled >>>> the 512M THP settings to always for a 100 servers in production (didn't >>>> really have high expectations :)). The average memory used for the workload >>>> increased from 217G to 233G. The amount of memory backed by 512M pages was >>>> 68G! The dTLB misses went down by 26% and the PID multiplier increased input >>>> by 5.9% (This is a very significant improvment in workload performance). >>>> A significant number of these THPs were faulted in at application start when >>>> were present across different VMAs. Ofcourse getting these 512M pages is >>>> easier on ARM due to bigger PAGE_SIZE and pageblock order. >>>> >>>> I am hoping that these patches for 1G THP can be used to provide similar >>>> benefits for x86. I expect workloads to fault them in at start time when there >>>> is plenty of free memory available. >>>> >>>> >>>> Previous attempt by Zi Yan >>>> ========================== >>>> >>>> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been >>>> significant changes in kernel since then, including folio conversion, mTHP >>>> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD >>>> code as reference for making 1G PUD THP work. I am hoping Zi can provide >>>> guidance on these patches! >>> >>> I am more than happy to help you. :) >>> >> >> Thanks!!! >> >>>> >>>> Major Design Decisions >>>> ====================== >>>> >>>> 1. No shared 1G zero page: The memory cost would be quite significant! >>>> >>>> 2. Page Table Pre-deposit Strategy >>>> PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE >>>> page tables (one for each potential PMD entry after split). >>>> We allocate a PMD page table and use its pmd_huge_pte list to store >>>> the deposited PTE tables. This ensures split operations don't fail due >>>> to page table allocation failures (at the cost of 2M per PUD THP) >>>> >>>> 3. Split to Base Pages >>>> When a PUD THP must be split (COW, partial unmap, mprotect), we split >>>> directly to base pages (262,144 PTEs). The ideal thing would be to split >>>> to 2M pages and then to 4K pages if needed. However, this would require >>>> significant rmap and mapcount tracking changes. >>>> >>>> 4. COW and fork handling via split >>>> Copy-on-write and fork for PUD THP triggers a split to base pages, then >>>> uses existing PTE-level COW infrastructure. Getting another 1G region is >>>> hard and could fail. If only a 4K is written, copying 1G is a waste. >>>> Probably this should only be done on CoW and not fork? >>>> >>>> 5. Migration via split >>>> Split PUD to PTEs and migrate individual pages. It is going to be difficult >>>> to find a 1G continguous memory to migrate to. Maybe its better to not >>>> allow migration of PUDs at all? I am more tempted to not allow migration, >>>> but have kept splitting in this RFC. >>> >>> Without migration, PUD THP loses its flexibility and transparency. But with >>> its 1GB size, I also wonder what the purpose of PUD THP migration can be. >>> It does not create memory fragmentation, since it is the largest folio size >>> we have and contiguous. NUMA balancing 1GB THP seems too much work. >> >> Yeah this is exactly what I was thinking as well. It is going to be expensive >> and difficult to migrate 1G pages, and I am not sure if what we get out of it >> is worth it? I kept the splitting code in this RFC as I wanted to show that >> its possible to split and migrate and the rejecting migration code is a lot easier. >> >>> >>> BTW, I posted many questions, but that does not mean I object the patchset. >>> I just want to understand your use case better, reduce unnecessary >>> code changes, and hopefully get it upstreamed this time. :) >>> >>> Thank you for the work. >>> >> >> Ah no this is awesome! Thanks for the questions! Its basically the discussion I >> wanted to start with the RFC. >> >> >> [1] https://gist.github.com/uarif1/35dcd63f9d76048b07eb5c16ace85991 >> >> > > It looks like the scenario you're going for is an application that > allocates a sizeable chunk of memory upfront, and would like it to be > 1G pages as much as possible, right? > Hello! Yes. But also it doesnt need to be a single chunk (VMA). > You can do that with 1G THPs, the advantage being that any failures to > get 1G pages are not explicit, so you're not left with having to grow > the number of hugetlb pages yourself, and see how many you can use. > > 1G THPs seem useful for that. I don't recall all of the discussion > here, but I assume that hooking 1G THP support in to khugepaged is > quite something else - the potential churn to get an 1G page could > well cause more system interference than you'd like. > Yes completely agree. > The CMA scenario Rik was talking about is similar: you set > hugetlb_cma=NG, and then, when you need 1G pages, you grow the hugetlb > pool and use them. Disadvantage: you have to do it explicitly. > > However, hugetlb_cma does give you a much larger chance of getting > those 1G pages. The example you give, 20 1G pages on a 1T system where > there is 292G free, isn't much of a problem in my experience. You > should have no problem getting that amount of 1G pages. Things get > more difficult when most of your memory is taken - hugetlb_cma really > helps there. E.g. we have systems that have 90% hugetlb_cma, and there > is a pretty good success rate converting back and forth between > hugetlb and normal page allocator pages with hugetlb_cma, while > operating close to that 90% hugetlb coverage. Without CMA, the success > rate drops quite a bit at that level. Yes agreed. > > CMA balancing is a related issue, for hugetlb. It fixes a problem that > has been known for years: the more memory you set aside for movable > only allocations (e.g. hugetlb_cma), the less breathing room you have > for unmovable allocations. So you risk the 'false OOM' scenario, where > the kernel can't make an unmovable allocation, even though there is > enough memory available, even outside of CMA. It's just that those > MOVABLE pageblocks were used for movable allocations. So ideally, you > would migrate those movable allocations to CMA under those > circumstances. Which is what CMA balancing does. It's worked out very > well for us in the scenario I list above (most memory being > hugetlb_cma). > > Anyway, I'm rambling on a bit. Let's see if I got this right: > > 1G THP > - advantages: transparent interface > - disadvantage: no HVO, lower success rate under higher memory > pressure than hugetlb_cma > Yes! But also, the problem of having no HVO for THPs I think can be worked on once the support for it is there. The lower success rate is a much more difficult problem to solve. > hugetlb_cma > - disadvantage: explicit interface, for higher values needs 'false > OOM' avoidance > - advange: better success rate under pressure. > > I think 1G THPs are a good solution for "nice to have" scenarios, but > there will still be use cases where a higher success rate is preferred > and HugeTLB is preferred. > Agreed. I dont think 1G THPs can completely replace hugetlb. Maybe after getting after several years of work to optimize it, there might be a path to it but not at the very start. > Lastly, there's also the ZONE_MOVABLE story. I think 1G THPs and > ZONE_MOVABLE could work well together, improving the success rate. But > then the issue of pinning raise its head again, and whether that > should be allowed or configurable per zone.. > Ack > - Frank